• Re: Efficiency of in-order vs. OoO

    From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Mar 20 21:12:16 2024
    From Newsgroup: comp.arch

    On 1/22/24 9:44 AM, Paul A. Clayton wrote:
    [snip]
    Obviously an extremely biased workload like the data analysis
    workloads targeted by Intel's research chip would probably show
    A55 in a better light (though A55 would likely be very inefficient
    compared to the research design, I think it used 4-way threaded
    in-order cores with limited cache and narrow memory channels [to
    avoid 64-byte accesses to access 64-bits or less of data]), but
    that would not be "fair".

    I (finally) found a reference to the Intel research chip. https://ieeexplore.ieee.org/document/10188866
    "The Intel Programmable and Integrated Unified Memory Architecture
    Graph Analytics Processor" (Sriram Aananthakrishnan et al., 2023)
    A PDF of the paper appears to be available at https://heirman.net/papers/aananthakrishnan2023piuma.pdf

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Mar 22 10:23:09 2024
    From Newsgroup: comp.arch

    Paul A. Clayton wrote:
    On 1/22/24 9:44 AM, Paul A. Clayton wrote:
    [snip]
    Obviously an extremely biased workload like the data analysis
    workloads targeted by Intel's research chip would probably show
    A55 in a better light (though A55 would likely be very inefficient
    compared to the research design, I think it used 4-way threaded
    in-order cores with limited cache and narrow memory channels [to avoid
    64-byte accesses to access 64-bits or less of data]), but
    that would not be "fair".

    I (finally) found a reference to the Intel research chip. https://ieeexplore.ieee.org/document/10188866
    "The Intel Programmable and Integrated Unified Memory Architecture
    Graph Analytics Processor" (Sriram Aananthakrishnan et al., 2023)
    A PDF of the paper appears to be available at https://heirman.net/papers/aananthakrishnan2023piuma.pdf

    Interesting. Thanks.
    I haven't finished reading it but one thing I notice is that since
    normally all of the chased pointers are virtual addresses, while they
    mention "Address translation tables (ATT)", I didn't see how they
    actually DO the virtual address translation during these offloaded chases.

    Also interesting are some of the authors other recent publishings. E.g.:

    https://scholar.google.com/citations?hl=en&user=bUTgzBUAAAAJ&view_op=list_works&sortby=pubdate

    https://scholar.google.com/citations?hl=en&user=ySqvmSQAAAAJ&view_op=list_works&sortby=pubdate

    This is a different approach to OoO uArch.
    Existing OoO work on the basis that most things are serial and predictable. This approach is optimized for sparse: short sequential code segments intermixed with sparse conditional code segments, chasing sparse data.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sun Mar 24 12:38:44 2024
    From Newsgroup: comp.arch

    On 2/25/24 5:22 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:
    [snip]
    When I looked at the pipeline design presented in the Arm Cortex-
    A55 Software Optimization Guide, I was surprised by the design.
    Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
    (ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
    DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
    stage before an ALU stage (clearly for AArch32).

    Almost like an Mc88100 which had 5 pipelines.

    I think I have an incorrect conception of data communication
    (fowarding and register-to-functional-unit). I also seem to be
    conflating somewhat issue port and functional unit. Forwarding
    from nine locations to nine locations and the remaining eight
    locations to eight locations (counting functional unit as a single
    target location even though a functional unit may have three
    functionally different input operands).

    I am used to functionality being merged; e.g., the multiplier also
    having a general ALU. Merged functional units would still need to
    route the operands to the appropriate functionality, but selecting
    the operation path for two operands *seems* simpler than selecting
    distinct operands and separate functional unit independently. This
    might also be a nomenclature issue.

    If one can only begin two operations in a cycle, the generality of
    having nine potential paths seems wasteful to me. Having separate
    paths for FP/Neon and GPR-using operations makes sense because of
    the different register sets (as well as latency/efficiency-
    optimized functional units vs. SIMD-optimized functional units;
    sharing execution hardware is tempting but there are tradeoffs).

    With nine potential issue ports, it seems strange to me that width
    is strictly capped at two. Even though AArch64 does not have My
    66000's Virtual Vector Method to exploit normally underutilized,
    there would be cases where an extra instruction or two could
    execute in parallel without increasing resources significantly. As
    an outsider, I can only assume that any benefit did not justify
    the costs in hardware and design effort. (With in-order execution,
    even a nearly free [hardware] increasing of width may not result
    in improved performance or efficiency.)

    The separation of MAC and DIV is mildly questionable — from my
    very amateur perspective — not supporting dual issue of a MAC-DIV
    pair seems very unlikely to hurt performance but the cost may be
    trivial.

    Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;

    I also ass?me the other operations are usually available for
    parallel execution (though this depends somewhat on compiler
    optimization for the microarchitecture), so execution of a
    multiply and a divide in parallel is probably uncommon.

    The FP/Neon section has these operations merged into a functional
    unit; I guess — I am not motivated to look this — that this is
    because FP divide/sqrt use the multiplier while integer divide
    does not.

    The Chips and Cheese article also indicated that branches are only
    resolved at writeback, two cycles later than if branch direction
    was resolved in the first execution stage. The difference between
    a six stage misprediction penalty and an eight stage one is not
    huge, but it seems to indicate a difference in focus. With

    In an 8 stage pipeline, the 2 cycles of added delay should hurt by
    ~5%-7%

    5% performance loss sounds expensive for a something that *seems*
    not terribly expensive to fix.

    [snip]
    I would have *guessed* that an AGLU (a functional unit providing
    address generation and "simple" ALU functions, like AMD's Bobcat?)
    would be more area and power efficient than having separate
    pipelines, at least for store address generation.

    Be careful with assumptions like that. Silicon area with no moving
    signals is remarkably power efficient.

    There is also the extra forwarding for separate functional units
    (and perhaps some extra costs from increased distance), but I
    admit that such factors really expose my complete lack of hardware
    experience. (I am aware of clock gating as a power saving
    technique and that "doing nothing" is cheap, but I have no
    intuition of the weights of the tradeoffs.)

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    [snip interesting stuff]
    Perhaps mildly out-of-order designs (say a little more than the
    PowerPC 750) are not actually useful (other than as a starting
    point for understanding out-of-order design). I do not understand
    why such an intermediate design (between in-order and 30+
    scheduling window out-of-order) is not useful. It may be that

    It is useful, just not all that much.

    going from say 10 to 30 scheduler entries gives so much benefit
    for relatively little extra cost (and no design is so precisely
    area constrained — even doubling core size would not mean pushing
    L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
    emotional investment in intermediate and mixed designs.

    10 does not accommodate much ILP beyond that of a 10 deep pipeline.
    30 accommodates L1 cache misses and typical FP latencies.
    90 accommodates "almost everything else"
    250 accommodates multiple L1 misses with L2 hits and "everything
    else".

    Presumably the benefit depends on issue width and load-to-use
    latency (pipeline depth, cache capacity, etc.). [For a cheap
    "general purpose" processor, not covering FP latencies well may
    not be very important.] Better hiding L1 _hit_ latency would seem
    to provide a significant fraction of the frequency and ILP benefit
    of out-or-order for a smallish core. (Some branch resolution
    latency can also be hidden; an in-order core can delay resolution
    until writeback of control-dependent instructions, but OoO's extra
    buffering facilitates deeper speculation.)

    If one has a scheduling window of 90 operations, having only three
    issue ports seems imbalanced to me.

    Out-of-order execution would also seem to facilitate opportunistic
    use of existing functionality. Even just buffering decoded
    instructions would seem to allow a 16-byte (aligned) instruction
    fetch with two instruction decoders to issue more than two
    instructions on some cycles without increasing register port
    count, forwarding paths, etc. OoO would further increase the
    frequency of being able to do more work with given hardware
    resources.

    Perhaps there may even be a case for a 1+ wide OoO core, i.e., an
    OoO core which sometimes issue more than one instruction in a
    cycle.

    For something like a smart phone, one or two small cores might be
    useful for background activity, tasks whose latency (within a
    broad range) is not related to system responsiveness for the user.

    For a server expected to run embarrassingly parallel workloads, if

    Servers are not expected to run embarrassingly parallel applications,
    they are expected to run an embarrassing large number of essentially
    serial applications.

    Shared caching of instructions still seems beneficial in "server
    worklaods" compared to fully general multiprogram workloads. A
    database server might even have more sharing, potentially having a
    single process (so page table sharing would be more beneficial),
    but that seems a less common use.

    a wimpy core provides sufficient responsiveness, I would expect
    most of the cores (possibly even all of the cores) to be wimpy.
    There might not be many workloads with such characteristics;

    Talk to Google about that....

    Urs Hölzle of Google put out a paper "Brawny cores still beat
    wimpy cores, most of the time"(2010). While some of the points —
    such as tail latency effects and software developments costs —
    made in the paper are (in my opinion) quite significant, I thought
    the argument significantly flawed. (I even wrote a blog post about
    this paper: https://dandelion-watcher.blogspot.com/2012/01/weak-case-against-wimpy-cores.html)

    The microservice programming model (motivated, from what I
    understand, by problem-size and performance scaling and service
    reliability with moderately reliable hardware without requiring
    much programming effort to support scaling) may also have
    significant implications on microarchitecture.

    The design space is also very large. One can have heterogeneity of
    wimpy and brawny cores at the rack level, wimpy-only chips within
    a heterogeneous package, heterogeneity within a chip, temporal
    heterogeneity (SMT and dynamic partitioning of core resources),
    etc. Core strength can very widely and performance balance can be
    diverse (e.g., a core with a quarter of the performance of a
    brawny core on general tasks might have — with coprocessors,
    tightly coupled accelerators, or general microarchitecture —
    approximately equal performance for some tasks).

    The performance of weaker cores can also be increased by
    increasing communication performance within local groups of such
    cores. Exploiting this would likely require significant
    programming effort, but some of the effort might be automated
    (even before AI replaces programmers). This assumes that there is
    significant communication that is less temporally local than
    within a core (out-of-order execution changes the temporal
    proximity of value communication; a result consumer might be
    nearby in program order but substantially more distant in
    execution order) and that intermediate resource allocation to
    intermediate latency/bandwdith communication can be beneficial.

    (I also think that there is an opportunity for optimization in the
    on-chip network. Optimizing the on-chip network for any-to-any
    communication seems less appropriate for many workloads not only
    because of the often limited scale of communication but also
    because the communication is, I suspect, often specialized.
    Getting a network design that is very good for some uses and
    adequate others seems challenging even with software cooperation.
    Rings seem really nice for pipeline-style parallelism and some
    other uses, crossbars seem nice for small node groups with heavy
    communication, grids seem to fit large node counts with nearest
    neighbor communication (physical modeling?), etc. Channel width,
    flit size, channel count also involve tradeoffs. Some
    communication does not require sending an entire cache block of
    data, but a smaller flit will have more overhead.)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Mar 24 19:00:22 2024
    From Newsgroup: comp.arch

    Paul A. Clayton wrote:

    On 2/25/24 5:22 PM, MitchAlsup1 wrote:
    Paul A. Clayton wrote:
    [snip]
    When I looked at the pipeline design presented in the Arm Cortex-
    A55 Software Optimization Guide, I was surprised by the design.
    Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
    (ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
    DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
    stage before an ALU stage (clearly for AArch32).

    Almost like an Mc88100 which had 5 pipelines.

    I think I have an incorrect conception of data communication
    (fowarding and register-to-functional-unit). I also seem to be
    conflating somewhat issue port and functional unit. Forwarding
    from nine locations to nine locations and the remaining eight
    locations to eight locations (counting functional unit as a single
    target location even though a functional unit may have three
    functionally different input operands).

    Much newer µArchitectural literature does not draw a firm box
    properly around real function units.

    For example, Mc 88120 has 6 function units buffered by 6 reservation
    stations. Each function unit had an Integer Adder including things
    like the branch resolution unit, FADD, and FMUL. When I drew those
    boxes, I would show post-forwarding operands arriving at the FU
    and then after arriving either being diverted to the INT unit or
    being diverted to the "other" function unit. This way you could
    count operand and result busses and end points for fan-in::fan-out
    reasons.

    This style seems to have fallen from favor; possible because we made
    the transition from value-containing reservation stations to value-
    free reservation stations--alleviating register file porting problems.

    I am used to functionality being merged; e.g., the multiplier also
    having a general ALU. Merged functional units would still need to
    route the operands to the appropriate functionality, but selecting
    the operation path for two operands *seems* simpler than selecting
    distinct operands and separate functional unit independently. This
    might also be a nomenclature issue.

    The above remains my style in µArchitecture literature, but when
    describing block diagram and circuit design levels, only the interior
    of the function unit is illustrated.

    If one can only begin two operations in a cycle, the generality of
    having nine potential paths seems wasteful to me. Having separate
    paths for FP/Neon and GPR-using operations makes sense because of
    the different register sets (as well as latency/efficiency-
    optimized functional units vs. SIMD-optimized functional units;
    sharing execution hardware is tempting but there are tradeoffs).

    In general, operand timing is tight and you better not screw it up;
    while result delivery timing only has to deal with fan-out and data
    arrival issues.

    My style was conceived back in the days where wires were fast and
    metal was precious (3 layers). Now that we have 12-15 layers it
    matters less, I suppose.

    With nine potential issue ports, it seems strange to me that width
    is strictly capped at two.

    Likely to be a register porting or a register port analysis limitation. Value-free reservation stations exacerbate this.

    Even though AArch64 does not have My
    66000's Virtual Vector Method to exploit normally underutilized,
    there would be cases where an extra instruction or two could
    execute in parallel without increasing resources significantly. As
    an outsider, I can only assume that any benefit did not justify
    the costs in hardware and design effort. (With in-order execution,
    even a nearly free [hardware] increasing of width may not result
    in improved performance or efficiency.)

    VVM works best with value-containing reservation stations.

    The separation of MAC and DIV is mildly questionable — from my
    very amateur perspective — not supporting dual issue of a MAC-DIV
    pair seems very unlikely to hurt performance but the cost may be
    trivial.

    Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;

    I also ass?me the other operations are usually available for
    parallel execution (though this depends somewhat on compiler
    optimization for the microarchitecture), so execution of a
    multiply and a divide in parallel is probably uncommon.

    In general, any 2 calculations that are not data-dependent, can
    be launched into execution without temporal binds.

    The FP/Neon section has these operations merged into a functional
    unit; I guess — I am not motivated to look this — that this is
    because FP divide/sqrt use the multiplier while integer divide
    does not.

    The Chips and Cheese article also indicated that branches are only
    resolved at writeback, two cycles later than if branch direction
    was resolved in the first execution stage. The difference between
    a six stage misprediction penalty and an eight stage one is not
    huge, but it seems to indicate a difference in focus. With

    In an 8 stage pipeline, the 2 cycles of added delay should hurt by
    ~5%-7%

    5% performance loss sounds expensive for a something that *seems*
    not terribly expensive to fix.

    [snip]
    I would have *guessed* that an AGLU (a functional unit providing
    address generation and "simple" ALU functions, like AMD's Bobcat?)
    would be more area and power efficient than having separate
    pipelines, at least for store address generation.

    Be careful with assumptions like that. Silicon area with no moving
    signals is remarkably power efficient.

    There is also the extra forwarding for separate functional units
    (and perhaps some extra costs from increased distance), but I
    admit that such factors really expose my complete lack of hardware experience. (I am aware of clock gating as a power saving
    technique and that "doing nothing" is cheap, but I have no
    intuition of the weights of the tradeoffs.)

    Mc 88120 had forwarding into the reservation stations and forwarding
    between reservation station output and function unit input. That is
    a lot of forwarding.

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!

    [snip interesting stuff]
    Perhaps mildly out-of-order designs (say a little more than the
    PowerPC 750) are not actually useful (other than as a starting
    point for understanding out-of-order design). I do not understand
    why such an intermediate design (between in-order and 30+
    scheduling window out-of-order) is not useful. It may be that

    It is useful, just not all that much.

    going from say 10 to 30 scheduler entries gives so much benefit
    for relatively little extra cost (and no design is so precisely
    area constrained — even doubling core size would not mean pushing
    L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
    emotional investment in intermediate and mixed designs.

    10 does not accommodate much ILP beyond that of a 10 deep pipeline.
    30 accommodates L1 cache misses and typical FP latencies.
    90 accommodates "almost everything else"
    250 accommodates multiple L1 misses with L2 hits and "everything
    else".

    Presumably the benefit depends on issue width and load-to-use
    latency (pipeline depth, cache capacity, etc.). [For a cheap
    "general purpose" processor, not covering FP latencies well may
    not be very important.] Better hiding L1 _hit_ latency would seem
    to provide a significant fraction of the frequency and ILP benefit
    of out-or-order for a smallish core. (Some branch resolution
    latency can also be hidden; an in-order core can delay resolution
    until writeback of control-dependent instructions, but OoO's extra
    buffering facilitates deeper speculation.)

    If one has a scheduling window of 90 operations, having only three
    issue ports seems imbalanced to me.

    I agree:: for Mc 88120 we had 96 instructions (max) in flight for
    a 6-wide {issue, launch, execute, result, and retire}, we also
    had 16-cycle execution window, so to stream DGEMM (from Matrix300}
    we had to execute a LD {which would miss ½ the time} and them have
    4 cycles for FMUL and 3 cycles for FADD allowing ST to capture the
    FADD result and ship it off to cache. Going backwards; 16-(1+3+4)
    meant the LD->L1$->miss->memory->LDalign had only 8 cycles.

    The modern version with FMAC would allow 11-cycles LD-Miss-Align.

    Out-of-order execution would also seem to facilitate opportunistic
    use of existing functionality. Even just buffering decoded
    instructions would seem to allow a 16-byte (aligned) instruction
    fetch with two instruction decoders to issue more than two
    instructions on some cycles without increasing register port
    count, forwarding paths, etc. OoO would further increase the
    frequency of being able to do more work with given hardware
    resources.

    My 66150 does 16B fetch and parses 2 instructions per cycle,
    even though it is only 1-wide. By fetching wide, and scanning
    ahead, one can identify branches and fetch their targets prior
    to executing the branch, eliminating the need for the delay-slot
    and reducing branch taken overhead down to about 0.13 cycles
    even without branch prediction !!

    But anything wider than 1-inistruction will need a branch predictor
    of some sort.

    Perhaps there may even be a case for a 1+ wide OoO core, i.e., an
    OoO core which sometimes issue more than one instruction in a
    cycle.

    For something like a smart phone, one or two small cores might be
    useful for background activity, tasks whose latency (within a
    broad range) is not related to system responsiveness for the user.

    For a server expected to run embarrassingly parallel workloads, if

    Servers are not expected to run embarrassingly parallel applications,
    they are expected to run an embarrassing large number of essentially
    serial applications.

    Shared caching of instructions still seems beneficial in "server
    worklaods" compared to fully general multiprogram workloads. A
    database server might even have more sharing, potentially having a
    single process (so page table sharing would be more beneficial),
    but that seems a less common use.

    a wimpy core provides sufficient responsiveness, I would expect
    most of the cores (possibly even all of the cores) to be wimpy.
    There might not be many workloads with such characteristics;

    Talk to Google about that....

    Urs Hölzle of Google put out a paper "Brawny cores still beat
    wimpy cores, most of the time"(2010). While some of the points —
    such as tail latency effects and software developments costs —
    made in the paper are (in my opinion) quite significant, I thought
    the argument significantly flawed. (I even wrote a blog post about
    this paper: https://dandelion-watcher.blogspot.com/2012/01/weak-case-against-wimpy-cores.html)

    The microservice programming model (motivated, from what I
    understand, by problem-size and performance scaling and service
    reliability with moderately reliable hardware without requiring
    much programming effort to support scaling) may also have
    significant implications on microarchitecture.

    The design space is also very large. One can have heterogeneity of
    wimpy and brawny cores at the rack level, wimpy-only chips within
    a heterogeneous package, heterogeneity within a chip, temporal
    heterogeneity (SMT and dynamic partitioning of core resources),
    etc. Core strength can very widely and performance balance can be
    diverse (e.g., a core with a quarter of the performance of a
    brawny core on general tasks might have — with coprocessors,
    tightly coupled accelerators, or general microarchitecture —
    approximately equal performance for some tasks).

    With a "proper interface" one should be able to off-load any
    crypto processing too a place that is both time-constant and
    where sensitive data never passes into the cache hierarchy of
    an untrusted core.

    The performance of weaker cores can also be increased by
    increasing communication performance within local groups of such
    cores. Exploiting this would likely require significant
    programming effort, but some of the effort might be automated
    (even before AI replaces programmers). This assumes that there is
    significant communication that is less temporally local than
    within a core (out-of-order execution changes the temporal
    proximity of value communication; a result consumer might be
    nearby in program order but substantially more distant in
    execution order) and that intermediate resource allocation to
    intermediate latency/bandwdith communication can be beneficial.

    (I also think that there is an opportunity for optimization in the
    on-chip network. Optimizing the on-chip network for any-to-any
    communication seems less appropriate for many workloads not only
    because of the often limited scale of communication but also
    because the communication is, I suspect, often specialized.

    And often necessarily serialized.

    Getting a network design that is very good for some uses and
    adequate others seems challenging even with software cooperation.

    See:: https://www.tachyum.com/media/pdf/tachyum_20isc20.pdf

    Rings seem really nice for pipeline-style parallelism and some
    other uses, crossbars seem nice for small node groups with heavy communication, grids seem to fit large node counts with nearest
    neighbor communication (physical modeling?), etc. Channel width,
    flit size, channel count also involve tradeoffs. Some
    communication does not require sending an entire cache block of
    data, but a smaller flit will have more overhead.)

    We are arriving at the scale where we want to ship a cache line of data
    in a single clock in order to have sufficient coherent BW for 128+ cores.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Mar 24 20:39:18 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!

    I suspect Paul is refering to what ARMv8 calls "System Registers";
    despite the name, most are stored in flops, and in the case of
    the ID registers, wires (perhaps anded with local e-fuses).

    Accesses to some of them are either self-synchronizing[*]
    the rest must be followed by an appropriate barrier
    instruction for the effects to be architecturally visible.

    [*] E.g. ICC_IAR1_EL1 (An interrupt acknowledge register).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sun Mar 24 21:46:18 2024
    From Newsgroup: comp.arch

    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!

    I suspect Paul is refering to what ARMv8 calls "System Registers";

    Yes. (There were also some debug registers, performance monitoring
    registers, trace registers, etc.)

    despite the name, most are stored in flops, and in the case of
    the ID registers, wires (perhaps anded with local e-fuses).

    Yes, many of the bits would be implemented as ROM/PROM and many
    would presumably be scattered about because they control/interact
    with specific functionality. They are similar I/O device
    registers. (I/O devices have also become more complex.)

    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly inconceivable for something like the MIPS R2000.

    An argument might be made that some designs would have no use for
    most of such extra state. Performance monitoring is useful for
    software development (and theoretically for OS decisions for
    scheduling, core migration, and other functions), but seems likely
    to be highly underutilized for typical use. A55 is presumably
    large enough that a synthesis-time remove of much of this
    functionality would have a tiny effect on total area. Even for a microcontroller the area cost might not be problematic.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Mar 25 08:41:06 2024
    From Newsgroup: comp.arch

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)
    ...
    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.

    Certainly. The A55 is similar to the 21164 (1994), which is much
    bigger than the R2000. For competition to the R2000, better look at
    the ARM1/ARM2, or, for something more contemporary, maybe the
    Cortex-M1.

    An argument might be made that some designs would have no use for
    most of such extra state. Performance monitoring is useful for
    software development (and theoretically for OS decisions for
    scheduling, core migration, and other functions), but seems likely
    to be highly underutilized for typical use. A55 is presumably
    large enough that a synthesis-time remove of much of this
    functionality would have a tiny effect on total area.

    ARM also has the Cortex-A35 (with a 25% smaller core than the A53 and
    80-100% of its performance according to ARM). I am unaware of it
    being used in smartphones, though.

    Even for a
    microcontroller the area cost might not be problematic.

    ARM-A is not for microcontrollers. ARM has ARM-M for that, e.g., the
    Cortex-M0 if you want it to be really small.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Mar 25 12:36:27 2024
    From Newsgroup: comp.arch

    Paul A. Clayton wrote:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly inconceivable for something like the MIPS R2000.

    Many of these register are configuration control that
    get set once at boot and never change.
    Others are dynamic but not time critical, like debug registers.

    Only a small number would be diddled on a regular basis,
    like interrupt control.

    They don't all need the same access speed -
    depending on usage some (most?) can be on "slow" buses
    that maybe take multiple clocks to read or write.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Mar 25 13:03:59 2024
    From Newsgroup: comp.arch

    EricP wrote:
    Paul A. Clayton wrote:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly
    inconceivable for something like the MIPS R2000.

    Many of these register are configuration control that
    get set once at boot and never change.
    Others are dynamic but not time critical, like debug registers.

    Only a small number would be diddled on a regular basis,
    like interrupt control.

    They don't all need the same access speed -
    depending on usage some (most?) can be on "slow" buses
    that maybe take multiple clocks to read or write.

    Also accessing many control registers must not occur out of order
    and must be guarded either implicitly or explicitly by instructions
    or uOps before and after to drain the pipeline.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 17:04:44 2024
    From Newsgroup: comp.arch

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Paul A. Clayton wrote:

    (I was also very surprised by how much extra state the A55 has:
    over 100 extra "registers". Even though these are not all 64-bit
    data storage units, this was still a surprising amount of extra
    state for a core targeting area efficiency. The storage itself may
    not be particularly expensive, but it gives some insight into how
    complex even a "simple" implementation can be.)

    Imaging having to stick all this stuff on a die at 2µ instead of 5nm !! >>
    I suspect Paul is refering to what ARMv8 calls "System Registers";

    Yes. (There were also some debug registers, performance monitoring
    registers, trace registers, etc.)

    despite the name, most are stored in flops, and in the case of
    the ID registers, wires (perhaps anded with local e-fuses).

    Yes, many of the bits would be implemented as ROM/PROM and many
    would presumably be scattered about because they control/interact
    with specific functionality. They are similar I/O device
    registers. (I/O devices have also become more complex.)

    However, having over 100 seems like a lot. Supporting performance
    counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.

    Yes, there are over 1000 system registers. Most of them are
    only present and implemented if associated feature(s) are supported by the implementation.

    The MIPS 2000 was designed three decades ago and implemented in
    a 2 micrometer node. Whose law states that logic will expand to
    fill the area available :-)?


    An argument might be made that some designs would have no use for
    most of such extra state. Performance monitoring is useful for
    software development (and theoretically for OS decisions for
    scheduling, core migration, and other functions), but seems likely
    to be highly underutilized for typical use.

    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Mar 25 17:38:58 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:

    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    My 66000 Architecture defines 8 performance counters at each layer of
    the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
    counters Interconnect gets 8 counters, Memory Controller gets 8 counters,
    PCIe root gets 8 counters--and every instance multiplies the counters.
    All counters are available via MMI/O space, and can be copied out or reinitialized in a single LDM, STM, or MM instruction. Any thread with
    a TLB mapping can read or write based on permission bits.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 18:23:50 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 3/24/24 4:39 PM, Scott Lurndal wrote:

    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    My 66000 Architecture defines 8 performance counters at each layer of
    the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
    counters Interconnect gets 8 counters, Memory Controller gets 8 counters, >PCIe root gets 8 counters--and every instance multiplies the counters.

    It's not really the number of counters that is important, rather
    it is what the counters count (i.e. which events can be counted).

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Mar 25 18:35:35 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world
    programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 25 14:33:44 2024
    From Newsgroup: comp.arch

    On 3/25/2024 1:35 PM, Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.


    Odd...

    I had mostly skipped any performance counters in hardware, and was
    instead using an emulator to model performance (among other things), but
    for performance tuning this only works so much as the emulator remains accurate in terms of cycle costs (I make an effort, but seems to vary).


    One annoyance is that trying to model some newer or more advanced
    features may bog down the emulator enough that it can't maintain
    real-time performance.

    Though, I guess it is likely that for a "not painfully slow" processor
    (like an A55 or similar) cycle-accurate emulation in real-time at the
    narive clock speed may not be viable (one would burn way to many cycles
    trying to model things like the cache hierarchy and branch predictor, ...).


    Some amount of debugging and performance measurements are possible via
    "LED" outputs, which show the status of pipeline stalls and the signals
    that feed into these stalls (and, in directly, the percentage of time
    spent running instructions via the absence of stalls, ...), ...

    Had generated a cycle-use ranking for the full simulation by having the testbench code run checks mostly on these LED outputs (vs looking at
    them directly).

    Runs on an actual FPGA are admittedly comparably infrequent.


    Though, ironically, have noted that things like shell commands, etc, can
    still be fairly responsive even for Verilog simulations effectively
    running in kHz territory (where, good responsiveness is sometimes a
    struggle even for modern PCs running Windows).

    Or, having recently been working on a tool, and due to some combination
    of factors, at one point in the testing kept taking around 20 seconds
    each time for process creation, which was rather annoying (because
    seemingly Windows would just upload the whole binary to the internet,
    then wait for a response, before letting it run).

    Seemingly, something about the tool was apparently triggering "Windows Defender SmartShield" or similar; it never gave any warnings/messages
    about it, merely caused a fairly long/annoying delay whenever
    relaunching the tool. Then just magically went away (after one of my
    secondary UPS's had "let the smoke out" and also the ISP had partly went
    down for a while; could see ISP local IPs but access to the wider
    internet was seemingly disrupted, ... Like, seemingly, a "the ghosts in
    the machine are not happy right now" type event).

    The tool itself was mostly writing something sorta like SFTP but instead
    for working with disk images. Starting to want to revisit the filesystem question, but looking back at NTFS, still don't really want to try to implement an NTFS driver.

    Possibly EXT2/3/4 would be an option, apart from the annoyance that
    Windows can't access it, so I would still be little better off than just rolling my own (and trying to have the core design be hopefully not
    needlessly complicated).

    ...


    - anton

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Mon Mar 25 20:22:00 2024
    From Newsgroup: comp.arch

    In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel
    or AMD.

    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect
    most of the latter to want those features so that they can understand the performance of their silicon better.

    John
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Mar 25 21:42:18 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:46:39 2024
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel
    or AMD.

    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    Look at vtune, for example.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:48:08 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world
    programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:22:31 2024
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    That might explain why for the AmLogic S922X in the Odroid N2/N2+
    there is a Linux 4.9 kernel that supports performance monitoring
    counters (AmLogic put that in for their own uses), but the mainline
    Linux kernel does not support perf on the S922X (perf was not in the requirements of whoever integrated the S922X stuff into the mainline).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:27:54 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a long-running program.

    Look at vtune, for example.

    And?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Mar 26 10:47:07 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so
    instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.


    Thanks!

    JTAG was indeed the term as was looking for (and not remembering). Maybe
    I'm getting old?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Mar 26 14:15:41 2024
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 16:47:02 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Mar 26 17:29:00 2024
    From Newsgroup: comp.arch

    In article <2024Mar26.174702@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    There can be considerable confusion on this point. In the early days of
    Intel VTune, it would only work on small and simple programs, but Intel
    sent one of the lead developers to visit the UK with it, expecting that
    it would instantly find huge speed-ups in my employers' code.

    What happened was that VTune crashed almost instantly when faced with
    something that large, and Intel learned about the difference between microarchitecture analysis and application analysis.

    John
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Mar 26 18:47:38 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >>bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    Quit being so CPU-centric.

    You also need measurement on how many of which transactions few across
    the bus, DRAM use analysis, and PCIe usage to fully tune the system.

    - anton
    --- Synchronet 3.20a-Linux NewsLink 1.114