• Re: Microarchitectural support for counting

    From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Dec 25 12:50:12 2024
    From Newsgroup: comp.arch

    On 10/5/24 11:11 AM, EricP wrote:
    MitchAlsup1 wrote:
    [snip ROB clearing for branch mispredict with following interrupt]
    Not necessary, you purge all of the younger instructions from the
    thread at retirement, but none of the instructions associated with
    the new <interrupt> thread at the front.

    That's difficult with a circular buffer for the instruction queue/rob
    as you can't edit the order. For a branch mispredict you might be
    able
    to mark a circular range of entries as voided, and leave the entries
    to be recovered serially at retire.

    One might also invert the order of ROB insertion for the interrupt
    thread, starting at the tail of the earlier thread and progressing
    toward and through the head of the earlier thread (the
    instructions canceled by the misprediction).

    This has the disadvantage of not allowing the new thread to use
    old ROB entries from the earlier thread until all instructions in
    the earlier thread have committed. If mispredictions and
    exceptions were handled only at commitment/retirement, there would
    be no such old entries. If a misprediction is handled before
    retirement, there generally would be earlier instructions/ROB
    entries that are not yet freeable.

    Another possibility is to use a circular buffer of pointers to ROB
    chunks. I think some SMT implementations did this to provide
    finer-grained resource sharing.

    But voiding doesn't look like it works for exceptions or conflicting interrupt priority adjustments. In those cases purging the interrupt
    handler and rejecting the hand-off looks like the only option.

    Should exceptions always have priority? It seems to me that if a
    thread is low enough priority to be interrupted, it is low enough
    priority to have its exception processing interrupted/delayed.

    (There might be cases where normal operation allows deadlines to
    be met with lower priority and unusual extended operation requires
    high priority/resource allocation. Boosting the priority/resource
    budget of a thread/task to meet deadlines seems likely to make
    system-level reasoning more difficult. It seems one could also
    create an inflationary spiral.)

    With substantial support for Switch-on-Event MultiThreading, it
    is conceivable that a lower priority interrupt could be held
    "resident" after being interrupted by a higher priority interrupt.
    A chunked ROB could support such, but it is not clear that such
    is desirable even ignoring complexity factors.

    Being able to overlap latency of a memory-mapped I/O access (or
    other slow access) with execution of another thread seems
    attractive and even an interrupt handler with few instructions
    might have significant run time. Since interrupt blocking is
    used to avoid core-localized resource contention, software would
    have to know about such SoEMT.

    (Interrupts seem similar to certain server software threads in
    having lower ILP from control dependencies and more frequent high
    latency operations, which hints that multithreading may be
    desirable.)

    If one can live with the occasional replay of an interrupt
    hand-off and
    handler execute due to mispredict/exception/interrupt_priority_adjust
    then the interrupt pipelining looks much simplified.

    Like with multiple-instruction ROB entries, replay is a useful
    mechanism for managing complexity and resource use in the common
    case. Exploiting instruction reuse is sort of a complement;
    increasing complexity to avoid re-execution of both-path
    instructions
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Dec 25 18:30:43 2024
    From Newsgroup: comp.arch

    On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

    On 10/5/24 11:11 AM, EricP wrote:
    MitchAlsup1 wrote:
    [snip]
    --------------------------

    But voiding doesn't look like it works for exceptions or conflicting
    interrupt priority adjustments. In those cases purging the interrupt
    handler and rejecting the hand-off looks like the only option.

    Should exceptions always have priority? It seems to me that if a
    thread is low enough priority to be interrupted, it is low enough
    priority to have its exception processing interrupted/delayed.

    It depends on what you mean::

    a) if you mean that exceptions are prioritized and the highest
    priority exception is the one taken, then OK you are working
    in an ISA that has multiple exceptions per instruction. Most
    RISC ISAs do not have this property.

    b) if you mean that exceptions take priority over non-exception
    instruction streaming, well that is what exceptions ARE. In these
    cases, the exception handler inherits the priority of the instruction
    stream that raised it--but that is NOT assigning a priority to the
    exception.

    c) and then there are the cases where a PageFault from GuestOS
    page tables is serviced by GuestOS, while a PageFault from
    HyperVisor page tables is serviced by HyperVisor. You could
    assert that HV has higher priority than GuestOS, but it is
    more like HV has privilege over GuestOS while running at the
    same priority level.

    (There might be cases where normal operation allows deadlines to
    be met with lower priority and unusual extended operation requires
    high priority/resource allocation. Boosting the priority/resource
    budget of a thread/task to meet deadlines seems likely to make
    system-level reasoning more difficult. It seems one could also
    create an inflationary spiral.)

    With substantial support for Switch-on-Event MultiThreading, it
    is conceivable that a lower priority interrupt could be held
    "resident" after being interrupted by a higher priority interrupt.

    I don't know what you mean by 'resident' would "lower priority
    ISR gets pushed on stack to allow higher priority ISR to run"
    qualify as 'resident' ?

    And then there is the slightly easier case: where GuestOS is
    servicing an interrupt and ISR takes a PageFault in Hyper-
    Visor page tables. HV PF ISR fixes GuestOS ISR PF, and returns
    to interrupted interrupt handler. Here, even an instruction
    stream incapable (IE & EE=OFF) of taking an Exception takes an
    Exception to a different privilege level.

    Switch-on-Event helps but is not necessary.

    A chunked ROB could support such, but it is not clear that such
    is desirable even ignoring complexity factors.

    Being able to overlap latency of a memory-mapped I/O access (or
    other slow access) with execution of another thread seems
    attractive and even an interrupt handler with few instructions
    might have significant run time. Since interrupt blocking is
    used to avoid core-localized resource contention, software would
    have to know about such SoEMT.

    It may take 10,000 cycles to read an I/O control register way
    down the PCIe tree, the ISR reads several of these registers,
    and constructs a data-structure to be processed by softIRQ (or
    DPC) at lower priority. So, allowing the long cycle MMI/O LDs
    to overlap with ISR thread setup is advantageous.

    (Interrupts seem similar to certain server software threads in
    having lower ILP from control dependencies and more frequent high
    latency operations, which hints that multithreading may be
    desirable.)

    Sooner or later an ISR has to actually deal with the MMI/O
    control registers associated with the <ahem> interrupt.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Dec 25 18:44:19 2024
    From Newsgroup: comp.arch

    On Sat, 5 Oct 2024 22:55:47 +0000, MitchAlsup1 wrote:

    On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:
    On interrupt, if the core starts fetching instructions from the handler >>>> and
    stuffing them into the instruction queue (ROB) while there are still
    instructions in flight, and if those older instructions get a branch
    mispredict, then the purge of mispredicted older instructions will also >>>> purge the interrupt handler.

    Not necessary, you purge all of the younger instructions from the
    thread at retirement, but none of the instructions associated with
    the new <interrupt> thread at the front.

    That's difficult with a circular buffer for the instruction queue/rob
    as you can't edit the order. For a branch mispredict you might be able
    to mark a circular range of entries as voided, and leave the entries
    to be recovered serially at retire.

    Every instruction needs a way to place itself before or after
    any mispredictable branch. Once you know which branch mispredicted, you
    know instructions will not retire, transitively. All you really need to
    know is if the instruction will retire, or not. The rest of the
    mechanics play out naturally in the pipeline.

    If, instead of nullifying every instruction past a given point, you
    make each instruction dependent on HIS-Branch execution (a predicted). Instructions issued under a mispredict shadow, remove THEMSELVES from instruction queues.

    If one is doing Predication with then-clauses and else-clauses*,
    dropping
    both clauses into execution and letting branch resolution choose which instruction execute and which die. At this point, the pipeline is well
    setup for using the same structure wrt interrupt hand-over. Should an
    exception happen in the application instruction stream, which was
    already
    in execution at the time of interruption, Any branch mispredict from application instructions stops application instruction stream precisely
    and we will get back to that precise point after ISR services the
    interrupt/

    (*) like My 66000

    But voiding doesn't look like it works for exceptions or conflicting
    interrupt priority adjustments. In those cases purging the interrupt
    handler and rejecting the hand-off looks like the only option.

    Can you make this statement again and use different words?

    If one can live with the occasional replay of an interrupt hand-off and
    handler execute due to mispredict/exception/interrupt_priority_adjust
    then the interrupt pipelining looks much simplified.

    You just have to cover the depth of the pipeline.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Dec 25 19:10:09 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

    On 10/5/24 11:11 AM, EricP wrote:
    MitchAlsup1 wrote:
    [snip]
    --------------------------

    But voiding doesn't look like it works for exceptions or conflicting
    interrupt priority adjustments. In those cases purging the interrupt
    handler and rejecting the hand-off looks like the only option.

    Should exceptions always have priority? It seems to me that if a
    thread is low enough priority to be interrupted, it is low enough
    priority to have its exception processing interrupted/delayed.

    It depends on what you mean::

    a) if you mean that exceptions are prioritized and the highest
    priority exception is the one taken, then OK you are working
    in an ISA that has multiple exceptions per instruction. Most
    RISC ISAs do not have this property.

    AArch64 has 44 different synchronous exception priorities, and within
    each priority that describes more than one exception, there
    is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in DDI0487K_a).

    While it is not common for a particular instruction to generate
    multiple execeptions, it is certainly possible (e.g. when
    instructions are trapped to a more privileged execution mode).


    b) if you mean that exceptions take priority over non-exception
    instruction streaming, well that is what exceptions ARE. In these
    cases, the exception handler inherits the priority of the instruction
    stream that raised it--but that is NOT assigning a priority to the
    exception.

    c) and then there are the cases where a PageFault from GuestOS
    page tables is serviced by GuestOS, while a PageFault from
    HyperVisor page tables is serviced by HyperVisor. You could
    assert that HV has higher priority than GuestOS, but it is
    more like HV has privilege over GuestOS while running at the
    same priority level.

    It seems unlikely that a translation fault in user mode would need
    handling in both the guest OS and the hypervisor during the
    execution of an instruction; the
    exception to the hypervisor would generally occur when the
    instruction trapped by the guest (who updated the guest translation
    tables) is restarted.

    Other exception causes (such as asynchronous exceptions
    like interrupts) would remain pending and be taken (subject
    to priority and control enables) when the instruction is
    restarted (or the next instruction is dispached for asynchronous
    exceptions).


    <snip>

    Being able to overlap latency of a memory-mapped I/O access (or
    other slow access) with execution of another thread seems

    That depends on whether the access is posted or non-posted. Only
    the latter affects instruction latency. The bulk of I/O to and
    from a PCIe express device is initiated by the device directly
    to memory (subject to iommu translation), not by the CPU, so
    generally the latency to read a MMIO register high enough
    to worry about scheduling other work on the core during
    the transfer.

    In most cases, it takes 1 or 2 orders of magnitude less than 10,000
    cycles to read an I/O control register in a typical PCI express function[***], particularly with modern on-chip PCIe endpoints[*] and CXL[**] (absent
    a PCIe Switched fabric such as the now deprecated multi-root
    I/O virtualization (MR-IOV)). A PCIe Gen-5 card can turn around
    a memory read request rather rapidly if the host I/O bus is
    clocked at a significant fraction (or unity) of the processor
    clock.

    [*] Such as the various bus 0 functions integrated into Intel and
    ARM processors for e.g. memory controller, I2C, SPI, etc) or
    on-chip network and crypto accelerators.

    [**] 150ns round trip additional latency compared with
    local DRAM with PCIe GEN5.

    [***] which don't need to deal with the PCIe transport
    and data link layers
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Dec 25 20:26:05 2024
    From Newsgroup: comp.arch

    On Wed, 25 Dec 2024 19:10:09 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

    On 10/5/24 11:11 AM, EricP wrote:
    MitchAlsup1 wrote:
    [snip]
    --------------------------

    But voiding doesn't look like it works for exceptions or conflicting
    interrupt priority adjustments. In those cases purging the interrupt
    handler and rejecting the hand-off looks like the only option.

    Should exceptions always have priority? It seems to me that if a
    thread is low enough priority to be interrupted, it is low enough
    priority to have its exception processing interrupted/delayed.

    It depends on what you mean::

    a) if you mean that exceptions are prioritized and the highest
    priority exception is the one taken, then OK you are working
    in an ISA that has multiple exceptions per instruction. Most
    RISC ISAs do not have this property.

    AArch64 has 44 different synchronous exception priorities, and within
    each priority that describes more than one exception, there
    is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in
    DDI0487K_a).

    Thanks for the link::

    However, I would claim that the vast majority of those 44 things
    are interrupts and not exceptions (in colloquial nomenclature).

    An exception is raised if an instruction cannot execute to completion
    and is raised synchronously with the instruction stream (and at a
    precise point in the instruction stream.

    An interrupt is raised asynchronous to the instruction stream.

    Reset is an interrupt and not an exceptions.

    Debug that hits an address range is closer to an interrupt than an
    exception. <but I digress>

    But it appears that ARM has many interrupts classified as exceptions.
    Anything not generated from instructions within the architectural
    instruction stream is an interrupt, and anything generated from
    within an architectural instructions stream is an exception.

    It also appears ARM uses priority to sort exceptions into an order,
    while most architectures define priority as a mechanism to to choose
    when to take hard-control-flow-events rather than what.

    Be that as it may...


    While it is not common for a particular instruction to generate
    multiple execeptions, it is certainly possible (e.g. when
    instructions are trapped to a more privileged execution mode).


    b) if you mean that exceptions take priority over non-exception
    instruction streaming, well that is what exceptions ARE. In these
    cases, the exception handler inherits the priority of the instruction >>stream that raised it--but that is NOT assigning a priority to the >>exception.

    c) and then there are the cases where a PageFault from GuestOS
    page tables is serviced by GuestOS, while a PageFault from
    HyperVisor page tables is serviced by HyperVisor. You could
    assert that HV has higher priority than GuestOS, but it is
    more like HV has privilege over GuestOS while running at the
    same priority level.

    It seems unlikely that a translation fault in user mode would need
    handling in both the guest OS and the hypervisor during the
    execution of an instruction;

    Neither stated nor inferred. A PageFault is handled singularly by
    the level in the system that controls (writes) those PTEs.

    There is a significant period of time in may architectures after
    control arrives at ISR where the ISR is not allowed to raise a
    page fault {Storing registers to a stack}, and since this ISR
    might be the PageFault handler, it is not in a position to
    handle its own faults. However, HyperVisor can handle GuestOS PageFaults--GuestOS thinks the pages are present with reasonable
    access rights, HyperVisor tables are used to swap them in/out.
    Other than latency GuestOS ISR does not see the PageFault.

    My 66000, on the other hand, when ISR receives control, state
    has been saved on a stack, the instruction stream is already
    re-entrant, and the register file as it was the last time
    this ISR ran.

    the
    exception to the hypervisor would generally occur when the
    instruction trapped by the guest (who updated the guest translation
    tables) is restarted.

    Other exception causes (such as asynchronous exceptions
    like interrupts)

    Asynchronous exceptions A R E interrupts, not like interrupts;
    they ARE interrupts. If it is not synchronous with instruction
    stream it is an interrupt. Only if it is synchronous with the
    instruction stream is it an exception.

    would remain pending and be taken (subject
    to priority and control enables) when the instruction is
    restarted (or the next instruction is dispached for asynchronous
    exceptions).


    <snip>

    Being able to overlap latency of a memory-mapped I/O access (or
    other slow access) with execution of another thread seems

    That depends on whether the access is posted or non-posted.

    Writes can be posted, Reads cannot. Reads must complete for the
    ISR to be able to setup the control block softIRQ/DPC will
    process shortly. Only after the data structure for softIRQ/DPC
    is written can ISR allow control flow to leave.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Dec 25 20:35:29 2024
    From Newsgroup: comp.arch

    On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

    --------------------------

    Not necessary, you purge all of the younger instructions from the
    thread at retirement, but none of the instructions associated with
    the new <interrupt> thread at the front.

    That's difficult with a circular buffer for the instruction queue/rob
    as you can't edit the order. For a branch mispredict you might be able
    to mark a circular range of entries as voided, and leave the entries
    to be recovered serially at retire.

    Sooner or later, the pipeline designer needs to recognize the of
    occuring
    code sequence pictured as::

    INST
    INST
    BC-------\
    INST |
    INST |
    INST |
    /----BR |
    | INST<----/
    | INST
    | INST
    \--->INST
    INST

    So that the branch predictor predicts as usual, but DECODER recognizes
    the join point of this prediction, so if the prediction is wrong, one
    only nullifies the mispredicted instructions and then inserts the
    alternate instructions while holding the join point instructions until
    the alternate instruction complete.

    But voiding doesn't look like it works for exceptions or conflicting interrupt priority adjustments. In those cases purging the interrupt
    handler and rejecting the hand-off looks like the only option.

    Nullify instructions from the mispredicted paths. On hand off to ISR,
    adjust recovery IP to past the last instruction that executed properly; nullifying between exception and ISR.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Dec 26 09:46:21 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    Sooner or later, the pipeline designer needs to recognize the of
    occuring
    code sequence pictured as::

    INST
    INST
    BC-------\
    INST |
    INST |
    INST |
    /----BR |
    | INST<----/
    | INST
    | INST
    \--->INST
    INST

    So that the branch predictor predicts as usual, but DECODER recognizes
    the join point of this prediction, so if the prediction is wrong, one
    only nullifies the mispredicted instructions and then inserts the
    alternate instructions while holding the join point instructions until
    the alternate instruction complete.

    Would this really save much? The main penalty here would still be
    fetching and decoding the alternate instructions. Sure, the
    instructions after the join point would not have to be fetched and
    decoded, but they would still have to go through the renamer, which
    typically is as narrow or narrower than instruction fetch and decode,
    so avoiding fetch and decode only helps for power (ok, that's
    something), but probably not performance.

    And the kind of insertion you imagine makes things more complicated,
    and only helps in the rare case of a misprediction.

    What alternatives do we have? There still are some branches that are
    hard to predict and for which it would be helpful to optimize them.

    Classically the programmer or compiler was supposed to turn
    hard-to-predict branches into conditional execution (e.g., someone
    (IIRC ARM) has an ITE instruction for that, and My 6600 has something
    similar IIRC). These kinds of instructions tend to turn the condition
    from a control-flow dependency (free when predicted, costly when
    mispredicted) into a data-flow dependency (usually some cost, but
    usually much lower than a misprediction).

    But programmers are not that great on predicting mispredictions (and programming languages usually don't have ways to express them),
    compilers are worse (even with feedback-directed optimization as it
    exists, i.e., without prediction accuracy feedback), and
    predictability might change between phases or callers.

    So it seems to me that this is something that the hardware might use
    history data to predict whether a branch is hard to predict (and maybe
    also taking into account how the dependencies affect the cost), and to
    switch between a branch-predicting implementation and a data-flow implementation of the condition.

    I have not followed ISCA and Micro proceedings in recent years, but I
    would not be surprised if somebody has already done a paper on such an
    idea.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 26 02:21:23 2024
    From Newsgroup: comp.arch

    On 10/3/2024 7:00 AM, Anton Ertl wrote:
    Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
    in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
    some programs the counters used for profiling the program result in
    cache contention due to true or false sharing among threads.

    The traditional software mitigation for that problem is to split the
    counters into per-thread or per-core instances. But for heavily multi-threaded programs running on machines with many cores the cost
    of this mitigation is substantial.

    So I thought up the idea of doing a similar thing in the caches (as a hardware mechanism), and Rene Mueller indicated that he had been
    thinking in the same direction, but his hardware people were not
    interested.

    In any case, here's how this might work. First, you need an
    add-to-memory instruction that does not need to know anything about
    the result (so the existing AMD64 instruction is not enough, thanks to
    it producing a flags result). Have cache consistency states
    "accumulator64", "accumulator32", "accumulator16", "accumulator8" (in addition to MOESI), which indicate that the cache line contains
    naturally aligned 64-bit, 32-bit, 16-bit, or 8-bit counters
    respectively. Not all of these states need to be supported. You also
    add a state shared-plus-accumulators (SPA).

    The architectural value of the values in all these cache lines is the
    sum of the SPA line (or main memory) plus all the accumulator lines.

    An n-bit add-to-memory adds that value to an accumulator-n line. If
    there is no such line, such a line is allocated, initialized to zero,
    and the add then stores the increment in the corresponding part. When allocating the accumulator line, an existing line may be forced to
    switch to SPA, and/or may be moved to an outwards in the cache. But
    if the add applies to some local line in exclusive or modified state,
    it's probably better to just update that line without any accumulator
    stuff.

    If there is a read access to such memory, all the accumulator lines
    are summed up and added with an SPA line (or main memory); this is
    relatively expensive, so this whole thing makes most sense if the
    programmer can arrange to have many additions relative to reads or
    writes. The SPA line is shared, so we keep its contents and the
    contents of the accumulator lines unchanged.

    For writes various options are possible; the catchall would be to add
    all the accumulator lines for that address to one of the SPA lines of
    that memory, overwrite the memory there, and broadcast the new line
    contents to the other SPA lines or invalidate them, and zero or
    invalidate all the accumulator lines. Another option is to write the
    value to one SPA copy (and invalidate the other SPA lines), and zero
    the corresponding bytes in the accumulator lines; this only works if
    there are no accumulators wider than the write.

    You will typically support the accumulator states only at L1 and maybe
    L2; if an accumulator line gets cool enough to be evicted from that,
    it can be added to the SPA line or to main memory.

    How do we enter these states? Given that every common architecture
    needs special instructions for using them, use of these instructions
    on a shared cache line or modified or exclusive line of another core
    would be a hint that using these states is a good idea.

    This is all a rather elaborate mechanism. Are counters in
    multi-threaded programs used enough (and read rarely enough) to
    justify the cost of implementing it? For the HotSpot application, the eventual answer was that they live with the cost of cache contention
    for the programs that have that problem. After some minutes the hot
    parts of the program are optimized, and cache contention is no longer
    a problem. But maybe there are some other applications that do more long-time accumulating that would benefit.

    If the per-thread counters are properly padded to a l2 cache line and
    properly aligned on cache line boundaries, well, the should not cause
    false sharing with other cache lines... Right?

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Dec 26 12:32:29 2024
    From Newsgroup: comp.arch

    On Wed, 25 Dec 2024 20:35:29 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

    --------------------------

    Not necessary, you purge all of the younger instructions from the
    thread at retirement, but none of the instructions associated with
    the new <interrupt> thread at the front.

    That's difficult with a circular buffer for the instruction
    queue/rob as you can't edit the order. For a branch mispredict you
    might be able to mark a circular range of entries as voided, and
    leave the entries to be recovered serially at retire.

    Sooner or later, the pipeline designer needs to recognize the of
    occuring
    code sequence pictured as::

    INST
    INST
    BC-------\
    INST |
    INST |
    INST |
    /----BR |
    | INST<----/
    | INST
    | INST
    \--->INST
    INST

    So that the branch predictor predicts as usual, but DECODER recognizes
    the join point of this prediction, so if the prediction is wrong, one
    only nullifies the mispredicted instructions and then inserts the
    alternate instructions while holding the join point instructions until
    the alternate instruction complete.


    Yes, compilers often generate such code.
    When coding in asm, I typically know at least something about
    probability of branches, so I tend to code it differently:

    inst
    inst
    bc colder_section
    inst
    inst
    inst
    merge_flow:
    inst
    inst
    ...
    ret

    colder_section:
    inst
    inst
    inst
    br merge_flow


    Intel's "efficiency" cores family starting from Tremont has weird
    "clustered" front end design. It often prefers [predicted] taken
    branches over [predicted] non-taken branches. On front ends like that
    my optimization is likely to become pessimization.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Dec 26 14:56:30 2024
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 10/3/2024 7:00 AM, Anton Ertl wrote:
    Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
    in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
    some programs the counters used for profiling the program result in
    cache contention due to true or false sharing among threads.

    The traditional software mitigation for that problem is to split the
    counters into per-thread or per-core instances. But for heavily
    multi-threaded programs running on machines with many cores the cost
    of this mitigation is substantial.
    ...
    For the HotSpot application, the
    eventual answer was that they live with the cost of cache contention
    for the programs that have that problem. After some minutes the hot
    parts of the program are optimized, and cache contention is no longer
    a problem.
    ...
    If the per-thread counters are properly padded to a l2 cache line and >properly aligned on cache line boundaries, well, the should not cause
    false sharing with other cache lines... Right?

    Sure, that's what the first sentence of the second paragraph you cited
    (and which I cited again) is about. Next, read the next sentence.

    Maybe I should give an example (fully made up on the spot, read the
    paper for real numbers): If HotSpot uses, on average one counter per conditional branch, and assuming a conditional branch every 10 static instructions (each having, say 4 bytes), with 1MB of generated code
    and 8 bytes per counter, that's 200KB of counters. But these counters
    are shared between all threads, so for code running on many cores you
    get true and false sharing.

    As mentioned, the usual mitigation is per-core counters. With a
    256-core machine, we now have 51.2MB of counters for 1MB of executable
    code. Now this is Java, so there might be quite a bit more executable
    code and correspondingly more counters. They eventually decided that
    the benefit of reduced cache coherence traffic is not worth that cost
    (or the cost of a hardware mechanism), as described in the last
    paragraph, from which I cited the important parts.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Dec 26 14:25:37 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

    --------------------------

    Not necessary, you purge all of the younger instructions from the
    thread at retirement, but none of the instructions associated with
    the new <interrupt> thread at the front.

    That's difficult with a circular buffer for the instruction queue/rob
    as you can't edit the order. For a branch mispredict you might be able
    to mark a circular range of entries as voided, and leave the entries
    to be recovered serially at retire.

    Sooner or later, the pipeline designer needs to recognize the of
    occuring
    code sequence pictured as::

    INST
    INST
    BC-------\
    INST |
    INST |
    INST |
    /----BR |
    | INST<----/
    | INST
    | INST
    \--->INST
    INST

    So that the branch predictor predicts as usual, but DECODER recognizes
    the join point of this prediction, so if the prediction is wrong, one
    only nullifies the mispredicted instructions and then inserts the
    alternate instructions while holding the join point instructions until
    the alternate instruction complete.

    Yes. Long ago I looked at some academic papers on hardware IF-conversion.
    Those papers were in the context of Itanium around 2005 or so,
    automatically converting short forward branches onto predication.

    There were also papers that looked at HW converting predication
    back into short branches because they tie down less resources.

    IIRC they were looking at interactions between predication,
    aka Guarded Execution, and branch predictors, and how IF-conversion
    affects the branch predictor stats.

    But voiding doesn't look like it works for exceptions or conflicting
    interrupt priority adjustments. In those cases purging the interrupt
    handler and rejecting the hand-off looks like the only option.

    Nullify instructions from the mispredicted paths. On hand off to ISR,
    adjust recovery IP to past the last instruction that executed properly; nullifying between exception and ISR.

    Yes, that seems the most straight forward way to do it.
    But to nullify *some* of the in-flight instructions and not others,
    just the ones in the mispredicted shadow, in the middle of a stream
    of other instructions, seems to require much of the logic necessary
    to support general OoO predication/guarded-execution.

    Branch mispredict could use two mechanisms, one using checkpoint
    and rollback for a normal branch mispredict which recovers resources immediately in one clock, and another if there is a pipelined interrupt
    already appended which defers resource recovery to retire.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Dec 26 21:54:53 2024
    From Newsgroup: comp.arch

    On Thu, 26 Dec 2024 9:46:21 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Sooner or later, the pipeline designer needs to recognize the of
    occuring
    code sequence pictured as::

    INST
    INST
    BC-------\
    INST |
    INST |
    INST |
    /----BR |
    | INST<----/
    | INST
    | INST
    \--->INST
    INST

    So that the branch predictor predicts as usual, but DECODER recognizes
    the join point of this prediction, so if the prediction is wrong, one
    only nullifies the mispredicted instructions and then inserts the
    alternate instructions while holding the join point instructions until
    the alternate instruction complete.

    Would this really save much? The main penalty here would still be
    fetching and decoding the alternate instructions. Sure, the
    instructions after the join point would not have to be fetched and
    decoded, but they would still have to go through the renamer, which
    typically is as narrow or narrower than instruction fetch and decode,
    so avoiding fetch and decode only helps for power (ok, that's
    something), but probably not performance.

    When you have the property that FETCH will stumble over the join point
    before the branch resolves, the fact you reached the join point means
    a branch misprediction is avoided (~16 cycles) and you nullify 4
    instructions from reservation stations. FETCH is not disrupted, and
    execution continues.

    The balance is the mispredict recovery overhead (~16 cycles) compared
    to the cost of inserting the un-predicted path into execution (1 cycle
    in the illustrated case).

    And the kind of insertion you imagine makes things more complicated,
    and only helps in the rare case of a misprediction.

    PREDication is designed for the unpredictable branches--as a means to
    directly express the fact that the <equivalent> branch code is not
    expected to be predicted well.

    For easy to predict branches, don't recode as PRED--presto; done. So,
    rather than having to Hint branches or to guard individual instructions;
    I PREDicate short clauses; saving bits in each instruction because these
    bits come from the PRED-instruction.

    What alternatives do we have? There still are some branches that are
    hard to predict and for which it would be helpful to optimize them.

    Classically the programmer or compiler was supposed to turn
    hard-to-predict branches into conditional execution (e.g., someone
    (IIRC ARM) has an ITE instruction for that, and My 6600 has something
    similar IIRC). These kinds of instructions tend to turn the condition
    from a control-flow dependency (free when predicted, costly when mispredicted) into a data-flow dependency (usually some cost, but
    usually much lower than a misprediction).

    Conditional execution and merging (CMOV) rarely takes as few
    instructions
    as branchy code and <almost> always consumes more power. However, there
    are a few cases where CMOV works out better than PRED, so: My 66000 has
    both.

    But programmers are not that great on predicting mispredictions (and programming languages usually don't have ways to express them),
    compilers are worse (even with feedback-directed optimization as it
    exists, i.e., without prediction accuracy feedback), and
    predictability might change between phases or callers.

    A conditional branch inside a subroutine is almost always dependent on
    who calls the subroutine. Some calls may have a nearly 100% prediction
    rate in one direction, other calls a near 100% prediction rate in the
    other direction.

    One thing that IS different n My 66000 (other than PREDs not needing to
    be predicted) is that loops are not predicted--there is a LOOP instr-
    uction that performs ADD-CMP-BC back to the top of the loop in 1 cycle.
    Since HW can see the iterating register and the terminating limit,
    one does not overpredict iterations and then mispredict them away,
    instead on predicts loop only so long as the arithmetic supports that prediction.

    Thus, in My 66000, looping branches do not contribute to predictor
    pollution (updates), leaving the branch predictors to deal with the
    harder stuff. In addition we have a LD IP instruction (called CALX)
    that loads a value from a table directly into IP, so no jumps here.
    And finally:: My 66000 has a Jump Through Table (JTT) instruction,
    which performs:: range check, table access, add scaled table entry
    to IP and transfer control.

    Thus, there is very little indirect prediction (maybe none on smaller implementations), switch tables are all PIC, and the tables are
    typically ¼ the size of the equivalents in other 64-bit ISAs.

    So, by taking Looping branches, indirect branches, and indirect calls
    out of the prediction tables, those that remain should be more
    predictable.

    So it seems to me that this is something that the hardware might use
    history data to predict whether a branch is hard to predict (and maybe
    also taking into account how the dependencies affect the cost), and to
    switch between a branch-predicting implementation and a data-flow implementation of the condition.

    I have not followed ISCA and Micro proceedings in recent years, but I
    would not be surprised if somebody has already done a paper on such an
    idea.

    - anton
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Fri Dec 27 11:16:47 2024
    From Newsgroup: comp.arch

    On 10/3/24 10:00, Anton Ertl wrote:
    Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
    in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
    some programs the counters used for profiling the program result in
    cache contention due to true or false sharing among threads.

    The traditional software mitigation for that problem is to split the
    counters into per-thread or per-core instances. But for heavily multi-threaded programs running on machines with many cores the cost
    of this mitigation is substantial.


    For profiling, do we really need accurate counters? They just need to
    be statistically accurate I would think.

    Instead of incrementing a counter, just store a non-zero immediate into
    a zero initialized byte array at a per "counter" index. There's no
    rmw data dependency, just a store so should have little impact on
    pipeline.

    A profiling thread loops thru the byte array, incrementing an actual
    counter when it sees no zero byte, and resets the byte to zero. You
    could use vector ops to process the array.

    If the stores were fast enough, you could do 2 or more stores at
    hashed indices, different hash for each store. Sort of a counting
    Bloom filter. The effective count would be the minimum of the
    hashed counts.

    No idea how feasible this would be though.

    Joe Seigh




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Dec 27 16:38:21 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 25 Dec 2024 19:10:09 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

    On 10/5/24 11:11 AM, EricP wrote:
    MitchAlsup1 wrote:
    [snip]
    --------------------------

    But voiding doesn't look like it works for exceptions or conflicting >>>>> interrupt priority adjustments. In those cases purging the interrupt >>>>> handler and rejecting the hand-off looks like the only option.

    Should exceptions always have priority? It seems to me that if a
    thread is low enough priority to be interrupted, it is low enough
    priority to have its exception processing interrupted/delayed.

    It depends on what you mean::

    a) if you mean that exceptions are prioritized and the highest
    priority exception is the one taken, then OK you are working
    in an ISA that has multiple exceptions per instruction. Most
    RISC ISAs do not have this property.

    AArch64 has 44 different synchronous exception priorities, and within
    each priority that describes more than one exception, there
    is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in
    DDI0487K_a).

    Thanks for the link::

    However, I would claim that the vast majority of those 44 things
    are interrupts and not exceptions (in colloquial nomenclature).

    I think that nomenclature is often processor specific. Anything
    that occurs synchronously during instruction execution as a result
    of executing that particular instruction is considered an exception
    in AArch64. Many of them are traps to higher exception levels
    for various reasons (including hypervisor traps) which can occur
    potentially with other exceptions such as TLB faults, etc.

    Interrupts, in the ARM sense are _always_ asynchronous, and more
    specifically refer to the two signals IRQ and FIQ that the Generic
    Interrupt Controller uses to inform a processing thread that it
    needs to handle and I/O interrupt.

    In Aarch64, they all vector through the same per-exception-level (kernel, hypervisor, secure monitor, realm) vector table.


    An exception is raised if an instruction cannot execute to completion
    and is raised synchronously with the instruction stream (and at a
    precise point in the instruction stream.

    That description describes accurately all of the 44 conditions
    above - the section is entitled, after all, 'SYNCHRONOUS exception
    priorities". Interrupts are by definition asynchronous in that
    AArch64 architecture.


    An interrupt is raised asynchronous to the instruction stream.

    Reset is an interrupt and not an exceptions.

    I would argue that reset is a condition and is in this list
    as such - sometimes it is synchronous (a result of executing
    a special instruction or store to a system register), sometimes
    it is asynchronous (via the chipset/SoC). The fact that reset
    has the highest priority is noted here specifically.


    Debug that hits an address range is closer to an interrupt than an
    exception. <but I digress>

    It is still synchronous to instruction execution.


    But it appears that ARM has many interrupts classified as exceptions. >Anything not generated from instructions within the architectural
    instruction stream is an interrupt, and anything generated from
    within an architectural instructions stream is an exception.

    That's your definition. It certainly doesn't apply to AAarch64
    (or the burroughs mainframes, for that matter).


    It also appears ARM uses priority to sort exceptions into an order,
    while most architectures define priority as a mechanism to to choose
    when to take hard-control-flow-events rather than what.

    They desire determinism for the software.



    It seems unlikely that a translation fault in user mode would need
    handling in both the guest OS and the hypervisor during the
    execution of an instruction;

    Neither stated nor inferred. A PageFault is handled singularly by
    the level in the system that controls (writes) those PTEs.

    Indeed. And the guest OS owns the PTEs (TTEs) for the guest
    user process, and the hypervisor owns the PTEs for the guest
    "physical address space view". This is true for ARM, Intel
    and AMD.


    There is a significant period of time in may architectures after
    control arrives at ISR where the ISR is not allowed to raise a
    page fault {Storing registers to a stack}, and since this ISR
    might be the PageFault handler, it is not in a position to
    handle its own faults. However, HyperVisor can handle GuestOS >PageFaults--GuestOS thinks the pages are present with reasonable
    access rights, HyperVisor tables are used to swap them in/out.
    Other than latency GuestOS ISR does not see the PageFault.

    I've written two hypervisors (one on x86, long before hardware
    assist (1998) and one using AMD SVM and NPT (mid 2000's). There is a
    very clean deliniation between the guest physical address space view
    from the guest and guest applications, and the host physical
    address space apportioned out to the various guest OS' by
    the hypervisor. In some cases the hypervisor can not even
    peek into the guest physical address space. They are distinct
    and independent (sans paravirtualization).


    My 66000, on the other hand, when ISR receives control, state
    has been saved on a stack, the instruction stream is already
    re-entrant, and the register file as it was the last time
    this ISR ran.

    The AAarch64 exception entry (for both interrupts and exceptions)
    is identical and only a few cycles. The exception routine
    (ISR in your nomenclature) can decide for itself what state
    to preserve (the processor state and return address are saved
    in special per-exception-level system registers automatically
    during exception entry and restored by exception return (eret
    instruction)).


    the
    exception to the hypervisor would generally occur when the
    instruction trapped by the guest (who updated the guest translation
    tables) is restarted.

    Other exception causes (such as asynchronous exceptions
    like interrupts)

    Asynchronous exceptions A R E interrupts, not like interrupts;
    they ARE interrupts. If it is not synchronous with instruction
    stream it is an interrupt. Only if it is synchronous with the
    instruction stream is it an exception.

    Your interrupt terminology differs from the ARM version. An
    interrupt is considered an asynchronous exception (of which
    there are three - IRQ, FIQ and Serror[*]). Both synchronous
    exceptions and asynchronous exceptionsuse the
    same vector table (indexed by exception level (privilege))
    and the ESR_ELx (Exception Status Register) has a 6-bit
    exception code that the exception routine uses to vector
    to the appropriate handler. Data and Instruction abort
    (translation faults) exception codes distinguish between
    a translation fault that occured in the lesser privilege
    (e.g. user mode trapping to kernel, or guest page fault
    trapping to hypervisor).

    [*] Asynchronous system error (e.g a posted store that subsequently
    failed downstream).


    would remain pending and be taken (subject
    to priority and control enables) when the instruction is
    restarted (or the next instruction is dispached for asynchronous
    exceptions).


    <snip>

    Being able to overlap latency of a memory-mapped I/O access (or
    other slow access) with execution of another thread seems

    That depends on whether the access is posted or non-posted.

    Writes can be posted, Reads cannot. Reads must complete for the
    ISR to be able to setup the control block softIRQ/DPC will
    process shortly. Only after the data structure for softIRQ/DPC
    is written can ISR allow control flow to leave.

    As I said, it depends on if it is posted or not. A store
    to trigger a doorbell that starts processing a ring of
    DMA instructions, for example, has no latency. And the DMA
    is all initiated by the endpoint device, not the OS.

    All that said, this isn't 1993 PCI, modern chipset and PCIe
    latencies are less than they used to be especially on
    SoCs where you don't have SERDES overhead.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Sat Dec 28 07:20:17 2024
    From Newsgroup: comp.arch

    On 12/27/24 11:16, jseigh wrote:
    On 10/3/24 10:00, Anton Ertl wrote:
    Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
    in the HotSpot Virtual Machine" at MPLR 2024.  He reported that for
    some programs the counters used for profiling the program result in
    cache contention due to true or false sharing among threads.

    The traditional software mitigation for that problem is to split the
    counters into per-thread or per-core instances.  But for heavily
    multi-threaded programs running on machines with many cores the cost
    of this mitigation is substantial.


    For profiling, do we really need accurate counters?  They just need to
    be statistically accurate I would think.

    Instead of incrementing a counter, just store a non-zero immediate into
    a zero initialized byte array at a per "counter" index.   There's no
    rmw data dependency, just a store so should have little impact on
    pipeline.

    A profiling thread loops thru the byte array, incrementing an actual
    counter when it sees no zero byte, and resets the byte to zero.  You
    could use vector ops to process the array.

    If the stores were fast enough, you could do 2 or more stores at
    hashed indices, different hash for each store. Sort of a counting
    Bloom filter.  The effective count would be the minimum of the
    hashed counts.

    No idea how feasible this would be though.


    Probably not feasible. The polling frequency wouldn't be high enough.


    If the problem is the number of counters, then counting Bloom filters
    might be worth looking into, assuming the overhead of incrementing
    the counts isn't a problem.

    Joe Seigh
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Dec 30 14:39:27 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 10/3/2024 7:00 AM, Anton Ertl wrote:
    Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
    in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
    some programs the counters used for profiling the program result in
    cache contention due to true or false sharing among threads.

    The traditional software mitigation for that problem is to split the
    counters into per-thread or per-core instances. But for heavily
    multi-threaded programs running on machines with many cores the cost
    of this mitigation is substantial.
    ....
    For the HotSpot application, the
    eventual answer was that they live with the cost of cache contention
    for the programs that have that problem. After some minutes the hot
    parts of the program are optimized, and cache contention is no longer
    a problem.
    ....
    If the per-thread counters are properly padded to a l2 cache line and
    properly aligned on cache line boundaries, well, the should not cause
    false sharing with other cache lines... Right?

    Sure, that's what the first sentence of the second paragraph you cited
    (and which I cited again) is about. Next, read the next sentence.

    Maybe I should give an example (fully made up on the spot, read the
    paper for real numbers): If HotSpot uses, on average one counter per conditional branch, and assuming a conditional branch every 10 static instructions (each having, say 4 bytes), with 1MB of generated code
    and 8 bytes per counter, that's 200KB of counters. But these counters
    are shared between all threads, so for code running on many cores you
    get true and false sharing.

    As mentioned, the usual mitigation is per-core counters. With a
    256-core machine, we now have 51.2MB of counters for 1MB of executable
    code. Now this is Java, so there might be quite a bit more executable
    code and correspondingly more counters. They eventually decided that
    the benefit of reduced cache coherence traffic is not worth that cost
    (or the cost of a hardware mechanism), as described in the last
    paragraph, from which I cited the important parts.

    - anton

    They could do this by having each thread log its own profile data
    into a thread-local profile bucket. When the bucket is full it
    queues its bucket to a "full" list and dequeues a new bucket from
    an "empty" list. A dedicated thread processes full buckets into the
    profile summary arrays, then puts the empty buckets on the empty list.

    A profile bucket is an array of 32-bits values. Each value is
    a 16-bit event type and 16-bit item id (or whatever).
    Simple events like counting each use of a branch take just one entry.
    Other profile events could take multiple entries if they recorded
    cpu performance counters or real time timestamps or both.

    The atomic accesses are only on full and empty bucket lists heads.
    By playing with the bucket sizes you can keep the chances of
    core collisions on the list heads to negligible.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Mon Dec 30 21:02:05 2024
    From Newsgroup: comp.arch

    On 12/25/24 1:30 PM, MitchAlsup1 wrote:
    On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

    On 10/5/24 11:11 AM, EricP wrote:
    MitchAlsup1 wrote:
    [snip]
    --------------------------

    But voiding doesn't look like it works for exceptions or
    conflicting
    interrupt priority adjustments. In those cases purging the
    interrupt
    handler and rejecting the hand-off looks like the only option.

    Should exceptions always have priority? It seems to me that if a
    thread is low enough priority to be interrupted, it is low enough
    priority to have its exception processing interrupted/delayed.

    It depends on what you mean::

    a) if you mean that exceptions are prioritized and the highest
    priority exception is the one taken, then OK you are working
    in an ISA that has multiple exceptions per instruction. Most
    RISC ISAs do not have this property.

    The context was any exception taking priority over an interrupt
    that was accepted, at least on a speculative path. I.e., the
    statement would have been more complete as "Should exceptions
    always (or ever) have priority over an accepted interrupt?"

    b) if you mean that exceptions take priority over non-exception
    instruction streaming, well that is what exceptions ARE. In these
    cases, the exception handler inherits the priority of the instruction
    stream that raised it--but that is NOT assigning a priority to the
    exception.

    Yes, if the architecture has precise exceptions.

    I do not see much benefit from continuing execution along even
    part of the non-exception instruction stream. There might be a
    class of exceptions which are more "monitoring" exceptions
    (something vaguely like event counters?) that do not need to be
    precise in all visible state. Completing an atomic event (in
    the normal instruction stream) _might_ make sense rather than
    delaying and possibly failing (if optimistic) the atomic event.

    There _might_ be some cases where an external thread activation
    might be somewhat timing critical and not dependent on any
    action performed by the exception handler. (I cannot think of
    any case, but that may be a lack of imagination.)

    Perhaps if execution is divided into tasks, an exception within
    a task might restart the entire task after the exception handling,
    pause the task during the handling of the exception, or complete
    the task possibly running the exception handling in a parallel
    thread. Even partial task completion, i.e., marking the task
    done even though the result is not complete, might be appropriate
    for some failures.

    I guess I am thinking that any delay or elimination of some
    computation from an exception may not necessarily have to delay
    or break data flow at a higher level. Managing this seems likely
    to be difficult. Debugging software would likely be even more
    difficult.

    c) and then there are the cases where a PageFault from GuestOS
    page tables is serviced by GuestOS, while a PageFault from
    HyperVisor page tables is serviced by HyperVisor. You could
    assert that HV has higher priority than GuestOS, but it is
    more like HV has privilege over GuestOS while running at the
    same priority level.

    (There might be cases where normal operation allows deadlines to
    be met with lower priority and unusual extended operation requires
    high priority/resource allocation. Boosting the priority/resource
    budget of a thread/task to meet deadlines seems likely to make
    system-level reasoning more difficult. It seems one could also
    create an inflationary spiral.)

    With substantial support for Switch-on-Event MultiThreading, it
    is conceivable that a lower priority interrupt could be held
    "resident" after being interrupted by a higher priority interrupt.

    I don't know what you mean by 'resident' would "lower priority
    ISR gets pushed on stack to allow higher priority ISR to run"
    qualify as 'resident' ?

    "resident" was a vague term to mean low overhead of switching
    back. Saving state to cache even with 64-byte per cycle bandwidth
    might still be slower than another mechanism and cached data
    might be evicted from L1, substantially increasing the latency.


    And then there is the slightly easier case: where GuestOS is
    servicing an interrupt and ISR takes a PageFault in Hyper-
    Visor page tables. HV PF ISR fixes GuestOS ISR PF, and returns
    to interrupted interrupt handler. Here, even an instruction
    stream incapable (IE & EE=OFF) of taking an Exception takes an
    Exception to a different privilege level.

    Switch-on-Event helps but is not necessary.

    A chunked ROB could support such, but it is not clear that such
    is desirable even ignoring complexity factors.

    Being able to overlap latency of a memory-mapped I/O access (or
    other slow access) with execution of another thread seems
    attractive and even an interrupt handler with few instructions
    might have significant run time. Since interrupt blocking is
    used to avoid core-localized resource contention, software would
    have to know about such SoEMT.

    It may take 10,000 cycles to read an I/O control register way
    down the PCIe tree, the ISR reads several of these registers,
    and constructs a data-structure to be processed by softIRQ (or
    DPC) at lower priority. So, allowing the long cycle MMI/O LDs
    to overlap with ISR thread setup is advantageous.

    (Interrupts seem similar to certain server software threads in
    having lower ILP from control dependencies and more frequent high
    latency operations, which hints that multithreading may be
    desirable.)

    Sooner or later an ISR has to actually deal with the MMI/O
    control registers associated with the <ahem> interrupt.

    Yes, but multithreading could hide some of those latencies in
    terms of throughput.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Jan 1 00:34:44 2025
    From Newsgroup: comp.arch

    On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:

    On 12/25/24 1:30 PM, MitchAlsup1 wrote:
    On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

    On 10/5/24 11:11 AM, EricP wrote:
    MitchAlsup1 wrote:
    [snip]
    --------------------------

    But voiding doesn't look like it works for exceptions or
    conflicting
    interrupt priority adjustments. In those cases purging the
    interrupt
    handler and rejecting the hand-off looks like the only option.

    Should exceptions always have priority? It seems to me that if a
    thread is low enough priority to be interrupted, it is low enough
    priority to have its exception processing interrupted/delayed.

    It depends on what you mean::

    a) if you mean that exceptions are prioritized and the highest
    priority exception is the one taken, then OK you are working
    in an ISA that has multiple exceptions per instruction. Most
    RISC ISAs do not have this property.

    The context was any exception taking priority over an interrupt
    that was accepted, at least on a speculative path. I.e., the
    statement would have been more complete as "Should exceptions
    always (or ever) have priority over an accepted interrupt?"

    In the parlance I used to document My 66000 architecture, exceptions
    happen at instruction boundaries, while interrupts happen between
    instructions. Thus CPU is never deciding between an interrupt or an
    exception.

    Interrupts take on the priority assigned at I/O creation time.
    {{Oh and BTW, a single I/O request can take I/O exception to
    GuestOS, to HyperVisor, can deliver completion to assigned
    supervisor (Guest OS or HV), and deliver I/O failures to
    Secure Monitor (or whomever is assigned)}}

    Exceptions take on the priority of the currently running thread.
    A page fault at priority min does not block any interrupt at
    priority > min. A page fault at priority max is not interruptible.


    --------------------------------------

    Sooner or later an ISR has to actually deal with the MMI/O
    control registers associated with the <ahem> interrupt.

    Yes, but multithreading could hide some of those latencies in
    terms of throughput.

    EricP is the master proponent of finishing the instructions in the
    execution window that are finishable. I, merely, have no problem
    in allowing the pipe to complete or take a flush based on the kind
    of pipeline being engineered.

    With 300-odd instructions in the window this thesis has merit,
    with a 5-stage pipeline 1-wide, it does not have merit but is
    not devoid of merit either.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Jan 2 14:14:50 2025
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:
    On 12/25/24 1:30 PM, MitchAlsup1 wrote:

    Sooner or later an ISR has to actually deal with the MMI/O
    control registers associated with the <ahem> interrupt.

    Yes, but multithreading could hide some of those latencies in
    terms of throughput.

    EricP is the master proponent of finishing the instructions in the
    execution window that are finishable. I, merely, have no problem
    in allowing the pipe to complete or take a flush based on the kind
    of pipeline being engineered.

    With 300-odd instructions in the window this thesis has merit,
    with a 5-stage pipeline 1-wide, it does not have merit but is
    not devoid of merit either.

    It is also possible that the speculation barriers I describe below
    will limit the benefits that pipelining exceptions and interrupts
    might be able to see.

    The issue is that both exception handlers and interrupts usually read and
    write Privileged Control Registers (PCR) and/or MMIO device registers very early into the handler. Most MMIO device registers and cpu PCR cannot be speculatively read as that may cause a state transition.
    Of course all stores are never speculated and can only be initiated
    at commit/retire.

    The normal memory coherence rules assume that loads are to memory-like locations that do not state transition on reads and that therefore
    memory loads can be harmlessly replayed if needed.
    While memory stores are not performed speculatively, an implementation
    might speculatively prefetch a cache line as soon as a store is queued
    and cause cache lines to ping-pong.

    But for loads to many MMIO devices and PCR effectively require a
    speculation barrier in front of them to prevent replays.

    A SPCB Speculation Barrier instruction could block speculation.
    It stalls execution until all older conditional branches are resolved and
    all older instructions that might throw an exception have determined
    they won't do so.

    The core could have an internal lookup table telling it which PCR can be
    read speculatively because there are no side effects to doing so.
    Those PCR would not require an SPCB to guard them.

    For MMIO device registers I think having an explicit SPCB instruction
    might be better than putting a "no-speculate" flag on the PTE for the
    device register address as that flag would be difficult to propagate
    backwards from address translate to all the parts of the core that
    we might have to sync with.

    This all means that there may be very little opportunity for speculative execution of their handlers, no matter how much hardware one tosses at them.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Jan 2 19:45:36 2025
    From Newsgroup: comp.arch

    On Thu, 2 Jan 2025 19:14:50 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:
    On 12/25/24 1:30 PM, MitchAlsup1 wrote:

    Sooner or later an ISR has to actually deal with the MMI/O
    control registers associated with the <ahem> interrupt.

    Yes, but multithreading could hide some of those latencies in
    terms of throughput.

    EricP is the master proponent of finishing the instructions in the
    execution window that are finishable. I, merely, have no problem
    in allowing the pipe to complete or take a flush based on the kind
    of pipeline being engineered.

    With 300-odd instructions in the window this thesis has merit,
    with a 5-stage pipeline 1-wide, it does not have merit but is
    not devoid of merit either.

    It is also possible that the speculation barriers I describe below
    will limit the benefits that pipelining exceptions and interrupts
    might be able to see.

    The issue is that both exception handlers and interrupts usually read
    and
    write Privileged Control Registers (PCR) and/or MMIO device registers
    very
    early into the handler. Most MMIO device registers and cpu PCR cannot be speculatively read as that may cause a state transition.
    Of course all stores are never speculated and can only be initiated
    at commit/retire.

    This becomes a question of "who knows what when".

    At the point of interrupt recognition (It has been raised, and I am
    going
    to take that interrupt) the pipeline has instructions retiring from the execution window, and instructions being performed, and instructions
    waiting for "things to happen".

    After interrupt recognition, you are inserting instructions into the
    execution window--but these are not speculative--they are known to
    not be under any speculation--they WILL execute to completion--regard-
    less of whether speculative instructions from before recognition are
    performed or flushed. This property is known until the ISR performs
    a predicted branch.

    So, it is possible to stream right onto an ISR--but few pipelines do.

    The normal memory coherence rules assume that loads are to memory-like locations that do not state transition on reads and that therefore
    memory loads can be harmlessly replayed if needed.
    While memory stores are not performed speculatively, an implementation
    might speculatively prefetch a cache line as soon as a store is queued
    and cause cache lines to ping-pong.

    But for loads to many MMIO devices and PCR effectively require a
    speculation barrier in front of them to prevent replays.

    My 66000 architecture specifies that accesses to MMI/O space is
    performed
    as if the core were performing memory references sequentially
    consistent;
    obviating a need for SPCB instruction there.

    There is only 1 instruction used to read/write control registers. It
    reads the operand registers and the control register at the beginning
    of execution, but does not write the control register until retirement; obviating a need for SPCB instruction there.

    Also note: core[i] can access core[j] control registers, but this access
    takes place in MMI/O space (and is sequentially consistent).

    A SPCB Speculation Barrier instruction could block speculation.
    It stalls execution until all older conditional branches are resolved
    and
    all older instructions that might throw an exception have determined
    they won't do so.

    The core could have an internal lookup table telling it which PCR can be
    read speculatively because there are no side effects to doing so.
    Those PCR would not require an SPCB to guard them.

    For MMIO device registers I think having an explicit SPCB instruction
    might be better than putting a "no-speculate" flag on the PTE for the
    device register address as that flag would be difficult to propagate backwards from address translate to all the parts of the core that
    we might have to sync with.

    I am curious. Is "unCacheable and MMI/O space" insufficient to figure
    out "Hey, it's non-speculative" too ??

    This all means that there may be very little opportunity for speculative execution of their handlers, no matter how much hardware one tosses at
    them.

    Good point, often unseen or unstated.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Jan 3 17:24:33 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    MitchAlsup1 wrote:
    On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:
    On 12/25/24 1:30 PM, MitchAlsup1 wrote:

    Sooner or later an ISR has to actually deal with the MMI/O
    control registers associated with the <ahem> interrupt.

    Yes, but multithreading could hide some of those latencies in
    terms of throughput.

    EricP is the master proponent of finishing the instructions in the
    execution window that are finishable. I, merely, have no problem
    in allowing the pipe to complete or take a flush based on the kind
    of pipeline being engineered.

    With 300-odd instructions in the window this thesis has merit,
    with a 5-stage pipeline 1-wide, it does not have merit but is
    not devoid of merit either.

    It is also possible that the speculation barriers I describe below
    will limit the benefits that pipelining exceptions and interrupts
    might be able to see.

    The issue is that both exception handlers and interrupts usually read and >write Privileged Control Registers (PCR) and/or MMIO device registers very >early into the handler. Most MMIO device registers and cpu PCR cannot be >speculatively read as that may cause a state transition.
    Of course all stores are never speculated and can only be initiated
    at commit/retire.

    The normal memory coherence rules assume that loads are to memory-like >locations that do not state transition on reads and that therefore
    memory loads can be harmlessly replayed if needed.
    While memory stores are not performed speculatively, an implementation
    might speculatively prefetch a cache line as soon as a store is queued
    and cause cache lines to ping-pong.

    But for loads to many MMIO devices and PCR effectively require a
    speculation barrier in front of them to prevent replays.

    A SPCB Speculation Barrier instruction could block speculation.
    It stalls execution until all older conditional branches are resolved and
    all older instructions that might throw an exception have determined
    they won't do so.

    The core could have an internal lookup table telling it which PCR can be
    read speculatively because there are no side effects to doing so.
    Those PCR would not require an SPCB to guard them.

    For MMIO device registers I think having an explicit SPCB instruction
    might be better than putting a "no-speculate" flag on the PTE for the
    device register address as that flag would be difficult to propagate >backwards from address translate to all the parts of the core that
    we might have to sync with.

    MMIO accesses are, by definition, non-cachable, which is typically
    designated in either a translation table entry or associated
    attribute registers (MTTR, MAIR). Non-cacheable accesses
    are not speculatively executed, which provides the
    correct semantics for device registers which have side effects
    on read accesses.

    Granted the granularity of that attribute is usually a translation unit
    (page) size.


    This all means that there may be very little opportunity for speculative >execution of their handlers, no matter how much hardware one tosses at them.

    That's true. ARM goes to some lengths to ensure that the access
    to the system register (ICC_IARx_EL1) that contains the current pending interrupt
    number for a given hardware thread/core is synchronized appropriately.

    "To allow software to ensure appropriate observability of actions
    initiated by GIC register accesses, the PE and CPU interface logic
    must ensure that reads of this register are self-synchronising when
    interrupts are masked by the PE (that is when PSTATE.{I,F} == {0,0}).
    This ensures that the effect of activating an interrupt on the signaling
    of interrupt exceptions is observed when a read of this register is
    architecturally executed so that no spurious interrupt exception
    occurs if interrupts are unmasked by an instruction immediately
    following the read."
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Mon Jan 6 11:33:05 2025
    From Newsgroup: comp.arch

    On 1/3/25 12:24 PM, Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [snip]
    For MMIO device registers I think having an explicit SPCB instruction
    might be better than putting a "no-speculate" flag on the PTE for the
    device register address as that flag would be difficult to propagate
    backwards from address translate to all the parts of the core that
    we might have to sync with.

    MMIO accesses are, by definition, non-cachable, which is typically
    designated in either a translation table entry or associated
    attribute registers (MTTR, MAIR). Non-cacheable accesses
    are not speculatively executed, which provides the
    correct semantics for device registers which have side effects
    on read accesses.

    It is not clear to me that Memory-Mapped I/O requires
    non-cacheable accesses. Some addresses within I/O device
    address areas do not have access side effects. I would **GUESS**
    that most I/O addresses do not have read side effects.

    (One obvious exception would be implicit buffers where a read
    "pops" a value from a queue allowing the next value to be accessed
    at the same address. _Theoretically_ one could buffer such reads
    outside of the I/O device such that old values would not be lost
    and incorrect speculation could be rolled back — this might be a
    form of versioned memory. Along similar lines, values could be
    prefetched and cached as long as all modifiers of the values use
    cache coherency. There may well be other cases of read side
    effects.)

    In general writes require hidden buffering for speculation, but
    write side effects can affect later reads. One possibility would
    be a write that changes which buffer is accessed at a given
    address. Such a write followed by a read of such a buffer address
    must have the read presented after the write, so caching the read
    address would be problematic.

    One weak type of write side effect would be similar to releasing
    a lock, where with a weaker memory order one needs to ensure that
    previous writes are visible before the "lock is released". E.g.,
    one might update a command buffer on an I/O device with multiple
    writes and lastly update a I/O device pointer to indicate that
    the buffer was added to. The ordering required for this is weaker
    than sequential consistency.

    If certain kinds of side effects are limited to a single device,
    then the ordering of accesses to different devices may allow
    greater flexibility in ordering. (This seems conceptually similar
    to cache coherence vs. consistency where "single I/O device"
    corresponds to single address. Cache coherence provides strict
    consistency for a single address.)

    I seem to recall that StrongARM exploited a distinction between
    "bufferable" and "cacheable" marked in PTEs to select the cache
    to which an access would be allocated. This presumably means
    that the two terms had different consistency/coherence
    constraints.

    I am very skeptical that an extremely complex system with best
    possible performance would be worthwhile. However, I suspect that
    some relaxation of ordering and cacheability would be practical
    and worthwhile.

    I do very much object to requiring memory-mapped I/O as a
    concept to require non-cacheability even if existing software
    (and hardware) and development mindset makes any relaxation
    impractical.

    Since x86 allowed a different kind of consistency for non-temporal
    stores, it may not be absurd for a new architecture to present
    a more complex interface, presumably with the option not to deal
    with that complexity. Of course, the most likely result would be
    hardware having to support the complexity with not actual benefit
    from use.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Jan 7 16:26:57 2025
    From Newsgroup: comp.arch

    "Paul A. Clayton" <paaronclayton@gmail.com> writes:
    On 1/3/25 12:24 PM, Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [snip]
    For MMIO device registers I think having an explicit SPCB instruction
    might be better than putting a "no-speculate" flag on the PTE for the
    device register address as that flag would be difficult to propagate
    backwards from address translate to all the parts of the core that
    we might have to sync with.

    MMIO accesses are, by definition, non-cachable, which is typically
    designated in either a translation table entry or associated
    attribute registers (MTTR, MAIR). Non-cacheable accesses
    are not speculatively executed, which provides the
    correct semantics for device registers which have side effects
    on read accesses.

    It is not clear to me that Memory-Mapped I/O requires
    non-cacheable accesses. Some addresses within I/O device
    address areas do not have access side effects. I would **GUESS**
    that most I/O addresses do not have read side effects.

    Generally such controls (cacheable/noncacheable) are on
    a page granularity (although some modern instruction
    sets include non-temporal move variants that bypass
    the cache).


    (One obvious exception would be implicit buffers where a read
    "pops" a value from a queue allowing the next value to be accessed
    at the same address. _Theoretically_ one could buffer such reads
    outside of the I/O device such that old values would not be lost
    and incorrect speculation could be rolled back — this might be a
    form of versioned memory. Along similar lines, values could be
    prefetched and cached as long as all modifiers of the values use
    cache coherency. There may well be other cases of read side
    effects.)

    In general writes require hidden buffering for speculation, but
    write side effects can affect later reads. One possibility would
    be a write that changes which buffer is accessed at a given
    address. Such a write followed by a read of such a buffer address
    must have the read presented after the write, so caching the read
    address would be problematic.

    One weak type of write side effect would be similar to releasing
    a lock, where with a weaker memory order one needs to ensure that
    previous writes are visible before the "lock is released". E.g.,
    one might update a command buffer on an I/O device with multiple
    writes and lastly update a I/O device pointer to indicate that
    the buffer was added to. The ordering required for this is weaker
    than sequential consistency.

    If certain kinds of side effects are limited to a single device,
    then the ordering of accesses to different devices may allow
    greater flexibility in ordering. (This seems conceptually similar
    to cache coherence vs. consistency where "single I/O device"
    corresponds to single address. Cache coherence provides strict
    consistency for a single address.)

    I seem to recall that StrongARM exploited a distinction between
    "bufferable" and "cacheable" marked in PTEs to select the cache
    to which an access would be allocated. This presumably means
    that the two terms had different consistency/coherence
    constraints.

    I am very skeptical that an extremely complex system with best
    possible performance would be worthwhile. However, I suspect that
    some relaxation of ordering and cacheability would be practical
    and worthwhile.

    Agreed with the first sentence. Ordering rules are generally
    defined by the hardware (e.g. PCIe ordering), although various
    host chipsets allow partial relaxation in cases where the
    device supports the relaxed ordering bit in the TLP header.


    I do very much object to requiring memory-mapped I/O as a
    concept to require non-cacheability even if existing software
    (and hardware) and development mindset makes any relaxation
    impractical.

    The hardware cost of doing otherwise seems to be a barrier
    to any relaxation.


    Since x86 allowed a different kind of consistency for non-temporal
    stores, it may not be absurd for a new architecture to present
    a more complex interface, presumably with the option not to deal
    with that complexity. Of course, the most likely result would be
    hardware having to support the complexity with not actual benefit
    from use.

    My current work involves modeling devices (PCI, PCIe and onboard
    accelerators) in software. Most device status registers, for
    example, don't have side effects on read, but they are changed
    by hardware at any time, and cacheing them doesn't make sense.

    For PCIe devices that expose a memory BAR backed by regular DRAM
    the host can map the entire BAR as cacheable.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Jan 7 17:31:11 2025
    From Newsgroup: comp.arch

    Paul A. Clayton wrote:
    On 1/3/25 12:24 PM, Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [snip]
    For MMIO device registers I think having an explicit SPCB instruction>>> might be better than putting a "no-speculate" flag on the PTE for the>>> device register address as that flag would be difficult to propagate
    backwards from address translate to all the parts of the core that
    we might have to sync with.

    MMIO accesses are, by definition, non-cachable, which is typically
    designated in either a translation table entry or associated
    attribute registers (MTTR, MAIR).    Non-cacheable accesses
    are not speculatively executed, which provides the
    correct semantics for device registers which have side effects
    on read accesses.

    It is not clear to me that Memory-Mapped I/O requires
    non-cacheable accesses. Some addresses within I/O device
    address areas do not have access side effects. I would **GUESS**
    that most I/O addresses do not have read side effects.
    And you would be wrong, at least back in the "old days" when I wrote
    drivers for some such devices.
    The worst was probably the EGA text/graphics adapter which had a bunch
    of write-only ports, making it completely impossible to context switch.
    Even IBM realized this so it was shortly after (1-2 years?) replaced by
    the VGA adapter which did allow you to query the current status.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114