• Microarch Club

    From George Musk@grgmusk@skiff.com to comp.arch on Thu Mar 21 19:34:50 2024
    From Newsgroup: comp.arch

    Thought this may be interesting:
    https://microarch.club/
    https://www.youtube.com/@MicroarchClub/videos
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Mon Mar 25 16:48:43 2024
    From Newsgroup: comp.arch

    On 3/21/2024 2:34 PM, George Musk wrote:
    Thought this may be interesting:
    https://microarch.club/
    https://www.youtube.com/@MicroarchClub/videos

    At least sort of interesting...

    I guess one of the guys on there did a manycore VLIW architecture with
    the memory local to each of the cores. Seems like an interesting
    approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first
    started working on this stuff, but it had drifted more towards a
    slightly more conventional design.


    But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
    L2, seemed like a better option than cores with a very large shared L1
    cache.

    I am not sure that abandoning a global address space is such a great
    idea, as a lot of the "merits" can be gained instead by using weak
    coherence models (possibly with a shared 256K or 512K or so for each
    group of 4 cores, at which point it goes out to a higher latency global
    bus). In this case, the division into independent memory regions could
    be done in software.

    It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into
    the common superset of both my original design and RV64G, doesn't really
    help matters here).

    If going for a more minimal core optimized for perf/area, some stuff
    might be dropped. Would likely drop integer and floating-point divide
    again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the
    lack of a zero register (though, encodings are comparably cheap, as the internal uArch has a zero register, and effectively treats immediate
    values as a special register as well, ...). Some of the debate is more
    related to the logic cost of dealing with some things in the decoder.

    Though, would likely still make a few decisions differently from those
    in RISC-V. Things like indexed load/store, predicated ops (with a
    designated flag bit), and large-immediate encodings, help enough with performance (relative to cost) to be worth keeping (though, mostly
    because the alternatives are not so good in terms of performance).

    Staying with 3-wide also makes sense, as going to 1-wide or 2-wide will
    not save much if one still wants things like FP-SIMD and unaligned
    memory access (when as-is, all the 3rd lane can really do is basic ALU
    ops and similar; and the savings are negligible if one has a register
    file with 6-ports, which in turn is needed to fully provision stuff for
    the 2-wide cases, ...).

    Practically the manycore idea doesn't go very far on FPGAs, as to have "useful" cores, one can't fit more than a few cores on any of the
    "affordable" FPGAs...

    But, maybe other people will have different ideas.

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Mar 25 22:17:03 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 3/21/2024 2:34 PM, George Musk wrote:
    Thought this may be interesting:
    https://microarch.club/
    https://www.youtube.com/@MicroarchClub/videos

    At least sort of interesting...

    I guess one of the guys on there did a manycore VLIW architecture with
    the memory local to each of the cores. Seems like an interesting
    approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first
    started working on this stuff, but it had drifted more towards a
    slightly more conventional design.


    But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
    L2, seemed like a better option than cores with a very large shared L1 cache.

    You appear to be "starting to get it"; congratulations.

    I am not sure that abandoning a global address space is such a great
    idea, as a lot of the "merits" can be gained instead by using weak
    coherence models (possibly with a shared 256K or 512K or so for each
    group of 4 cores, at which point it goes out to a higher latency global bus). In this case, the division into independent memory regions could
    be done in software.

    Most of the last 50 years has been towards a single global address space.

    It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into
    the common superset of both my original design and RV64G, doesn't really help matters here).

    If going for a more minimal core optimized for perf/area, some stuff
    might be dropped. Would likely drop integer and floating-point divide

    I think this is pound foolish even if penny wise.

    again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the
    lack of a zero register (though, encodings are comparably cheap, as the

    I got an effective zero register without having to waste a register
    name to "get it". My 66000 gives you 32 registers of 64-bits each and
    you can put any bit pattern in any register and treat it as you like.
    Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally available.

    internal uArch has a zero register, and effectively treats immediate
    values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.

    The problem is universal constants. RISCs being notably poor in their support--however this is better than addressing modes which require
    µCode.

    Though, would likely still make a few decisions differently from those
    in RISC-V. Things like indexed load/store,

    Absolutely

    predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}

    and large-immediate encodings,

    Nothing else is so poorly served in typical ISAs.

    help enough with performance (relative to cost)

    +40%

    to be worth keeping (though, mostly
    because the alternatives are not so good in terms of performance).

    Damage to pipeline ability less than -5%.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 25 22:32:13 2024
    From Newsgroup: comp.arch

    On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 3/21/2024 2:34 PM, George Musk wrote:
    Thought this may be interesting:
    https://microarch.club/
    https://www.youtube.com/@MicroarchClub/videos

    At least sort of interesting...

    I guess one of the guys on there did a manycore VLIW architecture with
    the memory local to each of the cores. Seems like an interesting
    approach, though not sure how well it would work on a general purpose
    workload. This is also closer to what I had imagined when I first
    started working on this stuff, but it had drifted more towards a
    slightly more conventional design.


    But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
    L2, seemed like a better option than cores with a very large shared L1
    cache.

    You appear to be "starting to get it"; congratulations.


    I had experimented with stuff before, and "big L1 caches" seemed to be
    in most regards worse. Hit rate goes into diminishing return territory,
    and timing isn't too happy either.

    At least for my workloads, 32K seemed like the local optimum.

    Say, checking hit rates (in Doom):
    8K: 67%, 16K: 78%,
    32K: 85%, 64K: 87%
    128K: 88%
    This being for a direct-mapped cache configuration with even/odd paired 16-byte cache lines.

    Other programs seem similar.


    For a direct-mapped L1 cache, there is an issue with conflict misses,
    where I was able to add in a small cache to absorb ~ 1-2% that was due
    to conflict misses, which also had the (seemingly more obvious) effect
    of reducing L2 misses (from a direct-mapped L2 cache). Though, it is
    likely that a set-associative L2 cache could have also addressed this
    issue (but likely with a higher cost impact).



    I am not sure that abandoning a global address space is such a great
    idea, as a lot of the "merits" can be gained instead by using weak
    coherence models (possibly with a shared 256K or 512K or so for each
    group of 4 cores, at which point it goes out to a higher latency
    global bus). In this case, the division into independent memory
    regions could be done in software.

    Most of the last 50 years has been towards a single global address space.


    Yeah.

    From what I can gather, the guy in the video had an architecture which
    gives each CPU its own 128K and needs explicit message passing to access outside of this (and faking a global address space in software, at a significant performance penalty). As I see it, this does not seem like
    such a great idea...


    Something like weak coherence can get most of the same savings, with
    much less impact on how one writes code (albeit, it does mean that mutex locking may still be painfully slow).

    But, this does mean it is better to try to approach software in a way
    that neither requires TSO semantics nor frequent mutex locking.


    It is unclear if my approach is "sufficiently minimal". There is more
    complexity than I would like in my ISA (and effectively turning it
    into the common superset of both my original design and RV64G, doesn't
    really help matters here).

    If going for a more minimal core optimized for perf/area, some stuff
    might be dropped. Would likely drop integer and floating-point divide

    I think this is pound foolish even if penny wise.


    The "shift and add" unit isn't free, and the relative gains are small.
    For integer divide, granted, it is faster than the pure software version
    in the general case. For FPU divide, N-R is faster, but shift-add can
    give an exact result. Most other / faster hardware divide strategies
    seem to be more expensive than a shift-and-add unit.


    My confidence in hardware divide isn't too high, noting for example that
    the AMD K10 and Bulldozer/15h had painfully slow divide operations (to
    such a degree that doing it in software was often faster). This implies
    that divide cost/performance is still not really a "solved" issue, even
    if one has the resources to throw at it.

    One can avoid the cost of the shift-and-add unit via "trap and emulate",
    but then the performance is worse.


    Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).


    again. Might also make sense to add an architectural zero register,
    and eliminate some number of encodings which exist merely because of
    the lack of a zero register (though, encodings are comparably cheap,
    as the

    I got an effective zero register without having to waste a register name
    to "get it". My 66000 gives you 32 registers of 64-bits each and you can
    put any bit pattern in any register and treat it as you like.
    Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally available.


    I guess offloading this to the compiler can also make sense.

    Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and
    if a program needs to do a NEG or similar, it can load 0 into a register itself.

    In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to
    zero a register...


    Say:
    XOR R14, R14, R14 //Designate R14 as pseudo-zero...
    ...
    ADD R14, 0x123, R8 //Load 0x123 into R8

    Though, likely still makes sense in this case to provide some
    "convenience" instructions.


    internal uArch has a zero register, and effectively treats immediate
    values as a special register as well, ...). Some of the debate is more
    related to the logic cost of dealing with some things in the decoder.

    The problem is universal constants. RISCs being notably poor in their support--however this is better than addressing modes which require
    µCode.


    Yeah.

    I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.

    Allows, say:
    ADD R4, 0x12345678, R6

    To be performed in potentially 1 clock-cycle and with a 64-bit encoding,
    which is better than, say:
    LUI X8, 0x12345
    ADD X8, X8, 0x678
    ADD X12, X10, X8


    Though, for jumbo-prefixes, did end up adding a special case in the
    compile where it will try to figure out if a constant will be used
    multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.


    It could maybe make sense to have function-scale static-assigned
    constants, but have not done so yet.

    Though, it appears as if one of the "top contenders" here would be 0,
    mostly because things like:
    foo->x=0;
    And:
    bar[i]=0;
    Are semi-common, and as-is end up needing to load 0 into a register each
    time they appear.

    Had already ended up with a similar sort of special case to optimize
    "return 0;" and similar, mostly because this was common enough that it
    made more sense to have a special case:
    BRA .lbl_ret //if function does not end with "return 0;"
    .lbl_ret_zero:
    MOV 0, R2
    .lbl_ret:
    ... epilog ...

    For many functions, which allowed "return 0;" to be emitted as:
    BRA .lbl_ret_zero
    Rather than:
    MOV 0, R2
    BRA .lbl_ret
    Which on average ended up as a net-win when there are more than around 3
    of them per function.


    Though, another possibility could be to allow constants to be included
    in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).


    Though, would likely still make a few decisions differently from those
    in RISC-V. Things like indexed load/store,

    Absolutely

                                               predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}


    I have per instruction predication:
    CMPxx ...
    OP?T //if-true
    OP?F //if-false
    Or:
    OP?T | OP?F //both in parallel, subject to encoding and ISA rules

    Performance gains are modest, but still noticeable (part of why
    predication ended up as a core ISA feature). Effect on pipeline seems to
    be small in its current form (it is handled along with register fetch,
    mostly turning non-executed instructions into NOPs during the EX stages).

    For the most part, 1-bit seems sufficient.
    More complex schemes generally ran into issues (had experimented with
    allowing a second predicate bit, or handling predicates as a
    stack-machine, but these ideas were mostly dead on arrival).


                          and large-immediate encodings,

    Nothing else is so poorly served in typical ISAs.


    Probably true.


                                                         help enough with
    performance (relative to cost)

    +40%


    I am mostly seeing around 30% or so, for Doom and similar.
    A few other programs still being closer to break-even at present.


    Things are a bit more contentious in terms of code density:
    With size-minimizing options to GCC:
    ".text" is slightly larger with BGBCC vs GCC (around 11%);
    However, the GCC output has significantly more ".rodata".

    A reasonable chunk of the code-size difference could be attributed to
    jumbo prefixes making the average instruction size slightly bigger.


    More could be possible with more compiler optimization effort.
    Currently, a few recent optimization cases are disabled as they seem to
    be causing bugs that I haven't figured out yet.


                                   to be worth keeping (though, mostly
    because the alternatives are not so good in terms of performance).

    Damage to pipeline ability less than -5%.

    Yeah.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Mar 26 19:16:07 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).

    If the boat anchor is a required unit of functionality, and I believe
    IDIV and FPDIV is, it should be defined in ISA and if you can't afford
    it find some way to trap rapidly so you can fix it up without excessive overhead. Like a MIPS TLB reload. If you can't get trap and emulate at sufficient performance, then add the HW to perform the instruction.

    again. Might also make sense to add an architectural zero register,
    and eliminate some number of encodings which exist merely because of
    the lack of a zero register (though, encodings are comparably cheap,
    as the

    I got an effective zero register without having to waste a register name
    to "get it". My 66000 gives you 32 registers of 64-bits each and you can
    put any bit pattern in any register and treat it as you like.
    Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
    available.


    I guess offloading this to the compiler can also make sense.

    Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and
    if a program needs to do a NEG or similar, it can load 0 into a register itself.

    In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to zero a register...

    MOV Rd,#imm16

    Cost 1 instruction of 32-bits in size and can be performed in 0 cycles

    Say:
    XOR R14, R14, R14 //Designate R14 as pseudo-zero...
    ...
    ADD R14, 0x123, R8 //Load 0x123 into R8

    Though, likely still makes sense in this case to provide some
    "convenience" instructions.


    internal uArch has a zero register, and effectively treats immediate
    values as a special register as well, ...). Some of the debate is more
    related to the logic cost of dealing with some things in the decoder.

    The problem is universal constants. RISCs being notably poor in their
    support--however this is better than addressing modes which require
    µCode.


    Yeah.

    I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.

    Allows, say:
    ADD R4, 0x12345678, R6

    To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
    LUI X8, 0x12345
    ADD X8, X8, 0x678
    ADD X12, X10, X8

    This strategy completely fails when the constant contains more than 32-bits

    FDIV R9,#3.141592653589247,R17

    When you have universal constants (including 5-bit immediates), you rarely
    need a register containing 0.

    Though, for jumbo-prefixes, did end up adding a special case in the
    compile where it will try to figure out if a constant will be used
    multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.

    This is a delicate balance:: while each use of the constant takes a
    unit or 2 of the instruction stream, each use cost 0 more instructions.
    The breakeven point in My 66000 is about 4 uses in a small area (loop)
    means that it should be hoisted into a register.

    It could maybe make sense to have function-scale static-assigned
    constants, but have not done so yet.

    Though, it appears as if one of the "top contenders" here would be 0,
    mostly because things like:
    foo->x=0;
    And:
    bar[i]=0;

    I see no need for a zero register:: the following are 1 instruction !

    ST #0,[Rfoo,offset(x)]

    ST #0,[Rbar,Ri]

    Are semi-common, and as-is end up needing to load 0 into a register each time they appear.

    Had already ended up with a similar sort of special case to optimize
    "return 0;" and similar, mostly because this was common enough that it
    made more sense to have a special case:
    BRA .lbl_ret //if function does not end with "return 0;"
    .lbl_ret_zero:
    MOV 0, R2
    .lbl_ret:
    ... epilog ...

    For many functions, which allowed "return 0;" to be emitted as:
    BRA .lbl_ret_zero
    Rather than:
    MOV 0, R2
    BRA .lbl_ret
    Which on average ended up as a net-win when there are more than around 3
    of them per function.

    Special defined tails......


    Though, another possibility could be to allow constants to be included
    in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).


    Though, would likely still make a few decisions differently from those
    in RISC-V. Things like indexed load/store,

    Absolutely

                                               predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}


    I have per instruction predication:
    CMPxx ...
    OP?T //if-true
    OP?F //if-false
    Or:
    OP?T | OP?F //both in parallel, subject to encoding and ISA rules

    CMP Rt,Ra,#whatever
    PLE Rt,TTTTTEEE
    // This begins the then-clause 5Ts -> 5 instructions
    OP1
    OP2
    OP3
    OP4
    OP5
    // this begins the else-clause 3Es -> 3 instructions
    OP6
    OP7
    OP8
    // we are now back join point.

    Notice no internal flow control instructions.

    Performance gains are modest, but still noticeable (part of why
    predication ended up as a core ISA feature). Effect on pipeline seems to
    be small in its current form (it is handled along with register fetch, mostly turning non-executed instructions into NOPs during the EX stages).

    The effect is that one uses Predication whenever you will have already
    fetched instructions at the join point by the time you have determined
    the predicate value {then, else} clauses. The PARSE and DECODE do the
    flow control without bothering FETCH.

    For the most part, 1-bit seems sufficient.

    How do you do && and || predication with 1 bit ??

    More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a
    stack-machine, but these ideas were mostly dead on arrival).

    Also note: the instructions in the then and else clauses know NOTHING
    about being under a predicate mask (or not) Thus, they waste no bit
    while retaining the ability to run under predication.


                          and large-immediate encodings, >>
    Nothing else is so poorly served in typical ISAs.


    Probably true.


                                                         help enough with
    performance (relative to cost)

    +40%


    I am mostly seeing around 30% or so, for Doom and similar.
    A few other programs still being closer to break-even at present.


    Things are a bit more contentious in terms of code density:
    With size-minimizing options to GCC:
    ".text" is slightly larger with BGBCC vs GCC (around 11%);
    However, the GCC output has significantly more ".rodata".

    A lot of this .rodata becomes constants in .text with universal constants.

    A reasonable chunk of the code-size difference could be attributed to
    jumbo prefixes making the average instruction size slightly bigger.

    Size is one thing and it primarily diddles in cache footprint statstics. Instruction count is another and primarily diddles in pipeline cycles
    to execute statistics.
    Fewer instruction wins almost all the time.

    More could be possible with more compiler optimization effort.
    Currently, a few recent optimization cases are disabled as they seem to
    be causing bugs that I haven't figured out yet.


                                   to be worth keeping (though, mostly
    because the alternatives are not so good in terms of performance).

    Damage to pipeline ability less than -5%.

    Yeah.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Mar 26 16:59:57 2024
    From Newsgroup: comp.arch

    On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Say, "we have an instruction, but it is a boat anchor" isn't an ideal
    situation (unless to be a placeholder for if/when it is not a boat
    anchor).

    If the boat anchor is a required unit of functionality, and I believe
    IDIV and FPDIV is, it should be defined in ISA and if you can't afford
    it find some way to trap rapidly so you can fix it up without excessive overhead. Like a MIPS TLB reload. If you can't get trap and emulate at sufficient performance, then add the HW to perform the instruction.


    Though, 32-bit ARM managed OK without integer divide.

    In my case, I ended up supporting it mostly for sake of the RV64 'M' extension, but it is in this case a little faster than a pure software solution (unlike on the K10 and Piledriver).


    Still costs around 1.5 kLUTs though for 64-bit MUL/DIV support, and a
    little more to route FDIV through it.

    Cheapest FPU approach is still the "ADD/SUB/MUL only" route.


    again. Might also make sense to add an architectural zero register,
    and eliminate some number of encodings which exist merely because of
    the lack of a zero register (though, encodings are comparably cheap,
    as the

    I got an effective zero register without having to waste a register
    name to "get it". My 66000 gives you 32 registers of 64-bits each and
    you can put any bit pattern in any register and treat it as you like.
    Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
    available.


    I guess offloading this to the compiler can also make sense.

    Least common denominator would be, say, not providing things like NEG
    instructions and similar (pretending as-if one had a zero register),
    and if a program needs to do a NEG or similar, it can load 0 into a
    register itself.

    In the extreme case (say, one also lacks a designated "load immediate"
    instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy
    to zero a register...

        MOV   Rd,#imm16

    Cost 1 instruction of 32-bits in size and can be performed in 0 cycles


    Though, RV had skipped this:
    ADD Xd, Zero, Imm12s
    Or:
    LUI Xd, ImmHi20
    ADD Xd, Xd, ImmLo12s

    One can argue for this on the basis of not needing an immediate-load instruction (nor a MOV instruction, nor NEG, nor ...).


    Though, yeah, in my case I ended up with more variety here:
    LDIZ Imm10u, Rn //10-bit, zero-extend, Imm12u (XG2)
    LDIN Imm10n, Rn //10-bit, one-extend, Imm12n (XG2)
    LDIMIZ Imm10u, Rn //Rn=Imm10u<<16 (newish)
    LDIMIN Imm10n, Rn //Rn=Imm10n<<16 (newish)
    LDIHI Imm10u, Rn //Rn=Imm10u<<22
    LDIQHI Imm10u, Rn //Rn=Imm10u<<54

    LDIZ Imm16u, Rn //16-bit, zero-extend
    LDIN Imm16n, Rn //16-bit, one-extend

    Then 64-bit jumbo forms:
    LDI Imm33s, Rn //33-bit, sign-extend
    LDIHI Imm33s, Rn //Rn=Imm33s<<16
    LDIQHI Imm33s, Rn //Rn=Imm33s<<32

    Then, 96 bit:
    LDI Imm64, Rn //64-bit, sign-extend

    And, some special cases:
    FLDCH Imm16u, Rn //Binary16->Binary64

    One could argue though that this is wild extravagance...

    The recent addition of LDIMIx was mostly because otherwise one needed a
    64-bit encoding to load constants like 262144 or similar (and a lot of bit-masks).


    At one point I did evaluate a more ARM32-like approach (effectively
    using a small value and a rotate). But, this cost more than the other
    options (would have required the great evil of effectively being able to
    feed two immediate values into the integer-shift unit, whereas many of
    the others could be routed through logic I already have for other ops).


    Though, one can argue that the drawback is that one does end up with
    more instructions in the ISA listing.


    Say:
       XOR R14, R14, R14  //Designate R14 as pseudo-zero...
       ...
       ADD R14, 0x123, R8  //Load 0x123 into R8

    Though, likely still makes sense in this case to provide some
    "convenience" instructions.


    internal uArch has a zero register, and effectively treats immediate
    values as a special register as well, ...). Some of the debate is
    more related to the logic cost of dealing with some things in the
    decoder.

    The problem is universal constants. RISCs being notably poor in their
    support--however this is better than addressing modes which require
    µCode.


    Yeah.

    I ended up with jumbo-prefixes. Still not perfect, and not perfectly
    orthogonal, but mostly works.

    Allows, say:
       ADD R4, 0x12345678, R6

    To be performed in potentially 1 clock-cycle and with a 64-bit
    encoding, which is better than, say:
       LUI X8, 0x12345
       ADD X8, X8, 0x678
       ADD X12, X10, X8

    This strategy completely fails when the constant contains more than 32-bits

        FDIV   R9,#3.141592653589247,R17

    When you have universal constants (including 5-bit immediates), you rarely need a register containing 0.


    The jumbo prefixes at least allow for a 64-bit constant load, but as-is
    not for 64-bit immediate values to 3RI ops. The latter could be done,
    but would require 128-bit fetch and decode, which doesn't seem worth it.

    There is the limbo feature of allowing for 57-bit immediate values, but
    this is optional.


    OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
    Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

    Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.

    Even something like a "LI Xd, Imm17s" instruction, would notably reduce
    the number of constants loaded from memory (as GCC seemingly prefers to
    use a LHU or LW or similar rather than encode it using LUI+ADD).


    I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
    S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
    them enabled in the CPU core (they involved the non-zero cost of
    repacking them into Binary16 in ID1 and then throwing a
    Binary16->Binary64 converter into the ID2 stage).

    Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
    can leverage a more generic Binary16->Binary64 converter path).


    For FPU compare with zero, can almost leverage the integer compare ops,
    apart from the annoying edge cases of -0.0 and NaN leading to "not
    strictly equivalent" behavior (though, an ASM programmer could more
    easily get away with this). But, not common enough to justify adding FPU specific ops for this.


    Though, for jumbo-prefixes, did end up adding a special case in the
    compile where it will try to figure out if a constant will be used
    multiple times in a basic-block and, if so, will load it into a
    register rather than use a jumbo-prefix form.

    This is a delicate balance:: while each use of the constant takes a
    unit or 2 of the instruction stream, each use cost 0 more instructions.
    The breakeven point in My 66000 is about 4 uses in a small area (loop)
    means that it should be hoisted into a register.


    IIRC, I had set it as x>2, since 2 seemed break-even, 3+ favoring using
    a register, and 1 favoring a constant.

    This did require adding logic to look forward over the current basic
    block to check for additional usage, which does seem a little like a
    hack. But, similar logic had already been used by the register allocator
    to try to prioritize which values should be evicted.


    It could maybe make sense to have function-scale static-assigned
    constants, but have not done so yet.

    Though, it appears as if one of the "top contenders" here would be 0,
    mostly because things like:
       foo->x=0;
    And:
       bar[i]=0;

    I see no need for a zero register:: the following are 1 instruction !

       ST     #0,[Rfoo,offset(x)]

       ST     #0,[Rbar,Ri]


    In my case, there is no direct equivalent in the core ISA.

    But, in retrospect, a zero-register is probably not worth it if one is
    not trying for RISC-V style minimalism in the core ISA (nevermind if
    they seemingly promptly disregard minimalism as soon as one gets outside
    of 'I', but then fail to address the shortcomings of said minimalism in
    'I').


    Are semi-common, and as-is end up needing to load 0 into a register
    each time they appear.

    Had already ended up with a similar sort of special case to optimize
    "return 0;" and similar, mostly because this was common enough that it
    made more sense to have a special case:
       BRA .lbl_ret  //if function does not end with "return 0;"
       .lbl_ret_zero:
       MOV 0, R2
       .lbl_ret:
       ... epilog ...

    For many functions, which allowed "return 0;" to be emitted as:
       BRA .lbl_ret_zero
    Rather than:
       MOV 0, R2
       BRA .lbl_ret
    Which on average ended up as a net-win when there are more than around
    3 of them per function.

    Special defined tails......


    I didn't really go much beyond 0, as supporting more than 0 ends up
    adding cost in terms of needing additional branches.

    But, yeah, 0/1/NULL/etc, seem to be fairly common return values.


    Though, another possibility could be to allow constants to be included
    in the "statically assign variables to registers" logic (as-is, they
    are excluded except in "tiny leaf" functions).


    Went and tweaked the compiler rules, so now they are included in the
    ranking in normal functions but given a lower priority than the local variables (by roughly "one loop nesting level"). Seemed to help slightly (making it 1 level less appeared to be the local optima judging by
    binary size).


    Seems that generally 0 still isn't quite common enough to justify having
    one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.



    Though, would likely still make a few decisions differently from
    those in RISC-V. Things like indexed load/store,

    Absolutely

                                               predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}


    I have per instruction predication:
       CMPxx ...
       OP?T  //if-true
       OP?F  //if-false
    Or:
       OP?T | OP?F  //both in parallel, subject to encoding and ISA rules

        CMP  Rt,Ra,#whatever
        PLE  Rt,TTTTTEEE
        // This begins the then-clause 5Ts -> 5 instructions
        OP1
        OP2
        OP3
        OP4
        OP5
        // this begins the else-clause 3Es -> 3 instructions
        OP6
        OP7
        OP8
        // we are now back join point.

    Notice no internal flow control instructions.


    It can be similar in my case, with the ?T / ?F encoding scheme.

    While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy
    into NOPs than to branch over it.

    Did end up leaving struct-copies where sz<=64 as allowed though (where a
    64 byte copy at least has the merit of achieving full pipeline
    saturation and being roughly break-even with a branch-miss, whereas a
    128 byte copy would cost roughly twice as much as a branch miss).


    Performance gains are modest, but still noticeable (part of why
    predication ended up as a core ISA feature). Effect on pipeline seems
    to be small in its current form (it is handled along with register
    fetch, mostly turning non-executed instructions into NOPs during the
    EX stages).

    The effect is that one uses Predication whenever you will have already fetched instructions at the join point by the time you have determined
    the predicate value {then, else} clauses. The PARSE and DECODE do the
    flow control without bothering FETCH.


    Yeah, though in my pipeline, it is still a tradeoff of the relative cost
    of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.


    For the most part, 1-bit seems sufficient.

    How do you do && and || predication with 1 bit ??


    Originally, it didn't.
    Now I added some 3R and 3RI CMPxx encodings.

    This allows, say:
    CMPGT R8, R10, R4
    CMPGT R8, R11, R5
    TST R4, R5
    ...

    The stack-machine predicate model could have worked, but would have been
    more of an issue to deal with in the compiler (in addition to the
    non-zero LUT cost, and not really saving any clock-cycles as compared
    with the newer bit-twiddling in registers strategy).

    Compiler does need some additional handling here, to not try this if the expressions could lead to side-effects.


    More complex schemes generally ran into issues (had experimented with
    allowing a second predicate bit, or handling predicates as a
    stack-machine, but these ideas were mostly dead on arrival).

    Also note: the instructions in the then and else clauses know NOTHING
    about being under a predicate mask (or not) Thus, they waste no bit
    while retaining the ability to run under predication.


    It is a tradeoff. In my case they cost 2 bits in the encoding, which is
    shared with the WEX scheme.


                          and large-immediate encodings, >>>
    Nothing else is so poorly served in typical ISAs.


    Probably true.


                                                         help enough
    with performance (relative to cost)

    +40%


    I am mostly seeing around 30% or so, for Doom and similar.
       A few other programs still being closer to break-even at present.


    Things are a bit more contentious in terms of code density:
       With size-minimizing options to GCC:
         ".text" is slightly larger with BGBCC vs GCC (around 11%);
         However, the GCC output has significantly more ".rodata".

    A lot of this .rodata becomes constants in .text with universal constants.


    Yeah.

    A reasonable chunk of the code-size difference could be attributed to
    jumbo prefixes making the average instruction size slightly bigger.

    Size is one thing and it primarily diddles in cache footprint statstics. Instruction count is another and primarily diddles in pipeline cycles
    to execute statistics. Fewer instruction wins almost all the time.


    Yeah, pretty much...


    More could be possible with more compiler optimization effort.
    Currently, a few recent optimization cases are disabled as they seem
    to be causing bugs that I haven't figured out yet.


                                   to be worth keeping (though, mostly
    because the alternatives are not so good in terms of performance).

    Damage to pipeline ability less than -5%.

    Yeah.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Mar 27 00:27:15 2024
    From Newsgroup: comp.arch

    On Tue, 26 Mar 2024 16:59:57 -0500
    BGB-Alt <bohannonindustriesllc@gmail.com> wrote:

    On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Say, "we have an instruction, but it is a boat anchor" isn't an
    ideal situation (unless to be a placeholder for if/when it is not
    a boat anchor).

    If the boat anchor is a required unit of functionality, and I
    believe IDIV and FPDIV is, it should be defined in ISA and if you
    can't afford it find some way to trap rapidly so you can fix it up
    without excessive overhead. Like a MIPS TLB reload. If you can't
    get trap and emulate at sufficient performance, then add the HW to
    perform the instruction.

    Though, 32-bit ARM managed OK without integer divide.


    For slightly less then 20 years ARM managed OK without integer divide.
    Then in 2004 they added integer divide instruction in ARMv7 (including
    ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are
    doing great :-)







    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 00:02:05 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:

    On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    I ended up with jumbo-prefixes. Still not perfect, and not perfectly
    orthogonal, but mostly works.

    Allows, say:
       ADD R4, 0x12345678, R6

    To be performed in potentially 1 clock-cycle and with a 64-bit
    encoding, which is better than, say:
       LUI X8, 0x12345
       ADD X8, X8, 0x678
       ADD X12, X10, X8

    This strategy completely fails when the constant contains more than 32-bits >>
        FDIV   R9,#3.141592653589247,R17

    When you have universal constants (including 5-bit immediates), you rarely >> need a register containing 0.


    The jumbo prefixes at least allow for a 64-bit constant load, but as-is
    not for 64-bit immediate values to 3RI ops. The latter could be done,
    but would require 128-bit fetch and decode, which doesn't seem worth it.

    There is the limbo feature of allowing for 57-bit immediate values, but
    this is optional.


    OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
    Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

    Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
    and a LD to get the value from data memory within ±2GB of IP. This takes
    3 instructions and 2 words in memory when universal constants do this in
    1 instruction and 2 words in the code stream to do this.

    Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.

    This is typical when the underlying architecture is not very extensible
    to 64-bit virtual address spaces; they have to waste a portion of the
    32-bit space to get access to all the 64-bit space. Universal constants
    makes this problem vanish.

    Even something like a "LI Xd, Imm17s" instruction, would notably reduce
    the number of constants loaded from memory (as GCC seemingly prefers to
    use a LHU or LW or similar rather than encode it using LUI+ADD).

    Reduce when compared to RISC-V but increased when compared to My 66000.
    My 66000 has (at 99& level) uses no instructions to fetch or create
    constants, nor does it waste any register (or registers) to hold use
    once constants.

    I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
    them enabled in the CPU core (they involved the non-zero cost of
    repacking them into Binary16 in ID1 and then throwing a
    Binary16->Binary64 converter into the ID2 stage).

    Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
    can leverage a more generic Binary16->Binary64 converter path).

    Sometimes I see a::

    CVTSD R2,#5

    Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
    to happen in a few instructions.

    Mostly, a floating point immediate is available from a 32-bit constant container. When accesses in a float calculation it is used as IEEE32
    when accessed by a 6double calculation IEEE32->IEEE64 promotion is
    performed in the constant delivery path. So, one can use almost any
    floating point constant that is representable in float as a double
    without eating cycles and while saving code footprint.

    For FPU compare with zero, can almost leverage the integer compare ops, apart from the annoying edge cases of -0.0 and NaN leading to "not
    strictly equivalent" behavior (though, an ASM programmer could more
    easily get away with this). But, not common enough to justify adding FPU specific ops for this.

    Actually, the edge/noise cases are not that many gates.
    a) once you are separating out NaNs, infinities are free !!
    b) once you are checking denorms for zero, infinites become free !!

    Having structured a Compare-to-zero circuit based on the fields in double;
    You can compose the terns to do all signed and unsigned integers and get
    a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
    note:: this circuit is about 10% of the gate count of an integer adder.

    -----------------------

    Seems that generally 0 still isn't quite common enough to justify having
    one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.


    It is common enough,
    But there are lots of ways to get a zero where you want it for a return.


    Though, would likely still make a few decisions differently from
    those in RISC-V. Things like indexed load/store,

    Absolutely

                                               predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}


    I have per instruction predication:
       CMPxx ...
       OP?T  //if-true
       OP?F  //if-false
    Or:
       OP?T | OP?F  //both in parallel, subject to encoding and ISA rules

        CMP  Rt,Ra,#whatever
        PLE  Rt,TTTTTEEE
        // This begins the then-clause 5Ts -> 5 instructions
        OP1
        OP2
        OP3
        OP4
        OP5
        // this begins the else-clause 3Es -> 3 instructions
        OP6
        OP7
        OP8
        // we are now back join point.

    Notice no internal flow control instructions.


    It can be similar in my case, with the ?T / ?F encoding scheme.

    Except you eat that/those bits in OpCode encoding.

    While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy into NOPs than to branch over it.

    Did end up leaving struct-copies where sz<=64 as allowed though (where a
    64 byte copy at least has the merit of achieving full pipeline
    saturation and being roughly break-even with a branch-miss, whereas a
    128 byte copy would cost roughly twice as much as a branch miss).

    I decided to bite the bullet and have LDM, STM and MM so the compiler does
    not have to do any analysis. This puts the onus on the memory unit designer
    to process these at least as fast as a series of LDs and STs. Done right
    this saves ~40%of the power of the caches avoiding ~70% of tag accesses
    and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.

    Performance gains are modest, but still noticeable (part of why
    predication ended up as a core ISA feature). Effect on pipeline seems
    to be small in its current form (it is handled along with register
    fetch, mostly turning non-executed instructions into NOPs during the
    EX stages).

    The effect is that one uses Predication whenever you will have already
    fetched instructions at the join point by the time you have determined
    the predicate value {then, else} clauses. The PARSE and DECODE do the
    flow control without bothering FETCH.


    Yeah, though in my pipeline, it is still a tradeoff of the relative cost
    of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.


    For the most part, 1-bit seems sufficient.

    How do you do && and || predication with 1 bit ??


    Originally, it didn't.
    Now I added some 3R and 3RI CMPxx encodings.

    This allows, say:
    CMPGT R8, R10, R4
    CMPGT R8, R11, R5
    TST R4, R5
    ....


    All I had to do was to make the second predication overwrite the first predication's mask, and the compiler did the rest.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Mar 27 13:21:04 2024
    From Newsgroup: comp.arch

    On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    I ended up with jumbo-prefixes. Still not perfect, and not perfectly
    orthogonal, but mostly works.

    Allows, say:
       ADD R4, 0x12345678, R6

    To be performed in potentially 1 clock-cycle and with a 64-bit
    encoding, which is better than, say:
       LUI X8, 0x12345
       ADD X8, X8, 0x678
       ADD X12, X10, X8

    This strategy completely fails when the constant contains more than
    32-bits

         FDIV   R9,#3.141592653589247,R17

    When you have universal constants (including 5-bit immediates), you
    rarely
    need a register containing 0.


    The jumbo prefixes at least allow for a 64-bit constant load, but
    as-is not for 64-bit immediate values to 3RI ops. The latter could be
    done, but would require 128-bit fetch and decode, which doesn't seem
    worth it.

    There is the limbo feature of allowing for 57-bit immediate values,
    but this is optional.


    OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
    Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

    Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
    and a LD to get the value from data memory within ±2GB of IP. This takes
    3 instructions and 2 words in memory when universal constants do this in
    1 instruction and 2 words in the code stream to do this.


    I was mostly testing with GCC for RV64, but, yeah, it just does memory
    loads.


    Typical GCC response on RV64 seems to be to turn nearly all of the
    big-constant cases into memory loads, which kinda sucks.

    This is typical when the underlying architecture is not very extensible
    to 64-bit virtual address spaces; they have to waste a portion of the
    32-bit space to get access to all the 64-bit space. Universal constants
    makes this problem vanish.


    Yeah.

    It at least seems worthwhile to have non-suck fallback strategies.



    Even something like a "LI Xd, Imm17s" instruction, would notably
    reduce the number of constants loaded from memory (as GCC seemingly
    prefers to use a LHU or LW or similar rather than encode it using
    LUI+ADD).

    Reduce when compared to RISC-V but increased when compared to My 66000.
    My 66000 has (at 99& level) uses no instructions to fetch or create constants, nor does it waste any register (or registers) to hold use
    once constants.


    Yeah, but the issue here is mostly with RISC-V and its lack of constant
    load.

    Or burning lots of encoding space (on LUI and AUIPC), but still lacks
    good general-purpose options.

    FWIW, if they had:
    LI Xd, Imm17s
    SHORI Xd, Imm16u //Xd=(Xd<<16)|Imm16u

    Would have still allowed a 32-bit constant in 2 ops, but would have also allowed 64-bit in 4 ops (within the limits of fixed-length 32-bit instructions); while also needing significantly less encoding space
    (could fit both of them into the remaining space in the OP-IMM-32 block).


    I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
    S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
    them enabled in the CPU core (they involved the non-zero cost of
    repacking them into Binary16 in ID1 and then throwing a
    Binary16->Binary64 converter into the ID2 stage).

    Generally, the "FLDCH Imm16, Rn" instruction works well enough here
    (and can leverage a more generic Binary16->Binary64 converter path).

    Sometimes I see a::

        CVTSD     R2,#5

    Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
    in register R2 so it can be accesses as an argument in the subroutine call
    to happen in a few instructions.


    I had looked into, say:
    FADD Rm, Imm5fp, Rn
    Where, despite Imm5fp being severely limited, it had an OK hit rate.

    Unpacking imm5fp to Binary16 being, essentially:
    aee.fff -> 0.aAAee.fff0000000



    OTOH, can note that a majority of typical floating point constants can
    be represented exactly in Binary16 (well, excluding "0.1" or similar),
    so it works OK as an immediate format.

    This allows a single 32-bit op to be used for constant loads (nevermind
    if one needs a 96 bit encoding for 0.1, or PI, or ...).


    IIRC, I originally had it in the CONV2 path, which would give it a
    2-cycle latency and only allow it in Lane 1.

    I later migrated the logic to the "MOV_IR" path which also deals with "Rn=(Rn<<16)|Imm" and similar, and currently allows 1-cycle Binary16 immediate-loads in all 3 lanes.

    Though, BGBCC still assumes it is Lane-1 only unless the FPU Immediate extension is enabled (as with the other FP converters).


    Mostly, a floating point immediate is available from a 32-bit constant container. When accesses in a float calculation it is used as IEEE32
    when accessed by a 6double calculation IEEE32->IEEE64 promotion is
    performed in the constant delivery path. So, one can use almost any
    floating point constant that is representable in float as a double
    without eating cycles and while saving code footprint.


    Don't currently have the encoding space for this.

    Could in theory pull off truncated Binary32 an Imm29s form, but not
    likely worth it. Would also require putting a converted in the ID2
    stage, so not free.

    In this case, the issue is more one of LUT cost to support these cases.


    For FPU compare with zero, can almost leverage the integer compare
    ops, apart from the annoying edge cases of -0.0 and NaN leading to
    "not strictly equivalent" behavior (though, an ASM programmer could
    more easily get away with this). But, not common enough to justify
    adding FPU specific ops for this.

    Actually, the edge/noise cases are not that many gates.
    a) once you are separating out NaNs, infinities are free !!
    b) once you are checking denorms for zero, infinites become free !!

    Having structured a Compare-to-zero circuit based on the fields in double; You can compose the terns to do all signed and unsigned integers and get
    a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
    note:: this circuit is about 10% of the gate count of an integer adder.


    I could add them, but, is it worth it?...

    In this case, it is more a question of encoding space than logic cost.

    It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at
    the integer ops not quite working correctly due to edge cases).


    -----------------------

    Seems that generally 0 still isn't quite common enough to justify
    having one register fewer for variables though (or to have a
    designated zero register), but otherwise it seems there is not much to
    justify trying to exclude the "implicit zero" ops from the ISA listing.


    It is common enough,
    But there are lots of ways to get a zero where you want it for a return.


    I think the main use case for a zero register is mostly that it allows
    using it as a special case for pseudo-ops. I guess, not quite the same
    if it is a normal GPR that just so happens to be 0.

    Recently ended up fixing a bug where:
    y=-x;
    Was misbehaving with "unsigned int":
    "NEG" produces a value which falls outside of UInt range;
    But, "NEG; EXTU.L" is a 2-op sequence.
    It had the EXT for SB/UB/SW/UW, but not for UL.
    For SL, bare NEG almost works, apart from ((-1)<<31).
    Could encode it as:
    SUBU.L Zero, Rs, Rn
    But, without a zero register, the compiler needs to special-case
    provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
    seem common enough to justify this).

    ...



    Though, would likely still make a few decisions differently from
    those in RISC-V. Things like indexed load/store,

    Absolutely

                                               predicated ops (with a
    designated flag bit),

    Predicated then and else clauses which are branch free.
    {{Also good for constant time crypto in need of flow control...}}


    I have per instruction predication:
       CMPxx ...
       OP?T  //if-true
       OP?F  //if-false
    Or:
       OP?T | OP?F  //both in parallel, subject to encoding and ISA rules >>>
         CMP  Rt,Ra,#whatever
         PLE  Rt,TTTTTEEE
         // This begins the then-clause 5Ts -> 5 instructions
         OP1
         OP2
         OP3
         OP4
         OP5
         // this begins the else-clause 3Es -> 3 instructions
         OP6
         OP7
         OP8
         // we are now back join point.

    Notice no internal flow control instructions.


    It can be similar in my case, with the ?T / ?F encoding scheme.

    Except you eat that/those bits in OpCode encoding.


    It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

    Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
    Per-instruction predication;
    WEX / Bundle encoding;
    Jumbo prefixes;
    ...

    But, maybe otherwise could have justified slightly bigger immediate
    fields, dunno.


    While poking at it, did go and add a check to exclude large
    struct-copy operations from predication, as it is slower to turn a
    large struct copy into NOPs than to branch over it.

    Did end up leaving struct-copies where sz<=64 as allowed though (where
    a 64 byte copy at least has the merit of achieving full pipeline
    saturation and being roughly break-even with a branch-miss, whereas a
    128 byte copy would cost roughly twice as much as a branch miss).

    I decided to bite the bullet and have LDM, STM and MM so the compiler does not have to do any analysis. This puts the onus on the memory unit designer to process these at least as fast as a series of LDs and STs. Done right
    this saves ~40%of the power of the caches avoiding ~70% of tag accesses
    and 90% of TLB accesses. You access the tag only when/after crossing a
    line boundary and you access TLB only after crossing a page boundary.

    OK.

    In my case, it was more a case of noting that sliding over, say, 1kB
    worth of memory loads/stores, is slower than branching around it.

    Previously this case was not checked.
    Had also changed some of the rules to allow FPU stuff to be predicated,
    since the current form of the ISA has no issue predicating FPU ops
    (IIRC, in an early form of the ISA, the FPU ops could not be predicated).

    Do still need to exclude any cases which may result in a function call
    though, ...


    Performance gains are modest, but still noticeable (part of why
    predication ended up as a core ISA feature). Effect on pipeline
    seems to be small in its current form (it is handled along with
    register fetch, mostly turning non-executed instructions into NOPs
    during the EX stages).

    The effect is that one uses Predication whenever you will have already
    fetched instructions at the join point by the time you have determined
    the predicate value {then, else} clauses. The PARSE and DECODE do the
    flow control without bothering FETCH.


    Yeah, though in my pipeline, it is still a tradeoff of the relative
    cost of a missed branch, vs the cost of sliding over both the THEN and
    ELSE branches as a series of NOPs.


    For the most part, 1-bit seems sufficient.

    How do you do && and || predication with 1 bit ??


    Originally, it didn't.
    Now I added some 3R and 3RI CMPxx encodings.

    This allows, say:
    CMPGT R8, R10, R4
    CMPGT R8, R11, R5
    TST R4, R5
    ....


    All I had to do was to make the second predication overwrite the first predication's mask, and the compiler did the rest.

    Not so simple in my case, but the hardware is simpler, since it just
    cares about the state of 1 bit (which is explicitly saved/restored along
    with the rest of the status-register if an interrupt occurs).

    Various cases get kind of annoying in the compiler, as it needs to
    figure out to some extent what the conditional is doing, and how to
    translate this through the IR stages.

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Wed Mar 27 13:32:56 2024
    From Newsgroup: comp.arch

    On 3/26/2024 5:27 PM, Michael S wrote:
    On Tue, 26 Mar 2024 16:59:57 -0500
    BGB-Alt <bohannonindustriesllc@gmail.com> wrote:

    On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:


    Say, "we have an instruction, but it is a boat anchor" isn't an
    ideal situation (unless to be a placeholder for if/when it is not
    a boat anchor).

    If the boat anchor is a required unit of functionality, and I
    believe IDIV and FPDIV is, it should be defined in ISA and if you
    can't afford it find some way to trap rapidly so you can fix it up
    without excessive overhead. Like a MIPS TLB reload. If you can't
    get trap and emulate at sufficient performance, then add the HW to
    perform the instruction.

    Though, 32-bit ARM managed OK without integer divide.


    For slightly less then 20 years ARM managed OK without integer divide.
    Then in 2004 they added integer divide instruction in ARMv7 (including ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are
    doing great :-)


    OK.

    I think both modern ARM and AMD Zen went over to "actually fast" integer divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
    can get from a shift-add unit.


    On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
    This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).

    A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case.
    But, this is a LUT cost tradeoff.

    It could be possible in theory to support a general 3-cycle integer
    divide, albeit if one can accept inexact results (would be faster than
    the software-based lookup table strategy).


    But, it is debatable. Pure minimalism would likely favor leaving out
    divide (and a bunch of other stuff). Usual rationale being, say, to try
    to fit the entire ISA listing on a single page of paper or similar (vs
    having a listing with several hundred defined encodings).


    Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
    listings that are considerably larger (thousands of encodings).

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 21:03:52 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 3/26/2024 7:02 PM, MitchAlsup1 wrote:

    Sometimes I see a::

        CVTSD     R2,#5

    Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
    in register R2 so it can be accesses as an argument in the subroutine call >> to happen in a few instructions.


    I had looked into, say:
    FADD Rm, Imm5fp, Rn
    Where, despite Imm5fp being severely limited, it had an OK hit rate.

    Unpacking imm5fp to Binary16 being, essentially:
    aee.fff -> 0.aAAee.fff0000000

    realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
    used fp constants--but does include 0, 1, 2, and 10. Also, realistically
    the missing cases are the 0.5s.

    OTOH, can note that a majority of typical floating point constants can
    be represented exactly in Binary16 (well, excluding "0.1" or similar),
    so it works OK as an immediate format.

    This allows a single 32-bit op to be used for constant loads (nevermind
    if one needs a 96 bit encoding for 0.1, or PI, or ...).


    Mostly, a floating point immediate is available from a 32-bit constant
    container. When accesses in a float calculation it is used as IEEE32
    when accessed by a 6double calculation IEEE32->IEEE64 promotion is
    performed in the constant delivery path. So, one can use almost any
    floating point constant that is representable in float as a double
    without eating cycles and while saving code footprint.


    Don't currently have the encoding space for this.

    Could in theory pull off truncated Binary32 an Imm29s form, but not
    likely worth it. Would also require putting a converted in the ID2
    stage, so not free.

    In this case, the issue is more one of LUT cost to support these cases.


    For FPU compare with zero, can almost leverage the integer compare
    ops, apart from the annoying edge cases of -0.0 and NaN leading to
    "not strictly equivalent" behavior (though, an ASM programmer could
    more easily get away with this). But, not common enough to justify
    adding FPU specific ops for this.

    Actually, the edge/noise cases are not that many gates.
    a) once you are separating out NaNs, infinities are free !!
    b) once you are checking denorms for zero, infinites become free !!

    Having structured a Compare-to-zero circuit based on the fields in double; >> You can compose the terns to do all signed and unsigned integers and get
    a gate count, then the number of gates you add to cover all 10 cases of
    floating point is 12% gate count over the simple integer version. Also
    note:: this circuit is about 10% of the gate count of an integer adder.


    I could add them, but, is it worth it?...

    Whether to add them or not is on you.
    I found things like this to be more straw on the camel's back
    {where the camel collapses to a unified register file model.}

    In this case, it is more a question of encoding space than logic cost.

    It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at
    the integer ops not quite working correctly due to edge cases).

    My model requires about ½ the instruction count when processing FP
    comparisons compared to RISC-V (big, no; around 5% in FP code and 0 elsewhere.} Where it wins big is compare against a non-zero FP constant.
    My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
    LD, FCMP, BC}


    -----------------------

    Seems that generally 0 still isn't quite common enough to justify
    having one register fewer for variables though (or to have a
    designated zero register), but otherwise it seems there is not much to
    justify trying to exclude the "implicit zero" ops from the ISA listing.


    It is common enough,
    But there are lots of ways to get a zero where you want it for a return.


    I think the main use case for a zero register is mostly that it allows
    using it as a special case for pseudo-ops. I guess, not quite the same
    if it is a normal GPR that just so happens to be 0.

    Recently ended up fixing a bug where:
    y=-x;
    Was misbehaving with "unsigned int":
    "NEG" produces a value which falls outside of UInt range;
    But, "NEG; EXTU.L" is a 2-op sequence.
    It had the EXT for SB/UB/SW/UW, but not for UL.
    For SL, bare NEG almost works, apart from ((-1)<<31).
    Could encode it as:
    SUBU.L Zero, Rs, Rn

    ADD Rn,#0,-Rs

    But notice::
    y = -x;
    a = b + y;
    can be performed as if it had been written::
    y = -x;
    a = b + (-x);
    Which is encoded as::

    ADD Ry,#0,-Rx
    ADD Ra,Rb,-Rx

    But, without a zero register,

    #0 is not a register, but its value is 0x0000000000000000 anyway.

    You missed the point entirely, if you can get easy access to #0
    then you no longer need a register to hold this simple bit pattern.
    In fact a large portion of My 66000 ISA over RISC-V comes from this
    mechanism.

    the compiler needs to special-case

    The compiler needs easy access to #0 and the compiler needs to know
    that #0 exists, but the compiler does not need to know if some register contains that same bit pattern.

    provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
    seem common enough to justify this).

    ....

    It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

    I burned 0 per instruction, but you can claim I burned 1 instruction PRED and 6.4 bits of that instruction are used to create masks that project upon up to
    8 following instructions.

    Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
    Per-instruction predication;
    WEX / Bundle encoding;
    Jumbo prefixes;
    ...

    But, maybe otherwise could have justified slightly bigger immediate
    fields, dunno.


    While poking at it, did go and add a check to exclude large
    struct-copy operations from predication, as it is slower to turn a
    large struct copy into NOPs than to branch over it.

    Did end up leaving struct-copies where sz<=64 as allowed though (where
    a 64 byte copy at least has the merit of achieving full pipeline
    saturation and being roughly break-even with a branch-miss, whereas a
    128 byte copy would cost roughly twice as much as a branch miss).

    I decided to bite the bullet and have LDM, STM and MM so the compiler does >> not have to do any analysis. This puts the onus on the memory unit designer >> to process these at least as fast as a series of LDs and STs. Done right
    this saves ~40%of the power of the caches avoiding ~70% of tag accesses
    and 90% of TLB accesses. You access the tag only when/after crossing a
    line boundary and you access TLB only after crossing a page boundary.

    OK.

    In my case, it was more a case of noting that sliding over, say, 1kB
    worth of memory loads/stores, is slower than branching around it.

    This is why My 66000 predication has use limits. Once you can get where
    you want faster with a branch, then a branch is what you should use.
    I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
    cycle and that the minimum DECODE time is 2 cycles, that Predication
    wins when the number of instructions <= FWidth × Dcycles = 8.
    Use predication and save cycles by not disrupting the front end.
    Use branching and save cycles by disrupting the front end.


    All I had to do was to make the second predication overwrite the first
    predication's mask, and the compiler did the rest.

    Not so simple in my case, but the hardware is simpler, since it just
    cares about the state of 1 bit (which is explicitly saved/restored along with the rest of the status-register if an interrupt occurs).

    Simpler than 8-flip flops used as a shift right register ??
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 21:14:01 2024
    From Newsgroup: comp.arch

    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide.
    Then in 2004 they added integer divide instruction in ARMv7 (including
    ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are
    doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
    can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including
    SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
    and with special casing of leading 1s and 0s average around 10-cycles ???

    I submit that at 10-cycles for average latency, the need to invent screwy
    forms of even faster division fall by the wayside {accurate or not}.

    NOTE well:: The size of the FMUL (or FMAC) unit does increase, but its
    increase is less than that of an STR divisor unit.

    On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
    This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).

    This is because you are using a linear iteration, try using a quadratic convergent iteration instead. OH but you CAN'T because your multiplier
    tree does not give accurate lower order bits.

    A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case.
    But, this is a LUT cost tradeoff.

    It could be possible in theory to support a general 3-cycle integer
    divide, albeit if one can accept inexact results (would be faster than
    the software-based lookup table strategy).


    But, it is debatable. Pure minimalism would likely favor leaving out
    divide (and a bunch of other stuff). Usual rationale being, say, to try
    to fit the entire ISA listing on a single page of paper or similar (vs having a listing with several hundred defined encodings).


    Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
    listings that are considerably larger (thousands of encodings).

    ....
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Mar 27 21:53:36 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide.
    Then in 2004 they added integer divide instruction in ARMv7 (including
    ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are
    doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer
    divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
    can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including
    SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast >divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
    and with special casing of leading 1s and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    https://www.quinapalus.com/cm7cycles.html


    I submit that at 10-cycles for average latency, the need to invent screwy >forms of even faster division fall by the wayside {accurate or not}.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Mar 28 01:06:05 2024
    From Newsgroup: comp.arch

    On Wed, 27 Mar 2024 21:14:01 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    What I don't get is the reluctance for using the FP multiplier as a
    fast divisor (IBM 360/91). AMD Opteron used this means to achieve
    17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? and with special casing of leading 1s and 0s average
    around 10-cycles ???

    I submit that at 10-cycles for average latency, the need to invent
    screwy forms of even faster division fall by the wayside {accurate or
    not}.


    All modern performance-oriented cores from Intel, AMD, ARM and Apple
    have fast integer dividers and typically even faster FP dividers.
    The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake (launched in 2015) and AMD Zen2 (launched in 2019), but the later is
    slow only in the worst case, the best case is o.k.
    I'd guess, when Skylake was designed nobody at Intel could imagine that
    it and its variations would be manufactured in huge volumes up to 2021.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 23:11:34 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including >>SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast >>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>and with special casing of leading 1s and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    I submit that a 5+2×ln8(s) is faster still.
    32-bits = 15 cycles <not so much faster>
    64-bits = 17 cycles <A lot faster>

    {Log base 8, where one uses Newton-Raphson or Goldschmidt to get 8 significant digits (9.2 bits are correct) and double the significant bits each iteration (2-cycles). }

    5 comes from looking at numerator and denominator to find the first bit of significance, and then shifting numerator and denominator so that the FDIV algorithm can work.

    https://www.quinapalus.com/cm7cycles.html


    I submit that at 10-cycles for average latency, the need to invent screwy >>forms of even faster division fall by the wayside {accurate or not}.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Mar 28 09:31:11 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles
    for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including
    SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast
    divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
    and with special casing of leading 1s and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including
    hashes?) where you use modulo operations, said modulus _must_ be a known constant, otherwise information about will leak from the timings, right?


    I submit that at 10-cycles for average latency, the need to invent screwy
    forms of even faster division fall by the wayside {accurate or not}.

    I agree.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Mar 28 14:03:18 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer divide. >>>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>>> ARMv7-M variant intended for small microcontroller cores like
    Cortex-M3) and for the following 20 years instead of merely OK they are >>>>> doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast" integer >>>> divide.

    I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>>> can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division (including >>> SRT variants).

    What I don't get is the reluctance for using the FP multiplier as a fast >>> divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
    FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>> and with special casing of leading 1s and 0s average around 10-cycles ??? >>
    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
    (where s is the number of significant digits in the quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including >hashes?) where you use modulo operations, said modulus _must_ be a known >constant, otherwise information about will leak from the timings, right?

    Perhaps, yet I suspect that the m7 isn't generally used for crypto,
    nor used in an SMP implementation where an observer can monitor
    a shared cache.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Mar 28 22:38:41 2024
    From Newsgroup: comp.arch

    On Thu, 28 Mar 2024 09:31:11 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer
    divide. Then in 2004 they added integer divide instruction in
    ARMv7 (including ARMv7-M variant intended for small
    microcontroller cores like Cortex-M3) and for the following 20
    years instead of merely OK they are doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast"
    integer divide.

    I think for a long time, the de-facto integer divide was ~ 36-40
    cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
    on-par with what I can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division
    (including SRT variants).

    What I don't get is the reluctance for using the FP multiplier as
    a fast divisor (IBM 360/91). AMD Opteron used this means to
    achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
    not be under 20-cycles ?? and with special casing of leading 1s
    and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
    cycles (where s is the number of significant digits in the
    quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including hashes?) where you use modulo operations, said modulus _must_ be a
    known constant, otherwise information about will leak from the
    timings, right?

    Are you aware of any professional crypto algorithm, including hashes,
    that uses modulo operations by modulo that is neither power-of-two nor
    at least 192-bit wide?



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Mar 29 13:38:55 2024
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 28 Mar 2024 09:31:11 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    BGB wrote:

    On 3/26/2024 5:27 PM, Michael S wrote:


    For slightly less then 20 years ARM managed OK without integer
    divide. Then in 2004 they added integer divide instruction in
    ARMv7 (including ARMv7-M variant intended for small
    microcontroller cores like Cortex-M3) and for the following 20
    years instead of merely OK they are doing great :-)


    OK.

    The point is they are doing better now after adding IDIV and FDIV.

    I think both modern ARM and AMD Zen went over to "actually fast"
    integer divide.

    I think for a long time, the de-facto integer divide was ~ 36-40
    cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
    on-par with what I can get from a shift-add unit.

    While those numbers are acceptable for shift-subtract division
    (including SRT variants).

    What I don't get is the reluctance for using the FP multiplier as
    a fast divisor (IBM 360/91). AMD Opteron used this means to
    achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
    not be under 20-cycles ?? and with special casing of leading 1s
    and 0s average around 10-cycles ???

    Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
    cycles (where s is the number of significant digits in the
    quotient).

    https://www.quinapalus.com/cm7cycles.html

    That looks a lot like an SRT divisor with early out?

    Having variable timing DIV means that any crypto operating (including
    hashes?) where you use modulo operations, said modulus _must_ be a
    known constant, otherwise information about will leak from the
    timings, right?

    Are you aware of any professional crypto algorithm, including hashes,
    that uses modulo operations by modulo that is neither power-of-two nor
    at least 192-bit wide?

    I was involved with the optimization of DFC, the AES condidate from CERN:

    It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
    and that resulted in a very simple reciprocal, i.e. no need for a DIV
    opcode.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Mar 29 16:38:58 2024
    From Newsgroup: comp.arch

    On Fri, 29 Mar 2024 13:38:55 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 28 Mar 2024 09:31:11 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:


    Are you aware of any professional crypto algorithm, including
    hashes, that uses modulo operations by modulo that is neither
    power-of-two nor at least 192-bit wide?

    I was involved with the optimization of DFC, the AES condidate from
    CERN:

    It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
    and that resulted in a very simple reciprocal, i.e. no need for a DIV opcode.

    Terje


    Since DFC lost, I suppose that even ignoring reciprocal optimization
    the answer to my question is 'No'.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 1 04:09:35 2024
    From Newsgroup: comp.arch

    ( Was distracted with other stuff, mostly trying to find and fix bugs. )


    On 3/27/2024 4:03 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 3/26/2024 7:02 PM, MitchAlsup1 wrote:

    Sometimes I see a::

         CVTSD     R2,#5

    Where a 5-bit immediate (value = 5) is converted into 5.0D0 and
    placed in register R2 so it can be accesses as an argument in the
    subroutine call
    to happen in a few instructions.


    I had looked into, say:
       FADD Rm, Imm5fp, Rn
    Where, despite Imm5fp being severely limited, it had an OK hit rate.

    Unpacking imm5fp to Binary16 being, essentially:
       aee.fff -> 0.aAAee.fff0000000


    Self correction: aee.ff


    realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often used fp constants--but does include 0, 1, 2, and 10. Also, realistically
    the missing cases are the 0.5s.


    This scheme had (rounded):
    * 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
    * 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
    * 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
    * 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
    * 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
    * 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
    * 110xx: 4800, 4900, 4A00, 4B00 ( 8.000, 10.000, 12.000, 14.000)
    * 111xx: 4C00, 4D00, 4E00, 4F00 (16.000, 20.000, 24.000, 28.000)

    Of the options evaluated, this seemed to get the best general-case hit rate.


    OTOH, can note that a majority of typical floating point constants can
    be represented exactly in Binary16 (well, excluding "0.1" or similar),
    so it works OK as an immediate format.

    This allows a single 32-bit op to be used for constant loads
    (nevermind if one needs a 96 bit encoding for 0.1, or PI, or ...).


    Mostly, a floating point immediate is available from a 32-bit constant
    container. When accesses in a float calculation it is used as IEEE32
    when accessed by a 6double calculation IEEE32->IEEE64 promotion is
    performed in the constant delivery path. So, one can use almost any
    floating point constant that is representable in float as a double
    without eating cycles and while saving code footprint.


    Don't currently have the encoding space for this.

    Could in theory pull off truncated Binary32 an Imm29s form, but not
    likely worth it. Would also require putting a converted in the ID2
    stage, so not free.

    In this case, the issue is more one of LUT cost to support these cases.


    For FPU compare with zero, can almost leverage the integer compare
    ops, apart from the annoying edge cases of -0.0 and NaN leading to
    "not strictly equivalent" behavior (though, an ASM programmer could
    more easily get away with this). But, not common enough to justify
    adding FPU specific ops for this.

    Actually, the edge/noise cases are not that many gates.
    a) once you are separating out NaNs, infinities are free !!
    b) once you are checking denorms for zero, infinites become free !!

    Having structured a Compare-to-zero circuit based on the fields in
    double;
    You can compose the terns to do all signed and unsigned integers and get >>> a gate count, then the number of gates you add to cover all 10 cases
    of floating point is 12% gate count over the simple integer version.
    Also
    note:: this circuit is about 10% of the gate count of an integer adder.


    I could add them, but, is it worth it?...

    Whether to add them or not is on you.
    I found things like this to be more straw on the camel's back {where the camel collapses to a unified register file model.}


    I already have a unified register file in my ISA (only the RISC-V mode
    treats it as two sub-spaces).


    But, in this case:
    BEQ/BNE/BGT/... almost work with floating-point values, apart from the annoying edge cases like -0 and similar.

    If substituted for floating-point comparison, most things work as
    expected, except that Quake ends up with some minor physics glitches
    (*1) and similar (mostly due to the occasional appearance of -0, and it
    not comparing equal to +0 in this case).

    *1: Most commonly in the form that sliding a bounding box along the
    ground will occasionally encounter seams where it gets stuck and can't
    slide across (so the player would need to jump over these seams).


    In this case, it is more a question of encoding space than logic cost.

    It is semi-common in FP terms, but likely not common enough to justify
    dedicated compare-and-branch ops and similar (vs the minor annoyance
    at the integer ops not quite working correctly due to edge cases).

    My model requires about ½ the instruction count when processing FP comparisons compared to RISC-V (big, no; around 5% in FP code and 0 elsewhere.} Where it wins big is compare against a non-zero FP constant.
    My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
    LD, FCMP, BC}


    At present, options are:
    FLDCH + FCMPxx (2R) + ( BT/BF | MOVT/MOVNT )
    FCMPxx (2RI) + ( BT/BF | MOVT/MOVNT )

    The latter currently requires the FpImm extension.

    The integer ops currently have some additional denser cases:
    Compare and Branch
    Compare-Zero and Branch
    Compare and Set (3R and small 3RI).

    However, adding these cases for FP would eat encoding space, and floating-point compare is less common than integer compare.


    Granted, it does look like FCMPGT is around 0.7% of the clock-cycle
    budget in GLQuake, so isn't too far down on the list. Would need to
    evaluate whether is is more commonly being used for branching or as a predicate or similar, and the relative amount of compare-with-zero vs comparing two values.

    Can note that FCMPEQ is much further down the list, at 0.03%.



    -----------------------

    Seems that generally 0 still isn't quite common enough to justify
    having one register fewer for variables though (or to have a
    designated zero register), but otherwise it seems there is not much
    to justify trying to exclude the "implicit zero" ops from the ISA
    listing.


    It is common enough,
    But there are lots of ways to get a zero where you want it for a return. >>>

    I think the main use case for a zero register is mostly that it allows
    using it as a special case for pseudo-ops. I guess, not quite the same
    if it is a normal GPR that just so happens to be 0.

    Recently ended up fixing a bug where:
       y=-x;
    Was misbehaving with "unsigned int":
       "NEG" produces a value which falls outside of UInt range;
       But, "NEG; EXTU.L" is a 2-op sequence.
       It had the EXT for SB/UB/SW/UW, but not for UL.
         For SL, bare NEG almost works, apart from ((-1)<<31).
    Could encode it as:
       SUBU.L  Zero, Rs, Rn

        ADD  Rn,#0,-Rs

    But notice::
        y = -x;
        a = b + y;
    can be performed as if it had been written::
        y = -x;
        a = b + (-x);
    Which is encoded as::

        ADD  Ry,#0,-Rx
        ADD  Ra,Rb,-Rx

    But, without a zero register,

    #0 is not a register, but its value is 0x0000000000000000 anyway.

    You missed the point entirely, if you can get easy access to #0
    then you no longer need a register to hold this simple bit pattern.
    In fact a large portion of My 66000 ISA over RISC-V comes from this mechanism.


    OK.

    I did end up switching "-x" for Int and UInt over to ADDS.L and ADDU.L
    with the zero in a register. Noted that on-average, this is still faster
    than "NEG + EXTx.L" even in the absence of a zero register.


    But, then ended up fighting with a bunch of compiler bugs, as some
    recent behavioral tweaks in the compiler had started stepping on a bunch
    of buggy cases:
    Some instructions which were invalid as prefixes were being set as
    prefixes in the WEXifier;
    The edge case for expanding Op32 to Op64 was mangling large immediate
    values (mostly related to cases dealing with encoding R32..R63 in
    Baseline mode);
    The handling for "tiny leaf functions" was incorrectly allowing
    functions to be compiled as tiny-leaf functions in cases where they had
    a by-value structure type on the local frame (in this case, the local
    struct address would end up as "whatever value was already in the
    register", potentially stomping stuff elsewhere in memory, *1).

    *1: TBD if this is related to some of the other
    memory-stomping/corruption bugs I had been seeing (which were sensitive
    to register allocator behavior). It is possible this was it, or they may
    yet still be other bugs that result in memory-stomping.


    This sort of stuff has eaten up a lot of my time recently.


                                  the compiler needs to special-case

    The compiler needs easy access to #0 and the compiler needs to know
    that #0 exists, but the compiler does not need to know if some register contains that same bit pattern.


    Conventionally, this was done with "pseudo-instructions", but for a
    pseudo instruction, one needs to know that zero exists in some location.

    Otherwise, one can leave it up to the register allocator ("hey, give me
    a register that holds zero").


    Though, looking at the compiler output, it still doesn't yet seem to
    realize that immediate 0 is 0 regardless of type. So, it is treating 0,
    0U, 0LL, NULL, ... effectively as different values (and allocating them
    in different registers). May need to address this...


    provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
    seem common enough to justify this).

    ....

    It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

    I burned 0 per instruction, but you can claim I burned 1 instruction
    PRED and
    6.4 bits of that instruction are used to create masks that project upon
    up to
    8 following instructions.


    Yeah.

    Say:
    CMPxx OP1?T OP2?T OP3?F OP4?F
    Is one less instruction word than, say:
    CMPxx PRED OP1 OP2 OP3 OP4

    And, as noted, most predication blocks tend to be short.


    Though, one could argue, say (very rough stats):
    Scalar: ~ 55%
    WEX : ~ 17% (has gone up slightly)
    Jumbo : ~ 15% (this is just the jumbo prefixes)
    Pred : ~ 13% (6.5% for ?T and ?F)
    PrWEX : ~ 0.4%

    Would, in terms of entropy, favor spending 0 or 1 bits per instruction
    on this.


    But, spending 2 bits here allows the encoding space to be more
    orthogonal (at least in XG2). So, nearly any unconditional instruction
    can also be encoded in WEX'ed or Predicated forms (there are a few large unconditional-only instructions; their alternate forms used to encode
    the Jumbo prefixes and PrWEX blocks).

    Like, a lopsided encoding space would be less ideal even if "more
    optimal" in a Huffman sense...

    Though, FWIW, 25% isn't *that* far off from the ~17% of ops flagged as
    WEX in this example; granted, one could argue that superscalar could
    deal with this.


    Granted, potentially a clever CPU could also avoid needing to burn an
    extra clock-cycle on a PRED op or similar, or 2 cycles to perform an unconditional branch, ... But, a naive CPU will likely not be able to
    avoid these sorts of costs.

    Granted, some past architectures were like, "well, an instruction would
    be there already..." and treat the cycle following the branch as a branch-delay-slot. Granted, had resisted that urge, and had previously
    noted that if the delay-slot would contain an instruction that triggers
    a pipeline interlock stall, this would foul up the pipeline and cause
    the branch-initiation process to fail (and the "fix" here would have
    been a lot more painful if the ISA actually had branch-delay slots;
    rather than just the would-be delay slot instruction having failed to be "sufficiently ignored").



    In Baseline mode, the encoding-space breakdown differs:
    75%: 16-bit ops
    12.5%: 32-bit ops (7x/9x), R32..R64,
    Scalar ops only covering a subset of the ISA.
    6.25%: Pred (Ex)
    6.25%: Scalar and WEX (R0..R31 only, Fx).

    Or, if no XGPR:
    87.5%: 16-bit ops.
    6.25%: Pred (Ex)
    6.25%: Scalar and WEX (R0..R31 only, Fx).

    Where the encoding of the Ex and Fx blocks is common between Baseline
    and XG2 (allowing for a common subset to exist).


    Comparably, XG2 Mode, despite only 1/4 of the encoding space being used
    for Scalar ops, generally still ends up expanding some of the immediate
    fields and similar vs Baseline mode (and mostly avoiding the need to use
    Op64 encodings due to the more orthogonal encodings).



    Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
       Per-instruction predication;
       WEX / Bundle encoding;
       Jumbo prefixes;
       ...

    But, maybe otherwise could have justified slightly bigger immediate
    fields, dunno.


    While poking at it, did go and add a check to exclude large
    struct-copy operations from predication, as it is slower to turn a
    large struct copy into NOPs than to branch over it.

    Did end up leaving struct-copies where sz<=64 as allowed though
    (where a 64 byte copy at least has the merit of achieving full
    pipeline saturation and being roughly break-even with a branch-miss,
    whereas a 128 byte copy would cost roughly twice as much as a branch
    miss).

    I decided to bite the bullet and have LDM, STM and MM so the compiler
    does
    not have to do any analysis. This puts the onus on the memory unit
    designer
    to process these at least as fast as a series of LDs and STs. Done right >>> this saves ~40%of the power of the caches avoiding ~70% of tag accesses
    and 90% of TLB accesses. You access the tag only when/after crossing
    a line boundary and you access TLB only after crossing a page boundary.

    OK.

    In my case, it was more a case of noting that sliding over, say, 1kB
    worth of memory loads/stores, is slower than branching around it.

    This is why My 66000 predication has use limits. Once you can get where
    you want faster with a branch, then a branch is what you should use.
    I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
    cycle and that the minimum DECODE time is 2 cycles, that Predication
    wins when the number of instructions <= FWidth × Dcycles = 8.
    Use predication and save cycles by not disrupting the front end.
    Use branching   and save cycles by     disrupting the front end.



    In this case, where and how much to predicate is left up to the compiler frontend, and it doesn't need to know in advance exactly how many
    instructions will be emitted by the backend (but, does need to avoid
    anything which might potentially result in an implicit runtime call or similar).


    In this case, the compiler decides this at the AST level, but there is
    an uncertainty how many CPU instructions would be covered by the AST
    (and the front-end hand-waves it mostly by counting AST nodes and using
    each AST node as a stand-in for an instruction; rejecting cases which
    have too many AST nodes).


    But, had needed to fiddle with stuff a bit to fix up some edge cases
    that were popping up here:
    A struct-copy may emit a much longer sequence than intended;
    Some seemingly benign cases, like "ptr-ptr" were resulting in "evil"
    cases with some types (like "Foo*" where "sizeof(Foo)" is not a power of
    2, resulting in an implicit integer division, so then one needs to
    detect these cases and reject them);
    ...

    But, yeah, cases where the predication results in 100+ instructions, or
    tries to do something invalid (such as trying to call a runtime support function), are undesirable.

    But, the logic here was a little lax in some of its validity checks.

    Where operators are potentially very different depending on which types
    they are applied to, and in the AST we don't implicitly know the type
    (the compiler can try to figure it out, but it is not guaranteed).

    Though, the predication handling can deal with the "What is the type of
    this expression? Hell if I know." cases by not predicating the blocks.
    FWIW, earlier on, BGBCC did not know any types at the AST level, and
    left type handling up to the RIL->3AC translation stage to sort this out
    (or, in difficult cases, the type might decay to "variant" and it could
    be sorted out at runtime using tagrefs; but I have since moved to full
    static typing, nevermind potential type system holes and similar in the frontend stages).

    But, there is a performance cost, as any time one asks the type of an expression, it needs to walk this part of the AST and potentially lookup variables and similar, which isn't free.

    But, some of this, while it works OK for C, would not fly so well with
    C++ (which would not be quite so tolerant of "Swiss cheese" typesystem semantics in the frontend stages). Did work better for some of my own
    language designs though (mostly built around hybrid static/dynamic
    systems; some things are a little easier when one can justify dealing
    with them by falling back to type-tagging and run-time type-checks, ...).



    Something like your style of predication mechanism would likely need to
    be applied much later in the process (such as when applying branch
    fixups), rather than in the front-end and encoded via the IR stages.


    Granted, with my scheme, trying to turn a branch into predication would
    leave a "hole" in the place where the branch op was (where a NOP or
    similar may need to be left), in which case, something like a "PRED" instruction wouldn't be a huge loss.

    Though, I guess, logic for "hey, turn this short forward branch into predication" could be a possibility as well (and might hit some cases
    missed by the AST level predication logic, or which appeared spontaneously).

    Would require logic to detect short forward branches and confirm that
    the instruction sequence following the branch does not contain any
    labels or similar though.




    All I had to do was to make the second predication overwrite the first
    predication's mask, and the compiler did the rest.

    Not so simple in my case, but the hardware is simpler, since it just
    cares about the state of 1 bit (which is explicitly saved/restored
    along with the rest of the status-register if an interrupt occurs).

    Simpler than 8-flip flops used as a shift right register ??


    It would need to be captured and preserved correctly during interrupts
    (and remain correctly aligned with the instruction stream when returning
    from an interrupt, etc).

    This would make it fundamentally different, say, than the shift-register
    type logic used to deal with a pipeline flush during a branch (where one really does not care what happens if an interrupt happens here; only
    that the flushed stages do not execute).


    Granted, in my case, one does need logic to detect, say, if ID2 contains
    a predicated instruction, and EX1 contains an instruction that will
    update the SR.T flag.

    Logic for predication handling is otherwise something like:
    casez( { opBraFlush, opUCmd[7:6], regInSr[0] } )
    4'b000z: tOpEnable = 1; //always
    4'b001z: tOpEnable = 0; //never (internal use by CPU)
    4'b0100: tOpEnable = 0; //?T and F
    4'b0101: tOpEnable = 1; //?T and T
    4'b0110: tOpEnable = 1; //?F and F
    4'b0111: tOpEnable = 0; //?F and T
    4'b1zzz: tOpEnable = 0; //flushed stage
    endcase

    Where, if tOpEnable is 0, the opcode (in opUCmd[5:0]) is replaced with
    NOP (unless it is a branch and !opBraFlush), where it is replaced with a "NO_BRANCH" (where if the branch-predictor predicted a branch, it
    initiates a branch to the following instruction; unlike "BRA" which
    initiates a branch to the target if the branch-predictor did not predict
    a branch having been taken).

    Where, generally, the branch-flush mask is updated on the transition to
    the following cycle (and the flush flag corresponding to the IF stage
    may also be set by the branch-predictor if it predicts a branch having
    been taken).

    ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Apr 2 16:12:49 2024
    From Newsgroup: comp.arch

    On 3/27/2024 6:06 PM, Michael S wrote:
    On Wed, 27 Mar 2024 21:14:01 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    What I don't get is the reluctance for using the FP multiplier as a
    fast divisor (IBM 360/91). AMD Opteron used this means to achieve
    17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under
    20-cycles ?? and with special casing of leading 1s and 0s average
    around 10-cycles ???

    I submit that at 10-cycles for average latency, the need to invent
    screwy forms of even faster division fall by the wayside {accurate or
    not}.


    All modern performance-oriented cores from Intel, AMD, ARM and Apple
    have fast integer dividers and typically even faster FP dividers.
    The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake (launched in 2015) and AMD Zen2 (launched in 2019), but the later is
    slow only in the worst case, the best case is o.k.
    I'd guess, when Skylake was designed nobody at Intel could imagine that
    it and its variations would be manufactured in huge volumes up to 2021.


    FWIW: Newest CPU that I have is Zen+ ...
    A lot of the other stuff around (excluding a few RasPi's I have, is
    stuff from the mid-to-late 2000's).

    But, yeah, I guess fast dividers are possible, question I guess is more whether they can be done cheaply enough for cores where one would
    actually care about things like the costs of instruction decoding?...


    Otherwise:

    But, yeah, at the moment, apart from debugging, had gotten around to
    starting to work on some more of the GUI related stuff:
    Added support for drawing window bitmaps using 256 and 16 color graphics; Starting working on a "toolkit" for GUI widget drawing;
    Also starting to add stuff for scalable / SDF font rendering.

    For UI widgets:
    Idea is that the widgets will be primarily drawn into a bitmap buffer
    that will be mapped to the window (lower level window management will
    not know or care about UI widgets);
    Generally, I am breaking the UI into "forms" which will be applied to
    the windows, with the information about the "form" and its contained
    widgets being handled separately from any state associated with those
    widgets (unlike GTK, which tends to keep the widget definition and state
    in the same object).

    Likewise, also generally approaching things in a way that "makes sense"
    for C (IOW: not the struct-casting mess that GTK had used). Also in the current design, the idea is that UI elements will be sized and
    positioned explicitly using window pixel coordinates (scaling UI based
    on window resizing may possibly be treated as more of an application
    concern, but could maybe have "layout helper" logic for this, TBD).


    I have yet to decide on the default bit-depth for UI widgets, mostly
    torn between 16-color and 256-color. Probably don't need high bit-depths
    for "text and widgets" windows, but may need more than 16-color. Also
    TBD which color palette to use in the case if 256 color is used (the
    palette used for the lower-level parts of the GUI works OK, but requires
    a lookup table to convert RGB555 to indexed color, don't want to
    duplicate this table for every program; and don't necessarily need to
    force programs to all use the same palette, even though this would be
    better in a color-fidelity sense). Though, for UI widgets it almost
    makes most sense to instead use a palette based around RGB222 (64 color)
    or similar.

    Say:
    00..0F: RGBI colors
    10..1F: Grayscale
    20..3F: Dunno
    40..7F: RGB222 colors
    80..FE: Dunno
    FF: Transparent
    Well, or maybe reuse a variant of the 216 color "web safe" palette:
    00..0F: RGBI colors
    10..1F: Grayscale
    20..FD: 216 colors
    FE: Dunno
    FF: Transparent

    Well, unless RGBI is sufficient (where, in this mode, the high-intensity-magenta color was interpreted as transparent).


    May also make sense to either overlay UI elements with different
    bit-depths, and/or treat the UI widgets as an overlay (though, this
    could have its own issues, say, if the background image is redrawn and
    the UI overlaps part of, the UI will also need to be redrawn). And/or
    render the UI bitmap at a higher bit-depth (such as hi-color) if higher bit-depth elements are present (say, if there is a widget representing a "Bitmap object" or similar). In the "overlaid UI widgets" scenario,
    would likely make most sense to draw the parts showing the underlying
    image as a dedicated transparent color.

    Though, non-overlap would be preferable in the case of things like video playback or similar.


    For scalable fonts, current idea is mostly to use SDF fonts encoded as 256-color BMP images, where the X/Y distances are encoded as 4 bits
    each. The color palette is kinda redundant in this case, but doesn't add
    too much additional storage overhead to a group of 16x16 character cells
    (say, representing CP-1252 with each glyph encoded as 16x16 pixels).

    Partly this is because SDF font rendering seems technically simpler than dealing with TrueType fonts (and was also used in some of my prior 3D
    engine projects; though with the alteration of representing the SDF's as 256-color BMP images rather than TGA or similar).

    Currently no plans to address the rest of Unicode with this though (and
    it would take around 17MB to represent the entire Unicode BMP with this scheme).

    In the past, had used Unifont, which was encoded in a form that stores
    each glyph as 16x16 at 1bpp; basically works, but no good way to scale
    the glyphs in this case without them looking terrible (well, absent
    upscaling them, possibly smoothing the edges, and transcoding into SDF
    form).

    But, no really a good algorithmic way to figure out which corners need
    to be sharp and which should be rounded. Best option is likely to look
    at blocks of, say, 3x3 or 5x5 pixels, and then make a guess based on the pattern how the pixel in the center should expand from 1 pixel to 2x2
    (which is applied iteratively to double the resolution of each glyph).

    Though, 3x3 is small enough to make it viable to figure out a lookup
    table by hand. Well, or generate a bunch of images of large-font text
    and try to "mine" the patterns from the images (and then use these
    patterns to hopefully upsample Unifont glyphs to 64x64 or similar, as
    needed for the 16x16 SDF generation...).


    Any thoughts on any of this?...

    ...

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 5 21:43:08 2024
    From Newsgroup: comp.arch

    BGB-Alt wrote:



    I have yet to decide on the default bit-depth for UI widgets, mostly
    torn between 16-color and 256-color. Probably don't need high bit-depths
    for "text and widgets" windows, but may need more than 16-color. Also

    When I write documents I have a favorite set of pastel colors I use to
    shade boxes, arrows, and text. I draw the figures first and then fill in
    the text later. To give the eye an easy means to go from reading the text
    to looking at a figure and finding the discussed item, I place a box around
    the text and shade the box with the same R-G-B color of the pre.jpg box in
    the figure. All figures are *.jpg. So, even a word processor needs 24-bit color.

    {{I have also found that Word R-G-B color values are not exactly the same
    as the R-G-V values in the *.jpg figure, too; but they are close enough
    to avoid "figuring it out and trying to fix it to perfection.}}

    TBD which color palette to use in the case if 256 color is used (the

    256 is OK enough for DOOM games, but not for professional documentation.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 6 01:42:58 2024
    From Newsgroup: comp.arch

    On 4/5/2024 4:43 PM, MitchAlsup1 wrote:
    BGB-Alt wrote:



    I have yet to decide on the default bit-depth for UI widgets, mostly
    torn between 16-color and 256-color. Probably don't need high
    bit-depths for "text and widgets" windows, but may need more than
    16-color. Also

    When I write documents I have a favorite set of pastel colors I use to
    shade boxes, arrows, and text. I draw the figures first and then fill in
    the text later. To give the eye an easy means to go from reading the
    text to looking at a figure and finding the discussed item, I place a
    box around
    the text and shade the box with the same R-G-B color of the pre.jpg box
    in the figure. All figures are *.jpg. So, even a word processor needs
    24-bit color.

    {{I have also found that Word R-G-B color values are not exactly the same
    as the R-G-V values in the *.jpg figure, too; but they are close enough
    to avoid "figuring it out and trying to fix it to perfection.}}

    TBD which color palette to use in the case if 256 color is used (the

    256 is OK enough for DOOM games, but not for professional documentation.


    If the palette is set up well, one can approximate most RGB colors semi-passably via the palette. Granted, still not great for photos or
    video (or even games); and a tradeoff needs to be made between color
    fidelity and levels of brightness (with the system palette I was using favoring brightness accuracy over color fidelity).

    Where games only really look good if the color palette used by the
    system roughly matches that used by the game. The palette I had used, as
    can be noted, was 14 levels of 18 colors:
    6 high-saturation (3-bit RGB);
    6 low saturation;
    3 off-white;
    grayscale (16 levels);
    orange;
    azure.

    Though, not really a fast/accurate procedural transform, as the palette
    is complicated enough to require a lookup table (unlike 216 color or
    RGB332).

    Has at one point used a palette encoding which had used a 4-bit Y and
    the remaining 4 bits effectively encoding a normalized vector in YUV
    space, but the current scheme is more efficient. In this older scheme,
    palette entries that would represent an out-of-gamut color were
    effectively wasted. This strategy worked semi-OK for photo-like
    graphics, but worse for other things; encoding logic was intermediate complexity (effectively built around a highly modified YCbCr transform).

    Though, one possible (semi-OK) way to algorithmically encode colors to
    the current palette would be:
    Calculate the Luma;
    Normalize the RGB color into a color vector;
    Feed this vector through a fixed RGB222 lookup table;
    Encode the color based on which axis was selected.



    Though, for most UI widgets, one arguably only needs 4 colors, say:
    Black, White, Dark-Gray, Light-Gray.

    Add maybe a few other colors for things like title-bars and similar, ...


    But, imposing a 16-color RGBI limit may be too restricting (even in the
    name of saving memory by representing windows at a lower bit-depth when
    a higher bit depth is unnecessary).

    If one allows for customizable UI color schemes (like in older versions
    of Windows), then RGBI would be insufficient even for simple pointy
    clicky apps (leaving 256-color as the minimum option).


    Here, 256 colors could make sense. Though, RGB555 would likely be
    overkill for most simple pointy clicky GUI apps.


    One possibility here could be for the widget toolkit window to
    automatically upgrade to a higher bit-depth if it detects the use of bitmap-object widgets or similar (possibly making the choice of
    bit-depth mostly invisible to the program in this case).

    Well, or leave it per program, where a program that may use more complex graphics requests the use of a higher bit-depth, and then say, one that
    only does a text-editor or calculator or similar, sticks with a more
    limited colorspace to save memory.


    Say, looking at Windows Calculator:
    Looks like it uses:
    ~ 5 or 6 shades of gray;
    A shade of blue.

    Programs of this sort don't really need a higher color depth.
    Would a Notepad style program need lots of colors?
    ...

    Granted, something like an image editor would need full-color, so in
    this case supporting higher color depths will make sense as well. Just,
    one doesn't want to use higher bit-depths when they are not needed, as
    this wastes more RAM on the backing framebuffers for the various
    windows, etc...




    In other news, had been messing with doing SDF versions of Unifont.

    Curiously, newer versions (15.1.05) of the Unifont font (post
    processing) seem to actually be a little bit smaller than an older
    versions (5.1). Not entirely sure what is going on with this (but, can
    note that there are a number of large gaps in the codepoint space).

    Well, and seemingly multiple versions of the font:
    The normal Unicode BMP;
    Another 'JP' version which follows JIS rules;
    Apparently maps in various characters from Planes 1 and 2, ...
    Appears mostly similar apart from the contents of the CJK ranges.
    ...


    Say:
    After conversion to a 1bpp glyph set:
    5.1 : 1.99 MB
    15.1.05: 1.78 MB
    After conversion to SDF and storing the bitmaps in a WAD container:
    5.1 : ~ 7.0MB
    15.1.05: ~ 6.1MB

    Where, in this case, WAD based packaging seemed "mostly sufficient".
    May or may not end up putting additional metadata in the WAD files.
    For now, it is mostly just a collection of bitmap images...

    With each image with its number encoded in the lump name, and understood
    to be a grid of 16x16 glyphs.


    Though, glyph shape reconstruction is less exact than could be hoped.
    Arguably, still better than if one tries to scale the bitmaps directly,
    but results may come out a little lumpy/weird here (though, trying to
    store them at 4-bits per sub-channel probably doesn't exactly help this).


    At the moment, storage is basically:
    Represent the SDF as a 256-color image, with each pixel split into 2 sub-channels. The palette color is interpreted as the input vector for
    the SDF drawing (interpolate between the pixels, take the median, and
    compare this value against a threshold).

    Initially, was not using the palette, but the palette allows more
    flexibility in this case.

    The current algorithm generates the SDF images by finding the horizontal
    and vertical distances. Some papers imply that it may make sense to
    allow more distance-calculation functions, possibly trying to regenerate
    each glyph using each combination of functions and picking whichever combination of functions yields the lowest reconstruction error.


    Granted, ended up using a fairly naive up-sampling algorithm (from the
    16x16 pixel input glyphs), which possibly doesn't really help (I fiddled
    with it until I got something semi-OK looking, but it has a limitation
    of not having any idea for what the upsampled glyphs are "supposed" to
    look like, what corners should be rounded and which kept sharp and
    pointy; and patterns that work for ASCII/8859-1 range don't extend to
    the rest of the BMP).

    Though, at least for 1252, this range of characters can be handled
    manually. And in this case, the upsampled glyphs were based on 8x8 pixel variants, where following the shapes as the low-res glyphs makes the
    font much more tolerant of scaling (comparably the the Unifont glyphs
    are "less robust" in these areas).



    However, as-is, processing the Unicode BMP still takes an annoyingly
    long time (and a more involved axis-search for each glyph wouldn't
    exactly help matters).

    Likely options, if I did so:
    Simple 2D euclidean distance (naive algorithm);
    Horizontal distance (current);
    Vertical distance (current);
    Down-Left (diagonal);
    Up-Left (diagonal).

    Where, the generation would pick the best 2, with an implicit "3rd axis"
    which would merely be the average of the first 2.


    In theory, could use hi-color images (with 3x 5-bit channels), but this
    would effectively double the memory requirements of each SDF image
    (though, the current implementation demand loads each "page", rather
    than bulk-loading the entire Unicode BMP).

    Well, or I use hi-color for CP-1252, but then the 8-bit scheme for Unifont.



    Not currently still wanting to face the complexity of dealing with
    TrueType fonts.

    Can also note that TTF fonts that cover the entire Unicode BMP are "not exactly small" either...



    --- Synchronet 3.20a-Linux NewsLink 1.114