• Generic GPUs

    From John Levine@johnl@taugh.com to comp.arch on Thu Apr 2 18:02:01 2026
    From Newsgroup: comp.arch

    This draft paper looks at a bunch of GPU instruction sets and wonders
    if we could come up with a widely implemented GPU instruction set
    analagous to the ARM CPU instruction set.

    Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
    Analysis of Hardware-Invariant Computational Primitives in Parallel
    Processors

    Ojima Abraham, Onyinye Okoli

    We present the first systematic cross-vendor analysis of GPU
    instruction set architectures spanning all four major GPU vendors:
    NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
    1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
    Apple (G13, reverse-engineered). Drawing on official ISA reference
    manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
    sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
    four architectures, six parameterizable dialects where vendors
    implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
    disagreements. Based on this analysis, we propose an abstract
    execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
    benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
    on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a
    mandatory primitive, a finding that refines our proposed model.


    https://arxiv.org/abs/2603.28793
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Apr 2 15:57:01 2026
    From Newsgroup: comp.arch

    On 4/2/2026 1:02 PM, John Levine wrote:
    This draft paper looks at a bunch of GPU instruction sets and wonders
    if we could come up with a widely implemented GPU instruction set
    analagous to the ARM CPU instruction set.

    Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
    Analysis of Hardware-Invariant Computational Primitives in Parallel Processors

    Ojima Abraham, Onyinye Okoli

    We present the first systematic cross-vendor analysis of GPU
    instruction set architectures spanning all four major GPU vendors:
    NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
    1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
    Apple (G13, reverse-engineered). Drawing on official ISA reference
    manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
    sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
    four architectures, six parameterizable dialects where vendors
    implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
    disagreements. Based on this analysis, we propose an abstract
    execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
    benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
    on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.


    https://arxiv.org/abs/2603.28793


    Skims...

    It almost seems like "XG3 but SIMT" wouldn't be too far off...


    Main differences:
    64 GPRs vs 128/256
    64-bit vs 32-bit
    2xF32/4xF16 vs 1xF32/2xF16

    Weird to think that XG3 has wider SIMD than a GPU, but yeah...

    Seems like they aren't really using SIMD in this case, but likely faking
    SIMD by using 2 or 4 registers for faking SIMD vectors (where the high register count makes sense). Otherwise, unless one is inlining the whole shader into a single monolithic blob, one is unlikely to be able to make effective use of that many registers.

    Does come off as something very different than RV-v in any case...


    But, narrower SIMD does make sense in a way vs wider SIMD:
    The wider you go, the more of a PITA that shuffles become.

    Scalar: No Shuffles at all;
    Two Wide: Some Half-Register ops;
    Four Wide: Need 4-element shuffles along-side half-register stuff;
    Eight Wide (or wider): All hell breaks loose (note the complete lack of
    native 8 wide SIMD ops in my ISA, there was a reason here...).

    Very wide vectors only really making sense if one assumes the
    computational task looks like an array walk or similar.

    I am also using a fairly limited predication scheme, but often 1 bit of predicate is sufficient. Can evaluate more complex conditionals in GPRs
    and then fold to the final predicate flag for the last step.

    ...


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 2 22:28:22 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    This draft paper looks at a bunch of GPU instruction sets and wonders
    if we could come up with a widely implemented GPU instruction set
    analagous to the ARM CPU instruction set.

    Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
    Analysis of Hardware-Invariant Computational Primitives in Parallel Processors

    Ojima Abraham, Onyinye Okoli

    We present the first systematic cross-vendor analysis of GPU
    instruction set architectures spanning all four major GPU vendors:
    NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
    1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
    Apple (G13, reverse-engineered). Drawing on official ISA reference
    manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
    sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
    four architectures, six parameterizable dialects where vendors
    implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
    disagreements. Based on this analysis, we propose an abstract
    execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
    benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
    on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.


    https://arxiv.org/abs/2603.28793


    What I find missing is illustrative::
    a) Texture LDs,
    b) Rasterization,
    c) Interpolation,
    d) Transcendental instructions,
    e) 'WARP' setup-tear down;
    Or, roughly 1/2 of what GPUs do when running GPU codes.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Apr 3 15:30:00 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-04-02 22:28:22] wrote:
    What I find missing is illustrative::
    a) Texture LDs,
    b) Rasterization,
    c) Interpolation,
    d) Transcendental instructions,

    IIUC these are largely orthogonal to the ISA. Is there a lot of
    variation in the form of those primitives in different GPU architectures?

    Also, my impression is that the article was mostly interested in the
    "GPGPU" angle rather than computer graphics (tho I can't find any
    explicit statement about that, so maybe it reflects my own bias).

    e) 'WARP' setup-tear down;

    I'd have liked to see more discussion of that one, indeed.

    Or, roughly 1/2 of what GPUs do when running GPU codes.

    Even in application domains other than computer graphics (well, I guess
    for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
    would be trivial and not particularly enlightening).


    === Stefan
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Apr 3 21:46:14 2026
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    MitchAlsup [2026-04-02 22:28:22] wrote:
    What I find missing is illustrative::
    a) Texture LDs,
    b) Rasterization,
    c) Interpolation,
    d) Transcendental instructions,

    IIUC these are largely orthogonal to the ISA. Is there a lot of
    variation in the form of those primitives in different GPU architectures?

    While different ISA to ISA, they all allow what appears to be a LD
    in ISA to perform {2,4,8} memory reads and perform {1,2,3}-D linear interpolations between points so that the shader program did not
    need instructions to perform that work.

    Also, my impression is that the article was mostly interested in the
    "GPGPU" angle rather than computer graphics (tho I can't find any
    explicit statement about that, so maybe it reflects my own bias).

    e) 'WARP' setup-tear down;

    I'd have liked to see more discussion of that one, indeed.

    In many GPUs, there is a front end scheduler that reads the draw-calls
    from SW and sets up WARPs to run. This scheduler needs to be able to
    spawn WARPs as fast as shader programs finish--or about 1 new WARP
    about every 8 cycles (32 new threads setup ready to go every 8 cycles).
    This is almost as important to GPU performance as 0-cycle context switch.

    Or, roughly 1/2 of what GPUs do when running GPU codes.

    Even in application domains other than computer graphics (well, I guess
    for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
    would be trivial and not particularly enlightening).


    === Stefan
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Apr 3 17:01:10 2026
    From Newsgroup: comp.arch

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    What I find missing is illustrative::
    a) Texture LDs,
    b) Rasterization,
    c) Interpolation,
    d) Transcendental instructions,

    IIUC these are largely orthogonal to the ISA. Is there a lot of
    variation in the form of those primitives in different GPU architectures?

    Also, my impression is that the article was mostly interested in the
    "GPGPU" angle rather than computer graphics (tho I can't find any
    explicit statement about that, so maybe it reflects my own bias).


    FWIW: A few specialized instructions related to A and C exist in XG1/2/3 despite it not being a GPU ISA.

    But... It was being used some for 3D rendering tasks and similar, so I
    guess it stands to reason.


    Though, D seems a little odd, as these aren't really used in the backend
    parts of a 3D renderer nor typically in shader code, and to what extent
    they are, there are often cheaper approximations (like, in your graphics shader not often much reason to care if "log()" or "pow()" are
    numerically accurate).


    Still, I didn't really expect to see something kinda like what I was
    doing already, but even more in that direction.


    e) 'WARP' setup-tear down;

    I'd have liked to see more discussion of that one, indeed.

    Or, roughly 1/2 of what GPUs do when running GPU codes.

    Even in application domains other than computer graphics (well, I guess
    for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
    would be trivial and not particularly enlightening).


    Yeah.


    Meanwhile I was now working on re-implementing the Int128 and Float128
    stuff for the XG3 and RISC-V backends. Writing the Float128 stuff in C
    this time (the original code for BJX2 was all in ASM and didn't work on
    either RV or XG3).

    Initially, I had just sorta stubbed this stuff over to make it build
    (there were some initial placeholders for the RISC-V versions of this
    stuff, wouldn't actually work correctly for 128-bit numbers though).
    Didn't start getting around to it until now (IIRC, started working on it
    some yesterday or so).


    Don't have the new Float128 code working yet, is giving me some crap,
    partly as I am left also needing to debug the Int128 support at the same
    time (the Float128 code being written on top of the Int128 support).

    Did partly switch over from the old Float128 code to the new code for
    the older BJX2 ISA as well, partly as the new code is at least intended
    to be able to support proper IEEE-754 semantics (with subnormal numbers
    and all that fun, whereas the former ASM code was still DAZ/FTZ).

    Still don't have the new code working yet though, and kept running into
    bugs (some in BGBCC, others in some of the new ASM support code being
    written for RV).



    Also generates epic dog-crap code for the RV ABI, as the registers don't
    go quite so far when dealing with 128-bit stuff. And the "va_list"
    handling seems to be broken as well for 128-bit types for some reason.


    Though, there isn't really a standard printf modifier for these.
    Was using things like "%032LX" with the thinking that the "long double"
    type modifier also means either int128 and float128.
    %X : assumed 32-bits ("sizeof(int)")
    %lX : 32 or 64-bits ("sizeof(long)")
    %llX : 64-bits ("sizeof(long long)")
    %LX : 128-bits (non-standard)
    %I64X : 64-bits (old MSVC convention)
    %I128X: 128-bits (old MSVC convention)

    Vs:
    %f : Assumed double (with float auto-promoting)
    %Lf : Long Double (float128)

    TBD:
    %LLX : Possible for a 256-bit type?...


    Ended up writing some test-bench code to start trying to debug some of
    this stuff, though for now just printing stuff using 64-bit types or
    64-bit chunks to workaround the va_list issue for now, will need to
    figure out the issue here.


    Can note that at present, a lot of the "__int128" code generated on RV
    ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole process).

    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be
    better to skip keeping int128/float128 in registers, and instead be like:
    ADDI X10, SP, DispDst //load destination address
    ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    Since:
    On RV there is relatively little that can be done inline;
    The register pressure is too high to effectively keep Int128 values in registers;
    A lot of the compiler logic was built to assume working on the values in registers (presumably inline in many cases), rather than quite so
    heavily relying on runtime calls (but, RV is weak here vs my own ISA
    even in the absence of the ALUX instructions).

    Though, could make sense (if adding this) to have it as a special case
    if none of the values are currently in registers.

    Note that passing memory addresses is generally how anything larger than 128-bits will work (So, "_BitInt(256)" would be passed as pointers).


    But, I guess no one says that working with "__int128" and similar on
    RISC-V needs to be fast.

    For now, the support functions (shared between RV and XG3) are being
    written to assume RV limitations. Though, for XG3 both the register
    pressure situation is less bad and there is more that could be done
    inline (either ALUX, or using ADDC and similar, but ADDC is now itself
    not necessarily a core instruction; and absent both ALUX and ADDC, one
    is left having to fake carry propagation using additional instructions,
    namely SLT+ADD, ...).

    But, yeah...


    Other thoughts:

    __int256? Possible I could add, but would likely be handled as an alias
    to "_BitInt(256)" rather than a dedicated type.

    __float256? Debatable, possibly could make sense as a "floating point
    type to end all floating point types" kinda thing, no immediate priority.

    But, these would likely need to wait until after I get the 128-bit types
    fully working on RV and similar...


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Apr 4 15:54:28 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated on RV
    ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole process).

    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be better to skip keeping int128/float128 in registers, and instead be like:
    ADDI X10, SP, DispDst //load destination address
    ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 4 12:11:14 2026
    From Newsgroup: comp.arch

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated on RV
    ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole process).

    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be
    better to skip keeping int128/float128 in registers, and instead be like:
    ADDI X10, SP, DispDst //load destination address
    ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.

    Result:
    Huge register pressure and a whole lot of spill-and-fill...
    While theoretically could hold 6 pairs, due to needing registers for
    other stuff it is closer to 3 or 4.

    Could potentially put the Int128 values off in FPR space, since there is
    less pressure there, and not like RV can actually do much with the
    values in GPRs anyways.


    Less bad in XG3 as with 28 callee save registers, there is less of an
    issue with spill and fill.

    More so when useful work can be inline, but this is more of a problem
    with RISC-V.

    Basically, only really AND/OR/XOR can be done inline.
    ADD/SUB/NEG: Logic too complicated, needs a function call.
    Shifts: Likewise, also needs multiple branch-paths depending on the
    shift amount.
    ...



    Oh well, got as far as getting FADD/FSUB/FMUL basically working, but
    FDIV still breaks. Likely still some bugs here...


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Apr 4 20:10:28 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated on RV
    ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole process).

    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be
    better to skip keeping int128/float128 in registers, and instead be like: >> ADDI X10, SP, DispDst //load destination address
    ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.
    32 GPRs
    16 callee save registers
    No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.

    While theoretically could hold 6 pairs, due to needing registers for
    other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.

    Choose wisely.

    Could potentially put the Int128 values off in FPR space, since there is less pressure there, and not like RV can actually do much with the
    values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
    So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs

    -------------------
    Basically, only really AND/OR/XOR can be done inline.

    In your ISA,

    ADD/SUB/NEG: Logic too complicated, needs a function call.

    LoL,

    Shifts: Likewise, also needs multiple branch-paths depending on the
    shift amount.

    LoL, Wrong ISA model.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Apr 4 20:06:09 2026
    From Newsgroup: comp.arch

    On 4/4/2026 3:10 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole process).

    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like: >>>> ADDI X10, SP, DispDst //load destination address
    ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.
    32 GPRs
    16 callee save registers
    No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.


    Possibly true.

    As for Callee-Save GPRs:
    XG1 : 15 (31 with XGPR)
    XG2 : 31
    XG3 : 28
    RISC-V: 12 (with 0/12/16 more on the FPR side)
    LP64 ABI: 12+0
    LP64D ABI: 12+12
    BGBCC's ABI: 12+16


    Also RISC-V mode has the highest spill-and-fill.

    Not usually too bad, but trying to put Int128 in GPRs causes it to
    become absurd. Likely better to move Int128 to FPRs despite being an
    integer type, will need to look into this.


    While theoretically could hold 6 pairs, due to needing registers for
    other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.

    Choose wisely.


    FP64 + poor FP128 is the better option IMO.

    The cases where FP128 are used are so rare as to make its performance
    mostly irrelevant, but in this case, did realize a need for full
    IEEE-754 support, which meant likely needing to redo this stuff in C
    (also relevant to get it working on RISC-V).

    Didn't necessarily want 3 different ASM versions:
    XG1/2, RISC-V, and XG3.

    Even if RISC-V is kind of a poor ISA to support this kind of thing.


    Could potentially put the Int128 values off in FPR space, since there is
    less pressure there, and not like RV can actually do much with the
    values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
    So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs


    You can move from FPRs to GPRs on function calls, and back on return:
    MV X10, F18
    MV X11, F19
    MV X12, F20
    MV X13, F21
    JAL X1, __whatever
    MV F22, X10
    MV F23, X11

    In the case of RISC-V, sorta works since typically pretty much nothing
    is done inline in this case (and at least the ability to freely move
    stuff between registers isn't an issue in RV64G).


    Though, within the ASM support functions, still limited to using GPRs
    (since the integer ops are used and only work on GPRs).



    Ironically though, this is one area where having the Q extension is
    actively worse than not having the Q extension:
    F/D only: Can freely move values between GPRs and FPRs;
    Q extension: This goes away, can only use memory loads/stores to move
    the values between register types in this case...


    As for multiply, RV has:
    MUL : 64*64->64, low results
    MULHU: 64*64->64, high results

    For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.

    Noted that it would be faster in this case to have a single multiply
    that produces both high-and-low results at the same time, than to
    emulate a 128-bit MULHU, as the MULHU would effectively still need to calculate the low-half to get carry propagation correct.

    Ended up doing this part in ASM partly to make it less needlessly slow.



    -------------------
    Basically, only really AND/OR/XOR can be done inline.

    In your ISA,


    In this case, in RV64G mode...

    In XG3 the situation doesn't suck nearly so badly.


    ADD/SUB/NEG: Logic too complicated, needs a function call.

    LoL,

    Shifts: Likewise, also needs multiple branch-paths depending on the
    shift amount.

    LoL, Wrong ISA model.

    Predication would help here...

    But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.

    So, no predication, or conditional-select, or ...


    So, one needs paths, say:
    Shift is negative;
    Shift is 0 bits;
    Shift is 1 to 63 bits;
    Shift is 64 to 127 bits;
    ...
    For 128+, can mask to 7 bits after verifying that the shift is positive.

    Checking the sign of the shifts is needed to match the behavior of the
    shift ops in my ISA, where negative shifts go the opposite direction
    (and I already needed to deal specially with 0).

    So, say (pseudo-code, I actually wrote this part in ASM):
    if(shl>0)
    {
    shl&=127;
    if(shl>=64)
    {
    out_hi=in_low<<(shl&63);
    out_lo=0;
    }else
    {
    out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
    out_lo=in_lo<<(shl&63);
    }
    }else
    {
    if(shl==0)
    {
    out_hi=in_hi;
    out_lo=in_lo;
    }else
    {
    //flip sign and call opposite shift
    }
    }

    ...

    Well, unless one wants to argue that there are more efficient ways to
    deal with 128-bit integer math in RISC-V.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Apr 4 20:41:23 2026
    From Newsgroup: comp.arch

    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices. Go back to the days of 32 bit registers. Many/most of the architectures of that era had good support
    for both 32 bit single precision and 64 bit double precision. So why
    now, in the era of 64 bit registers, can't you have good support for 64
    and 128 bit floating point? i.e. what's different?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Apr 5 00:31:48 2026
    From Newsgroup: comp.arch

    On 2026-04-04 9:06 p.m., BGB wrote:
    On 4/4/2026 3:10 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated on RV >>>>> ends up looking like:
         Load value from the stack;
         Load other value from the stack;
         Copy into argument registers;
         Copy into argument registers;
         Call Int128 support function;
         Copy return value into other registers;
         Store back to the stack (so it can repeat this whole process). >>>>>
    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be >>>>> better to skip keeping int128/float128 in registers, and instead be >>>>> like:
         ADDI X10, SP, DispDst   //load destination address
         ADDI X11, SP, DispSrc1  //load first source address
         ADDI X12, SP, DispSrc2
         JAL  X1, __xli_add_3m   //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
        32 GPRs;
        12 callee save registers;
        Register pairing.
          32 GPRs
          16 callee save registers
          No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.


    Possibly true.

    As for Callee-Save GPRs:
      XG1   : 15 (31 with XGPR)
      XG2   : 31
      XG3   : 28
      RISC-V: 12 (with 0/12/16 more on the FPR side)
        LP64 ABI: 12+0
        LP64D ABI: 12+12
        BGBCC's ABI: 12+16


    Also RISC-V mode has the highest spill-and-fill.

    Not usually too bad, but trying to put Int128 in GPRs causes it to
    become absurd. Likely better to move Int128 to FPRs despite being an
    integer type, will need to look into this.


    While theoretically could hold 6 pairs, due to needing registers for
    other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.

    Choose wisely.


    FP64 + poor FP128 is the better option IMO.

    Qupls has support for both binary and decimal float 128 in the ISA. But
    I have not bothered to implement either yet. binary floats support
    multiple precisions whereas there is only 128-bit decimal float. This is
    on the assumption that one would want lots of precision with decimal
    floats and not performance. There is also opcodes reserved for 64-bit
    posits.

    Using a seven-bit base opcode space which is mostly full. All the
    different precisions, data types, and support for SIMD and vectors
    really uses up opcode space. Opcode usage is fairly regular however. The
    same func code represents the same instruction in different precisions.

    The cases where FP128 are used are so rare as to make its performance
    mostly irrelevant, but in this case, did realize a need for full
    IEEE-754 support, which meant likely needing to redo this stuff in C
    (also relevant to get it working on RISC-V).

    Didn't necessarily want 3 different ASM versions:
      XG1/2, RISC-V, and XG3.

    Even if RISC-V is kind of a poor ISA to support this kind of thing.


    Could potentially put the Int128 values off in FPR space, since there is >>> less pressure there, and not like RV can actually do much with the
    values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
    So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs

    You can move from FPRs to GPRs on function calls, and back on return:
      MV  X10, F18
      MV  X11, F19
      MV  X12, F20
      MV  X13, F21
      JAL X1, __whatever
      MV  F22, X10
      MV  F23, X11

    In the case of RISC-V, sorta works since typically pretty much nothing
    is done inline in this case (and at least the ability to freely move
    stuff between registers isn't an issue in RV64G).


    Hee, hee. I have gone with 128 GPR registers for Qupls. So, no trouble
    using register pairs or quads for data. I figured it was better to
    expose the registers that might be used for vector storage and allow
    them to be used for other purposes. Vector instructions are still code
    dense, taking the first register for the vector specified in the ISA instruction. The machine increments the register number for the vector
    length required.
    It is really quasi-vectors in Qupls. A vector instruction gets converted
    into multiple SIMD instructions as needed. However some vector
    instructions cannot be done, like arbitrary vector slides.

    Though, within the ASM support functions, still limited to using GPRs
    (since the integer ops are used and only work on GPRs).



    Ironically though, this is one area where having the Q extension is
    actively worse than not having the Q extension:
    F/D only: Can freely move values between GPRs and FPRs;
    Q extension: This goes away, can only use memory loads/stores to move
    the values between register types in this case...


    As for multiply, RV has:
      MUL  : 64*64->64, low results
      MULHU: 64*64->64, high results

    For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.

    Noted that it would be faster in this case to have a single multiply
    that produces both high-and-low results at the same time, than to
    emulate a 128-bit MULHU, as the MULHU would effectively still need to calculate the low-half to get carry propagation correct.

    Ended up doing this part in ASM partly to make it less needlessly slow.



    -------------------
    Basically, only really AND/OR/XOR can be done inline.

    In your ISA,


    In this case, in RV64G mode...

    In XG3 the situation doesn't suck nearly so badly.


    ADD/SUB/NEG: Logic too complicated, needs a function call.

    LoL,

    Shifts: Likewise, also needs multiple branch-paths depending on the
    shift amount.

    LoL,  Wrong ISA model.

    Predication would help here...

    But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.

    So, no predication, or conditional-select, or ...


    So, one needs paths, say:
      Shift is negative;
      Shift is 0 bits;
      Shift is 1 to 63 bits;
      Shift is 64 to 127 bits;
      ...
    For 128+, can mask to 7 bits after verifying that the shift is positive.

    Checking the sign of the shifts is needed to match the behavior of the
    shift ops in my ISA, where negative shifts go the opposite direction
    (and I already needed to deal specially with 0).

    So, say (pseudo-code, I actually wrote this part in ASM):
      if(shl>0)
      {
        shl&=127;
        if(shl>=64)
        {
          out_hi=in_low<<(shl&63);
          out_lo=0;
        }else
        {
          out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
          out_lo=in_lo<<(shl&63);
        }
      }else
      {
        if(shl==0)
        {
          out_hi=in_hi;
          out_lo=in_lo;
        }else
        {
           //flip sign and call opposite shift
        }
      }

    ...

    Well, unless one wants to argue that there are more efficient ways to
    deal with 128-bit integer math in RISC-V.


    I need to review shifting for Qupls. It is being done in a somewhat
    simple manner with RTL code for both right and left shifts. I would hide
    the negative shift behind an ISA instruction which has a positive shift.
    That is both left and right shifts rotate left internally (micro-architecturally?), but the programmer does not see that. They see
    only shifts with positive shift amounts. Negative shift amounts can be confusing.

    Love the power of micro-ops.



    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Apr 5 00:40:28 2026
    From Newsgroup: comp.arch

    On 2026-04-04 11:41 p.m., Stephen Fuld wrote:
    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices.  Go back to the days of 32 bit registers.  Many/most of the architectures of that era had good support
    for both 32 bit single precision and 64 bit double precision.  So why
    now, in the era of 64 bit registers, can't you have good support for 64
    and 128 bit floating point?  i.e. what's different?


    IMO the difference may be the usefulness for hardware cost (value). It
    takes twice as much hardware for something that may be only rarely used.
    I think there was good demand for 32-bit machines, a little bit less for 64-bit and even less for 128-bit.
    Wide registers are handy for handling wider data, but the float
    precision may not be as in demand. With 512-bit wide SIMD register,
    should there be 512-bit floating-point precision?

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Apr 4 22:12:25 2026
    From Newsgroup: comp.arch

    On 4/4/2026 9:40 PM, Robert Finch wrote:
    On 2026-04-04 11:41 p.m., Stephen Fuld wrote:
    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices.  Go back to the days of 32
    bit registers.  Many/most of the architectures of that era had good
    support for both 32 bit single precision and 64 bit double precision.
    So why now, in the era of 64 bit registers, can't you have good
    support for 64 and 128 bit floating point?  i.e. what's different?


    IMO the difference may be the usefulness for hardware cost (value). It
    takes twice as much hardware for something that may be only rarely used.
    I think there was good demand for 32-bit machines, a little bit less for 64-bit and even less for 128-bit.

    If you wanted to add a fourth choice to Mitch's list - something like
    Great FP 64 and 128 but at a significant extra hardware cost, I would
    accept that. I wasn't commenting on the utility of implementing FP128,
    only Mitch's alternatives for such an implementation.


    Wide registers are handy for handling wider data, but the float
    precision may not be as in demand. With 512-bit wide SIMD register,
    should there be 512-bit floating-point precision?

    Probably not, as I see essentially no demand for 512 bit floats. I
    don't know what the demand for 128 bit floats is, but Mitch's post seems
    to assume there was enough demand to implement it, and was discussing alternative implementations.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 5 07:58:23 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    I would consider single fully pipleined 128-bit FMA unit in
    parallel to the usual FP64 units great, and it should not
    compromise FP64.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Apr 5 12:01:37 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole process).

    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like: >>>> ADDI X10, SP, DispDst //load destination address
    ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.
    32 GPRs
    16 callee save registers
    No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.

    While theoretically could hold 6 pairs, due to needing registers for
    other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.
    I believe that the best compromise both today and for the next 10 years
    is to have great f64 which includes Augmented[Addition|Multiplication].
    Those two were added in 754-2019, they enable exact/arbitrary precision arithmetic with very little overhead on an fpu which already supports
    FMAC (which was officially included in the standard in 1998 or 2008).
    On top of this you need the occasional full f128 operation in order to
    get the extended exponent range, but this is much less common than just
    an extended mantissa.

    Choose wisely.

    Could potentially put the Int128 values off in FPR space, since there is
    less pressure there, and not like RV can actually do much with the
    values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
    So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs
    If you emulate f128, then using pairs of 64-bit integer regs is the
    easiest solution. If you have a single register set then it becomes much easier to mix integer/logic ops with fp operations, i.e to implement
    special functions where a f64 result can be used as a starting point for one or two NR iterations.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Apr 5 14:46:40 2026
    From Newsgroup: comp.arch

    On Sun, 5 Apr 2026 12:01:37 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated
    on RV ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole
    process).

    Which, granted, there isn't that much of a better way to do
    this...


    Though, almost starts to make one question if on RV it would
    almost be better to skip keeping int128/float128 in registers,
    and instead be like: ADDI X10, SP, DispDst //load destination
    address ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.
    32 GPRs
    16 callee save registers
    No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.

    While theoretically could hold 6 pairs, due to needing registers
    for other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.

    I believe that the best compromise both today and for the next 10
    years is to have great f64 which includes
    Augmented[Addition|Multiplication].

    Those two were added in 754-2019, they enable exact/arbitrary
    precision arithmetic with very little overhead on an fpu which
    already supports FMAC (which was officially included in the standard
    in 1998 or 2008).

    On top of this you need the occasional full f128 operation in order
    to get the extended exponent range, but this is much less common than
    just an extended mantissa.

    Choose wisely.

    Could potentially put the Int128 values off in FPR space, since
    there is less pressure there, and not like RV can actually do much
    with the values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
    So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs

    If you emulate f128, then using pairs of 64-bit integer regs is the
    easiest solution. If you have a single register set then it becomes
    much easier to mix integer/logic ops with fp operations, i.e to
    implement special functions where a f64 result can be used as a
    starting point for one or two NR iterations.

    Terje

    On modern CPUs, i.e. Zen3 and better, integer division is a better
    (than fp64 fdiv) starting point for fp128 division.
    For sqrtq/rsqrtq, approximate fp64 rsqrt is probably the best starting
    step, but not every ISA has it. For example, x86-64 only has it
    with AVX512 and even here precision is only 28 bits. 28 bits is an
    excellent starting point for fp64 sqrt/rsqrt, but for fp128 it's not
    quite sufficient. I'd prefer 31-32 bits, in order to get exact value
    after 2 NR iterations for >99.9% of the inputs.
    The cost of moving things between FP and general registers that you
    mentioned above is hardly above noise floor, esp. for slower operations
    like sqrt and div.
    However, avoidance of FP could have other advantages. The most
    important is that FP affects flags and in current ABIs flags are shared
    between fp128 and fp64/fp32. If one uses FPU for emulation of fp128
    then setting flags in IEEE-prescribed manner become quite difficult,
    especially so for Inexact - the most useless of them all, but still
    mandatory according to IEEE.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 5 13:30:42 2026
    From Newsgroup: comp.arch

    On 4/4/2026 11:31 PM, Robert Finch wrote:
    On 2026-04-04 9:06 p.m., BGB wrote:
    On 4/4/2026 3:10 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated
    on RV
    ends up looking like:
         Load value from the stack;
         Load other value from the stack;
         Copy into argument registers;
         Copy into argument registers;
         Call Int128 support function;
         Copy return value into other registers;
         Store back to the stack (so it can repeat this whole process). >>>>>>
    Which, granted, there isn't that much of a better way to do this... >>>>>>

    Though, almost starts to make one question if on RV it would
    almost be
    better to skip keeping int128/float128 in registers, and instead
    be like:
         ADDI X10, SP, DispDst   //load destination address
         ADDI X11, SP, DispSrc1  //load first source address
         ADDI X12, SP, DispSrc2
         JAL  X1, __xli_add_3m   //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
        32 GPRs;
        12 callee save registers;
        Register pairing.
          32 GPRs
          16 callee save registers
          No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.


    Possibly true.

    As for Callee-Save GPRs:
       XG1   : 15 (31 with XGPR)
       XG2   : 31
       XG3   : 28
       RISC-V: 12 (with 0/12/16 more on the FPR side)
         LP64 ABI: 12+0
         LP64D ABI: 12+12
         BGBCC's ABI: 12+16


    Also RISC-V mode has the highest spill-and-fill.

    Not usually too bad, but trying to put Int128 in GPRs causes it to
    become absurd. Likely better to move Int128 to FPRs despite being an
    integer type, will need to look into this.


    While theoretically could hold 6 pairs, due to needing registers for
    other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.

    Choose wisely.


    FP64 + poor FP128 is the better option IMO.

    Qupls has support for both binary and decimal float 128 in the ISA. But
    I have not bothered to implement either yet. binary floats support
    multiple precisions whereas there is only 128-bit decimal float. This is
    on the assumption that one would want lots of precision with decimal
    floats and not performance. There is also opcodes reserved for 64-bit posits.

    Using a seven-bit base opcode space which is mostly full. All the
    different precisions, data types, and support for SIMD and vectors
    really uses up opcode space. Opcode usage is fairly regular however. The same func code represents the same instruction in different precisions.


    OK.

    Current formats:
    BJX2:
    Binary64 (Scalar)
    Binary32 and Binary16: SIMD and Converters
    FP8/FP8U/FP8A: SIMD Converters
    Binary128: Software
    BFloat16: Faked in software
    RISC-V:
    Binary32 (F, Scalar | SIMD)
    Binary64 (D, Scalar)
    Binary16 (Zfh)
    Binary128 (Q; or "Pseudo Q" in my case)
    (Pseudo Q using the same encodings, but register pairs)
    (Implementation via trap-and-emulate, potential patching)
    Recently implemented: Plain software path.
    BFloat16 ("Zfbf16" or such)
    FP8 via RV-V (unsupported).

    FP-SIMD formats in my case:
    2x Binary32 (both ISAs, RV via F encodings)
    4x Binary16 (both ISAs, RV via Zfh encodings)
    4x Binary32 (Register Pair)
    Via converters:
    4x FP8/FP8U/FP8A (<> 4x Binary16)



    Spent a while trying to figure out why my new Float128 code wasn't
    working, thinking there was still an Int128 bug...

    Turns out I did a screw-up and was using the non-shifted mantissa for FADD/FSUB rather than the shifted one (and didn't exactly work correctly).


    The cases where FP128 are used are so rare as to make its performance
    mostly irrelevant, but in this case, did realize a need for full
    IEEE-754 support, which meant likely needing to redo this stuff in C
    (also relevant to get it working on RISC-V).

    Didn't necessarily want 3 different ASM versions:
       XG1/2, RISC-V, and XG3.

    Even if RISC-V is kind of a poor ISA to support this kind of thing.


    Could potentially put the Int128 values off in FPR space, since
    there is
    less pressure there, and not like RV can actually do much with the
    values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
    So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs

    You can move from FPRs to GPRs on function calls, and back on return:
       MV  X10, F18
       MV  X11, F19
       MV  X12, F20
       MV  X13, F21
       JAL X1, __whatever
       MV  F22, X10
       MV  F23, X11

    In the case of RISC-V, sorta works since typically pretty much nothing
    is done inline in this case (and at least the ability to freely move
    stuff between registers isn't an issue in RV64G).


    Hee, hee. I have gone with 128 GPR registers for Qupls. So, no trouble
    using register pairs or quads for data. I figured it was better to
    expose the registers that might be used for vector storage and allow
    them to be used for other purposes. Vector instructions are still code dense, taking the first register for the vector specified in the ISA instruction. The machine increments the register number for the vector length required.
    It is really quasi-vectors in Qupls. A vector instruction gets converted into multiple SIMD instructions as needed. However some vector
    instructions cannot be done, like arbitrary vector slides.


    Yeah.

    As-is:
    XG1: 32/64 registers
    Whole ISA only has 32 registers;
    Subset has 64 registers;
    Other ops can encode 64 registers via Jumbo Prefixes.
    Has 16-bit ops, mostly limited to R0..R15.
    Oldest and least orthogonal of the schemes.
    XG2: 64 registers
    Trades 16-bit encodings for making 64-bit GPRs orthogonal.
    XG3: 64 registers
    Mostly a repack of XG2 to coexist with RISC-V;
    Also switches to RISC-V's register numbering;
    Albeit with a unified register space.
    Canonically drops some parts of XG2.
    RISC-V: 32+32 registers.
    RV-V would add 32x more, don't currently support RV-V,
    If I did, would likely be via software emulation.

    Register spaces:
    XG1/XG2:
    R0/R1: Stomp Registers, Special | N/E in some contexts.
    Functionally, scratch registers with encoding restrictions.
    R2..R7: Scratch
    R4..R7: Arg0..Arg3
    R4..R14: Callee Save
    R15: SP
    R16..R23: Scratch
    R20..R23: Arg4..Arg7
    R24..R31: Callee Save

    R32..R39: Scratch
    R36..R39: Arg8..Arg11
    R40..R47: Callee Save
    R48..R55: Scratch
    R52..R55: Arg12..Arg15
    R56..R63: Callee Save

    XG3 / RISC-V
    R0/X0 (ZR): Zero
    R1/X1 (LR/RA): Link Register
    R2/X2 (SP): Stack pointer
    R3/X3 (GBR/GP): Global Pointer
    R4/X4 (TP): Task Pointer
    R5..R7: Scratch (Stomp Registers in BGBCC)
    R8/R9: Callee Save
    R10..R17: Scratch / Arg0..Arg7
    R18..R27: Callee Save
    R28..R31: Scratch
    R32/F0 .. R35/F3: Scratch (Stomp in BGBCC)
    R36/F4 .. R39/F7: Callee Save (BGBCC)
    R40/F8 .. R41/F9: Callee Save
    R42/F10 .. R49/F17: Scratch (potential Arg8..Arg15)
    R50/F18 .. R59/F27: Callee Save
    F60/F28 .. R63/F31: Scratch

    There is not a 1:1 mapping between the XG1/XG2 and RV/XG3 register spaces.

    As-is, RV -> XG1/2 space:
    X0 -> ZR (N/E)
    X1 -> LR (CR)
    X2 -> SP (R15)
    X3 -> GBR (CR)
    X4..X13: R4..R13
    X14/X15: R2/R3
    X16..X31: R16..R31
    F0..F31: R32..R63

    The XG1/XG2 registers for R0, R1, and R14, are inaccessible on the
    RV/XG3 side.

    As-is, this means a core hosted using RV or XG3 in effect couldn't run
    XG1 or XG2 code natively, as some of the registers could not be
    preserved on context switches (absent adding some other mechanism to
    access them).


    Comparison if RV and XG3 instruction layouts:
    ZZZZttttttssssssZZZZnnnnnnYYYYpp (XG3, 3R)
    ZZZZZZZtttttsssssZZZnnnnnYYYYY11 (RV, 3R)
    iiiiiiiiiissssssZZZZnnnnnnYYYYpp (XG3, 3RI, Imm10)
    iiiiiiiiiiiisssssZZZnnnnnYYYYY11 (RV, 3RI, Imm12)

    Divergent:
    iiiiiiiiiiiiiiiiZZZZnnnnnnYYYYpp (XG3, 2RI-Imm16)
    iiiiiiiiiiiiiiiiZZZZiiiiiiiYYYpp (XG3, Imm23, BRA/BSR)
    iiiiiiiiiiiiiiiiiiiinnnnnYYYYY11 (RV, 2RI-Imm20; JAL/LUI/AUIPC)
    Despite superficial similarity, JAL uses a different Disp layout.
    RV's JAL displacement is horribly dog-chewed.


    RISC-V shuffles the Imm12 bits around depending on the instruction type,
    for XG3, the layout remains unchanged (Load/Store/ALU/Bcc: All the same
    layout in XG3, though scale may differ).

    In XG3 encodings, the assumption is that the scale of the immediate
    differs, in RISC-V the encoding bits move around instead.



    Encoding was different in XG1 and XG2, divided into 16-bit words.

    XG1
    YYYYZZZZnnnnmmmm 2R
    YYYYZZZZnnnniiii 2RI Imm4
    YYYYnnnniiiiiiii 2RI Imm8 (ADD)
    YYYYZZZZiiiiiiii Imm8
    YYYYiiiiiiiiiiii Imm12 (LDI Imm, R0)
    111PYwYYnnnnmmmm ZZZZqnmoooooZZZZ //3R
    111PYwYYnnnnmmmm ZZZZqnmiiiiiiiii //3RI Imm9
    111PYwYYZZZnnnnn iiiiiiiiiiiiiiii //2RI Imm16
    111PYwYYiiiiiiii ZZZZiiiiiiiiiiii //Imm20 (BRA/BSR)

    XG2:
    NMOPYwYYnnnnmmmm ZZZZqnmoooooZZZZ //3R
    NMIPYwYYnnnnmmmm ZZZZqnmiiiiiiiii //3RI Imm10
    NZZPYwYYZZZnnnnn iiiiiiiiiiiiiiii //2RI Imm16
    IIIPYwYYiiiiiiii ZZZZiiiiiiiiiiii //Imm23 (BRA/BSR)


    Note that XG1's 32-bit encodings are still valid in XG2, and XG2's
    encodings mostly map up to XG3 (though there are things that are N/E
    between them).


    Though, within the ASM support functions, still limited to using GPRs
    (since the integer ops are used and only work on GPRs).



    Ironically though, this is one area where having the Q extension is
    actively worse than not having the Q extension:
    F/D only: Can freely move values between GPRs and FPRs;
    Q extension: This goes away, can only use memory loads/stores to move
    the values between register types in this case...


    As for multiply, RV has:
       MUL  : 64*64->64, low results
       MULHU: 64*64->64, high results

    For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.

    Noted that it would be faster in this case to have a single multiply
    that produces both high-and-low results at the same time, than to
    emulate a 128-bit MULHU, as the MULHU would effectively still need to
    calculate the low-half to get carry propagation correct.

    Ended up doing this part in ASM partly to make it less needlessly slow.



    -------------------
    Basically, only really AND/OR/XOR can be done inline.

    In your ISA,


    In this case, in RV64G mode...

    In XG3 the situation doesn't suck nearly so badly.


    ADD/SUB/NEG: Logic too complicated, needs a function call.

    LoL,

    Shifts: Likewise, also needs multiple branch-paths depending on the
    shift amount.

    LoL,  Wrong ISA model.

    Predication would help here...

    But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.

    So, no predication, or conditional-select, or ...


    So, one needs paths, say:
       Shift is negative;
       Shift is 0 bits;
       Shift is 1 to 63 bits;
       Shift is 64 to 127 bits;
       ...
    For 128+, can mask to 7 bits after verifying that the shift is positive.

    Checking the sign of the shifts is needed to match the behavior of the
    shift ops in my ISA, where negative shifts go the opposite direction
    (and I already needed to deal specially with 0).

    So, say (pseudo-code, I actually wrote this part in ASM):
       if(shl>0)
       {
         shl&=127;
         if(shl>=64)
         {
           out_hi=in_low<<(shl&63);
           out_lo=0;
         }else
         {
           out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
           out_lo=in_lo<<(shl&63);
         }
       }else
       {
         if(shl==0)
         {
           out_hi=in_hi;
           out_lo=in_lo;
         }else
         {
            //flip sign and call opposite shift
         }
       }

    ...

    Well, unless one wants to argue that there are more efficient ways to
    deal with 128-bit integer math in RISC-V.


    I need to review shifting for Qupls. It is being done in a somewhat
    simple manner with RTL code for both right and left shifts. I would hide
    the negative shift behind an ISA instruction which has a positive shift.
    That is both left and right shifts rotate left internally (micro- architecturally?), but the programmer does not see that. They see only shifts with positive shift amounts. Negative shift amounts can be
    confusing.

    Love the power of micro-ops.


    In my case, initially I only had left-shift with right-shifted being
    expressed as a negative left shift (carried over from SH and similar).

    Later added right-shift instructions, but they follow the same pattern.

    So, say, for 64-bit shift:
    0.. 63: Shift left 0 to 63 bits;
    64.. 127: Shift left 0 to 63 bits (Mod-64);
    -1.. 63: Shift right 1 to 63 bits;
    -64..-128: Shift right 0 to 63 bits (Mod-64);

    Generally with higher bits being ignored (and values outside of
    -128..127 being undefined).


    For 128-bit shift (via ALUX):
    0.. 127: Shift left 0 to 127 bits;
    -1..-127: Shift right 1 to 127 bits;
    Anything outside this range being undefined.

    In practice, the existing unit would alternate directions on overflow
    for the ALUX handling (but with the 128-bit shift instructions being N/E
    in RV land, and optional in XG3).


    There are sometimes cases where a negative shift becoming a shift in the opposite direction can be useful.

    Though, this differs from the canonical behavior in RISC-V (naive modulo shift, in like x86).

    For the support functions, a signed directional shift had been assumed,
    though differing slightly:
    Positive shifts: Always Mod-N in the same direction (eg, left);
    Negative shifts: Always Mod-N in the same direction (eg, right).
    Though, would still become undefined outside of Int32 range.

    ...


    Can note that it seems for my FP128 test program there is a quite
    significant code size difference between the RV and XG3 versions (even
    in the current absence of ALUX instructions, and with some of my own extensions on the RV side).

    ...

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 5 14:23:48 2026
    From Newsgroup: comp.arch

    On 4/5/2026 12:12 AM, Stephen Fuld wrote:
    On 4/4/2026 9:40 PM, Robert Finch wrote:
    On 2026-04-04 11:41 p.m., Stephen Fuld wrote:
    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices.  Go back to the days of 32
    bit registers.  Many/most of the architectures of that era had good
    support for both 32 bit single precision and 64 bit double precision.
    So why now, in the era of 64 bit registers, can't you have good
    support for 64 and 128 bit floating point?  i.e. what's different?


    IMO the difference may be the usefulness for hardware cost (value). It
    takes twice as much hardware for something that may be only rarely used.
    I think there was good demand for 32-bit machines, a little bit less
    for 64-bit and even less for 128-bit.

    If you wanted to add a fourth choice to Mitch's list - something like
    Great FP 64 and 128 but at a significant extra hardware cost, I would
    accept that.  I wasn't commenting on the utility of implementing FP128, only Mitch's alternatives for such an implementation.


    I could in theory do native FP128 with paired registers, similar to how
    some existing machines did native FP64 with paired 32-bit registers.

    But, yeah, they are not the same:
    FP64: Needed often.
    FP128: Needed rarely.


    Current cost for doing FP128 in hardware would be too expensive though.
    So, for now, it is software options.

    Which, as-is, are:
    Pseudo-Q option:
    Pretend the Q ops exist, but operate on pairs;
    Trap and emulate them;
    Better code density but slower (due to trap overheads).
    Other option:
    Runtime calls (recently implemented for the RV case);
    Awful code density but slightly less slow in this case.



    Wide registers are handy for handling wider data, but the float
    precision may not be as in demand. With 512-bit wide SIMD register,
    should there be 512-bit floating-point precision?

    Probably not, as I see essentially no demand for 512 bit floats.  I
    don't know what the demand for 128 bit floats is, but Mitch's post seems
    to assume there was enough demand to implement it, and was discussing alternative implementations.


    Yeah:
    FP64 : Semi-common workhorse;
    FP128: Rate, when FP64 isn't enough;
    FP256: Universe-scale numbers.
    FP512: What even would it be used for?...

    Like, say, with FP256 you could already express a coordinate space
    effectively covering the size of the observable universe.

    OTOH, it isn't that much harder to support FP256 than to support FP128,
    if one has a way to express 256-bit integer math. It is, of course,
    slower...


    Though one possible optimization would be to detect cases where the
    wider formats could fall back to narrower math.

    Where, FP128 and FP256 could have enough general overhead to make it worthwhile to detect these sorts of narrowing cases.




    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 5 20:29:50 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices. Go back to the days of 32 bit registers. Many/most of the architectures of that era had good support
    for both 32 bit single precision and 64 bit double precision. So why
    now, in the era of 64 bit registers, can't you have good support for 64
    and 128 bit floating point? i.e. what's different?

    In the 32-bit era, there was very significant demand for 64-bit FP
    and we were just waiting for silicon acreage to do 64-bits;
    whereas,
    In the 64-bit era, there is a bit of demand for FP128 to make it not
    stink to high heaven, but not enough to even consider making everything 128-bits.

    So in a) we had to have useful FP64 in an essentially 32-bit machine
    but right now b) there is no essentiality just don't screw up enough
    it can't be useful on occasion.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 5 20:40:25 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole process).

    Which, granted, there isn't that much of a better way to do this...


    Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like:
    ADDI X10, SP, DispDst //load destination address
    ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.
    32 GPRs
    16 callee save registers
    No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.

    While theoretically could hold 6 pairs, due to needing registers for
    other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.

    I believe that the best compromise both today and for the next 10 years
    is to have great f64 which includes Augmented[Addition|Multiplication].

    As:

    GLOBAL Kahan_Babuška
    ENTRY Kahan_Babuška
    Kahan_Babuška:
    // note compiler put &residual in R3 at CALL
    LDD R4,[R3]
    CARRY R4,{IO}
    FADD R1,R1,R2 // will not set inexact
    STD R4,[R3]
    RET

    Inexact can be set in this usage of CARRY and FADD.


    Those two were added in 754-2019, they enable exact/arbitrary precision arithmetic with very little overhead on an fpu which already supports
    FMAC (which was officially included in the standard in 1998 or 2008).

    On top of this you need the occasional full f128 operation in order to
    get the extended exponent range, but this is much less common than just
    an extended mantissa.

    Choose wisely.

    Could potentially put the Int128 values off in FPR space, since there is >> less pressure there, and not like RV can actually do much with the
    values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
    So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs

    If you emulate f128, then using pairs of 64-bit integer regs is the
    easiest solution. If you have a single register set then it becomes much easier to mix integer/logic ops with fp operations, i.e to implement
    special functions where a f64 result can be used as a starting point for
    one or two NR iterations.

    Terje

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 5 20:48:44 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 5 Apr 2026 12:01:37 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/4/2026 10:54 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 4/3/2026 2:30 PM, Stefan Monnier wrote:
    MitchAlsup [2026-04-02 22:28:22] wrote:
    -------------------
    Can note that at present, a lot of the "__int128" code generated
    on RV ends up looking like:
    Load value from the stack;
    Load other value from the stack;
    Copy into argument registers;
    Copy into argument registers;
    Call Int128 support function;
    Copy return value into other registers;
    Store back to the stack (so it can repeat this whole
    process).

    Which, granted, there isn't that much of a better way to do
    this...


    Though, almost starts to make one question if on RV it would
    almost be better to skip keeping int128/float128 in registers,
    and instead be like: ADDI X10, SP, DispDst //load destination
    address ADDI X11, SP, DispSrc1 //load first source address
    ADDI X12, SP, DispSrc2
    JAL X1, __xli_add_3m //do operation in memory

    This is an argument against separate register files. Since FP128
    is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
    help in calculations, ...

    It also shows that once something does not fit the register format
    one is better off passing via pointers. ...

    It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.
    32 GPRs
    16 callee save registers
    No register pairing

    Result:
    Huge register pressure and a whole lot of spill-and-fill...

    My 66000 has less spill fill than RISC-V even without FPRs.

    While theoretically could hold 6 pairs, due to needing registers
    for other stuff it is closer to 3 or 4.

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64

    Right now, one needs great FP64; and the occasional FP128.

    I believe that the best compromise both today and for the next 10
    years is to have great f64 which includes Augmented[Addition|Multiplication].

    Those two were added in 754-2019, they enable exact/arbitrary
    precision arithmetic with very little overhead on an fpu which
    already supports FMAC (which was officially included in the standard
    in 1998 or 2008).

    On top of this you need the occasional full f128 operation in order
    to get the extended exponent range, but this is much less common than
    just an extended mantissa.

    Choose wisely.

    Could potentially put the Int128 values off in FPR space, since
    there is less pressure there, and not like RV can actually do much
    with the values in GPRs anyways.

    You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation So, either
    a) you have a complete integer ISA on FPRs
    b) you do FP128 emulation in GPRs

    If you emulate f128, then using pairs of 64-bit integer regs is the easiest solution. If you have a single register set then it becomes
    much easier to mix integer/logic ops with fp operations, i.e to
    implement special functions where a f64 result can be used as a
    starting point for one or two NR iterations.

    Terje




    On modern CPUs, i.e. Zen3 and better, integer division is a better
    (than fp64 fdiv) starting point for fp128 division.

    For sqrtq/rsqrtq, approximate fp64 rsqrt is probably the best starting
    step, but not every ISA has it. For example, x86-64 only has it
    with AVX512 and even here precision is only 28 bits. 28 bits is an
    excellent starting point for fp64 sqrt/rsqrt, but for fp128 it's not
    quite sufficient. I'd prefer 31-32 bits, in order to get exact value
    after 2 NR iterations for >99.9% of the inputs.

    The cost of moving things between FP and general registers that you
    mentioned above is hardly above noise floor, esp. for slower operations
    like sqrt and div.

    However, avoidance of FP could have other advantages. The most
    important is that FP affects flags and in current ABIs flags are shared between fp128 and fp64/fp32. If one uses FPU for emulation of fp128
    then setting flags in IEEE-prescribed manner become quite difficult, especially so for Inexact - the most useless of them all, but still
    mandatory according to IEEE.

    There is also the problem of getting the inexact bit correct on FP128 calculations (or exact FP64 calculations where the inexact bit is never
    to be set because no bits are lost at/in rounding}.

    {double, double} TwoSum( double a, double b )
    {
    return {double, double} = a + b;
    }

    Inexact should never be set in this sequence since no significance
    has been lost. Most ISAs fail miserably here, especially the ones that
    have to do this as:

    {double, double} TwoSum( double a, double b )
    { // Knuth
    x = a + b ;
    q = x - a ;
    r = x - q ;
    s = b - q ;
    t = a - r ;
    y = s + t ;
    return { x, y };
    }

    or

    {double, double} FastTwoSum( double a, double b )
    { // Dekker
    // ASSERT a > b
    x = a + b ;
    q = x - a ;
    y = b - q ;
    return { x, y };
    }

    All that "extra" arithmetic is to "get" the inexactness and put
    it back in the result-pair. Yet accessing the flags is so expensive
    that most ISAs (and accompanying SW) don't even try.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 6 14:50:39 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:
    Right now, one needs great FP64; and the occasional FP128.

    I believe that the best compromise both today and for the next 10 years
    is to have great f64 which includes Augmented[Addition|Multiplication].

    As:

    GLOBAL Kahan_Babuška
    ENTRY Kahan_Babuška
    Kahan_Babuška:
    // note compiler put &residual in R3 at CALL
    LDD R4,[R3]
    CARRY R4,{IO}
    FADD R1,R1,R2 // will not set inexact
    STD R4,[R3]
    RET

    Inexact can be set in this usage of CARRY and FADD.
    That is a lovely implementation of AugmentedAddition.
    It is probably clear by now that I simply love your CARRY feature. I
    just wish it could turn up on CPU I have can use as a daily driver.
    CARRY is one of those things which I wish that I had been able to think
    of, as opposed to most of the stuff that the patent office have allowed
    over the last few decades.
    (I.e like "We have implemented a classic Kahan text book algorithm in
    HW, please give us a patent for doing it.")
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Apr 6 08:36:39 2026
    From Newsgroup: comp.arch

    On 4/5/2026 1:29 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices. Go back to the days of 32 bit
    registers. Many/most of the architectures of that era had good support
    for both 32 bit single precision and 64 bit double precision. So why
    now, in the era of 64 bit registers, can't you have good support for 64
    and 128 bit floating point? i.e. what's different?

    In the 32-bit era, there was very significant demand for 64-bit FP
    and we were just waiting for silicon acreage to do 64-bits;
    whereas,
    In the 64-bit era, there is a bit of demand for FP128 to make it not
    stink to high heaven, but not enough to even consider making everything 128-bits.

    So in a) we had to have useful FP64 in an essentially 32-bit machine
    but right now b) there is no essentiality just don't screw up enough
    it can't be useful on occasion.


    So I think you are agreeing with Robert Finch and me, that an
    alternative that existed, but that you rejected, for good reasons, would
    have been to provide great support for FP128.

    I agree with your decision to provide "a little" support for FP128. I
    would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
    the future if the demand increases.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 6 16:38:40 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/5/2026 1:29 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices. Go back to the days of 32 bit >> registers. Many/most of the architectures of that era had good support
    for both 32 bit single precision and 64 bit double precision. So why
    now, in the era of 64 bit registers, can't you have good support for 64
    and 128 bit floating point? i.e. what's different?

    In the 32-bit era, there was very significant demand for 64-bit FP
    and we were just waiting for silicon acreage to do 64-bits;
    whereas,
    In the 64-bit era, there is a bit of demand for FP128 to make it not
    stink to high heaven, but not enough to even consider making everything 128-bits.

    So in a) we had to have useful FP64 in an essentially 32-bit machine
    but right now b) there is no essentiality just don't screw up enough
    it can't be useful on occasion.


    So I think you are agreeing with Robert Finch and me, that an
    alternative that existed, but that you rejected, for good reasons, would have been to provide great support for FP128.

    I agree with your decision to provide "a little" support for FP128. I
    would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
    the future if the demand increases.

    I accept that philosophy graciously.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 6 16:25:06 2026
    From Newsgroup: comp.arch

    On 4/6/2026 10:36 AM, Stephen Fuld wrote:
    On 4/5/2026 1:29 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/4/2026 1:10 PM, MitchAlsup wrote:

    One has to make a choice:
    a) great FP64 and useful {but not great} FP128
    b) better FP128 {but still not great} with compromised FP64
    c) great FP128 withe seriously compromised FP64


    I am not sure those are the only choices.  Go back to the days of 32 bit >>> registers.  Many/most of the architectures of that era had good support >>> for both 32 bit single precision and 64 bit double precision.  So why
    now, in the era of 64 bit registers, can't you have good support for 64
    and 128 bit floating point?  i.e. what's different?

    In the 32-bit era, there was very significant demand for 64-bit FP
    and we were just waiting for silicon acreage to do 64-bits;
    whereas,
    In the 64-bit era, there is a bit of demand for FP128 to make it not
    stink to high heaven, but not enough to even consider making everything
    128-bits.

    So in a) we had to have useful FP64 in an essentially 32-bit machine
    but right now b) there is no essentiality just don't screw up enough
    it can't be useful on occasion.


    So I think you are agreeing with Robert Finch and me, that an
    alternative that existed, but that you rejected, for good reasons, would have been to provide great support for FP128.

    I agree with your decision to provide "a little" support for FP128.  I would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
    the future if the demand increases.


    IMO, this goal could still be served with, say, having Binary64 ops that
    work on pairs and use trap-and-emulate. The later hardware option being
    to have real hardware support instead.

    In the case of RISC-V, and my "Pseudo-Q" idea, it also remains fully compatible with the 'D' extension.


    Whereas, in the case of F/D vs F/D/Q, there would be potential for
    issues that would either break things or have a detrimental impact on
    the existing D extension:
    Moves between X and F registers no longer as well defined.
    There are issues that happen when XLEN!=FLEN.
    Context switching and ABI issues would result if mixing F/D and F/D/Q.


    Whereas, using pairs over Q's bigger FPRs avoids any potential for compatibility issues:
    FMV.X.D / FMV.D.X remain well defined;
    ABIs and context-switching remain unchanged.

    Tradeoffs:
    If working with FP128, there are effectively half as many FPU registers
    (16 vs 32).


    The bigger selling point for 128-bit registers would likely be if there
    were a whole lot of 4xFP32 SIMD, but as-is, not enough of this to likely justify bigger registers over register pairs.


    ...

    --- Synchronet 3.21f-Linux NewsLink 1.2