This draft paper looks at a bunch of GPU instruction sets and wonders
if we could come up with a widely implemented GPU instruction set
analagous to the ARM CPU instruction set.
Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
Analysis of Hardware-Invariant Computational Primitives in Parallel Processors
Ojima Abraham, Onyinye Okoli
We present the first systematic cross-vendor analysis of GPU
instruction set architectures spanning all four major GPU vendors:
NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
Apple (G13, reverse-engineered). Drawing on official ISA reference
manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
four architectures, six parameterizable dialects where vendors
implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
disagreements. Based on this analysis, we propose an abstract
execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.
https://arxiv.org/abs/2603.28793
This draft paper looks at a bunch of GPU instruction sets and wonders
if we could come up with a widely implemented GPU instruction set
analagous to the ARM CPU instruction set.
Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
Analysis of Hardware-Invariant Computational Primitives in Parallel Processors
Ojima Abraham, Onyinye Okoli
We present the first systematic cross-vendor analysis of GPU
instruction set architectures spanning all four major GPU vendors:
NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
Apple (G13, reverse-engineered). Drawing on official ISA reference
manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
four architectures, six parameterizable dialects where vendors
implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
disagreements. Based on this analysis, we propose an abstract
execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.
https://arxiv.org/abs/2603.28793
What I find missing is illustrative::
a) Texture LDs,
b) Rasterization,
c) Interpolation,
d) Transcendental instructions,
e) 'WARP' setup-tear down;
Or, roughly 1/2 of what GPUs do when running GPU codes.
MitchAlsup [2026-04-02 22:28:22] wrote:
What I find missing is illustrative::
a) Texture LDs,
b) Rasterization,
c) Interpolation,
d) Transcendental instructions,
IIUC these are largely orthogonal to the ISA. Is there a lot of
variation in the form of those primitives in different GPU architectures?
Also, my impression is that the article was mostly interested in the
"GPGPU" angle rather than computer graphics (tho I can't find any
explicit statement about that, so maybe it reflects my own bias).
e) 'WARP' setup-tear down;
I'd have liked to see more discussion of that one, indeed.
Or, roughly 1/2 of what GPUs do when running GPU codes.
Even in application domains other than computer graphics (well, I guess
for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
would be trivial and not particularly enlightening).
=== Stefan--- Synchronet 3.21f-Linux NewsLink 1.2
MitchAlsup [2026-04-02 22:28:22] wrote:
What I find missing is illustrative::
a) Texture LDs,
b) Rasterization,
c) Interpolation,
d) Transcendental instructions,
IIUC these are largely orthogonal to the ISA. Is there a lot of
variation in the form of those primitives in different GPU architectures?
Also, my impression is that the article was mostly interested in the
"GPGPU" angle rather than computer graphics (tho I can't find any
explicit statement about that, so maybe it reflects my own bias).
e) 'WARP' setup-tear down;
I'd have liked to see more discussion of that one, indeed.
Or, roughly 1/2 of what GPUs do when running GPU codes.
Even in application domains other than computer graphics (well, I guess
for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
would be trivial and not particularly enlightening).
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).
Which, granted, there isn't that much of a better way to do this...
Though, almost starts to make one question if on RV it would almost be better to skip keeping int128/float128 in registers, and instead be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).
Which, granted, there isn't that much of a better way to do this...
Though, almost starts to make one question if on RV it would almost be
better to skip keeping int128/float128 in registers, and instead be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).
Which, granted, there isn't that much of a better way to do this...
Though, almost starts to make one question if on RV it would almost be
better to skip keeping int128/float128 in registers, and instead be like: >> ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
Result:
Huge register pressure and a whole lot of spill-and-fill...
While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.
Could potentially put the Int128 values off in FPR space, since there is less pressure there, and not like RV can actually do much with the
values in GPRs anyways.
Basically, only really AND/OR/XOR can be done inline.
ADD/SUB/NEG: Logic too complicated, needs a function call.
Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.
BGB <cr88192@gmail.com> posted:
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).
Which, granted, there isn't that much of a better way to do this...
Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like: >>>> ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
16 callee save registers
No register pairing
Result:
Huge register pressure and a whole lot of spill-and-fill...
My 66000 has less spill fill than RISC-V even without FPRs.
While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
Right now, one needs great FP64; and the occasional FP128.
Choose wisely.
Could potentially put the Int128 values off in FPR space, since there is
less pressure there, and not like RV can actually do much with the
values in GPRs anyways.
You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs
-------------------
Basically, only really AND/OR/XOR can be done inline.
In your ISA,
ADD/SUB/NEG: Logic too complicated, needs a function call.
LoL,
Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.
LoL, Wrong ISA model.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
On 4/4/2026 3:10 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated on RV >>>>> ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process). >>>>>
Which, granted, there isn't that much of a better way to do this...
Though, almost starts to make one question if on RV it would almost be >>>>> better to skip keeping int128/float128 in registers, and instead be >>>>> like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
16 callee save registers
No register pairing
Result:
Huge register pressure and a whole lot of spill-and-fill...
My 66000 has less spill fill than RISC-V even without FPRs.
Possibly true.
As for Callee-Save GPRs:
XG1 : 15 (31 with XGPR)
XG2 : 31
XG3 : 28
RISC-V: 12 (with 0/12/16 more on the FPR side)
LP64 ABI: 12+0
LP64D ABI: 12+12
BGBCC's ABI: 12+16
Also RISC-V mode has the highest spill-and-fill.
Not usually too bad, but trying to put Int128 in GPRs causes it to
become absurd. Likely better to move Int128 to FPRs despite being an
integer type, will need to look into this.
While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
Right now, one needs great FP64; and the occasional FP128.
Choose wisely.
FP64 + poor FP128 is the better option IMO.
The cases where FP128 are used are so rare as to make its performance
mostly irrelevant, but in this case, did realize a need for full
IEEE-754 support, which meant likely needing to redo this stuff in C
(also relevant to get it working on RISC-V).
Didn't necessarily want 3 different ASM versions:
XG1/2, RISC-V, and XG3.
Even if RISC-V is kind of a poor ISA to support this kind of thing.
Could potentially put the Int128 values off in FPR space, since there is >>> less pressure there, and not like RV can actually do much with the
values in GPRs anyways.
You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs
You can move from FPRs to GPRs on function calls, and back on return:
MV X10, F18
MV X11, F19
MV X12, F20
MV X13, F21
JAL X1, __whatever
MV F22, X10
MV F23, X11
In the case of RISC-V, sorta works since typically pretty much nothing
is done inline in this case (and at least the ability to freely move
stuff between registers isn't an issue in RV64G).
Though, within the ASM support functions, still limited to using GPRs
(since the integer ops are used and only work on GPRs).
Ironically though, this is one area where having the Q extension is
actively worse than not having the Q extension:
F/D only: Can freely move values between GPRs and FPRs;
Q extension: This goes away, can only use memory loads/stores to move
the values between register types in this case...
As for multiply, RV has:
MUL : 64*64->64, low results
MULHU: 64*64->64, high results
For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.
Noted that it would be faster in this case to have a single multiply
that produces both high-and-low results at the same time, than to
emulate a 128-bit MULHU, as the MULHU would effectively still need to calculate the low-half to get carry propagation correct.
Ended up doing this part in ASM partly to make it less needlessly slow.
-------------------
Basically, only really AND/OR/XOR can be done inline.
In your ISA,
In this case, in RV64G mode...
In XG3 the situation doesn't suck nearly so badly.
ADD/SUB/NEG: Logic too complicated, needs a function call.
LoL,
Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.
LoL, Wrong ISA model.
Predication would help here...
But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.
So, no predication, or conditional-select, or ...
So, one needs paths, say:
Shift is negative;
Shift is 0 bits;
Shift is 1 to 63 bits;
Shift is 64 to 127 bits;
...
For 128+, can mask to 7 bits after verifying that the shift is positive.
Checking the sign of the shifts is needed to match the behavior of the
shift ops in my ISA, where negative shifts go the opposite direction
(and I already needed to deal specially with 0).
So, say (pseudo-code, I actually wrote this part in ASM):
if(shl>0)
{
shl&=127;
if(shl>=64)
{
out_hi=in_low<<(shl&63);
out_lo=0;
}else
{
out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
out_lo=in_lo<<(shl&63);
}
}else
{
if(shl==0)
{
out_hi=in_hi;
out_lo=in_lo;
}else
{
//flip sign and call opposite shift
}
}
...
Well, unless one wants to argue that there are more efficient ways to
deal with 128-bit integer math in RISC-V.
On 4/4/2026 1:10 PM, MitchAlsup wrote:
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
I am not sure those are the only choices. Go back to the days of 32 bit registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?
On 2026-04-04 11:41 p.m., Stephen Fuld wrote:
On 4/4/2026 1:10 PM, MitchAlsup wrote:IMO the difference may be the usefulness for hardware cost (value). It
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
I am not sure those are the only choices. Go back to the days of 32
bit registers. Many/most of the architectures of that era had good
support for both 32 bit single precision and 64 bit double precision.
So why now, in the era of 64 bit registers, can't you have good
support for 64 and 128 bit floating point? i.e. what's different?
takes twice as much hardware for something that may be only rarely used.
I think there was good demand for 32-bit machines, a little bit less for 64-bit and even less for 128-bit.
Wide registers are handy for handling wider data, but the float
precision may not be as in demand. With 512-bit wide SIMD register,
should there be 512-bit floating-point precision?
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
BGB <cr88192@gmail.com> posted:I believe that the best compromise both today and for the next 10 years
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).
Which, granted, there isn't that much of a better way to do this...
Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like: >>>> ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
16 callee save registers
No register pairing
Result:
Huge register pressure and a whole lot of spill-and-fill...
My 66000 has less spill fill than RISC-V even without FPRs.
While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
Right now, one needs great FP64; and the occasional FP128.
Choose wisely.If you emulate f128, then using pairs of 64-bit integer regs is the
Could potentially put the Int128 values off in FPR space, since there is
less pressure there, and not like RV can actually do much with the
values in GPRs anyways.
You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs
MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated
on RV ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole
process).
Which, granted, there isn't that much of a better way to do
this...
Though, almost starts to make one question if on RV it would
almost be better to skip keeping int128/float128 in registers,
and instead be like: ADDI X10, SP, DispDst //load destination
address ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
16 callee save registers
No register pairing
Result:
Huge register pressure and a whole lot of spill-and-fill...
My 66000 has less spill fill than RISC-V even without FPRs.
While theoretically could hold 6 pairs, due to needing registers
for other stuff it is closer to 3 or 4.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
Right now, one needs great FP64; and the occasional FP128.
I believe that the best compromise both today and for the next 10
years is to have great f64 which includes
Augmented[Addition|Multiplication].
Those two were added in 754-2019, they enable exact/arbitrary
precision arithmetic with very little overhead on an fpu which
already supports FMAC (which was officially included in the standard
in 1998 or 2008).
On top of this you need the occasional full f128 operation in order
to get the extended exponent range, but this is much less common than
just an extended mantissa.
Choose wisely.
Could potentially put the Int128 values off in FPR space, since
there is less pressure there, and not like RV can actually do much
with the values in GPRs anyways.
You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs
If you emulate f128, then using pairs of 64-bit integer regs is the
easiest solution. If you have a single register set then it becomes
much easier to mix integer/logic ops with fp operations, i.e to
implement special functions where a f64 result can be used as a
starting point for one or two NR iterations.
Terje
On 2026-04-04 9:06 p.m., BGB wrote:
On 4/4/2026 3:10 PM, MitchAlsup wrote:Qupls has support for both binary and decimal float 128 in the ISA. But
BGB <cr88192@gmail.com> posted:
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated
on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process). >>>>>>
Which, granted, there isn't that much of a better way to do this... >>>>>>
Though, almost starts to make one question if on RV it would
almost be
better to skip keeping int128/float128 in registers, and instead
be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
16 callee save registers
No register pairing
Result:
Huge register pressure and a whole lot of spill-and-fill...
My 66000 has less spill fill than RISC-V even without FPRs.
Possibly true.
As for Callee-Save GPRs:
XG1 : 15 (31 with XGPR)
XG2 : 31
XG3 : 28
RISC-V: 12 (with 0/12/16 more on the FPR side)
LP64 ABI: 12+0
LP64D ABI: 12+12
BGBCC's ABI: 12+16
Also RISC-V mode has the highest spill-and-fill.
Not usually too bad, but trying to put Int128 in GPRs causes it to
become absurd. Likely better to move Int128 to FPRs despite being an
integer type, will need to look into this.
While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
Right now, one needs great FP64; and the occasional FP128.
Choose wisely.
FP64 + poor FP128 is the better option IMO.
I have not bothered to implement either yet. binary floats support
multiple precisions whereas there is only 128-bit decimal float. This is
on the assumption that one would want lots of precision with decimal
floats and not performance. There is also opcodes reserved for 64-bit posits.
Using a seven-bit base opcode space which is mostly full. All the
different precisions, data types, and support for SIMD and vectors
really uses up opcode space. Opcode usage is fairly regular however. The same func code represents the same instruction in different precisions.
The cases where FP128 are used are so rare as to make its performanceHee, hee. I have gone with 128 GPR registers for Qupls. So, no trouble
mostly irrelevant, but in this case, did realize a need for full
IEEE-754 support, which meant likely needing to redo this stuff in C
(also relevant to get it working on RISC-V).
Didn't necessarily want 3 different ASM versions:
XG1/2, RISC-V, and XG3.
Even if RISC-V is kind of a poor ISA to support this kind of thing.
Could potentially put the Int128 values off in FPR space, since
there is
less pressure there, and not like RV can actually do much with the
values in GPRs anyways.
You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs
You can move from FPRs to GPRs on function calls, and back on return:
MV X10, F18
MV X11, F19
MV X12, F20
MV X13, F21
JAL X1, __whatever
MV F22, X10
MV F23, X11
In the case of RISC-V, sorta works since typically pretty much nothing
is done inline in this case (and at least the ability to freely move
stuff between registers isn't an issue in RV64G).
using register pairs or quads for data. I figured it was better to
expose the registers that might be used for vector storage and allow
them to be used for other purposes. Vector instructions are still code dense, taking the first register for the vector specified in the ISA instruction. The machine increments the register number for the vector length required.
It is really quasi-vectors in Qupls. A vector instruction gets converted into multiple SIMD instructions as needed. However some vector
instructions cannot be done, like arbitrary vector slides.
Though, within the ASM support functions, still limited to using GPRsI need to review shifting for Qupls. It is being done in a somewhat
(since the integer ops are used and only work on GPRs).
Ironically though, this is one area where having the Q extension is
actively worse than not having the Q extension:
F/D only: Can freely move values between GPRs and FPRs;
Q extension: This goes away, can only use memory loads/stores to move
the values between register types in this case...
As for multiply, RV has:
MUL : 64*64->64, low results
MULHU: 64*64->64, high results
For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.
Noted that it would be faster in this case to have a single multiply
that produces both high-and-low results at the same time, than to
emulate a 128-bit MULHU, as the MULHU would effectively still need to
calculate the low-half to get carry propagation correct.
Ended up doing this part in ASM partly to make it less needlessly slow.
-------------------
Basically, only really AND/OR/XOR can be done inline.
In your ISA,
In this case, in RV64G mode...
In XG3 the situation doesn't suck nearly so badly.
ADD/SUB/NEG: Logic too complicated, needs a function call.
LoL,
Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.
LoL, Wrong ISA model.
Predication would help here...
But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.
So, no predication, or conditional-select, or ...
So, one needs paths, say:
Shift is negative;
Shift is 0 bits;
Shift is 1 to 63 bits;
Shift is 64 to 127 bits;
...
For 128+, can mask to 7 bits after verifying that the shift is positive.
Checking the sign of the shifts is needed to match the behavior of the
shift ops in my ISA, where negative shifts go the opposite direction
(and I already needed to deal specially with 0).
So, say (pseudo-code, I actually wrote this part in ASM):
if(shl>0)
{
shl&=127;
if(shl>=64)
{
out_hi=in_low<<(shl&63);
out_lo=0;
}else
{
out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
out_lo=in_lo<<(shl&63);
}
}else
{
if(shl==0)
{
out_hi=in_hi;
out_lo=in_lo;
}else
{
//flip sign and call opposite shift
}
}
...
Well, unless one wants to argue that there are more efficient ways to
deal with 128-bit integer math in RISC-V.
simple manner with RTL code for both right and left shifts. I would hide
the negative shift behind an ISA instruction which has a positive shift.
That is both left and right shifts rotate left internally (micro- architecturally?), but the programmer does not see that. They see only shifts with positive shift amounts. Negative shift amounts can be
confusing.
Love the power of micro-ops.
On 4/4/2026 9:40 PM, Robert Finch wrote:
On 2026-04-04 11:41 p.m., Stephen Fuld wrote:
On 4/4/2026 1:10 PM, MitchAlsup wrote:IMO the difference may be the usefulness for hardware cost (value). It
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
I am not sure those are the only choices. Go back to the days of 32
bit registers. Many/most of the architectures of that era had good
support for both 32 bit single precision and 64 bit double precision.
So why now, in the era of 64 bit registers, can't you have good
support for 64 and 128 bit floating point? i.e. what's different?
takes twice as much hardware for something that may be only rarely used.
I think there was good demand for 32-bit machines, a little bit less
for 64-bit and even less for 128-bit.
If you wanted to add a fourth choice to Mitch's list - something like
Great FP 64 and 128 but at a significant extra hardware cost, I would
accept that. I wasn't commenting on the utility of implementing FP128, only Mitch's alternatives for such an implementation.
Wide registers are handy for handling wider data, but the float
precision may not be as in demand. With 512-bit wide SIMD register,
should there be 512-bit floating-point precision?
Probably not, as I see essentially no demand for 512 bit floats. I
don't know what the demand for 128 bit floats is, but Mitch's post seems
to assume there was enough demand to implement it, and was discussing alternative implementations.
On 4/4/2026 1:10 PM, MitchAlsup wrote:
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
I am not sure those are the only choices. Go back to the days of 32 bit registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?
MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).
Which, granted, there isn't that much of a better way to do this...
Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
16 callee save registers
No register pairing
Result:
Huge register pressure and a whole lot of spill-and-fill...
My 66000 has less spill fill than RISC-V even without FPRs.
While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
Right now, one needs great FP64; and the occasional FP128.
I believe that the best compromise both today and for the next 10 years
is to have great f64 which includes Augmented[Addition|Multiplication].
Those two were added in 754-2019, they enable exact/arbitrary precision arithmetic with very little overhead on an fpu which already supports
FMAC (which was officially included in the standard in 1998 or 2008).
On top of this you need the occasional full f128 operation in order to
get the extended exponent range, but this is much less common than just
an extended mantissa.
Choose wisely.
Could potentially put the Int128 values off in FPR space, since there is >> less pressure there, and not like RV can actually do much with the
values in GPRs anyways.
You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs
If you emulate f128, then using pairs of 64-bit integer regs is the
easiest solution. If you have a single register set then it becomes much easier to mix integer/logic ops with fp operations, i.e to implement
special functions where a f64 result can be used as a starting point for
one or two NR iterations.
Terje
On Sun, 5 Apr 2026 12:01:37 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 4/4/2026 10:54 AM, MitchAlsup wrote:32 GPRs
BGB <cr88192@gmail.com> posted:
On 4/3/2026 2:30 PM, Stefan Monnier wrote:-------------------
MitchAlsup [2026-04-02 22:28:22] wrote:
Can note that at present, a lot of the "__int128" code generated
on RV ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole
process).
Which, granted, there isn't that much of a better way to do
this...
Though, almost starts to make one question if on RV it would
almost be better to skip keeping int128/float128 in registers,
and instead be like: ADDI X10, SP, DispDst //load destination
address ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory
This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...
It also shows that once something does not fit the register format
one is better off passing via pointers. ...
It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.
16 callee save registers
No register pairing
Result:
Huge register pressure and a whole lot of spill-and-fill...
My 66000 has less spill fill than RISC-V even without FPRs.
While theoretically could hold 6 pairs, due to needing registers
for other stuff it is closer to 3 or 4.
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
Right now, one needs great FP64; and the occasional FP128.
I believe that the best compromise both today and for the next 10
years is to have great f64 which includes Augmented[Addition|Multiplication].
Those two were added in 754-2019, they enable exact/arbitrary
precision arithmetic with very little overhead on an fpu which
already supports FMAC (which was officially included in the standard
in 1998 or 2008).
On top of this you need the occasional full f128 operation in order
to get the extended exponent range, but this is much less common than
just an extended mantissa.
Choose wisely.
Could potentially put the Int128 values off in FPR space, since
there is less pressure there, and not like RV can actually do much
with the values in GPRs anyways.
You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs
If you emulate f128, then using pairs of 64-bit integer regs is the easiest solution. If you have a single register set then it becomes
much easier to mix integer/logic ops with fp operations, i.e to
implement special functions where a f64 result can be used as a
starting point for one or two NR iterations.
Terje
On modern CPUs, i.e. Zen3 and better, integer division is a better
(than fp64 fdiv) starting point for fp128 division.
For sqrtq/rsqrtq, approximate fp64 rsqrt is probably the best starting
step, but not every ISA has it. For example, x86-64 only has it
with AVX512 and even here precision is only 28 bits. 28 bits is an
excellent starting point for fp64 sqrt/rsqrt, but for fp128 it's not
quite sufficient. I'd prefer 31-32 bits, in order to get exact value
after 2 NR iterations for >99.9% of the inputs.
The cost of moving things between FP and general registers that you
mentioned above is hardly above noise floor, esp. for slower operations
like sqrt and div.
However, avoidance of FP could have other advantages. The most
important is that FP affects flags and in current ABIs flags are shared between fp128 and fp64/fp32. If one uses FPU for emulation of fp128
then setting flags in IEEE-prescribed manner become quite difficult, especially so for Inexact - the most useless of them all, but still
mandatory according to IEEE.
Terje Mathisen <terje.mathisen@tmsw.no> posted:That is a lovely implementation of AugmentedAddition.
Right now, one needs great FP64; and the occasional FP128.
I believe that the best compromise both today and for the next 10 years
is to have great f64 which includes Augmented[Addition|Multiplication].
As:
GLOBAL Kahan_Babuška
ENTRY Kahan_Babuška
Kahan_Babuška:
// note compiler put &residual in R3 at CALL
LDD R4,[R3]
CARRY R4,{IO}
FADD R1,R1,R2 // will not set inexact
STD R4,[R3]
RET
Inexact can be set in this usage of CARRY and FADD.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 4/4/2026 1:10 PM, MitchAlsup wrote:
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
I am not sure those are the only choices. Go back to the days of 32 bit
registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?
In the 32-bit era, there was very significant demand for 64-bit FP
and we were just waiting for silicon acreage to do 64-bits;
whereas,
In the 64-bit era, there is a bit of demand for FP128 to make it not
stink to high heaven, but not enough to even consider making everything 128-bits.
So in a) we had to have useful FP64 in an essentially 32-bit machine
but right now b) there is no essentiality just don't screw up enough
it can't be useful on occasion.
On 4/5/2026 1:29 PM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 4/4/2026 1:10 PM, MitchAlsup wrote:
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
I am not sure those are the only choices. Go back to the days of 32 bit >> registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?
In the 32-bit era, there was very significant demand for 64-bit FP
and we were just waiting for silicon acreage to do 64-bits;
whereas,
In the 64-bit era, there is a bit of demand for FP128 to make it not
stink to high heaven, but not enough to even consider making everything 128-bits.
So in a) we had to have useful FP64 in an essentially 32-bit machine
but right now b) there is no essentiality just don't screw up enough
it can't be useful on occasion.
So I think you are agreeing with Robert Finch and me, that an
alternative that existed, but that you rejected, for good reasons, would have been to provide great support for FP128.
I agree with your decision to provide "a little" support for FP128. I
would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
the future if the demand increases.
On 4/5/2026 1:29 PM, MitchAlsup wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 4/4/2026 1:10 PM, MitchAlsup wrote:
One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64
I am not sure those are the only choices. Go back to the days of 32 bit >>> registers. Many/most of the architectures of that era had good support >>> for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?
In the 32-bit era, there was very significant demand for 64-bit FP
and we were just waiting for silicon acreage to do 64-bits;
whereas,
In the 64-bit era, there is a bit of demand for FP128 to make it not
stink to high heaven, but not enough to even consider making everything
128-bits.
So in a) we had to have useful FP64 in an essentially 32-bit machine
but right now b) there is no essentiality just don't screw up enough
it can't be useful on occasion.
So I think you are agreeing with Robert Finch and me, that an
alternative that existed, but that you rejected, for good reasons, would have been to provide great support for FP128.
I agree with your decision to provide "a little" support for FP128. I would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
the future if the demand increases.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,113 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 492337:10:35 |
| Calls: | 14,238 |
| Files: | 186,312 |
| D/L today: |
3,907 files (1,273M bytes) |
| Messages: | 2,514,893 |