Forum: War Ensemble BBS

Generic GPUs

From John Levine@johnl@taugh.com to comp.arch on Thu Apr 2 18:02:01 2026

From Newsgroup: comp.arch

This draft paper looks at a bunch of GPU instruction sets and wonders
if we could come up with a widely implemented GPU instruction set
analagous to the ARM CPU instruction set.

Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
Analysis of Hardware-Invariant Computational Primitives in Parallel
Processors

Ojima Abraham, Onyinye Okoli

We present the first systematic cross-vendor analysis of GPU
instruction set architectures spanning all four major GPU vendors:
NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
Apple (G13, reverse-engineered). Drawing on official ISA reference
manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
four architectures, six parameterizable dialects where vendors
implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
disagreements. Based on this analysis, we propose an abstract
execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a
mandatory primitive, a finding that refines our proposed model.

https://arxiv.org/abs/2603.28793
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Apr 2 15:57:01 2026

From Newsgroup: comp.arch

On 4/2/2026 1:02 PM, John Levine wrote:

This draft paper looks at a bunch of GPU instruction sets and wonders
if we could come up with a widely implemented GPU instruction set
analagous to the ARM CPU instruction set.

Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
Analysis of Hardware-Invariant Computational Primitives in Parallel Processors

Ojima Abraham, Onyinye Okoli

We present the first systematic cross-vendor analysis of GPU
instruction set architectures spanning all four major GPU vendors:
NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
Apple (G13, reverse-engineered). Drawing on official ISA reference
manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
four architectures, six parameterizable dialects where vendors
implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
disagreements. Based on this analysis, we propose an abstract
execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.

https://arxiv.org/abs/2603.28793

Skims...

It almost seems like "XG3 but SIMT" wouldn't be too far off...

Main differences:
64 GPRs vs 128/256
64-bit vs 32-bit
2xF32/4xF16 vs 1xF32/2xF16

Weird to think that XG3 has wider SIMD than a GPU, but yeah...

Seems like they aren't really using SIMD in this case, but likely faking
SIMD by using 2 or 4 registers for faking SIMD vectors (where the high register count makes sense). Otherwise, unless one is inlining the whole shader into a single monolithic blob, one is unlikely to be able to make effective use of that many registers.

Does come off as something very different than RV-v in any case...

But, narrower SIMD does make sense in a way vs wider SIMD:
The wider you go, the more of a PITA that shuffles become.

Scalar: No Shuffles at all;
Two Wide: Some Half-Register ops;
Four Wide: Need 4-element shuffles along-side half-register stuff;
Eight Wide (or wider): All hell breaks loose (note the complete lack of
native 8 wide SIMD ops in my ISA, there was a reason here...).

Very wide vectors only really making sense if one assumes the
computational task looks like an array walk or similar.

I am also using a fairly limited predication scheme, but often 1 bit of predicate is sufficient. Can evaluate more complex conditionals in GPRs
and then fold to the final predicate flag for the last step.

...

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 2 22:28:22 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

This draft paper looks at a bunch of GPU instruction sets and wonders
if we could come up with a widely implemented GPU instruction set
analagous to the ARM CPU instruction set.

Toward a Universal GPU Instruction Set Architecture: A Cross-Vendor
Analysis of Hardware-Invariant Computational Primitives in Parallel Processors

Ojima Abraham, Onyinye Okoli

We present the first systematic cross-vendor analysis of GPU
instruction set architectures spanning all four major GPU vendors:
NVIDIA (PTX ISA v1.0 through v9.2, Fermi through Blackwell), AMD (RDNA
1 to 4 and CDNA 1 to 4), Intel (Gen11, Xe-LP, Xe-HPG, Xe-HPC), and
Apple (G13, reverse-engineered). Drawing on official ISA reference
manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts totaling over 5,000 pages of primary
sources across 16 distinct microarchitectures, we identify ten hardware-invariant computational primitives that appear across all
four architectures, six parameterizable dialects where vendors
implement identical concepts with different parameters, and six true architectural divergences representing fundamental design
disagreements. Based on this analysis, we propose an abstract
execution model for a vendor-neutral GPU ISA grounded in the physical constraints of parallel computation. We validate our model with
benchmark results on NVIDIA T4 and Apple M1 hardware, the two most architecturally distant platforms in our study. On five of six benchmark-platform pairs, the abstract model matches or exceeds native vendor-optimized performance. The single outlier (parallel reduction
on NVIDIA, 62.5% of native) reveals that intra-wave shuffle must be a mandatory primitive, a finding that refines our proposed model.

https://arxiv.org/abs/2603.28793

What I find missing is illustrative::
a) Texture LDs,
b) Rasterization,
c) Interpolation,
d) Transcendental instructions,
e) 'WARP' setup-tear down;
Or, roughly 1/2 of what GPUs do when running GPU codes.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Apr 3 15:30:00 2026

From Newsgroup: comp.arch

MitchAlsup [2026-04-02 22:28:22] wrote:

What I find missing is illustrative::
a) Texture LDs,
b) Rasterization,
c) Interpolation,
d) Transcendental instructions,

IIUC these are largely orthogonal to the ISA. Is there a lot of
variation in the form of those primitives in different GPU architectures?

Also, my impression is that the article was mostly interested in the
"GPGPU" angle rather than computer graphics (tho I can't find any
explicit statement about that, so maybe it reflects my own bias).

e) 'WARP' setup-tear down;

I'd have liked to see more discussion of that one, indeed.

Or, roughly 1/2 of what GPUs do when running GPU codes.

Even in application domains other than computer graphics (well, I guess
for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
would be trivial and not particularly enlightening).

=== Stefan
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Apr 3 21:46:14 2026

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> posted:

MitchAlsup [2026-04-02 22:28:22] wrote:

What I find missing is illustrative::
a) Texture LDs,
b) Rasterization,
c) Interpolation,
d) Transcendental instructions,

IIUC these are largely orthogonal to the ISA. Is there a lot of
variation in the form of those primitives in different GPU architectures?

While different ISA to ISA, they all allow what appears to be a LD
in ISA to perform {2,4,8} memory reads and perform {1,2,3}-D linear interpolations between points so that the shader program did not
need instructions to perform that work.

Also, my impression is that the article was mostly interested in the
"GPGPU" angle rather than computer graphics (tho I can't find any
explicit statement about that, so maybe it reflects my own bias).

e) 'WARP' setup-tear down;

I'd have liked to see more discussion of that one, indeed.

In many GPUs, there is a front end scheduler that reads the draw-calls
from SW and sets up WARPs to run. This scheduler needs to be able to
spawn WARPs as fast as shader programs finish--or about 1 new WARP
about every 8 cycles (32 new threads setup ready to go every 8 cycles).
This is almost as important to GPU performance as 0-cycle context switch.

Or, roughly 1/2 of what GPUs do when running GPU codes.

Even in application domains other than computer graphics (well, I guess
for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
would be trivial and not particularly enlightening).

=== Stefan

--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Apr 3 17:01:10 2026

From Newsgroup: comp.arch

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

What I find missing is illustrative::
a) Texture LDs,
b) Rasterization,
c) Interpolation,
d) Transcendental instructions,

IIUC these are largely orthogonal to the ISA. Is there a lot of
variation in the form of those primitives in different GPU architectures?

Also, my impression is that the article was mostly interested in the
"GPGPU" angle rather than computer graphics (tho I can't find any
explicit statement about that, so maybe it reflects my own bias).

FWIW: A few specialized instructions related to A and C exist in XG1/2/3 despite it not being a GPU ISA.

But... It was being used some for 3D rendering tasks and similar, so I
guess it stands to reason.

Though, D seems a little odd, as these aren't really used in the backend
parts of a 3D renderer nor typically in shader code, and to what extent
they are, there are often cheaper approximations (like, in your graphics shader not often much reason to care if "log()" or "pow()" are
numerically accurate).

Still, I didn't really expect to see something kinda like what I was
doing already, but even more in that direction.

e) 'WARP' setup-tear down;

I'd have liked to see more discussion of that one, indeed.

Or, roughly 1/2 of what GPUs do when running GPU codes.

Even in application domains other than computer graphics (well, I guess
for things like physics simulations transcendental functions would be important, but I expect adding them to the authors' "abstract ISA"
would be trivial and not particularly enlightening).

Yeah.

Meanwhile I was now working on re-implementing the Int128 and Float128
stuff for the XG3 and RISC-V backends. Writing the Float128 stuff in C
this time (the original code for BJX2 was all in ASM and didn't work on
either RV or XG3).

Initially, I had just sorta stubbed this stuff over to make it build
(there were some initial placeholders for the RISC-V versions of this
stuff, wouldn't actually work correctly for 128-bit numbers though).
Didn't start getting around to it until now (IIRC, started working on it
some yesterday or so).

Don't have the new Float128 code working yet, is giving me some crap,
partly as I am left also needing to debug the Int128 support at the same
time (the Float128 code being written on top of the Int128 support).

Did partly switch over from the old Float128 code to the new code for
the older BJX2 ISA as well, partly as the new code is at least intended
to be able to support proper IEEE-754 semantics (with subnormal numbers
and all that fun, whereas the former ASM code was still DAZ/FTZ).

Still don't have the new code working yet though, and kept running into
bugs (some in BGBCC, others in some of the new ASM support code being
written for RV).

Also generates epic dog-crap code for the RV ABI, as the registers don't
go quite so far when dealing with 128-bit stuff. And the "va_list"
handling seems to be broken as well for 128-bit types for some reason.

Though, there isn't really a standard printf modifier for these.
Was using things like "%032LX" with the thinking that the "long double"
type modifier also means either int128 and float128.
%X : assumed 32-bits ("sizeof(int)")
%lX : 32 or 64-bits ("sizeof(long)")
%llX : 64-bits ("sizeof(long long)")
%LX : 128-bits (non-standard)
%I64X : 64-bits (old MSVC convention)
%I128X: 128-bits (old MSVC convention)

Vs:
%f : Assumed double (with float auto-promoting)
%Lf : Long Double (float128)

TBD:
%LLX : Possible for a 256-bit type?...

Ended up writing some test-bench code to start trying to debug some of
this stuff, though for now just printing stuff using 64-bit types or
64-bit chunks to workaround the va_list issue for now, will need to
figure out the issue here.

Can note that at present, a lot of the "__int128" code generated on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).

Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be
better to skip keeping int128/float128 in registers, and instead be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

Since:
On RV there is relatively little that can be done inline;
The register pressure is too high to effectively keep Int128 values in registers;
A lot of the compiler logic was built to assume working on the values in registers (presumably inline in many cases), rather than quite so
heavily relying on runtime calls (but, RV is weak here vs my own ISA
even in the absence of the ALUX instructions).

Though, could make sense (if adding this) to have it as a special case
if none of the values are currently in registers.

Note that passing memory addresses is generally how anything larger than 128-bits will work (So, "_BitInt(256)" would be passed as pointers).

But, I guess no one says that working with "__int128" and similar on
RISC-V needs to be fast.

For now, the support functions (shared between RV and XG3) are being
written to assume RV limitations. Though, for XG3 both the register
pressure situation is less bad and there is more that could be done
inline (either ALUX, or using ADDC and similar, but ADDC is now itself
not necessarily a core instruction; and absent both ALUX and ADDC, one
is left having to fake carry propagation using additional instructions,
namely SLT+ADD, ...).

But, yeah...

Other thoughts:

__int256? Possible I could add, but would likely be handled as an alias
to "_BitInt(256)" rather than a dedicated type.

__float256? Debatable, possibly could make sense as a "floating point
type to end all floating point types" kinda thing, no immediate priority.

But, these would likely need to wait until after I get the 128-bit types
fully working on RV and similar...

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Apr 4 15:54:28 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).

Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be better to skip keeping int128/float128 in registers, and instead be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Apr 4 12:11:14 2026

From Newsgroup: comp.arch

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).

Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be
better to skip keeping int128/float128 in registers, and instead be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.

Result:
Huge register pressure and a whole lot of spill-and-fill...
While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.

Could potentially put the Int128 values off in FPR space, since there is
less pressure there, and not like RV can actually do much with the
values in GPRs anyways.

Less bad in XG3 as with 28 callee save registers, there is less of an
issue with spill and fill.

More so when useful work can be inline, but this is more of a problem
with RISC-V.

Basically, only really AND/OR/XOR can be done inline.
ADD/SUB/NEG: Logic too complicated, needs a function call.
Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.
...

Oh well, got as far as getting FADD/FSUB/FMUL basically working, but
FDIV still breaks. Likely still some bugs here...

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Apr 4 20:10:28 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated on RV
ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).

Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be
better to skip keeping int128/float128 in registers, and instead be like: >> ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.

32 GPRs
16 callee save registers
No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

Choose wisely.

Could potentially put the Int128 values off in FPR space, since there is less pressure there, and not like RV can actually do much with the
values in GPRs anyways.

You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

-------------------

Basically, only really AND/OR/XOR can be done inline.

In your ISA,

ADD/SUB/NEG: Logic too complicated, needs a function call.

LoL,

Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.

LoL, Wrong ISA model.
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Apr 4 20:06:09 2026

From Newsgroup: comp.arch

On 4/4/2026 3:10 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).

Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like: >>>> ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.

32 GPRs
16 callee save registers
No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

Possibly true.

As for Callee-Save GPRs:
XG1 : 15 (31 with XGPR)
XG2 : 31
XG3 : 28
RISC-V: 12 (with 0/12/16 more on the FPR side)
LP64 ABI: 12+0
LP64D ABI: 12+12
BGBCC's ABI: 12+16

Also RISC-V mode has the highest spill-and-fill.

Not usually too bad, but trying to put Int128 in GPRs causes it to
become absurd. Likely better to move Int128 to FPRs despite being an
integer type, will need to look into this.

While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

Choose wisely.

FP64 + poor FP128 is the better option IMO.

The cases where FP128 are used are so rare as to make its performance
mostly irrelevant, but in this case, did realize a need for full
IEEE-754 support, which meant likely needing to redo this stuff in C
(also relevant to get it working on RISC-V).

Didn't necessarily want 3 different ASM versions:
XG1/2, RISC-V, and XG3.

Even if RISC-V is kind of a poor ISA to support this kind of thing.

Could potentially put the Int128 values off in FPR space, since there is
less pressure there, and not like RV can actually do much with the
values in GPRs anyways.

You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

You can move from FPRs to GPRs on function calls, and back on return:
MV X10, F18
MV X11, F19
MV X12, F20
MV X13, F21
JAL X1, __whatever
MV F22, X10
MV F23, X11

In the case of RISC-V, sorta works since typically pretty much nothing
is done inline in this case (and at least the ability to freely move
stuff between registers isn't an issue in RV64G).

Though, within the ASM support functions, still limited to using GPRs
(since the integer ops are used and only work on GPRs).

Ironically though, this is one area where having the Q extension is
actively worse than not having the Q extension:
F/D only: Can freely move values between GPRs and FPRs;
Q extension: This goes away, can only use memory loads/stores to move
the values between register types in this case...

As for multiply, RV has:
MUL : 64*64->64, low results
MULHU: 64*64->64, high results

For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.

Noted that it would be faster in this case to have a single multiply
that produces both high-and-low results at the same time, than to
emulate a 128-bit MULHU, as the MULHU would effectively still need to calculate the low-half to get carry propagation correct.

Ended up doing this part in ASM partly to make it less needlessly slow.

-------------------

Basically, only really AND/OR/XOR can be done inline.

In your ISA,

In this case, in RV64G mode...

In XG3 the situation doesn't suck nearly so badly.

ADD/SUB/NEG: Logic too complicated, needs a function call.

LoL,

Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.

LoL, Wrong ISA model.

Predication would help here...

But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.

So, no predication, or conditional-select, or ...

So, one needs paths, say:
Shift is negative;
Shift is 0 bits;
Shift is 1 to 63 bits;
Shift is 64 to 127 bits;
...
For 128+, can mask to 7 bits after verifying that the shift is positive.

Checking the sign of the shifts is needed to match the behavior of the
shift ops in my ISA, where negative shifts go the opposite direction
(and I already needed to deal specially with 0).

So, say (pseudo-code, I actually wrote this part in ASM):
if(shl>0)
{
shl&=127;
if(shl>=64)
{
out_hi=in_low<<(shl&63);
out_lo=0;
}else
{
out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
out_lo=in_lo<<(shl&63);
}
}else
{
if(shl==0)
{
out_hi=in_hi;
out_lo=in_lo;
}else
{
//flip sign and call opposite shift
}
}

...

Well, unless one wants to argue that there are more efficient ways to
deal with 128-bit integer math in RISC-V.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Apr 4 20:41:23 2026

From Newsgroup: comp.arch

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32 bit registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Apr 5 00:31:48 2026

From Newsgroup: comp.arch

On 2026-04-04 9:06 p.m., BGB wrote:

On 4/4/2026 3:10 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated on RV >>>>> ends up looking like:
     Load value from the stack;
     Load other value from the stack;
     Copy into argument registers;
     Copy into argument registers;
     Call Int128 support function;
     Copy return value into other registers;
     Store back to the stack (so it can repeat this whole process). >>>>>
Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be >>>>> better to skip keeping int128/float128 in registers, and instead be >>>>> like:
     ADDI X10, SP, DispDst   //load destination address
     ADDI X11, SP, DispSrc1 //load first source address
     ADDI X12, SP, DispSrc2
     JAL X1, __xli_add_3m   //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.

      32 GPRs
      16 callee save registers
      No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

Possibly true.

As for Callee-Save GPRs:
XG1   : 15 (31 with XGPR)
XG2   : 31
XG3   : 28
RISC-V: 12 (with 0/12/16 more on the FPR side)
    LP64 ABI: 12+0
    LP64D ABI: 12+12
    BGBCC's ABI: 12+16

Also RISC-V mode has the highest spill-and-fill.

Not usually too bad, but trying to put Int128 in GPRs causes it to
become absurd. Likely better to move Int128 to FPRs despite being an
integer type, will need to look into this.

While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

Choose wisely.

FP64 + poor FP128 is the better option IMO.

Qupls has support for both binary and decimal float 128 in the ISA. But
I have not bothered to implement either yet. binary floats support
multiple precisions whereas there is only 128-bit decimal float. This is
on the assumption that one would want lots of precision with decimal
floats and not performance. There is also opcodes reserved for 64-bit
posits.

Using a seven-bit base opcode space which is mostly full. All the
different precisions, data types, and support for SIMD and vectors
really uses up opcode space. Opcode usage is fairly regular however. The
same func code represents the same instruction in different precisions.

The cases where FP128 are used are so rare as to make its performance
mostly irrelevant, but in this case, did realize a need for full
IEEE-754 support, which meant likely needing to redo this stuff in C
(also relevant to get it working on RISC-V).

Didn't necessarily want 3 different ASM versions:
XG1/2, RISC-V, and XG3.

Even if RISC-V is kind of a poor ISA to support this kind of thing.

Could potentially put the Int128 values off in FPR space, since there is >>> less pressure there, and not like RV can actually do much with the
values in GPRs anyways.

You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

You can move from FPRs to GPRs on function calls, and back on return:
MV X10, F18
MV X11, F19
MV X12, F20
MV X13, F21
JAL X1, __whatever
MV F22, X10
MV F23, X11

In the case of RISC-V, sorta works since typically pretty much nothing
is done inline in this case (and at least the ability to freely move
stuff between registers isn't an issue in RV64G).

Hee, hee. I have gone with 128 GPR registers for Qupls. So, no trouble
using register pairs or quads for data. I figured it was better to
expose the registers that might be used for vector storage and allow
them to be used for other purposes. Vector instructions are still code
dense, taking the first register for the vector specified in the ISA instruction. The machine increments the register number for the vector
length required.
It is really quasi-vectors in Qupls. A vector instruction gets converted
into multiple SIMD instructions as needed. However some vector
instructions cannot be done, like arbitrary vector slides.

Though, within the ASM support functions, still limited to using GPRs
(since the integer ops are used and only work on GPRs).

Ironically though, this is one area where having the Q extension is
actively worse than not having the Q extension:
F/D only: Can freely move values between GPRs and FPRs;
Q extension: This goes away, can only use memory loads/stores to move
the values between register types in this case...

As for multiply, RV has:
MUL : 64*64->64, low results
MULHU: 64*64->64, high results

For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.

Noted that it would be faster in this case to have a single multiply
that produces both high-and-low results at the same time, than to
emulate a 128-bit MULHU, as the MULHU would effectively still need to calculate the low-half to get carry propagation correct.

Ended up doing this part in ASM partly to make it less needlessly slow.

-------------------

Basically, only really AND/OR/XOR can be done inline.

In your ISA,

In this case, in RV64G mode...

In XG3 the situation doesn't suck nearly so badly.

ADD/SUB/NEG: Logic too complicated, needs a function call.

LoL,

Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.

LoL, Wrong ISA model.

Predication would help here...

But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.

So, no predication, or conditional-select, or ...

So, one needs paths, say:
Shift is negative;
Shift is 0 bits;
Shift is 1 to 63 bits;
Shift is 64 to 127 bits;
...
For 128+, can mask to 7 bits after verifying that the shift is positive.

Checking the sign of the shifts is needed to match the behavior of the
shift ops in my ISA, where negative shifts go the opposite direction
(and I already needed to deal specially with 0).

So, say (pseudo-code, I actually wrote this part in ASM):
if(shl>0)
{
    shl&=127;
    if(shl>=64)
    {
      out_hi=in_low<<(shl&63);
      out_lo=0;
    }else
    {
      out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
      out_lo=in_lo<<(shl&63);
    }
}else
{
    if(shl==0)
    {
      out_hi=in_hi;
      out_lo=in_lo;
    }else
    {
       //flip sign and call opposite shift
    }
}

...

Well, unless one wants to argue that there are more efficient ways to
deal with 128-bit integer math in RISC-V.

I need to review shifting for Qupls. It is being done in a somewhat
simple manner with RTL code for both right and left shifts. I would hide
the negative shift behind an ISA instruction which has a positive shift.
That is both left and right shifts rotate left internally (micro-architecturally?), but the programmer does not see that. They see
only shifts with positive shift amounts. Negative shift amounts can be confusing.

Love the power of micro-ops.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Apr 5 00:40:28 2026

From Newsgroup: comp.arch

On 2026-04-04 11:41 p.m., Stephen Fuld wrote:

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32 bit registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?

IMO the difference may be the usefulness for hardware cost (value). It
takes twice as much hardware for something that may be only rarely used.
I think there was good demand for 32-bit machines, a little bit less for 64-bit and even less for 128-bit.
Wide registers are handy for handling wider data, but the float
precision may not be as in demand. With 512-bit wide SIMD register,
should there be 512-bit floating-point precision?

--- Synchronet 3.21f-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Apr 4 22:12:25 2026

From Newsgroup: comp.arch

On 4/4/2026 9:40 PM, Robert Finch wrote:

On 2026-04-04 11:41 p.m., Stephen Fuld wrote:

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32
bit registers. Many/most of the architectures of that era had good
support for both 32 bit single precision and 64 bit double precision.
So why now, in the era of 64 bit registers, can't you have good
support for 64 and 128 bit floating point? i.e. what's different?

IMO the difference may be the usefulness for hardware cost (value). It
takes twice as much hardware for something that may be only rarely used.
I think there was good demand for 32-bit machines, a little bit less for 64-bit and even less for 128-bit.

If you wanted to add a fourth choice to Mitch's list - something like
Great FP 64 and 128 but at a significant extra hardware cost, I would
accept that. I wasn't commenting on the utility of implementing FP128,
only Mitch's alternatives for such an implementation.

Wide registers are handy for handling wider data, but the float
precision may not be as in demand. With 512-bit wide SIMD register,
should there be 512-bit floating-point precision?

Probably not, as I see essentially no demand for 512 bit floats. I
don't know what the demand for 128 bit floats is, but Mitch's post seems
to assume there was enough demand to implement it, and was discussing alternative implementations.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 5 07:58:23 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I would consider single fully pipleined 128-bit FMA unit in
parallel to the usual FP64 units great, and it should not
compromise FP64.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Apr 5 12:01:37 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).

Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like: >>>> ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.

32 GPRs
16 callee save registers
No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

I believe that the best compromise both today and for the next 10 years
is to have great f64 which includes Augmented[Addition|Multiplication].
Those two were added in 754-2019, they enable exact/arbitrary precision arithmetic with very little overhead on an fpu which already supports
FMAC (which was officially included in the standard in 1998 or 2008).
On top of this you need the occasional full f128 operation in order to
get the extended exponent range, but this is much less common than just
an extended mantissa.

Choose wisely.

Could potentially put the Int128 values off in FPR space, since there is
less pressure there, and not like RV can actually do much with the
values in GPRs anyways.

You need 64Ã—64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

If you emulate f128, then using pairs of 64-bit integer regs is the
easiest solution. If you have a single register set then it becomes much easier to mix integer/logic ops with fp operations, i.e to implement
special functions where a f64 result can be used as a starting point for one or two NR iterations.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Apr 5 14:46:40 2026

From Newsgroup: comp.arch

On Sun, 5 Apr 2026 12:01:37 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated
on RV ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole
process).

Which, granted, there isn't that much of a better way to do
this...

Though, almost starts to make one question if on RV it would
almost be better to skip keeping int128/float128 in registers,
and instead be like: ADDI X10, SP, DispDst //load destination
address ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.

32 GPRs
16 callee save registers
No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

While theoretically could hold 6 pairs, due to needing registers
for other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

I believe that the best compromise both today and for the next 10
years is to have great f64 which includes
Augmented[Addition|Multiplication].

Those two were added in 754-2019, they enable exact/arbitrary
precision arithmetic with very little overhead on an fpu which
already supports FMAC (which was officially included in the standard
in 1998 or 2008).

On top of this you need the occasional full f128 operation in order
to get the extended exponent range, but this is much less common than
just an extended mantissa.

Choose wisely.

Could potentially put the Int128 values off in FPR space, since
there is less pressure there, and not like RV can actually do much
with the values in GPRs anyways.

You need 64Ã—64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

If you emulate f128, then using pairs of 64-bit integer regs is the
easiest solution. If you have a single register set then it becomes
much easier to mix integer/logic ops with fp operations, i.e to
implement special functions where a f64 result can be used as a
starting point for one or two NR iterations.

Terje

On modern CPUs, i.e. Zen3 and better, integer division is a better
(than fp64 fdiv) starting point for fp128 division.
For sqrtq/rsqrtq, approximate fp64 rsqrt is probably the best starting
step, but not every ISA has it. For example, x86-64 only has it
with AVX512 and even here precision is only 28 bits. 28 bits is an
excellent starting point for fp64 sqrt/rsqrt, but for fp128 it's not
quite sufficient. I'd prefer 31-32 bits, in order to get exact value
after 2 NR iterations for >99.9% of the inputs.
The cost of moving things between FP and general registers that you
mentioned above is hardly above noise floor, esp. for slower operations
like sqrt and div.
However, avoidance of FP could have other advantages. The most
important is that FP affects flags and in current ABIs flags are shared
between fp128 and fp64/fp32. If one uses FPU for emulation of fp128
then setting flags in IEEE-prescribed manner become quite difficult,
especially so for Inexact - the most useless of them all, but still
mandatory according to IEEE.
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Apr 5 13:30:42 2026

From Newsgroup: comp.arch

On 4/4/2026 11:31 PM, Robert Finch wrote:

On 2026-04-04 9:06 p.m., BGB wrote:

On 4/4/2026 3:10 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated
on RV
ends up looking like:
     Load value from the stack;
     Load other value from the stack;
     Copy into argument registers;
     Copy into argument registers;
     Call Int128 support function;
     Copy return value into other registers;
     Store back to the stack (so it can repeat this whole process). >>>>>>
Which, granted, there isn't that much of a better way to do this... >>>>>>

Though, almost starts to make one question if on RV it would
almost be
better to skip keeping int128/float128 in registers, and instead
be like:
     ADDI X10, SP, DispDst   //load destination address
     ADDI X11, SP, DispSrc1 //load first source address
     ADDI X12, SP, DispSrc2
     JAL X1, __xli_add_3m   //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
    32 GPRs;
    12 callee save registers;
    Register pairing.

      32 GPRs
      16 callee save registers
      No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

Possibly true.

As for Callee-Save GPRs:
   XG1   : 15 (31 with XGPR)
   XG2   : 31
   XG3   : 28
   RISC-V: 12 (with 0/12/16 more on the FPR side)
     LP64 ABI: 12+0
     LP64D ABI: 12+12
     BGBCC's ABI: 12+16

Also RISC-V mode has the highest spill-and-fill.

Not usually too bad, but trying to put Int128 in GPRs causes it to
become absurd. Likely better to move Int128 to FPRs despite being an
integer type, will need to look into this.

While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

Choose wisely.

FP64 + poor FP128 is the better option IMO.

Qupls has support for both binary and decimal float 128 in the ISA. But
I have not bothered to implement either yet. binary floats support
multiple precisions whereas there is only 128-bit decimal float. This is
on the assumption that one would want lots of precision with decimal
floats and not performance. There is also opcodes reserved for 64-bit posits.

Using a seven-bit base opcode space which is mostly full. All the
different precisions, data types, and support for SIMD and vectors
really uses up opcode space. Opcode usage is fairly regular however. The same func code represents the same instruction in different precisions.

OK.

Current formats:
BJX2:
Binary64 (Scalar)
Binary32 and Binary16: SIMD and Converters
FP8/FP8U/FP8A: SIMD Converters
Binary128: Software
BFloat16: Faked in software
RISC-V:
Binary32 (F, Scalar | SIMD)
Binary64 (D, Scalar)
Binary16 (Zfh)
Binary128 (Q; or "Pseudo Q" in my case)
(Pseudo Q using the same encodings, but register pairs)
(Implementation via trap-and-emulate, potential patching)
Recently implemented: Plain software path.
BFloat16 ("Zfbf16" or such)
FP8 via RV-V (unsupported).

FP-SIMD formats in my case:
2x Binary32 (both ISAs, RV via F encodings)
4x Binary16 (both ISAs, RV via Zfh encodings)
4x Binary32 (Register Pair)
Via converters:
4x FP8/FP8U/FP8A (<> 4x Binary16)

Spent a while trying to figure out why my new Float128 code wasn't
working, thinking there was still an Int128 bug...

Turns out I did a screw-up and was using the non-shifted mantissa for FADD/FSUB rather than the shifted one (and didn't exactly work correctly).

The cases where FP128 are used are so rare as to make its performance
mostly irrelevant, but in this case, did realize a need for full
IEEE-754 support, which meant likely needing to redo this stuff in C
(also relevant to get it working on RISC-V).

Didn't necessarily want 3 different ASM versions:
   XG1/2, RISC-V, and XG3.

Even if RISC-V is kind of a poor ISA to support this kind of thing.

Could potentially put the Int128 values off in FPR space, since
there is
less pressure there, and not like RV can actually do much with the
values in GPRs anyways.

You need 64×64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

You can move from FPRs to GPRs on function calls, and back on return:
   MV X10, F18
   MV X11, F19
   MV X12, F20
   MV X13, F21
   JAL X1, __whatever
   MV F22, X10
   MV F23, X11

In the case of RISC-V, sorta works since typically pretty much nothing
is done inline in this case (and at least the ability to freely move
stuff between registers isn't an issue in RV64G).

Hee, hee. I have gone with 128 GPR registers for Qupls. So, no trouble
using register pairs or quads for data. I figured it was better to
expose the registers that might be used for vector storage and allow
them to be used for other purposes. Vector instructions are still code dense, taking the first register for the vector specified in the ISA instruction. The machine increments the register number for the vector length required.
It is really quasi-vectors in Qupls. A vector instruction gets converted into multiple SIMD instructions as needed. However some vector
instructions cannot be done, like arbitrary vector slides.

Yeah.

As-is:
XG1: 32/64 registers
Whole ISA only has 32 registers;
Subset has 64 registers;
Other ops can encode 64 registers via Jumbo Prefixes.
Has 16-bit ops, mostly limited to R0..R15.
Oldest and least orthogonal of the schemes.
XG2: 64 registers
Trades 16-bit encodings for making 64-bit GPRs orthogonal.
XG3: 64 registers
Mostly a repack of XG2 to coexist with RISC-V;
Also switches to RISC-V's register numbering;
Albeit with a unified register space.
Canonically drops some parts of XG2.
RISC-V: 32+32 registers.
RV-V would add 32x more, don't currently support RV-V,
If I did, would likely be via software emulation.

Register spaces:
XG1/XG2:
R0/R1: Stomp Registers, Special | N/E in some contexts.
Functionally, scratch registers with encoding restrictions.
R2..R7: Scratch
R4..R7: Arg0..Arg3
R4..R14: Callee Save
R15: SP
R16..R23: Scratch
R20..R23: Arg4..Arg7
R24..R31: Callee Save

R32..R39: Scratch
R36..R39: Arg8..Arg11
R40..R47: Callee Save
R48..R55: Scratch
R52..R55: Arg12..Arg15
R56..R63: Callee Save

XG3 / RISC-V
R0/X0 (ZR): Zero
R1/X1 (LR/RA): Link Register
R2/X2 (SP): Stack pointer
R3/X3 (GBR/GP): Global Pointer
R4/X4 (TP): Task Pointer
R5..R7: Scratch (Stomp Registers in BGBCC)
R8/R9: Callee Save
R10..R17: Scratch / Arg0..Arg7
R18..R27: Callee Save
R28..R31: Scratch
R32/F0 .. R35/F3: Scratch (Stomp in BGBCC)
R36/F4 .. R39/F7: Callee Save (BGBCC)
R40/F8 .. R41/F9: Callee Save
R42/F10 .. R49/F17: Scratch (potential Arg8..Arg15)
R50/F18 .. R59/F27: Callee Save
F60/F28 .. R63/F31: Scratch

There is not a 1:1 mapping between the XG1/XG2 and RV/XG3 register spaces.

As-is, RV -> XG1/2 space:
X0 -> ZR (N/E)
X1 -> LR (CR)
X2 -> SP (R15)
X3 -> GBR (CR)
X4..X13: R4..R13
X14/X15: R2/R3
X16..X31: R16..R31
F0..F31: R32..R63

The XG1/XG2 registers for R0, R1, and R14, are inaccessible on the
RV/XG3 side.

As-is, this means a core hosted using RV or XG3 in effect couldn't run
XG1 or XG2 code natively, as some of the registers could not be
preserved on context switches (absent adding some other mechanism to
access them).

Comparison if RV and XG3 instruction layouts:
ZZZZttttttssssssZZZZnnnnnnYYYYpp (XG3, 3R)
ZZZZZZZtttttsssssZZZnnnnnYYYYY11 (RV, 3R)
iiiiiiiiiissssssZZZZnnnnnnYYYYpp (XG3, 3RI, Imm10)
iiiiiiiiiiiisssssZZZnnnnnYYYYY11 (RV, 3RI, Imm12)

Divergent:
iiiiiiiiiiiiiiiiZZZZnnnnnnYYYYpp (XG3, 2RI-Imm16)
iiiiiiiiiiiiiiiiZZZZiiiiiiiYYYpp (XG3, Imm23, BRA/BSR)
iiiiiiiiiiiiiiiiiiiinnnnnYYYYY11 (RV, 2RI-Imm20; JAL/LUI/AUIPC)
Despite superficial similarity, JAL uses a different Disp layout.
RV's JAL displacement is horribly dog-chewed.

RISC-V shuffles the Imm12 bits around depending on the instruction type,
for XG3, the layout remains unchanged (Load/Store/ALU/Bcc: All the same
layout in XG3, though scale may differ).

In XG3 encodings, the assumption is that the scale of the immediate
differs, in RISC-V the encoding bits move around instead.

Encoding was different in XG1 and XG2, divided into 16-bit words.

XG1
YYYYZZZZnnnnmmmm 2R
YYYYZZZZnnnniiii 2RI Imm4
YYYYnnnniiiiiiii 2RI Imm8 (ADD)
YYYYZZZZiiiiiiii Imm8
YYYYiiiiiiiiiiii Imm12 (LDI Imm, R0)
111PYwYYnnnnmmmm ZZZZqnmoooooZZZZ //3R
111PYwYYnnnnmmmm ZZZZqnmiiiiiiiii //3RI Imm9
111PYwYYZZZnnnnn iiiiiiiiiiiiiiii //2RI Imm16
111PYwYYiiiiiiii ZZZZiiiiiiiiiiii //Imm20 (BRA/BSR)

XG2:
NMOPYwYYnnnnmmmm ZZZZqnmoooooZZZZ //3R
NMIPYwYYnnnnmmmm ZZZZqnmiiiiiiiii //3RI Imm10
NZZPYwYYZZZnnnnn iiiiiiiiiiiiiiii //2RI Imm16
IIIPYwYYiiiiiiii ZZZZiiiiiiiiiiii //Imm23 (BRA/BSR)

Note that XG1's 32-bit encodings are still valid in XG2, and XG2's
encodings mostly map up to XG3 (though there are things that are N/E
between them).

Though, within the ASM support functions, still limited to using GPRs
(since the integer ops are used and only work on GPRs).

Ironically though, this is one area where having the Q extension is
actively worse than not having the Q extension:
F/D only: Can freely move values between GPRs and FPRs;
Q extension: This goes away, can only use memory loads/stores to move
the values between register types in this case...

As for multiply, RV has:
   MUL : 64*64->64, low results
   MULHU: 64*64->64, high results

For Float128 with IEEE rounding, needed a 128*128->256 bit multiply.

Noted that it would be faster in this case to have a single multiply
that produces both high-and-low results at the same time, than to
emulate a 128-bit MULHU, as the MULHU would effectively still need to
calculate the low-half to get carry propagation correct.

Ended up doing this part in ASM partly to make it less needlessly slow.

-------------------

Basically, only really AND/OR/XOR can be done inline.

In your ISA,

In this case, in RV64G mode...

In XG3 the situation doesn't suck nearly so badly.

ADD/SUB/NEG: Logic too complicated, needs a function call.

LoL,

Shifts: Likewise, also needs multiple branch-paths depending on the
shift amount.

LoL, Wrong ISA model.

Predication would help here...

But, RISC-V only has BEQ/BNE/BLT/... to resolve this issue.

So, no predication, or conditional-select, or ...

So, one needs paths, say:
   Shift is negative;
   Shift is 0 bits;
   Shift is 1 to 63 bits;
   Shift is 64 to 127 bits;
   ...
For 128+, can mask to 7 bits after verifying that the shift is positive.

Checking the sign of the shifts is needed to match the behavior of the
shift ops in my ISA, where negative shifts go the opposite direction
(and I already needed to deal specially with 0).

So, say (pseudo-code, I actually wrote this part in ASM):
   if(shl>0)
   {
     shl&=127;
     if(shl>=64)
     {
       out_hi=in_low<<(shl&63);
       out_lo=0;
     }else
     {
       out_hi=(in_hi<<(shl&63))|(in_lo>>(64-(shl&63)));
       out_lo=in_lo<<(shl&63);
     }
   }else
   {
     if(shl==0)
     {
       out_hi=in_hi;
       out_lo=in_lo;
     }else
     {
        //flip sign and call opposite shift
     }
   }

...

Well, unless one wants to argue that there are more efficient ways to
deal with 128-bit integer math in RISC-V.

I need to review shifting for Qupls. It is being done in a somewhat
simple manner with RTL code for both right and left shifts. I would hide
the negative shift behind an ISA instruction which has a positive shift.
That is both left and right shifts rotate left internally (micro- architecturally?), but the programmer does not see that. They see only shifts with positive shift amounts. Negative shift amounts can be
confusing.

Love the power of micro-ops.

In my case, initially I only had left-shift with right-shifted being
expressed as a negative left shift (carried over from SH and similar).

Later added right-shift instructions, but they follow the same pattern.

So, say, for 64-bit shift:
0.. 63: Shift left 0 to 63 bits;
64.. 127: Shift left 0 to 63 bits (Mod-64);
-1.. 63: Shift right 1 to 63 bits;
-64..-128: Shift right 0 to 63 bits (Mod-64);

Generally with higher bits being ignored (and values outside of
-128..127 being undefined).

For 128-bit shift (via ALUX):
0.. 127: Shift left 0 to 127 bits;
-1..-127: Shift right 1 to 127 bits;
Anything outside this range being undefined.

In practice, the existing unit would alternate directions on overflow
for the ALUX handling (but with the 128-bit shift instructions being N/E
in RV land, and optional in XG3).

There are sometimes cases where a negative shift becoming a shift in the opposite direction can be useful.

Though, this differs from the canonical behavior in RISC-V (naive modulo shift, in like x86).

For the support functions, a signed directional shift had been assumed,
though differing slightly:
Positive shifts: Always Mod-N in the same direction (eg, left);
Negative shifts: Always Mod-N in the same direction (eg, right).
Though, would still become undefined outside of Int32 range.

...

Can note that it seems for my FP128 test program there is a quite
significant code size difference between the RV and XG3 versions (even
in the current absence of ALUX instructions, and with some of my own extensions on the RV side).

...

--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Apr 5 14:23:48 2026

From Newsgroup: comp.arch

On 4/5/2026 12:12 AM, Stephen Fuld wrote:

On 4/4/2026 9:40 PM, Robert Finch wrote:

On 2026-04-04 11:41 p.m., Stephen Fuld wrote:

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32
bit registers. Many/most of the architectures of that era had good
support for both 32 bit single precision and 64 bit double precision.
So why now, in the era of 64 bit registers, can't you have good
support for 64 and 128 bit floating point? i.e. what's different?

IMO the difference may be the usefulness for hardware cost (value). It
takes twice as much hardware for something that may be only rarely used.
I think there was good demand for 32-bit machines, a little bit less
for 64-bit and even less for 128-bit.

If you wanted to add a fourth choice to Mitch's list - something like
Great FP 64 and 128 but at a significant extra hardware cost, I would
accept that. I wasn't commenting on the utility of implementing FP128, only Mitch's alternatives for such an implementation.

I could in theory do native FP128 with paired registers, similar to how
some existing machines did native FP64 with paired 32-bit registers.

But, yeah, they are not the same:
FP64: Needed often.
FP128: Needed rarely.

Current cost for doing FP128 in hardware would be too expensive though.
So, for now, it is software options.

Which, as-is, are:
Pseudo-Q option:
Pretend the Q ops exist, but operate on pairs;
Trap and emulate them;
Better code density but slower (due to trap overheads).
Other option:
Runtime calls (recently implemented for the RV case);
Awful code density but slightly less slow in this case.

Wide registers are handy for handling wider data, but the float
precision may not be as in demand. With 512-bit wide SIMD register,
should there be 512-bit floating-point precision?

Probably not, as I see essentially no demand for 512 bit floats. I
don't know what the demand for 128 bit floats is, but Mitch's post seems
to assume there was enough demand to implement it, and was discussing alternative implementations.

Yeah:
FP64 : Semi-common workhorse;
FP128: Rate, when FP64 isn't enough;
FP256: Universe-scale numbers.
FP512: What even would it be used for?...

Like, say, with FP256 you could already express a coordinate space
effectively covering the size of the observable universe.

OTOH, it isn't that much harder to support FP256 than to support FP128,
if one has a way to express 256-bit integer math. It is, of course,
slower...

Though one possible optimization would be to detect cases where the
wider formats could fall back to narrower math.

Where, FP128 and FP256 could have enough general overhead to make it worthwhile to detect these sorts of narrowing cases.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 5 20:29:50 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32 bit registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?

In the 32-bit era, there was very significant demand for 64-bit FP
and we were just waiting for silicon acreage to do 64-bits;
whereas,
In the 64-bit era, there is a bit of demand for FP128 to make it not
stink to high heaven, but not enough to even consider making everything 128-bits.

So in a) we had to have useful FP64 in an essentially 32-bit machine
but right now b) there is no essentiality just don't screw up enough
it can't be useful on occasion.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 5 20:40:25 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated on RV >>>> ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole process).

Which, granted, there isn't that much of a better way to do this...

Though, almost starts to make one question if on RV it would almost be >>>> better to skip keeping int128/float128 in registers, and instead be like:
ADDI X10, SP, DispDst //load destination address
ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.

32 GPRs
16 callee save registers
No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

While theoretically could hold 6 pairs, due to needing registers for
other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

I believe that the best compromise both today and for the next 10 years
is to have great f64 which includes Augmented[Addition|Multiplication].

As:

GLOBAL Kahan_Babuška
ENTRY Kahan_Babuška
Kahan_Babuška:
// note compiler put &residual in R3 at CALL
LDD R4,[R3]
CARRY R4,{IO}
FADD R1,R1,R2 // will not set inexact
STD R4,[R3]
RET

Inexact can be set in this usage of CARRY and FADD.

Those two were added in 754-2019, they enable exact/arbitrary precision arithmetic with very little overhead on an fpu which already supports
FMAC (which was officially included in the standard in 1998 or 2008).

On top of this you need the occasional full f128 operation in order to
get the extended exponent range, but this is much less common than just
an extended mantissa.

Choose wisely.

Could potentially put the Int128 values off in FPR space, since there is >> less pressure there, and not like RV can actually do much with the
values in GPRs anyways.

You need 64Ã—64->128, 128 << or >>, 128+128->128 for FP128 emulation
So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

If you emulate f128, then using pairs of 64-bit integer regs is the
easiest solution. If you have a single register set then it becomes much easier to mix integer/logic ops with fp operations, i.e to implement
special functions where a f64 result can be used as a starting point for
one or two NR iterations.

Terje

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 5 20:48:44 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Sun, 5 Apr 2026 12:01:37 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/4/2026 10:54 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 4/3/2026 2:30 PM, Stefan Monnier wrote:

MitchAlsup [2026-04-02 22:28:22] wrote:

-------------------

Can note that at present, a lot of the "__int128" code generated
on RV ends up looking like:
Load value from the stack;
Load other value from the stack;
Copy into argument registers;
Copy into argument registers;
Call Int128 support function;
Copy return value into other registers;
Store back to the stack (so it can repeat this whole
process).

Which, granted, there isn't that much of a better way to do
this...

Though, almost starts to make one question if on RV it would
almost be better to skip keeping int128/float128 in registers,
and instead be like: ADDI X10, SP, DispDst //load destination
address ADDI X11, SP, DispSrc1 //load first source address
ADDI X12, SP, DispSrc2
JAL X1, __xli_add_3m //do operation in memory

This is an argument against separate register files. Since FP128
is nothing like FP32 or FP64 and FP32 or FP64 instructions do not
help in calculations, ...

It also shows that once something does not fit the register format
one is better off passing via pointers. ...

It is a great tradeoff of register file size:
32 GPRs;
12 callee save registers;
Register pairing.

32 GPRs
16 callee save registers
No register pairing

Result:
Huge register pressure and a whole lot of spill-and-fill...

My 66000 has less spill fill than RISC-V even without FPRs.

While theoretically could hold 6 pairs, due to needing registers
for other stuff it is closer to 3 or 4.

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

Right now, one needs great FP64; and the occasional FP128.

I believe that the best compromise both today and for the next 10
years is to have great f64 which includes Augmented[Addition|Multiplication].

Those two were added in 754-2019, they enable exact/arbitrary
precision arithmetic with very little overhead on an fpu which
already supports FMAC (which was officially included in the standard
in 1998 or 2008).

On top of this you need the occasional full f128 operation in order
to get the extended exponent range, but this is much less common than
just an extended mantissa.

Choose wisely.

Could potentially put the Int128 values off in FPR space, since
there is less pressure there, and not like RV can actually do much
with the values in GPRs anyways.

You need 64Ã—64->128, 128 << or >>, 128+128->128 for FP128 emulation So, either
a) you have a complete integer ISA on FPRs
b) you do FP128 emulation in GPRs

If you emulate f128, then using pairs of 64-bit integer regs is the easiest solution. If you have a single register set then it becomes
much easier to mix integer/logic ops with fp operations, i.e to
implement special functions where a f64 result can be used as a
starting point for one or two NR iterations.

Terje

On modern CPUs, i.e. Zen3 and better, integer division is a better
(than fp64 fdiv) starting point for fp128 division.

For sqrtq/rsqrtq, approximate fp64 rsqrt is probably the best starting
step, but not every ISA has it. For example, x86-64 only has it
with AVX512 and even here precision is only 28 bits. 28 bits is an
excellent starting point for fp64 sqrt/rsqrt, but for fp128 it's not
quite sufficient. I'd prefer 31-32 bits, in order to get exact value
after 2 NR iterations for >99.9% of the inputs.

The cost of moving things between FP and general registers that you
mentioned above is hardly above noise floor, esp. for slower operations
like sqrt and div.

However, avoidance of FP could have other advantages. The most
important is that FP affects flags and in current ABIs flags are shared between fp128 and fp64/fp32. If one uses FPU for emulation of fp128
then setting flags in IEEE-prescribed manner become quite difficult, especially so for Inexact - the most useless of them all, but still
mandatory according to IEEE.

There is also the problem of getting the inexact bit correct on FP128 calculations (or exact FP64 calculations where the inexact bit is never
to be set because no bits are lost at/in rounding}.

{double, double} TwoSum( double a, double b )
{
return {double, double} = a + b;
}

Inexact should never be set in this sequence since no significance
has been lost. Most ISAs fail miserably here, especially the ones that
have to do this as:

{double, double} TwoSum( double a, double b )
{ // Knuth
x = a + b ;
q = x - a ;
r = x - q ;
s = b - q ;
t = a - r ;
y = s + t ;
return { x, y };
}

or

{double, double} FastTwoSum( double a, double b )
{ // Dekker
// ASSERT a > b
x = a + b ;
q = x - a ;
y = b - q ;
return { x, y };
}

All that "extra" arithmetic is to "get" the inexactness and put
it back in the result-pair. Yet accessing the flags is so expensive
that most ISAs (and accompanying SW) don't even try.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 6 14:50:39 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Right now, one needs great FP64; and the occasional FP128.

I believe that the best compromise both today and for the next 10 years
is to have great f64 which includes Augmented[Addition|Multiplication].

As:

GLOBAL Kahan_Babuška
ENTRY Kahan_Babuška
Kahan_Babuška:
// note compiler put &residual in R3 at CALL
LDD R4,[R3]
CARRY R4,{IO}
FADD R1,R1,R2 // will not set inexact
STD R4,[R3]
RET

Inexact can be set in this usage of CARRY and FADD.

That is a lovely implementation of AugmentedAddition.
It is probably clear by now that I simply love your CARRY feature. I
just wish it could turn up on CPU I have can use as a daily driver.
CARRY is one of those things which I wish that I had been able to think
of, as opposed to most of the stuff that the patent office have allowed
over the last few decades.
(I.e like "We have implemented a classic Kahan text book algorithm in
HW, please give us a patent for doing it.")
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21f-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Apr 6 08:36:39 2026

From Newsgroup: comp.arch

On 4/5/2026 1:29 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32 bit
registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?

In the 32-bit era, there was very significant demand for 64-bit FP
and we were just waiting for silicon acreage to do 64-bits;
whereas,
In the 64-bit era, there is a bit of demand for FP128 to make it not
stink to high heaven, but not enough to even consider making everything 128-bits.

So in a) we had to have useful FP64 in an essentially 32-bit machine
but right now b) there is no essentiality just don't screw up enough
it can't be useful on occasion.

So I think you are agreeing with Robert Finch and me, that an
alternative that existed, but that you rejected, for good reasons, would
have been to provide great support for FP128.

I agree with your decision to provide "a little" support for FP128. I
would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
the future if the demand increases.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 6 16:38:40 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/5/2026 1:29 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32 bit >> registers. Many/most of the architectures of that era had good support
for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?

In the 32-bit era, there was very significant demand for 64-bit FP
and we were just waiting for silicon acreage to do 64-bits;
whereas,
In the 64-bit era, there is a bit of demand for FP128 to make it not
stink to high heaven, but not enough to even consider making everything 128-bits.

So in a) we had to have useful FP64 in an essentially 32-bit machine
but right now b) there is no essentiality just don't screw up enough
it can't be useful on occasion.

So I think you are agreeing with Robert Finch and me, that an
alternative that existed, but that you rejected, for good reasons, would have been to provide great support for FP128.

I agree with your decision to provide "a little" support for FP128. I
would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
the future if the demand increases.

I accept that philosophy graciously.

--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Apr 6 16:25:06 2026

From Newsgroup: comp.arch

On 4/6/2026 10:36 AM, Stephen Fuld wrote:

On 4/5/2026 1:29 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/4/2026 1:10 PM, MitchAlsup wrote:

One has to make a choice:
a) great FP64 and useful {but not great} FP128
b) better FP128 {but still not great} with compromised FP64
c) great FP128 withe seriously compromised FP64

I am not sure those are the only choices. Go back to the days of 32 bit >>> registers. Many/most of the architectures of that era had good support >>> for both 32 bit single precision and 64 bit double precision. So why
now, in the era of 64 bit registers, can't you have good support for 64
and 128 bit floating point? i.e. what's different?

In the 32-bit era, there was very significant demand for 64-bit FP
and we were just waiting for silicon acreage to do 64-bits;
whereas,
In the 64-bit era, there is a bit of demand for FP128 to make it not
stink to high heaven, but not enough to even consider making everything
128-bits.

So in a) we had to have useful FP64 in an essentially 32-bit machine
but right now b) there is no essentiality just don't screw up enough
it can't be useful on occasion.

So I think you are agreeing with Robert Finch and me, that an
alternative that existed, but that you rejected, for good reasons, would have been to provide great support for FP128.

I agree with your decision to provide "a little" support for FP128. I would add that an additional requirement for the implementation is to do nothing to prevent a better, but backwards compatible, implementation in
the future if the demand increases.

IMO, this goal could still be served with, say, having Binary64 ops that
work on pairs and use trap-and-emulate. The later hardware option being
to have real hardware support instead.

In the case of RISC-V, and my "Pseudo-Q" idea, it also remains fully compatible with the 'D' extension.

Whereas, in the case of F/D vs F/D/Q, there would be potential for
issues that would either break things or have a detrimental impact on
the existing D extension:
Moves between X and F registers no longer as well defined.
There are issues that happen when XLEN!=FLEN.
Context switching and ABI issues would result if mixing F/D and F/D/Q.

Whereas, using pairs over Q's bigger FPRs avoids any potential for compatibility issues:
FMV.X.D / FMV.D.X remain well defined;
ABIs and context-switching remain unchanged.

Tradeoffs:
If working with FP128, there are effectively half as many FPU registers
(16 vs 32).

The bigger selling point for 128-bit registers would likely be if there
were a whole lot of 4xFP32 SIMD, but as-is, not enough of this to likely justify bigger registers over register pairs.

...

--- Synchronet 3.21f-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,113
Nodes:	10 (0 / 10)
Uptime:	492337:10:35
Calls:	14,238
Files:	186,312
D/L today:	3,907 files (1,273M bytes)
Messages:	2,514,893

Generic GPUs

Who's Online

System Info