Forum: War Ensemble BBS

Re: Tonights Tradeoff

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Oct 28 23:52:53 2025

From Newsgroup: comp.arch

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 00:14:08 2025

From Newsgroup: comp.arch

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 04:29:15 2025

From Newsgroup: comp.arch

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

Well, that and when co-existing with RV64G, it gives somewhere to put
the FPRs. But, in turn this was initially motivated by me failing to
figure out how to get GCC configured to target Zfinx/Zdinx.

Had ended up going with the Even/Odd pairing scheme as it is less wonky
IMO to deal with R5:R4 than R36:R4.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

BT/BF works well. I otherwise also ended up using RISC-V style branches,
which I originally disliked due to higher implementation cost, but they
do technically allow for higher performance than just BT/BF or Branch-Compare-with-Zero in 2-R cases.

So, it becomes harder to complain about a feature that does technically
help with performance.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Hmm...

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;
There are some unresolved bugs, but I haven't been able to fully hunt
them down. A lot was in relation to RISC-V's C extension, but at least
it seems like at this point the C extension is likely fully working.

Haven't been many features that can usefully increase general-case performance. So, it is starting to seem like XG2 and XG3 may be fairly
stable at this point.

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance;
Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

Had noted in Cinebench that my main PC is actually performing a little
slower than is typical for the 2700X, but then again, it is effectively
a 2700X running with DDR4-2133 rather than DDR4-2933, but partly this
was a case of the RAM I have was unstable if run all that fast (and in
this case; more RAM but slightly slower seemed preferable to less RAM
but slightly faster, or running it slightly faster but having the
computer be crash-prone).

They sold the ran with its on-the-box speed being the XMP2 settings
rather than the baseline settings, but the RAM in question didn't run
reliably at the XMP or XMP2 settings (and wasn't inclined to spend more;
more so when there was already the annoyance that my MOBO chipset
apparently doesn't deal with a full 128GB, but can tolerate 112GB, but
maybe not an ideal setup for perf).

So, yeah, it seems that I have a setup where the 2700X is getting worse single-threaded performance than the i7 8650U in the laptop.

Apparently, going by Cinebench scores, my PC's single threaded
performance is mostly hanging out with a bunch of Xeons (getting a score
in R23 of around 700 vs 950).

Well, could be addressed, in theory, but would need some RAM that
actually runs reliably at 2933 or 3200 MT/s and is also cheap...

In both cases, they are CPUs originally released in 2018.

Has noted, in a few tests:
LZ4 benchmark (same file):
Main PC: 3.3 GB/s
Laptop: 3.9 GB/s
memcpy (single threaded):
Main PC: 3.8 GB/s
Laptop : 5.6 GB/s
memcpy (all threads):
Main PC: ~ 15 GB/s
Laptop : ~ 24 GB/s
( Like, what; thing only has 1 stick of RAM... *1 )

*1: Also, how is a laptop with 1 stick of RAM matching a dual-socket
Xeon E5410 with like 8 sticks of RAM...

or, maybe it was just weak that my main PC was failing to beat the Xeon
at this?... My main PC does at least beat the Xeon at single-threaded performance (was less true of my older Piledriver based PC).

Granted, then again, I am using (almost) the cheapest MOBO I could find
at the time (that had an OK number of RAM slots and SATA connectors).
Can't quite identify the MOBO or chipset as I lost the box (and not
clearly labeled on the MOBO itself); except that it is a
something-or-another ASUS board.

Like, at the time, IIRC:
Went on Newggg;
Pick mostly the cheapest parts on the site;
Say, a Zen+ CPU being a lot cheaper than Zen 2,
or pretty much anything from Intel.
...

Did get a slightly fancy/beefy case, but partly this was because I was
annoyed with the late-90s-era beige tower case I had been using. Which I
had ended up hot gluing a bunch of extra PC fans into the thing in an
attempt to keep airflow good enough so that it didn't melt. And
under-clocking the CPU so that it could run reliably.

Like, 4GHz Piledriver ran too hot and was unreliable, but was far more
stable at 3.4 GHz. Was technically faster than a Phenom II underclocked
to 2.8 GHz (for similar reasons).

Where, at least the Zen+ doesn't overheat at stock settings (but, they
also supplied the thing with a comparably much bigger stock CPU cooler).

The case I got is slightly more traditional, with 5.25" bays and similar
and mostly sheet-steel construction, Vs the "new" trend of mostly glass-covered-box PC cases. Sadly, it seems like companies have mostly
stopped selling the traditional sheet-steel PC cases with open 5.25"
bays. Like, where exactly is someone supposed to put their DVD-RW drive,
or hot-swap HDD trays ?...

Well, in the past we also had floppy drives, but the MOBO's removed the connectors forcing one to now go the USB route if they want a floppy
drive (but, now mostly moot as relatively few other computers still have floppy drives either).

Well, in theory could build a PC with newer components and a bigger
budget for parts. Still wouldn't want to go over to Win11, now it is a
choice between jumping to Linux or "Windows Server" or similar (like, at
least they didn't pollute Windows Server with a bunch of random
pointless crap).

For now, inertia option is to just keep using Win10 for now.

As for the laptop, had noted:
Can run Minecraft:
Yes; though best results at an 8-chunk draw distance.
Much more than this, and the "Intel UHD" graphics struggle.
At 12 chunks, there is obvious chug.
At 16 chunks, it starts dropping into single digit territory.
Can run Doom3:
Yes: Mostly gets 40-50 fps in Doom 3.

My main PC can manage a 16-chunk draw distance in Minecraft and mostly
gets a constant 63 fps in Doom3.

Don't have many other newer games to test, as I mostly lost interest in
modern "AAA" games. And, stuff like Doom+RTX, I already know this wont
work. I can mostly just be happy that Minecraft works and is playable
(and that its GPU is solidly faster than just using a software renderer...).

On both fronts, this is a significant improvement over the older laptop.
For the price, I sort of worried that it would be dead slow, but it significantly outperforms its Vista-era predecessor.

This is mostly because I had noticed that, right now (unlike a few years
ago), there are actually OK laptops at cheap prices (along with all the
$80 Dell OptiPlex computers and similar on Amazon...).

Otherwise, went and recently wrote up a spec partly based on a BASIC
dialect I had used in one of my 3D engines, with some design cleanup: https://pastebin.com/2pEE7VE8

Where I was able to get a usable implementation for something similar in
a little over 1000 lines of C.

Though, this was for an Unstructured BASIC dialect.

Decided then to try something a little harder:
Doing a small JavaScript like language, and trying to keep the
interpreter small.

I don't yet have the full language implemented, but for a partial JS
like language, I currently have something in around 2500 lines of C.

I had initially set a target estimate of 4-8 kLOC.
Unless the remaining functionality ends up eating a lot of code, I am on target towards hitting the lower end of this range (need to get most of
the rest of the core-language implemented within around 1.5 kLOC or so).

Note: No 3rd party libraries allowed, only the normal C runtime library.
Did end up using a few C99 features, but mostly still C95.

For now, I was calling the language BS3L, where:
Dynamically typed;
Supports: Integers, Floating-Point, Strings, Objects, Arrays, ...
JS style syntax;
Nothing too exciting here.
Still has JS style arrays and objects;
Dynamically scoped.
Where, dynamic scoping needs less code than lexical scoping;
But, dynamic scoping is also a potential foot-gun as well.
Not sure if too much of a foot-gun.
Vs going to C-style scoping;
Or, biting the bullet and properly implementing lexical scoping.
Leaving out most advanced features.
will be fairly minimal even vs early versions of JS.

But, in some cases, was borrowing some design ideas from the BASIC interpreter. There were some unavoidable costs, such as in this case
needing a full parser (that builds an AST) and an AST-walking
interpreter. Unlike BASIC, it wouldn't be possible to implement an
interpreter by directly walking and pattern matching lists of tokens.

And, a parser that builds an AST, and code to walk said AST, necessarily
needs more code.

I guess, it is a question if if someone else could manage to implement a JavaScript style language in under 1000 lines of C while also writing "relatively normal" C (no huge blocks of obfuscated or rampant abuse of
the preprocessor). Or, basically, where one has to stick to similar C
coding conventions to those used in Doom and Quake.

I am not sure if this would be possible. Both the dynamic type-system
and parser have eaten up a fair chunk of the code budget. A sub 1000
line parser is also a little novel; but the parser itself got a little
wonky and doesn't fully abstract over what it parses (as there is still
a fair bit of bleed-over from the token stream). And, it sorta ended up abusing the use of binary operators a little.

For example, it has wonk like dealing with lists of statements as-if
there were a right-associative semicolon operator (allowing it to be
walked like a linked list).

There is slightly wonky operator tokenization again to save code:
Separately matching every possible operator pattern is a bunch of extra
logic. Was using rules that mostly give the correct operators, but with
the possibility of non-sense operators. Also the precedence levels don't
match up exactly, but this is a lower priority issue.

I guess, if someone things they can do so in significantly less code,
they can try.

Note that while a language like Lua sort of resembles an intermediate
between BASIC and JavaScript, I wouldn't expect Lua to save that much
here (it would still have the cost of needing to build an AST and similar).

Going from an AST to a bytecode or 3AC IR would allow for higher
performance.

But, I decided to go for an AST walking interpreter in this case as it
would be the least LOC.

Actually takes more effort trying to keep the code small. Rather than
just copy-pasting stuff a bunch of times, one spends more time needing
to try to factor out and reuse common patterns.

Though, in a way, some of this is revisiting stuff I did 20+ years ago,
but from a different perspective.

Like, 20+ years ago, my first interpreters also used AST walkers.

As for where I will go with this, I don't know.
Some of it could make sense as a starting point for a GLSL compiler;
Or maybe adapted into parsing the SCAD language;

Or, as a cheaper alternative to what my first script VM became.
By the end of its span, it had become quite massive...
Though, still not too bad if compared with SpiderMonkey or V8.

Ironically, my jump to a Java + JVM/.NET like design was actually to
make it simpler.

For a simple but slow language, JS works, but if you want it fast it
quickly turns worse (and simpler to jump to a more traditional
statically typed language). Like, there was this thing, known as "Hindley-Milner Type Inference", which on one hand, could be used to
make a JavaScript style language fast (by turning it transparently into
a statically-typed language), but also, was a huge PITA to deal with
(this was combined in my VM with optional explicit type declarations;
with a syntax inspired by ActionScript).

Well, and when something gets big and complicated enough that one almost
may as well just use spiderMonkey or similar to run their JS code, this
is a problem...

Still less bad than LLVM, not sure why anyone would willingly submit to
this.

Well, there is still surviving descendant of the original VM (although branching off from an earlier form) in the form of BGBCC.

Though, makes more sense to do a clean interpreter in this case, than to
try to build one by copy-pasting the parser from BGBCC or my old VM and
trying to build a new lighter-weight VM.

In some of these cases, it is easier to scale up than scale back down.
Easier to take simpler code and add features or improve performance.
Than to take more complex code and try to trim it down.

And, sometimes it does make more sense to just write something starting
from a clean slate.

Well, except for my attempt at a clean slate C compiler, except this was
more a case of realizing I wouldn't undershoot BGBCC by enough to be worthwhile, and there were some new problem points that were emerging in
the design. Partly as I was trying to follow a model more like that used
by GCC and binutils, which I was then left to suspect is not the right approach (and in some ways, the approach I had used in BGBCC seemed to
make more sense than trying to imitate how GCC does things).

Might still make at some point to try for another clean-slate C
compiler, though if I would still end up taking a similar general
approach to BGBCC (or .NET), there isn't a huge incentive (vs continuing
to use BGBCC).

Where, say, the main thing that would ideally need to be improved would
be improving BGBCC's performance and reducing memory footprint. As-is, compiling with BGBCC is about as slow as compiling with GCC, which isn't great.

Comparably, MSVC typically being a bit faster at compiling stuff IME.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:41:46 2025

From Newsgroup: comp.arch

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be
passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:50:35 2025

From Newsgroup: comp.arch

On 2025-10-29 8:41 a.m., Robert Finch wrote:

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste. If you have six bits, you can use
all 64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

I should mention that the high registers are available only in user/app
mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the
design. There are about 160 (64+32+32+32) logical registers then. They
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 17:44:14 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.

These days, that's not so clear. E.g., Zen4 has 192 physical 512-bit
SIMD registers, despite having only 256-bit wide FUs. The way I
understand it, a 512-bit operation comes as one uop to the FU,
occupies it for two cycles (and of course the result latency is
extra), and then has a 512-bit result.

The alternative would be to do as AMD did in some earlier cores,
starting with (I think) K8: have registers that are half as wide and
split each 512-bit operation into 2 256-bit uops that go throught the
OoO engine individually. This approach would allow more physical
256-bit registers, and waste less on 32-bit, 64-bit, 128-bit and
256-bit operations, but would cost additional decoding bandwidth,
renaming bandwidth, renaming checkpoint size (a little), and scheduler
space than the approach AMD have taken. Apparently the cost of this
approach is higher than the benefit.

Doubling the logical register size doubles the renamer checkpoint
size, no? This way of avoiding "waste" looks quite a bit more
expensive.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 13:04:42 2025

From Newsgroup: comp.arch

On 10/29/2025 7:50 AM, Robert Finch wrote:

On 2025-10-29 8:41 a.m., Robert Finch wrote:

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or
6 bit register numbers in the instructions. Five allows you to use
the high registers for 128 bit operations without needing another
register specifier, but then the high registers can only be used for
128 bit operations, which seems a waste. If you have six bits, you
can use all 64 registers for any operation, but how is the "upper"
method that better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would
be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But
since it should be using mostly compiled code, it does not make much
difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

I am not as sure about this approach...

Well, Low 32=GPR, High 32=FPR, makes sense, I did this.

But, pairing a GPR and FPR for the 128-bit cases seems wonky; or
subsetting registers on context switch seems like it could turn into a problem.

Or, if a goal is to allow for encodings with a 5-bit register field,
would make sense to use 32-bit encodings.

Where, granted, 6b register fields in a 32-bit instruction does have the drawback of limiting how much encoding space exists for opcode and
immediate (and one has to be more careful not to "waste" the encoding
space as badly as RISC-V had done).

Though, can note that both:
R6+R6+Imm10
R5+R5+Imm12
Use the same amount of encoding space.
But, R6+R6+R6 uses 3 bits more than R5+R5+R5.

Though, one could debate my case, as I did effectively end up burning
1/4 of the total encoding space mostly on Jumbo prefixes.

...

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a
branch on bit-set/clear for conditional branches. Might also include
branch true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. They
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.

FWIW: I have gotten by OK with 128 internal registers:
00..3F: Array-Mapped Registers (mostly the GPRs)
40..7F: CRs and SPRs

Mostly sufficient.

For the array-mapped registers, these ones use LUTRAM, with a logical
copy of the array per write port, and some control bits to encode which
array currently holds the up-to-date copy of the register.

All this gets internally replicated for each read port.

So, roughly 18 internal copies of all of the registers with 6R3W, but
this is unavoidable (since LUTRAMs are 1R1W).

The other option is using flip-flops, which is the strategy mostly used
for the writable CRs and SPRs. This is done sparingly as the resource
cost is higher in this case (at least on xilinx, *).

*: Things went amiss on Altera and when I tried to build on this, needed
to use FF's for all the GPRs as well; as these FPGAs lack a direct
equivalent of LUTRAMs and instead have smaller Block RAMs. Also the
Lattice FPGAs also lack LUTRAM IIRC (but, my core doesn't map as well to Lattice FPGAs either).

As for the CR/SPR space:
Some of it is used for writable registers;
A big chunk is used for internal read-only registers.
ZZR, IMM, IMMB, JIMM, etc.
ZZR: Zero Register / Null Register (Write)
IMM: Immediate for current lane (33-bit, sign-ext).
IMMB: Immediate from Lane 3.
JIMM: 64-bit immediate spanning Lanes A and B.
...

Could also be seen as C0..C63 (or, all control registers) except that
much of C32..C63 is used for internal read-only SPRs, and a few other
SPRs (DLR, DHR, and SP).

Originally, the CRs and SPRs were handled as separate, but now things
have gotten fuzzy (and, for RISC-V, some of the CRs need to be accessed
in GPR like ways).

There is some wonk as they were handled as separate modules, but with
the current way things are done it would almost make more sense to fold
all of the CRs into the GPR file module.

The module might also continue to deal with forwarding, but might also
make sense to have a RegisterFile module, possibly with a disjoint
"Register Forwarding And Interlocks" style module (which forwards
registers if the value is available and signals pipeline stalls as
needed; this logic currently partly handled by the existing
register-file module).

Did experiment with a mechanism to allow bank-swapped registers. This
would have added an internal 2-bit mode for the registers, and would
stall the pipeline to swap the current registers with their bank-swapped versions if needed (with the registers internally backed to Block-RAM).
Ended up mostly not using this though (at best, it wouldn't gain much
over the existing "Load and Store everything to RAM" strategy; and would
make context switching slower than it is already).

It is more likely that a practical mechanism for fast bank swapping
would need a mechanism to bank-swap the registers to external RAM. Or
maybe a special "Stall and dump all the registers to this RAM Address" instruction.

For the RISC-V CSRs:
Part of the space maps to the CRs, and part maps to CPUID;
For pretty much everything else, it traps.
So, pretty much all of the normal RISC-V CSRs will trap.

Ended up trapping for the RISC-V FPU CSRs as well:
Rarely accessed;
Rather than just one CSR for the FPU status, they broke it up to
multiple sub-registers for parts of the register (like, there is a
special CSR just for the rounding-mode, ...).

Also the hardware only supports moving to/from a CR, so any more complex scenarios will also trap. They had gotten a little too fancy with this
stuff IMO.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 29 18:15:42 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

Having register pairs does not make the compiler writer's life easier, unfortunately.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Having 64 registers and 64 bit registers makes life easier for that
particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 11:29:54 2025

From Newsgroup: comp.arch

On 10/29/2025 10:44 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste.

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions. But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that. e.g.

Add A1,A2,A3 would be a 64 bit add on those registers but
Add128 A1,A2,A3 would be a 128 bit add using A1H for the high order
bits of the destination, etc. So the question becomes how is using
Rn+32 better than using Rn+1?

That being said, your points are well taken for a different implementation.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:33:46 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:

if( p && p->next )

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly
for 64-bit and smaller.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions alongwith the higher precision.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:47:09 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

it is definitely an issue.

But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

----------

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

{5, 16, 32, 64}-bit immediates.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;

<snip>

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

I am still running at 70% RISC-Vs instruction count.

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Fewer instructions, and or instructions that take fewer cycles to execute.

Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

There is very little to be gained with that many registers.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 14:02:32 2025

From Newsgroup: comp.arch

On 10/29/2025 1:15 PM, Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

Having register pairs does not make the compiler writer's life easier, unfortunately.

Yeah, and from the compiler POV, would likely prefer having Even+Odd pairs.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

Having 64 registers and 64 bit registers makes life easier for that particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

Agreed.

From what I have seen, the vast bulk of constants tend to come in
several major clusters:
0 to 511: The bulk of all constants (peaks near 0, geometric fall-off)
-64 to -1: Much of what falls outside 0 to 511.
-32768 to 65535: Second major group
-2G to +4G: Third group (smaller than second)
64-bit: Another smaller spike.

For values between 512 and 16384: Sparsely populated.
Mostly the continued geometric fall-off from the near-0 peak.
Likewise for values between 65536 and 1G.
Values between 4G and 4E tend to be mostly unused.

Like, in the sense of, if you have 33-bit vs 52 or 56-bit for a
constant, the larger constants would have very little advantage (in
terms of statistical hit rate) over the 33 bit constant (and, it isn't
until you reach 64 bits that it suddenly becomes worthwhile again).

Partly why I go with 33 bit immediate fields in the pipeline in my core,
but nothing much bigger or smaller:
Slightly smaller misses out on a lot, so almost may as well drop back to
17 in this case;
Going slightly bigger would gain pretty much nothing.

Like, in the latter case, does sort of almost turn into a "go all the
way to 64 bits or don't bother" thing.

That said, I do use a 48-bit address space, so while in concept 48-bits
could be useful for pointers: This is statistically insignificant in an
ISA which doesn't encode absolute addresses in instructions.

So, ironically, there are a lot of 48-bit values around, just pretty
much none of them being encoded via instructions.

Kind of a similar situation to function argument counts:
8 arguments: Most of the functions;
12: Vast majority of them;
16: Often only a few stragglers remain.

So, 16 gets like 99.95% of the functions, but maybe there are a few
isolated ones taking 20+ arguments lurking somewhere in the code. One
would then need to go up to 32 arguments to have reasonable confidence
of "100%" coverage.

Or, impose an arbitrary limit, where the stragglers would need to be
modified to pass arguments using a struct or something.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 13:05:08 2025

From Newsgroup: comp.arch

On 10/29/2025 11:47 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

snip

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 15:58:40 2025

From Newsgroup: comp.arch

On 10/29/2025 1:47 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

it is definitely an issue.

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

OK.

In my case, a lot of the 128-bit operations are a single 32-bit
instruction, which splits (in decode) to spanning multiple lanes (using
the 6R3w register file as a virtual 3R1W 128-bit register file).

In some cases, pairs of 64-bit SIMD instructions may be merged to send
both through the SIMD unit at the same time. Say, as a special-case
co-issue for 2x Binary32 ops (which can basically be handled the same as
the 4x Binary32 scenario by the SIMD unit).

----------

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

{5, 16, 32, 64}-bit immediates.

The reason 17 and 33 ended up slightly preferable is that both
zero-extended and sign-extended 16 and 32 bit values are fairly common.

And, if one has both a zero and sign extended immediate, this eats the
same encoding space as having a 17-bit immediate, or a separate
zero-extended and one-extended variant.

There are a few 5/6 bit immediate instructions, but I didn't really
count them.

XG3's equivalent of SLTI and similar only has Imm6 encodings (can be
extended to 33 bits with a jumbo prefix).

There isn't much need for a direct 128-bit immediate though:
This case is exceedingly rate;
Register-pairs basically make it a non-issue;
Even if it were supported:
This would still require a 24-byte encoding...
Which, doesn't save anything over 2x 12-bytes.
And doesn't gain much, apart from making CPU more expensive.

Someone could maybe do 20 bytes by using a 128-bit memory load, but with
the usual drawbacks of using a memory load (BGBCC doesn't usually do
this). The memory load will have a higher latency than a pair of
immediate instructions.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;

<snip>

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

I am still running at 70% RISC-Vs instruction count.

Basically similar.

XG3 also uses only 70% as many instructions as RV64G.

But, if you throw Indexed Load/Store, Load/Store Pair, Jumbo Prefixes,
etc, at the problem (on top of RISC-V), suddenly RISC-V becomes a lot
more competitive (30% smaller and 50% faster).

Not found a good way to much improve much over this though...

But, yeah, if comparing against RV64G as it exists in its standard form,
there is a bit of room for improvement.

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Fewer instructions, and or instructions that take fewer cycles to execute.

Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

There is very little to be gained with that many registers.

Granted.

The main thing it benefits is things like TKRA-GL, ...

Doom basically sees no real difference between 32 and 64 GPRs (nor does SW-Quake).

Mostly matters for code where one has functions with around 100+ local variables... Which, are uncommon much outside of TKRA-GL or similar.

As-is, SW-Quake is one of the cases that does well with RISC-V, though GL-Quake performs like hot dog-crap; mostly as TKRA-GL gets wrecked if
it is limited to 32 registers and doesn't have SIMD.

Only real saving point is when running with TKRA-GL over system calls in
which case it runs in the kernel (as XG1) which is slightly less bad.
For reasons, TestKern kinda still needs to be built as XG1.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
Rarely reaches turbo
pretty much only happens if just running a single thread...
With all cores running stuff in the background:
Idles around 3.6 to 3.8.

Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
If power set to performance, reaches turbo a lot more easily,
and with multi-core workloads.
But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

For $240 I was paranoid is still might not have been fast enough to run Minecraft...

Still annoyed as the RAM claimed like DDR4-3200 on the box, but doesn't
run reliably at more than DDR4-2133... Like, you can try 3200 if you
don't mind computer blue-screening after a few minutes I guess...

But, without much RAM, nor enough SSD space to set up a huge pagefile,
not going to try compiling LLVM on the thing.

Even with all the RAM, a full rebuild of LLVM still takes several hours
on my main PC (though, trying to build LLVM or GCC is at least slightly
faster if one tells the AV software to stop grinding the CPU by looking
at every file accessed).

Vs the $80 OptiPlex that came with a 2C/4T Core i3 variant, that wasn't particularly snappy (seemed on-par with the Vista era laptop; though
this has a 2C/2T CPU).

Basically, was a small PC that was using mostly laptop-style parts
internally (laptop DVD-RW drive and laptop style HDD); some sort of ITX
MOBO layout I think.

I don't remember there being any card slots; so like if you want to
install a PCIe card or similar, basically SOL.

But, it was either this or an off-brand NUC clone...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 21:52:54 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 10/29/2025 11:47 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

snip

But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

It is a 64-bit machine that provides a small modicum of support for
larger sizes. It is not and never will be a 128-bit machine--that is
what vVM is for.

Key words "small modicum"

DBLE simply supplies registers to the pipeline and width to decode.

If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.

DBLE supports 128-bits in the ISA at the total cost of 1 instruction
added per use. In many situations (especially integer) CARRY is the
better option because it throws a shadow of width over a number of
instructions and thereby has lower code foot print costs. So, a 256
bit shift is only 5 instructions instead of 8. And realistically, if
you want wider than that, you have already run out of registers.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:01:17 2025

From Newsgroup: comp.arch

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

Having register pairs does not make the compiler writer's life easier, unfortunately.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

Having 64 registers and 64 bit registers makes life easier for that particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

It is load/store with no memory ops excepting possibly atomic memory ops.>

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

I found that 16-bit immediates could be encoded instead of 10-bit.
So, now there are 16,56,96 and 136 bit constants possible. The
56-bitconstant likely has enough range for most 64-bit ops. Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the
constant unused. 136 bit constants may not be implemented, but a size
code is reserved for that size.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:20:51 2025

From Newsgroup: comp.arch

On 2025-10-29 2:33 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:

if( p && p->next )

Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the
branch eval, but is probably well worth it.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.

Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions alongwith the higher precision.

Improves the accuracy? of algorithms, but seems a bit specific to me.
Are there other instruction sequence where double-rounding would be good
to avoid? Seems like HW could detect the sequence and fuse the instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:26:05 2025

From Newsgroup: comp.arch

<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
    Rarely reaches turbo
      pretty much only happens if just running a single thread...
    With all cores running stuff in the background:
      Idles around 3.6 to 3.8.

Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
    If power set to performance, reaches turbo a lot more easily,
      and with multi-core workloads.
    But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

<snip>
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 22:31:12 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 18:48:56 2025

From Newsgroup: comp.arch

On 10/29/2025 5:26 PM, Robert Finch wrote:

<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
   8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
     Rarely reaches turbo
       pretty much only happens if just running a single thread...
     With all cores running stuff in the background:
       Idles around 3.6 to 3.8.

Laptop:
   4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
     If power set to performance, reaches turbo a lot more easily,
       and with multi-core workloads.
     But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

<snip>
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.

IIRC, current PC was something like:
CPU: $80 (Zen+; Zen 2 and 3 were around, but more expensive)
MOBO: $60
Case: $50
...

Spent around $200 for 128GB of RAM.
Could have gotten a cheaper 64GB kit had I known my MOBO would not
accept a full 128GB (then could have had 96 GB).

The RTX card I have (RTX 3060) has 12 GB of VRAM.

IIRC, it was also about the cheapest semi-modern graphics card I could
find at the time. Like, while I could have bought an RTX 4090 or similar
at the time, I am not made of money.

Like, a prior-generation mid-range card being the cheaper option.
And, still newer than the GTX980 that had died on my (where, the GTX980
was itself second-hand).

Before this, had been running a GTX 460, and before that, a Radeon HD
4850 (IIRC).

I think it was a case of:
Had a Phenom II box, with the HD 4850;
Switched to GTX 460, as I got one second-hand for free, slightly better; Replaced Phenom II board+CPU with FX-8350;
Got GTX 980 (also second hand);
Got Ryzen 7 2700X and new MOBO;
Got RTX 3060 (as the 980 was failing).

With the RTX 3060, had to go single-monitor, mostly as it only has
DisplayPort outputs, and DP->HDMI->DVI via adapters doesn't seem to work (whereas HDMI->DVI did work via adapters).

Well, also the RTX 3060 doesn't have a VGA output either (monitor would
also accept VGA).

Though, the current monitor I am using is newer and does support
DisplayPort.

I also managed to get a MultiSync CRT a while ago, but it only really
gives good results at 640x480 and 800x600, 1024x768 sorta-works (but
1280x1024 does not work), has a roughly 16" CRT or so; VGA input.

I also have an LCD that goes up to 1280x1024, although it looks like
garbage if set above 1024x768. Only accepts VGA.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 07:13:54 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

If you have that many bits available, do you still go for a load-store
architecture, or do you have memory operations? This could offset the
larger size of your instructions.

It is load/store with no memory ops excepting possibly atomic memory ops.>

OK. Starting with 40 vs 32 bits, you have a factor of 1.25 disadvantage
in code density to start with. Having memory operations could offset
that by a certain factor, that was why I was asking.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

I found that 16-bit immediates could be encoded instead of 10-bit.

OK. That should also help for offsets in load/store.

So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.

For addresses, it will take some time for this to overflow :-)
For floating point constants, that will be hard.

I have done some analysis on frequency of floating point constants
in different programs, and what I found was that there are a few
floating point constants that keep coming up, like a few integers
around zero (biased towards the positive side), plus a few more
golden oldies like 0.5, 1.5 and pi. Apart from that, I found that
different programs have wildly different floating point constants,
which is not surprising. (I based that analysis on the grand
total of three packages, namely Perl, gnuplot and GSL, so cover
is not really extensive).

Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.

There are also 32-bit floating point constants, and 32-bit integers
as constants. There are also very many small integer constants, but
of course there also could be others.

136 bit constants may not be implemented, but a size
code is reserved for that size.

I'm still hoping for good 128-bit IEEE hardware float support.
POWER has this, but stuck it on their their decimal float
arithmetic, which is not highly performing...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 30 13:53:04 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:09:00 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-10-29 2:33 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.

I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:

if( p && p->next )

Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.

Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions along with the higher precision.

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good
to avoid?

Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

The problem arises due to a cross products of various {machine,
language, compiler} features not working "all ends towards the middle".

LLVM promotes FP calculations with a constant to 64-bits whenever the
constant cannot be represented exactly in 32-bits. {Strike one}

C makes no <useful> statements about precision of calculation control.
{strike two}

HW almost never provides mixed mode calculations which provide the
means to avoid the double rounding. {strike three}

So, technically, My 66000 does not provide general-mixed-mode FP,
but I wrote a special rule to allow for larger constants used with
narrower registers to cover exactly this case. {It also saves 2 CVT instructions (latency and footprint).

Seems like HW could detect the sequence and fuse the instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:10:47 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Oct 30 12:29:39 2025

From Newsgroup: comp.arch

On 10/30/2025 11:10 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.

Only really makes sense if one assumes these resources are "borderline
free".

If you are also paying for logic complexity and wires/routing, then
having bigger registers just to typically waste most of them is not ideal.

Granted, one could argue that most of the register is wasted when, say:
Most integer values could easily fit into 16 bits;
We have 64-bit registers.

But, there is enough that actually uses the 64-bits of a 64-bit register
to make it worthwhile. Would be harder to say the same for 128-bit
registers.

It is common on many 32-bit machines to use register pairs for 64-bit operations.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

Also questionable to read as someone lacking much hardware that actually supports 256 or 512-bit AVX on the actual HW level. And, both AVX and
AVX-512 had not exactly had clean roll-outs.

Checks and, ironically, my recent super-cheap laptop was the first thing
I got that apparently has proper 256-bit AVX support (still no AVX-512 though...).

Still some oddities though:
RAM that appears to be faster than it should be;
The MHz and CAS latency appear abnormally high.
They do not match the values for DDR4-2400.
(Nor, even DDR4 in general).
Appears to exceed expected bandwidth on memcpy test;
...
Windows 11 on an unsupported CPU model;
More so, Windows 11 Professional, also on something cheap.
(Listing said it would come with Win10, got Win11 instead, OK).

So, technically seems good, but also slightly sus...

Differs slightly from what I was expecting:
Something kinda old and not super fast;
Listing said Windows 10, kinda expected Windows 10;
...

Like, something non-standard may have been done here.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 16:46:14 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:

Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the
quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit
registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP. In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.

Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.

By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

IIRC the original HPPA has 32 or so 64-bit FP registers, which they
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 17:58:34 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned? >>>

The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problemso
or inefficiencies?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 30 23:39:28 2025

From Newsgroup: comp.arch

On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:00:50 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not RISC.

I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:

Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.

Consider that being able to address every 2^(3+n) field of a register
is far from free. Take a simple add of 2 bytes::

ADDB R8[7], R6[3], R19[4]

One has to individually align each of the bytes, which is going to blow
out your timing for forwarding by at least 3 gates of delay (operands)
and 4 gates for the result (register). The only way it makes "timing"
sense if if you restrict the patterns to::

ADDB R8[7], R6[7], R19[7]

Where there is no "vertical" routine in obtaining operands and delivering results. {{OR you could always just eat a latency cycle when all fields
are not the same.}}

I also suspect that you would gain few compiler writers to support random fields in registers.

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.

{ABCD}X registers were data.
{SDBS} registers were pointer registers.

There are vanishingly few useful manipulations on part of pointers.

Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".

In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.

Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.

By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

But gains the property that the whole register contains 1 proper value {range-limited to the container size whence it came} This in turn makes tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value
analysis in the compiler.

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

If your ISA has excellent support for statically positioned bit-fields
(or even better with dynamically positioned bit fields) fetching the
fields and depositing them back into containers does not add significant latency. {volatile notwithstanding} While poor ISA support does add
significant latency.

IIRC the original HPPA has 32 or so 64-bit FP registers, which they
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:06:35 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>

The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>

That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problems
or inefficiencies?

Consider trying to invalidate an ICache line--this requires looking
at 2 DCache lines to see if they, too, need invalidation.

Consider self-modifying code, the data stream overwrites an instruction,
then later the FETCH engine runs over the modified line, but the modified
line is 64-bytes of the needed 80-bytes, so you take a hit and a miss on
a single fetch.

It also prevents SNARFing updates to ICache instructions, unless the
SNARFed data is entirely retained in a single ICache line.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 22:19:18 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

According to my understanding, EV4 had no SIMD-style instructions.

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4. The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in
hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 31 00:57:42 2025

From Newsgroup: comp.arch

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

According to my understanding, EV4 had no SIMD-style instructions.

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

Yes, those were in EV4.

Alpha 21064 and Alpha 21064A HRM is here: https://github.com/JonathanBelanger/DECaxp/blob/master/ExternalDocumentation

I didn't consider these instructions as SIMD. May be, I should have.
Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 31 14:48:41 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

...

I didn't consider these instructions as SIMD. May be, I should have.

They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.

Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 13:21:45 2025

From Newsgroup: comp.arch

On 10/31/2025 9:48 AM, Anton Ertl wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

...

I didn't consider these instructions as SIMD. May be, I should have.

They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.

Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.

Most likely Dhrystone:
It shows disproportionate impact from the relative speed of things like "strcmp()" and integer divide.

I had experimented with special instructions for packed search, which
could be used to help with either string compare of implementing
dictionary objects in my usual way.

Though, had later fallen back to a more generic way of implementing
"strcmp()" that could allow more fair comparison between my own ISA and RISC-V. Where, say, one instead makes the determination based on how efficiently the ISA can handle various pieces of C code (rather than the
use of niche instructions that typically require hand-written ASM or
similar).

Generally, makes more sense to use helper instructions that have a
general impact on performance, say for example, effecting how quickly a
new image can be drawn into VRAM.

For example, in my GUI experiments:
Most of the programs are redrawing the screens as, say, 320x200 RGB555.

Well, except ROTT, which uses 384x200 8-bit, on top of a bunch of code
to mimic planar VGA behavior. In this case, for the port it was easier
to write wrapper code to fake the VGA weirdness than to try to rewrite
the whole renderer to work with a normal linear framebuffer (like what
Doom and similar had used).

In a lot of the cases, I was using an 8-bit indexed color or color-cell
mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

Though, thus far the 1024x768 mode is still mostly untested on real
hardware.

Had experimented some with special instructions to speed up the indexed
color conversion and color-cell encoding, but had mostly gone back and
forth between using helper instructions and normal plain C logic, and
which exact route to take.

Had at one point had a helper instruction for the "convert 4 RGB555
colors to 4 indexed colors using a hardware palette", but this broke
when I later ended up modifying the system palette for better results
(which was a critical weakness of this approach). Also the naive
strategy of using a 32K lookup table isn't great, as this doesn't fit
into the L1 cache.

So, for 4 bpp color cell:
Generally, each block of 4x4 pixels understood as 2 RGB555 endpoints,
and 2 selector bits per pixel. Though, in VRAM, 4 of these are packed
into a logical 8x8 pixel block; rather than a linear ordering like in
DXT1 or similar (specifics differ, but general concept is similar to DXT1/S3TC).

The 2bpp mode generally has 8x8 pixels encoded as 1bpp in raster order
(same order as a character cell, with MSB in top-left corner and LSB in lower-right corner). And, then typically 2x RGB555 over the 8x8 block.
IIRC, had also experimented with having each 4x4 sub-block able to use a
pair of RGB232 colors, but was harder to get good results.

But, to help with this process, it was useful to have helper operations
for, say:
Map RGB555 values to a luma value;
Select minimum and maximum RGB555 values for block;
Map luma values to 1 or 2 bit selectors;
...

Internally, the GUI mode had worked by drawing everything to an RGB555 framebuffer (~ 512K or 1MB) and then using a bitmap to track which
blocks had been modified and need to be re-encoded and sent over to VRAM (partly by first flagging during window redraw, then comparing with a
previous version of the framebuffer and tracking when pixel-blocks will
differ to refine the selection of blocks that need redraw, copying over
blocks as needed to keep track of these buffers).

Process wasn't particularly efficient (and performance is considerably
worse than what Win3.x or Win9x seemed to give).

As for the packed-search instructions, there were 16-bit versions as
well, which could be used either to help with UTF-16 operations; or for dictionary objects.

Where, a common way I implement dictionary objects is to use arrays of
16-bit keys with 64-bit values (often tagged values or similar).

Though, this does put a limit on the maximum number of unique symbols
that can be used as dictionary keys, but not often an issue in practice. Generally these are not QNames or C function names, so reduces the issue
of running out of symbol name somewhat.

One can also differ though on how much it makes to have sense to have
ISA level helpers for working with tagrefs and similar (or, getting the
ABI involved with these matters, like defining in the ABI the encodings
for things like fixnum/flonum/etc).

...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 14:32:00 2025

From Newsgroup: comp.arch

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
   320x200: 16 bpp
   640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
   800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
    Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat
color blocks).

Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:09:23 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good
to avoid?

Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:

f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)

You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that
would require a triple sized mantissa.

The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:12:45 2025

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD
extensions across the industry), but still provides no direct name for
the individual bytes of a register.

According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.

The original (v1?) Alpha had instructions intending to make it "easy" to process character data in 8-byte chunks inside a register.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 1 18:19:48 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good >> to avoid?

Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:

f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)

You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.

The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 1 19:18:39 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.

{ABCD}X registers were data.
{SDBS} registers were pointer registers.

The 8086 is no 68000. The [BX] addressing mode makes it obvious that
that's not the case.

What is actually the case: AL-DL, AH-DH correspond to 8-bit registers
of the 8080, some of AX-DX correspond to register pairs. SI, DI, BP
are new, SP corresponds to the 8080 SP, which does not have 8-bit
components. That's why SI, DI, BP, SP have no low or high
sub-registers.

Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".

I think that we can learn a lot from earlier architectures, some
things to adopt and some things to avoid. Concerning subregisters, I
lean towards avoiding.

That's also another reason to avoid load-and-op and RMW instructions.
With a load/store architecture, load can sign/zero extend as
necessary, and then most operations can be done at full width.

But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value >analysis in the compiler.

I don't think it's impossible or particularly hard for the compiler. Implementing it in OoO hardware causes complications, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 1 21:08:35 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 02:21:18 2025

From Newsgroup: comp.arch

On 10/31/2025 2:32 PM, BGB wrote:

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
   1024x768: 1 bpp monochrome, other experiments (*1)
     Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8,
allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).

Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...

Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).

Using a joke image as a test case here...

https://x.com/cr88192/status/1984694932666261839

This variation uses:
Y R
B G

In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...

Where, Y is a pure luma value.
May or may not use this, or:
Y R B G Y R B G
B G Y R B G Y R
...
But, prior pattern is simpler to deal with.

Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.

With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.

With 4x4, there is effectively 4 bits per channel, which is enough to
recover 1 bit of color per channel.

With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
are normalized here).

Having both a Y and G channel slightly helps with the color-recovery
process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).

Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of chroma;
...

Dealing with chroma does have the effect of making the dithering process
more complicated. As noted, reliable recovery of the color vector is
itself a bit fiddly (and is very sensitive to the encoder side dither process).

The former image was itself an example of an artifact caused by the
dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color
shifts). The later image was mostly after I realized the issue with the
dither pattern, and modified how it was being handled (replacing the use
of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
the matrix for each channel).

Image quality isn't great, but then again, not sure how to do that much
better with a naive 1 bit/pixel encoding.

I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.

One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
Getting a 5-bit scaled RGB;
Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.

Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic
complexity.

Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering.

Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.

the RGBI approach seems intermediate, more likely to decode grayscale
patterns as gray.

I guess a more open question is if such a thing could be useful (it is
pretty far down the image-quality scale). But, OTOH, with simpler (non-randomized) dither patterns; it can LZ compress OK (depending on
image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

If combined with delta encoding or similar; could almost be adapted into
a very crappy video codec.

Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.

But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
image quality).

More just interesting that I was able to get things "almost half-way
passable" from 1 bpp monochrome.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 11:36:36 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in the
ulp position.

We have known since before the 1978 ieee754 standard that guard+sticky
(plus sign and ulp) is enough to get the rounding correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 15:56:12 2025

From Newsgroup: comp.arch

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje

People use names like guard and sticky bits and sometimes also rounding
bit (e.g. in Wikipedia article) without explanation, as if everybody
had agreed about what they mean. But I don't think that everybody
really agree.

Shockingly, an absence of strict definitions apples even to most widely refereed article of David Goldberg "What Every Computer Scientist Should
Know About Floating-Point Arithmetic". It seems, people copy the name
of the article one from another, but very small fraction of them
bothered to actually read it.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,075
Nodes:	10 (0 / 10)
Uptime:	88:11:05
Calls:	13,797
Files:	186,989
D/L today:	4,539 files (1,290M bytes)
Messages:	2,438,148

Re: Tonights Tradeoff

Who's Online

System Info