Forum: War Ensemble BBS

Microarch Club

From George Musk@grgmusk@skiff.com to comp.arch on Thu Mar 21 19:34:50 2024

From Newsgroup: comp.arch

Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Mon Mar 25 16:48:43 2024

From Newsgroup: comp.arch

On 3/21/2024 2:34 PM, George Musk wrote:

Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos

At least sort of interesting...

I guess one of the guys on there did a manycore VLIW architecture with
the memory local to each of the cores. Seems like an interesting
approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first
started working on this stuff, but it had drifted more towards a
slightly more conventional design.

But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
L2, seemed like a better option than cores with a very large shared L1
cache.

I am not sure that abandoning a global address space is such a great
idea, as a lot of the "merits" can be gained instead by using weak
coherence models (possibly with a shared 256K or 512K or so for each
group of 4 cores, at which point it goes out to a higher latency global
bus). In this case, the division into independent memory regions could
be done in software.

It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into
the common superset of both my original design and RV64G, doesn't really
help matters here).

If going for a more minimal core optimized for perf/area, some stuff
might be dropped. Would likely drop integer and floating-point divide
again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the
lack of a zero register (though, encodings are comparably cheap, as the internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more
related to the logic cost of dealing with some things in the decoder.

Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store, predicated ops (with a
designated flag bit), and large-immediate encodings, help enough with performance (relative to cost) to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Staying with 3-wide also makes sense, as going to 1-wide or 2-wide will
not save much if one still wants things like FP-SIMD and unaligned
memory access (when as-is, all the 3rd lane can really do is basic ALU
ops and similar; and the savings are negligible if one has a register
file with 6-ports, which in turn is needed to fully provision stuff for
the 2-wide cases, ...).

Practically the manycore idea doesn't go very far on FPGAs, as to have "useful" cores, one can't fit more than a few cores on any of the
"affordable" FPGAs...

But, maybe other people will have different ideas.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Mar 25 22:17:03 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 3/21/2024 2:34 PM, George Musk wrote:

Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos

At least sort of interesting...

I guess one of the guys on there did a manycore VLIW architecture with
the memory local to each of the cores. Seems like an interesting
approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first
started working on this stuff, but it had drifted more towards a
slightly more conventional design.

But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
L2, seemed like a better option than cores with a very large shared L1 cache.

You appear to be "starting to get it"; congratulations.

I am not sure that abandoning a global address space is such a great
idea, as a lot of the "merits" can be gained instead by using weak
coherence models (possibly with a shared 256K or 512K or so for each
group of 4 cores, at which point it goes out to a higher latency global bus). In this case, the division into independent memory regions could
be done in software.

Most of the last 50 years has been towards a single global address space.

It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into
the common superset of both my original design and RV64G, doesn't really help matters here).

If going for a more minimal core optimized for perf/area, some stuff
might be dropped. Would likely drop integer and floating-point divide

I think this is pound foolish even if penny wise.

again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the
lack of a zero register (though, encodings are comparably cheap, as the

I got an effective zero register without having to waste a register
name to "get it". My 66000 gives you 32 registers of 64-bits each and
you can put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally available.

internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.

The problem is universal constants. RISCs being notably poor in their support--however this is better than addressing modes which require
µCode.

Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store,

Absolutely

predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

and large-immediate encodings,

Nothing else is so poorly served in typical ISAs.

help enough with performance (relative to cost)

+40%

to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Damage to pipeline ability less than -5%.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Mon Mar 25 22:32:13 2024

From Newsgroup: comp.arch

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 3/21/2024 2:34 PM, George Musk wrote:

Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos

At least sort of interesting...

I guess one of the guys on there did a manycore VLIW architecture with
the memory local to each of the cores. Seems like an interesting
approach, though not sure how well it would work on a general purpose
workload. This is also closer to what I had imagined when I first
started working on this stuff, but it had drifted more towards a
slightly more conventional design.

But, admittedly, this is for small-N cores, 16/32K of L1 with a shared
L2, seemed like a better option than cores with a very large shared L1
cache.

You appear to be "starting to get it"; congratulations.

I had experimented with stuff before, and "big L1 caches" seemed to be
in most regards worse. Hit rate goes into diminishing return territory,
and timing isn't too happy either.

At least for my workloads, 32K seemed like the local optimum.

Say, checking hit rates (in Doom):
8K: 67%, 16K: 78%,
32K: 85%, 64K: 87%
128K: 88%
This being for a direct-mapped cache configuration with even/odd paired 16-byte cache lines.

Other programs seem similar.

For a direct-mapped L1 cache, there is an issue with conflict misses,
where I was able to add in a small cache to absorb ~ 1-2% that was due
to conflict misses, which also had the (seemingly more obvious) effect
of reducing L2 misses (from a direct-mapped L2 cache). Though, it is
likely that a set-associative L2 cache could have also addressed this
issue (but likely with a higher cost impact).

I am not sure that abandoning a global address space is such a great
idea, as a lot of the "merits" can be gained instead by using weak
coherence models (possibly with a shared 256K or 512K or so for each
group of 4 cores, at which point it goes out to a higher latency
global bus). In this case, the division into independent memory
regions could be done in software.

Most of the last 50 years has been towards a single global address space.

Yeah.

From what I can gather, the guy in the video had an architecture which
gives each CPU its own 128K and needs explicit message passing to access outside of this (and faking a global address space in software, at a significant performance penalty). As I see it, this does not seem like
such a great idea...

Something like weak coherence can get most of the same savings, with
much less impact on how one writes code (albeit, it does mean that mutex locking may still be painfully slow).

But, this does mean it is better to try to approach software in a way
that neither requires TSO semantics nor frequent mutex locking.

It is unclear if my approach is "sufficiently minimal". There is more
complexity than I would like in my ISA (and effectively turning it
into the common superset of both my original design and RV64G, doesn't
really help matters here).

If going for a more minimal core optimized for perf/area, some stuff
might be dropped. Would likely drop integer and floating-point divide

I think this is pound foolish even if penny wise.

The "shift and add" unit isn't free, and the relative gains are small.
For integer divide, granted, it is faster than the pure software version
in the general case. For FPU divide, N-R is faster, but shift-add can
give an exact result. Most other / faster hardware divide strategies
seem to be more expensive than a shift-and-add unit.

My confidence in hardware divide isn't too high, noting for example that
the AMD K10 and Bulldozer/15h had painfully slow divide operations (to
such a degree that doing it in software was often faster). This implies
that divide cost/performance is still not really a "solved" issue, even
if one has the resources to throw at it.

One can avoid the cost of the shift-and-add unit via "trap and emulate",
but then the performance is worse.

Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).

again. Might also make sense to add an architectural zero register,
and eliminate some number of encodings which exist merely because of
the lack of a zero register (though, encodings are comparably cheap,
as the

I got an effective zero register without having to waste a register name
to "get it". My 66000 gives you 32 registers of 64-bits each and you can
put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally available.

I guess offloading this to the compiler can also make sense.

Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and
if a program needs to do a NEG or similar, it can load 0 into a register itself.

In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to
zero a register...

Say:
XOR R14, R14, R14 //Designate R14 as pseudo-zero...
...
ADD R14, 0x123, R8 //Load 0x123 into R8

Though, likely still makes sense in this case to provide some
"convenience" instructions.

internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more
related to the logic cost of dealing with some things in the decoder.

The problem is universal constants. RISCs being notably poor in their support--however this is better than addressing modes which require
µCode.

Yeah.

I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.

Allows, say:
ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit encoding,
which is better than, say:
LUI X8, 0x12345
ADD X8, X8, 0x678
ADD X12, X10, X8

Though, for jumbo-prefixes, did end up adding a special case in the
compile where it will try to figure out if a constant will be used
multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.

It could maybe make sense to have function-scale static-assigned
constants, but have not done so yet.

Though, it appears as if one of the "top contenders" here would be 0,
mostly because things like:
foo->x=0;
And:
bar[i]=0;
Are semi-common, and as-is end up needing to load 0 into a register each
time they appear.

Had already ended up with a similar sort of special case to optimize
"return 0;" and similar, mostly because this was common enough that it
made more sense to have a special case:
BRA .lbl_ret //if function does not end with "return 0;"
.lbl_ret_zero:
MOV 0, R2
.lbl_ret:
... epilog ...

For many functions, which allowed "return 0;" to be emitted as:
BRA .lbl_ret_zero
Rather than:
MOV 0, R2
BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3
of them per function.

Though, another possibility could be to allow constants to be included
in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).

Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store,

Absolutely

                                           predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

I have per instruction predication:
CMPxx ...
OP?T //if-true
OP?F //if-false
Or:
OP?T | OP?F //both in parallel, subject to encoding and ISA rules

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems to
be small in its current form (it is handled along with register fetch,
mostly turning non-executed instructions into NOPs during the EX stages).

For the most part, 1-bit seems sufficient.
More complex schemes generally ran into issues (had experimented with
allowing a second predicate bit, or handling predicates as a
stack-machine, but these ideas were mostly dead on arrival).

                      and large-immediate encodings,

Nothing else is so poorly served in typical ISAs.

Probably true.

                                                     help enough with
performance (relative to cost)

+40%

I am mostly seeing around 30% or so, for Doom and similar.
A few other programs still being closer to break-even at present.

Things are a bit more contentious in terms of code density:
With size-minimizing options to GCC:
".text" is slightly larger with BGBCC vs GCC (around 11%);
However, the GCC output has significantly more ".rodata".

A reasonable chunk of the code-size difference could be attributed to
jumbo prefixes making the average instruction size slightly bigger.

More could be possible with more compiler optimization effort.
Currently, a few recent optimization cases are disabled as they seem to
be causing bugs that I haven't figured out yet.

                               to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Damage to pipeline ability less than -5%.

Yeah.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Mar 26 19:16:07 2024

From Newsgroup: comp.arch

BGB wrote:

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).

If the boat anchor is a required unit of functionality, and I believe
IDIV and FPDIV is, it should be defined in ISA and if you can't afford
it find some way to trap rapidly so you can fix it up without excessive overhead. Like a MIPS TLB reload. If you can't get trap and emulate at sufficient performance, then add the HW to perform the instruction.

again. Might also make sense to add an architectural zero register,
and eliminate some number of encodings which exist merely because of
the lack of a zero register (though, encodings are comparably cheap,
as the

I got an effective zero register without having to waste a register name
to "get it". My 66000 gives you 32 registers of 64-bits each and you can
put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.

I guess offloading this to the compiler can also make sense.

Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and
if a program needs to do a NEG or similar, it can load 0 into a register itself.

In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to zero a register...

MOV Rd,#imm16

Cost 1 instruction of 32-bits in size and can be performed in 0 cycles

Say:
XOR R14, R14, R14 //Designate R14 as pseudo-zero...
...
ADD R14, 0x123, R8 //Load 0x123 into R8

Though, likely still makes sense in this case to provide some
"convenience" instructions.

internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is more
related to the logic cost of dealing with some things in the decoder.

The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.

Yeah.

I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.

Allows, say:
ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
LUI X8, 0x12345
ADD X8, X8, 0x678
ADD X12, X10, X8

This strategy completely fails when the constant contains more than 32-bits

FDIV R9,#3.141592653589247,R17

When you have universal constants (including 5-bit immediates), you rarely
need a register containing 0.

Though, for jumbo-prefixes, did end up adding a special case in the
compile where it will try to figure out if a constant will be used
multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.

This is a delicate balance:: while each use of the constant takes a
unit or 2 of the instruction stream, each use cost 0 more instructions.
The breakeven point in My 66000 is about 4 uses in a small area (loop)
means that it should be hoisted into a register.

It could maybe make sense to have function-scale static-assigned
constants, but have not done so yet.

Though, it appears as if one of the "top contenders" here would be 0,
mostly because things like:
foo->x=0;
And:
bar[i]=0;

I see no need for a zero register:: the following are 1 instruction !

ST #0,[Rfoo,offset(x)]

ST #0,[Rbar,Ri]

Are semi-common, and as-is end up needing to load 0 into a register each time they appear.

Had already ended up with a similar sort of special case to optimize
"return 0;" and similar, mostly because this was common enough that it
made more sense to have a special case:
BRA .lbl_ret //if function does not end with "return 0;"
.lbl_ret_zero:
MOV 0, R2
.lbl_ret:
... epilog ...

For many functions, which allowed "return 0;" to be emitted as:
BRA .lbl_ret_zero
Rather than:
MOV 0, R2
BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3
of them per function.

Special defined tails......

Though, another possibility could be to allow constants to be included
in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).

Though, would likely still make a few decisions differently from those
in RISC-V. Things like indexed load/store,

Absolutely

                                           predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

I have per instruction predication:
CMPxx ...
OP?T //if-true
OP?F //if-false
Or:
OP?T | OP?F //both in parallel, subject to encoding and ISA rules

CMP Rt,Ra,#whatever
PLE Rt,TTTTTEEE
// This begins the then-clause 5Ts -> 5 instructions
OP1
OP2
OP3
OP4
OP5
// this begins the else-clause 3Es -> 3 instructions
OP6
OP7
OP8
// we are now back join point.

Notice no internal flow control instructions.

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems to
be small in its current form (it is handled along with register fetch, mostly turning non-executed instructions into NOPs during the EX stages).

The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.

For the most part, 1-bit seems sufficient.

How do you do && and || predication with 1 bit ??

More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a
stack-machine, but these ideas were mostly dead on arrival).

Also note: the instructions in the then and else clauses know NOTHING
about being under a predicate mask (or not) Thus, they waste no bit
while retaining the ability to run under predication.

                      and large-immediate encodings, >>

Nothing else is so poorly served in typical ISAs.

Probably true.

                                                     help enough with
performance (relative to cost)

+40%

I am mostly seeing around 30% or so, for Doom and similar.
A few other programs still being closer to break-even at present.

Things are a bit more contentious in terms of code density:
With size-minimizing options to GCC:
".text" is slightly larger with BGBCC vs GCC (around 11%);
However, the GCC output has significantly more ".rodata".

A lot of this .rodata becomes constants in .text with universal constants.

A reasonable chunk of the code-size difference could be attributed to
jumbo prefixes making the average instruction size slightly bigger.

Size is one thing and it primarily diddles in cache footprint statstics. Instruction count is another and primarily diddles in pipeline cycles
to execute statistics.
Fewer instruction wins almost all the time.

More could be possible with more compiler optimization effort.
Currently, a few recent optimization cases are disabled as they seem to
be causing bugs that I haven't figured out yet.

                               to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Damage to pipeline ability less than -5%.

Yeah.

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Mar 26 16:59:57 2024

From Newsgroup: comp.arch

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:

BGB wrote:

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Say, "we have an instruction, but it is a boat anchor" isn't an ideal
situation (unless to be a placeholder for if/when it is not a boat
anchor).

If the boat anchor is a required unit of functionality, and I believe
IDIV and FPDIV is, it should be defined in ISA and if you can't afford
it find some way to trap rapidly so you can fix it up without excessive overhead. Like a MIPS TLB reload. If you can't get trap and emulate at sufficient performance, then add the HW to perform the instruction.

Though, 32-bit ARM managed OK without integer divide.

In my case, I ended up supporting it mostly for sake of the RV64 'M' extension, but it is in this case a little faster than a pure software solution (unlike on the K10 and Piledriver).

Still costs around 1.5 kLUTs though for 64-bit MUL/DIV support, and a
little more to route FDIV through it.

Cheapest FPU approach is still the "ADD/SUB/MUL only" route.

again. Might also make sense to add an architectural zero register,
and eliminate some number of encodings which exist merely because of
the lack of a zero register (though, encodings are comparably cheap,
as the

I got an effective zero register without having to waste a register
name to "get it". My 66000 gives you 32 registers of 64-bits each and
you can put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.

I guess offloading this to the compiler can also make sense.

Least common denominator would be, say, not providing things like NEG
instructions and similar (pretending as-if one had a zero register),
and if a program needs to do a NEG or similar, it can load 0 into a
register itself.

In the extreme case (say, one also lacks a designated "load immediate"
instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy
to zero a register...

    MOV   Rd,#imm16

Cost 1 instruction of 32-bits in size and can be performed in 0 cycles

Though, RV had skipped this:
ADD Xd, Zero, Imm12s
Or:
LUI Xd, ImmHi20
ADD Xd, Xd, ImmLo12s

One can argue for this on the basis of not needing an immediate-load instruction (nor a MOV instruction, nor NEG, nor ...).

Though, yeah, in my case I ended up with more variety here:
LDIZ Imm10u, Rn //10-bit, zero-extend, Imm12u (XG2)
LDIN Imm10n, Rn //10-bit, one-extend, Imm12n (XG2)
LDIMIZ Imm10u, Rn //Rn=Imm10u<<16 (newish)
LDIMIN Imm10n, Rn //Rn=Imm10n<<16 (newish)
LDIHI Imm10u, Rn //Rn=Imm10u<<22
LDIQHI Imm10u, Rn //Rn=Imm10u<<54

LDIZ Imm16u, Rn //16-bit, zero-extend
LDIN Imm16n, Rn //16-bit, one-extend

Then 64-bit jumbo forms:
LDI Imm33s, Rn //33-bit, sign-extend
LDIHI Imm33s, Rn //Rn=Imm33s<<16
LDIQHI Imm33s, Rn //Rn=Imm33s<<32

Then, 96 bit:
LDI Imm64, Rn //64-bit, sign-extend

And, some special cases:
FLDCH Imm16u, Rn //Binary16->Binary64

One could argue though that this is wild extravagance...

The recent addition of LDIMIx was mostly because otherwise one needed a
64-bit encoding to load constants like 262144 or similar (and a lot of bit-masks).

At one point I did evaluate a more ARM32-like approach (effectively
using a small value and a rotate). But, this cost more than the other
options (would have required the great evil of effectively being able to
feed two immediate values into the integer-shift unit, whereas many of
the others could be routed through logic I already have for other ops).

Though, one can argue that the drawback is that one does end up with
more instructions in the ISA listing.

Say:
   XOR R14, R14, R14 //Designate R14 as pseudo-zero...
   ...
   ADD R14, 0x123, R8 //Load 0x123 into R8

Though, likely still makes sense in this case to provide some
"convenience" instructions.

internal uArch has a zero register, and effectively treats immediate
values as a special register as well, ...). Some of the debate is
more related to the logic cost of dealing with some things in the
decoder.

The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.

Yeah.

I ended up with jumbo-prefixes. Still not perfect, and not perfectly
orthogonal, but mostly works.

Allows, say:
   ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit
encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8

This strategy completely fails when the constant contains more than 32-bits

    FDIV   R9,#3.141592653589247,R17

When you have universal constants (including 5-bit immediates), you rarely need a register containing 0.

The jumbo prefixes at least allow for a 64-bit constant load, but as-is
not for 64-bit immediate values to 3RI ops. The latter could be done,
but would require 128-bit fetch and decode, which doesn't seem worth it.

There is the limbo feature of allowing for 57-bit immediate values, but
this is optional.

OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.

Even something like a "LI Xd, Imm17s" instruction, would notably reduce
the number of constants loaded from memory (as GCC seemingly prefers to
use a LHU or LW or similar rather than encode it using LUI+ADD).

I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
them enabled in the CPU core (they involved the non-zero cost of
repacking them into Binary16 in ID1 and then throwing a
Binary16->Binary64 converter into the ID2 stage).

Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
can leverage a more generic Binary16->Binary64 converter path).

For FPU compare with zero, can almost leverage the integer compare ops,
apart from the annoying edge cases of -0.0 and NaN leading to "not
strictly equivalent" behavior (though, an ASM programmer could more
easily get away with this). But, not common enough to justify adding FPU specific ops for this.

Though, for jumbo-prefixes, did end up adding a special case in the
compile where it will try to figure out if a constant will be used
multiple times in a basic-block and, if so, will load it into a
register rather than use a jumbo-prefix form.

This is a delicate balance:: while each use of the constant takes a
unit or 2 of the instruction stream, each use cost 0 more instructions.
The breakeven point in My 66000 is about 4 uses in a small area (loop)
means that it should be hoisted into a register.

IIRC, I had set it as x>2, since 2 seemed break-even, 3+ favoring using
a register, and 1 favoring a constant.

This did require adding logic to look forward over the current basic
block to check for additional usage, which does seem a little like a
hack. But, similar logic had already been used by the register allocator
to try to prioritize which values should be evicted.

It could maybe make sense to have function-scale static-assigned
constants, but have not done so yet.

Though, it appears as if one of the "top contenders" here would be 0,
mostly because things like:
   foo->x=0;
And:
   bar[i]=0;

I see no need for a zero register:: the following are 1 instruction !

   ST     #0,[Rfoo,offset(x)]

   ST     #0,[Rbar,Ri]

In my case, there is no direct equivalent in the core ISA.

But, in retrospect, a zero-register is probably not worth it if one is
not trying for RISC-V style minimalism in the core ISA (nevermind if
they seemingly promptly disregard minimalism as soon as one gets outside
of 'I', but then fail to address the shortcomings of said minimalism in
'I').

Are semi-common, and as-is end up needing to load 0 into a register
each time they appear.

Had already ended up with a similar sort of special case to optimize
"return 0;" and similar, mostly because this was common enough that it
made more sense to have a special case:
   BRA .lbl_ret //if function does not end with "return 0;"
   .lbl_ret_zero:
   MOV 0, R2
   .lbl_ret:
   ... epilog ...

For many functions, which allowed "return 0;" to be emitted as:
   BRA .lbl_ret_zero
Rather than:
   MOV 0, R2
   BRA .lbl_ret
Which on average ended up as a net-win when there are more than around
3 of them per function.

Special defined tails......

I didn't really go much beyond 0, as supporting more than 0 ends up
adding cost in terms of needing additional branches.

But, yeah, 0/1/NULL/etc, seem to be fairly common return values.

Though, another possibility could be to allow constants to be included
in the "statically assign variables to registers" logic (as-is, they
are excluded except in "tiny leaf" functions).

Went and tweaked the compiler rules, so now they are included in the
ranking in normal functions but given a lower priority than the local variables (by roughly "one loop nesting level"). Seemed to help slightly (making it 1 level less appeared to be the local optima judging by
binary size).

Seems that generally 0 still isn't quite common enough to justify having
one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.

Though, would likely still make a few decisions differently from
those in RISC-V. Things like indexed load/store,

Absolutely

                                           predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

I have per instruction predication:
   CMPxx ...
   OP?T //if-true
   OP?F //if-false
Or:
   OP?T | OP?F //both in parallel, subject to encoding and ISA rules

    CMP Rt,Ra,#whatever
    PLE Rt,TTTTTEEE
    // This begins the then-clause 5Ts -> 5 instructions
    OP1
    OP2
    OP3
    OP4
    OP5
    // this begins the else-clause 3Es -> 3 instructions
    OP6
    OP7
    OP8
    // we are now back join point.

Notice no internal flow control instructions.

It can be similar in my case, with the ?T / ?F encoding scheme.

While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy
into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though (where a
64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems
to be small in its current form (it is handled along with register
fetch, mostly turning non-executed instructions into NOPs during the
EX stages).

The effect is that one uses Predication whenever you will have already fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.

Yeah, though in my pipeline, it is still a tradeoff of the relative cost
of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.

For the most part, 1-bit seems sufficient.

How do you do && and || predication with 1 bit ??

Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.

This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
...

The stack-machine predicate model could have worked, but would have been
more of an issue to deal with in the compiler (in addition to the
non-zero LUT cost, and not really saving any clock-cycles as compared
with the newer bit-twiddling in registers strategy).

Compiler does need some additional handling here, to not try this if the expressions could lead to side-effects.

More complex schemes generally ran into issues (had experimented with
allowing a second predicate bit, or handling predicates as a
stack-machine, but these ideas were mostly dead on arrival).

Also note: the instructions in the then and else clauses know NOTHING
about being under a predicate mask (or not) Thus, they waste no bit
while retaining the ability to run under predication.

It is a tradeoff. In my case they cost 2 bits in the encoding, which is
shared with the WEX scheme.

                      and large-immediate encodings, >>>

Nothing else is so poorly served in typical ISAs.

Probably true.

                                                     help enough
with performance (relative to cost)

+40%

I am mostly seeing around 30% or so, for Doom and similar.
   A few other programs still being closer to break-even at present.

Things are a bit more contentious in terms of code density:
   With size-minimizing options to GCC:
     ".text" is slightly larger with BGBCC vs GCC (around 11%);
     However, the GCC output has significantly more ".rodata".

A lot of this .rodata becomes constants in .text with universal constants.

Yeah.

A reasonable chunk of the code-size difference could be attributed to
jumbo prefixes making the average instruction size slightly bigger.

Size is one thing and it primarily diddles in cache footprint statstics. Instruction count is another and primarily diddles in pipeline cycles
to execute statistics. Fewer instruction wins almost all the time.

Yeah, pretty much...

More could be possible with more compiler optimization effort.
Currently, a few recent optimization cases are disabled as they seem
to be causing bugs that I haven't figured out yet.

                               to be worth keeping (though, mostly
because the alternatives are not so good in terms of performance).

Damage to pipeline ability less than -5%.

Yeah.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Mar 27 00:27:15 2024

From Newsgroup: comp.arch

On Tue, 26 Mar 2024 16:59:57 -0500
BGB-Alt <bohannonindustriesllc@gmail.com> wrote:

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:

BGB wrote:

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Say, "we have an instruction, but it is a boat anchor" isn't an
ideal situation (unless to be a placeholder for if/when it is not
a boat anchor).

If the boat anchor is a required unit of functionality, and I
believe IDIV and FPDIV is, it should be defined in ISA and if you
can't afford it find some way to trap rapidly so you can fix it up
without excessive overhead. Like a MIPS TLB reload. If you can't
get trap and emulate at sufficient performance, then add the HW to
perform the instruction.

Though, 32-bit ARM managed OK without integer divide.

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 00:02:05 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:

BGB wrote:

I ended up with jumbo-prefixes. Still not perfect, and not perfectly
orthogonal, but mostly works.

Allows, say:
   ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit
encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8

This strategy completely fails when the constant contains more than 32-bits >>
    FDIV   R9,#3.141592653589247,R17

When you have universal constants (including 5-bit immediates), you rarely >> need a register containing 0.

The jumbo prefixes at least allow for a 64-bit constant load, but as-is
not for 64-bit immediate values to 3RI ops. The latter could be done,
but would require 128-bit fetch and decode, which doesn't seem worth it.

There is the limbo feature of allowing for 57-bit immediate values, but
this is optional.

OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
and a LD to get the value from data memory within ±2GB of IP. This takes
3 instructions and 2 words in memory when universal constants do this in
1 instruction and 2 words in the code stream to do this.

Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks.

This is typical when the underlying architecture is not very extensible
to 64-bit virtual address spaces; they have to waste a portion of the
32-bit space to get access to all the 64-bit space. Universal constants
makes this problem vanish.

Even something like a "LI Xd, Imm17s" instruction, would notably reduce
the number of constants loaded from memory (as GCC seemingly prefers to
use a LHU or LW or similar rather than encode it using LUI+ADD).

Reduce when compared to RISC-V but increased when compared to My 66000.
My 66000 has (at 99& level) uses no instructions to fetch or create
constants, nor does it waste any register (or registers) to hold use
once constants.

I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
them enabled in the CPU core (they involved the non-zero cost of
repacking them into Binary16 in ID1 and then throwing a
Binary16->Binary64 converter into the ID2 stage).

Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and
can leverage a more generic Binary16->Binary64 converter path).

Sometimes I see a::

CVTSD R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.

Mostly, a floating point immediate is available from a 32-bit constant container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

For FPU compare with zero, can almost leverage the integer compare ops, apart from the annoying edge cases of -0.0 and NaN leading to "not
strictly equivalent" behavior (though, an ASM programmer could more
easily get away with this). But, not common enough to justify adding FPU specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.

-----------------------

Seems that generally 0 still isn't quite common enough to justify having
one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.

It is common enough,
But there are lots of ways to get a zero where you want it for a return.

Though, would likely still make a few decisions differently from
those in RISC-V. Things like indexed load/store,

Absolutely

                                           predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

I have per instruction predication:
   CMPxx ...
   OP?T //if-true
   OP?F //if-false
Or:
   OP?T | OP?F //both in parallel, subject to encoding and ISA rules

    CMP Rt,Ra,#whatever
    PLE Rt,TTTTTEEE
    // This begins the then-clause 5Ts -> 5 instructions
    OP1
    OP2
    OP3
    OP4
    OP5
    // this begins the else-clause 3Es -> 3 instructions
    OP6
    OP7
    OP8
    // we are now back join point.

Notice no internal flow control instructions.

It can be similar in my case, with the ?T / ?F encoding scheme.

Except you eat that/those bits in OpCode encoding.

While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though (where a
64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline seems
to be small in its current form (it is handled along with register
fetch, mostly turning non-executed instructions into NOPs during the
EX stages).

The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.

Yeah, though in my pipeline, it is still a tradeoff of the relative cost
of a missed branch, vs the cost of sliding over both the THEN and ELSE branches as a series of NOPs.

For the most part, 1-bit seems sufficient.

How do you do && and || predication with 1 bit ??

Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.

This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
....

All I had to do was to make the second predication overwrite the first predication's mask, and the compiler did the rest.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Mar 27 13:21:04 2024

From Newsgroup: comp.arch

On 3/26/2024 7:02 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:

BGB wrote:

I ended up with jumbo-prefixes. Still not perfect, and not perfectly
orthogonal, but mostly works.

Allows, say:
   ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit
encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8

This strategy completely fails when the constant contains more than
32-bits

     FDIV   R9,#3.141592653589247,R17

When you have universal constants (including 5-bit immediates), you
rarely
need a register containing 0.

The jumbo prefixes at least allow for a 64-bit constant load, but
as-is not for 64-bit immediate values to 3RI ops. The latter could be
done, but would require 128-bit fetch and decode, which doesn't seem
worth it.

There is the limbo feature of allowing for 57-bit immediate values,
but this is optional.

OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with
Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
and a LD to get the value from data memory within ±2GB of IP. This takes
3 instructions and 2 words in memory when universal constants do this in
1 instruction and 2 words in the code stream to do this.

I was mostly testing with GCC for RV64, but, yeah, it just does memory
loads.

Typical GCC response on RV64 seems to be to turn nearly all of the
big-constant cases into memory loads, which kinda sucks.

This is typical when the underlying architecture is not very extensible
to 64-bit virtual address spaces; they have to waste a portion of the
32-bit space to get access to all the 64-bit space. Universal constants
makes this problem vanish.

Yeah.

It at least seems worthwhile to have non-suck fallback strategies.

Even something like a "LI Xd, Imm17s" instruction, would notably
reduce the number of constants loaded from memory (as GCC seemingly
prefers to use a LHU or LW or similar rather than encode it using
LUI+ADD).

Reduce when compared to RISC-V but increased when compared to My 66000.
My 66000 has (at 99& level) uses no instructions to fetch or create constants, nor does it waste any register (or registers) to hold use
once constants.

Yeah, but the issue here is mostly with RISC-V and its lack of constant
load.

Or burning lots of encoding space (on LUI and AUIPC), but still lacks
good general-purpose options.

FWIW, if they had:
LI Xd, Imm17s
SHORI Xd, Imm16u //Xd=(Xd<<16)|Imm16u

Would have still allowed a 32-bit constant in 2 ops, but would have also allowed 64-bit in 4 ops (within the limits of fixed-length 32-bit instructions); while also needing significantly less encoding space
(could fit both of them into the remaining space in the OP-IMM-32 block).

I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or
S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping
them enabled in the CPU core (they involved the non-zero cost of
repacking them into Binary16 in ID1 and then throwing a
Binary16->Binary64 converter into the ID2 stage).

Generally, the "FLDCH Imm16, Rn" instruction works well enough here
(and can leverage a more generic Binary16->Binary64 converter path).

Sometimes I see a::

    CVTSD     R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
in register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.

I had looked into, say:
FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.

Unpacking imm5fp to Binary16 being, essentially:
aee.fff -> 0.aAAee.fff0000000

OTOH, can note that a majority of typical floating point constants can
be represented exactly in Binary16 (well, excluding "0.1" or similar),
so it works OK as an immediate format.

This allows a single 32-bit op to be used for constant loads (nevermind
if one needs a 96 bit encoding for 0.1, or PI, or ...).

IIRC, I originally had it in the CONV2 path, which would give it a
2-cycle latency and only allow it in Lane 1.

I later migrated the logic to the "MOV_IR" path which also deals with "Rn=(Rn<<16)|Imm" and similar, and currently allows 1-cycle Binary16 immediate-loads in all 3 lanes.

Though, BGBCC still assumes it is Lane-1 only unless the FPU Immediate extension is enabled (as with the other FP converters).

Mostly, a floating point immediate is available from a 32-bit constant container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

Don't currently have the encoding space for this.

Could in theory pull off truncated Binary32 an Imm29s form, but not
likely worth it. Would also require putting a converted in the ID2
stage, so not free.

In this case, the issue is more one of LUT cost to support these cases.

For FPU compare with zero, can almost leverage the integer compare
ops, apart from the annoying edge cases of -0.0 and NaN leading to
"not strictly equivalent" behavior (though, an ASM programmer could
more easily get away with this). But, not common enough to justify
adding FPU specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in double; You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.

I could add them, but, is it worth it?...

In this case, it is more a question of encoding space than logic cost.

It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at
the integer ops not quite working correctly due to edge cases).

-----------------------

Seems that generally 0 still isn't quite common enough to justify
having one register fewer for variables though (or to have a
designated zero register), but otherwise it seems there is not much to
justify trying to exclude the "implicit zero" ops from the ISA listing.

It is common enough,
But there are lots of ways to get a zero where you want it for a return.

I think the main use case for a zero register is mostly that it allows
using it as a special case for pseudo-ops. I guess, not quite the same
if it is a normal GPR that just so happens to be 0.

Recently ended up fixing a bug where:
y=-x;
Was misbehaving with "unsigned int":
"NEG" produces a value which falls outside of UInt range;
But, "NEG; EXTU.L" is a 2-op sequence.
It had the EXT for SB/UB/SW/UW, but not for UL.
For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
SUBU.L Zero, Rs, Rn
But, without a zero register, the compiler needs to special-case
provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
seem common enough to justify this).

...

Though, would likely still make a few decisions differently from
those in RISC-V. Things like indexed load/store,

Absolutely

                                           predicated ops (with a
designated flag bit),

Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}

I have per instruction predication:
   CMPxx ...
   OP?T //if-true
   OP?F //if-false
Or:
   OP?T | OP?F //both in parallel, subject to encoding and ISA rules >>>

     CMP Rt,Ra,#whatever
     PLE Rt,TTTTTEEE
     // This begins the then-clause 5Ts -> 5 instructions
     OP1
     OP2
     OP3
     OP4
     OP5
     // this begins the else-clause 3Es -> 3 instructions
     OP6
     OP7
     OP8
     // we are now back join point.

Notice no internal flow control instructions.

It can be similar in my case, with the ?T / ?F encoding scheme.

Except you eat that/those bits in OpCode encoding.

It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
Per-instruction predication;
WEX / Bundle encoding;
Jumbo prefixes;
...

But, maybe otherwise could have justified slightly bigger immediate
fields, dunno.

While poking at it, did go and add a check to exclude large
struct-copy operations from predication, as it is slower to turn a
large struct copy into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though (where
a 64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler does not have to do any analysis. This puts the onus on the memory unit designer to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a
line boundary and you access TLB only after crossing a page boundary.

OK.

In my case, it was more a case of noting that sliding over, say, 1kB
worth of memory loads/stores, is slower than branching around it.

Previously this case was not checked.
Had also changed some of the rules to allow FPU stuff to be predicated,
since the current form of the ISA has no issue predicating FPU ops
(IIRC, in an early form of the ISA, the FPU ops could not be predicated).

Do still need to exclude any cases which may result in a function call
though, ...

Performance gains are modest, but still noticeable (part of why
predication ended up as a core ISA feature). Effect on pipeline
seems to be small in its current form (it is handled along with
register fetch, mostly turning non-executed instructions into NOPs
during the EX stages).

The effect is that one uses Predication whenever you will have already
fetched instructions at the join point by the time you have determined
the predicate value {then, else} clauses. The PARSE and DECODE do the
flow control without bothering FETCH.

Yeah, though in my pipeline, it is still a tradeoff of the relative
cost of a missed branch, vs the cost of sliding over both the THEN and
ELSE branches as a series of NOPs.

For the most part, 1-bit seems sufficient.

How do you do && and || predication with 1 bit ??

Originally, it didn't.
Now I added some 3R and 3RI CMPxx encodings.

This allows, say:
CMPGT R8, R10, R4
CMPGT R8, R11, R5
TST R4, R5
....

All I had to do was to make the second predication overwrite the first predication's mask, and the compiler did the rest.

Not so simple in my case, but the hardware is simpler, since it just
cares about the state of 1 bit (which is explicitly saved/restored along
with the rest of the status-register if an interrupt occurs).

Various cases get kind of annoying in the compiler, as it needs to
figure out to some extent what the conditional is doing, and how to
translate this through the IR stages.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Mar 27 13:32:56 2024

From Newsgroup: comp.arch

On 3/26/2024 5:27 PM, Michael S wrote:

On Tue, 26 Mar 2024 16:59:57 -0500
BGB-Alt <bohannonindustriesllc@gmail.com> wrote:

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:

BGB wrote:

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Say, "we have an instruction, but it is a boat anchor" isn't an
ideal situation (unless to be a placeholder for if/when it is not
a boat anchor).

If the boat anchor is a required unit of functionality, and I
believe IDIV and FPDIV is, it should be defined in ISA and if you
can't afford it find some way to trap rapidly so you can fix it up
without excessive overhead. Like a MIPS TLB reload. If you can't
get trap and emulate at sufficient performance, then add the HW to
perform the instruction.

Though, 32-bit ARM managed OK without integer divide.

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

OK.

I think both modern ARM and AMD Zen went over to "actually fast" integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.

On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).

A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case.
But, this is a LUT cost tradeoff.

It could be possible in theory to support a general 3-cycle integer
divide, albeit if one can accept inexact results (would be faster than
the software-based lookup table strategy).

But, it is debatable. Pure minimalism would likely favor leaving out
divide (and a bunch of other stuff). Usual rationale being, say, to try
to fit the entire ISA listing on a single page of paper or similar (vs
having a listing with several hundred defined encodings).

Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
listings that are considerably larger (thousands of encodings).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 21:03:52 2024

From Newsgroup: comp.arch

BGB wrote:

On 3/26/2024 7:02 PM, MitchAlsup1 wrote:

Sometimes I see a::

CVTSD R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed
in register R2 so it can be accesses as an argument in the subroutine call >> to happen in a few instructions.

I had looked into, say:
FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.

Unpacking imm5fp to Binary16 being, essentially:
aee.fff -> 0.aAAee.fff0000000

realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
used fp constants--but does include 0, 1, 2, and 10. Also, realistically
the missing cases are the 0.5s.

OTOH, can note that a majority of typical floating point constants can
be represented exactly in Binary16 (well, excluding "0.1" or similar),
so it works OK as an immediate format.

This allows a single 32-bit op to be used for constant loads (nevermind
if one needs a 96 bit encoding for 0.1, or PI, or ...).

Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

Don't currently have the encoding space for this.

Could in theory pull off truncated Binary32 an Imm29s form, but not
likely worth it. Would also require putting a converted in the ID2
stage, so not free.

In this case, the issue is more one of LUT cost to support these cases.

For FPU compare with zero, can almost leverage the integer compare
ops, apart from the annoying edge cases of -0.0 and NaN leading to
"not strictly equivalent" behavior (though, an ASM programmer could
more easily get away with this). But, not common enough to justify
adding FPU specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in double; >> You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of
floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.

I could add them, but, is it worth it?...

Whether to add them or not is on you.
I found things like this to be more straw on the camel's back
{where the camel collapses to a unified register file model.}

In this case, it is more a question of encoding space than logic cost.

It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at
the integer ops not quite working correctly due to edge cases).

My model requires about ½ the instruction count when processing FP
comparisons compared to RISC-V (big, no; around 5% in FP code and 0 elsewhere.} Where it wins big is compare against a non-zero FP constant.
My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
LD, FCMP, BC}

-----------------------

Seems that generally 0 still isn't quite common enough to justify
having one register fewer for variables though (or to have a
designated zero register), but otherwise it seems there is not much to
justify trying to exclude the "implicit zero" ops from the ISA listing.

It is common enough,
But there are lots of ways to get a zero where you want it for a return.

I think the main use case for a zero register is mostly that it allows
using it as a special case for pseudo-ops. I guess, not quite the same
if it is a normal GPR that just so happens to be 0.

Recently ended up fixing a bug where:
y=-x;
Was misbehaving with "unsigned int":
"NEG" produces a value which falls outside of UInt range;
But, "NEG; EXTU.L" is a 2-op sequence.
It had the EXT for SB/UB/SW/UW, but not for UL.
For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
SUBU.L Zero, Rs, Rn

ADD Rn,#0,-Rs

But notice::
y = -x;
a = b + y;
can be performed as if it had been written::
y = -x;
a = b + (-x);
Which is encoded as::

ADD Ry,#0,-Rx
ADD Ra,Rb,-Rx

But, without a zero register,

#0 is not a register, but its value is 0x0000000000000000 anyway.

You missed the point entirely, if you can get easy access to #0
then you no longer need a register to hold this simple bit pattern.
In fact a large portion of My 66000 ISA over RISC-V comes from this
mechanism.

the compiler needs to special-case

The compiler needs easy access to #0 and the compiler needs to know
that #0 exists, but the compiler does not need to know if some register contains that same bit pattern.

provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
seem common enough to justify this).

....

It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

I burned 0 per instruction, but you can claim I burned 1 instruction PRED and 6.4 bits of that instruction are used to create masks that project upon up to
8 following instructions.

Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
Per-instruction predication;
WEX / Bundle encoding;
Jumbo prefixes;
...

But, maybe otherwise could have justified slightly bigger immediate
fields, dunno.

While poking at it, did go and add a check to exclude large
struct-copy operations from predication, as it is slower to turn a
large struct copy into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though (where
a 64 byte copy at least has the merit of achieving full pipeline
saturation and being roughly break-even with a branch-miss, whereas a
128 byte copy would cost roughly twice as much as a branch miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler does >> not have to do any analysis. This puts the onus on the memory unit designer >> to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a
line boundary and you access TLB only after crossing a page boundary.

OK.

In my case, it was more a case of noting that sliding over, say, 1kB
worth of memory loads/stores, is slower than branching around it.

This is why My 66000 predication has use limits. Once you can get where
you want faster with a branch, then a branch is what you should use.
I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
cycle and that the minimum DECODE time is 2 cycles, that Predication
wins when the number of instructions <= FWidth × Dcycles = 8.
Use predication and save cycles by not disrupting the front end.
Use branching and save cycles by disrupting the front end.

All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.

Not so simple in my case, but the hardware is simpler, since it just
cares about the state of 1 bit (which is explicitly saved/restored along with the rest of the status-register if an interrupt occurs).

Simpler than 8-flip flops used as a shift right register ??
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 21:14:01 2024

From Newsgroup: comp.arch

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???

I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.

NOTE well:: The size of the FMUL (or FMAC) unit does increase, but its
increase is less than that of an STR divisor unit.

On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).

This is because you are using a linear iteration, try using a quadratic convergent iteration instead. OH but you CAN'T because your multiplier
tree does not give accurate lower order bits.

A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case.
But, this is a LUT cost tradeoff.

It could be possible in theory to support a general 3-cycle integer
divide, albeit if one can accept inexact results (would be faster than
the software-based lookup table strategy).

But, it is debatable. Pure minimalism would likely favor leaving out
divide (and a bunch of other stuff). Usual rationale being, say, to try
to fit the entire ISA listing on a single page of paper or similar (vs having a listing with several hundred defined encodings).

Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA
listings that are considerably larger (thousands of encodings).

....

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Mar 27 21:53:36 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer
divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I
can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast >divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

https://www.quinapalus.com/cm7cycles.html

I submit that at 10-cycles for average latency, the need to invent screwy >forms of even faster division fall by the wayside {accurate or not}.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Mar 28 01:06:05 2024

From Newsgroup: comp.arch

On Wed, 27 Mar 2024 21:14:01 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

What I don't get is the reluctance for using the FP multiplier as a
fast divisor (IBM 360/91). AMD Opteron used this means to achieve
17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? and with special casing of leading 1s and 0s average
around 10-cycles ???

I submit that at 10-cycles for average latency, the need to invent
screwy forms of even faster division fall by the wayside {accurate or
not}.

All modern performance-oriented cores from Intel, AMD, ARM and Apple
have fast integer dividers and typically even faster FP dividers.
The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake (launched in 2015) and AMD Zen2 (launched in 2019), but the later is
slow only in the worst case, the best case is o.k.
I'd guess, when Skylake was designed nobody at Intel could imagine that
it and its variations would be manufactured in huge volumes up to 2021.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Mar 27 23:11:34 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including >>SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast >>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>and with special casing of leading 1s and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

I submit that a 5+2×ln8(s) is faster still.
32-bits = 15 cycles <not so much faster>
64-bits = 17 cycles <A lot faster>

{Log base 8, where one uses Newton-Raphson or Goldschmidt to get 8 significant digits (9.2 bits are correct) and double the significant bits each iteration (2-cycles). }

5 comes from looking at numerator and denominator to find the first bit of significance, and then shifting numerator and denominator so that the FDIV algorithm can work.

https://www.quinapalus.com/cm7cycles.html

I submit that at 10-cycles for average latency, the need to invent screwy >>forms of even faster division fall by the wayside {accurate or not}.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Mar 28 09:31:11 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles
for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast
divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including
hashes?) where you use modulo operations, said modulus _must_ be a known constant, otherwise information about will leak from the timings, right?

I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.

I agree.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Mar 28 14:03:18 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer divide. >>>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>>> ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are >>>>> doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer >>>> divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>>> can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including >>> SRT variants).

What I don't get is the reluctance for using the FP multiplier as a fast >>> divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>> and with special casing of leading 1s and 0s average around 10-cycles ??? >>

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including >hashes?) where you use modulo operations, said modulus _must_ be a known >constant, otherwise information about will leak from the timings, right?

Perhaps, yet I suspect that the m7 isn't generally used for crypto,
nor used in an SMP implementation where an observer can monitor
a shared cache.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Mar 28 22:38:41 2024

From Newsgroup: comp.arch

On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer
divide. Then in 2004 they added integer divide instruction in
ARMv7 (including ARMv7-M variant intended for small
microcontroller cores like Cortex-M3) and for the following 20
years instead of merely OK they are doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast"
integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40
cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
on-par with what I can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division
(including SRT variants).

What I don't get is the reluctance for using the FP multiplier as
a fast divisor (IBM 360/91). AMD Opteron used this means to
achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
not be under 20-cycles ?? and with special casing of leading 1s
and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
cycles (where s is the number of significant digits in the
quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including hashes?) where you use modulo operations, said modulus _must_ be a
known constant, otherwise information about will leak from the
timings, right?

Are you aware of any professional crypto algorithm, including hashes,
that uses modulo operations by modulo that is neither power-of-two nor
at least 192-bit wide?

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Mar 29 13:38:55 2024

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:

For slightly less then 20 years ARM managed OK without integer
divide. Then in 2004 they added integer divide instruction in
ARMv7 (including ARMv7-M variant intended for small
microcontroller cores like Cortex-M3) and for the following 20
years instead of merely OK they are doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast"
integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40
cycles for 32-bit, and 68-72 cycles for 64-bit. This is also
on-par with what I can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division
(including SRT variants).

What I don't get is the reluctance for using the FP multiplier as
a fast divisor (IBM 360/91). AMD Opteron used this means to
achieve 17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV
not be under 20-cycles ?? and with special casing of leading 1s
and 0s average around 10-cycles ???

Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2]
cycles (where s is the number of significant digits in the
quotient).

https://www.quinapalus.com/cm7cycles.html

That looks a lot like an SRT divisor with early out?

Having variable timing DIV means that any crypto operating (including
hashes?) where you use modulo operations, said modulus _must_ be a
known constant, otherwise information about will leak from the
timings, right?

Are you aware of any professional crypto algorithm, including hashes,
that uses modulo operations by modulo that is neither power-of-two nor
at least 192-bit wide?

I was involved with the optimization of DFC, the AES condidate from CERN:

It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
and that resulted in a very simple reciprocal, i.e. no need for a DIV
opcode.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Mar 29 16:38:58 2024

From Newsgroup: comp.arch

On Fri, 29 Mar 2024 13:38:55 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Thu, 28 Mar 2024 09:31:11 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Are you aware of any professional crypto algorithm, including
hashes, that uses modulo operations by modulo that is neither
power-of-two nor at least 192-bit wide?

I was involved with the optimization of DFC, the AES condidate from
CERN:

It uses a fixed prime just above 2^64 as the modulus (2^64+13 afair),
and that resulted in a very simple reciprocal, i.e. no need for a DIV opcode.

Terje

Since DFC lost, I suppose that even ignoring reciprocal optimization
the answer to my question is 'No'.

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Mon Apr 1 04:09:35 2024

From Newsgroup: comp.arch

( Was distracted with other stuff, mostly trying to find and fix bugs. )

On 3/27/2024 4:03 PM, MitchAlsup1 wrote:

BGB wrote:

On 3/26/2024 7:02 PM, MitchAlsup1 wrote:

Sometimes I see a::

     CVTSD     R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and
placed in register R2 so it can be accesses as an argument in the
subroutine call
to happen in a few instructions.

I had looked into, say:
   FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.

Unpacking imm5fp to Binary16 being, essentially:
   aee.fff -> 0.aAAee.fff0000000

Self correction: aee.ff

realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often used fp constants--but does include 0, 1, 2, and 10. Also, realistically
the missing cases are the 0.5s.

This scheme had (rounded):
* 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
* 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
* 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
* 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
* 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
* 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
* 110xx: 4800, 4900, 4A00, 4B00 ( 8.000, 10.000, 12.000, 14.000)
* 111xx: 4C00, 4D00, 4E00, 4F00 (16.000, 20.000, 24.000, 28.000)

Of the options evaluated, this seemed to get the best general-case hit rate.

OTOH, can note that a majority of typical floating point constants can
be represented exactly in Binary16 (well, excluding "0.1" or similar),
so it works OK as an immediate format.

This allows a single 32-bit op to be used for constant loads
(nevermind if one needs a 96 bit encoding for 0.1, or PI, or ...).

Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

Don't currently have the encoding space for this.

Could in theory pull off truncated Binary32 an Imm29s form, but not
likely worth it. Would also require putting a converted in the ID2
stage, so not free.

In this case, the issue is more one of LUT cost to support these cases.

For FPU compare with zero, can almost leverage the integer compare
ops, apart from the annoying edge cases of -0.0 and NaN leading to
"not strictly equivalent" behavior (though, an ASM programmer could
more easily get away with this). But, not common enough to justify
adding FPU specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in
double;
You can compose the terns to do all signed and unsigned integers and get >>> a gate count, then the number of gates you add to cover all 10 cases
of floating point is 12% gate count over the simple integer version.
Also
note:: this circuit is about 10% of the gate count of an integer adder.

I could add them, but, is it worth it?...

Whether to add them or not is on you.
I found things like this to be more straw on the camel's back {where the camel collapses to a unified register file model.}

I already have a unified register file in my ISA (only the RISC-V mode
treats it as two sub-spaces).

But, in this case:
BEQ/BNE/BGT/... almost work with floating-point values, apart from the annoying edge cases like -0 and similar.

If substituted for floating-point comparison, most things work as
expected, except that Quake ends up with some minor physics glitches
(*1) and similar (mostly due to the occasional appearance of -0, and it
not comparing equal to +0 in this case).

*1: Most commonly in the form that sliding a bounding box along the
ground will occasionally encounter seams where it gets stuck and can't
slide across (so the player would need to jump over these seams).

In this case, it is more a question of encoding space than logic cost.

It is semi-common in FP terms, but likely not common enough to justify
dedicated compare-and-branch ops and similar (vs the minor annoyance
at the integer ops not quite working correctly due to edge cases).

My model requires about ½ the instruction count when processing FP comparisons compared to RISC-V (big, no; around 5% in FP code and 0 elsewhere.} Where it wins big is compare against a non-zero FP constant.
My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC,
LD, FCMP, BC}

At present, options are:
FLDCH + FCMPxx (2R) + ( BT/BF | MOVT/MOVNT )
FCMPxx (2RI) + ( BT/BF | MOVT/MOVNT )

The latter currently requires the FpImm extension.

The integer ops currently have some additional denser cases:
Compare and Branch
Compare-Zero and Branch
Compare and Set (3R and small 3RI).

However, adding these cases for FP would eat encoding space, and floating-point compare is less common than integer compare.

Granted, it does look like FCMPGT is around 0.7% of the clock-cycle
budget in GLQuake, so isn't too far down on the list. Would need to
evaluate whether is is more commonly being used for branching or as a predicate or similar, and the relative amount of compare-with-zero vs comparing two values.

Can note that FCMPEQ is much further down the list, at 0.03%.

-----------------------

Seems that generally 0 still isn't quite common enough to justify
having one register fewer for variables though (or to have a
designated zero register), but otherwise it seems there is not much
to justify trying to exclude the "implicit zero" ops from the ISA
listing.

It is common enough,
But there are lots of ways to get a zero where you want it for a return. >>>

I think the main use case for a zero register is mostly that it allows
using it as a special case for pseudo-ops. I guess, not quite the same
if it is a normal GPR that just so happens to be 0.

Recently ended up fixing a bug where:
   y=-x;
Was misbehaving with "unsigned int":
   "NEG" produces a value which falls outside of UInt range;
   But, "NEG; EXTU.L" is a 2-op sequence.
   It had the EXT for SB/UB/SW/UW, but not for UL.
     For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
   SUBU.L Zero, Rs, Rn

    ADD Rn,#0,-Rs

But notice::
    y = -x;
    a = b + y;
can be performed as if it had been written::
    y = -x;
    a = b + (-x);
Which is encoded as::

    ADD Ry,#0,-Rx
    ADD Ra,Rb,-Rx

But, without a zero register,

#0 is not a register, but its value is 0x0000000000000000 anyway.

You missed the point entirely, if you can get easy access to #0
then you no longer need a register to hold this simple bit pattern.
In fact a large portion of My 66000 ISA over RISC-V comes from this mechanism.

OK.

I did end up switching "-x" for Int and UInt over to ADDS.L and ADDU.L
with the zero in a register. Noted that on-average, this is still faster
than "NEG + EXTx.L" even in the absence of a zero register.

But, then ended up fighting with a bunch of compiler bugs, as some
recent behavioral tweaks in the compiler had started stepping on a bunch
of buggy cases:
Some instructions which were invalid as prefixes were being set as
prefixes in the WEXifier;
The edge case for expanding Op32 to Op64 was mangling large immediate
values (mostly related to cases dealing with encoding R32..R63 in
Baseline mode);
The handling for "tiny leaf functions" was incorrectly allowing
functions to be compiled as tiny-leaf functions in cases where they had
a by-value structure type on the local frame (in this case, the local
struct address would end up as "whatever value was already in the
register", potentially stomping stuff elsewhere in memory, *1).

*1: TBD if this is related to some of the other
memory-stomping/corruption bugs I had been seeing (which were sensitive
to register allocator behavior). It is possible this was it, or they may
yet still be other bugs that result in memory-stomping.

This sort of stuff has eaten up a lot of my time recently.

                              the compiler needs to special-case

The compiler needs easy access to #0 and the compiler needs to know
that #0 exists, but the compiler does not need to know if some register contains that same bit pattern.

Conventionally, this was done with "pseudo-instructions", but for a
pseudo instruction, one needs to know that zero exists in some location.

Otherwise, one can leave it up to the register allocator ("hey, give me
a register that holds zero").

Though, looking at the compiler output, it still doesn't yet seem to
realize that immediate 0 is 0 regardless of type. So, it is treating 0,
0U, 0LL, NULL, ... effectively as different values (and allocating them
in different registers). May need to address this...

provision this (or, in theory, add a "NEGU.L" instruction, but doesn't
seem common enough to justify this).

....

It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.

I burned 0 per instruction, but you can claim I burned 1 instruction
PRED and
6.4 bits of that instruction are used to create masks that project upon
up to
8 following instructions.

Yeah.

Say:
CMPxx OP1?T OP2?T OP3?F OP4?F
Is one less instruction word than, say:
CMPxx PRED OP1 OP2 OP3 OP4

And, as noted, most predication blocks tend to be short.

Though, one could argue, say (very rough stats):
Scalar: ~ 55%
WEX : ~ 17% (has gone up slightly)
Jumbo : ~ 15% (this is just the jumbo prefixes)
Pred : ~ 13% (6.5% for ?T and ?F)
PrWEX : ~ 0.4%

Would, in terms of entropy, favor spending 0 or 1 bits per instruction
on this.

But, spending 2 bits here allows the encoding space to be more
orthogonal (at least in XG2). So, nearly any unconditional instruction
can also be encoded in WEX'ed or Predicated forms (there are a few large unconditional-only instructions; their alternate forms used to encode
the Jumbo prefixes and PrWEX blocks).

Like, a lopsided encoding space would be less ideal even if "more
optimal" in a Huffman sense...

Though, FWIW, 25% isn't *that* far off from the ~17% of ops flagged as
WEX in this example; granted, one could argue that superscalar could
deal with this.

Granted, potentially a clever CPU could also avoid needing to burn an
extra clock-cycle on a PRED op or similar, or 2 cycles to perform an unconditional branch, ... But, a naive CPU will likely not be able to
avoid these sorts of costs.

Granted, some past architectures were like, "well, an instruction would
be there already..." and treat the cycle following the branch as a branch-delay-slot. Granted, had resisted that urge, and had previously
noted that if the delay-slot would contain an instruction that triggers
a pipeline interlock stall, this would foul up the pipeline and cause
the branch-initiation process to fail (and the "fix" here would have
been a lot more painful if the ISA actually had branch-delay slots;
rather than just the would-be delay slot instruction having failed to be "sufficiently ignored").

In Baseline mode, the encoding-space breakdown differs:
75%: 16-bit ops
12.5%: 32-bit ops (7x/9x), R32..R64,
Scalar ops only covering a subset of the ISA.
6.25%: Pred (Ex)
6.25%: Scalar and WEX (R0..R31 only, Fx).

Or, if no XGPR:
87.5%: 16-bit ops.
6.25%: Pred (Ex)
6.25%: Scalar and WEX (R0..R31 only, Fx).

Where the encoding of the Ex and Fx blocks is common between Baseline
and XG2 (allowing for a common subset to exist).

Comparably, XG2 Mode, despite only 1/4 of the encoding space being used
for Scalar ops, generally still ends up expanding some of the immediate
fields and similar vs Baseline mode (and mostly avoiding the need to use
Op64 encodings due to the more orthogonal encodings).

Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
   Per-instruction predication;
   WEX / Bundle encoding;
   Jumbo prefixes;
   ...

But, maybe otherwise could have justified slightly bigger immediate
fields, dunno.

While poking at it, did go and add a check to exclude large
struct-copy operations from predication, as it is slower to turn a
large struct copy into NOPs than to branch over it.

Did end up leaving struct-copies where sz<=64 as allowed though
(where a 64 byte copy at least has the merit of achieving full
pipeline saturation and being roughly break-even with a branch-miss,
whereas a 128 byte copy would cost roughly twice as much as a branch
miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler
does
not have to do any analysis. This puts the onus on the memory unit
designer
to process these at least as fast as a series of LDs and STs. Done right >>> this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing
a line boundary and you access TLB only after crossing a page boundary.

OK.

In my case, it was more a case of noting that sliding over, say, 1kB
worth of memory loads/stores, is slower than branching around it.

This is why My 66000 predication has use limits. Once you can get where
you want faster with a branch, then a branch is what you should use.
I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
cycle and that the minimum DECODE time is 2 cycles, that Predication
wins when the number of instructions <= FWidth × Dcycles = 8.
Use predication and save cycles by not disrupting the front end.
Use branching   and save cycles by     disrupting the front end.

In this case, where and how much to predicate is left up to the compiler frontend, and it doesn't need to know in advance exactly how many
instructions will be emitted by the backend (but, does need to avoid
anything which might potentially result in an implicit runtime call or similar).

In this case, the compiler decides this at the AST level, but there is
an uncertainty how many CPU instructions would be covered by the AST
(and the front-end hand-waves it mostly by counting AST nodes and using
each AST node as a stand-in for an instruction; rejecting cases which
have too many AST nodes).

But, had needed to fiddle with stuff a bit to fix up some edge cases
that were popping up here:
A struct-copy may emit a much longer sequence than intended;
Some seemingly benign cases, like "ptr-ptr" were resulting in "evil"
cases with some types (like "Foo*" where "sizeof(Foo)" is not a power of
2, resulting in an implicit integer division, so then one needs to
detect these cases and reject them);
...

But, yeah, cases where the predication results in 100+ instructions, or
tries to do something invalid (such as trying to call a runtime support function), are undesirable.

But, the logic here was a little lax in some of its validity checks.

Where operators are potentially very different depending on which types
they are applied to, and in the AST we don't implicitly know the type
(the compiler can try to figure it out, but it is not guaranteed).

Though, the predication handling can deal with the "What is the type of
this expression? Hell if I know." cases by not predicating the blocks.
FWIW, earlier on, BGBCC did not know any types at the AST level, and
left type handling up to the RIL->3AC translation stage to sort this out
(or, in difficult cases, the type might decay to "variant" and it could
be sorted out at runtime using tagrefs; but I have since moved to full
static typing, nevermind potential type system holes and similar in the frontend stages).

But, there is a performance cost, as any time one asks the type of an expression, it needs to walk this part of the AST and potentially lookup variables and similar, which isn't free.

But, some of this, while it works OK for C, would not fly so well with
C++ (which would not be quite so tolerant of "Swiss cheese" typesystem semantics in the frontend stages). Did work better for some of my own
language designs though (mostly built around hybrid static/dynamic
systems; some things are a little easier when one can justify dealing
with them by falling back to type-tagging and run-time type-checks, ...).

Something like your style of predication mechanism would likely need to
be applied much later in the process (such as when applying branch
fixups), rather than in the front-end and encoded via the IR stages.

Granted, with my scheme, trying to turn a branch into predication would
leave a "hole" in the place where the branch op was (where a NOP or
similar may need to be left), in which case, something like a "PRED" instruction wouldn't be a huge loss.

Though, I guess, logic for "hey, turn this short forward branch into predication" could be a possibility as well (and might hit some cases
missed by the AST level predication logic, or which appeared spontaneously).

Would require logic to detect short forward branches and confirm that
the instruction sequence following the branch does not contain any
labels or similar though.

All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.

Not so simple in my case, but the hardware is simpler, since it just
cares about the state of 1 bit (which is explicitly saved/restored
along with the rest of the status-register if an interrupt occurs).

Simpler than 8-flip flops used as a shift right register ??

It would need to be captured and preserved correctly during interrupts
(and remain correctly aligned with the instruction stream when returning
from an interrupt, etc).

This would make it fundamentally different, say, than the shift-register
type logic used to deal with a pipeline flush during a branch (where one really does not care what happens if an interrupt happens here; only
that the flushed stages do not execute).

Granted, in my case, one does need logic to detect, say, if ID2 contains
a predicated instruction, and EX1 contains an instruction that will
update the SR.T flag.

Logic for predication handling is otherwise something like:
casez( { opBraFlush, opUCmd[7:6], regInSr[0] } )
4'b000z: tOpEnable = 1; //always
4'b001z: tOpEnable = 0; //never (internal use by CPU)
4'b0100: tOpEnable = 0; //?T and F
4'b0101: tOpEnable = 1; //?T and T
4'b0110: tOpEnable = 1; //?F and F
4'b0111: tOpEnable = 0; //?F and T
4'b1zzz: tOpEnable = 0; //flushed stage
endcase

Where, if tOpEnable is 0, the opcode (in opUCmd[5:0]) is replaced with
NOP (unless it is a branch and !opBraFlush), where it is replaced with a "NO_BRANCH" (where if the branch-predictor predicted a branch, it
initiates a branch to the following instruction; unlike "BRA" which
initiates a branch to the target if the branch-predictor did not predict
a branch having been taken).

Where, generally, the branch-flush mask is updated on the transition to
the following cycle (and the flush flag corresponding to the IF stage
may also be set by the branch-predictor if it predicts a branch having
been taken).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Apr 2 16:12:49 2024

From Newsgroup: comp.arch

On 3/27/2024 6:06 PM, Michael S wrote:

On Wed, 27 Mar 2024 21:14:01 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

What I don't get is the reluctance for using the FP multiplier as a
fast divisor (IBM 360/91). AMD Opteron used this means to achieve
17-cycle FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under
20-cycles ?? and with special casing of leading 1s and 0s average
around 10-cycles ???

I submit that at 10-cycles for average latency, the need to invent
screwy forms of even faster division fall by the wayside {accurate or
not}.

All modern performance-oriented cores from Intel, AMD, ARM and Apple
have fast integer dividers and typically even faster FP dividers.
The last "big" cores with relatively slow 64-bit IDIV were Intel Skylake (launched in 2015) and AMD Zen2 (launched in 2019), but the later is
slow only in the worst case, the best case is o.k.
I'd guess, when Skylake was designed nobody at Intel could imagine that
it and its variations would be manufactured in huge volumes up to 2021.

FWIW: Newest CPU that I have is Zen+ ...
A lot of the other stuff around (excluding a few RasPi's I have, is
stuff from the mid-to-late 2000's).

But, yeah, I guess fast dividers are possible, question I guess is more whether they can be done cheaply enough for cores where one would
actually care about things like the costs of instruction decoding?...

Otherwise:

But, yeah, at the moment, apart from debugging, had gotten around to
starting to work on some more of the GUI related stuff:
Added support for drawing window bitmaps using 256 and 16 color graphics; Starting working on a "toolkit" for GUI widget drawing;
Also starting to add stuff for scalable / SDF font rendering.

For UI widgets:
Idea is that the widgets will be primarily drawn into a bitmap buffer
that will be mapped to the window (lower level window management will
not know or care about UI widgets);
Generally, I am breaking the UI into "forms" which will be applied to
the windows, with the information about the "form" and its contained
widgets being handled separately from any state associated with those
widgets (unlike GTK, which tends to keep the widget definition and state
in the same object).

Likewise, also generally approaching things in a way that "makes sense"
for C (IOW: not the struct-casting mess that GTK had used). Also in the current design, the idea is that UI elements will be sized and
positioned explicitly using window pixel coordinates (scaling UI based
on window resizing may possibly be treated as more of an application
concern, but could maybe have "layout helper" logic for this, TBD).

I have yet to decide on the default bit-depth for UI widgets, mostly
torn between 16-color and 256-color. Probably don't need high bit-depths
for "text and widgets" windows, but may need more than 16-color. Also
TBD which color palette to use in the case if 256 color is used (the
palette used for the lower-level parts of the GUI works OK, but requires
a lookup table to convert RGB555 to indexed color, don't want to
duplicate this table for every program; and don't necessarily need to
force programs to all use the same palette, even though this would be
better in a color-fidelity sense). Though, for UI widgets it almost
makes most sense to instead use a palette based around RGB222 (64 color)
or similar.

Say:
00..0F: RGBI colors
10..1F: Grayscale
20..3F: Dunno
40..7F: RGB222 colors
80..FE: Dunno
FF: Transparent
Well, or maybe reuse a variant of the 216 color "web safe" palette:
00..0F: RGBI colors
10..1F: Grayscale
20..FD: 216 colors
FE: Dunno
FF: Transparent

Well, unless RGBI is sufficient (where, in this mode, the high-intensity-magenta color was interpreted as transparent).

May also make sense to either overlay UI elements with different
bit-depths, and/or treat the UI widgets as an overlay (though, this
could have its own issues, say, if the background image is redrawn and
the UI overlaps part of, the UI will also need to be redrawn). And/or
render the UI bitmap at a higher bit-depth (such as hi-color) if higher bit-depth elements are present (say, if there is a widget representing a "Bitmap object" or similar). In the "overlaid UI widgets" scenario,
would likely make most sense to draw the parts showing the underlying
image as a dedicated transparent color.

Though, non-overlap would be preferable in the case of things like video playback or similar.

For scalable fonts, current idea is mostly to use SDF fonts encoded as 256-color BMP images, where the X/Y distances are encoded as 4 bits
each. The color palette is kinda redundant in this case, but doesn't add
too much additional storage overhead to a group of 16x16 character cells
(say, representing CP-1252 with each glyph encoded as 16x16 pixels).

Partly this is because SDF font rendering seems technically simpler than dealing with TrueType fonts (and was also used in some of my prior 3D
engine projects; though with the alteration of representing the SDF's as 256-color BMP images rather than TGA or similar).

Currently no plans to address the rest of Unicode with this though (and
it would take around 17MB to represent the entire Unicode BMP with this scheme).

In the past, had used Unifont, which was encoded in a form that stores
each glyph as 16x16 at 1bpp; basically works, but no good way to scale
the glyphs in this case without them looking terrible (well, absent
upscaling them, possibly smoothing the edges, and transcoding into SDF
form).

But, no really a good algorithmic way to figure out which corners need
to be sharp and which should be rounded. Best option is likely to look
at blocks of, say, 3x3 or 5x5 pixels, and then make a guess based on the pattern how the pixel in the center should expand from 1 pixel to 2x2
(which is applied iteratively to double the resolution of each glyph).

Though, 3x3 is small enough to make it viable to figure out a lookup
table by hand. Well, or generate a bunch of images of large-font text
and try to "mine" the patterns from the images (and then use these
patterns to hopefully upsample Unifont glyphs to 64x64 or similar, as
needed for the 16x16 SDF generation...).

Any thoughts on any of this?...

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 5 21:43:08 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

I have yet to decide on the default bit-depth for UI widgets, mostly
torn between 16-color and 256-color. Probably don't need high bit-depths
for "text and widgets" windows, but may need more than 16-color. Also

When I write documents I have a favorite set of pastel colors I use to
shade boxes, arrows, and text. I draw the figures first and then fill in
the text later. To give the eye an easy means to go from reading the text
to looking at a figure and finding the discussed item, I place a box around
the text and shade the box with the same R-G-B color of the pre.jpg box in
the figure. All figures are *.jpg. So, even a word processor needs 24-bit color.

{{I have also found that Word R-G-B color values are not exactly the same
as the R-G-V values in the *.jpg figure, too; but they are close enough
to avoid "figuring it out and trying to fix it to perfection.}}

TBD which color palette to use in the case if 256 color is used (the

256 is OK enough for DOOM games, but not for professional documentation.

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Sat Apr 6 01:42:58 2024

From Newsgroup: comp.arch

On 4/5/2024 4:43 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

I have yet to decide on the default bit-depth for UI widgets, mostly
torn between 16-color and 256-color. Probably don't need high
bit-depths for "text and widgets" windows, but may need more than
16-color. Also

When I write documents I have a favorite set of pastel colors I use to
shade boxes, arrows, and text. I draw the figures first and then fill in
the text later. To give the eye an easy means to go from reading the
text to looking at a figure and finding the discussed item, I place a
box around
the text and shade the box with the same R-G-B color of the pre.jpg box
in the figure. All figures are *.jpg. So, even a word processor needs
24-bit color.

{{I have also found that Word R-G-B color values are not exactly the same
as the R-G-V values in the *.jpg figure, too; but they are close enough
to avoid "figuring it out and trying to fix it to perfection.}}

TBD which color palette to use in the case if 256 color is used (the

256 is OK enough for DOOM games, but not for professional documentation.

If the palette is set up well, one can approximate most RGB colors semi-passably via the palette. Granted, still not great for photos or
video (or even games); and a tradeoff needs to be made between color
fidelity and levels of brightness (with the system palette I was using favoring brightness accuracy over color fidelity).

Where games only really look good if the color palette used by the
system roughly matches that used by the game. The palette I had used, as
can be noted, was 14 levels of 18 colors:
6 high-saturation (3-bit RGB);
6 low saturation;
3 off-white;
grayscale (16 levels);
orange;
azure.

Though, not really a fast/accurate procedural transform, as the palette
is complicated enough to require a lookup table (unlike 216 color or
RGB332).

Has at one point used a palette encoding which had used a 4-bit Y and
the remaining 4 bits effectively encoding a normalized vector in YUV
space, but the current scheme is more efficient. In this older scheme,
palette entries that would represent an out-of-gamut color were
effectively wasted. This strategy worked semi-OK for photo-like
graphics, but worse for other things; encoding logic was intermediate complexity (effectively built around a highly modified YCbCr transform).

Though, one possible (semi-OK) way to algorithmically encode colors to
the current palette would be:
Calculate the Luma;
Normalize the RGB color into a color vector;
Feed this vector through a fixed RGB222 lookup table;
Encode the color based on which axis was selected.

Though, for most UI widgets, one arguably only needs 4 colors, say:
Black, White, Dark-Gray, Light-Gray.

Add maybe a few other colors for things like title-bars and similar, ...

But, imposing a 16-color RGBI limit may be too restricting (even in the
name of saving memory by representing windows at a lower bit-depth when
a higher bit depth is unnecessary).

If one allows for customizable UI color schemes (like in older versions
of Windows), then RGBI would be insufficient even for simple pointy
clicky apps (leaving 256-color as the minimum option).

Here, 256 colors could make sense. Though, RGB555 would likely be
overkill for most simple pointy clicky GUI apps.

One possibility here could be for the widget toolkit window to
automatically upgrade to a higher bit-depth if it detects the use of bitmap-object widgets or similar (possibly making the choice of
bit-depth mostly invisible to the program in this case).

Well, or leave it per program, where a program that may use more complex graphics requests the use of a higher bit-depth, and then say, one that
only does a text-editor or calculator or similar, sticks with a more
limited colorspace to save memory.

Say, looking at Windows Calculator:
Looks like it uses:
~ 5 or 6 shades of gray;
A shade of blue.

Programs of this sort don't really need a higher color depth.
Would a Notepad style program need lots of colors?
...

Granted, something like an image editor would need full-color, so in
this case supporting higher color depths will make sense as well. Just,
one doesn't want to use higher bit-depths when they are not needed, as
this wastes more RAM on the backing framebuffers for the various
windows, etc...

In other news, had been messing with doing SDF versions of Unifont.

Curiously, newer versions (15.1.05) of the Unifont font (post
processing) seem to actually be a little bit smaller than an older
versions (5.1). Not entirely sure what is going on with this (but, can
note that there are a number of large gaps in the codepoint space).

Well, and seemingly multiple versions of the font:
The normal Unicode BMP;
Another 'JP' version which follows JIS rules;
Apparently maps in various characters from Planes 1 and 2, ...
Appears mostly similar apart from the contents of the CJK ranges.
...

Say:
After conversion to a 1bpp glyph set:
5.1 : 1.99 MB
15.1.05: 1.78 MB
After conversion to SDF and storing the bitmaps in a WAD container:
5.1 : ~ 7.0MB
15.1.05: ~ 6.1MB

Where, in this case, WAD based packaging seemed "mostly sufficient".
May or may not end up putting additional metadata in the WAD files.
For now, it is mostly just a collection of bitmap images...

With each image with its number encoded in the lump name, and understood
to be a grid of 16x16 glyphs.

Though, glyph shape reconstruction is less exact than could be hoped.
Arguably, still better than if one tries to scale the bitmaps directly,
but results may come out a little lumpy/weird here (though, trying to
store them at 4-bits per sub-channel probably doesn't exactly help this).

At the moment, storage is basically:
Represent the SDF as a 256-color image, with each pixel split into 2 sub-channels. The palette color is interpreted as the input vector for
the SDF drawing (interpolate between the pixels, take the median, and
compare this value against a threshold).

Initially, was not using the palette, but the palette allows more
flexibility in this case.

The current algorithm generates the SDF images by finding the horizontal
and vertical distances. Some papers imply that it may make sense to
allow more distance-calculation functions, possibly trying to regenerate
each glyph using each combination of functions and picking whichever combination of functions yields the lowest reconstruction error.

Granted, ended up using a fairly naive up-sampling algorithm (from the
16x16 pixel input glyphs), which possibly doesn't really help (I fiddled
with it until I got something semi-OK looking, but it has a limitation
of not having any idea for what the upsampled glyphs are "supposed" to
look like, what corners should be rounded and which kept sharp and
pointy; and patterns that work for ASCII/8859-1 range don't extend to
the rest of the BMP).

Though, at least for 1252, this range of characters can be handled
manually. And in this case, the upsampled glyphs were based on 8x8 pixel variants, where following the shapes as the low-res glyphs makes the
font much more tolerant of scaling (comparably the the Unifont glyphs
are "less robust" in these areas).

However, as-is, processing the Unicode BMP still takes an annoyingly
long time (and a more involved axis-search for each glyph wouldn't
exactly help matters).

Likely options, if I did so:
Simple 2D euclidean distance (naive algorithm);
Horizontal distance (current);
Vertical distance (current);
Down-Left (diagonal);
Up-Left (diagonal).

Where, the generation would pick the best 2, with an implicit "3rd axis"
which would merely be the average of the first 2.

In theory, could use hi-color images (with 3x 5-bit channels), but this
would effectively double the memory requirements of each SDF image
(though, the current implementation demand loads each "page", rather
than bulk-loading the entire Unicode BMP).

Well, or I use hi-color for CP-1252, but then the 8-bit scheme for Unifont.

Not currently still wanting to face the complexity of dealing with
TrueType fonts.

Can also note that TTF fonts that cover the entire Unicode BMP are "not exactly small" either...

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	914
Nodes:	10 (1 / 9)
Uptime:	229:36:59
Calls:	12,157
Files:	186,517
Messages:	2,232,552

Microarch Club

Who's Online

System Info