• Re: Tonights Tradeoff

    From Robert Finch@robfi680@gmail.com to comp.arch on Tue Oct 28 23:52:53 2025
    From Newsgroup: comp.arch

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 00:14:08 2025
    From Newsgroup: comp.arch

    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name. Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
    64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?



    GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 04:29:15 2025
    From Newsgroup: comp.arch

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.


    Well, that and when co-existing with RV64G, it gives somewhere to put
    the FPRs. But, in turn this was initially motivated by me failing to
    figure out how to get GCC configured to target Zfinx/Zdinx.


    Had ended up going with the Even/Odd pairing scheme as it is less wonky
    IMO to deal with R5:R4 than R36:R4.


    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.


    BT/BF works well. I otherwise also ended up using RISC-V style branches,
    which I originally disliked due to higher implementation cost, but they
    do technically allow for higher performance than just BT/BF or Branch-Compare-with-Zero in 2-R cases.

    So, it becomes harder to complain about a feature that does technically
    help with performance.


    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.


    Hmm...

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.



    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    There are some unresolved bugs, but I haven't been able to fully hunt
    them down. A lot was in relation to RISC-V's C extension, but at least
    it seems like at this point the C extension is likely fully working.

    Haven't been many features that can usefully increase general-case performance. So, it is starting to seem like XG2 and XG3 may be fairly
    stable at this point.

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.



    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
    $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance;
    Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    Had noted in Cinebench that my main PC is actually performing a little
    slower than is typical for the 2700X, but then again, it is effectively
    a 2700X running with DDR4-2133 rather than DDR4-2933, but partly this
    was a case of the RAM I have was unstable if run all that fast (and in
    this case; more RAM but slightly slower seemed preferable to less RAM
    but slightly faster, or running it slightly faster but having the
    computer be crash-prone).

    They sold the ran with its on-the-box speed being the XMP2 settings
    rather than the baseline settings, but the RAM in question didn't run
    reliably at the XMP or XMP2 settings (and wasn't inclined to spend more;
    more so when there was already the annoyance that my MOBO chipset
    apparently doesn't deal with a full 128GB, but can tolerate 112GB, but
    maybe not an ideal setup for perf).

    So, yeah, it seems that I have a setup where the 2700X is getting worse single-threaded performance than the i7 8650U in the laptop.

    Apparently, going by Cinebench scores, my PC's single threaded
    performance is mostly hanging out with a bunch of Xeons (getting a score
    in R23 of around 700 vs 950).

    Well, could be addressed, in theory, but would need some RAM that
    actually runs reliably at 2933 or 3200 MT/s and is also cheap...


    In both cases, they are CPUs originally released in 2018.

    Has noted, in a few tests:
    LZ4 benchmark (same file):
    Main PC: 3.3 GB/s
    Laptop: 3.9 GB/s
    memcpy (single threaded):
    Main PC: 3.8 GB/s
    Laptop : 5.6 GB/s
    memcpy (all threads):
    Main PC: ~ 15 GB/s
    Laptop : ~ 24 GB/s
    ( Like, what; thing only has 1 stick of RAM... *1 )

    *1: Also, how is a laptop with 1 stick of RAM matching a dual-socket
    Xeon E5410 with like 8 sticks of RAM...

    or, maybe it was just weak that my main PC was failing to beat the Xeon
    at this?... My main PC does at least beat the Xeon at single-threaded performance (was less true of my older Piledriver based PC).


    Granted, then again, I am using (almost) the cheapest MOBO I could find
    at the time (that had an OK number of RAM slots and SATA connectors).
    Can't quite identify the MOBO or chipset as I lost the box (and not
    clearly labeled on the MOBO itself); except that it is a
    something-or-another ASUS board.

    Like, at the time, IIRC:
    Went on Newggg;
    Pick mostly the cheapest parts on the site;
    Say, a Zen+ CPU being a lot cheaper than Zen 2,
    or pretty much anything from Intel.
    ...


    Did get a slightly fancy/beefy case, but partly this was because I was
    annoyed with the late-90s-era beige tower case I had been using. Which I
    had ended up hot gluing a bunch of extra PC fans into the thing in an
    attempt to keep airflow good enough so that it didn't melt. And
    under-clocking the CPU so that it could run reliably.

    Like, 4GHz Piledriver ran too hot and was unreliable, but was far more
    stable at 3.4 GHz. Was technically faster than a Phenom II underclocked
    to 2.8 GHz (for similar reasons).

    Where, at least the Zen+ doesn't overheat at stock settings (but, they
    also supplied the thing with a comparably much bigger stock CPU cooler).

    The case I got is slightly more traditional, with 5.25" bays and similar
    and mostly sheet-steel construction, Vs the "new" trend of mostly glass-covered-box PC cases. Sadly, it seems like companies have mostly
    stopped selling the traditional sheet-steel PC cases with open 5.25"
    bays. Like, where exactly is someone supposed to put their DVD-RW drive,
    or hot-swap HDD trays ?...

    Well, in the past we also had floppy drives, but the MOBO's removed the connectors forcing one to now go the USB route if they want a floppy
    drive (but, now mostly moot as relatively few other computers still have floppy drives either).




    Well, in theory could build a PC with newer components and a bigger
    budget for parts. Still wouldn't want to go over to Win11, now it is a
    choice between jumping to Linux or "Windows Server" or similar (like, at
    least they didn't pollute Windows Server with a bunch of random
    pointless crap).

    For now, inertia option is to just keep using Win10 for now.


    As for the laptop, had noted:
    Can run Minecraft:
    Yes; though best results at an 8-chunk draw distance.
    Much more than this, and the "Intel UHD" graphics struggle.
    At 12 chunks, there is obvious chug.
    At 16 chunks, it starts dropping into single digit territory.
    Can run Doom3:
    Yes: Mostly gets 40-50 fps in Doom 3.

    My main PC can manage a 16-chunk draw distance in Minecraft and mostly
    gets a constant 63 fps in Doom3.

    Don't have many other newer games to test, as I mostly lost interest in
    modern "AAA" games. And, stuff like Doom+RTX, I already know this wont
    work. I can mostly just be happy that Minecraft works and is playable
    (and that its GPU is solidly faster than just using a software renderer...).


    On both fronts, this is a significant improvement over the older laptop.
    For the price, I sort of worried that it would be dead slow, but it significantly outperforms its Vista-era predecessor.

    This is mostly because I had noticed that, right now (unlike a few years
    ago), there are actually OK laptops at cheap prices (along with all the
    $80 Dell OptiPlex computers and similar on Amazon...).



    Otherwise, went and recently wrote up a spec partly based on a BASIC
    dialect I had used in one of my 3D engines, with some design cleanup: https://pastebin.com/2pEE7VE8

    Where I was able to get a usable implementation for something similar in
    a little over 1000 lines of C.

    Though, this was for an Unstructured BASIC dialect.


    Decided then to try something a little harder:
    Doing a small JavaScript like language, and trying to keep the
    interpreter small.

    I don't yet have the full language implemented, but for a partial JS
    like language, I currently have something in around 2500 lines of C.

    I had initially set a target estimate of 4-8 kLOC.
    Unless the remaining functionality ends up eating a lot of code, I am on target towards hitting the lower end of this range (need to get most of
    the rest of the core-language implemented within around 1.5 kLOC or so).

    Note: No 3rd party libraries allowed, only the normal C runtime library.
    Did end up using a few C99 features, but mostly still C95.


    For now, I was calling the language BS3L, where:
    Dynamically typed;
    Supports: Integers, Floating-Point, Strings, Objects, Arrays, ...
    JS style syntax;
    Nothing too exciting here.
    Still has JS style arrays and objects;
    Dynamically scoped.
    Where, dynamic scoping needs less code than lexical scoping;
    But, dynamic scoping is also a potential foot-gun as well.
    Not sure if too much of a foot-gun.
    Vs going to C-style scoping;
    Or, biting the bullet and properly implementing lexical scoping.
    Leaving out most advanced features.
    will be fairly minimal even vs early versions of JS.

    But, in some cases, was borrowing some design ideas from the BASIC interpreter. There were some unavoidable costs, such as in this case
    needing a full parser (that builds an AST) and an AST-walking
    interpreter. Unlike BASIC, it wouldn't be possible to implement an
    interpreter by directly walking and pattern matching lists of tokens.

    And, a parser that builds an AST, and code to walk said AST, necessarily
    needs more code.

    I guess, it is a question if if someone else could manage to implement a JavaScript style language in under 1000 lines of C while also writing "relatively normal" C (no huge blocks of obfuscated or rampant abuse of
    the preprocessor). Or, basically, where one has to stick to similar C
    coding conventions to those used in Doom and Quake.


    I am not sure if this would be possible. Both the dynamic type-system
    and parser have eaten up a fair chunk of the code budget. A sub 1000
    line parser is also a little novel; but the parser itself got a little
    wonky and doesn't fully abstract over what it parses (as there is still
    a fair bit of bleed-over from the token stream). And, it sorta ended up abusing the use of binary operators a little.

    For example, it has wonk like dealing with lists of statements as-if
    there were a right-associative semicolon operator (allowing it to be
    walked like a linked list).

    There is slightly wonky operator tokenization again to save code:
    Separately matching every possible operator pattern is a bunch of extra
    logic. Was using rules that mostly give the correct operators, but with
    the possibility of non-sense operators. Also the precedence levels don't
    match up exactly, but this is a lower priority issue.


    I guess, if someone things they can do so in significantly less code,
    they can try.

    Note that while a language like Lua sort of resembles an intermediate
    between BASIC and JavaScript, I wouldn't expect Lua to save that much
    here (it would still have the cost of needing to build an AST and similar).

    Going from an AST to a bytecode or 3AC IR would allow for higher
    performance.

    But, I decided to go for an AST walking interpreter in this case as it
    would be the least LOC.


    Actually takes more effort trying to keep the code small. Rather than
    just copy-pasting stuff a bunch of times, one spends more time needing
    to try to factor out and reuse common patterns.


    Though, in a way, some of this is revisiting stuff I did 20+ years ago,
    but from a different perspective.

    Like, 20+ years ago, my first interpreters also used AST walkers.

    As for where I will go with this, I don't know.
    Some of it could make sense as a starting point for a GLSL compiler;
    Or maybe adapted into parsing the SCAD language;

    Or, as a cheaper alternative to what my first script VM became.
    By the end of its span, it had become quite massive...
    Though, still not too bad if compared with SpiderMonkey or V8.

    Ironically, my jump to a Java + JVM/.NET like design was actually to
    make it simpler.

    For a simple but slow language, JS works, but if you want it fast it
    quickly turns worse (and simpler to jump to a more traditional
    statically typed language). Like, there was this thing, known as "Hindley-Milner Type Inference", which on one hand, could be used to
    make a JavaScript style language fast (by turning it transparently into
    a statically-typed language), but also, was a huge PITA to deal with
    (this was combined in my VM with optional explicit type declarations;
    with a syntax inspired by ActionScript).


    Well, and when something gets big and complicated enough that one almost
    may as well just use spiderMonkey or similar to run their JS code, this
    is a problem...

    Still less bad than LLVM, not sure why anyone would willingly submit to
    this.


    Well, there is still surviving descendant of the original VM (although branching off from an earlier form) in the form of BGBCC.

    Though, makes more sense to do a clean interpreter in this case, than to
    try to build one by copy-pasting the parser from BGBCC or my old VM and
    trying to build a new lighter-weight VM.

    In some of these cases, it is easier to scale up than scale back down.
    Easier to take simpler code and add features or improve performance.
    Than to take more complex code and try to trim it down.


    And, sometimes it does make more sense to just write something starting
    from a clean slate.

    Well, except for my attempt at a clean slate C compiler, except this was
    more a case of realizing I wouldn't undershoot BGBCC by enough to be worthwhile, and there were some new problem points that were emerging in
    the design. Partly as I was trying to follow a model more like that used
    by GCC and binutils, which I was then left to suspect is not the right approach (and in some ways, the approach I had used in BGBCC seemed to
    make more sense than trying to imitate how GCC does things).

    Might still make at some point to try for another clean-slate C
    compiler, though if I would still end up taking a similar general
    approach to BGBCC (or .NET), there isn't a huge incentive (vs continuing
    to use BGBCC).

    Where, say, the main thing that would ideally need to be improved would
    be improving BGBCC's performance and reducing memory footprint. As-is, compiling with BGBCC is about as slow as compiling with GCC, which isn't great.

    Comparably, MSVC typically being a bit faster at compiling stuff IME.


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:41:46 2025
    From Newsgroup: comp.arch

    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or 6
    bit register numbers in the instructions.  Five allows you to use the
    high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste.  If you have six bits, you can use all
    64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would be
    passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But since
    it should be using mostly compiled code, it does not make much difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch
    on bit-set/clear for conditional branches. Might also include branch
    true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:50:35 2025
    From Newsgroup: comp.arch

    On 2025-10-29 8:41 a.m., Robert Finch wrote:
    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or 6
    bit register numbers in the instructions.  Five allows you to use the
    high registers for 128 bit operations without needing another register
    specifier, but then the high registers can only be used for 128 bit
    operations, which seems a waste.  If you have six bits, you can use
    all 64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But since
    it should be using mostly compiled code, it does not make much difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch
    on bit-set/clear for conditional branches. Might also include branch
    true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    I should mention that the high registers are available only in user/app
    mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the
    design. There are about 160 (64+32+32+32) logical registers then. They
    are supported by 512 physical registers. My previous design had 224
    logical registers which eats up more hardware, probably for little benefit.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 17:44:14 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.

    These days, that's not so clear. E.g., Zen4 has 192 physical 512-bit
    SIMD registers, despite having only 256-bit wide FUs. The way I
    understand it, a 512-bit operation comes as one uop to the FU,
    occupies it for two cycles (and of course the result latency is
    extra), and then has a 512-bit result.

    The alternative would be to do as AMD did in some earlier cores,
    starting with (I think) K8: have registers that are half as wide and
    split each 512-bit operation into 2 256-bit uops that go throught the
    OoO engine individually. This approach would allow more physical
    256-bit registers, and waste less on 32-bit, 64-bit, 128-bit and
    256-bit operations, but would cost additional decoding bandwidth,
    renaming bandwidth, renaming checkpoint size (a little), and scheduler
    space than the approach AMD have taken. Apparently the cost of this
    approach is higher than the benefit.

    Doubling the logical register size doubles the renamer checkpoint
    size, no? This way of avoiding "waste" looks quite a bit more
    expensive.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 13:04:42 2025
    From Newsgroup: comp.arch

    On 10/29/2025 7:50 AM, Robert Finch wrote:
    On 2025-10-29 8:41 a.m., Robert Finch wrote:
    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or
    6 bit register numbers in the instructions.  Five allows you to use
    the high registers for 128 bit operations without needing another
    register specifier, but then the high registers can only be used for
    128 bit operations, which seems a waste.  If you have six bits, you
    can use all 64 registers for any operation, but how is the "upper"
    method that better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would
    be passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But
    since it should be using mostly compiled code, it does not make much
    difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    I am not as sure about this approach...

    Well, Low 32=GPR, High 32=FPR, makes sense, I did this.

    But, pairing a GPR and FPR for the 128-bit cases seems wonky; or
    subsetting registers on context switch seems like it could turn into a problem.


    Or, if a goal is to allow for encodings with a 5-bit register field,
    would make sense to use 32-bit encodings.

    Where, granted, 6b register fields in a 32-bit instruction does have the drawback of limiting how much encoding space exists for opcode and
    immediate (and one has to be more careful not to "waste" the encoding
    space as badly as RISC-V had done).

    Though, can note that both:
    R6+R6+Imm10
    R5+R5+Imm12
    Use the same amount of encoding space.
    But, R6+R6+R6 uses 3 bits more than R5+R5+R5.


    Though, one could debate my case, as I did effectively end up burning
    1/4 of the total encoding space mostly on Jumbo prefixes.

    ...



    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a
    branch on bit-set/clear for conditional branches. Might also include
    branch true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. They
    are supported by 512 physical registers. My previous design had 224
    logical registers which eats up more hardware, probably for little benefit.


    FWIW: I have gotten by OK with 128 internal registers:
    00..3F: Array-Mapped Registers (mostly the GPRs)
    40..7F: CRs and SPRs

    Mostly sufficient.

    For the array-mapped registers, these ones use LUTRAM, with a logical
    copy of the array per write port, and some control bits to encode which
    array currently holds the up-to-date copy of the register.

    All this gets internally replicated for each read port.

    So, roughly 18 internal copies of all of the registers with 6R3W, but
    this is unavoidable (since LUTRAMs are 1R1W).


    The other option is using flip-flops, which is the strategy mostly used
    for the writable CRs and SPRs. This is done sparingly as the resource
    cost is higher in this case (at least on xilinx, *).

    *: Things went amiss on Altera and when I tried to build on this, needed
    to use FF's for all the GPRs as well; as these FPGAs lack a direct
    equivalent of LUTRAMs and instead have smaller Block RAMs. Also the
    Lattice FPGAs also lack LUTRAM IIRC (but, my core doesn't map as well to Lattice FPGAs either).


    As for the CR/SPR space:
    Some of it is used for writable registers;
    A big chunk is used for internal read-only registers.
    ZZR, IMM, IMMB, JIMM, etc.
    ZZR: Zero Register / Null Register (Write)
    IMM: Immediate for current lane (33-bit, sign-ext).
    IMMB: Immediate from Lane 3.
    JIMM: 64-bit immediate spanning Lanes A and B.
    ...

    Could also be seen as C0..C63 (or, all control registers) except that
    much of C32..C63 is used for internal read-only SPRs, and a few other
    SPRs (DLR, DHR, and SP).

    Originally, the CRs and SPRs were handled as separate, but now things
    have gotten fuzzy (and, for RISC-V, some of the CRs need to be accessed
    in GPR like ways).

    There is some wonk as they were handled as separate modules, but with
    the current way things are done it would almost make more sense to fold
    all of the CRs into the GPR file module.

    The module might also continue to deal with forwarding, but might also
    make sense to have a RegisterFile module, possibly with a disjoint
    "Register Forwarding And Interlocks" style module (which forwards
    registers if the value is available and signals pipeline stalls as
    needed; this logic currently partly handled by the existing
    register-file module).



    Did experiment with a mechanism to allow bank-swapped registers. This
    would have added an internal 2-bit mode for the registers, and would
    stall the pipeline to swap the current registers with their bank-swapped versions if needed (with the registers internally backed to Block-RAM).
    Ended up mostly not using this though (at best, it wouldn't gain much
    over the existing "Load and Store everything to RAM" strategy; and would
    make context switching slower than it is already).

    It is more likely that a practical mechanism for fast bank swapping
    would need a mechanism to bank-swap the registers to external RAM. Or
    maybe a special "Stall and dump all the registers to this RAM Address" instruction.


    For the RISC-V CSRs:
    Part of the space maps to the CRs, and part maps to CPUID;
    For pretty much everything else, it traps.
    So, pretty much all of the normal RISC-V CSRs will trap.

    Ended up trapping for the RISC-V FPU CSRs as well:
    Rarely accessed;
    Rather than just one CSR for the FPU status, they broke it up to
    multiple sub-registers for parts of the register (like, there is a
    special CSR just for the rounding-mode, ...).

    Also the hardware only supports moving to/from a CR, so any more complex scenarios will also trap. They had gotten a little too fancy with this
    stuff IMO.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 29 18:15:42 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    Having register pairs does not make the compiler writer's life easier, unfortunately.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Having 64 registers and 64 bit registers makes life easier for that
    particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 11:29:54 2025
    From Newsgroup: comp.arch

    On 10/29/2025 10:44 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register
    specifier, but then the high registers can only be used for 128 bit
    operations, which seems a waste.

    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions. But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that. e.g.

    Add A1,A2,A3 would be a 64 bit add on those registers but
    Add128 A1,A2,A3 would be a 128 bit add using A1H for the high order
    bits of the destination, etc. So the question becomes how is using
    Rn+32 better than using Rn+1?

    That being said, your points are well taken for a different implementation.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:33:46 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    I have both the bit-vector compare and branch, but also a compare to zero
    and branch as a single instruction. I suggest you should too, if for no
    other reason than:

    if( p && p->next )

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly
    for 64-bit and smaller.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions alongwith the higher precision.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:47:09 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    it is definitely an issue.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    ----------

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.

    {5, 16, 32, 64}-bit immediates.



    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    <snip>

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    I am still running at 70% RISC-Vs instruction count.

    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Fewer instructions, and or instructions that take fewer cycles to execute.

    Example, ENTER and EXIT instructions move 4 registers per cycle to/from
    cache in a pipeline that has 1 result per cycle.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.

    There is very little to be gained with that many registers.

    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 14:02:32 2025
    From Newsgroup: comp.arch

    On 10/29/2025 1:15 PM, Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

    Having register pairs does not make the compiler writer's life easier, unfortunately.


    Yeah, and from the compiler POV, would likely prefer having Even+Odd pairs.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    Having 64 registers and 64 bit registers makes life easier for that particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?


    Agreed.

    From what I have seen, the vast bulk of constants tend to come in
    several major clusters:
    0 to 511: The bulk of all constants (peaks near 0, geometric fall-off)
    -64 to -1: Much of what falls outside 0 to 511.
    -32768 to 65535: Second major group
    -2G to +4G: Third group (smaller than second)
    64-bit: Another smaller spike.

    For values between 512 and 16384: Sparsely populated.
    Mostly the continued geometric fall-off from the near-0 peak.
    Likewise for values between 65536 and 1G.
    Values between 4G and 4E tend to be mostly unused.

    Like, in the sense of, if you have 33-bit vs 52 or 56-bit for a
    constant, the larger constants would have very little advantage (in
    terms of statistical hit rate) over the 33 bit constant (and, it isn't
    until you reach 64 bits that it suddenly becomes worthwhile again).


    Partly why I go with 33 bit immediate fields in the pipeline in my core,
    but nothing much bigger or smaller:
    Slightly smaller misses out on a lot, so almost may as well drop back to
    17 in this case;
    Going slightly bigger would gain pretty much nothing.

    Like, in the latter case, does sort of almost turn into a "go all the
    way to 64 bits or don't bother" thing.


    That said, I do use a 48-bit address space, so while in concept 48-bits
    could be useful for pointers: This is statistically insignificant in an
    ISA which doesn't encode absolute addresses in instructions.

    So, ironically, there are a lot of 48-bit values around, just pretty
    much none of them being encoded via instructions.


    Kind of a similar situation to function argument counts:
    8 arguments: Most of the functions;
    12: Vast majority of them;
    16: Often only a few stragglers remain.

    So, 16 gets like 99.95% of the functions, but maybe there are a few
    isolated ones taking 20+ arguments lurking somewhere in the code. One
    would then need to go up to 32 arguments to have reasonable confidence
    of "100%" coverage.

    Or, impose an arbitrary limit, where the stragglers would need to be
    modified to pass arguments using a struct or something.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 13:05:08 2025
    From Newsgroup: comp.arch

    On 10/29/2025 11:47 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    snip
    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    So if DBLE says the next instruction is double width, does that mean
    that all "128 bit instructions" require 64 bits in the instruction
    stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

    If so, I guess it is a tradeoff for not requiring register pairing, e.g.
    Rn and Rn+1.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 15:58:40 2025
    From Newsgroup: comp.arch

    On 10/29/2025 1:47 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    it is definitely an issue.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.


    OK.

    In my case, a lot of the 128-bit operations are a single 32-bit
    instruction, which splits (in decode) to spanning multiple lanes (using
    the 6R3w register file as a virtual 3R1W 128-bit register file).

    In some cases, pairs of 64-bit SIMD instructions may be merged to send
    both through the SIMD unit at the same time. Say, as a special-case
    co-issue for 2x Binary32 ops (which can basically be handled the same as
    the 4x Binary32 scenario by the SIMD unit).

    ----------

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.

    {5, 16, 32, 64}-bit immediates.


    The reason 17 and 33 ended up slightly preferable is that both
    zero-extended and sign-extended 16 and 32 bit values are fairly common.

    And, if one has both a zero and sign extended immediate, this eats the
    same encoding space as having a 17-bit immediate, or a separate
    zero-extended and one-extended variant.

    There are a few 5/6 bit immediate instructions, but I didn't really
    count them.

    XG3's equivalent of SLTI and similar only has Imm6 encodings (can be
    extended to 33 bits with a jumbo prefix).



    There isn't much need for a direct 128-bit immediate though:
    This case is exceedingly rate;
    Register-pairs basically make it a non-issue;
    Even if it were supported:
    This would still require a 24-byte encoding...
    Which, doesn't save anything over 2x 12-bytes.
    And doesn't gain much, apart from making CPU more expensive.

    Someone could maybe do 20 bytes by using a 128-bit memory load, but with
    the usual drawbacks of using a memory load (BGBCC doesn't usually do
    this). The memory load will have a higher latency than a pair of
    immediate instructions.





    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    <snip>

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    I am still running at 70% RISC-Vs instruction count.


    Basically similar.

    XG3 also uses only 70% as many instructions as RV64G.

    But, if you throw Indexed Load/Store, Load/Store Pair, Jumbo Prefixes,
    etc, at the problem (on top of RISC-V), suddenly RISC-V becomes a lot
    more competitive (30% smaller and 50% faster).

    Not found a good way to much improve much over this though...


    But, yeah, if comparing against RV64G as it exists in its standard form,
    there is a bit of room for improvement.



    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Fewer instructions, and or instructions that take fewer cycles to execute.

    Example, ENTER and EXIT instructions move 4 registers per cycle to/from
    cache in a pipeline that has 1 result per cycle.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.

    There is very little to be gained with that many registers.


    Granted.

    The main thing it benefits is things like TKRA-GL, ...

    Doom basically sees no real difference between 32 and 64 GPRs (nor does SW-Quake).


    Mostly matters for code where one has functions with around 100+ local variables... Which, are uncommon much outside of TKRA-GL or similar.


    As-is, SW-Quake is one of the cases that does well with RISC-V, though GL-Quake performs like hot dog-crap; mostly as TKRA-GL gets wrecked if
    it is limited to 32 registers and doesn't have SIMD.


    Only real saving point is when running with TKRA-GL over system calls in
    which case it runs in the kernel (as XG1) which is slightly less bad.
    For reasons, TestKern kinda still needs to be built as XG1.


    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
    $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
    8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
    Rarely reaches turbo
    pretty much only happens if just running a single thread...
    With all cores running stuff in the background:
    Idles around 3.6 to 3.8.

    Laptop:
    4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
    If power set to performance, reaches turbo a lot more easily,
    and with multi-core workloads.
    But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    For $240 I was paranoid is still might not have been fast enough to run Minecraft...


    Still annoyed as the RAM claimed like DDR4-3200 on the box, but doesn't
    run reliably at more than DDR4-2133... Like, you can try 3200 if you
    don't mind computer blue-screening after a few minutes I guess...



    But, without much RAM, nor enough SSD space to set up a huge pagefile,
    not going to try compiling LLVM on the thing.

    Even with all the RAM, a full rebuild of LLVM still takes several hours
    on my main PC (though, trying to build LLVM or GCC is at least slightly
    faster if one tells the AV software to stop grinding the CPU by looking
    at every file accessed).


    Vs the $80 OptiPlex that came with a 2C/4T Core i3 variant, that wasn't particularly snappy (seemed on-par with the Vista era laptop; though
    this has a 2C/2T CPU).

    Basically, was a small PC that was using mostly laptop-style parts
    internally (laptop DVD-RW drive and laptop style HDD); some sort of ITX
    MOBO layout I think.

    I don't remember there being any card slots; so like if you want to
    install a PCIe card or similar, basically SOL.

    But, it was either this or an off-brand NUC clone...


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 21:52:54 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 10/29/2025 11:47 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    snip
    But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    So if DBLE says the next instruction is double width, does that mean
    that all "128 bit instructions" require 64 bits in the instruction
    stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

    It is a 64-bit machine that provides a small modicum of support for
    larger sizes. It is not and never will be a 128-bit machine--that is
    what vVM is for.

    Key words "small modicum"

    DBLE simply supplies registers to the pipeline and width to decode.

    If so, I guess it is a tradeoff for not requiring register pairing, e.g.
    Rn and Rn+1.

    DBLE supports 128-bits in the ISA at the total cost of 1 instruction
    added per use. In many situations (especially integer) CARRY is the
    better option because it throws a shadow of width over a number of
    instructions and thereby has lower code foot print costs. So, a 256
    bit shift is only 5 instructions instead of 8. And realistically, if
    you want wider than that, you have already run out of registers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:01:17 2025
    From Newsgroup: comp.arch

    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>
    Having register pairs does not make the compiler writer's life easier, unfortunately.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    Having 64 registers and 64 bit registers makes life easier for that particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    It is load/store with no memory ops excepting possibly atomic memory ops.>
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?

    I found that 16-bit immediates could be encoded instead of 10-bit.
    So, now there are 16,56,96 and 136 bit constants possible. The
    56-bitconstant likely has enough range for most 64-bit ops. Otherwise using
    a 96-bit constant for 64-bit ops would leave the upper 32-bit of the
    constant unused. 136 bit constants may not be implemented, but a size
    code is reserved for that size.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:20:51 2025
    From Newsgroup: comp.arch

    On 2025-10-29 2:33 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    I have both the bit-vector compare and branch, but also a compare to zero
    and branch as a single instruction. I suggest you should too, if for no
    other reason than:

    if( p && p->next )


    Yes, I was going to have at least branch on register 0 (false) 1 (true)
    as there is encoding room to support it. It does add more cases in the
    branch eval, but is probably well worth it.
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.


    Following the same philosophy. Expecting only some use for 128-bit
    floats. Integers can only handle 8,16,32, or 64-bits.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions alongwith the higher precision.

    Improves the accuracy? of algorithms, but seems a bit specific to me.
    Are there other instruction sequence where double-rounding would be good
    to avoid? Seems like HW could detect the sequence and fuse the instructions.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:26:05 2025
    From Newsgroup: comp.arch

    <snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
      8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
        Rarely reaches turbo
          pretty much only happens if just running a single thread...
        With all cores running stuff in the background:
          Idles around 3.6 to 3.8.

    Laptop:
      4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
        If power set to performance, reaches turbo a lot more easily,
          and with multi-core workloads.
        But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    <snip>
    For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
    32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
    was needed, my last machine only had 16GB, found it using about 20GB. I
    did not want to spring for a machine with even more RAM, they tended to
    be high-end machines.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 22:31:12 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing
    registers for 128-bit operation.

    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit
    operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 18:48:56 2025
    From Newsgroup: comp.arch

    On 10/29/2025 5:26 PM, Robert Finch wrote:
    <snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
       8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
         Rarely reaches turbo
           pretty much only happens if just running a single thread...
         With all cores running stuff in the background:
           Idles around 3.6 to 3.8.

    Laptop:
       4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
         If power set to performance, reaches turbo a lot more easily,
           and with multi-core workloads.
         But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    <snip>
    For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
    32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
    was needed, my last machine only had 16GB, found it using about 20GB. I
    did not want to spring for a machine with even more RAM, they tended to
    be high-end machines.


    IIRC, current PC was something like:
    CPU: $80 (Zen+; Zen 2 and 3 were around, but more expensive)
    MOBO: $60
    Case: $50
    ...

    Spent around $200 for 128GB of RAM.
    Could have gotten a cheaper 64GB kit had I known my MOBO would not
    accept a full 128GB (then could have had 96 GB).


    The RTX card I have (RTX 3060) has 12 GB of VRAM.

    IIRC, it was also about the cheapest semi-modern graphics card I could
    find at the time. Like, while I could have bought an RTX 4090 or similar
    at the time, I am not made of money.

    Like, a prior-generation mid-range card being the cheaper option.
    And, still newer than the GTX980 that had died on my (where, the GTX980
    was itself second-hand).


    Before this, had been running a GTX 460, and before that, a Radeon HD
    4850 (IIRC).

    I think it was a case of:
    Had a Phenom II box, with the HD 4850;
    Switched to GTX 460, as I got one second-hand for free, slightly better; Replaced Phenom II board+CPU with FX-8350;
    Got GTX 980 (also second hand);
    Got Ryzen 7 2700X and new MOBO;
    Got RTX 3060 (as the 980 was failing).

    With the RTX 3060, had to go single-monitor, mostly as it only has
    DisplayPort outputs, and DP->HDMI->DVI via adapters doesn't seem to work (whereas HDMI->DVI did work via adapters).

    Well, also the RTX 3060 doesn't have a VGA output either (monitor would
    also accept VGA).

    Though, the current monitor I am using is newer and does support
    DisplayPort.


    I also managed to get a MultiSync CRT a while ago, but it only really
    gives good results at 640x480 and 800x600, 1024x768 sorta-works (but
    1280x1024 does not work), has a roughly 16" CRT or so; VGA input.

    I also have an LCD that goes up to 1280x1024, although it looks like
    garbage if set above 1024x768. Only accepts VGA.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 07:13:54 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>

    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    If you have that many bits available, do you still go for a load-store
    architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    It is load/store with no memory ops excepting possibly atomic memory ops.>

    OK. Starting with 40 vs 32 bits, you have a factor of 1.25 disadvantage
    in code density to start with. Having memory operations could offset
    that by a certain factor, that was why I was asking.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?

    I found that 16-bit immediates could be encoded instead of 10-bit.

    OK. That should also help for offsets in load/store.

    So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.

    For addresses, it will take some time for this to overflow :-)
    For floating point constants, that will be hard.

    I have done some analysis on frequency of floating point constants
    in different programs, and what I found was that there are a few
    floating point constants that keep coming up, like a few integers
    around zero (biased towards the positive side), plus a few more
    golden oldies like 0.5, 1.5 and pi. Apart from that, I found that
    different programs have wildly different floating point constants,
    which is not surprising. (I based that analysis on the grand
    total of three packages, namely Perl, gnuplot and GSL, so cover
    is not really extensive).

    Otherwise using
    a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.

    There are also 32-bit floating point constants, and 32-bit integers
    as constants. There are also very many small integer constants, but
    of course there also could be others.

    136 bit constants may not be implemented, but a size
    code is reserved for that size.

    I'm still hoping for good 128-bit IEEE hardware float support.
    POWER has this, but stuck it on their their decimal float
    arithmetic, which is not highly performing...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 30 13:53:04 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>

    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:09:00 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-10-29 2:33 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.

    I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:

    if( p && p->next )


    Yes, I was going to have at least branch on register 0 (false) 1 (true)
    as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.


    Following the same philosophy. Expecting only some use for 128-bit
    floats. Integers can only handle 8,16,32, or 64-bits.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions along with the higher precision.

    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good
    to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double
    roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    The problem arises due to a cross products of various {machine,
    language, compiler} features not working "all ends towards the middle".

    LLVM promotes FP calculations with a constant to 64-bits whenever the
    constant cannot be represented exactly in 32-bits. {Strike one}

    C makes no <useful> statements about precision of calculation control.
    {strike two}

    HW almost never provides mixed mode calculations which provide the
    means to avoid the double rounding. {strike three}

    So, technically, My 66000 does not provide general-mixed-mode FP,
    but I wrote a special rule to allow for larger constants used with
    narrower registers to cover exactly this case. {It also saves 2 CVT instructions (latency and footprint).

    Seems like HW could detect the sequence and fuse the instructions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:10:47 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.

    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Oct 30 12:29:39 2025
    From Newsgroup: comp.arch

    On 10/30/2025 11:10 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing
    registers for 128-bit operation.


    Only really makes sense if one assumes these resources are "borderline
    free".

    If you are also paying for logic complexity and wires/routing, then
    having bigger registers just to typically waste most of them is not ideal.


    Granted, one could argue that most of the register is wasted when, say:
    Most integer values could easily fit into 16 bits;
    We have 64-bit registers.

    But, there is enough that actually uses the 64-bits of a 64-bit register
    to make it worthwhile. Would be harder to say the same for 128-bit
    registers.

    It is common on many 32-bit machines to use register pairs for 64-bit operations.


    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit
    operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.


    Also questionable to read as someone lacking much hardware that actually supports 256 or 512-bit AVX on the actual HW level. And, both AVX and
    AVX-512 had not exactly had clean roll-outs.


    Checks and, ironically, my recent super-cheap laptop was the first thing
    I got that apparently has proper 256-bit AVX support (still no AVX-512 though...).


    Still some oddities though:
    RAM that appears to be faster than it should be;
    The MHz and CAS latency appear abnormally high.
    They do not match the values for DDR4-2400.
    (Nor, even DDR4 in general).
    Appears to exceed expected bandwidth on memcpy test;
    ...
    Windows 11 on an unsupported CPU model;
    More so, Windows 11 Professional, also on something cheap.
    (Listing said it would come with Win10, got Win11 instead, OK).

    So, technically seems good, but also slightly sus...


    Differs slightly from what I was expecting:
    Something kinda old and not super fast;
    Listing said Windows 10, kinda expected Windows 10;
    ...

    Like, something non-standard may have been done here.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 16:46:14 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.

    I don't see that following at all, but it inspired a closer look at
    the usage/waste of register bits in RISCs:

    Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
    precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
    64-bit registers rather than following the idea of Intel and Robert
    Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the
    quadruple number of 16-bit registers that can be joined into 32-bit
    anbd 64-bit registers when needed, or even better, the octuple number
    of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit
    registers. We can even ressurrect the character-oriented or
    digit-oriented architectures of the 1950s.

    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP. In the 32-bit extension, they did not add ways to
    access the third and fourth byte, or the second wyde (16-bit value).
    In the 64-bit extension, AMD added ways to access the low byte of
    every register (in addition to AH-DH), but no way to access the second
    byte of other registers than RAX-RDX, nor ways to access higher wydes,
    or 32-bit units. Apparently they were not concerned about this kind
    of waste. For the 8086 the explanation is not trying to avoid waste,
    but an easy automatic mapping from 8080 code to 8086 code.

    Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
    32-bit register alone, which one can consider to be useful for storing
    data in those bits (and in case of AL, AH actually provides a
    conventient way to access some of the bits, and vice versa), but leads
    to partial-register stalls. The hardware contains fast paths for some
    common cases of partial-register writes, but AFAIK AH-DH do not get
    fast paths in most CPUs.

    By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.

    IIRC the original HPPA has 32 or so 64-bit FP registers, which they
    then split into 58? 32-bit FP registers. I don't know how they
    further evolved that feature.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 17:58:34 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned? >>>
    The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.> >>
    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    There is a cache level (L2 usually, I believe) when icache and
    dcache are no longer separate. Wouldn't this cause problemso
    or inefficiencies?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 30 23:39:28 2025
    From Newsgroup: comp.arch

    On Thu, 30 Oct 2025 16:46:14 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.


    According to my understanding, EV4 had no SIMD-style instructions.
    They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
    ahead of VIS in UltraSPARC.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:00:50 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not RISC.

    I don't see that following at all, but it inspired a closer look at
    the usage/waste of register bits in RISCs:

    Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
    precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
    64-bit registers rather than following the idea of Intel and Robert
    Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
    anbd 64-bit registers when needed, or even better, the octuple number
    of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
    digit-oriented architectures of the 1950s.

    Consider that being able to address every 2^(3+n) field of a register
    is far from free. Take a simple add of 2 bytes::

    ADDB R8[7], R6[3], R19[4]

    One has to individually align each of the bytes, which is going to blow
    out your timing for forwarding by at least 3 gates of delay (operands)
    and 4 gates for the result (register). The only way it makes "timing"
    sense if if you restrict the patterns to::

    ADDB R8[7], R6[7], R19[7]

    Where there is no "vertical" routine in obtaining operands and delivering results. {{OR you could always just eat a latency cycle when all fields
    are not the same.}}

    I also suspect that you would gain few compiler writers to support random fields in registers.

    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP.

    {ABCD}X registers were data.
    {SDBS} registers were pointer registers.

    There are vanishingly few useful manipulations on part of pointers.

    Oh and BTW:: using x86-history as justification for an architectural
    feature is "bad style".

    In the 32-bit extension, they did not add ways to
    access the third and fourth byte, or the second wyde (16-bit value).
    In the 64-bit extension, AMD added ways to access the low byte of
    every register (in addition to AH-DH), but no way to access the second
    byte of other registers than RAX-RDX, nor ways to access higher wydes,
    or 32-bit units. Apparently they were not concerned about this kind
    of waste. For the 8086 the explanation is not trying to avoid waste,
    but an easy automatic mapping from 8080 code to 8086 code.

    Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
    32-bit register alone, which one can consider to be useful for storing
    data in those bits (and in case of AL, AH actually provides a
    conventient way to access some of the bits, and vice versa), but leads
    to partial-register stalls. The hardware contains fast paths for some
    common cases of partial-register writes, but AFAIK AH-DH do not get
    fast paths in most CPUs.

    By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

    But gains the property that the whole register contains 1 proper value {range-limited to the container size whence it came} This in turn makes tracking values easy--in fact placing several different sized values
    in a single register makes it essentially impossible to perform value
    analysis in the compiler.

    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.

    If your ISA has excellent support for statically positioned bit-fields
    (or even better with dynamically positioned bit fields) fetching the
    fields and depositing them back into containers does not add significant latency. {volatile notwithstanding} While poor ISA support does add
    significant latency.

    IIRC the original HPPA has 32 or so 64-bit FP registers, which they
    then split into 58? 32-bit FP registers. I don't know how they
    further evolved that feature.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:06:35 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>
    The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.> >>
    That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    There is a cache level (L2 usually, I believe) when icache and
    dcache are no longer separate. Wouldn't this cause problems
    or inefficiencies?

    Consider trying to invalidate an ICache line--this requires looking
    at 2 DCache lines to see if they, too, need invalidation.

    Consider self-modifying code, the data stream overwrites an instruction,
    then later the FETCH engine runs over the modified line, but the modified
    line is 64-bytes of the needed 80-bytes, so you take a hit and a miss on
    a single fetch.

    It also prevents SNARFing updates to ICache instructions, unless the
    SNARFed data is entirely retained in a single ICache line.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 22:19:18 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    According to my understanding, EV4 had no SIMD-style instructions.

    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4. The architecture
    description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
    not say that some implementations don't include these instructons in
    hardware, whereas for the Multimedia support instructions (Section
    4.13), the reference does say that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 31 00:57:42 2025
    From Newsgroup: comp.arch

    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    According to my understanding, EV4 had no SIMD-style instructions.

    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.


    Yes, those were in EV4.

    Alpha 21064 and Alpha 21064A HRM is here: https://github.com/JonathanBelanger/DECaxp/blob/master/ExternalDocumentation

    I didn't consider these instructions as SIMD. May be, I should have.
    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    The architecture
    description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
    not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
    4.13), the reference does say that.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 31 14:48:41 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.
    ...
    I didn't consider these instructions as SIMD. May be, I should have.

    They definitely are, but they were not touted as such at the time, and
    they use the GPRs, unlike most SIMD extensions to instruction sets.

    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    Yes. This was pre-first-wave. The Alpha architects just wanted to
    speed up some common operations that would otherwise have been
    relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
    benchmark (maybe Dhrystone), someone claimed that these string
    instructions gave Alpha an unfair advantage.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 13:21:45 2025
    From Newsgroup: comp.arch

    On 10/31/2025 9:48 AM, Anton Ertl wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.
    ...
    I didn't consider these instructions as SIMD. May be, I should have.

    They definitely are, but they were not touted as such at the time, and
    they use the GPRs, unlike most SIMD extensions to instruction sets.

    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    Yes. This was pre-first-wave. The Alpha architects just wanted to
    speed up some common operations that would otherwise have been
    relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
    benchmark (maybe Dhrystone), someone claimed that these string
    instructions gave Alpha an unfair advantage.


    Most likely Dhrystone:
    It shows disproportionate impact from the relative speed of things like "strcmp()" and integer divide.


    I had experimented with special instructions for packed search, which
    could be used to help with either string compare of implementing
    dictionary objects in my usual way.


    Though, had later fallen back to a more generic way of implementing
    "strcmp()" that could allow more fair comparison between my own ISA and RISC-V. Where, say, one instead makes the determination based on how efficiently the ISA can handle various pieces of C code (rather than the
    use of niche instructions that typically require hand-written ASM or
    similar).



    Generally, makes more sense to use helper instructions that have a
    general impact on performance, say for example, effecting how quickly a
    new image can be drawn into VRAM.

    For example, in my GUI experiments:
    Most of the programs are redrawing the screens as, say, 320x200 RGB555.

    Well, except ROTT, which uses 384x200 8-bit, on top of a bunch of code
    to mimic planar VGA behavior. In this case, for the port it was easier
    to write wrapper code to fake the VGA weirdness than to try to rewrite
    the whole renderer to work with a normal linear framebuffer (like what
    Doom and similar had used).


    In a lot of the cases, I was using an 8-bit indexed color or color-cell
    mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
    Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going bigger;
    so higher-resolutions had typically worked to reduce the bits per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
    1024x768: 1 bpp monochrome, other experiments (*1)
    Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

    Though, thus far the 1024x768 mode is still mostly untested on real
    hardware.

    Had experimented some with special instructions to speed up the indexed
    color conversion and color-cell encoding, but had mostly gone back and
    forth between using helper instructions and normal plain C logic, and
    which exact route to take.

    Had at one point had a helper instruction for the "convert 4 RGB555
    colors to 4 indexed colors using a hardware palette", but this broke
    when I later ended up modifying the system palette for better results
    (which was a critical weakness of this approach). Also the naive
    strategy of using a 32K lookup table isn't great, as this doesn't fit
    into the L1 cache.


    So, for 4 bpp color cell:
    Generally, each block of 4x4 pixels understood as 2 RGB555 endpoints,
    and 2 selector bits per pixel. Though, in VRAM, 4 of these are packed
    into a logical 8x8 pixel block; rather than a linear ordering like in
    DXT1 or similar (specifics differ, but general concept is similar to DXT1/S3TC).

    The 2bpp mode generally has 8x8 pixels encoded as 1bpp in raster order
    (same order as a character cell, with MSB in top-left corner and LSB in lower-right corner). And, then typically 2x RGB555 over the 8x8 block.
    IIRC, had also experimented with having each 4x4 sub-block able to use a
    pair of RGB232 colors, but was harder to get good results.

    But, to help with this process, it was useful to have helper operations
    for, say:
    Map RGB555 values to a luma value;
    Select minimum and maximum RGB555 values for block;
    Map luma values to 1 or 2 bit selectors;
    ...


    Internally, the GUI mode had worked by drawing everything to an RGB555 framebuffer (~ 512K or 1MB) and then using a bitmap to track which
    blocks had been modified and need to be re-encoded and sent over to VRAM (partly by first flagging during window redraw, then comparing with a
    previous version of the framebuffer and tracking when pixel-blocks will
    differ to refine the selection of blocks that need redraw, copying over
    blocks as needed to keep track of these buffers).

    Process wasn't particularly efficient (and performance is considerably
    worse than what Win3.x or Win9x seemed to give).



    As for the packed-search instructions, there were 16-bit versions as
    well, which could be used either to help with UTF-16 operations; or for dictionary objects.

    Where, a common way I implement dictionary objects is to use arrays of
    16-bit keys with 64-bit values (often tagged values or similar).

    Though, this does put a limit on the maximum number of unique symbols
    that can be used as dictionary keys, but not often an issue in practice. Generally these are not QNames or C function names, so reduces the issue
    of running out of symbol name somewhat.

    One can also differ though on how much it makes to have sense to have
    ISA level helpers for working with tagrefs and similar (or, getting the
    ABI involved with these matters, like defining in the ABI the encodings
    for things like fixnum/flonum/etc).

    ...


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 14:32:00 2025
    From Newsgroup: comp.arch

    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
    Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going bigger;
    so higher-resolutions had typically worked to reduce the bits per pixel:
       320x200: 16 bpp
       640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
       800x600: 2 or 4 bpp color-cell
      1024x768: 1 bpp monochrome, other experiments (*1)
        Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
    G R
    B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4 pixels,
    with the pixel bits encoding color indirectly for the whole 4x4 block:
    G R G B
    B G R G
    G R G B
    B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main color
    if over 75% of a given bit are set in a given way (say, for mostly flat
    color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color graphics
    at 1 bpp...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:09:23 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:
    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good
    to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double
    roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    This is because the mantissa lengths (including the hidden bit) increase
    to at least 2n+2:

    f16 1:5:10 (1+10=11, 11*2+2 = 22)
    f32 1:8:23 (1+23=24, 24*2+2 = 50)
    f64 1:11:52 (1+52=53, 53*2+2 = 108)
    f128 1:15:112 (1+112=113)

    You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that
    would require a triple sized mantissa.

    The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
    was set to 64-bit precision.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:12:45 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 30 Oct 2025 16:46:14 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD
    extensions across the industry), but still provides no direct name for
    the individual bytes of a register.


    According to my understanding, EV4 had no SIMD-style instructions.
    They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
    ahead of VIS in UltraSPARC.

    The original (v1?) Alpha had instructions intending to make it "easy" to process character data in 8-byte chunks inside a register.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 1 18:19:48 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:
    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good >> to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    This is because the mantissa lengths (including the hidden bit) increase
    to at least 2n+2:

    f16 1:5:10 (1+10=11, 11*2+2 = 22)
    f32 1:8:23 (1+23=24, 24*2+2 = 50)
    f64 1:11:52 (1+52=53, 53*2+2 = 108)
    f128 1:15:112 (1+112=113)

    You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.

    The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
    was set to 64-bit precision.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 1 19:18:39 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP.

    {ABCD}X registers were data.
    {SDBS} registers were pointer registers.

    The 8086 is no 68000. The [BX] addressing mode makes it obvious that
    that's not the case.

    What is actually the case: AL-DL, AH-DH correspond to 8-bit registers
    of the 8080, some of AX-DX correspond to register pairs. SI, DI, BP
    are new, SP corresponds to the 8080 SP, which does not have 8-bit
    components. That's why SI, DI, BP, SP have no low or high
    sub-registers.

    Oh and BTW:: using x86-history as justification for an architectural
    feature is "bad style".

    I think that we can learn a lot from earlier architectures, some
    things to adopt and some things to avoid. Concerning subregisters, I
    lean towards avoiding.

    That's also another reason to avoid load-and-op and RMW instructions.
    With a load/store architecture, load can sign/zero extend as
    necessary, and then most operations can be done at full width.

    But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
    in a single register makes it essentially impossible to perform value >analysis in the compiler.

    I don't think it's impossible or particularly hard for the compiler. Implementing it in OoO hardware causes complications, though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 1 21:08:35 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 02:21:18 2025
    From Newsgroup: comp.arch

    On 10/31/2025 2:32 PM, BGB wrote:
    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-
    cell mode. For indexed color, one needs to send each image through a
    palette conversion (to the OS color palette); or run a color-cell
    encoder. Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going
    bigger; so higher-resolutions had typically worked to reduce the bits
    per pixel:
        320x200: 16 bpp
        640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
        800x600: 2 or 4 bpp color-cell
       1024x768: 1 bpp monochrome, other experiments (*1)
         Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8,
    allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
      G R
      B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4 pixels,
    with the pixel bits encoding color indirectly for the whole 4x4 block:
      G R G B
      B G R G
      G R G B
      B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main color
    if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color graphics
    at 1 bpp...


    Well, anyways, here is me testing with another variation of the idea
    (after thinking about it again).

    Using a joke image as a test case here...

    https://x.com/cr88192/status/1984694932666261839

    This variation uses:
    Y R
    B G

    In this case tiling as:
    Y R Y R ...
    B G B G ...
    Y R Y R ...
    B G B G ...
    ...

    Where, Y is a pure luma value.
    May or may not use this, or:
    Y R B G Y R B G
    B G Y R B G Y R
    ...
    But, prior pattern is simpler to deal with.

    Note that having every line follow the same pattern (with no
    alternation) would lead to obvious vertical lines in the output.


    With a different (slightly more complicated color recovery algorithm),
    and was operating on 8x8 pixel blocks.

    With 4x4, there is effectively 4 bits per channel, which is enough to
    recover 1 bit of color per channel.

    With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
    are normalized here).

    Having both a Y and G channel slightly helps with the color-recovery
    process; and allows a way to signal a monochrome block (if Y==G, the
    block is assumed to be monochrome, and the R/B bits can be used more
    freely for expressing luma).

    Where:
    Chroma accuracy comes at the expense of luma accuracy;
    An increased colorspace comes at the cost of spatial resolution of chroma;
    ...


    Dealing with chroma does have the effect of making the dithering process
    more complicated. As noted, reliable recovery of the color vector is
    itself a bit fiddly (and is very sensitive to the encoder side dither process).

    The former image was itself an example of an artifact caused by the
    dithering process, which in this case was over-boosting the green
    channel (and rotating the dither matrix would result in drastic color
    shifts). The later image was mostly after I realized the issue with the
    dither pattern, and modified how it was being handled (replacing the use
    of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
    the matrix for each channel).


    Image quality isn't great, but then again, not sure how to do that much
    better with a naive 1 bit/pixel encoding.


    I guess, an open question here is whether the color-recovery algorithm
    would be practical for hardware / FPGA.

    One possible could be:
    Use LUT4 to map 4b -> 2b (as a count)
    Then, map 2x2b -> 3b (adder)
    Then, map 2x3b -> 4b (adder), then discard LSB.
    Then, select max or R/G/B/Y;
    This is used as an inverse normalization scale.
    Feed each value and scale through a LUT (for R/G/B)
    Getting a 5-bit scaled RGB;
    Roughly: (Val<<5)/Max
    Compose a 5-bit RGB555 value used for each pixel that is set.

    Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

    Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic
    complexity.


    Pros/Cons:
    +: Looks better than per-pixel Bayer-RGB
    +: Looks better than 4x4 RGBI
    -: Would require more complex decoder logic;
    -: Requires specialized dither logic to not look like broken crap.
    -: Doesn't give passable results if handed naive grayscale dithering.

    Per-Pixel RGB still holds up OK with naive grayscale dither.
    But, this approach is a lot more particular.

    the RGBI approach seems intermediate, more likely to decode grayscale
    patterns as gray.



    I guess a more open question is if such a thing could be useful (it is
    pretty far down the image-quality scale). But, OTOH, with simpler (non-randomized) dither patterns; it can LZ compress OK (depending on
    image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

    If combined with delta encoding or similar; could almost be adapted into
    a very crappy video codec.

    Well, or LZ4, where (at 320x200) one could potentially hold several
    frames of video in a 64K sliding window.

    But, image quality might be unacceptably poor. Also if decoded in
    software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
    image quality).


    More just interesting that I was able to get things "almost half-way
    passable" from 1 bpp monochrome.


    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 11:36:36 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in the
    ulp position.

    We have known since before the 1978 ieee754 standard that guard+sticky
    (plus sign and ulp) is enough to get the rounding correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 15:56:12 2025
    From Newsgroup: comp.arch

    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always
    do the op in the next higher precision, then round again down to
    the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also rounding
    bit (e.g. in Wikipedia article) without explanation, as if everybody
    had agreed about what they mean. But I don't think that everybody
    really agree.

    Shockingly, an absence of strict definitions apples even to most widely refereed article of David Goldberg "What Every Computer Scientist Should
    Know About Floating-Point Arithmetic". It seems, people copy the name
    of the article one from another, but very small fraction of them
    bothered to actually read it.


    --- Synchronet 3.21a-Linux NewsLink 1.2