• Concedtina III May Be Returning

    From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Aug 31 06:17:07 2025
    From Newsgroup: comp.arch

    I have had so much success in adjusting Concertina II to achieve my goals
    more fully than I had thought possible... that I now think that it may be possible to proceed from Concertina II to a design which gets rid of the
    one feature of Concertina II that has been the most controversial.

    Yes, I think that I could actually do without block structure.

    What would Concertina III look like?

    Well, the basic instruction set would be similar to that of Concertina II.
    But the P bits would be taken out of the operate instructions, and so
    would the option of replacing a register specification by a pseudo-
    immediate pointer.

    The tiny gaps between the opcodes of some instructions to squeeze out
    space for block headers would be removed.

    But the big spaces for the shortest block header prefixes would be what is used for doing without headers.

    Instead of a block header being used to indicate code consisting of variable-length instructions, variable-length instructions would be
    contained within a sequence of pairs of 32-bit instructions of this form:

    11110xx(17 bits)(8 bits)
    11111x(9 bits)(17 bits)

    Instructions could be 17 bits long, 34 bits long, 51 bits long, and so on,
    any multiple of 17 bits in length.

    In the first instruction slot of the pair, the two bits xx indicate, for
    the two 17-bit regions of the variable-length instruction area that start
    in it, if they are the first 17-bit area of an instruction. The second instruction slot only contains the start of one 17-bit area, so only one
    bit x is needed. Since 17 is an odd number, this meshes perfectly with the fact that the 17-bit area which straddles both words isn't split evenly,
    but rather one extra bit of it is in the second 32-bit instruction slot.

    I had been hoping to use 18-bit areas instead, but after re-checking my calculations, I found there just wasn't enough opcode space.

    Long instructions that contain immediates would not be part of variable-
    length instruction code. Instead, their lengths would be multiples of 32
    bits, making them part of ordinary code with 32-bit instructions.

    Their form would be like this:

    32-bit immediate:

    1010(12 bits)(16 bits)
    10111(11 bits)(16 bits)'

    where the first parenthesized area belongs to the instruction, and the
    second to the immediate.

    48-bit immediate:

    1010(12 bits)(16 bits)
    10110(11 bits)(16 bits)
    10111(11 bits)(16 bits)

    64-bit immediate:

    1010(12 bits)(16 bits)
    10110(3 bits)(24 bits)
    10111(3 bits)(24 bits)

    Since the instruction, exclusive of the immediate, really only needs 12
    bits - 7 bit opcode, and 5 bit destination register - in each case there's enough additional space for the instruction to begin with a few bits that indicates its length, so that decoding is simple.

    The scheme is not really space-efficient.

    But the question that I really have is... is this really any better than having block headers? Or is it just as bad, just as complicated?

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 31 13:12:52 2025
    From Newsgroup: comp.arch

    On 8/31/2025 1:17 AM, John Savard wrote:
    I have had so much success in adjusting Concertina II to achieve my goals more fully than I had thought possible... that I now think that it may be possible to proceed from Concertina II to a design which gets rid of the
    one feature of Concertina II that has been the most controversial.

    Yes, I think that I could actually do without block structure.

    What would Concertina III look like?

    Well, the basic instruction set would be similar to that of Concertina II. But the P bits would be taken out of the operate instructions, and so
    would the option of replacing a register specification by a pseudo-
    immediate pointer.

    The tiny gaps between the opcodes of some instructions to squeeze out
    space for block headers would be removed.

    But the big spaces for the shortest block header prefixes would be what is used for doing without headers.

    Instead of a block header being used to indicate code consisting of variable-length instructions, variable-length instructions would be
    contained within a sequence of pairs of 32-bit instructions of this form:

    11110xx(17 bits)(8 bits)
    11111x(9 bits)(17 bits)

    Instructions could be 17 bits long, 34 bits long, 51 bits long, and so on, any multiple of 17 bits in length.

    In the first instruction slot of the pair, the two bits xx indicate, for
    the two 17-bit regions of the variable-length instruction area that start
    in it, if they are the first 17-bit area of an instruction. The second instruction slot only contains the start of one 17-bit area, so only one
    bit x is needed. Since 17 is an odd number, this meshes perfectly with the fact that the 17-bit area which straddles both words isn't split evenly,
    but rather one extra bit of it is in the second 32-bit instruction slot.

    I had been hoping to use 18-bit areas instead, but after re-checking my calculations, I found there just wasn't enough opcode space.

    Long instructions that contain immediates would not be part of variable- length instruction code. Instead, their lengths would be multiples of 32 bits, making them part of ordinary code with 32-bit instructions.

    Their form would be like this:

    32-bit immediate:

    1010(12 bits)(16 bits)
    10111(11 bits)(16 bits)'

    where the first parenthesized area belongs to the instruction, and the
    second to the immediate.

    48-bit immediate:

    1010(12 bits)(16 bits)
    10110(11 bits)(16 bits)
    10111(11 bits)(16 bits)

    64-bit immediate:

    1010(12 bits)(16 bits)
    10110(3 bits)(24 bits)
    10111(3 bits)(24 bits)

    Since the instruction, exclusive of the immediate, really only needs 12
    bits - 7 bit opcode, and 5 bit destination register - in each case there's enough additional space for the instruction to begin with a few bits that indicates its length, so that decoding is simple.

    The scheme is not really space-efficient.

    But the question that I really have is... is this really any better than having block headers? Or is it just as bad, just as complicated?



    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...



    Where:
    Prefix+16b = 48b
    Prefix+32b = 64b
    Prefix+Prefix+32b = 96b

    This leaves 15 bits of encoding space for 16 bit ops.

    Nominally, the 16-bit ops can have 16 registers.
    Preferably:
    8 scratch, 8 callee save
    Roughly the first 4-8 argument registers.
    A few cases could have 32 registers.

    For 32-bit ops, 32 or 64 registers.

    Might map out register space as, say:
    R0..R3: ZR/LR/SP/GP
    R4..R7: Scratch
    R8..R15: Scratch / Arg
    R16..R31: Callee Save
    For high 32:
    R32..R47: Scratch
    R48..R63: Callee Save


    With 16 bit ops maybe having maybe R8..R23 (or R16..R24,R8..R15) for the
    Reg4 field.

    Some possible 16-bit ops:
    00in-nnnn-iiii-0000 ADD Imm5s, Rn5 //"ADD 0, R0" = TRAP
    01in-nnnn-iiii-0000 MOV Imm5s, Rn5
    10mn-nnnn-mmmm-0000 ADD Rm5, Rn5
    11mn-nnnn-mmmm-0000 MOV Rm5, Rn5

    0000-nnnn-iiii-0010 ADDW Imm4u, Rn4 //ADD Imm, 32-bit sign extension
    0001-nnnn-mmmm-0010 SUB Rm4, Rn4
    0010-nnnn-mmmm-0010 ADDW Imm4n, Rn4 //ADD Imm, 32-bit sign extension
    0011-nnnn-mmmm-0010 MOVW Rm4, Rn4 //MOV with 32-bit sign extension
    0100-nnnn-mmmm-0010 ADDW Rm4, Rn4
    0101-nnnn-mmmm-0010 AND Rm4, Rn4
    0110-nnnn-mmmm-0010 OR Rm4, Rn4
    0111-nnnn-mmmm-0010 XOR Rm4, Rn4

    1000-dddd-dddd-0010 BRA Disp8
    ...

    0ddd-nnnn-mmmm-0100 LW Disp3(Rm4), Rn4
    1ddd-nnnn-mmmm-0100 SW Disp3(Rm4), Rn4
    0ddd-nnnn-mmmm-0110 LD Disp3(Rm4), Rn4
    1ddd-nnnn-mmmm-0110 SD Disp3(Rm4), Rn4

    00dn-nnnn-dddd-1000 LW Disp5(SP), Rn5
    01dn-nnnn-dddd-1000 LD Disp5(SP), Rn5
    10dn-nnnn-dddd-1000 SW Disp5(SP), Rn5
    11dn-nnnn-dddd-1000 SD Disp5(SP), Rn5

    ...

    Avoid temptation to make immediate and displacement fields overly large,
    as for small values these tend to have a very rapid drop-off. Not enough registers hurts more here than narrow Imm/Disp fields.

    Main place a larger Disp is justified is for SP-rel Load/Store.

    The 16-bit ops don't need to be sufficient to support the whole ISA,
    merely to provide space-savings for a subset of common-case operations. Preferably with the encodings not being confetti.



    32 bit instruction layout could do whatever.
    zzzz-oooo-oomm-mmmm-zzzz-nnnn-nnyy-yyy1 //Similar to XG3
    zzzz-zzzo-oooo-mmmm-mzzz-nnnn-nyyy-yyy1 //Similar to RISC-V
    zzzz-zzoo-ooom-mmmm-zzzz-znnn-nnyy-yyy1 //Intermediate, 5b registers
    ...

    May make sense to slightly shrink immediate and displacement fields
    slightly relative to RISC-V, and instead assume use of jumbo prefixes.

    Also, probably not doing something like RISC-V's JAL, which is an
    unreasonable waste of encoding space.

    As for 5 or 6 bit registers, possibilities:
    * Purely 5-bit, like traditional RISC-V
    * Purely 6 bit, like XG3, but this leaves less encoding space.
    * Mixed 5 or 6 bit, like XG1, but more intentionally
    ** In XG1, the subset of 32b ops with Reg6 were a bit of a hack.


    Possible, assuming a mixed 5/6 bit scheme:
    zzzz-oooo-oomm-mmmm-zzzz-nnnn-nnyy-yyy1 //3R, Reg6
    zzzz-zooo-oozm-mmmm-zzzz-znnn-nnyy-yyy1 //3R, Reg5
    iiii-iiii-iimm-mmmm-zzzz-nnnn-nnyy-yyy1 //3RI, Reg6 (Imm10)
    iiii-iiii-iizm-mmmm-zzzz-znnn-nnyy-yyy1 //3RI, Reg5 (Imm10)
    iiii-iiii-iiii-iiii-zzzz-nnnn-nnyy-yyy1 //2RI, Reg6 (Imm16)
    iiii-iiii-iiii-iiii-zzzz-0jjj-jj11-1111 //Jumbo Prefix (Imm, Typical)
    Extends Imm10 to Imm33, leaves 4 bits for sub-operation.
    If prefix is used with a Reg5 op, extend reg fields to 6-bit.

    Keeping register fields consistent helps with superscalar. It more
    matters here that the low order bits remain in the same place.

    Keeping immediate fields consistent helps with "everything not being a
    massive pain". I would assume all normal 3RI immediate and displacement instructions have the same size and layout of immediate. However,
    depending on instruction it may change scale. As I see it, scale
    changing is preferable to bit-slicing though.


    While less elegant to have a mix of 5 and 6 bit register encodings,
    doing so could be more economical regarding the use of encoding space
    compared with purely 6 bit (while being less limiting compared with pure
    5 bit).

    Possibly, one could have a semi-split space where, say:
    R0..R31 are primarily integer registers;
    R32..R63 are primarily FPU and SIMD registers;
    A lot of core operations have access to the entire register space;
    Non-core operations might be specific to the low or high 32;
    SIMD-128 ops could use 5b register fields,
    but still have the full register space


    The semi-split scheme could work well in medium to low register pressure scenarios; with 6-bit core ops helping with high register pressure.

    Likely, at least, all the Load/Store and basic ALU operations need Reg6.


    Not having 16-bit ops could simplify implementation slightly and also
    allow better performance with a simpler implementation. One other option
    being to allow 16 bit ops, but with the caveat that using them may
    reduce performance (with the compiler ideally keeping performance
    optimized using primarily 32-bit encodings wherever possible, and
    maintaining a 32-bit alignment of the instruction stream).

    Where, in such a case, 16-bit ops could be left either for low-traffic
    code or for size optimized binaries.

    ...


    Though, more elegant (and possibly also highest performance) could be to
    just go for 32/64/96 bit instructions with 6-bit register fields (also,
    any lower-probability ops can use 64-bit encodings; though, at the cost
    of code density).


    Just another random pull here...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Sep 2 09:15:59 2025
    From Newsgroup: comp.arch

    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that
    whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then trigger
    and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really
    don't like about giving up the block structure.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Sep 2 13:07:07 2025
    From Newsgroup: comp.arch

    On 9/2/2025 4:15 AM, John Savard wrote:
    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that
    whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then trigger and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really
    don't like about giving up the block structure.


    Note that tagging like that described does still allow some amount of
    parallel decoding, since we still have combinatorial logic. Granted, scalability is an issue.

    As can be noted, my use of jumbo-prefixes for large immediate values
    does have the property of allowing reusing 32-bit decoders for 64-bit
    and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings
    don't change the instruction being decoded, but merely extend it.

    Some internal plumbing is needed to stitch the immediate values together though, typically:
    We have OpA, OpB, OpC
    DecC gets OpC, and JBits from OpB
    DecB gets OpB, and JBits from OpA
    DecA gets OpA, and 0 for JBits.

    In my CPU core, I had a few times considered changing how decoding
    worked, to either reverse or right-align the instruction block to reduce
    the amount of MUX'ing needed in the decoder. If going for
    right-alignment, then DecC would always go to Lane1, DecB to Lane2, and
    DecA to Lane3.

    Can note that for immediate-handling, the Lane1 decoder produces the low
    33 bits of the result. If a decoder has a jumbo prefix and is itself
    given a jumbo-prefix, it assumes a 96 bit encoding and produces the
    value for the high 32 bits.

    At least in my designs, I only account for 33 bits of immediate per
    lane. Instead, when a full 64-bit immediate is encoded, its value is
    assembled in the ID2/RF stage.


    Though, admittedly my CPU core design did fall back to sequential
    execution for 16-bit ops, but this was partly for cost reasons.

    For BJX2/XG1 originally, it was because the instructions couldn't use
    WEX tagging, but after adding superscalar it was because I would either
    need multiple parallel 16-bit decoders, or to change how 16 bit ops were handled (likely using a 16->32 repacker).

    So, say:
    IF stage:
    Retrieve instruction from Cache Line;
    Determine fetch length:
    XG1/XG2 used explicit tagging;
    XG3 and RV use SuperScalar checks.
    Run repackers.
    Currently both XG3 and RISC-V 48-bit ops are handled by repacking.
    Decode Stage:
    Decode N parallel 32-bit ops;
    Prefixes route to the corresponding instructions;
    Any lane holding solely a prefix goes NOP.


    For a repacker, it would help if there were fairly direct mappings
    between the 16-bit and 32-bit ops. Contrary to claims, RVC does not
    appear to fit such a pattern. Personally, there isn't much good to say
    about RVC's encoding scheme, as it is very much ad-hoc dog chew.

    The usual claim is more that it is "compressed" in that you can first
    generate a 32-bit op internally and "squish" it down into a 16-bit form
    if it fits. This isn't terribly novel as I see it. Repacking RVC has
    similar problems to decoding it directly, namely that for a fair number
    of instructions, nearly each instruction has its own idiosyncratic
    encoding scheme (and you can't just simply shuffle some of the bits
    around and fill others with 0s and similar to arrive back at a valid RVI instruction).


    Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though
    there were some special cases made in the decoding rules. Though,
    admittedly I did do more than the bare minimum here (to fit it into the
    same encoding space as RV), mostly as I ended up going for a "Dog Chew Reduction" route rather than merely a "do the bare minimum bit-shuffling needed to make it fit".

    For better or worse, it effectively made XG3 its own ISA as far as BGBCC
    is concerned. Even if in theory I could have used repacking, the
    original XG1/XG2 emitter logic is a total mess. It was written
    originally for fixed-length 16-bit ops, so encodes and outputs
    instructions 16 bits at a time (using big "switch()" blocks, but the
    RISC-V and XG3 emitters also went this path; as far as BGBCC is
    concerned, it is treating XG3 as part of RISC-V land).


    Both the CPU core and also JX2VM handle it by repacking to XG2 though.
    For the XG3VM (userland only emulator for now), it instead decodes XG3 directly, with decoders for XG3, RVI, and RVC.

    Had noted the relative irony that despite XG3 having a longer
    instruction listing (than RVI) it still ends up with a slightly shorter decoder.

    Some of this has to deal with one big annoyance of RISC-V's encoding scheme: Its inconsistent and dog-chewed handling of immediate and displacement
    values.


    Though, for mixed-output, there are still a handful of cases where RVI encodings can beat XG3 encodings, mostly involving cases where the RVI encodings have a slightly larger displacement.

    In compiler stats, this seems to mostly affect:
    LB, LBU, LW, LWU
    SB, SW
    ADDI, ADDIW, LUI
    The former:
    Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.
    ADDI: 12b > 10b
    LUI: because loading a 32-bit value of the form XXXXX000 does happen
    sometimes it seems.

    Instruction counts are low enough that a "pure XG3" would likely result
    in Doom being around 1K larger (the cases where RVI ops are used would
    need a 64-bit jumbo-encoding in XG3).

    Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using
    separate Imm10u/Imm10n encodings, rather than an Imm10s, does have merit
    in that this effectively gives it an Imm11s encoding; and ADD is one of
    the main instructions that tends to be big-immediate-heavy (and in early design it was a close race between ADD ImmU/ImmN, vs ADD/SUB ImmU, but
    the current scheme has a tiny bit more range, albeit SUB-ImmU could have possibly avoided the need for an ImmN case).

    So, say:
    ADD: Large immediate heavy.
    SUB: Can reduce to ADD in the immediate case.
    AND: Preferable to have signed immediate values;
    Not common enough to justify the ImmU/ImmN scheme.
    OR: Almost exclusively positive;
    XOR: Almost exclusively positive.

    Load/Store displacements are very lopsided in the positive direction.
    Disp10s slightly beats Disp10u though.
    More negative displacements than 512..1023.
    XG1 had sorta hacked around it by:
    Disp9u, Disp5n
    Disp10u, Disp6n was considered, but didn't go that way.
    Disp10s was at least, "slightly less ugly",
    Even if Disp10u+Disp6n would have arguably been better
    for code density.

    Or, cough, someone could maybe do signed load/store displacements like:
    000..110: Positive
    111: Negative
    So, Disp10as is 0..1791, -256..-1
    Would better fit the statistical distribution, but... Errm...


    ...


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Sep 2 13:10:23 2025
    From Newsgroup: comp.arch

    On 9/2/2025 1:07 PM, BGB wrote:
    On 9/2/2025 4:15 AM, John Savard wrote:
    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
                            xxxx-xxxx-xxxx-xxx0  //16 bit
        xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1  //32 bit
        xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111  //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with
    only
    minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that
    whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then
    trigger
    and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really
    don't like about giving up the block structure.


    Note that tagging like that described does still allow some amount of parallel decoding, since we still have combinatorial logic. Granted, scalability is an issue.

    As can be noted, my use of jumbo-prefixes for large immediate values
    does have the property of allowing reusing 32-bit decoders for 64-bit
    and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings don't change the instruction being decoded, but merely extend it.

    Some internal plumbing is needed to stitch the immediate values together though, typically:
      We have OpA, OpB, OpC
      DecC gets OpC, and JBits from OpB
      DecB gets OpB, and JBits from OpA
      DecA gets OpA, and 0 for JBits.

    In my CPU core, I had a few times considered changing how decoding
    worked, to either reverse or right-align the instruction block to reduce
    the amount of MUX'ing needed in the decoder. If going for right-
    alignment, then DecC would always go to Lane1, DecB to Lane2, and DecA
    to Lane3.

    Can note that for immediate-handling, the Lane1 decoder produces the low
    33 bits of the result. If a decoder has a jumbo prefix and is itself
    given a jumbo-prefix, it assumes a 96 bit encoding and produces the
    value for the high 32 bits.

    At least in my designs, I only account for 33 bits of immediate per
    lane. Instead, when a full 64-bit immediate is encoded, its value is assembled in the ID2/RF stage.


    Though, admittedly my CPU core design did fall back to sequential
    execution for 16-bit ops, but this was partly for cost reasons.

    For BJX2/XG1 originally, it was because the instructions couldn't use
    WEX tagging, but after adding superscalar it was because I would either
    need multiple parallel 16-bit decoders, or to change how 16 bit ops were handled (likely using a 16->32 repacker).

    So, say:
      IF stage:
        Retrieve instruction from Cache Line;
        Determine fetch length:
          XG1/XG2 used explicit tagging;
          XG3 and RV use SuperScalar checks.
        Run repackers.
          Currently both XG3 and RISC-V 48-bit ops are handled by repacking.
      Decode Stage:
        Decode N parallel 32-bit ops;
        Prefixes route to the corresponding instructions;
        Any lane holding solely a prefix goes NOP.


    For a repacker, it would help if there were fairly direct mappings
    between the 16-bit and 32-bit ops. Contrary to claims, RVC does not
    appear to fit such a pattern. Personally, there isn't much good to say
    about RVC's encoding scheme, as it is very much ad-hoc dog chew.

    The usual claim is more that it is "compressed" in that you can first generate a 32-bit op internally and "squish" it down into a 16-bit form
    if it fits. This isn't terribly novel as I see it. Repacking RVC has
    similar problems to decoding it directly, namely that for a fair number
    of instructions, nearly each instruction has its own idiosyncratic
    encoding scheme (and you can't just simply shuffle some of the bits
    around and fill others with 0s and similar to arrive back at a valid RVI instruction).


    Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though
    there were some special cases made in the decoding rules. Though,
    admittedly I did do more than the bare minimum here (to fit it into the
    same encoding space as RV), mostly as I ended up going for a "Dog Chew Reduction" route rather than merely a "do the bare minimum bit-shuffling needed to make it fit".

    For better or worse, it effectively made XG3 its own ISA as far as BGBCC
    is concerned. Even if in theory I could have used repacking, the
    original XG1/XG2 emitter logic is a total mess. It was written
    originally for fixed-length 16-bit ops, so encodes and outputs
    instructions 16 bits at a time (using big "switch()" blocks, but the
    RISC-V and XG3 emitters also went this path; as far as BGBCC is
    concerned, it is treating XG3 as part of RISC-V land).


    Both the CPU core and also JX2VM handle it by repacking to XG2 though.
    For the XG3VM (userland only emulator for now), it instead decodes XG3 directly, with decoders for XG3, RVI, and RVC.

    Had noted the relative irony that despite XG3 having a longer
    instruction listing (than RVI) it still ends up with a slightly shorter decoder.

    Some of this has to deal with one big annoyance of RISC-V's encoding
    scheme:
    Its inconsistent and dog-chewed handling of immediate and displacement values.


    Though, for mixed-output, there are still a handful of cases where RVI encodings can beat XG3 encodings, mostly involving cases where the RVI encodings have a slightly larger displacement.

    In compiler stats, this seems to mostly affect:
      LB, LBU, LW, LWU
      SB, SW
      ADDI, ADDIW, LUI
    The former:
      Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.
      ADDI: 12b > 10b
    LUI: because loading a 32-bit value of the form XXXXX000 does happen sometimes it seems.

    Instruction counts are low enough that a "pure XG3" would likely result
    in Doom being around 1K larger (the cases where RVI ops are used would
    need a 64-bit jumbo-encoding in XG3).

    Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using
    separate Imm10u/Imm10n encodings, rather than an Imm10s, does have merit
    in that this effectively gives it an Imm11s encoding; and ADD is one of
    the main instructions that tends to be big-immediate-heavy (and in early design it was a close race between ADD ImmU/ImmN, vs ADD/SUB ImmU, but
    the current scheme has a tiny bit more range, albeit SUB-ImmU could have possibly avoided the need for an ImmN case).

    So, say:
      ADD: Large immediate heavy.
        SUB: Can reduce to ADD in the immediate case.
      AND: Preferable to have signed immediate values;
        Not common enough to justify the ImmU/ImmN scheme.
      OR: Almost exclusively positive;
      XOR: Almost exclusively positive.

    Load/Store displacements are very lopsided in the positive direction.
      Disp10s slightly beats Disp10u though.
        More negative displacements than 512..1023.
      XG1 had sorta hacked around it by:
        Disp9u, Disp5n
        Disp10u, Disp6n was considered, but didn't go that way.
          Disp10s was at least, "slightly less ugly",
            Even if Disp10u+Disp6n would have arguably been better
            for code density.

    Or, cough, someone could maybe do signed load/store displacements like:
      000..110: Positive
      111: Negative
    So, Disp10as is 0..1791, -256..-1
    Would better fit the statistical distribution, but... Errm...


    Self-correction (brain fart), that would be a Disp11as, groan...

    Probably Disp10as:
    000..110: Positive
    111: Negative
    Disp10as is 0..895, -128..-1
    Or:
    00..10: Positive
    11: Negative
    Disp10as is 0..767, -256..-1

    ...


    ...


    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Sep 2 18:40:16 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then trigger and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really don't like about giving up the block structure.

    I found this completely unnecessary.

    Only a small number of Major OpCodes can have constants, denoted by:: 0b'001xxxdd dddsssss D12dsmin orxsssss

    D=0 signifies '1' and '2' specify 5-bit immediates
    D=1 signifies a constant
    d=0 signifies 32-bit constant
    d=1 signifies 64-bit constant
    '1' signifies negation of Src1
    '2' signifies negation of Src2

    In effect, D12ds is a routing specifier, telling DECODE what to route
    where in an easy to determine pattern. You could go so far as to call
    it a routing OpCode. This field is a large contributor to how My 66000
    requires fewer instructions than Other ISAs.

    However, I also found that STs need an immediate and a displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Lest one thinks this results in serial decoding, consider that the
    pattern decoder is 40 gates (just larger than 3-flip-flops) so one
    can afford to put this pattern decoder on every word in the inst-
    buffer and then inst[0] selects inst[1], but inst[1] has already
    selected inst[2] which has selected inst[3] and we have a tree
    pattern that can parse 16-instructions in a 16-gate cycle time
    from a 24-32 word input-buffer to DECODE. I call this stage of
    the pipeline PARSE.

    Also note that 1 My 66000 instruction does the work of 1.4 RISC-V
    instructions, so, a 6-wide My 66000 machine is equivalent to a
    8.4-to-9 wide RISC-V machine.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Sep 2 23:55:17 2025
    From Newsgroup: comp.arch

    On Tue, 02 Sep 2025 18:40:16 +0000, MitchAlsup wrote:

    Lest one thinks this results in serial decoding, consider that the
    pattern decoder is 40 gates (just larger than 3-flip-flops) so one can
    afford to put this pattern decoder on every word in the inst- buffer

    Yes, given sufficiently simple decoding, one could allow backtracking when
    the second word of an instruction is decoded as if it was the first.

    Of course, though, it wastes electricity and produces heat, but a
    negligible amount, I agree.

    I'm designing my ISA, though, to make it simple to implement... in one specific sense. It's horribly large and complicated, but at least it
    doesn't demand that imlementors understand any fancy tricks.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2