• Deproved RAT

    From Robert Finch@robfi680@gmail.com to comp.arch on Sat Mar 28 17:29:47 2026
    From Newsgroup: comp.arch

    Okay, I have come up with the following scheme for mapping registers in
    the RAT to reduce the storage and logic requirements. Probably at the
    cost of some performance.

    Organize the physical registers into sets. Have only eight physical
    registers associated with each pair of ISA registers. That is an average
    of four rename registers available for each ISA register. They are
    organized in pairs to try and increase the odds of a rename register
    being available.

    For the register map, store a three-bit index into the group of eight
    physical registers for each ISA register. When referenced the physical register number is the ISA register number divided by two concatenated
    with the three-bit index.

    The difference between this and a flat register map is that only three
    bits are required to identify the physical register. So it requires 1/3
    the storage space and 1/3 the muxing. (with 512 physical registers).

    For Qupls RAT it works out to 17,300 LUTs instead of 27,600 LUTs.
    However, the number of stalls in renaming is potentially increased… IDK
    how big the impact is.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Sat Mar 28 22:11:00 2026
    From Newsgroup: comp.arch

    In article <10q9h8c$v7qa$1@dont-email.me>, robfi680@gmail.com (Robert
    Finch) wrote:

    Organize the physical registers into sets. Have only eight physical registers associated with each pair of ISA registers. That is an
    average of four rename registers available for each ISA register.
    They are organized in pairs to try and increase the odds of a
    rename register being available.

    This seems to present the compiler writer with a temptation to make use
    of information about the number of rename registers in long expression sequences. That causes problems when the implementation changes.

    John
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Mar 28 19:17:45 2026
    From Newsgroup: comp.arch

    On 2026-03-28 6:11 p.m., John Dallman wrote:
    In article <10q9h8c$v7qa$1@dont-email.me>, robfi680@gmail.com (Robert
    Finch) wrote:

    Organize the physical registers into sets. Have only eight physical
    registers associated with each pair of ISA registers. That is an
    average of four rename registers available for each ISA register.
    They are organized in pairs to try and increase the odds of a
    rename register being available.

    This seems to present the compiler writer with a temptation to make use
    of information about the number of rename registers in long expression sequences. That causes problems when the implementation changes.

    John

    Yes. I think it would only cause performance differences. Performance
    should only improve on a better implementation.

    It should be possible to trade-off the number of registers in a set and
    the number of ISA registers. I tried sixteen physical with four ISA
    which should lower the number of stalls. It was only about 1k LUTs more.

    I have thought that superscalars were complex enough that people would
    not be cycle counting, but measuring instead.

    I got the CPU core to fit on the FPGA now, but without a DRAM
    controller. Meaning I should be able to get some small demos running eventually. I had to axe the instruction expander too. Instructions to micro-ops are now 1:1









    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Mar 30 21:01:13 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Okay, I have come up with the following scheme for mapping registers in
    the RAT to reduce the storage and logic requirements. Probably at the
    cost of some performance.

    Organize the physical registers into sets. Have only eight physical registers associated with each pair of ISA registers. That is an average
    of four rename registers available for each ISA register. They are
    organized in pairs to try and increase the odds of a rename register
    being available.

    For the register map, store a three-bit index into the group of eight physical registers for each ISA register. When referenced the physical register number is the ISA register number divided by two concatenated
    with the three-bit index.

    Take what follows with a grain of salt::

    In the machines on which I participated, we kept additional information
    in <essentially> the RAT--in particular, the identity of the FU which
    will deliver the result. So, reading the RAT gave the Physical Register
    Number (PRN), the FU who delivers a result and whether the result is in
    RF or waiting on FU. We encoded this such that FU and PRN used the same
    bits along with a state, so the whole thing used only 8-bits; 1 for PRN (7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).
    We had 6-FUs and an execution window 16-deep.

    When the operand is latent, FU tells the Reservation Station entry which
    result bus to monitor and which tag to match. This means the RS entries
    are only watching 1 bus each, need only 1 comparator; but over time each
    entry can monitor all result busses.

    I also used an associative lookup (ARN in, PRN/{FU,tag}) out with a valid
    bit per PRN. The valid bits are written into the history table each issue,
    and are read out as Bcnd begins execution, so that if the branch was mis- predicted, the RAT can be recovered by writing 128-valid-bits to RAT.cam--
    as few as zero cycle branch recovery. The entries in the history table
    have logic per column that can discover that the register is free even
    before the result is written to RF. Issue has logic to determine if the register is over-written in the same issue cycle, so the PRN does not get allocated, so the pool is effectively larger.

    The difference between this and a flat register map is that only three
    bits are required to identify the physical register. So it requires 1/3
    the storage space and 1/3 the muxing. (with 512 physical registers).

    For Qupls RAT it works out to 17,300 LUTs instead of 27,600 LUTs.
    However, the number of stalls in renaming is potentially increased… IDK how big the impact is.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 31 08:05:55 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    In the machines on which I participated, we kept additional information
    in <essentially> the RAT--in particular, the identity of the FU which
    will deliver the result. So, reading the RAT gave the Physical Register >Number (PRN), the FU who delivers a result and whether the result is in
    RF or waiting on FU. We encoded this such that FU and PRN used the same
    bits along with a state, so the whole thing used only 8-bits; 1 for PRN >(7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

    My impression is that in recent CPUs with valueless reservation
    stations the PRN is used in the in-flight instructions and the RAT
    from the start, without needing to change anything in in-flight
    instructions once an instruction that writes a register it depends on
    delivers its results. Maybe they have additional bits for detecting
    that the result is available without having to compare everything.

    One disadvantage of this approach is that PRNs are allocated before
    they actually need to store something. For programs that have a lot
    of instructions waiting for some result (of, e.g., a cache miss) to
    become ready, that might be an issue. OTOH, you need those physical
    registers anyway for programs that have a lot of finished instructions
    waiting for committing in the reorder buffer (e.g., due to having to
    wait for an instruction that might trap or mispredict).

    So finding a way to reduce the register needs of the former kind of
    program may not lead to actually reducing the number of registers;
    therefore such a way may not actually be useful.

    When the operand is latent, FU tells the Reservation Station entry which >result bus to monitor and which tag to match. This means the RS entries
    are only watching 1 bus each, need only 1 comparator; but over time each >entry can monitor all result busses.

    Your description inspired an idea: My impression is that having many
    write ports is much more expensive than having more registers. So
    have a register file for each FU, with one write port for each
    FU-specific register file. The total number of physical registers
    would have to be increased to achieve a similar renaming capacity
    across typical workloads, and one probably still needs a similar
    number of read ports as before, but the result might still require
    less area.

    However, based on register file capacity measurements (e.g., at
    chipsandcheese) it seems that modern microarchitectures differentiate
    at most between GPRs, SIMD registers, and various flags when it comes
    to physical registers. So either the area saving from reduced write
    ports is not that relevant and many-write-ported register files are
    used.

    Or there is a backup mechanism for making use of other register files
    when the FU's register file has no free registers; e.g., if the
    register renamer finds that all registers in the FU's register file
    are allocated, it can insert a move uop from a register of the target
    FU to an idle FU with enough free registers before
    allocating a register to an uop for the target FU.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 31 18:35:53 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    In the machines on which I participated, we kept additional information
    in <essentially> the RAT--in particular, the identity of the FU which
    will deliver the result. So, reading the RAT gave the Physical Register >Number (PRN), the FU who delivers a result and whether the result is in
    RF or waiting on FU. We encoded this such that FU and PRN used the same >bits along with a state, so the whole thing used only 8-bits; 1 for PRN >(7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

    My impression is that in recent CPUs with valueless reservation
    stations the PRN is used in the in-flight instructions and the RAT
    from the start, without needing to change anything in in-flight
    instructions once an instruction that writes a register it depends on delivers its results. Maybe they have additional bits for detecting
    that the result is available without having to compare everything.

    There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ...
    all arriving at the same spot--Data-flow works.

    One disadvantage of this approach is that PRNs are allocated before
    they actually need to store something. For programs that have a lot
    of instructions waiting for some result (of, e.g., a cache miss) to
    become ready, that might be an issue. OTOH, you need those physical registers anyway for programs that have a lot of finished instructions waiting for committing in the reorder buffer (e.g., due to having to
    wait for an instruction that might trap or mispredict).

    PRNs are required from about 2 cycles after Decode until RoB retirement. Certain PRNs may die earlier (over written) and different choices make
    this easier or harder.

    So finding a way to reduce the register needs of the former kind of
    program may not lead to actually reducing the number of registers;
    therefore such a way may not actually be useful.

    Not by enough to count.

    When the operand is latent, FU tells the Reservation Station entry which >result bus to monitor and which tag to match. This means the RS entries
    are only watching 1 bus each, need only 1 comparator; but over time each >entry can monitor all result busses.

    Your description inspired an idea: My impression is that having many
    write ports is much more expensive than having more registers. So
    have a register file for each FU, with one write port for each
    FU-specific register file. The total number of physical registers
    would have to be increased to achieve a similar renaming capacity
    across typical workloads, and one probably still needs a similar
    number of read ports as before, but the result might still require
    less area.

    Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit
    (less shift) and whatever the FU was named for. I found this "better"
    than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
    bus contention between Ints and Others. Not much but enough.

    However, based on register file capacity measurements (e.g., at chipsandcheese) it seems that modern microarchitectures differentiate
    at most between GPRs, SIMD registers, and various flags when it comes
    to physical registers. So either the area saving from reduced write
    ports is not that relevant and many-write-ported register files are
    used.

    Or there is a backup mechanism for making use of other register files
    when the FU's register file has no free registers; e.g., if the
    register renamer finds that all registers in the FU's register file
    are allocated, it can insert a move uop from a register of the target
    FU to an idle FU with enough free registers before
    allocating a register to an uop for the target FU.

    - anton
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Mar 31 20:07:40 2026
    From Newsgroup: comp.arch

    In article <10q9nir$11j88$1@dont-email.me>, robfi680@gmail.com (Robert
    Finch) wrote:

    This seems to present the compiler writer with a temptation to
    make use of information about the number of rename registers in
    long expression sequences. That causes problems when the
    implementation changes.

    Yes. I think it would only cause performance differences.

    Yes, but those can matter a lot.

    Performance should only improve on a better implementation.

    It should, yes. The ability of customers to chance upon and rely on pathological cases should not be under-estimated.

    I have thought that superscalars were complex enough that people
    would not be cycle counting, but measuring instead.

    That's generally true.

    John
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 2 01:27:11 2026
    From Newsgroup: comp.arch

    On 2026-03-31 2:35 p.m., MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    In the machines on which I participated, we kept additional information
    in <essentially> the RAT--in particular, the identity of the FU which
    will deliver the result. So, reading the RAT gave the Physical Register
    Number (PRN), the FU who delivers a result and whether the result is in
    RF or waiting on FU. We encoded this such that FU and PRN used the same
    bits along with a state, so the whole thing used only 8-bits; 1 for PRN
    (7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

    My impression is that in recent CPUs with valueless reservation
    stations the PRN is used in the in-flight instructions and the RAT
    from the start, without needing to change anything in in-flight
    instructions once an instruction that writes a register it depends on
    delivers its results. Maybe they have additional bits for detecting
    that the result is available without having to compare everything.

    There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ...
    all arriving at the same spot--Data-flow works.

    One disadvantage of this approach is that PRNs are allocated before
    they actually need to store something. For programs that have a lot
    of instructions waiting for some result (of, e.g., a cache miss) to
    become ready, that might be an issue. OTOH, you need those physical
    registers anyway for programs that have a lot of finished instructions
    waiting for committing in the reorder buffer (e.g., due to having to
    wait for an instruction that might trap or mispredict).

    PRNs are required from about 2 cycles after Decode until RoB retirement. Certain PRNs may die earlier (over written) and different choices make
    this easier or harder.

    So finding a way to reduce the register needs of the former kind of
    program may not lead to actually reducing the number of registers;
    therefore such a way may not actually be useful.

    Not by enough to count.

    When the operand is latent, FU tells the Reservation Station entry which >>> result bus to monitor and which tag to match. This means the RS entries
    are only watching 1 bus each, need only 1 comparator; but over time each >>> entry can monitor all result busses.

    Your description inspired an idea: My impression is that having many
    write ports is much more expensive than having more registers. So
    have a register file for each FU, with one write port for each
    FU-specific register file. The total number of physical registers
    would have to be increased to achieve a similar renaming capacity
    across typical workloads, and one probably still needs a similar
    number of read ports as before, but the result might still require
    less area.

    Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit
    (less shift) and whatever the FU was named for. I found this "better"
    than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
    bus contention between Ints and Others. Not much but enough.

    However, based on register file capacity measurements (e.g., at
    chipsandcheese) it seems that modern microarchitectures differentiate
    at most between GPRs, SIMD registers, and various flags when it comes
    to physical registers. So either the area saving from reduced write
    ports is not that relevant and many-write-ported register files are
    used.

    Or there is a backup mechanism for making use of other register files
    when the FU's register file has no free registers; e.g., if the
    register renamer finds that all registers in the FU's register file
    are allocated, it can insert a move uop from a register of the target
    FU to an idle FU with enough free registers before
    allocating a register to an uop for the target FU.

    - anton

    These post have given me something to investigate. Whether it is smaller
    to add to the RAT and reduce the number of comparators in the
    reservation stations OR reduce the RAT.
    More config options coming up.

    Let see if I understand this. While there may only be one bus being
    monitored, that bus has to originate from the other result busses via a
    mux. So, the result busses are going past the reservation stations which
    then feed into a mux controlled by the FU id which the reservation
    station examines for values. I think I can see where that would make the reservation stations smaller. It gets rid of the comparators in the reservation stations and replaces them with muxes on the result busses.

    Qupls has a slightly different organization. There are a lot of
    functional units. 14 IIRC for a full-blown version, each with four or
    more read ports. But there are only four results busses begin examined.
    The result bus is dynamically selected to update the register file.
    Whichever set of four results is selected is looked at.

    Qupls has values stored in the reservation stations. There are only 16 register read ports running to the reservation stations that are used to
    load values. Then the four result busses also monitored for values to
    load. All of this is still smaller than the RAT, as Qupls is configured
    at the moment.

    I could try changing things so that all 14 (or more) result busses run
    past the reservations stations, but I have a feeling that all the muxes
    for the busses will consume a lot of logic. Muxes are relatively
    expensive in an FPGA. Comparators are less expensive I think.

    Current config (8 units):
    ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH

    Reservation stations are using about 5k LUTs each.
    The RAT is about 50k LUTs.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 2 17:57:15 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-03-31 2:35 p.m., MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    In the machines on which I participated, we kept additional information >>> in <essentially> the RAT--in particular, the identity of the FU which
    will deliver the result. So, reading the RAT gave the Physical Register >>> Number (PRN), the FU who delivers a result and whether the result is in >>> RF or waiting on FU. We encoded this such that FU and PRN used the same >>> bits along with a state, so the whole thing used only 8-bits; 1 for PRN >>> (7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

    My impression is that in recent CPUs with valueless reservation
    stations the PRN is used in the in-flight instructions and the RAT
    from the start, without needing to change anything in in-flight
    instructions once an instruction that writes a register it depends on
    delivers its results. Maybe they have additional bits for detecting
    that the result is available without having to compare everything.

    There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ... all arriving at the same spot--Data-flow works.

    One disadvantage of this approach is that PRNs are allocated before
    they actually need to store something. For programs that have a lot
    of instructions waiting for some result (of, e.g., a cache miss) to
    become ready, that might be an issue. OTOH, you need those physical
    registers anyway for programs that have a lot of finished instructions
    waiting for committing in the reorder buffer (e.g., due to having to
    wait for an instruction that might trap or mispredict).

    PRNs are required from about 2 cycles after Decode until RoB retirement. Certain PRNs may die earlier (over written) and different choices make
    this easier or harder.

    So finding a way to reduce the register needs of the former kind of
    program may not lead to actually reducing the number of registers;
    therefore such a way may not actually be useful.

    Not by enough to count.

    When the operand is latent, FU tells the Reservation Station entry which >>> result bus to monitor and which tag to match. This means the RS entries >>> are only watching 1 bus each, need only 1 comparator; but over time each >>> entry can monitor all result busses.

    Your description inspired an idea: My impression is that having many
    write ports is much more expensive than having more registers. So
    have a register file for each FU, with one write port for each
    FU-specific register file. The total number of physical registers
    would have to be increased to achieve a similar renaming capacity
    across typical workloads, and one probably still needs a similar
    number of read ports as before, but the result might still require
    less area.

    Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit (less shift) and whatever the FU was named for. I found this "better"
    than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
    bus contention between Ints and Others. Not much but enough.

    However, based on register file capacity measurements (e.g., at
    chipsandcheese) it seems that modern microarchitectures differentiate
    at most between GPRs, SIMD registers, and various flags when it comes
    to physical registers. So either the area saving from reduced write
    ports is not that relevant and many-write-ported register files are
    used.

    Or there is a backup mechanism for making use of other register files
    when the FU's register file has no free registers; e.g., if the
    register renamer finds that all registers in the FU's register file
    are allocated, it can insert a move uop from a register of the target
    FU to an idle FU with enough free registers before
    allocating a register to an uop for the target FU.

    - anton

    These post have given me something to investigate. Whether it is smaller
    to add to the RAT and reduce the number of comparators in the
    reservation stations OR reduce the RAT.
    More config options coming up.

    Let see if I understand this. While there may only be one bus being monitored, that bus has to originate from the other result busses via a
    mux. So, the result busses are going past the reservation stations which then feed into a mux controlled by the FU id which the reservation
    station examines for values. I think I can see where that would make the reservation stations smaller. It gets rid of the comparators in the reservation stations and replaces them with muxes on the result busses.

    Right, all result busses go to all RSs. Each RS entry.operand watches
    1 (or 0) busses. Any RS entry.operand can watch ay result bus.

    Qupls has a slightly different organization. There are a lot of
    functional units. 14 IIRC for a full-blown version, each with four or
    more read ports. But there are only four results busses begin examined.

    The result bus is dynamically selected to update the register file.

    I would consider the dynamically selected result bus a mistake. A
    result bus is heavily loaded and needs big drivers. You design will
    need 4 big drivers per FU instead of 1. And for what gain ??

    Whichever set of four results is selected is looked at.

    Qupls has values stored in the reservation stations. There are only 16 register read ports running to the reservation stations that are used to load values. Then the four result busses also monitored for values to
    load. All of this is still smaller than the RAT, as Qupls is configured
    at the moment.

    How many entries (instructions) per RS ?

    I could try changing things so that all 14 (or more) result busses run
    past the reservations stations, but I have a feeling that all the muxes
    for the busses will consume a lot of logic. Muxes are relatively
    expensive in an FPGA. Comparators are less expensive I think.

    Current config (8 units):
    ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH
    versus:
    ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
    MEM1, MEM2, MEM3, FADD, FMUL, Branch
    SFT1, SHT2, SFT3, FMSC, FDIV,
    where vertical means they are the same FU#

    Reservation stations are using about 5k LUTs each.

    14×5 = 70K

    The RAT is about 50k LUTs.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 2 17:50:09 2026
    From Newsgroup: comp.arch

    On 2026-04-02 1:57 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-03-31 2:35 p.m., MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    In the machines on which I participated, we kept additional information >>>>> in <essentially> the RAT--in particular, the identity of the FU which >>>>> will deliver the result. So, reading the RAT gave the Physical Register >>>>> Number (PRN), the FU who delivers a result and whether the result is in >>>>> RF or waiting on FU. We encoded this such that FU and PRN used the same >>>>> bits along with a state, so the whole thing used only 8-bits; 1 for PRN >>>>> (7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits). >>>>
    My impression is that in recent CPUs with valueless reservation
    stations the PRN is used in the in-flight instructions and the RAT
    from the start, without needing to change anything in in-flight
    instructions once an instruction that writes a register it depends on
    delivers its results. Maybe they have additional bits for detecting
    that the result is available without having to compare everything.

    There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ... >>> all arriving at the same spot--Data-flow works.

    One disadvantage of this approach is that PRNs are allocated before
    they actually need to store something. For programs that have a lot
    of instructions waiting for some result (of, e.g., a cache miss) to
    become ready, that might be an issue. OTOH, you need those physical
    registers anyway for programs that have a lot of finished instructions >>>> waiting for committing in the reorder buffer (e.g., due to having to
    wait for an instruction that might trap or mispredict).

    PRNs are required from about 2 cycles after Decode until RoB retirement. >>> Certain PRNs may die earlier (over written) and different choices make
    this easier or harder.

    So finding a way to reduce the register needs of the former kind of
    program may not lead to actually reducing the number of registers;
    therefore such a way may not actually be useful.

    Not by enough to count.

    When the operand is latent, FU tells the Reservation Station entry which >>>>> result bus to monitor and which tag to match. This means the RS entries >>>>> are only watching 1 bus each, need only 1 comparator; but over time each >>>>> entry can monitor all result busses.

    Your description inspired an idea: My impression is that having many
    write ports is much more expensive than having more registers. So
    have a register file for each FU, with one write port for each
    FU-specific register file. The total number of physical registers
    would have to be increased to achieve a similar renaming capacity
    across typical workloads, and one probably still needs a similar
    number of read ports as before, but the result might still require
    less area.

    Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit
    (less shift) and whatever the FU was named for. I found this "better"
    than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
    bus contention between Ints and Others. Not much but enough.

    However, based on register file capacity measurements (e.g., at
    chipsandcheese) it seems that modern microarchitectures differentiate
    at most between GPRs, SIMD registers, and various flags when it comes
    to physical registers. So either the area saving from reduced write
    ports is not that relevant and many-write-ported register files are
    used.

    Or there is a backup mechanism for making use of other register files
    when the FU's register file has no free registers; e.g., if the
    register renamer finds that all registers in the FU's register file
    are allocated, it can insert a move uop from a register of the target
    FU to an idle FU with enough free registers before
    allocating a register to an uop for the target FU.

    - anton

    These post have given me something to investigate. Whether it is smaller
    to add to the RAT and reduce the number of comparators in the
    reservation stations OR reduce the RAT.
    More config options coming up.

    Let see if I understand this. While there may only be one bus being
    monitored, that bus has to originate from the other result busses via a
    mux. So, the result busses are going past the reservation stations which
    then feed into a mux controlled by the FU id which the reservation
    station examines for values. I think I can see where that would make the
    reservation stations smaller. It gets rid of the comparators in the
    reservation stations and replaces them with muxes on the result busses.

    Right, all result busses go to all RSs. Each RS entry.operand watches
    1 (or 0) busses. Any RS entry.operand can watch ay result bus.

    Qupls has a slightly different organization. There are a lot of
    functional units. 14 IIRC for a full-blown version, each with four or
    more read ports. But there are only four results busses begin examined.

    The result bus is dynamically selected to update the register file.

    I would consider the dynamically selected result bus a mistake. A
    result bus is heavily loaded and needs big drivers. You design will
    need 4 big drivers per FU instead of 1. And for what gain ??

    An issue is the number of result busses to support all the units.
    There is something like 16 or 18 results (some units can produce two
    results), I thought it would not work to have a result bus for every
    unit. 16 write ports on the register file was not happening. I could not
    see how to reduce things to say 6 busses.

    Four busses were used to minimize the size of the register file, since
    there was a mux anyway. I was not thinking of the driver electronics for running in an FPGA.

    I am not fond of the dynamic selected result bus, either. Maybe it could
    be reduced to eight busses, without dynamic selection.

    Whichever set of four results is selected is looked at.

    Qupls has values stored in the reservation stations. There are only 16
    register read ports running to the reservation stations that are used to
    load values. Then the four result busses also monitored for values to
    load. All of this is still smaller than the RAT, as Qupls is configured
    at the moment.

    How many entries (instructions) per RS ?

    Qupls is currently configured for one entry per RS. But it is a
    parameter (for each RS). It had to be minimized to fit the FPGA.
    I think the 5k size was for three-entry RS.


    I could try changing things so that all 14 (or more) result busses run
    past the reservations stations, but I have a feeling that all the muxes
    for the busses will consume a lot of logic. Muxes are relatively
    expensive in an FPGA. Comparators are less expensive I think.

    Current config (8 units):
    ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH
    versus:
    ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
    MEM1, MEM2, MEM3, FADD, FMUL, Branch
    SFT1, SHT2, SFT3, FMSC, FDIV,
    where vertical means they are the same FU#


    Okay, I had units separated by latency so there is minimal latency going
    from the unit back to the results/input (feedback paths). Trying to keep performance of dependent instructions good.
    ALU1, ALU2 are single cycle latency. FPU is three cycles versus FMA
    which is five. Most of the units can issue an instruction every clock
    cycle. Some units not in the minimal config may have large latencies and cannot issue every cycle. These include float trig, graphics unit,
    neural net unit.

    Although two ALUs are shown, the FPU can execute ALU instructions too.
    And the ALU can execute the single cycle FPU instructions. I use the
    name SAU (for simple arithmetic unit) because of the crossover. When I
    see ALU I think integer.

    There are four result busses to feed the register file. A larger
    register file may be too much for the current implementation. There is a
    lot of BRAM used for the register file. 1/4 BRAMs in the device.


    Reservation stations are using about 5k LUTs each.

    14×5 = 70K

    The RAT is about 50k LUTs.



    I tried configuring Qupls for 3 entries per RS, and more functional units/functionality, but it turned out to be about 700,000 LUTs.
    I am trying to keep a demo under 200k LUTs.
    When I obtain a larger board it will just be a matter of changing some
    config values.



    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 2 22:25:02 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-04-02 1:57 p.m., MitchAlsup wrote:
    -----------------------
    I would consider the dynamically selected result bus a mistake. A
    result bus is heavily loaded and needs big drivers. You design will
    need 4 big drivers per FU instead of 1. And for what gain ??

    An issue is the number of result busses to support all the units.
    There is something like 16 or 18 results (some units can produce two results), I thought it would not work to have a result bus for every
    unit. 16 write ports on the register file was not happening. I could not
    see how to reduce things to say 6 busses.

    Realistically, you are going to be performing between 2 and 3 I/c
    and thus 4-6 busses are perfectly capable.

    Four busses were used to minimize the size of the register file, since
    there was a mux anyway. I was not thinking of the driver electronics for running in an FPGA.

    I am not fond of the dynamic selected result bus, either. Maybe it could
    be reduced to eight busses, without dynamic selection.

    Whichever set of four results is selected is looked at.

    Qupls has values stored in the reservation stations. There are only 16
    register read ports running to the reservation stations that are used to >> load values. Then the four result busses also monitored for values to
    load. All of this is still smaller than the RAT, as Qupls is configured
    at the moment.

    How many entries (instructions) per RS ?

    Qupls is currently configured for one entry per RS. But it is a
    parameter (for each RS). It had to be minimized to fit the FPGA.
    I think the 5k size was for three-entry RS.

    Ok, I mean that a RS has both width and depth. Width would be chosen
    to be appropriate for the number of operands any of the attached FUs
    would need (max) So an INT unit would have 2-operands, a Mem unit
    would have 2 register operands and one constant operand (Displacement), FMUL/FMAC would have 3, ...

    A RS has depth, so with a ~100 Instruction execution window, and 6 FUs
    one would expect 16 RS.instructions each with 2 or 3 dynamic operands.
    There is no reason to build RSs if you don't have enough/FU to cover the dynamic latency of the critical path.

    I could try changing things so that all 14 (or more) result busses run
    past the reservations stations, but I have a feeling that all the muxes
    for the busses will consume a lot of logic. Muxes are relatively
    expensive in an FPGA. Comparators are less expensive I think.

    Current config (8 units):
    ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH
    versus:
    ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
    MEM1, MEM2, MEM3, FADD, FMUL, Branch
    SFT1, SHT2, SFT3, FMSC, FDIV,
    where vertical means they are the same FU#


    Okay, I had units separated by latency so there is minimal latency going from the unit back to the results/input (feedback paths). Trying to keep performance of dependent instructions good.
    ALU1, ALU2 are single cycle latency. FPU is three cycles versus FMA
    which is five. Most of the units can issue an instruction every clock
    cycle. Some units not in the minimal config may have large latencies and cannot issue every cycle. These include float trig, graphics unit,
    neural net unit.

    {You will probably have to edit this to see the true ASCII art due to the inherent stupidity of the space eaters.} One Function Unit::

    +----------------------------------------+
    | +------------------------+ |
    |->| | |\ |
    |->| long latency FU |->| | |
    |->| | |M| |
    Rs-->| +------------------------+ |U| |\ |
    | |X|-|D|---|->result bus
    | +--------+ | | |/ |
    |->| short |----------------->| | |
    | +--------+ |/ |
    +----------------------------------------+

    You may even be able to use the <unused> buffering in the long latency
    sub-unit to delay the <already done> shot latency calculation. Alternately,
    you could add some buffering between short and long to take up the slack.

    The final gate inside the FU is the large heavily loaded bus driver.

    There will be some kind of internal timing chain in the FU that arbitrates
    the long versus the short(s) and sends tags at the appropriate instant.

    Although two ALUs are shown, the FPU can execute ALU instructions too.
    And the ALU can execute the single cycle FPU instructions. I use the
    name SAU (for simple arithmetic unit) because of the crossover. When I
    see ALU I think integer.

    When I said ALU above, I meant {ADD, SUB, CMP, FCMP, certain Conversions, certain bit twiddling, logic} that is :most things that fit in 1 cycle with forwarding and result bus drive.

    There are four result busses to feed the register file. A larger
    register file may be too much for the current implementation. There is a
    lot of BRAM used for the register file. 1/4 BRAMs in the device.

    I have built (logic design, circuit design, SPICE tuning, layout) of
    6R-6W register file of 128×64-bit entries. The SPICE tuning was most "illuminating" as to the limitations of multi-port SRAM-like storage.

    I do not, at this instant in time, think wider than 6R-6W is practicable. {{Just as well since we are only performing ~2.x I/c with 300 instruction execution windows {and cache hierarchy hit rates and latencies}}}



    Reservation stations are using about 5k LUTs each.

    14×5 = 70K

    The RAT is about 50k LUTs.



    I tried configuring Qupls for 3 entries per RS, and more functional units/functionality, but it turned out to be about 700,000 LUTs.
    I am trying to keep a demo under 200k LUTs.
    When I obtain a larger board it will just be a matter of changing some config values.



    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 2 23:22:06 2026
    From Newsgroup: comp.arch

    On 2026-04-02 6:25 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-04-02 1:57 p.m., MitchAlsup wrote:
    -----------------------
    I would consider the dynamically selected result bus a mistake. A
    result bus is heavily loaded and needs big drivers. You design will
    need 4 big drivers per FU instead of 1. And for what gain ??

    An issue is the number of result busses to support all the units.
    There is something like 16 or 18 results (some units can produce two
    results), I thought it would not work to have a result bus for every
    unit. 16 write ports on the register file was not happening. I could not
    see how to reduce things to say 6 busses.

    Realistically, you are going to be performing between 2 and 3 I/c
    and thus 4-6 busses are perfectly capable.

    Yeah, that was the other reason there were only four busses. The results
    were being queued in case of a peak more than four. I have reduced
    things now to 12 units. So I am trying 12 write ports and 12 read ports
    and being rid of the dynamic write selection and queues.

    Four busses were used to minimize the size of the register file, since
    there was a mux anyway. I was not thinking of the driver electronics for
    running in an FPGA.

    I am not fond of the dynamic selected result bus, either. Maybe it could
    be reduced to eight busses, without dynamic selection.

    Whichever set of four results is selected is looked at.

    Qupls has values stored in the reservation stations. There are only 16 >>>> register read ports running to the reservation stations that are used to >>>> load values. Then the four result busses also monitored for values to
    load. All of this is still smaller than the RAT, as Qupls is configured >>>> at the moment.

    How many entries (instructions) per RS ?

    Qupls is currently configured for one entry per RS. But it is a
    parameter (for each RS). It had to be minimized to fit the FPGA.
    I think the 5k size was for three-entry RS.

    Ok, I mean that a RS has both width and depth. Width would be chosen
    to be appropriate for the number of operands any of the attached FUs
    would need (max) So an INT unit would have 2-operands, a Mem unit
    would have 2 register operands and one constant operand (Displacement), FMUL/FMAC would have 3, ...

    A RS has depth, so with a ~100 Instruction execution window, and 6 FUs
    one would expect 16 RS.instructions each with 2 or 3 dynamic operands.
    There is no reason to build RSs if you don't have enough/FU to cover the dynamic latency of the critical path.

    I could try changing things so that all 14 (or more) result busses run >>>> past the reservations stations, but I have a feeling that all the muxes >>>> for the busses will consume a lot of logic. Muxes are relatively
    expensive in an FPGA. Comparators are less expensive I think.

    Current config (8 units):
    ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH
    versus:
    ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
    MEM1, MEM2, MEM3, FADD, FMUL, Branch
    SFT1, SHT2, SFT3, FMSC, FDIV,
    where vertical means they are the same FU#


    Okay, I had units separated by latency so there is minimal latency going
    from the unit back to the results/input (feedback paths). Trying to keep
    performance of dependent instructions good.
    ALU1, ALU2 are single cycle latency. FPU is three cycles versus FMA
    which is five. Most of the units can issue an instruction every clock
    cycle. Some units not in the minimal config may have large latencies and
    cannot issue every cycle. These include float trig, graphics unit,
    neural net unit.

    {You will probably have to edit this to see the true ASCII art due to the inherent stupidity of the space eaters.} One Function Unit::

    +----------------------------------------+
    | +------------------------+ |
    |->| | |\ |
    |->| long latency FU |->| | |
    |->| | |M| |
    Rs-->| +------------------------+ |U| |\ |
    | |X|-|D|---|->result bus
    | +--------+ | | |/ |
    |->| short |----------------->| | |
    | +--------+ |/ |
    +----------------------------------------+

    You may even be able to use the <unused> buffering in the long latency sub-unit to delay the <already done> shot latency calculation. Alternately, you could add some buffering between short and long to take up the slack.

    The final gate inside the FU is the large heavily loaded bus driver.

    There will be some kind of internal timing chain in the FU that arbitrates the long versus the short(s) and sends tags at the appropriate instant.

    Ascii looks good. I did this already for some of the unit/functions. The
    FPU is 3 cycles but some 1 or 2 cycles ops are fed into the same
    pipeline. I was going with 1,3,5 and many for latency, plus some units
    that are removable from the design.

    I still think a short pipeline unit is needed for some common ops to
    execute back-to-back as dependent instructions. Or I guess the
    alternative is to collect a huge number of instructions and amortize the latencies.

    Although two ALUs are shown, the FPU can execute ALU instructions too.
    And the ALU can execute the single cycle FPU instructions. I use the
    name SAU (for simple arithmetic unit) because of the crossover. When I
    see ALU I think integer.

    When I said ALU above, I meant {ADD, SUB, CMP, FCMP, certain Conversions, certain bit twiddling, logic} that is :most things that fit in 1 cycle with forwarding and result bus drive.

    Okay. My ALU includes the almost the same. Is there cross unit
    forwarding or is it just within the same pipeline? My dispatch is not
    smart enough to dispatch instructions to the same ALU to make use of forwarding.

    The muxes for results forwarding in an FPGA slow the design down. I have
    seen a couple of designs that say its not worth forwarding results.
    Better to bump up the clock frequency.

    There are four result busses to feed the register file. A larger
    register file may be too much for the current implementation. There is a
    lot of BRAM used for the register file. 1/4 BRAMs in the device.

    I have built (logic design, circuit design, SPICE tuning, layout) of
    6R-6W register file of 128×64-bit entries. The SPICE tuning was most "illuminating" as to the limitations of multi-port SRAM-like storage.

    IDK if I would ever get to that level of detail. I am relying on the
    FPGA designers knowledge. Not planning a custom chip logic design.
    Obviously there are physical limitations for a custom logic design. I
    have seen recently advertised for smaller volume custom chip. Still too expensive for a hobbyist.


    I do not, at this instant in time, think wider than 6R-6W is practicable. {{Just as well since we are only performing ~2.x I/c with 300 instruction execution windows {and cache hierarchy hit rates and latencies}}}



    Reservation stations are using about 5k LUTs each.

    14×5 = 70K

    The RAT is about 50k LUTs.



    I tried configuring Qupls for 3 entries per RS, and more functional
    units/functionality, but it turned out to be about 700,000 LUTs.
    I am trying to keep a demo under 200k LUTs.
    When I obtain a larger board it will just be a matter of changing some
    config values.




    --- Synchronet 3.21f-Linux NewsLink 1.2