Forum: War Ensemble BBS

Deproved RAT

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Mar 28 17:29:47 2026

From Newsgroup: comp.arch

Okay, I have come up with the following scheme for mapping registers in
the RAT to reduce the storage and logic requirements. Probably at the
cost of some performance.

Organize the physical registers into sets. Have only eight physical
registers associated with each pair of ISA registers. That is an average
of four rename registers available for each ISA register. They are
organized in pairs to try and increase the odds of a rename register
being available.

For the register map, store a three-bit index into the group of eight
physical registers for each ISA register. When referenced the physical register number is the ISA register number divided by two concatenated
with the three-bit index.

The difference between this and a flat register map is that only three
bits are required to identify the physical register. So it requires 1/3
the storage space and 1/3 the muxing. (with 512 physical registers).

For Qupls RAT it works out to 17,300 LUTs instead of 27,600 LUTs.
However, the number of stalls in renaming is potentially increased… IDK
how big the impact is.
--- Synchronet 3.21f-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Sat Mar 28 22:11:00 2026

From Newsgroup: comp.arch

In article <10q9h8c$v7qa$1@dont-email.me>, robfi680@gmail.com (Robert
Finch) wrote:

Organize the physical registers into sets. Have only eight physical registers associated with each pair of ISA registers. That is an
average of four rename registers available for each ISA register.
They are organized in pairs to try and increase the odds of a
rename register being available.

This seems to present the compiler writer with a temptation to make use
of information about the number of rename registers in long expression sequences. That causes problems when the implementation changes.

John
--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Mar 28 19:17:45 2026

From Newsgroup: comp.arch

On 2026-03-28 6:11 p.m., John Dallman wrote:

In article <10q9h8c$v7qa$1@dont-email.me>, robfi680@gmail.com (Robert
Finch) wrote:

Organize the physical registers into sets. Have only eight physical
registers associated with each pair of ISA registers. That is an
average of four rename registers available for each ISA register.
They are organized in pairs to try and increase the odds of a
rename register being available.

This seems to present the compiler writer with a temptation to make use
of information about the number of rename registers in long expression sequences. That causes problems when the implementation changes.

John

Yes. I think it would only cause performance differences. Performance
should only improve on a better implementation.

It should be possible to trade-off the number of registers in a set and
the number of ISA registers. I tried sixteen physical with four ISA
which should lower the number of stalls. It was only about 1k LUTs more.

I have thought that superscalars were complex enough that people would
not be cycle counting, but measuring instead.

I got the CPU core to fit on the FPGA now, but without a DRAM
controller. Meaning I should be able to get some small demos running eventually. I had to axe the instruction expander too. Instructions to micro-ops are now 1:1

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Mar 30 21:01:13 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Okay, I have come up with the following scheme for mapping registers in
the RAT to reduce the storage and logic requirements. Probably at the
cost of some performance.

Organize the physical registers into sets. Have only eight physical registers associated with each pair of ISA registers. That is an average
of four rename registers available for each ISA register. They are
organized in pairs to try and increase the odds of a rename register
being available.

For the register map, store a three-bit index into the group of eight physical registers for each ISA register. When referenced the physical register number is the ISA register number divided by two concatenated
with the three-bit index.

Take what follows with a grain of salt::

In the machines on which I participated, we kept additional information
in <essentially> the RAT--in particular, the identity of the FU which
will deliver the result. So, reading the RAT gave the Physical Register
Number (PRN), the FU who delivers a result and whether the result is in
RF or waiting on FU. We encoded this such that FU and PRN used the same
bits along with a state, so the whole thing used only 8-bits; 1 for PRN (7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).
We had 6-FUs and an execution window 16-deep.

When the operand is latent, FU tells the Reservation Station entry which
result bus to monitor and which tag to match. This means the RS entries
are only watching 1 bus each, need only 1 comparator; but over time each
entry can monitor all result busses.

I also used an associative lookup (ARN in, PRN/{FU,tag}) out with a valid
bit per PRN. The valid bits are written into the history table each issue,
and are read out as Bcnd begins execution, so that if the branch was mis- predicted, the RAT can be recovered by writing 128-valid-bits to RAT.cam--
as few as zero cycle branch recovery. The entries in the history table
have logic per column that can discover that the register is free even
before the result is written to RF. Issue has logic to determine if the register is over-written in the same issue cycle, so the PRN does not get allocated, so the pool is effectively larger.

The difference between this and a flat register map is that only three
bits are required to identify the physical register. So it requires 1/3
the storage space and 1/3 the muxing. (with 512 physical registers).

For Qupls RAT it works out to 17,300 LUTs instead of 27,600 LUTs.
However, the number of stalls in renaming is potentially increased… IDK how big the impact is.

--- Synchronet 3.21f-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 31 08:05:55 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

In the machines on which I participated, we kept additional information
in <essentially> the RAT--in particular, the identity of the FU which
will deliver the result. So, reading the RAT gave the Physical Register >Number (PRN), the FU who delivers a result and whether the result is in
RF or waiting on FU. We encoded this such that FU and PRN used the same
bits along with a state, so the whole thing used only 8-bits; 1 for PRN >(7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

My impression is that in recent CPUs with valueless reservation
stations the PRN is used in the in-flight instructions and the RAT
from the start, without needing to change anything in in-flight
instructions once an instruction that writes a register it depends on
delivers its results. Maybe they have additional bits for detecting
that the result is available without having to compare everything.

One disadvantage of this approach is that PRNs are allocated before
they actually need to store something. For programs that have a lot
of instructions waiting for some result (of, e.g., a cache miss) to
become ready, that might be an issue. OTOH, you need those physical
registers anyway for programs that have a lot of finished instructions
waiting for committing in the reorder buffer (e.g., due to having to
wait for an instruction that might trap or mispredict).

So finding a way to reduce the register needs of the former kind of
program may not lead to actually reducing the number of registers;
therefore such a way may not actually be useful.

When the operand is latent, FU tells the Reservation Station entry which >result bus to monitor and which tag to match. This means the RS entries
are only watching 1 bus each, need only 1 comparator; but over time each >entry can monitor all result busses.

Your description inspired an idea: My impression is that having many
write ports is much more expensive than having more registers. So
have a register file for each FU, with one write port for each
FU-specific register file. The total number of physical registers
would have to be increased to achieve a similar renaming capacity
across typical workloads, and one probably still needs a similar
number of read ports as before, but the result might still require
less area.

However, based on register file capacity measurements (e.g., at
chipsandcheese) it seems that modern microarchitectures differentiate
at most between GPRs, SIMD registers, and various flags when it comes
to physical registers. So either the area saving from reduced write
ports is not that relevant and many-write-ported register files are
used.

Or there is a backup mechanism for making use of other register files
when the FU's register file has no free registers; e.g., if the
register renamer finds that all registers in the FU's register file
are allocated, it can insert a move uop from a register of the target
FU to an idle FU with enough free registers before
allocating a register to an uop for the target FU.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 31 18:35:53 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

In the machines on which I participated, we kept additional information
in <essentially> the RAT--in particular, the identity of the FU which
will deliver the result. So, reading the RAT gave the Physical Register >Number (PRN), the FU who delivers a result and whether the result is in
RF or waiting on FU. We encoded this such that FU and PRN used the same >bits along with a state, so the whole thing used only 8-bits; 1 for PRN >(7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

My impression is that in recent CPUs with valueless reservation
stations the PRN is used in the in-flight instructions and the RAT
from the start, without needing to change anything in in-flight
instructions once an instruction that writes a register it depends on delivers its results. Maybe they have additional bits for detecting
that the result is available without having to compare everything.

There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ...
all arriving at the same spot--Data-flow works.

One disadvantage of this approach is that PRNs are allocated before
they actually need to store something. For programs that have a lot
of instructions waiting for some result (of, e.g., a cache miss) to
become ready, that might be an issue. OTOH, you need those physical registers anyway for programs that have a lot of finished instructions waiting for committing in the reorder buffer (e.g., due to having to
wait for an instruction that might trap or mispredict).

PRNs are required from about 2 cycles after Decode until RoB retirement. Certain PRNs may die earlier (over written) and different choices make
this easier or harder.

So finding a way to reduce the register needs of the former kind of
program may not lead to actually reducing the number of registers;
therefore such a way may not actually be useful.

Not by enough to count.

When the operand is latent, FU tells the Reservation Station entry which >result bus to monitor and which tag to match. This means the RS entries
are only watching 1 bus each, need only 1 comparator; but over time each >entry can monitor all result busses.

Your description inspired an idea: My impression is that having many
write ports is much more expensive than having more registers. So
have a register file for each FU, with one write port for each
FU-specific register file. The total number of physical registers
would have to be increased to achieve a similar renaming capacity
across typical workloads, and one probably still needs a similar
number of read ports as before, but the result might still require
less area.

Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit
(less shift) and whatever the FU was named for. I found this "better"
than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
bus contention between Ints and Others. Not much but enough.

However, based on register file capacity measurements (e.g., at chipsandcheese) it seems that modern microarchitectures differentiate
at most between GPRs, SIMD registers, and various flags when it comes
to physical registers. So either the area saving from reduced write
ports is not that relevant and many-write-ported register files are
used.

Or there is a backup mechanism for making use of other register files
when the FU's register file has no free registers; e.g., if the
register renamer finds that all registers in the FU's register file
are allocated, it can insert a move uop from a register of the target
FU to an idle FU with enough free registers before
allocating a register to an uop for the target FU.

- anton

--- Synchronet 3.21f-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Mar 31 20:07:40 2026

From Newsgroup: comp.arch

In article <10q9nir$11j88$1@dont-email.me>, robfi680@gmail.com (Robert
Finch) wrote:

This seems to present the compiler writer with a temptation to
make use of information about the number of rename registers in
long expression sequences. That causes problems when the
implementation changes.

Yes. I think it would only cause performance differences.

Yes, but those can matter a lot.

Performance should only improve on a better implementation.

It should, yes. The ability of customers to chance upon and rely on pathological cases should not be under-estimated.

I have thought that superscalars were complex enough that people
would not be cycle counting, but measuring instead.

That's generally true.

John
--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 2 01:27:11 2026

From Newsgroup: comp.arch

On 2026-03-31 2:35 p.m., MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

In the machines on which I participated, we kept additional information
in <essentially> the RAT--in particular, the identity of the FU which
will deliver the result. So, reading the RAT gave the Physical Register
Number (PRN), the FU who delivers a result and whether the result is in
RF or waiting on FU. We encoded this such that FU and PRN used the same
bits along with a state, so the whole thing used only 8-bits; 1 for PRN
(7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

My impression is that in recent CPUs with valueless reservation
stations the PRN is used in the in-flight instructions and the RAT
from the start, without needing to change anything in in-flight
instructions once an instruction that writes a register it depends on
delivers its results. Maybe they have additional bits for detecting
that the result is available without having to compare everything.

There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ...
all arriving at the same spot--Data-flow works.

One disadvantage of this approach is that PRNs are allocated before
they actually need to store something. For programs that have a lot
of instructions waiting for some result (of, e.g., a cache miss) to
become ready, that might be an issue. OTOH, you need those physical
registers anyway for programs that have a lot of finished instructions
waiting for committing in the reorder buffer (e.g., due to having to
wait for an instruction that might trap or mispredict).

PRNs are required from about 2 cycles after Decode until RoB retirement. Certain PRNs may die earlier (over written) and different choices make
this easier or harder.

So finding a way to reduce the register needs of the former kind of
program may not lead to actually reducing the number of registers;
therefore such a way may not actually be useful.

Not by enough to count.

When the operand is latent, FU tells the Reservation Station entry which >>> result bus to monitor and which tag to match. This means the RS entries
are only watching 1 bus each, need only 1 comparator; but over time each >>> entry can monitor all result busses.

Your description inspired an idea: My impression is that having many
write ports is much more expensive than having more registers. So
have a register file for each FU, with one write port for each
FU-specific register file. The total number of physical registers
would have to be increased to achieve a similar renaming capacity
across typical workloads, and one probably still needs a similar
number of read ports as before, but the result might still require
less area.

Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit
(less shift) and whatever the FU was named for. I found this "better"
than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
bus contention between Ints and Others. Not much but enough.

However, based on register file capacity measurements (e.g., at
chipsandcheese) it seems that modern microarchitectures differentiate
at most between GPRs, SIMD registers, and various flags when it comes
to physical registers. So either the area saving from reduced write
ports is not that relevant and many-write-ported register files are
used.

Or there is a backup mechanism for making use of other register files
when the FU's register file has no free registers; e.g., if the
register renamer finds that all registers in the FU's register file
are allocated, it can insert a move uop from a register of the target
FU to an idle FU with enough free registers before
allocating a register to an uop for the target FU.

- anton

These post have given me something to investigate. Whether it is smaller
to add to the RAT and reduce the number of comparators in the
reservation stations OR reduce the RAT.
More config options coming up.

Let see if I understand this. While there may only be one bus being
monitored, that bus has to originate from the other result busses via a
mux. So, the result busses are going past the reservation stations which
then feed into a mux controlled by the FU id which the reservation
station examines for values. I think I can see where that would make the reservation stations smaller. It gets rid of the comparators in the reservation stations and replaces them with muxes on the result busses.

Qupls has a slightly different organization. There are a lot of
functional units. 14 IIRC for a full-blown version, each with four or
more read ports. But there are only four results busses begin examined.
The result bus is dynamically selected to update the register file.
Whichever set of four results is selected is looked at.

Qupls has values stored in the reservation stations. There are only 16 register read ports running to the reservation stations that are used to
load values. Then the four result busses also monitored for values to
load. All of this is still smaller than the RAT, as Qupls is configured
at the moment.

I could try changing things so that all 14 (or more) result busses run
past the reservations stations, but I have a feeling that all the muxes
for the busses will consume a lot of logic. Muxes are relatively
expensive in an FPGA. Comparators are less expensive I think.

Current config (8 units):
ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH

Reservation stations are using about 5k LUTs each.
The RAT is about 50k LUTs.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 2 17:57:15 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-03-31 2:35 p.m., MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

In the machines on which I participated, we kept additional information >>> in <essentially> the RAT--in particular, the identity of the FU which
will deliver the result. So, reading the RAT gave the Physical Register >>> Number (PRN), the FU who delivers a result and whether the result is in >>> RF or waiting on FU. We encoded this such that FU and PRN used the same >>> bits along with a state, so the whole thing used only 8-bits; 1 for PRN >>> (7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits).

My impression is that in recent CPUs with valueless reservation
stations the PRN is used in the in-flight instructions and the RAT
from the start, without needing to change anything in in-flight
instructions once an instruction that writes a register it depends on
delivers its results. Maybe they have additional bits for detecting
that the result is available without having to compare everything.

There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ... all arriving at the same spot--Data-flow works.

One disadvantage of this approach is that PRNs are allocated before
they actually need to store something. For programs that have a lot
of instructions waiting for some result (of, e.g., a cache miss) to
become ready, that might be an issue. OTOH, you need those physical
registers anyway for programs that have a lot of finished instructions
waiting for committing in the reorder buffer (e.g., due to having to
wait for an instruction that might trap or mispredict).

PRNs are required from about 2 cycles after Decode until RoB retirement. Certain PRNs may die earlier (over written) and different choices make
this easier or harder.

So finding a way to reduce the register needs of the former kind of
program may not lead to actually reducing the number of registers;
therefore such a way may not actually be useful.

Not by enough to count.

When the operand is latent, FU tells the Reservation Station entry which >>> result bus to monitor and which tag to match. This means the RS entries >>> are only watching 1 bus each, need only 1 comparator; but over time each >>> entry can monitor all result busses.

Your description inspired an idea: My impression is that having many
write ports is much more expensive than having more registers. So
have a register file for each FU, with one write port for each
FU-specific register file. The total number of physical registers
would have to be increased to achieve a similar renaming capacity
across typical workloads, and one probably still needs a similar
number of read ports as before, but the result might still require
less area.

Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit (less shift) and whatever the FU was named for. I found this "better"
than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
bus contention between Ints and Others. Not much but enough.

However, based on register file capacity measurements (e.g., at
chipsandcheese) it seems that modern microarchitectures differentiate
at most between GPRs, SIMD registers, and various flags when it comes
to physical registers. So either the area saving from reduced write
ports is not that relevant and many-write-ported register files are
used.

Or there is a backup mechanism for making use of other register files
when the FU's register file has no free registers; e.g., if the
register renamer finds that all registers in the FU's register file
are allocated, it can insert a move uop from a register of the target
FU to an idle FU with enough free registers before
allocating a register to an uop for the target FU.

- anton

These post have given me something to investigate. Whether it is smaller
to add to the RAT and reduce the number of comparators in the
reservation stations OR reduce the RAT.
More config options coming up.

Let see if I understand this. While there may only be one bus being monitored, that bus has to originate from the other result busses via a
mux. So, the result busses are going past the reservation stations which then feed into a mux controlled by the FU id which the reservation
station examines for values. I think I can see where that would make the reservation stations smaller. It gets rid of the comparators in the reservation stations and replaces them with muxes on the result busses.

Right, all result busses go to all RSs. Each RS entry.operand watches
1 (or 0) busses. Any RS entry.operand can watch ay result bus.

Qupls has a slightly different organization. There are a lot of
functional units. 14 IIRC for a full-blown version, each with four or
more read ports. But there are only four results busses begin examined.

The result bus is dynamically selected to update the register file.

I would consider the dynamically selected result bus a mistake. A
result bus is heavily loaded and needs big drivers. You design will
need 4 big drivers per FU instead of 1. And for what gain ??

Whichever set of four results is selected is looked at.

Qupls has values stored in the reservation stations. There are only 16 register read ports running to the reservation stations that are used to load values. Then the four result busses also monitored for values to
load. All of this is still smaller than the RAT, as Qupls is configured
at the moment.

How many entries (instructions) per RS ?

I could try changing things so that all 14 (or more) result busses run
past the reservations stations, but I have a feeling that all the muxes
for the busses will consume a lot of logic. Muxes are relatively
expensive in an FPGA. Comparators are less expensive I think.

Current config (8 units):
ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH

versus:
ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
MEM1, MEM2, MEM3, FADD, FMUL, Branch
SFT1, SHT2, SFT3, FMSC, FDIV,
where vertical means they are the same FU#

Reservation stations are using about 5k LUTs each.

14×5 = 70K

The RAT is about 50k LUTs.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 2 17:50:09 2026

From Newsgroup: comp.arch

On 2026-04-02 1:57 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-03-31 2:35 p.m., MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

In the machines on which I participated, we kept additional information >>>>> in <essentially> the RAT--in particular, the identity of the FU which >>>>> will deliver the result. So, reading the RAT gave the Physical Register >>>>> Number (PRN), the FU who delivers a result and whether the result is in >>>>> RF or waiting on FU. We encoded this such that FU and PRN used the same >>>>> bits along with a state, so the whole thing used only 8-bits; 1 for PRN >>>>> (7-bits: 128 Physical Registers) versus FU (3-bits) and tag (4-bits). >>>>

My impression is that in recent CPUs with valueless reservation
stations the PRN is used in the in-flight instructions and the RAT
from the start, without needing to change anything in in-flight
instructions once an instruction that writes a register it depends on
delivers its results. Maybe they have additional bits for detecting
that the result is available without having to compare everything.

There are several ways to do the above {ARN number}, {PRN}, {FU,tag}, ... >>> all arriving at the same spot--Data-flow works.

One disadvantage of this approach is that PRNs are allocated before
they actually need to store something. For programs that have a lot
of instructions waiting for some result (of, e.g., a cache miss) to
become ready, that might be an issue. OTOH, you need those physical
registers anyway for programs that have a lot of finished instructions >>>> waiting for committing in the reorder buffer (e.g., due to having to
wait for an instruction that might trap or mispredict).

PRNs are required from about 2 cycles after Decode until RoB retirement. >>> Certain PRNs may die earlier (over written) and different choices make
this easier or harder.

So finding a way to reduce the register needs of the former kind of
program may not lead to actually reducing the number of registers;
therefore such a way may not actually be useful.

Not by enough to count.

When the operand is latent, FU tells the Reservation Station entry which >>>>> result bus to monitor and which tag to match. This means the RS entries >>>>> are only watching 1 bus each, need only 1 comparator; but over time each >>>>> entry can monitor all result busses.

Your description inspired an idea: My impression is that having many
write ports is much more expensive than having more registers. So
have a register file for each FU, with one write port for each
FU-specific register file. The total number of physical registers
would have to be increased to achieve a similar renaming capacity
across typical workloads, and one probably still needs a similar
number of read ports as before, but the result might still require
less area.

Mc 88120 had 6 "FU"s, 6 write ports. Each FU contained an Integer unit
(less shift) and whatever the FU was named for. I found this "better"
than having 12 FUs {6 Int, 3 Mem, 1 FMUL, 1 FADD, 1 Branch} because of
bus contention between Ints and Others. Not much but enough.

However, based on register file capacity measurements (e.g., at
chipsandcheese) it seems that modern microarchitectures differentiate
at most between GPRs, SIMD registers, and various flags when it comes
to physical registers. So either the area saving from reduced write
ports is not that relevant and many-write-ported register files are
used.

Or there is a backup mechanism for making use of other register files
when the FU's register file has no free registers; e.g., if the
register renamer finds that all registers in the FU's register file
are allocated, it can insert a move uop from a register of the target
FU to an idle FU with enough free registers before
allocating a register to an uop for the target FU.

- anton

These post have given me something to investigate. Whether it is smaller
to add to the RAT and reduce the number of comparators in the
reservation stations OR reduce the RAT.
More config options coming up.

Let see if I understand this. While there may only be one bus being
monitored, that bus has to originate from the other result busses via a
mux. So, the result busses are going past the reservation stations which
then feed into a mux controlled by the FU id which the reservation
station examines for values. I think I can see where that would make the
reservation stations smaller. It gets rid of the comparators in the
reservation stations and replaces them with muxes on the result busses.

Right, all result busses go to all RSs. Each RS entry.operand watches
1 (or 0) busses. Any RS entry.operand can watch ay result bus.

Qupls has a slightly different organization. There are a lot of
functional units. 14 IIRC for a full-blown version, each with four or
more read ports. But there are only four results busses begin examined.

The result bus is dynamically selected to update the register file.

I would consider the dynamically selected result bus a mistake. A
result bus is heavily loaded and needs big drivers. You design will
need 4 big drivers per FU instead of 1. And for what gain ??

An issue is the number of result busses to support all the units.
There is something like 16 or 18 results (some units can produce two
results), I thought it would not work to have a result bus for every
unit. 16 write ports on the register file was not happening. I could not
see how to reduce things to say 6 busses.

Four busses were used to minimize the size of the register file, since
there was a mux anyway. I was not thinking of the driver electronics for running in an FPGA.

I am not fond of the dynamic selected result bus, either. Maybe it could
be reduced to eight busses, without dynamic selection.

Whichever set of four results is selected is looked at.

Qupls has values stored in the reservation stations. There are only 16
register read ports running to the reservation stations that are used to
load values. Then the four result busses also monitored for values to
load. All of this is still smaller than the RAT, as Qupls is configured
at the moment.

How many entries (instructions) per RS ?

Qupls is currently configured for one entry per RS. But it is a
parameter (for each RS). It had to be minimized to fit the FPGA.
I think the 5k size was for three-entry RS.

I could try changing things so that all 14 (or more) result busses run
past the reservations stations, but I have a feeling that all the muxes
for the busses will consume a lot of logic. Muxes are relatively
expensive in an FPGA. Comparators are less expensive I think.

Current config (8 units):
ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH

versus:
ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
MEM1, MEM2, MEM3, FADD, FMUL, Branch
SFT1, SHT2, SFT3, FMSC, FDIV,
where vertical means they are the same FU#

Okay, I had units separated by latency so there is minimal latency going
from the unit back to the results/input (feedback paths). Trying to keep performance of dependent instructions good.
ALU1, ALU2 are single cycle latency. FPU is three cycles versus FMA
which is five. Most of the units can issue an instruction every clock
cycle. Some units not in the minimal config may have large latencies and cannot issue every cycle. These include float trig, graphics unit,
neural net unit.

Although two ALUs are shown, the FPU can execute ALU instructions too.
And the ALU can execute the single cycle FPU instructions. I use the
name SAU (for simple arithmetic unit) because of the crossover. When I
see ALU I think integer.

There are four result busses to feed the register file. A larger
register file may be too much for the current implementation. There is a
lot of BRAM used for the register file. 1/4 BRAMs in the device.

Reservation stations are using about 5k LUTs each.

14×5 = 70K

The RAT is about 50k LUTs.

I tried configuring Qupls for 3 entries per RS, and more functional units/functionality, but it turned out to be about 700,000 LUTs.
I am trying to keep a demo under 200k LUTs.
When I obtain a larger board it will just be a matter of changing some
config values.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 2 22:25:02 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-04-02 1:57 p.m., MitchAlsup wrote:

-----------------------

I would consider the dynamically selected result bus a mistake. A
result bus is heavily loaded and needs big drivers. You design will
need 4 big drivers per FU instead of 1. And for what gain ??

An issue is the number of result busses to support all the units.
There is something like 16 or 18 results (some units can produce two results), I thought it would not work to have a result bus for every
unit. 16 write ports on the register file was not happening. I could not
see how to reduce things to say 6 busses.

Realistically, you are going to be performing between 2 and 3 I/c
and thus 4-6 busses are perfectly capable.

Four busses were used to minimize the size of the register file, since
there was a mux anyway. I was not thinking of the driver electronics for running in an FPGA.

I am not fond of the dynamic selected result bus, either. Maybe it could
be reduced to eight busses, without dynamic selection.

Whichever set of four results is selected is looked at.

Qupls has values stored in the reservation stations. There are only 16
register read ports running to the reservation stations that are used to >> load values. Then the four result busses also monitored for values to
load. All of this is still smaller than the RAT, as Qupls is configured
at the moment.

How many entries (instructions) per RS ?

Qupls is currently configured for one entry per RS. But it is a
parameter (for each RS). It had to be minimized to fit the FPGA.
I think the 5k size was for three-entry RS.

Ok, I mean that a RS has both width and depth. Width would be chosen
to be appropriate for the number of operands any of the attached FUs
would need (max) So an INT unit would have 2-operands, a Mem unit
would have 2 register operands and one constant operand (Displacement), FMUL/FMAC would have 3, ...

A RS has depth, so with a ~100 Instruction execution window, and 6 FUs
one would expect 16 RS.instructions each with 2 or 3 dynamic operands.
There is no reason to build RSs if you don't have enough/FU to cover the dynamic latency of the critical path.

I could try changing things so that all 14 (or more) result busses run
past the reservations stations, but I have a feeling that all the muxes
for the busses will consume a lot of logic. Muxes are relatively
expensive in an FPGA. Comparators are less expensive I think.

Current config (8 units):
ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH

versus:
ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
MEM1, MEM2, MEM3, FADD, FMUL, Branch
SFT1, SHT2, SFT3, FMSC, FDIV,
where vertical means they are the same FU#

Okay, I had units separated by latency so there is minimal latency going from the unit back to the results/input (feedback paths). Trying to keep performance of dependent instructions good.
ALU1, ALU2 are single cycle latency. FPU is three cycles versus FMA
which is five. Most of the units can issue an instruction every clock
cycle. Some units not in the minimal config may have large latencies and cannot issue every cycle. These include float trig, graphics unit,
neural net unit.

{You will probably have to edit this to see the true ASCII art due to the inherent stupidity of the space eaters.} One Function Unit::

+----------------------------------------+
| +------------------------+ |
|->| | |\ |
|->| long latency FU |->| | |
|->| | |M| |
Rs-->| +------------------------+ |U| |\ |
| |X|-|D|---|->result bus
| +--------+ | | |/ |
|->| short |----------------->| | |
| +--------+ |/ |
+----------------------------------------+

You may even be able to use the <unused> buffering in the long latency
sub-unit to delay the <already done> shot latency calculation. Alternately,
you could add some buffering between short and long to take up the slack.

The final gate inside the FU is the large heavily loaded bus driver.

There will be some kind of internal timing chain in the FU that arbitrates
the long versus the short(s) and sends tags at the appropriate instant.

Although two ALUs are shown, the FPU can execute ALU instructions too.
And the ALU can execute the single cycle FPU instructions. I use the
name SAU (for simple arithmetic unit) because of the crossover. When I
see ALU I think integer.

When I said ALU above, I meant {ADD, SUB, CMP, FCMP, certain Conversions, certain bit twiddling, logic} that is :most things that fit in 1 cycle with forwarding and result bus drive.

There are four result busses to feed the register file. A larger
register file may be too much for the current implementation. There is a
lot of BRAM used for the register file. 1/4 BRAMs in the device.

I have built (logic design, circuit design, SPICE tuning, layout) of
6R-6W register file of 128×64-bit entries. The SPICE tuning was most "illuminating" as to the limitations of multi-port SRAM-like storage.

I do not, at this instant in time, think wider than 6R-6W is practicable. {{Just as well since we are only performing ~2.x I/c with 300 instruction execution windows {and cache hierarchy hit rates and latencies}}}

Reservation stations are using about 5k LUTs each.

14×5 = 70K

The RAT is about 50k LUTs.

I tried configuring Qupls for 3 entries per RS, and more functional units/functionality, but it turned out to be about 700,000 LUTs.
I am trying to keep a demo under 200k LUTs.
When I obtain a larger board it will just be a matter of changing some config values.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Apr 2 23:22:06 2026

From Newsgroup: comp.arch

On 2026-04-02 6:25 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-04-02 1:57 p.m., MitchAlsup wrote:

-----------------------

I would consider the dynamically selected result bus a mistake. A
result bus is heavily loaded and needs big drivers. You design will
need 4 big drivers per FU instead of 1. And for what gain ??

An issue is the number of result busses to support all the units.
There is something like 16 or 18 results (some units can produce two
results), I thought it would not work to have a result bus for every
unit. 16 write ports on the register file was not happening. I could not
see how to reduce things to say 6 busses.

Realistically, you are going to be performing between 2 and 3 I/c
and thus 4-6 busses are perfectly capable.

Yeah, that was the other reason there were only four busses. The results
were being queued in case of a peak more than four. I have reduced
things now to 12 units. So I am trying 12 write ports and 12 read ports
and being rid of the dynamic write selection and queues.

Four busses were used to minimize the size of the register file, since
there was a mux anyway. I was not thinking of the driver electronics for
running in an FPGA.

I am not fond of the dynamic selected result bus, either. Maybe it could
be reduced to eight busses, without dynamic selection.

Whichever set of four results is selected is looked at.

Qupls has values stored in the reservation stations. There are only 16 >>>> register read ports running to the reservation stations that are used to >>>> load values. Then the four result busses also monitored for values to
load. All of this is still smaller than the RAT, as Qupls is configured >>>> at the moment.

How many entries (instructions) per RS ?

Qupls is currently configured for one entry per RS. But it is a
parameter (for each RS). It had to be minimized to fit the FPGA.
I think the 5k size was for three-entry RS.

Ok, I mean that a RS has both width and depth. Width would be chosen
to be appropriate for the number of operands any of the attached FUs
would need (max) So an INT unit would have 2-operands, a Mem unit
would have 2 register operands and one constant operand (Displacement), FMUL/FMAC would have 3, ...

A RS has depth, so with a ~100 Instruction execution window, and 6 FUs
one would expect 16 RS.instructions each with 2 or 3 dynamic operands.
There is no reason to build RSs if you don't have enough/FU to cover the dynamic latency of the critical path.

I could try changing things so that all 14 (or more) result busses run >>>> past the reservations stations, but I have a feeling that all the muxes >>>> for the busses will consume a lot of logic. Muxes are relatively
expensive in an FPGA. Comparators are less expensive I think.

Current config (8 units):
ALU1, ALU2, IMUL, DIV, FMA, FPU, MEM, BRANCH

versus:
ALU1, ALU2, ALU3, ALU4, ALU5, ALU6
MEM1, MEM2, MEM3, FADD, FMUL, Branch
SFT1, SHT2, SFT3, FMSC, FDIV,
where vertical means they are the same FU#

Okay, I had units separated by latency so there is minimal latency going
from the unit back to the results/input (feedback paths). Trying to keep
performance of dependent instructions good.
ALU1, ALU2 are single cycle latency. FPU is three cycles versus FMA
which is five. Most of the units can issue an instruction every clock
cycle. Some units not in the minimal config may have large latencies and
cannot issue every cycle. These include float trig, graphics unit,
neural net unit.

{You will probably have to edit this to see the true ASCII art due to the inherent stupidity of the space eaters.} One Function Unit::

+----------------------------------------+
| +------------------------+ |
|->| | |\ |
|->| long latency FU |->| | |
|->| | |M| |
Rs-->| +------------------------+ |U| |\ |
| |X|-|D|---|->result bus
| +--------+ | | |/ |
|->| short |----------------->| | |
| +--------+ |/ |
+----------------------------------------+

You may even be able to use the <unused> buffering in the long latency sub-unit to delay the <already done> shot latency calculation. Alternately, you could add some buffering between short and long to take up the slack.

The final gate inside the FU is the large heavily loaded bus driver.

There will be some kind of internal timing chain in the FU that arbitrates the long versus the short(s) and sends tags at the appropriate instant.

Ascii looks good. I did this already for some of the unit/functions. The
FPU is 3 cycles but some 1 or 2 cycles ops are fed into the same
pipeline. I was going with 1,3,5 and many for latency, plus some units
that are removable from the design.

I still think a short pipeline unit is needed for some common ops to
execute back-to-back as dependent instructions. Or I guess the
alternative is to collect a huge number of instructions and amortize the latencies.

Although two ALUs are shown, the FPU can execute ALU instructions too.
And the ALU can execute the single cycle FPU instructions. I use the
name SAU (for simple arithmetic unit) because of the crossover. When I
see ALU I think integer.

When I said ALU above, I meant {ADD, SUB, CMP, FCMP, certain Conversions, certain bit twiddling, logic} that is :most things that fit in 1 cycle with forwarding and result bus drive.

Okay. My ALU includes the almost the same. Is there cross unit
forwarding or is it just within the same pipeline? My dispatch is not
smart enough to dispatch instructions to the same ALU to make use of forwarding.

The muxes for results forwarding in an FPGA slow the design down. I have
seen a couple of designs that say its not worth forwarding results.
Better to bump up the clock frequency.

There are four result busses to feed the register file. A larger
register file may be too much for the current implementation. There is a
lot of BRAM used for the register file. 1/4 BRAMs in the device.

I have built (logic design, circuit design, SPICE tuning, layout) of
6R-6W register file of 128×64-bit entries. The SPICE tuning was most "illuminating" as to the limitations of multi-port SRAM-like storage.

IDK if I would ever get to that level of detail. I am relying on the
FPGA designers knowledge. Not planning a custom chip logic design.
Obviously there are physical limitations for a custom logic design. I
have seen recently advertised for smaller volume custom chip. Still too expensive for a hobbyist.

I do not, at this instant in time, think wider than 6R-6W is practicable. {{Just as well since we are only performing ~2.x I/c with 300 instruction execution windows {and cache hierarchy hit rates and latencies}}}

Reservation stations are using about 5k LUTs each.

14×5 = 70K

The RAT is about 50k LUTs.

I tried configuring Qupls for 3 entries per RS, and more functional
units/functionality, but it turned out to be about 700,000 LUTs.
I am trying to keep a demo under 200k LUTs.
When I obtain a larger board it will just be a matter of changing some
config values.

--- Synchronet 3.21f-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,113
Nodes:	10 (0 / 10)
Uptime:	492335:43:22
Calls:	14,238
Files:	186,312
D/L today:	3,558 files (1,159M bytes)
Messages:	2,514,865

Deproved RAT

Who's Online

System Info