Obviously an extremely biased workload like the data analysis
workloads targeted by Intel's research chip would probably show
A55 in a better light (though A55 would likely be very inefficient
compared to the research design, I think it used 4-way threaded
in-order cores with limited cache and narrow memory channels [to
avoid 64-byte accesses to access 64-bits or less of data]), but
that would not be "fair".
On 1/22/24 9:44 AM, Paul A. Clayton wrote:
[snip]
Obviously an extremely biased workload like the data analysis
workloads targeted by Intel's research chip would probably show
A55 in a better light (though A55 would likely be very inefficient
compared to the research design, I think it used 4-way threaded
in-order cores with limited cache and narrow memory channels [to avoid
64-byte accesses to access 64-bits or less of data]), but
that would not be "fair".
I (finally) found a reference to the Intel research chip. https://ieeexplore.ieee.org/document/10188866
"The Intel Programmable and Integrated Unified Memory Architecture
Graph Analytics Processor" (Sriram Aananthakrishnan et al., 2023)
A PDF of the paper appears to be available at https://heirman.net/papers/aananthakrishnan2023piuma.pdf
Paul A. Clayton wrote:[snip]
When I looked at the pipeline design presented in the Arm Cortex-
A55 Software Optimization Guide, I was surprised by the design.
Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
(ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
stage before an ALU stage (clearly for AArch32).
Almost like an Mc88100 which had 5 pipelines.
The separation of MAC and DIV is mildly questionable — from my
very amateur perspective — not supporting dual issue of a MAC-DIV
pair seems very unlikely to hurt performance but the cost may be
trivial.
Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;
The Chips and Cheese article also indicated that branches are only
resolved at writeback, two cycles later than if branch direction
was resolved in the first execution stage. The difference between
a six stage misprediction penalty and an eight stage one is not
huge, but it seems to indicate a difference in focus. With
In an 8 stage pipeline, the 2 cycles of added delay should hurt by
~5%-7%
I would have *guessed* that an AGLU (a functional unit providing
address generation and "simple" ALU functions, like AMD's Bobcat?)
would be more area and power efficient than having separate
pipelines, at least for store address generation.
Be careful with assumptions like that. Silicon area with no moving
signals is remarkably power efficient.
Perhaps mildly out-of-order designs (say a little more than the
PowerPC 750) are not actually useful (other than as a starting
point for understanding out-of-order design). I do not understand
why such an intermediate design (between in-order and 30+
scheduling window out-of-order) is not useful. It may be that
It is useful, just not all that much.
going from say 10 to 30 scheduler entries gives so much benefit
for relatively little extra cost (and no design is so precisely
area constrained — even doubling core size would not mean pushing
L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
emotional investment in intermediate and mixed designs.
10 does not accommodate much ILP beyond that of a 10 deep pipeline.
30 accommodates L1 cache misses and typical FP latencies.
90 accommodates "almost everything else"
250 accommodates multiple L1 misses with L2 hits and "everything
else".
For something like a smart phone, one or two small cores might be
useful for background activity, tasks whose latency (within a
broad range) is not related to system responsiveness for the user.
For a server expected to run embarrassingly parallel workloads, if
Servers are not expected to run embarrassingly parallel applications,
they are expected to run an embarrassing large number of essentially
serial applications.
a wimpy core provides sufficient responsiveness, I would expect
most of the cores (possibly even all of the cores) to be wimpy.
There might not be many workloads with such characteristics;
Talk to Google about that....
On 2/25/24 5:22 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:[snip]
When I looked at the pipeline design presented in the Arm Cortex-
A55 Software Optimization Guide, I was surprised by the design.
Figure 1 (page 10 in Revision R2p0) shows nine execution pipelines
(ALU0, ALU1, MAC, DIV, branch, store, load, FP/Neon MAC &
DIV/SWRT, FP/Neon ALU) and ALU0 and ALU1 have a shift pipeline
stage before an ALU stage (clearly for AArch32).
Almost like an Mc88100 which had 5 pipelines.
I think I have an incorrect conception of data communication
(fowarding and register-to-functional-unit). I also seem to be
conflating somewhat issue port and functional unit. Forwarding
from nine locations to nine locations and the remaining eight
locations to eight locations (counting functional unit as a single
target location even though a functional unit may have three
functionally different input operands).
I am used to functionality being merged; e.g., the multiplier also
having a general ALU. Merged functional units would still need to
route the operands to the appropriate functionality, but selecting
the operation path for two operands *seems* simpler than selecting
distinct operands and separate functional unit independently. This
might also be a nomenclature issue.
If one can only begin two operations in a cycle, the generality of
having nine potential paths seems wasteful to me. Having separate
paths for FP/Neon and GPR-using operations makes sense because of
the different register sets (as well as latency/efficiency-
optimized functional units vs. SIMD-optimized functional units;
sharing execution hardware is tempting but there are tradeoffs).
With nine potential issue ports, it seems strange to me that width
is strictly capped at two.
Even though AArch64 does not have My
66000's Virtual Vector Method to exploit normally underutilized,
there would be cases where an extra instruction or two could
execute in parallel without increasing resources significantly. As
an outsider, I can only assume that any benefit did not justify
the costs in hardware and design effort. (With in-order execution,
even a nearly free [hardware] increasing of width may not result
in improved performance or efficiency.)
The separation of MAC and DIV is mildly questionable — from my
very amateur perspective — not supporting dual issue of a MAC-DIV
pair seems very unlikely to hurt performance but the cost may be
trivial.
Many (MANY) MUL-DIV pairs are data dependent. y = i*m/n;
I also ass?me the other operations are usually available for
parallel execution (though this depends somewhat on compiler
optimization for the microarchitecture), so execution of a
multiply and a divide in parallel is probably uncommon.
The FP/Neon section has these operations merged into a functional
unit; I guess — I am not motivated to look this — that this is
because FP divide/sqrt use the multiplier while integer divide
does not.
The Chips and Cheese article also indicated that branches are only
resolved at writeback, two cycles later than if branch direction
was resolved in the first execution stage. The difference between
a six stage misprediction penalty and an eight stage one is not
huge, but it seems to indicate a difference in focus. With
In an 8 stage pipeline, the 2 cycles of added delay should hurt by
~5%-7%
5% performance loss sounds expensive for a something that *seems*
not terribly expensive to fix.
[snip]
I would have *guessed* that an AGLU (a functional unit providing
address generation and "simple" ALU functions, like AMD's Bobcat?)
would be more area and power efficient than having separate
pipelines, at least for store address generation.
Be careful with assumptions like that. Silicon area with no moving
signals is remarkably power efficient.
There is also the extra forwarding for separate functional units
(and perhaps some extra costs from increased distance), but I
admit that such factors really expose my complete lack of hardware experience. (I am aware of clock gating as a power saving
technique and that "doing nothing" is cheap, but I have no
intuition of the weights of the tradeoffs.)
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
[snip interesting stuff]
Perhaps mildly out-of-order designs (say a little more than the
PowerPC 750) are not actually useful (other than as a starting
point for understanding out-of-order design). I do not understand
why such an intermediate design (between in-order and 30+
scheduling window out-of-order) is not useful. It may be that
It is useful, just not all that much.
going from say 10 to 30 scheduler entries gives so much benefit
for relatively little extra cost (and no design is so precisely
area constrained — even doubling core size would not mean pushing
L1 off-chip, e.g.). I have a lumper taxonomic bias, so I have some
emotional investment in intermediate and mixed designs.
10 does not accommodate much ILP beyond that of a 10 deep pipeline.
30 accommodates L1 cache misses and typical FP latencies.
90 accommodates "almost everything else"
250 accommodates multiple L1 misses with L2 hits and "everything
else".
Presumably the benefit depends on issue width and load-to-use
latency (pipeline depth, cache capacity, etc.). [For a cheap
"general purpose" processor, not covering FP latencies well may
not be very important.] Better hiding L1 _hit_ latency would seem
to provide a significant fraction of the frequency and ILP benefit
of out-or-order for a smallish core. (Some branch resolution
latency can also be hidden; an in-order core can delay resolution
until writeback of control-dependent instructions, but OoO's extra
buffering facilitates deeper speculation.)
If one has a scheduling window of 90 operations, having only three
issue ports seems imbalanced to me.
Out-of-order execution would also seem to facilitate opportunistic
use of existing functionality. Even just buffering decoded
instructions would seem to allow a 16-byte (aligned) instruction
fetch with two instruction decoders to issue more than two
instructions on some cycles without increasing register port
count, forwarding paths, etc. OoO would further increase the
frequency of being able to do more work with given hardware
resources.
Perhaps there may even be a case for a 1+ wide OoO core, i.e., an
OoO core which sometimes issue more than one instruction in a
cycle.
For something like a smart phone, one or two small cores might be
useful for background activity, tasks whose latency (within a
broad range) is not related to system responsiveness for the user.
For a server expected to run embarrassingly parallel workloads, if
Servers are not expected to run embarrassingly parallel applications,
they are expected to run an embarrassing large number of essentially
serial applications.
Shared caching of instructions still seems beneficial in "server
worklaods" compared to fully general multiprogram workloads. A
database server might even have more sharing, potentially having a
single process (so page table sharing would be more beneficial),
but that seems a less common use.
a wimpy core provides sufficient responsiveness, I would expect
most of the cores (possibly even all of the cores) to be wimpy.
There might not be many workloads with such characteristics;
Talk to Google about that....
Urs Hölzle of Google put out a paper "Brawny cores still beat
wimpy cores, most of the time"(2010). While some of the points —
such as tail latency effects and software developments costs —
made in the paper are (in my opinion) quite significant, I thought
the argument significantly flawed. (I even wrote a blog post about
this paper: https://dandelion-watcher.blogspot.com/2012/01/weak-case-against-wimpy-cores.html)
The microservice programming model (motivated, from what I
understand, by problem-size and performance scaling and service
reliability with moderately reliable hardware without requiring
much programming effort to support scaling) may also have
significant implications on microarchitecture.
The design space is also very large. One can have heterogeneity of
wimpy and brawny cores at the rack level, wimpy-only chips within
a heterogeneous package, heterogeneity within a chip, temporal
heterogeneity (SMT and dynamic partitioning of core resources),
etc. Core strength can very widely and performance balance can be
diverse (e.g., a core with a quarter of the performance of a
brawny core on general tasks might have — with coprocessors,
tightly coupled accelerators, or general microarchitecture —
approximately equal performance for some tasks).
The performance of weaker cores can also be increased by
increasing communication performance within local groups of such
cores. Exploiting this would likely require significant
programming effort, but some of the effort might be automated
(even before AI replaces programmers). This assumes that there is
significant communication that is less temporally local than
within a core (out-of-order execution changes the temporal
proximity of value communication; a result consumer might be
nearby in program order but substantially more distant in
execution order) and that intermediate resource allocation to
intermediate latency/bandwdith communication can be beneficial.
(I also think that there is an opportunity for optimization in the
on-chip network. Optimizing the on-chip network for any-to-any
communication seems less appropriate for many workloads not only
because of the often limited scale of communication but also
because the communication is, I suspect, often specialized.
Getting a network design that is very good for some uses and
adequate others seems challenging even with software cooperation.
Rings seem really nice for pipeline-style parallelism and some
other uses, crossbars seem nice for small node groups with heavy communication, grids seem to fit large node counts with nearest
neighbor communication (physical modeling?), etc. Channel width,
flit size, channel count also involve tradeoffs. Some
communication does not require sending an entire cache block of
data, but a smaller flit will have more overhead.)
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!
mitchalsup@aol.com (MitchAlsup1) writes:
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
Imaging having to stick all this stuff on a die at 2µ instead of 5nm !!
I suspect Paul is refering to what ARMv8 calls "System Registers";
despite the name, most are stored in flops, and in the case of
the ID registers, wires (perhaps anded with local e-fuses).
On 3/24/24 4:39 PM, Scott Lurndal wrote:...
mitchalsup@aol.com (MitchAlsup1) writes:
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.
An argument might be made that some designs would have no use for
most of such extra state. Performance monitoring is useful for
software development (and theoretically for OS decisions for
scheduling, core migration, and other functions), but seems likely
to be highly underutilized for typical use. A55 is presumably
large enough that a synthesis-time remove of much of this
functionality would have a tiny effect on total area.
Even for a
microcontroller the area cost might not be problematic.
On 3/24/24 4:39 PM, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly inconceivable for something like the MIPS R2000.
Paul A. Clayton wrote:
On 3/24/24 4:39 PM, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Paul A. Clayton wrote:
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly
inconceivable for something like the MIPS R2000.
Many of these register are configuration control that
get set once at boot and never change.
Others are dynamic but not time critical, like debug registers.
Only a small number would be diddled on a regular basis,
like interrupt control.
They don't all need the same access speed -
depending on usage some (most?) can be on "slow" buses
that maybe take multiple clocks to read or write.
On 3/24/24 4:39 PM, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Paul A. Clayton wrote:I suspect Paul is refering to what ARMv8 calls "System Registers";
(I was also very surprised by how much extra state the A55 has:
over 100 extra "registers". Even though these are not all 64-bit
data storage units, this was still a surprising amount of extra
state for a core targeting area efficiency. The storage itself may
not be particularly expensive, but it gives some insight into how
complex even a "simple" implementation can be.)
Imaging having to stick all this stuff on a die at 2µ instead of 5nm !! >>
Yes. (There were also some debug registers, performance monitoring
registers, trace registers, etc.)
despite the name, most are stored in flops, and in the case of
the ID registers, wires (perhaps anded with local e-fuses).
Yes, many of the bits would be implemented as ROM/PROM and many
would presumably be scattered about because they control/interact
with specific functionality. They are similar I/O device
registers. (I/O devices have also become more complex.)
However, having over 100 seems like a lot. Supporting performance
counters and tracing is also something that would have been nearly >inconceivable for something like the MIPS R2000.
An argument might be made that some designs would have no use for
most of such extra state. Performance monitoring is useful for
software development (and theoretically for OS decisions for
scheduling, core migration, and other functions), but seems likely
to be highly underutilized for typical use.
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
On 3/24/24 4:39 PM, Scott Lurndal wrote:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Scott Lurndal wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
On 3/24/24 4:39 PM, Scott Lurndal wrote:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
My 66000 Architecture defines 8 performance counters at each layer of
the design:: cores gets 8 counters, L1s gets 8 counters, L3s gets 8
counters Interconnect gets 8 counters, Memory Controller gets 8 counters, >PCIe root gets 8 counters--and every instance multiplies the counters.
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
scott@slp53.sl.home (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.
My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.
- anton
scott@slp53.sl.home (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.
scott@slp53.sl.home (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.
My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.
In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel
or AMD.
The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.
Anton Ertl wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.
My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world
programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.
Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )
The question is if "users" to ARM Holdings are actual end-users, or the
SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.
The biggest demand is from the OS vendors. Hardware folks have
simulation and emulators.
Look at vtune, for example.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Having reverse engineered the original Pentium EMON counters I got a
meeting with Intel about their next cpu (the PentiumPro), what I was
told about the Pentium was that this chip was the first one which was
too complicated to create/sell an In-Circuit Emulator (ICE) version, so
instead they added a bunch of counters for near-zero overhead monitoring
and depended on a bit-serial read-out when they needed to dump all state
for debugging. (I have forgotten the proper term for that interface! :-( )
Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.
scott@slp53.sl.home (Scott Lurndal) writes:
The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.
You don't want to use a full-blown microarchitectural emulator for a >long-running program.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.
You don't want to use a full-blown microarchitectural emulator for a >>long-running program.
Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, >bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.
Their target is not application analysis.
scott@slp53.sl.home (Scott Lurndal) writes:
Their target is not application analysis.
This sounds like hardware folks that are only concerned with
memory-bound programs.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
The biggest demand is from the OS vendors. Hardware folks have >>>>simulation and emulators.
You don't want to use a full-blown microarchitectural emulator for a >>>long-running program.
Generally hardware folks don't run 'long-running programs' when
analyzing performance, they use the emulator for determining latencies, >>bandwidths and efficiacy of cache coherency algorithms and
cache prefetchers.
Their target is not application analysis.
This sounds like hardware folks that are only concerned with
memory-bound programs.
I OTOH expect that designers of out-of-order (and in-order) cores
analyse the performance of various programs to find out where the
bottlenecks of their microarchitectures are in benchmarks and
applications that people look at to determine which CPU to buy. And
that's why we not only just have PMCs for memory accesses, but also
for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.
- anton--- Synchronet 3.20a-Linux NewsLink 1.114
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 914 |
Nodes: | 10 (1 / 9) |
Uptime: | 230:46:19 |
Calls: | 12,157 |
Files: | 186,517 |
Messages: | 2,232,574 |