From Newsgroup: comp.arch
Every so often I go back and read my Seminal thread::
"Arguments for a Sane Instruction Set Architecture" -------------------------------------------------------
Todays topic addresses questions ask by Nick McLaren::
From about 1/2 way through the thread::
I pick this up because My 66000 Architecture has change "a lot" in this
corner since the thread retired (from google.groups).
X) It should be possible to extend the arithmetic (especially multiple precision and different overflow handling), easily and efficiently,
without supervisor involvement. Fixing the interrupt handling would
be a significant help, here.
The CARRY mechanism goes a long way in providing access to multiprecision arithmetic. While each step along the calculation is a unique instruction
each instruction specifies its signedness, so the earlier calculations
will be unsigned, and the final calculation will be {Signed, unSigned};
the {Signed} variants checking Overflow when "all the bits did not go
to the result".
CARRY provides an additional input (operand) and an additional output
(result) at the register level converting k-Operand-1-Result instructions
into (K+1)-Operand-2-Result instructions. In effect, CARRY provides a
list of bits that get concatenated on (up to) 8 subsequent instructions;
and provides an "accumulator" wired onto the data path, so that the
register CARRY provides can be read-once and written-once, saving
RF accesses on lower end machines.
CARRY operates on ADD, MUL, DIV, SR, SL, INS, FADD, FMUL, ...
CARRY provides the simple notion of "If the result delivers all the
bits" then Overflow cannot happen, but if less than all the bits are
delivered, the non-delivered bits are checked for significance to
determine Overflow.
Y) It should be possible for threads/ps/ts/ws to queue short messages
and send 'notifications' (perhaps interrupts) to others in the same
task unit, without supervisor involvement, and preferably without
main memory involvement.
Much has change in this corner of My 66000 architecture. In particular::
all interrupts are MSI-X based, and since MSI-X interrupts are small
messages sent to an "interrupt service port" aperture, any thread with
a PTE to an interrupt table can send small messages as if it were an interrupt--BECAUSE it IS.
It is assumed that each Guest OS has its own interrupt table, as does
each HostOS, and Each HyperVisor. In addition, there may be other
interrupt tables associated with trusted but unprivileged threads
so that messaging is as efficient as performing an ATOMIC event
and more reliable, too.
My 66000 Interrupt table architecture::
An interrupt table is a 1-page structure in memory that contains
a) a raised bit-vector (64-bits)
b) a higher-originator table pointer (51-bits + 1 valid bit)
c) 64 queue headers {22-bits each--2½ entries per DW}
d) a Free-list on-page 'pointer' (10-bits)
e) 996 messages {42-bits each in a 64-bit DW}
There is an interrupt service port (in the LLC) that receives MSI-X
messages within the Interrupt Table Aperture, and enQueues them onto
the table. Should this new interrupt raise a priority level higher
than the table currently contains, the Service Port broadcasts the
Raised bit-vector to all cores.
Each Core has an interrupt Table control register containing the page-
address of its interrupt table. There is a valid bit in case the core
is not watching for interrupts. When the core sees the broadcast and
verifies that the BC is for his table, core begins interrupt negotiation.
Interrupt Negotiation is a process by which multiple cores bid for the
highest priority pending enabled interrupt. This negotiation can fail
and when it fails, the Fetch-Execute part of the core has not lost a
single cycle of execution. Successful Interrupt Negotiation results in
an interrupt being deQueued out of the table, and a control transfer
to the dispatcher.
The received interrupt message contains
a) CS 2-bits
b) Priority 6-bits
c) ISR 8-bits
d) MSI-X message 32-bits
CS selects the Context of Software Stack that "takes" the interrupt.
Priority sets the core priority::
all interrupts of lesser priority are instantly Disabled
all interrupts of higher priority remain Enabled
ISR indexes a table of ISR handlers ----------------------------------------------------
A 22-bit queue header contains two 10-bit page offset "pointers",
a bit controlling whether the queue is accepting new arrivals of
interrupts, and a bit controlling whether this queue delivers
queued messages into the raised register. ----------------------------------------------------
The Dispatcher arrives with the register file of the Context where
R0 contains the received interrupt message, R1-R15 are temporary
registers for ISR and Dispatch to use at will. CS and Priority
have already been moved to CRs.
Dispatcher isolates ISR and then calls through ISR table (2-instructions).
When ISR returns Dispatcher performs SVR instruction (1-instruction). ----------------------------------------------------
So, for this-thread to send an interrupt to another-thread, this-thread
has to have a PTE translating to another-thread's interrupt table,
has to have worked out a message protocol whereby a 32-bit message
is all that is needed for another-thread to decipher why it received
this message.
Of course in order to get to the point where the above paragraph is
workable, this-thread has to ask its Guest OS for the PTE and already
have worked out a message structure (SW).
But, now with PTE in MMU table, this-thread can send an interrupt
to another-thread that GuestOS allows, with a single STD instruction.
AND there is no reason that another-thread is not this-thread. -------------------------------------------------------
The Higher Originator pointer is used when::
a) this interrupt table is not being observed by any core
b) this interrupt table exhausts its free-list.
(a) is used to inform HostOS that one of its VMs needs to get a time
slice as soon as possible.
(b) is used to prevent loss of interrupts. ----------------------------------------------------
Every time the core changes the core-interrupt table pointer, the
interrupt negotiator goes out and reads the interrupt table raised
control register.
Z) There should be some performance counters that provide useful data
for programmers, rather than the hardware designers, and those should
be obtainable efficiently, without supervisor involvement, by the thread/p/t/w being monitored and any others that are managing it..
A set of counters contains 8 performance counters and a counter
access Control Register. Since My 66000 supports multi-location
memory reference instructions.
When the CR is accesses as a scalar, it is used as a control register
to the counters, and contains eight 8-bit fields, each field controls
1 counter, specifies which events it counts,...
When the CR is accessed as a multiple, the counters themselves are
accessed. It is easy to LDM all 8 counters in a single instruction,
or to move all 8 counters with an MM instruction,...
{Yes, I remember Scott Lundal dislikes this access structure}
When Guest OS grants R-- access the counters can be red by application
but not modified, so readability and writability are independently controllable.
As far as performance counters are concerned, each "major resource"
has a set of counters. A Core consists of 4 major resources and
thus has 32-performance counters. There is a list of 36-envets a
core {Fetch-Execute} counts, a list of 21 events L1 cache counts,
a list 25-events the L2 cache counts, and a final list of 15 events
the Interconnect block counts.
Further out, {L3==LLC, DRAM, HostBridge, PCIeRoot, and individual
PCIe trees} each have their own sets of counters.
Yes the counter are still very HW centric--sorry Nick--but it is
HW events that the performance counters count.
--- Synchronet 3.21a-Linux NewsLink 1.2