Just found a gem on Cray arithmetic, which (rightly) incurred
The Wrath of Kahan:
https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf
"Pessimism comes less from the error-analyst's dour personality--- Synchronet 3.21a-Linux NewsLink 1.2
than from his mental model of computer arithmetic."
I also had to look up "equipollent".
I assume many people in this group know this, but for those who
don't, it is well worth reading.
Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan ...
https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf
Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:
On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:
Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:
While the arithmetic on the Cray I was bad enough, this document seems
to focus on some later models in the Cray line, which, like the IBM
System/ 360 when it first came out, before an urgent retrofit, lacked a
guard digit!
The concluding part of that article had a postscript which said that,
while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.
After reading that article, I looked for more information on other >processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of >the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently >quoted here.
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:It took many years to figure it out for *DEC* hardware designers.
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing
in its entirety, that the “too hard” or “too obscure” parts were there for an important reason,
to make programming that much easier,For many non-obvious parts of 754 it's true. For many other parts, esp.
and should not be .
You’ll notice that Kahan mentioned Apple more than once, as seeminglyAccording to my understanding, Motorola suffered from being early
his favourite example of a company that took IEEE754 to heart and
implemented it completely in software, where their hardware vendor of
choice at the time (Motorola), skimped a bit on hardware support.
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
- anton
On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
The hardware designers took many years -- right through the 1990s, I think -- to be persuaded that IEEE754 really was worth implementing in its entirety, that the “too hard” or “too obscure” parts were there for an
important reason, to make programming that much easier, and should not be skipped.
You’ll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented--- Synchronet 3.21a-Linux NewsLink 1.2
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
The hardware designers took many years -- right through the 1990s, I think >> -- to be persuaded that IEEE754 really was worth implementing in its
entirety, that the “too hard” or “too obscure” parts were there for an
important reason, to make programming that much easier, and should not be
skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
did make it easier--but NaNs, infinities, Underflow at the Denorm level
went in the other direction.
You’ll notice that Kahan mentioned Apple more than once, as seemingly his >> favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >> a branch delay slot, as well as using traps to implement some portions of >> the IEEE 754 standard... thus, presumably, being one of the architectures >> to inspire the piece about bad architectures from Linus Torvalds recently >> quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).
Though, reading some stuff, implies a predecessor chip (the R4000) had a more functionally complete FPU. So, I guess it is also possible that the R4300 had a more limited FPU to make it cheaper for the embedded market.
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.
As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.
Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU emulation).
Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts,
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the “too hard” or “too obscure” parts were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
The hardware designers took many years -- right through the 1990s,
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the “too hard” or “too obscure” >> parts were there for an important reason,
It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?
You’ll notice that Kahan mentioned Apple more than once, as
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.
According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.
BGB <cr88192@gmail.com> posted:
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >>>> a branch delay slot, as well as using traps to implement some portions of >>>> the IEEE 754 standard... thus, presumably, being one of the architectures >>>> to inspire the piece about bad architectures from Linus Torvalds recently >>>> quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the
rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).
And this is why FP wants high quality implementation.
Though, reading some stuff, implies a predecessor chip (the R4000) had a
more functionally complete FPU. So, I guess it is also possible that the
R4300 had a more limited FPU to make it cheaper for the embedded market.
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.
Do it right or don't do it at all.
As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.
While I can agree with the sentiment, the emulation overhead makes this
very hard to achieve indeed.
Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU
emulation).
Partial issue is mostly that one doesn't want to remain in an interrupt
handler for too long because this blocks any other interrupts,
At the time of control arrival, interrupts are already reentrant in
My 66000. A higher priority interrupt will take control from the
lower priority interrupt.
On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
The hardware designers took many years -- right through the 1990s,It took many years to figure it out for *DEC* hardware designers.
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the “too hard” or “too obscure” >>> parts were there for an important reason,
Was there any other general-purpose RISC vendor that suffered from
similar denseness?
I thought they all did, just about.
You’ll notice that Kahan mentioned Apple more than once, asAccording to my understanding, Motorola suffered from being early
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.
Let’s see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.
On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the “too hard” or “too obscure” parts were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
As a programmer, I count all that under my definition of “easier”.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.
Denormals -- aren’t they called “subnormals” now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. It’s about the principle of least surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
early ones were 754 compatible, but later chips from 1986 intended for
the 386 were compatible, and they seem to have been used by many
(Motorola, Intel, Sun, PA-RISC, ...)
I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?
I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code
perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.
Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
perhaps
calculations should be re-arranged, algorithms changed, or you should be
using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).
The first two require more knowledge about FP than many programmers
have, all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant. The last increases the
resource usage much more than proper support for denormals.
* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware. So that's my decision
criterion: If the software cost is higher than the hardware cost,
the software crisis is relevant; and in the present context, it
means that expending hardware to reduce the cost of software is
justified. Denormal numbers are such a feature.
- anton
On 14/10/2025 09:51, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of >>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.
I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that >could be a null pointer.
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more >useful result.
And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is >because they intend to divide by it and want to avoid division by zero
(and thus infinity).
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)
However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?
Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?
I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from >your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway >before everything falls apart, but only a tiny amount.
On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:Subnormal is critical for stability of zero-seeking algorithms, i.e a
On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote: >>>
The hardware designers took many years -- right through the 1990s, I>>>> think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the “too hard†or “too obscure†parts
were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
As a programmer, I count all that under my definition of “easierâ€.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >>> in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological
results right through to the end of the calculation, in a mathematically
consistent way.
Denormals -- aren’t they called “subnormals†now? -- are also
about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at
once and going straight to zero. It’s about the principle of least
surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level>> with Prof Kahan.
I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.
But I find it harder to understand why denormals or subnormals are going
to be useful. Ultimately, your floating point code is approximating arithmetic on real numbers. Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting > results that have such a dynamic range that you are using denormals? And
what are you doing where it is acceptable to lose some precision with
those numbers, but not to give up and say things have gone badly wrong > (a NaN or infinity, or underflow signal)? I have a lot of difficulty
imagining a situation where denormals would be helpful and you haven't > got a major design issue with your code - perhaps calculations should be
re-arranged, algorithms changed, or you should be using an arithmetic
format with greater range (switch from single to double, double to quad,
or use something more advanced).
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.
David Brown <david.brown@hesbynett.no> writes:
On 14/10/2025 09:51, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be >>>> useful to have a representation for that. The defined "viral" nature of >>>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >>>> checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.
I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that
could be a null pointer.
Not really:
* Null pointers don't materialize spontaneously as results of
arithmetic operations. They are stored explicitly by the
programmer, making the programmer much more aware of their
existence.
* Programmers are trained to check for null pointers. And if they
forget such a check, the result usually is that the program traps,
usually soon after the place where the check should have been. With
a NaN you just silently execute the wrong branch of an IF, and later
you wonder what happened.
* The most common use for null pointers is terminating a linked list
or other recursive data structure. Programmers are trained to deal
with the terminating case in their code.
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more
useful result.
When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
certain kind of exception. Maybe you can also set it up such that it produces a NaN instead. I doubt that many people would find that
useful, however.
And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is
because they intend to divide by it and want to avoid division by zero
(and thus infinity).
Denormals don't help much here. IEEE doubles cannot represent 2^1024,
but denormals allow to represent positive numbers down to 2^-1074.
So, with denormal numbers, the absolute value of your divisor must be
less than 2^-50 to produce a non-infinite result where flush-to-zero
would have produced an infinity.
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)
Not mine. An assumption that I like is that the associative law
holds. It holds with -fwrapv, but not with overflow-is-undefined.
I fail to see how declaring any condition undefined behaviour would
increase any guarantees.
However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?
Yes.
If any of them is a NaN, the result is false for either comparison
(because a-b would be NaN, and because the result of any comparison
with a NaN is false).
For infinity there are a number of cases
1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
5) inf<inf (false) vs. inf-inf=NaN<0 (false)
6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
7) inf<-inf (false) vs. inf--inf=inf<0 (false)
8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)
The most interesting case here is 5), because if means that a<=b is
not equivalent to a-b<=0, even with denormal numbers.
Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?
I was thinking about what programmers might use when writing their
code. For compilers, having that equivalence may occasionally be
helpful for producing better code, but if it does not hold, the
compiler will just not use such an equivalence (once the compiler is debugged).
This is an example from Kahan that stuck in my mind, because it
appeals to me as a programmer. He has also given other examples that
don't do that for me, but may appeal to a mathematician, phycisist or chemist.
I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from
your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway
before everything falls apart, but only a tiny amount.
I think the nicer properties (such as the equivalence mentioned above)
is the more important benefit. And if you take a different branch of
an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
enough accuracy by far.
On 10/13/2025 4:53 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >> rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).
And this is why FP wants high quality implementation.
From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.
But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.
Though, reading some stuff, implies a predecessor chip (the R4000) had a >> more functionally complete FPU. So, I guess it is also possible that the >> R4300 had a more limited FPU to make it cheaper for the embedded market. >>
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.
Do it right or don't do it at all.
?...
The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.
Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.
For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.
On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:
On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in >>> its entirety, that the “too hard” or “too obscure” parts were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
As a programmer, I count all that under my definition of “easier”.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >> in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.
Denormals -- aren’t they called “subnormals” now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. It’s about the principle of least surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.
But I find it harder to understand why denormals or subnormals are going
to be useful.
Ultimately, your floating point code is approximating arithmetic on real numbers.
Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting results that have such a dynamic range that you are using denormals?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)? I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code - perhaps
calculations should be re-arranged, algorithms changed, or you should be using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.
Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly >wrong (a NaN or infinity, or underflow signal)?
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to >double, double to quad, or use something more advanced).
The first two require more knowledge about FP than many programmers
have,
all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant.
David Brown <david.brown@hesbynett.no> posted:
Ultimately, your floating point code is approximating
arithmetic on real numbers.
Don' make me laugh.
The associative law holds fine with UB on overflow,
Well, I think that if your values are getting that small enough to make >denormal results, your code is at least questionable.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:[...]
Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.
Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?
You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}
The first two require more knowledge about FP than many programmers
have,
Don't allow THOSE programmers to program FP codes !!
Get ones that understand the nuances.
BGB <cr88192@gmail.com> posted:
On 10/13/2025 4:53 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other >>>>>> processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist >>>>> probably the best-known example (denormal support only in software), >>>>> and Linus Torvalds worked on it personally. Concerning exposing the >>>>> pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >>>> rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so >>>> a lot of the N64 games suffer drift and other issues over time (as the >>>> programmers had compensated for the MIPS issues in code rather than via >>>> traps).
And this is why FP wants high quality implementation.
From what I gather, it was a combination of Binary32 with DAZ/FTZ and
truncate rounding. Then, with emulators running instead on hardware with
denormals and RNE.
In the above sentence I was talking about your FPU not getting
an infinitely correct result and then rounding to container size.
Not about the other "other" anomalies" many of which can be dealt
with in SW.
But, the result was that the games would work correctly on the original
hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.
Though, reading some stuff, implies a predecessor chip (the R4000) had a >>>> more functionally complete FPU. So, I guess it is also possible that the >>>> R4300 had a more limited FPU to make it cheaper for the embedded market. >>>>Do it right or don't do it at all.
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs) >>>> Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either. >>>
?...
The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.
But their combination of HW+SW gets the right answer.
Your multiply does not.
Or, is the argument here that sticking with weaker not-quite IEEE FPU is
preferable to using trap handlers.
The 5-bang instructions as used by HW+SW has to computer the result
to infinite precision and then round to container size.
The paper illustrates CRAY 1,... FP was fast but inaccurate enough
to fund an army of numerical analysists to see if the program was
delivering acceptable results.
IEEE 754 got rid of the army of Numerical Analysists.
But now, nobody remembers how bad is was/can be.
For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.
Nobody is asking for that.
* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.
Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.
On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:
* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.
The "crisis" was supposed to do with the shortage of programs to write all >the programs that were needed to solve business and user needs.
By that definition, I don’t think the "crisis" exists any more. It went >away with the rise of very-high-level languages, from about the time of >those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:What demonstrations?
On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:
* The Wikipedia article on the software crisis does not give a
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.
The "crisis" was supposed to do with the shortage of programs to
write all the programs that were needed to solve business and user
needs.
I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.
By that definition, I don’t think the "crisis" exists any more. It
went away with the rise of very-high-level languages, from about the
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.
In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.
Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:
"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.
Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.
David Brown <david.brown@hesbynett.no> posted:That was true under 754-2008 but we fixed it for 2019: All NaNs
On 14/10/2025 04:27, Lawrence D’Oliveiro wrote:>>> On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D’Oliveiro wrote:
The hardware designers took many years -- right through the 1990s, I >>>>> think -- to be persuaded that IEEE754 really was worth implementing in >>>>> its entirety, that the “too hard†or “too obscure†parts were there for
an important reason, to make programming that much easier, and should >>>>> not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs >>>> more reliable (more numerically stable) and to give the programmer a>>>> constant programming model (not easier).
As a programmer, I count all that under my definition of “easierâ€.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did >>>> make it easier--but NaNs, infinities, Underflow at the Denorm level went >>>> in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological >>> results right through to the end of the calculation, in a mathematically >>> consistent way.
Denormals -- aren’t they called “subnormals†now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at >>> once and going straight to zero. It’s about the principle of least >>> surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
MAX( x, NaN ) is x.
On 14/10/2025 18:46, Anton Ertl wrote:Please note that I have NOT personally observed this, but I have been
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.
Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point, ignoring
NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y. Then - again in the mathematical real domain - the operation is carried out. Then the
result is truncated or rounded to fit back within the mantissa and
exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^ -308
to 10 ^ +308, or 716 orders of magnitude. (For comparison, the size of
the universe measured in Planck lengths is only about 61 orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another 16 orders of magnitude - at the cost of rapidly decreasing precision. They don't stop the inevitable approximation to zero, they just delay it a little.
I am still at a loss to understand how this is going to be useful - when will that small extra margin near zero actually make a difference, in
the real world, with real values? When you are using your
Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially when these smaller numbers have lower precision?
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the solution, but that's not going to improve the accuracy of the
results.
Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?
I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.
(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)
David Brown wrote:It does not sound right to me. Newton-alike iterations oscillations by
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough
to make denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.
Feel free to correct me if what I write below is wrong - you,
Terje, and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison,
the size of the universe measured in Planck lengths is only about
61 orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here -
another 16 orders of magnitude - at the cost of rapidly decreasing precision. They don't stop the inevitable approximation to zero,
they just delay it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are
using your Newton-Raphson iteration to find your function's zeros,
what are the circumstances in which you can get a more useful end
result if you continue to 10 ^ -324 instead of treating 10 ^ -308
as zero - especially when these smaller numbers have lower
precision?
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least
some zero-seeking algorithms will stabilize on an exact value, if and
only if you have subnormals, otherwise it is possible to wobble back
& forth between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of
the reaons being that I tend to not write infinite loops inside
functions, instead I'll pre-calculate how many (typically NR)
iterations should be needed.
Terje
David Brown wrote:
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to make >>>> denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.
Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y. Then
- again in the mathematical real domain - the operation is carried
out. Then the result is truncated or rounded to fit back within the
mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just delay
it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.
Terje
On Wed, 15 Oct 2025 12:36:17 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.
Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?
I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the
intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.
(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)
I don't think that I agree with Anton's point, at least as formulated.
Yes, subnormals improve precision of Newton-Raphson and such*, but only
when the numbers involved in calculations are below 2**-971, which does
not happen very often. What is more important that *when* it happens
then naively written implementations of such algorithms still converge. Without subnormals (or without expert provisions) there is big chance
that they would not converge at all. That happens mostly because
IEEE-754 preserves following intuitive invariant:
When x > y then x - y > 0
Without subnormals, e.g. with VAX float formats that are otherwise
pretty good, this invariant does not hold.
* - I personally prefer to illustrate it with cord-and-tangent
root-finding algorithm that can be used for any type of function as
long as you proved that on section of interest there is no change of
sign of its first and second derivatives. May be, because I
was taught this algorithm at age of 15. This algo can be called
half-Newton].
On Wed, 15 Oct 2025 05:55:40 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:
On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:I never heard that one. The software project failures, deadline
* The Wikipedia article on the software crisis does not give aThe "crisis" was supposed to do with the shortage of programs to
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.
write all the programs that were needed to solve business and user
needs.
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.
By that definition, I donג��t think the "crisis" exists any more. It >>> went away with the rise of very-high-level languages, from about theBetter tools certainly help. One interesting aspect here is that all
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.
In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.
Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the
difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:
"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.
Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton
What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.
Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost?
When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:
"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.
Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton
What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.
What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.
I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like
CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm
throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).
It can be used to check an array index before use in a LD/ST, e.g.
CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]
The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.
My idea is for the CHKcc instruction to copy the test value to a dest register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.
CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]
Because there is no branch, there is no way to speculate around the check (but load value speculation could negate this fix).
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.
Terje--- Synchronet 3.21a-Linux NewsLink 1.2
Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.
I might note that SIMD obeys none of the 3 conditions.
On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence D’Oliveiro wrote:
The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.
By that definition, I don’t think the "crisis" exists any more. It went
away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.
There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.
In any case, even with these languages there are still software projects
that fail, miss their deadlines and have overrun their budget ...
On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:
Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.
I might note that SIMD obeys none of the 3 conditions.
I believe GCC can do auto-vectorization in some situations.
But the RISC-V folks still think Cray-style long vectors are better than SIMD, if only because it preserves the “R” in “RISC”.
On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:
On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence D’Oliveiro wrote:
The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.
By that definition, I don’t think the "crisis" exists any more. It went >> away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.
Correct. That does seem to be a key part of what “very-high-level” means.
There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.
What we’re seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.
In any case, even with these languages there are still software projects that fail, miss their deadlines and have overrun their budget ...
I’m not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key point about using such a very-high-level language is you can do a lot in just a few lines of code.--- Synchronet 3.21a-Linux NewsLink 1.2
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
No ISA with more than 200 instructions deserves the RISC mantra.
On Wed, 15 Oct 2025 21:42:32 -0000 (UTC), Lawrence D’Oliveiro wrote:
What we’re seeing here is a downward creep, as those very-high-level
languages (Python and JavaScript, particularly) are encroaching into
the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels,
otherwise we might as well use the latter.
There is a pernicious trap:: once an application written in a VHLL is acclaimed by the masses--it instantly falls into the trap where "users
want more performance":: something the VHLL cannot provide until they.........
45 years ago it was LISP, you wrote the application in LISP to figure
out the required algorithms and once you got it working, you rewrote it
in a high-performance language (FORTRAN or C) so it was usably fast.
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
“Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)
Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the “R” stands for.
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
Terje Mathisen <terje.mathisen@tmsw.no> posted:
----------------------
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.
I.e. they differ by exactly one ulp.
In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.
On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
“Reduced Instruction Set Computing”. That was what every single primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the “R” stands for.
I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.
EricP <ThatWouldBeTelling@thevillage.com> posted:
---------------------------
What demonstrations?I had an idea on how to eliminate Bound Check Bypass.
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me.
However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of
Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.
I intend to have range-check-and-fault instructions like
CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm
throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).
It can be used to check an array index before use in a LD/ST, e.g.
CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]
The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.
Yes, order in OoO is sanity-impairing.
But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your µA does not update µA state prior to retire,
you can be as OoO as you like and still not be Spectré sensitive.
One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.
My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register
write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.
CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]
If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.
Because there is no branch, there is no way to speculate around the check
(but load value speculation could negate this fix).
x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and µfaults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at µA level.
Michael S <already5chosen@yahoo.com> writes:
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
browser.
As for the browsers, AFAIK they tried to make Spectre leak less by
making the clock less precise. That does not stop Spectre, it only
makes data extraction using the clock slower. Moreover, there are
ways to work around that by running a timing loop, i.e., instead of
the clock you use the current count of the counted loop.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
What do you mean with "mitigated in hardware"? The answers to
hardware vulnerabilities are to either fix the hardware (for Spectre "invisible speculation" looks the most promising to me), or to leave
the hardware vulnerable and mitigate the vulnerability in software
(possibly supported by hardware or firmware changes that do not fix
the vulnerability).
So do you not want it to be fixed in hardware, or not mitigated in
software? As long as the hardware is not fixed, you may not have a
choice in the latter, unless you use an OS you write yourself. AFAIK
you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations disabled.
So if you are against hardware fixes, you will pay for software
mitigations, in development cost (possibly indirectly) and in
performance.
More info on the topic:
Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf
- anton
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
“Reduced Instruction Set Computing”. That was what every single primer on the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)
Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the “R” stands for.
On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
“Reduced Instruction Set Computing”. That was what every single primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD
extensions in x86 and elsewhere.)
I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers. With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are. I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.
Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the “R” stands for.
I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.
MitchAlsup wrote:
EricP <ThatWouldBeTelling@thevillage.com> posted: ---------------------------
What demonstrations?I had an idea on how to eliminate Bound Check Bypass.
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That >>> is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. >>> However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of >>> Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would >>> not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.
I intend to have range-check-and-fault instructions like
CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm
throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).
It can be used to check an array index before use in a LD/ST, e.g.
CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]
The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.
Yes, order in OoO is sanity-impairing.
But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your µA does not update µA state prior to retire,
you can be as OoO as you like and still not be Spectré sensitive.
One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.
My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register >> write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.
CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]
If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.
My idea is the same as a SUB instruction with overflow detect,
which I would already have. I like cheap solutions.
But the core idea here, to eliminate a control flow race condition by changing it to a data flow dependency, may be applicable in other areas.
Because there is no branch, there is no way to speculate around the check >> (but load value speculation could negate this fix).
On second thought, no, load value speculation would not negate this fix.
x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and µfaults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at µA level.
I'd prefer not to step in that cow pie to begin with.
Then I won't have to spend time cleaning my shoes afterwards.
MitchAlsup wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
----------------------
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if >> you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.
I.e. they differ by exactly one ulp.
In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.
Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.
Terje
On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
“Reduced Instruction Set Computing”. That was what every single primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up the instruction opcode space with a combinatorial explosion. (Or sequence of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)
I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers.
With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are.
I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.
Also there might be some pipeline benefits in having longer vector
operands ... I’ll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the “R” stands for.
I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted::
On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:
In any case, even with these languages there are still software projects >> > that fail, miss their deadlines and have overrun their budget ...
A lot of these projects were unnecessary. Once someone figured out how to >make the (17 kinds of) hammers one needs, there it little need to make a
new hammer architecture.
Windows could have stopped at W7, and many MANY people would have been >happier... The mouse was more precise in W7 than in W8 ... With a little >upgrade for new PCIe architecture along the way rather than redesigning
whole kit and caboodle for tablets and phones which did not work BTW...
Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand >for the product, but pushed by companies to make already satisfied users
have to upgrade.
Those programmers could have transitioned to new SW projects rather than >redesigning the same old thing 8 more times. Presto, there is now enough
well trained SW engineers to tackle the undone SW backlog.
But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many memory aliasing issues to use the vector ISA.
If even Intel can't make their crap work well, I am skeptical.
Looking at
The Case for the Reduced Instruction Set Computer, 1980, David Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917
he never says what defines RISC, just what improved results this *design approach* should achieve.
Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.
Most use-cases for longer vectors tend to matrix-like rather than vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.
There are some weaknesses, for example, I mostly ended up dealing with
RGB math by simply repeating the 8-bit values twice within a 16-bit
spot.
BSR and JSR had been modified to allow arbitrary link register, but it
may make sense to reverse this; as Rd other than X0 and X1 is seemingly pretty much never used in practice (so not really worth the logic cost).
On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:There are two of them..
If even Intel can't make their crap work well, I am skeptical.
The only CISC architecture to survive
the (otherwise universal)--- Synchronet 3.21a-Linux NewsLink 1.2
transition to RISC was kept afloat through high revenues and high
margins, which allowed the company to spend the much higher sums
needed to add all the extra millions of transistors necessary to keep performance competitive.
On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:Larger register sets were common, but not universal.
Looking at
The Case for the Reduced Instruction Set Computer, 1980, David
Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917
he never says what defines RISC, just what improved results this
*design approach* should achieve.
From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set Computer” -- because the one feature that really did become common
was the larger register sets.
On 10/16/2025 2:04 AM, David Brown wrote:
On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
“Reduced Instruction Set Computing”. That was what every single
primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or >>>> vice versa)--they simply represent different ways of shooting yourself >>>> in the foot.
The primary design criterion, as I understood it, was to avoid
filling up
the instruction opcode space with a combinatorial explosion. (Or
sequence
of combinatorial explosions, when you look at the wave after wave of
SIMD
extensions in x86 and elsewhere.)
I believe another aim is to have the same instructions work on
different hardware. With SIMD, you need different code if your
processor can add 4 ints at a time, or 8 ints, or 16 ints - it's all
different instructions using different SIMD registers. With the
vector style instructions in RISC-V, the actual SIMD registers and
implementation are not exposed to the ISA and you have the same code
no matter how wide the actual execution units are. I have no
experience with this (or much experience with SIMD), but that seems
like a big win to my mind. It is akin to letting the processor
hardware handle multiple instructions in parallel in superscaler cpus,
rather than Itanium EPIC coding.
But, there is problem:
Once you go wider than 2 or 4 elements, cases where wider SIMD brings
more benefit tend to fall off a cliff.
More so, when you go wider, there are new problems:
Vector Masking;
Resource and energy costs of using wider vectors;
...
On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:
But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many
memory aliasing issues to use the vector ISA.
Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?
On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:
If even Intel can't make their crap work well, I am skeptical.
The only CISC architecture to survive
There are two of them..
the (otherwise universal)
transition to RISC was kept afloat through high revenues and high
margins, which allowed the company to spend the much higher sums
needed to add all the extra millions of transistors necessary to keep
performance competitive.
On 10/17/2025 5:54 AM, Michael S wrote:
On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:
If even Intel can't make their crap work well, I am skeptical.
The only CISC architecture to survive
There are two of them..
AFAIK:
x86 / x86-64: Alive and well in PCs.
6502: Now dead (no more 6502's being made)
65C816: Still holding on (niche), backwards compatible with 6502.
Z80: Dead
M68K: Mostly Dead
NXP ColdFire: Still lives (Simplified M68K).
MSP430: Still Lives (I classify it as a CISC).
IBM S/360: Dead on real HW
Lives on in emulation.
On 10/17/2025 11:00 AM, BGB wrote:
On 10/17/2025 5:54 AM, Michael S wrote:
On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:
If even Intel can't make their crap work well, I am skeptical.
The only CISC architecture to survive
There are two of them..
AFAIK:
x86 / x86-64: Alive and well in PCs.
6502: Now dead (no more 6502's being made)
65C816: Still holding on (niche), backwards compatible with 6502. >> Z80: Dead
M68K: Mostly Dead
NXP ColdFire: Still lives (Simplified M68K).
MSP430: Still Lives (I classify it as a CISC).
IBM S/360: Dead on real HW
Lives on in emulation.
As I am sure others will verify, the compatible descendants of the S/360
are alive in real hardware. While I expect there haven't been any "new name" customers in a long time, the fact that IBM still introduces new
chips every few years indicates that there is still a market for this architecture, presumably by existing customer's existing workload
growth, and perhaps new applications related to existing ones.
Some of the original BUNCH architectures do live on in emulation (Burroughts, Univac, Honeywell). I believe the other two, CDC and NCR
are dead.
I expect that all of the minicomputer age architectures are dead.
There also were lots of microcomputer "chip" architectures that are dead (National Semi, ATT, Fairchild, etc.), but I don't necessarily attribute that to being overtaken by RISC architectures.
On 10/17/2025 1:49 PM, Stephen Fuld wrote:
On 10/17/2025 11:00 AM, BGB wrote:
On 10/17/2025 5:54 AM, Michael S wrote:
On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:
If even Intel can't make their crap work well, I am skeptical.
The only CISC architecture to survive
There are two of them..
AFAIK:
x86 / x86-64: Alive and well in PCs.
6502: Now dead (no more 6502's being made)
65C816: Still holding on (niche), backwards compatible with 6502. >>> Z80: Dead
M68K: Mostly Dead
NXP ColdFire: Still lives (Simplified M68K).
MSP430: Still Lives (I classify it as a CISC).
IBM S/360: Dead on real HW
Lives on in emulation.
As I am sure others will verify, the compatible descendants of the
S/360 are alive in real hardware. While I expect there haven't been
any "new name" customers in a long time, the fact that IBM still
introduces new chips every few years indicates that there is still a
market for this architecture, presumably by existing customer's
existing workload growth, and perhaps new applications related to
existing ones.
OK.
I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.
On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:
Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.
I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common code
to execute the actual loop bodies.
Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.
Back in the days of Seymour Cray, his machines were getting useful results out of vector lengths up to 64 elements.
Perhaps that was more a substitute for parallel processing.
There are some weaknesses, for example, I mostly ended up dealing with
RGB math by simply repeating the 8-bit values twice within a 16-bit
spot.
Maybe it’s time to look beyond RGB colours. I remember some “Photo” inkjet
printers had 5 or 6 different colour inks, to try to fill out more of the
CIE space. Computer monitors could do the same. Look at the OpenEXR image format that these CG folks like to use: that allows for more than 3 colour components, and each component can be a float -- even single-precision
might not be enough, so they allow for double precision as well.
BSR and JSR had been modified to allow arbitrary link register, but it
may make sense to reverse this; as Rd other than X0 and X1 is seemingly
pretty much never used in practice (so not really worth the logic cost).
POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to address in CTR and leave return address in LR).
Hope the attributions are correct.
On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
:On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:
In any case, even with these languages there are still software projects >> > that fail, miss their deadlines and have overrun their budget ...
A lot of these projects were unnecessary. Once someone figured out how to >make the (17 kinds of) hammers one needs, there it little need to make a >new hammer architecture.
Windows could have stopped at W7, and many MANY people would have been >happier... The mouse was more precise in W7 than in W8 ... With a little >upgrade for new PCIe architecture along the way rather than redesigning >whole kit and caboodle for tablets and phones which did not work BTW...
Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand >for the product, but pushed by companies to make already satisfied users >have to upgrade.
Those programmers could have transitioned to new SW projects rather than >redesigning the same old thing 8 more times. Presto, there is now enough >well trained SW engineers to tackle the undone SW backlog.
The problem is that decades of "New & Improved" consumer products have conditioned the public to expect innovation (at minimum new packaging
and/or advertising) every so often.
Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by many
current developers due to concern that the project has been abandoned.
That perception likely would not change even if the author(s)
responded to inquiries, the library was suitable "as is" for the
intended use, and the lack of recent updates can be explained entirely
by a lack of new bug reports.
Why take a chance?
There simply _must_ be a similar project somewhere
else that still is actively under development. Even if it's buggy and unfinished, at least someone is working on it.
YMMV but, as a software developer myself, this attitude makes me sick.
8-(
On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:
If even Intel can't make their crap work well, I am skeptical.
The only CISC architecture to survive the (otherwise universal) transition to RISC was kept afloat through high revenues and high margins, which allowed the company to spend the much higher sums needed to add all the extra millions of transistors necessary to keep performance competitive.
On Fri, 17 Oct 2025 06:51:18 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Thu, 16 Oct 2025 16:26:27 -0500, BGB wrote:
If even Intel can't make their crap work well, I am skeptical.
The only CISC architecture to survive
There are two of them..
the (otherwise universal)
transition to RISC was kept afloat through high revenues and high
margins, which allowed the company to spend the much higher sums
needed to add all the extra millions of transistors necessary to keep performance competitive.
On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:
But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many
memory aliasing issues to use the vector ISA.
Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?
On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Thu, 16 Oct 2025 10:24:37 -0400, EricP wrote:
Looking at
The Case for the Reduced Instruction Set Computer, 1980, David
Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917
he never says what defines RISC, just what improved results this
*design approach* should achieve.
From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common
was the larger register sets.
Larger register sets were common, but not universal.
Load/store architecture was (with allowance for exceptions for synchronization primitives that are not expected to be as fast as
normal instructions) appears to be universal.
On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common was
the larger register sets.
Larger register sets were common, but not universal.
On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:
The SuperH also did this for the FPU:
On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:
Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.
I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common
code to execute the actual loop bodies.
Didn't have enough encoding space to fit everything, so they sorta used
FPU control bits to control which instructions were decoded.
Maybe. Just in my own experience, it seems to fizzle out pretty quickly.Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.
Back in the days of Seymour Cray, his machines were getting useful
results out of vector lengths up to 64 elements.
Perhaps that was more a substitute for parallel processing.
It may not count for Cray though, since IIRC their vectors were encoded
as memory-addresses and they were effectively using pipelining tricks
for the vectors.
So, in this case, a truer analog of Cray style vectors would not be
variable width SIMD that can fake large vectors, but rather a mechanism
to stream the vector through a SIMD unit.
Maybe it’s time to look beyond RGB colours. I remember some “Photo”IME:
inkjet printers had 5 or 6 different colour inks, to try to fill out
more of the CIE space. Computer monitors could do the same. Look at the
OpenEXR image format that these CG folks like to use: that allows for
more than 3 colour components, and each component can be a float --
even single-precision might not be enough, so they allow for double
precision as well.
The visible difference between RGB555 and RGB24 is small;
The difference between RGB24 and RGB30 in mostly imperceptible;
Though, most modern LCD/LED monitors actually only give around 5 or 6
bits per color channel (unlike the true analog on VGA CRTs, *).
Had noted though that for me, IRL, monitors can't really represent real
life colors. Like, I live in a world where computer displays all have a slight tint (with a similar tint and color distortion also applying to
the output of color laser printers; and a different color distortion for inkjet printers).
Never underestimate the work designers can do when given cubic dollars
of budget under which to work.
I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.
On 10/17/2025 1:48 AM, Lawrence D’Oliveiro wrote:
On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:
But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many >> memory aliasing issues to use the vector ISA.
Is this why C99 (and later) has the “restrict” qualifier <https://en.cppreference.com/w/c/language/restrict.html>?
Ironically, this is also partly why I suspect if a C-like language could have a "T[]" type that was distinct from "T*" could be useful, even if
they were the same representation internally (a bare memory pointer):
"T[]" could be safely assumed to never alias
At least in theory "restrict" works, when people use it
.--- Synchronet 3.21a-Linux NewsLink 1.2
On 10/17/2025 12:43 PM, BGB wrote:
On 10/17/2025 1:49 PM, Stephen Fuld wrote:
As I am sure others will verify, the compatible descendants of the
S/360 are alive in real hardware. While I expect there haven't been
any "new name" customers in a long time, the fact that IBM still
introduces new chips every few years indicates that there is still a
market for this architecture, presumably by existing customer's
existing workload growth, and perhaps new applications related to
existing ones.
OK.
I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.
I have heard that idea several times before. I wonder where it came from?
On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:
On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >> Computer” -- because the one feature that really did become common was >> the larger register sets.
Larger register sets were common, but not universal.
Where is there an architecture you would class as “RISC”, but did not have
a “large” register set?
(How “large” is “large”? The VAX had 16 registers; was there any RISC
architecture with only that few?)
On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:
On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:
The SuperH also did this for the FPU:
On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:
Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.
I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common
code to execute the actual loop bodies.
Didn't have enough encoding space to fit everything, so they sorta used
FPU control bits to control which instructions were decoded.
That was probably not cost-effective for scalar instructions, because it would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.
Probably better for vector instructions, where one sequence of operand
type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.
Maybe. Just in my own experience, it seems to fizzle out pretty quickly.Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.
Back in the days of Seymour Cray, his machines were getting useful
results out of vector lengths up to 64 elements.
Perhaps that was more a substitute for parallel processing.
Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively- parallel supers, or RISC machines etc. Maybe the parallelism was thought
to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)
Short-vector SIMD was introduced along an entirely separate evolutionary path, namely that of bringing DSP-style operations into general-purpose CPUs.
It may not count for Cray though, since IIRC their vectors were encoded
as memory-addresses and they were effectively using pipelining tricks
for the vectors.
Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the two were bound to have underlying features in common.
So, in this case, a truer analog of Cray style vectors would not be variable width SIMD that can fake large vectors, but rather a mechanism
to stream the vector through a SIMD unit.
But short-vector SIMD can only deal with operands in lockstep. If you
loosen this restriction, then you are back to multiple function units and superscalar execution.
The visible difference between RGB555 and RGB24 is small;
The difference between RGB24 and RGB30 in mostly imperceptible;
Though, most modern LCD/LED monitors actually only give around 5 or 6
bits per color channel (unlike the true analog on VGA CRTs, *).
First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.
Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
imagery, the better the quality of the output.
Sure, you may think 64-bit floats must be overkill for this purpose; but these are artists you’re dealing with. ;)
On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:
Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations into
general-purpose CPUs.
MMX was designed to kill off the plug in Modems.
But short-vector SIMD can only deal with operands in lockstep. If you
loosen this restriction, then you are back to multiple function units
and superscalar execution.
Which is a GOOD thing !!
First of all, we have some “HDR” monitors around now that can output a >> much greater gradation of brightness levels. These can be used to
produce apparent brightnesses greater than 100%.
It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.
Secondly, we’re talking about input image formats. Remember that every
image-processing step is going to introduce some generational loss due
to rounding errors; therefore the higher the quality of the raw input
imagery, the better the quality of the output.
That is why the arithmetic is done in 16-bits.
POWER/PowerPC has only two registers that are allowed to contain dynamic instruction addresses: LR and CTR. So, a dynamic branch (including
subroutine return) can be BCTR (jump to address in CTR) or BLR (jump to address in LR); and a dynamic subroutine call has to be BCTRL (jump to address in CTR and leave return address in LR).
It is unlikely that monitors will ever get much beyond 11-bits of pixelI do not understand why monitor would go beyond 9-bits. Most people
depth per color.
On Fri, 17 Oct 2025 15:32:39 -0500, BGB wrote:
On 10/17/2025 2:03 AM, Lawrence D’Oliveiro wrote:
The SuperH also did this for the FPU:
On Thu, 16 Oct 2025 15:17:22 -0500, BGB wrote:
Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.
I know, you could consider that a cheat in some ways. But on the other
hand, it allows code reuse, by having different (overloaded) function
entry points each do type-specific setup, then all branch to common
code to execute the actual loop bodies.
Didn't have enough encoding space to fit everything, so they sorta used
FPU control bits to control which instructions were decoded.
That was probably not cost-effective for scalar instructions, because it would turn a single operation instruction into multiple instructions for operand type setup followed by the actual operation instruction.
Probably better for vector instructions, where one sequence of operand
type setup lets it then chug away to process a whole sequence of operand tuples in exactly the same way.
Maybe. Just in my own experience, it seems to fizzle out pretty quickly.Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.
Back in the days of Seymour Cray, his machines were getting useful
results out of vector lengths up to 64 elements.
Perhaps that was more a substitute for parallel processing.
Maybe that was just a software thing: the Cray machines had their own architecture(s), which was never carried forward to the new massively- parallel supers, or RISC machines etc. Maybe the parallelism was thought
to render deep pipelines obsolete -- at least in the early years. (*Cough* Pentium 4 *Cough*)
Short-vector SIMD was introduced along an entirely separate evolutionary path, namely that of bringing DSP-style operations into general-purpose
CPUs.
It may not count for Cray though, since IIRC their vectors were encoded
as memory-addresses and they were effectively using pipelining tricks
for the vectors.
Certainly if you look at the evolution of Seymour Cray’s designs, explicit vectorization was for him the next stage after implicit pipelining, so the two were bound to have underlying features in common.
So, in this case, a truer analog of Cray style vectors would not be
variable width SIMD that can fake large vectors, but rather a mechanism
to stream the vector through a SIMD unit.
But short-vector SIMD can only deal with operands in lockstep. If you
loosen this restriction, then you are back to multiple function units and superscalar execution.
Maybe it’s time to look beyond RGB colours. I remember some “Photo” >>> inkjet printers had 5 or 6 different colour inks, to try to fill outIME:
more of the CIE space. Computer monitors could do the same. Look at the
OpenEXR image format that these CG folks like to use: that allows for
more than 3 colour components, and each component can be a float --
even single-precision might not be enough, so they allow for double
precision as well.
The visible difference between RGB555 and RGB24 is small;
The difference between RGB24 and RGB30 in mostly imperceptible;
Though, most modern LCD/LED monitors actually only give around 5 or 6
bits per color channel (unlike the true analog on VGA CRTs, *).
First of all, we have some “HDR” monitors around now that can output a much greater gradation of brightness levels. These can be used to produce apparent brightnesses greater than 100%.
Secondly, we’re talking about input image formats. Remember that every image-processing step is going to introduce some generational loss due to rounding errors; therefore the higher the quality of the raw input
imagery, the better the quality of the output.
Sure, you may think 64-bit floats must be overkill for this purpose; but these are artists you’re dealing with. ;)
Had noted though that for me, IRL, monitors can't really represent real
life colors. Like, I live in a world where computer displays all have a
slight tint (with a similar tint and color distortion also applying to
the output of color laser printers; and a different color distortion for
inkjet printers).
That is always true; “white” is never truly “white”, which is why those
who work in colour always talk about a “white point” for defining what is meant by “white”, which is the colour of a perfect “black body” emitter at
a specific temperature (typically 5500K or above).
It is unlikely that monitors will ever get much beyond 11-bits of pixelI do not understand why monitor would go beyond 9-bits. Most people
depth per color.
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits.
Bits beyond 8 would be for some sea creatures or viewable with special glasses?
Stephen Fuld wrote:
On 10/17/2025 12:43 PM, BGB wrote:
On 10/17/2025 1:49 PM, Stephen Fuld wrote:
As I am sure others will verify, the compatible descendants of the
S/360 are alive in real hardware. While I expect there haven't been >>>> any "new name" customers in a long time, the fact that IBM still
introduces new chips every few years indicates that there is still a
market for this architecture, presumably by existing customer's
existing workload growth, and perhaps new applications related to
existing ones.
OK.
I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.
I have heard that idea several times before. I wonder where it came
from?
The AS400 cpu was replaced by Power and an emulation layer. https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC
The z-series was always a different cpu, but maybe they
shared development groups with Power. The stages of the
z15 core (2019) doesn't look anything like Power10 (2021).
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:
On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >>>> Computer” -- because the one feature that really did become common was >>>> the larger register sets.
Larger register sets were common, but not universal.
Where is there an architecture you would class as “RISC”, but did not have
a “large” register set?
See Univac 1108
On 10/17/2025 5:20 PM, Lawrence D’Oliveiro wrote:
Yeah, but this works assuming that your vector ops are primarily mapped
Probably better for vector instructions, where one sequence of operand
type setup lets it then chug away to process a whole sequence of
operand tuples in exactly the same way.
to long-running loops.
Maybe that was just a software thing: the Cray machines had their ownI think they were also mostly intended for CFD and FEM simulations and similar, or stuff that is very regular (running the same math over a
architecture(s), which was never carried forward to the new massively-
parallel supers, or RISC machines etc. Maybe the parallelism was
thought to render deep pipelines obsolete -- at least in the early
years. (*Cough* Pentium 4 *Cough*)
whole lot of elements).
LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.
I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits. Bits beyond 8 would be for some sea creatures or viewable with special
glasses?
BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
is unusual in having lots of its functionality done in software, so IBM "just" ported that software to Power.
On 17/10/2025 08:48, Lawrence D’Oliveiro wrote:
On Thu, 16 Oct 2025 22:19:21 GMT, MitchAlsup wrote:
But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve many >>> memory aliasing issues to use the vector ISA.
Is this why C99 (and later) has the “restrict” qualifier
<https://en.cppreference.com/w/c/language/restrict.html>?
"restrict" can significantly improve non-vectored code too, as well as
more "ad-hoc" vectoring of code where the compiler uses general-purpose registers, but interlaces loads, stores and operations to improve pipelining. But it is certainly a very useful qualifier for vector code.
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate
lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.
In practice:: RSQRT() is no harder to compute {both HW and SW},
yet:: RSQRT() is more useful::
SQRT(x) = RSQRT(x)*x is 1 pipelined FMUL
RSQRT(x) = 1/SQRT(x) is 1 non-pipelined FDIV
Useful in vector normalization::
some-vector-calculation
-----------------------
SQRT( SUM(x**2,1,n) )
and a host of others.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
Short-vector SIMD was introduced along an entirely separate evolutionary
path, namely that of bringing DSP-style operations into general-purpose
CPUs.
MMX was designed to kill off the plug in Modems.
On 10/17/2025 4:52 PM, EricP wrote:I consider AS/400 to be the blueprint for Mill's choice to have a model-portable distribution format that goes through the specializer in
The AS400 cpu was replaced by Power and an emulation layer.
https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC
Yes, sort of. Perhaps because IBM replaced the AS/400 with power,
someone assumed (incorrectly) that they replaced all their proprietary > CPUs with it.
BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
is unusual in having lots of its functionality done in software, so IBM "just" ported that software to Power. For the other stuff, while there
was a sort of emulation layer, but the first time a program was run, it
got silently recompiled to target the new architecture. Or something
like that.
On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:10 million is more than what I've heard/seen, but OK:
I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits. Bits
beyond 8 would be for some sea creatures or viewable with special
glasses?
Under ideal conditions (comparing large areas), the human eye can
distinguish about 10 million colours. Round that up to 2**24, and you get
the traditional 8-by-8-by-8 RGB “full colour” space.
However, consider your eye’s ability to adapt to a dynamic range from aIn reality they don't even (really) try. :-)
dim room out into bright sunlight. Now imagine trying to simulate some of that in a movie, and you can see why the video images will need more than 8-by-8-by-8 dynamic range.
Where is there an architecture you would class as "RISC", but did not have
a "large" register set?
(How "large" is "large"? The VAX had 16 registers; was there any RISC >architecture with only that few?)
Lawrence D’Oliveiro wrote:
On Fri, 17 Oct 2025 22:29:28 -0400, Robert Finch wrote:
I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits. Bits
beyond 8 would be for some sea creatures or viewable with special
glasses?
Under ideal conditions (comparing large areas), the human eye can
distinguish about 10 million colours. Round that up to 2**24, and you get
the traditional 8-by-8-by-8 RGB “full colour” space.
10 million is more than what I've heard/seen, but OK:
More interesting is the fact that females tend to have about 10x the
ability to distinguish colors compared to men, due to the fact that the blue-green receptors are tied to the X chromosome, and they don't have
to be exactly the same. I know this is true for my wife and me, but on
the other hand I have much better monochrome vision so I can see better
when it is quite dark.
However, consider your eye’s ability to adapt to a dynamic range from a
dim room out into bright sunlight. Now imagine trying to simulate some of
that in a movie, and you can see why the video images will need more than
8-by-8-by-8 dynamic range.
In reality they don't even (really) try. :-)
Many years ago, they even had to shoot all night-time scenes during the
day because the film and cameras didn't have nearly enough dynamic range.
Stephen Fuld wrote:
On 10/17/2025 4:52 PM, EricP wrote:
The AS400 cpu was replaced by Power and an emulation layer.
https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC
Yes, sort of. Perhaps because IBM replaced the AS/400 with power,
someone assumed (incorrectly) that they replaced all their proprietary
CPUs with it.
BTW, with the AS/400, power didn't emulate the older S/38 CPU. AS/400
is unusual in having lots of its functionality done in software, so
IBM "just" ported that software to Power. For the other stuff, while
there was a sort of emulation layer, but the first time a program was
run, it got silently recompiled to target the new architecture. Or
something like that.
I consider AS/400 to be the blueprint for Mill's choice to have a model- portable distribution format that goes through the specializer in order
to be compatible with the actual CPU model it is now running on.
On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:
On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:
First of all, we have some “HDR” monitors around now that can output a >>> much greater gradation of brightness levels. These can be used to
produce apparent brightnesses greater than 100%.
It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.
I think bragging rights alone will see it grow beyond that. Look at tandem OLEDs.
George Neuner <gneuner2@comcast.net> posted:
Hope the attributions are correct.
On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:
:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:
In any case, even with these languages there are still software projects
that fail, miss their deadlines and have overrun their budget ...
A lot of these projects were unnecessary. Once someone figured out how to >> >make the (17 kinds of) hammers one needs, there it little need to make a
new hammer architecture.
Windows could have stopped at W7, and many MANY people would have been
happier... The mouse was more precise in W7 than in W8 ... With a little >> >upgrade for new PCIe architecture along the way rather than redesigning
whole kit and caboodle for tablets and phones which did not work BTW...
Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand >> >for the product, but pushed by companies to make already satisfied users
have to upgrade.
Those programmers could have transitioned to new SW projects rather than
redesigning the same old thing 8 more times. Presto, there is now enough
well trained SW engineers to tackle the undone SW backlog.
The problem is that decades of "New & Improved" consumer products have
conditioned the public to expect innovation (at minimum new packaging
and/or advertising) every so often.
Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by many
current developers due to concern that the project has been abandoned.
That perception likely would not change even if the author(s)
responded to inquiries, the library was suitable "as is" for the
intended use, and the lack of recent updates can be explained entirely
by a lack of new bug reports.
LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.
Most Floating Point Libraries are in a similar position. They were
updated after IEEE 754 became widespread and are as good today as
ever.
{FF1, Tomography, CFD, FEM} have needed no real changes in decades.
Sometimes, Software is "done". You may add things to the package
{like a new crescent wrench} but the old hammer works just as well
today as 30 years ago when you bought it.
Why take a chance?
On the last day of SW support for W10--they (THEY) updated several
things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!
To the SW vendor, they want to be able to update their SW any time
they want. Yet, the application user wants the same bugs to remain
constant over the duration of the WHOLE FRIGGEN project--because
once you found them and figured a way around them, you don't want
them to reappear somewhere else !!!
There simply _must_ be a similar project somewhere
else that still is actively under development. Even if it's buggy and
unfinished, at least someone is working on it.
I understand--but this bites more often than the conservative approach.
YMMV but, as a software developer myself, this attitude makes me sick.
8-(
I was in a 3-year project where we had to forgo upgrading from SunOS
to Solaris because the SW license model changes would have put us out
of business before project completion.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:
Where is there an architecture you would class as "RISC", but did
not have a "large" register set?
(How "large" is "large"? The VAX had 16 registers; was there any
RISC architecture with only that few?)
The first IBM 801 has 16 registers. ARM A32/T32 has 16 registers (and
shares the VAX's mistake of making the PC accessible as GPR). RV32E
(and, I think, RV64E) has 16 registers.
- anton
George Neuner <gneuner2@comcast.net> posted:
Hope the attributions are correct.
On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted::
On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:
In any case, even with these languages there are still
software projects that fail, miss their deadlines and have
overrun their budget ...
A lot of these projects were unnecessary. Once someone figured out
how to make the (17 kinds of) hammers one needs, there it little
need to make a new hammer architecture.
Windows could have stopped at W7, and many MANY people would have
been happier... The mouse was more precise in W7 than in W8 ...
With a little upgrade for new PCIe architecture along the way
rather than redesigning whole kit and caboodle for tablets and
phones which did not work BTW...
Office application work COULD have STOPPED in 2003, eXcel in 1998,
... and few people would have cared. Many SW projects are driven
not by demand for the product, but pushed by companies to make
already satisfied users have to upgrade.
Those programmers could have transitioned to new SW projects
rather than redesigning the same old thing 8 more times. Presto,
there is now enough well trained SW engineers to tackle the undone
SW backlog.
The problem is that decades of "New & Improved" consumer products
have conditioned the public to expect innovation (at minimum new
packaging and/or advertising) every so often.
Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by
many current developers due to concern that the project has been
abandoned. That perception likely would not change even if the
author(s) responded to inquiries, the library was suitable "as is"
for the intended use, and the lack of recent updates can be
explained entirely by a lack of new bug reports.
LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.
Most Floating Point Libraries are in a similar position. They were
updated after IEEE 754 became widespread and are as good today as
ever.
{FF1, Tomography, CFD, FEM} have needed no real changes in decades.
Sometimes, Software is "done". You may add things to the package
{like a new crescent wrench} but the old hammer works just as well
today as 30 years ago when you bought it.
Why take a chance?
On the last day of SW support for W10--they (THEY) updated several
things I WANT BACK THE WAY THEY WERE THE DAY BEFORE !!!!!
To the SW vendor, they want to be able to update their SW any time
they want. Yet, the application user wants the same bugs to remain
constant over the duration of the WHOLE FRIGGEN project--because
once you found them and figured a way around them, you don't want
them to reappear somewhere else !!!
There simply _must_ be a similar project
somewhere else that still is actively under development. Even if
it's buggy and unfinished, at least someone is working on it.
I understand--but this bites more often than the conservative
approach.
YMMV but, as a software developer myself, this attitude makes me
sick. 8-(
I was in a 3-year project where we had to forgo upgrading from SunOS
to Solaris because the SW license model changes would have put us out
of business before project completion.
It is possible that LAPAC API was not updated in decades,
although I'd
expect that even at API level there were at least small additions, if
not changes. But if you are right that LAPAC implementation was not
updated in decade than you could be sure that it is either not used by anybody or used by very few people.
Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS.
No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".
I had thought it was the idea that IBM kept running the original ISA,
but as an emulation layer on top of POWER rather than as the real
hardware level ISA.
I have heard that idea several times before. I wonder where it came from?
The AS400 cpu was replaced by Power and an emulation layer. >https://en.wikipedia.org/wiki/IBM_AS/400#The_move_to_PowerPC
The z-series was always a different cpu, but maybe they
shared development groups with Power. The stages of the
z15 core (2019) doesn't look anything like Power10 (2021).
https://www.servethehome.com/wp-content/uploads/2020/08/Hot-Chips-32-IBM-Z15-Processor-Pipeline.jpg
Michael S <already5chosen@yahoo.com> schrieb:
It is possible that LAPAC API was not updated in decades,
The API of existing LAPACK routines was not changed (AFAIK),
but there were certainly additions. It is also possible to chose
64-bit integers at build time.
although I'd
expect that even at API level there were at least small additions,
if not changes. But if you are right that LAPAC implementation was
not updated in decade than you could be sure that it is either not
used by anybody or used by very few people.
It is certainly in use by very many people, if indirectly, for example
by Python or R.
I learned about R the hard way, when a wrong
interface in the C bindings of Lapack surfaced after a long, long
time.
Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS.
Different level of application. You use LAPACK when you want to do
things like calculating eigenvalues or singular value decomposition,
see https://www.netlib.org/lapack/lug/node19.html . If you use
BLAS directly, you might want to check if there is a routine
in LAPACK which does what you need to do.
No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".
I agree. There is a _lot_ of active research in numerical
algorithms, be it for ODE systems, sparse linear solvers or whatnot.
A lot of that is happening in Julia, actually.
On Fri, 17 Oct 2025 13:59:33 +0300, Michael S wrote:
On Fri, 17 Oct 2025 06:53:16 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
From the beginning, I felt that the much-trumpeted reduction in
instruction set complexity never quite matched up with reality. So I
thought a better name would be “IRSC”, as in “Increased Register Set >>> Computer” -- because the one feature that really did become common was >>> the larger register sets.
Larger register sets were common, but not universal.
Where is there an architecture you would class as “RISC”, but did not have
a “large” register set?
(How “large” is “large”? The VAX had 16 registers; was there any RISC
architecture with only that few?)
On 16/10/2025 23:26, BGB wrote:
On 10/16/2025 2:04 AM, David Brown wrote:
On 16/10/2025 07:44, Lawrence D’Oliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better >>>>>> than SIMD, if only because it preserves the “R” in “RISC”.
The R in RISC-V comes from "student _R_esearch".
“Reduced Instruction Set Computing”. That was what every single
primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or >>>>> vice versa)--they simply represent different ways of shooting yourself >>>>> in the foot.
The primary design criterion, as I understood it, was to avoid
filling up
the instruction opcode space with a combinatorial explosion. (Or
sequence
of combinatorial explosions, when you look at the wave after wave of
SIMD
extensions in x86 and elsewhere.)
I believe another aim is to have the same instructions work on
different hardware. With SIMD, you need different code if your
processor can add 4 ints at a time, or 8 ints, or 16 ints - it's all
different instructions using different SIMD registers. With the
vector style instructions in RISC-V, the actual SIMD registers and
implementation are not exposed to the ISA and you have the same code
no matter how wide the actual execution units are. I have no
experience with this (or much experience with SIMD), but that seems
like a big win to my mind. It is akin to letting the processor
hardware handle multiple instructions in parallel in superscaler cpus,
rather than Itanium EPIC coding.
But, there is problem:
Once you go wider than 2 or 4 elements, cases where wider SIMD brings
more benefit tend to fall off a cliff.
More so, when you go wider, there are new problems:
Vector Masking;
Resource and energy costs of using wider vectors;
...
I appreciate that. Often you will either be wanting the operations to
be done on a small number of elements, or you will want to do it for a
large block of N elements which may be determined at run-time. There
are some algorithm, such as in cryptography, where you have sizeable but fixed-size blocks.
When you are dealing with small, fixed-size vectors, x86-style SIMD can
be fine - you can treat your four-element vectors as single objects to
be loaded, passed around, and operated on. But when you have a large run-time count N, it gets a lot more inefficient. First you have to
decide what SIMD extensions you are going to require from the target,
and thus how wide your SIMD instructions will be - say, M elements.
Then you need to loop N / M times, doing M elements at a time. Then you need to handle the remaining N % M elements - possibly using smaller
SIMD operations, possibly doing them with serial instructions (noting
that there might be different details in the implementation of SIMD and serial instructions, especially for floating point).
[LAPACK] is certainly in use by very many people, if indirectly, for
example by Python or R.
I don't use either of the two for numerics (I use python for other
tasks). But I use Matlab and Octave. I know for sure that Octave
uses relatively new implementations, and pretty sure that the same
goes for Matlab.
MitchAlsup wrote:
On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:
Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.
MMX was designed to kill off the plug in Modems.
MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.
In many cases one can enlarge data structures to multiple of SIMD vector
size (and align them properly). There requires some extra code, but mot
too much and all of it is outside inner loop. So, there is some waste,
but rather small due to unused elements.
Of course, there is still trouble due to different SIMD vector size
and/or different SIMD instructions sets.
On 18/10/2025 03:05, Lawrence D’Oliveiro wrote:
On Sat, 18 Oct 2025 00:42:27 GMT, MitchAlsup wrote:
On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:
First of all, we have some “HDR” monitors around now that can output a >>>> much greater gradation of brightness levels. These can be used to
produce apparent brightnesses greater than 100%.
It is unlikely that monitors will ever get much beyond 11-bits of pixel
depth per color.
I think bragging rights alone will see it grow beyond that. Look at
tandem
OLEDs.
Like many things, human perception of brightness is not linear - it is somewhat logarithmic. So even though we might not be able to
distinguish anywhere close to 2000 different nuances of one primary
colour, we /can/ perceive a very wide dynamic range. Having a large
number of bits on a linear scale can be more convenient in practice than trying to get accurate non-linear scaling.
On Sat, 18 Oct 2025 19:24:21 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
It is possible that LAPAC API was not updated in decades,
The API of existing LAPACK routines was not changed (AFAIK),
but there were certainly additions. It is also possible to chose
64-bit integers at build time.
although I'd
expect that even at API level there were at least small additions,
if not changes. But if you are right that LAPAC implementation was
not updated in decade than you could be sure that it is either not
used by anybody or used by very few people.
It is certainly in use by very many people, if indirectly, for example
by Python or R.
Does Python (numpy and scipy, I suppose) or R linked against
implementation of LAPACK from 40 or 30 years ago, as suggested by Mitch?
Somehow, I don't believe it.
I don't use either of the two for numerics (I use python for other
tasks). But I use Matlab and Octave. I know for sure that Octave uses relatively new implementations, and pretty sure that the same goes
for Matlab.
Personally, when I need LAPAC-like functionality then I tend to use
BLAS routines either from Intel MKL or from OpenBLAS.
Different level of application. You use LAPACK when you want to do
things like calculating eigenvalues or singular value decomposition,
see https://www.netlib.org/lapack/lug/node19.html . If you use
BLAS directly, you might want to check if there is a routine
in LAPACK which does what you need to do.
Higher-level algos I am interested in are mostly our own inventions.
I can look, of course, but the chances that they are present in LAPACK
are very low.
In fact, Even BLAS L3 I don't use all that often (and lower levels
of BLAS never).
Not because APIs do not match my needs. They typpically do. But
because standard implementations are optimized for big or huge matrices.
My needs are medium matrices. A lot of medium matrices.
My own implementations of standard algorithms for medium-sized
matrices, most importantly of Cholesky decomposition, tend to be much
faster than those in OTS BLAS librares. And preparatioon of my own
didn't take a lot of time. After all those are simple algorithms.
Speaking of Cray, the US Mint are issuing some new $1 coins featuringMy guess: Well below 0.1% unless they get told what it is.
various famous persons/things, and one of them has a depiction of the
Cray-1 on it.
From the photo I’ve seen, it’s an overhead view, looking like a
stylized letter C. So I wonder, even with the accompanying legend “CRAY-1 SUPERCOMPUTERâ€, how many people will realize that’s actually a
picture of the computer?
<https://www.tomshardware.com/tech-industry/new-us-usd1-coins-to-feature-steve-jobs-and-cray-1-supercomputer-us-mints-2026-american-innovation-program-to-memorialize-computing-history>
It is unlikely that monitors will ever get much beyond 11-bits of pixel depth per color.I do not understand why monitor would go beyond 9-bits. Most people
can't see beyond 7 or 8-bits color component depth. Keeping the
component depth 10-bits or less allows colors to fit into 32-bits.
Bits beyond 8 would be for some sea creatures or viewable with special glasses?
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.
Lapack's basics have not changed, but it is still actively maintained,
with errors being fixed and new features added.
If you look at the most recent major release, you will see that a lot
is going on: https://www.netlib.org/lapack/lapack-3.12.0.html
One important thing seems to be changes to 64-bit integers.
And I love changes like
- B = BB*CS + DD*SN
- C = -AA*SN + CC*CS
+ B = ( BB*CS ) + ( DD*SN )
+ C = -( AA*SN ) + ( CC*CS )
which makes sure that compilers don't emit FMA instructions and
change rounding (which, apparently, reduced accuracy enormously
for one routine.
(According to the Fortran standard, the compiler has to honor--- Synchronet 3.21a-Linux NewsLink 1.2
parentheses).
On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".
Michael S <already5chosen@yahoo.com> posted:
On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
No, old hammer does not work well. Unless you consider delivering
5-10% of possible performance as "working well".
Are you suggesting that a brand new #3 ball peen hammer is usefully
better than a 30 YO #3 ball peen hammer ???
On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:
MitchAlsup wrote:
On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:
Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.
MMX was designed to kill off the plug in Modems.
MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.
I think the initial “killer app” for short-vector SIMD was very much video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.
On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:Having SIMD available was a key part of making the open source Ogg
On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:
MitchAlsup wrote:
On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:
Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.
MMX was designed to kill off the plug in Modems.
MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.
I think the initial “killer app†for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.
Agreed. But having SIMD made audio processing more efficient, which was
a nice bonus - especially if you wanted more than CD quality audio.
David Brown wrote:
On 19/10/2025 03:17, Lawrence D’Oliveiro wrote:
On Sat, 18 Oct 2025 10:21:32 +0200, Terje Mathisen wrote:
MitchAlsup wrote:
On Fri, 17 Oct 2025 22:20:49 -0000 (UTC), Lawrence D’Oliveiro wrote:
Short-vector SIMD was introduced along an entirely separate
evolutionary path, namely that of bringing DSP-style operations
into general-purpose CPUs.
MMX was designed to kill off the plug in Modems.
MMX was quite obviously (also) intended for short vectors of
typically 8 and 16-bit elements, it was the enabler for sw DVD
decoding. ZoranDVD was the first to properly handle 30 frames/second
with zero skips, it needed a PentiumMMX-200 to do so.
I think the initial “killer app†for short-vector SIMD was very much
video encoding/decoding, not audio encoding/decoding. Audio was
already easy enough to manage with general-purpose CPUs of the 1990s.
Agreed. But having SIMD made audio processing more efficient, which
was a nice bonus - especially if you wanted more than CD quality audio.
Having SIMD available was a key part of making the open source Ogg
Vorbis decoder 3x faster.
It worked on MMX/SSE/SSE2/Altivec.
On Fri, 17 Oct 2025 20:54:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
George Neuner <gneuner2@comcast.net> posted:
Hope the attributions are correct.
On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:
:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:
In any case, even with these languages there are still
software projects that fail, miss their deadlines and have
overrun their budget ...
A lot of these projects were unnecessary. Once someone figured out
how to make the (17 kinds of) hammers one needs, there it little
need to make a new hammer architecture.
Windows could have stopped at W7, and many MANY people would have
been happier... The mouse was more precise in W7 than in W8 ...
With a little upgrade for new PCIe architecture along the way
rather than redesigning whole kit and caboodle for tablets and
phones which did not work BTW...
Office application work COULD have STOPPED in 2003, eXcel in 1998,
... and few people would have cared. Many SW projects are driven
not by demand for the product, but pushed by companies to make
already satisfied users have to upgrade.
Those programmers could have transitioned to new SW projects
rather than redesigning the same old thing 8 more times. Presto,
there is now enough well trained SW engineers to tackle the undone
SW backlog.
The problem is that decades of "New & Improved" consumer products
have conditioned the public to expect innovation (at minimum new
packaging and/or advertising) every so often.
Bringing it back to computers: consider that a FOSS library which
hasn't seen an update for 2 years likely would be passed over by
many current developers due to concern that the project has been
abandoned. That perception likely would not change even if the
author(s) responded to inquiries, the library was suitable "as is"
for the intended use, and the lack of recent updates can be
explained entirely by a lack of new bug reports.
LAPAC has not been updated in decades, yet is as relevant today as
the first day it was available.
It is possible that LAPAC API was not updated in decades, although I'd
expect that even at API level there were at least small additions, if
not changes. But if you are right that LAPAC implementation was not
updated in decade than you could be sure that it is either not used by anybody or used by very few people.
AFAICS at logical level interface stays the same. There is significant change: in old times you were on your own trying to interface
Lapack from C. Now you can get C interface.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,075 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 106:59:58 |
| Calls: | 13,798 |
| Files: | 186,990 |
| D/L today: |
1,958 files (537M bytes) |
| Messages: | 2,438,897 |