Forum: War Ensemble BBS

Re: 80286 protected mode

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 9 10:24:34 2024

From Newsgroup: comp.arch

On 08/10/2024 09:28, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Whereas by the time 286 got out, everybody was wanting flat
memory ala C.

It's interesting that, when C was standardized, the segmentation found
its way into it by disallowing subtracting and comparing between
addresses in different objects.

It is difficult to talk about the timing of features (either things that
are allowed, or things explicitly disallowed) before the standardisation
of C, as there was no single language "C". Different variants supported
by different compilers had different rules.

This disallows performing certain
forms of induction variable elimination by hand. So while flat memory
is C culture so much that you write "flat memory ala C", the
standardized subset of C (what standard C fanatics claim is the only
meaning of "C") actually specifies a segmented memory model.

No, the C standard does not in any sense specify a segmented memory
model. Nor does it specify a non-segmented or flat or contiguous memory.

The nearest it gets is the description of converting between pointers
and integers, where it says that the conversion of a pointer to an
integer might not fit in any integer type, in which case the conversions
are undefined behaviour - but if they /are/ convertible, the intention
is that the value (of type "uintptr_t") should be consistent with "the addressing structure of the execution environment".

The way C is specified is intended to be strong enough to allow
programmers to do all they generally need to do using portable code
(i.e., code that doesn't rely on anything other than standard
behaviour), without unnecessarily restricting the kinds of systems that
can implement C, and without unnecessarily restricting what people can
write in non-portable code.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

In practice, on all but the most niche or specialised platforms, if you
do feel you need to compare random pointers, you can cast them to
uintptr_t and compare these. That will generally work on segmented, non-contiguous or flat memories.

An interesting case is the Forth standard. It specifies "contiguous regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.

Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could
interpret part of the address value as the segment or other memory block identifier and part of it as an index into that block, just as a C implementation can.

A flat address model is almost certainly more /efficient/, for C, Forth
and many other languages. But that does not mean a particular model is /required/ or specified by the language.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 9 16:28:19 2024

From Newsgroup: comp.arch

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Oct 9 16:42:38 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.

In most every mainstream implementation, memmove() is written
in assembler in order to inject the appropriate prefeches and
follow the recommended instruction usage per the target architecture
software optimization guide.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 9 18:10:44 2024

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 9 22:20:42 2024

From Newsgroup: comp.arch

On 09/10/2024 18:28, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.

They don't have to write it in standard, portable C. Standard libraries
will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they want.

You will find that most implementations of memmove() are done by
converting the pointers to a unsigned integer type and comparing those
values. The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
writers can use C99 for theirs).

Such implementations will not be portable to all systems. They won't
work on a target that has some kind of "fat" pointers or segmented
pointers that can't be translated properly to integers.

That's okay, of course. For targets that have such complications, that standard library function will be written in a different way.

The avrlibc library used by gcc for the AVR has its memmove()
implemented in assembly for speed, as does musl for some architectures.

There are lots of parts of the standard C library that cannot be written completely in portable standard C. (How would you write a function that handles files? You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 9 22:22:16 2024

From Newsgroup: comp.arch

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 9 21:37:30 2024

From Newsgroup: comp.arch

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 9 14:52:39 2024

From Newsgroup: comp.arch

On 10/9/2024 1:20 PM, David Brown wrote:

On 09/10/2024 18:28, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.

They don't have to write it in standard, portable C. Standard libraries will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they
want.

You will find that most implementations of memmove() are done by
converting the pointers to a unsigned integer type and comparing those values. The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
writers can use C99 for theirs).

Such implementations will not be portable to all systems. They won't
work on a target that has some kind of "fat" pointers or segmented
pointers that can't be translated properly to integers.

That's okay, of course. For targets that have such complications, that standard library function will be written in a different way.

The avrlibc library used by gcc for the AVR has its memmove()
implemented in assembly for speed, as does musl for some architectures.

There are lots of parts of the standard C library that cannot be written completely in portable standard C. (How would you write a function that handles files? You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

I agree with everything you say up until the last sentence. There are
several languages, mostly older ones like Fortran and COBOL, where the
file handling/I/O are defined portably within the language proper, not
in a separate library. It just moves the non-portable stuff from the
library writer (as in C) to the compiler writer (as in Fortran, COBOL, etc.)
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 10 00:33:41 2024

From Newsgroup: comp.arch

On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be written
completely in portable standard C. (How would you write a function that
handles files?

Do you mean things other than open(), close(), read(), write(), lseek()
??

You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 08:24:32 2024

From Newsgroup: comp.arch

On 09/10/2024 23:52, Stephen Fuld wrote:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be
written completely in portable standard C. (How would you write a
function that handles files? You need non-portable OS calls.) That's
why these things are in the standard library in the first place.

I agree with everything you say up until the last sentence. There are several languages, mostly older ones like Fortran and COBOL, where the
file handling/I/O are defined portably within the language proper, not
in a separate library. It just moves the non-portable stuff from the library writer (as in C) to the compiler writer (as in Fortran, COBOL,
etc.)

I meant that this is why these features have to be provided, rather than
left for the user to implement themselves. They could also have been
provided in the language itself (as was done in many other languages) -
the point is that you cannot write the file access functions in pure
standard C.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 08:30:37 2024

From Newsgroup: comp.arch

On 10/10/2024 02:33, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be written >>> completely in portable standard C. (How would you write a function that >>> handles files?

Do you mean things other than open(), close(), read(), write(), lseek()
??

The C standard library provides functions like fopen(), fclose(),
fwrite(), etc. It provides them because programs often need such functionality, and you cannot write them yourself in portable standard
C. (As Stephen pointed out, C could have had them built into the
language - for many good reasons, C did not go that route.)

The functions you list here are the POSIX names - not the C standard
library names. Those POSIX functions cannot be implemented in portable standard C either if you exclude making wrappers around the standard
library functions.

In both cases - implementing the standard library functions or
implementing the POSIX functions - you need something beyond standard C,
such as a way to call OS API's.

You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 08:31:52 2024

From Newsgroup: comp.arch

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 10 18:38:55 2024

From Newsgroup: comp.arch

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects? >>>>> For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 21:21:20 2024

From Newsgroup: comp.arch

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects? >>>>>> For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.

In general, there will be many aspects of a C compiler's code generator,
its run-time support library, and C standard libraries that can work
better if they are optimised for each new generation of processor.
Sometimes you just need to re-compile the library with a newer compiler
and appropriate flags, other times you need to modify the library source
code. None of this is specific to memmove().

But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 10 20:00:29 2024

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no >connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be >eliminated when the compiler optimises the functions inline - when the >compiler knows the size of the move/copy, it can optimise directly.

The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.

In general, there will be many aspects of a C compiler's code generator,
its run-time support library, and C standard libraries that can work
better if they are optimised for each new generation of processor.
Sometimes you just need to re-compile the library with a newer compiler
and appropriate flags, other times you need to modify the library source >code. None of this is specific to memmove().

But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.

Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle memcpy
and memset.

They're three-instruction sets; prolog/body/epilog. There are separate
sets for forward vs. forward-or-backward copies.

The prolog instruction preconditions the copy and copies
an IMPDEF portion.

The body instruction performs an IMPDEF Portion and

the epilog instruction finalizes the copy.

The three instructions are issued consecutively.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 10 23:54:15 2024

From Newsgroup: comp.arch

On Thu, 10 Oct 2024 20:00:29 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

David Brown <david.brown@hesbynett.no> writes:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise >directly.

The use of wider register sizes can help to some extent, but not
once you have reached the width of the internal buses or cache
bandwidth.

In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries
that can work better if they are optimised for each new generation
of processor. Sometimes you just need to re-compile the library with
a newer compiler and appropriate flags, other times you need to
modify the library source code. None of this is specific to
memmove().

But it is true that you get an easier and more future-proof
memmove() and memcopy() if you have an ISA that supports scalable
vector processing of some kind, such as ARM and RISC-V have, rather
than explicitly sized SIMD registers.

Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
memcpy and memset.

They're three-instruction sets; prolog/body/epilog. There are
separate sets for forward vs. forward-or-backward copies.

The prolog instruction preconditions the copy and copies
an IMPDEF portion.

The body instruction performs an IMPDEF Portion and

the epilog instruction finalizes the copy.

The three instructions are issued consecutively.

People that have more clue about Arm Inc schedule than myself
expect Arm Cortex cores that implement these instructions to be
announced next May and to appear in actual [expensive] phones in 2026.
Which probably means 2027 at best for Neoverse cores.

It's hard to make an educated guess about schedule of other Arm core
designers.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 10 21:03:33 2024

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 10 Oct 2024 20:00:29 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

David Brown <david.brown@hesbynett.no> writes:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

The use of wider register sizes can help to some extent, but not
once you have reached the width of the internal buses or cache
bandwidth.

In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries
that can work better if they are optimised for each new generation
of processor. Sometimes you just need to re-compile the library with
a newer compiler and appropriate flags, other times you need to
modify the library source code. None of this is specific to
memmove().

But it is true that you get an easier and more future-proof
memmove() and memcopy() if you have an ISA that supports scalable
vector processing of some kind, such as ARM and RISC-V have, rather
than explicitly sized SIMD registers.

Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
memcpy and memset.

They're three-instruction sets; prolog/body/epilog. There are
separate sets for forward vs. forward-or-backward copies.

The prolog instruction preconditions the copy and copies
an IMPDEF portion.

The body instruction performs an IMPDEF Portion and

the epilog instruction finalizes the copy.

The three instructions are issued consecutively.

People that have more clue about Arm Inc schedule than myself
expect Arm Cortex cores that implement these instructions to be
announced next May and to appear in actual [expensive] phones in 2026.
Which probably means 2027 at best for Neoverse cores.

It's hard to make an educated guess about schedule of other Arm core >designers.

In the mean time, they've have "DC ZVA" for the special case of
memset(,0,) since ARMv8.0.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Brian G. Lucas@bagel99@gmail.com to comp.arch on Thu Oct 10 16:19:31 2024

From Newsgroup: comp.arch

On 10/10/24 2:21 PM, David Brown wrote:
[ SNIP]

The existence of a dedicated assembly instruction does not let you write an efficient memmove() in standard C. That's why I said there was no connection
between the two concepts.

If the compiler generates the memmove instruction, then one doesn't
have to write memmove() is C - it is never called/used.

For some targets, it can be helpful to write memmove() in assembly or using inline assembly, rather than in non-portable C (which is the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and memcpy() on large transfers, and the overhead in setting things up that is proportionally
more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of
the move/copy, it can optimise directly.

The use of wider register sizes can help to some extent, but not once you have
reached the width of the internal buses or cache bandwidth.

In general, there will be many aspects of a C compiler's code generator, its run-time support library, and C standard libraries that can work better if they
are optimised for each new generation of processor. Sometimes you just need to
re-compile the library with a newer compiler and appropriate flags, other times
you need to modify the library source code. None of this is specific to memmove().

But it is true that you get an easier and more future-proof memmove() and memcopy() if you have an ISA that supports scalable vector processing of some
kind, such as ARM and RISC-V have, rather than explicitly sized SIMD registers.

Not applicable.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 10 21:30:38 2024

From Newsgroup: comp.arch

On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C.

{
memmove( p, q, size );
}

Where the compiler produces the MM instruction itself. Looks damn
close to standard C to me !!
OR
for( int i = 0, i < size; i++ )
p[i] = q[i];

Which gets compiled to memcpy()--also looks to be standard C.
OR

p_struct = q_struct;

gets compiled to::

memmove( &p_struct, &q_struct, sizeof( q_struct ) );

also looks to be std C.

The thing is you are no longer writing memmove(), you are simply
teaching the compiler to recognizes its _use_ cases directly. In
addition, these will always be within spitting distance of as fast
as one can perform those activities.

That's why I said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers.

Given that we are talking about GBOoO machines here, the several
AGEN units[1,2,3] have plenty of calculation BW to determine order
without wasting cycles getting started.

But given LBIO machine, the ability to process memory to memory moves
at cache port width is always an advantage except for cases needing
only 1 read or 1 write--if you build the HW with these in mind.

Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

In HW they should always be optimized.
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 11 13:37:03 2024

From Newsgroup: comp.arch

On 10/10/2024 23:19, Brian G. Lucas wrote:

On 10/10/24 2:21 PM, David Brown wrote:
[ SNIP]

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

If the compiler generates the memmove instruction, then one doesn't
have to write memmove() is C - it is never called/used.

The common case is that a good compiler will generate inline code for
some cases - typically known (at compile-time) small sizes - and call a generic library function when the size is not known or is over a certain
size. Then there are some targets where it will always call the library
code, and some where it will always generate inline code.

Even if the compiler /can/ generate inline code, there can be
circumstances when it will not do so - such as if you have not enabled optimisation, or are optimising for size, or using a weaker compiler, or calling the function indirectly.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.

In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries that
can work better if they are optimised for each new generation of
processor. Sometimes you just need to re-compile the library with a
newer compiler and appropriate flags, other times you need to modify
the library source code. None of this is specific to memmove().

But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.

Not applicable.

I don't understand what you mean by that. /What/ is not applicable to
/what/ ?

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 11 14:10:13 2024

From Newsgroup: comp.arch

On 10/10/2024 23:30, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C.

      {
           memmove( p, q, size );
      }

What is that circular reference supposed to do? The whole discussion
has been about the /fact/ that you cannot implement the "memmove"
function in a C standard library using fully portable standard C code.

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

You can implement "memcpy" in portable standard C, using a loop and
array or pointer syntax (somewhat like your loop below, but with the
correct type for the index). But you cannot do so for memmove() because
you cannot identify the direction you need to run your loop in an
efficient and fully portable manner.

It does not matter what the target is - the target is totally irrelevant
for /portable/ standard C code. If the target made a difference, it
would not be portable!

I can't understand why this is causing you difficulty.

Perhaps you simply didn't understand what you wrote a few posts back,
when you claimed that the reason people writing portable standard C code cannot write an efficient memmove() implementation is "a symptom of bad
ISA design".

Where the compiler produces the MM instruction itself. Looks damn
close to standard C to me !!
OR
      for( int i = 0, i < size; i++ )
           p[i] = q[i];

Which gets compiled to memcpy()--also looks to be standard C.
OR

      p_struct = q_struct;

gets compiled to::

      memmove( &p_struct, &q_struct, sizeof( q_struct ) );

also looks to be std C.

Those are standard C, yes. And a good compiler will optimise such code.
And if the target has some kind of scalable vector support or other dedicated instructions for moving or copying memory, it can do a better
job of optimising the code.

That has /nothing/ to do with the point under discussion.

I think you are simply confused about what you are talking about here.
Either you don't know what is meant by writing portable standard C, or
you don't know what is meant by implementing a C standard library, or
you haven't actually been reading the posts you replied to. You seem determined to make the point that /your/ ISA has useful and efficient instructions and features for memory copy functionality, while the x86
ISA does not, and that means /your/ ISA is good design and the x86 ISA
is bad design.

Now, I will fully agree with you that the x86 is not a good design. The modern x86 processor devices are proof that you /can/ polish a turd.
And I fully agree with you that instructions for arbitrary length vector instructions of various sorts (of which memory copying is the simplest operation) have many advantages over SIMD using fixed-size vector
registers. (ARM and RISC-V also agree with you there.)

But that is all irrelevant to the discussion.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 11 15:13:17 2024

From Newsgroup: comp.arch

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his my66k
LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad idea
for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification, i.e.
exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
effort.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 11 16:54:13 2024

From Newsgroup: comp.arch

On 11/10/2024 14:13, Michael S wrote:

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his my66k
LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad idea
for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification, i.e.
exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
effort.

That explanation helps a little, but only a little. I wasn't suggesting anything - or if I was, it was several posts ago and the context has
long since been snipped. Can you be more explicit about what you think
I was suggesting, and why it might not be a good idea for targeting a
"my66k" ISA? (That is not a processor I have heard of, so you'll have
to give a brief summary of any particular features that are relevant here.)

--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Fri Oct 11 08:15:29 2024

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be
written completely in portable standard C. (How would you write
a function that handles files? You need non-portable OS calls.)
That's why these things are in the standard library in the first
place.

I agree with everything you say up until the last sentence. There
are several languages, mostly older ones like Fortran and COBOL,
where the file handling/I/O are defined portably within the
language proper, not in a separate library. It just moves the
non-portable stuff from the library writer (as in C) to the
compiler writer (as in Fortran, COBOL, etc.)

What I think you mean is that I/O and file handling are defined as
part of the language rather than being written in the language.
Assuming that's true, what you're saying is not at odds with what
David said. I/O and so forth cannot be written in unaugmented
standard C without changing the language. Given the language as
it is, these things must be put in the standard library, because
they cannot be provided in the existing language.

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library. In
particular, it makes for a very clean distinction between two
kinds of implementation, what the C standard calls a freestanding implementation (which excludes most of the library) and a hosted
implementation (which includes the whole library). This facility
is what allows C to run easily on very small processors, because
there is no overhead for non-essential language features. That is
not to say such things couldn't be arranged for Fortran or COBOL,
but it would be harder, because those languages are not designed
to be separable.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Oct 11 18:55:29 2024

From Newsgroup: comp.arch

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Oct 12 00:02:32 2024

From Newsgroup: comp.arch

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

      .global memmove
memmove:
      MM     R2,R1,R3
      RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Oct 11 23:32:20 2024

From Newsgroup: comp.arch

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

      .global memmove
memmove:
      MM     R2,R1,R3
      RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Brett@ggtgp@yahoo.com to comp.arch on Sat Oct 12 05:06:05 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Sat Oct 12 05:11:44 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different
objects? For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they
can implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard
library memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

Throughout this long thread you keep missing the point. Having
different instructions available doesn't change the definition
of the C language. It is possible to write code in standard C
(which means, C that does NOT depend on any internal details of
any implementation) to copy bytes from one place to another with
semantics matching those of memmove(), BUT that code is clunky.
To get a decent implementation of memmove() semantics requires
knowledge of some internal implementation details that are not
part of standard C. Whether those details are part of the
compiler or part of the runtime environment (the library) is
irrelevant - they still aren't part of standard C. Adding new
instructions to the ISA, no matter what those new instructions
are, cannot change that.
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Oct 12 17:16:44 2024

From Newsgroup: comp.arch

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Sat Oct 12 19:26:30 2024

From Newsgroup: comp.arch

On 12.10.24 17:16, David Brown wrote:

[snip rant]

You are aware that this is c.arch, not c.lang.c?
--
Bernd Linsel
--- Synchronet 3.20a-Linux NewsLink 1.114

From Brian G. Lucas@bagel99@gmail.com to comp.arch on Sat Oct 12 12:36:43 2024

From Newsgroup: comp.arch

On 10/12/24 12:06 AM, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

Yes.
#include <string.h>

void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}

void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Brett@ggtgp@yahoo.com to comp.arch on Sat Oct 12 18:17:18 2024

From Newsgroup: comp.arch

Brian G. Lucas <bagel99@gmail.com> wrote:

On 10/12/24 12:06 AM, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Yes.
#include <string.h>

void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}

void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret

Excellent!

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

What is the default virtual loop count if the register count is not
available?

Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is
loaded and that count value could be say 5. Lots of work and time
discarded, so you play the odds, perhaps to the low side and over prefetch
to cover being wrong.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Brett@ggtgp@yahoo.com to comp.arch on Sat Oct 12 18:33:17 2024

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

There are only two decisions to make in memcpy, are the copies less than
copy sized aligned, and do the pointers overlap in copy size.

For hardware this simplifies down to perhaps two types of copies, easy and hard.

If you make hard fast, and you will, then two versions is all you need, not
the dozens of choices with 1k of code you need in C.

Often you know which of the two you want at compile time from the pointer
type.

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Oct 12 18:32:48 2024

From Newsgroup: comp.arch

On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

The 3rd Operand can, indeed, be a constant.
That causes no restartability problem when you have a place to
store the current count==index, so that when control returns
and you re-execute MM, it sees that x amount has already been
done, and C-X is left.

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

That is what Predication is for.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Oct 12 18:37:35 2024

From Newsgroup: comp.arch

On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

Brian G. Lucas <bagel99@gmail.com> wrote:

On 10/12/24 12:06 AM, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Yes.
#include <string.h>

void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}

void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret

Excellent!

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

What is the default virtual loop count if the register count is not available?

There is always a count available; it can come from a register or an
immediate.

Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is loaded and that count value could be say 5.

The instruction cannot start until the count in known. You don't start
an FMAC until all 3 operands are ready, either.

Lots of work and time
discarded, so you play the odds, perhaps to the low side and over
prefetch to cover being wrong.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Brett@ggtgp@yahoo.com to comp.arch on Sun Oct 13 01:25:13 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

Brian G. Lucas <bagel99@gmail.com> wrote:

On 10/12/24 12:06 AM, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Yes.
#include <string.h>

void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}

void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret

Excellent!

Though I guess forwarding a const is probably a thing today to improve >>>> branch prediction, which is normally HORRIBLE for short branch counts.

What is the default virtual loop count if the register count is not
available?

There is always a count available; it can come from a register or an immediate.

Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is
loaded and that count value could be say 5.

The instruction cannot start until the count in known. You don't start
an FMAC until all 3 operands are ready, either.

That simplifies a lot of issues, thanks!

Lots of work and time
discarded, so you play the odds, perhaps to the low side and over
prefetch to cover being wrong.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Oct 12 23:09:27 2024

From Newsgroup: comp.arch

On 10/12/24 2:37 PM, MitchAlsup1 wrote:

On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

[snip]

Worst case the source and dest are in cache, and the count is
150 cycles
away in memory. So hundreds of chars could be copied until the
value is
loaded and that count value could be say 5.

The instruction cannot start until the count in known. You don't
start
an FMAC until all 3 operands are ready, either.

This is not _strictly_ true. Some ARM implementations start an
FMADD before the addend is available when it is known that it
will be available in time. This allows dependent accumulation
with a latency equal to the ADD part.

One might even be able to start the shift to align addend and
product early as this value is easy to calculate for normal FP
values.

In many microarchitectures, an operation will be scheduled to
execute when an L1 cache hit would be expected to make an operand
available. I.e., the instruction "starts" before the operand is
actually available.

With branch prediction, a branch instruction is "started" before
the condition has been evaluated. Your statement implies that
My 66000 MM implementations will not do such prediction.

In the case of a memory copy, performing rollback of
misspeculation is potentially much easier than in the general case
of a loop with store operations.

Memory copy also facilitates deeper speculation. The source data
can be preserved in memory more readily than arbitrary sequences
of register contents. If both source and destination start points
are known, destination reads can be translated into source reads
within a speculation domain. (The source could also be prefetched
before the destination is known.)

It does seem that My 66000's MM does not completely eliminate the
potential for faster special case software even if every
implementation is perfect. Software might know that the tail part
of a cache block that is not overwritten is dead data. This can
avoid a read for ownership of the last destination block, software
could do a cache block zero for the last block and then copy the
data over that. This special case might apply for appending to a
buffer.

I do not know that adding a MM instruction variant to handle that
special case would be worthwhile.

I am skeptical that all implementations of MM would be perfect,
i.e., perform at least as well as software more specifically
controlling hardware if such control had been provided by the ISA.
E.g., ISA support for byte-masks for stores might not only allow
non-contiguous stores (such as updating more than one field in a
structure while leaving other intermediately placed fields
unchanged) but might have higher performance than a general MM if
the source happened to be replicated in a register.

"Hard cases make bad law" may be generalized to special cases make
bad (general) interfaces. Clean interfaces that can be implemented
almost optimally have advantages over complicated interfaces that
can theoretically handle more cases optimally **if one uses the
proper (highly specific) incantation!!!**
--- Synchronet 3.20a-Linux NewsLink 1.114

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Oct 13 10:31:49 2024

From Newsgroup: comp.arch

On 2024-10-12 21:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you >>>> are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like
memcpy() and memset() can be implemented in different architectures and
optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

[ snip discussion of HW ]

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that David
is defending is that memmove() cannot be implemented "efficiently" in /standard/ C source code, on /any/ HW, because it would require
comparing /C pointers/ that point to potentially different /C objects/,
which is not defined behavior in standard C, whether compiled to machine
code, or executed by an interpreter of C code, or executed by a human programmer performing what was called "desk testing" in the 1960s.

Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and David
is not disputing that. But Mitch seems not to understand or not to see
the issue about standard C vs memmove().

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 10:56:20 2024

From Newsgroup: comp.arch

On Sat, 12 Oct 2024 18:32:48 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

The 3rd Operand can, indeed, be a constant.
That causes no restartability problem when you have a place to
store the current count==index, so that when control returns
and you re-execute MM, it sees that x amount has already been
done, and C-X is left.

I don't understand this paragraph.
Does constant as a 3rd operand cause restartability problem?
Or does it not?
If it does not, then how?
Do you have a private field in thread state? Saved on stack by by
interrupt uCode ?
OS people would not like it. They prefer to have full control even when
they don't use it 99.999% of the time.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 12:00:48 2024

From Newsgroup: comp.arch

On Fri, 11 Oct 2024 16:54:13 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 11/10/2024 14:13, Michael S wrote:

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his
my66k LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad
idea for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification,
i.e. exactly the same mechanism that is done on "non-scalable" architectures, would provide better performance. And memcpy/memmove
is certainly sufficiently important to justify an additional
development effort.

That explanation helps a little, but only a little. I wasn't
suggesting anything - or if I was, it was several posts ago and the
context has long since been snipped.

You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

Can you be more explicit about
what you think I was suggesting, and why it might not be a good idea
for targeting a "my66k" ISA? (That is not a processor I have heard
of, so you'll have to give a brief summary of any particular features
that are relevant here.)

The proper spelling appears to be My 66000.
For starter, My 66000 has no SIMD. It does not even have dedicated FP
register file. Both FP and Int share common 32x64bit register space.

More importantly, it has dedicate instruction with exactly the same
semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
that in at least several out of multitude of implementations it will
suck.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 12:26:22 2024

From Newsgroup: comp.arch

On Sun, 13 Oct 2024 10:31:49 +0300
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2024-10-12 21:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I
know you are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply
to? Are you interested in replying, and engaging in the
discussion? Or are you just looking for a chance to promote your
own architecture, no matter how tenuous the connection might be to
other posts?

Again, let me say that I agree with what you are saying - I agree
that an ISA should have instructions that are efficient for what
people actually want to do. I agree that it is a good thing to
have instructions that let performance scale with advances in
hardware ideally without needing changes in compiled binaries, and
at least without needing changes in source code.

I believe there is an interesting discussion to be had here, and I
would enjoy hearing about comparisons of different ways things
functions like memcpy() and memset() can be implemented in
different architectures and optimised for different sizes, or how
scalable vector instructions can work in comparison to fixed-size
SIMD instructions.

But at the moment, this potential is lost because you are posting
total shite about implementing memmove() in standard C. It is
disappointing that someone with your extensive knowledge and
experience cannot see this. I am finding it all very frustrating.

[ snip discussion of HW ]

In short your complaints are wrong headed in not understanding what hardware memcpy can do.

I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that
David is defending is that memmove() cannot be implemented
"efficiently" in /standard/ C source code, on /any/ HW, because it
would require comparing /C pointers/ that point to potentially
different /C objects/, which is not defined behavior in standard C,
whether compiled to machine code, or executed by an interpreter of C
code, or executed by a human programmer performing what was called
"desk testing" in the 1960s.

Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and
David is not disputing that. But Mitch seems not to understand or not
to see the issue about standard C vs memmove().

Sufficiently advanced compiler can recognize patterns and replace them
with built-in sequences.
In case of memmove() the most easily recognizable pattern in 100%
standard C99 appears to be:
void *memmove( void *dest, const void *src, size_t count)
{
if (count > 0) {
char tmp[count];
memcpy(tmp, src, count);
memcpy(dest, tmp, count);
}
return dest;
}
I don't suggest that real implementation in Brian's compiler is like
that. Much more likely his implementation uses non-standard C and looks approximately like:
void *memmove(void *dest, const void *src, size_t count {
return __builtin_memmove(dest, src, count);
}
However, implementing the first variant efficiently is well within
abilities of good compiler.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Oct 13 13:33:55 2024

From Newsgroup: comp.arch

On 2024-10-13 12:26, Michael S wrote:

On Sun, 13 Oct 2024 10:31:49 +0300
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2024-10-12 21:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

[ snip ]

But at the moment, this potential is lost because you are posting
total shite about implementing memmove() in standard C. It is
disappointing that someone with your extensive knowledge and
experience cannot see this. I am finding it all very frustrating.

[ snip discussion of HW ]

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that
David is defending is that memmove() cannot be implemented
"efficiently" in /standard/ C source code, on /any/ HW, because it
would require comparing /C pointers/ that point to potentially
different /C objects/, which is not defined behavior in standard C,
whether compiled to machine code, or executed by an interpreter of C
code, or executed by a human programmer performing what was called
"desk testing" in the 1960s.

Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU
instructions, or by dedicated instructions such as Mitch's MM, and
David is not disputing that. But Mitch seems not to understand or not
to see the issue about standard C vs memmove().

Sufficiently advanced compiler can recognize patterns and replace them
with built-in sequences.

Sure.

In case of memmove() the most easily recognizable pattern in 100%
standard C99 appears to be:

void *memmove( void *dest, const void *src, size_t count)
{
if (count > 0) {
char tmp[count];
memcpy(tmp, src, count);
memcpy(dest, tmp, count);
}
return dest;
}

Yes.

I don't suggest that real implementation in Brian's compiler is like
that. Much more likely his implementation uses non-standard C and looks approximately like:
void *memmove(void *dest, const void *src, size_t count {
return __builtin_memmove(dest, src, count);
}

However, implementing the first variant efficiently is well within
abilities of good compiler.

Yes, but it is not required by the C standard, so the fact remains that
there is no standard way of implementing memmove() in a way that is "efficient" in the sense that it ensures that a copy to and from a
temporary will /not/ happen.

In practice, of course, memmove() is implemented in a non-portable way
or by in-line code, as everybody understands.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Sun Oct 13 12:57:06 2024

From Newsgroup: comp.arch

On 12/10/2024 19:26, Bernd Linsel wrote:

On 12.10.24 17:16, David Brown wrote:

[snip rant]

You are aware that this is c.arch, not c.lang.c?

Absolutely, yes.

But in a thread branch discussing C, details of C are relevant.

I don't expect any random regular here to know "language lawyer" details
of the C standards. I don't expect people here to care about them.
People in comp.lang.c care about them - for people here, the main
interest in C is for programs to run on the computer architectures that
are the real interest.

But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
what other posters write. The point under discussion was that you
cannot implement an efficient "memmove()" function in fully portable
standard C. That's a fact - it is a well-established fact. Another
clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the advantage and the disadvantage of writing code in portable standard C.

All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
architecture discussions.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Sun Oct 13 13:58:14 2024

From Newsgroup: comp.arch

On 12/10/2024 20:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you >>>> are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like
memcpy() and memset() can be implemented in different architectures and
optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

There are only two decisions to make in memcpy, are the copies less than
copy sized aligned, and do the pointers overlap in copy size.

Are you confused about memcpy() and memmove()? If so, let's clear that
one up from the start. For memcpy(), there are no overlap issues - the
person using it promises that the source and destination areas do not
overlap, and no one cares what might happen if they do. For memmove(),
the areas /may/ overlap, and the copy is done as though the source were
copied first to a temporary area, and then from the temporary area to
the destination.

For memcpy(), there can be several issues to consider for efficient implementations that can be skipped for a simple loop copying byte for
byte. An efficient implementation will probably want to copy with
larger sizes, such as using 32-bit, 64-bit, or bigger registers. For
some targets, that is only possible for aligned data (and for some,
unaligned accesses may be allowed but emulated by traps, making them
massively slower than byte-by-byte accesses). The best choice of size
will be implementation and target dependent, as will methods of
determining alignment (if that is relevant). I'm guessing that by your somewhat muddled phrase "are the copies less than copy sized aligned",
you meant something on those lines.

For memmove(), you generally also need to decide if your copy loop
should run upwards or downwards, and that must be done in an implementation-dependent manner. It is conceivable that for a target
with more complex memory setups - perhaps allowing the same memory to be accessible in different ways via different segments - that this is not
enough.

For hardware this simplifies down to perhaps two types of copies, easy and hard.

For most targets, yes.

If you make hard fast, and you will, then two versions is all you need, not the dozens of choices with 1k of code you need in C.

That makes little sense. What "1k of code" do you need in C?
Implementations of memcpy() and memmove() are implementation and target-specific, not general portable standard C. There is no single C implementation of these functions.

It is an obvious truism that if you have hardware instructions that can implement an efficient memcpy() and/or memmove() on a target, then the implementation-specific implementations of these functions on that
target will be small, simple and efficient.

Often you know which of the two you want at compile time from the pointer type.

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

What complaints? I haven't made any complains about implementing these functions.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Sun Oct 13 14:10:20 2024

From Newsgroup: comp.arch

On 13/10/2024 11:00, Michael S wrote:

On Fri, 11 Oct 2024 16:54:13 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 11/10/2024 14:13, Michael S wrote:

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his
my66k LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad
idea for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification,
i.e. exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove
is certainly sufficiently important to justify an additional
development effort.

That explanation helps a little, but only a little. I wasn't
suggesting anything - or if I was, it was several posts ago and the
context has long since been snipped.

You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

I certainly suggested that they have some advantages, yes. I don't know nearly enough details about implementations and practical usage to know
if scalable vector instructions are /always/ better than non-scalable
SIMD with fixed-size registers, either from the viewpoint of their
efficiency at runtime or their implementation in hardware.

It seems to me that if the compiler knows the size of a memcpy/memmove,
then the best results would probably be achieved by the compiler
inlining the copy using fixed size registers of a suitable size. If it
does not know the size, then I would expect (but I don't know for sure)
that a hardware scalable vector instruction should be more efficient
than using fixed-size registers. If that were not the case, then I
wonder why scalable vector hardware has become popular recently in ISAs.

If you - or someone else - knows enough to say more about this, then I'd
be glad to learn about it.

Can you be more explicit about
what you think I was suggesting, and why it might not be a good idea
for targeting a "my66k" ISA? (That is not a processor I have heard
of, so you'll have to give a brief summary of any particular features
that are relevant here.)

The proper spelling appears to be My 66000.
For starter, My 66000 has no SIMD. It does not even have dedicated FP register file. Both FP and Int share common 32x64bit register space.

OK.

More importantly, it has dedicate instruction with exactly the same
semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
that in at least several out of multitude of implementations it will
suck.

So if I understand you correctly, your argument is that scalable vector instructions - at least for copying memory - is slow in hardware implementations, and thus it would be better to simply copy memory in a
loop using larger fixed-size registers? I would find that surprising,
but as I said, I don't know the details of implementations.

(I do know that in the 68k family, the hardware division instruction was dropped for later devices after it was realised that a software division routine was faster than the hardware instruction. So such strange
things have happened.)

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Oct 13 15:45:37 2024

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library.

When you implements something like, say

vsum(double *a, double *b, double *c, size_t n);

where a, b, and c may point to arrays in different objects, or may
point to overlapping parts of the same object, and the result vector c
in the overlap case should be the same as in the no-overlap case
(similar to memmove()), being able to compare pointers to possibly
different objects also comes in handy.

Another example is when the programmer uses the address as a key in,
e.g., a binary search tree. And, as you write, casting to intptr_t is
not guarenteed to work by the C standard, either.

An example that probably compares pointers to the same object as far
as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
When you have two free variables, and you unify them, in the
implementation one variable points to the other one. Now which should
point to which? The younger variable should point to the older one,
because it will die sooner. How do you know which variable is
younger? You compare the addresses; the variables reside on a stack,
so the younger one is closer to the top.

If that stack is one object as far as the C standard is concerned,
there is no problem with that solution. If the stack is implemented
as several objects (to make it easier growable; I don't know if there
is a Prolog implementation that does that), you first have to check in
which piece it is (maybe with a binary search), and then possibly
compare within the stack piece at hand.

An interesting case is the Forth standard. It specifies "contiguous
regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.

Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could >interpret part of the address value as the segment or other memory block >identifier and part of it as an index into that block, just as a C >implementation can.

I.e., what you are saying is that one can simulate a flat-memory model
on a segmented memory model. Certainly. In the case of the 8086 (and
even more so on the 286) the costs of that are so high that no
widely-used Forth system went there.

One can also simulate segmented memory (a natural fit for many
programming languages) on flat memory. In this case the cost is much
smaller, plus it gives the maximum flexibility about segment/object
sizes and numbers. That is why flat memory has won.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sun Oct 13 13:32:32 2024

From Newsgroup: comp.arch

On 10/13/24 3:56 AM, Michael S wrote:

On Sat, 12 Oct 2024 18:32:48 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

[snip memory copy instruction]

The 3rd Operand can, indeed, be a constant.
That causes no restartability problem when you have a place to
store the current count==index, so that when control returns
and you re-execute MM, it sees that x amount has already been
done, and C-X is left.

I don't understand this paragraph.
Does constant as a 3rd operand cause restartability problem?
Or does it not?
If it does not, then how?
Do you have a private field in thread state? Saved on stack by by
interrupt uCode ?

The extra state is saved in the context save area (like
for My 66000's extra state for the PREDicate instruction
modifier).

(Of course, restartability could also be provided by using
an ordinary register for the in-progress count even for
immediate counts. The instruction would effectively become a
load immediate and memory copy. Implicit/extra state has
some benefits.)

OS people would not like it. They prefer to have full control even when
they don't use it 99.999% of the time.

On the other hand, isolating some state and functionality might
facilitate less trust requirements? Some OS people might not like
having the OS be less than fully trusted.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Oct 13 21:21:11 2024

From Newsgroup: comp.arch

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects? >>>>>>> For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL.Â A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details.Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can >>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>> memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Â It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the > compiler knows the size of the move/copy, it can optimise directly.

What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.
I.e. totally removing the need for compiler tricks or wide register operations.
Also apropos the compiler library issue:
You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today), and then the memmove() calls will usually be inlined.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Brett@ggtgp@yahoo.com to comp.arch on Sun Oct 13 19:36:04 2024

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 19:26, Bernd Linsel wrote:

On 12.10.24 17:16, David Brown wrote:

[snip rant]

You are aware that this is c.arch, not c.lang.c?

Absolutely, yes.

But in a thread branch discussing C, details of C are relevant.

I don't expect any random regular here to know "language lawyer" details
of the C standards. I don't expect people here to care about them.
People in comp.lang.c care about them - for people here, the main
interest in C is for programs to run on the computer architectures that
are the real interest.

But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
what other posters write. The point under discussion was that you
cannot implement an efficient "memmove()" function in fully portable standard C. That's a fact - it is a well-established fact. Another
clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the advantage and the disadvantage of writing code in portable standard C.

All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
architecture discussions.

MemMove in C is fundamentally two void pointers and a count of bytes to
move.

C does not care what the alignment of those two void pointers is.

ALU’s are so cheap as to be free, a dedicated MM unit can have a shifter
and mask with a buffer, and happily copy aligned chunks from the source and write aligned chunks to the dest, even though both are odd aligned in
different ways, and overlapping the same buffer.

Note that writes have byte enables, you can write 5 bytes in one go to
cache, to finish off the end of a series of aligned writes.

My 66000 only has one MM instruction because when you throw enough hardware
at the problem, one instruction is all you need.

And it also covers MemCopy, and yes there is a backwards copy version.

I detailed the hardware to do this several years ago on Real World Tech.
And such hardware has been available for many decades in DMA units.

The 6502 based GameBoy had a MemMove DMA unit as it was many times faster copying bytes than the 6502 was, and doubled the overall performance of the GameBoy.

One ring to rule them all.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Oct 13 19:43:34 2024

From Newsgroup: comp.arch

Brett <ggtgp@yahoo.com> writes:

David Brown <david.brown@hesbynett.no> wrote:

All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable
standard C is weaknesses in the x86 ISA), so that we can clear up his
misunderstandings and move on to the more interesting computer
architecture discussions.

<snip>

My 66000 only has one MM instruction because when you throw enough hardware >at the problem, one instruction is all you need.

And it also covers MemCopy, and yes there is a backwards copy version.

I detailed the hardware to do this several years ago on Real World Tech.

Such hardware (memcpy/memmove/memfill) was available in 1965 on the Burroughs medium systems mainframes. In the 80s, support was added for hashing
strings as well.

It's not a new concept. In fact, there were some tricks that could
be used with overlapping source and destination buffers that would
replicate chunks of data).
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 23:01:53 2024

From Newsgroup: comp.arch

On Sun, 13 Oct 2024 19:43:34 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Brett <ggtgp@yahoo.com> writes:

David Brown <david.brown@hesbynett.no> wrote:

All I am asking Mitch to do is to understand this, and to stop
saying silly things (such as implementing memmove() by calling
memmove(), or that the /reason/ you can't implement memmove()
efficiently in portable standard C is weaknesses in the x86 ISA),
so that we can clear up his misunderstandings and move on to the
more interesting computer architecture discussions.

<snip>

My 66000 only has one MM instruction because when you throw enough
hardware at the problem, one instruction is all you need.

And it also covers MemCopy, and yes there is a backwards copy
version.

I detailed the hardware to do this several years ago on Real World
Tech.

Such hardware (memcpy/memmove/memfill) was available in 1965 on the
Burroughs medium systems mainframes. In the 80s, support was added
for hashing strings as well.

It's not a new concept. In fact, there were some tricks that could
be used with overlapping source and destination buffers that would
replicate chunks of data).

The difference is that today for strings of certain size, say from 200
bytes to half of your L1D cache, if your precios HW copies less than 50
bytes per clock then people would complain that it is slower than snail.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Brian G. Lucas@bagel99@gmail.com to comp.arch on Sun Oct 13 15:32:04 2024

From Newsgroup: comp.arch

On 10/13/24 4:26 AM, Michael S wrote:

On Sun, 13 Oct 2024 10:31:49 +0300
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2024-10-12 21:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I
know you are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply
to? Are you interested in replying, and engaging in the
discussion? Or are you just looking for a chance to promote your
own architecture, no matter how tenuous the connection might be to
other posts?

Again, let me say that I agree with what you are saying - I agree
that an ISA should have instructions that are efficient for what
people actually want to do. I agree that it is a good thing to
have instructions that let performance scale with advances in
hardware ideally without needing changes in compiled binaries, and
at least without needing changes in source code.

I believe there is an interesting discussion to be had here, and I
would enjoy hearing about comparisons of different ways things
functions like memcpy() and memset() can be implemented in
different architectures and optimised for different sizes, or how
scalable vector instructions can work in comparison to fixed-size
SIMD instructions.

But at the moment, this potential is lost because you are posting
total shite about implementing memmove() in standard C. It is
disappointing that someone with your extensive knowledge and
experience cannot see this. I am finding it all very frustrating.

[ snip discussion of HW ]

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that
David is defending is that memmove() cannot be implemented
"efficiently" in /standard/ C source code, on /any/ HW, because it
would require comparing /C pointers/ that point to potentially
different /C objects/, which is not defined behavior in standard C,
whether compiled to machine code, or executed by an interpreter of C
code, or executed by a human programmer performing what was called
"desk testing" in the 1960s.

Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU
instructions, or by dedicated instructions such as Mitch's MM, and
David is not disputing that. But Mitch seems not to understand or not
to see the issue about standard C vs memmove().

Sufficiently advanced compiler can recognize patterns and replace them
with built-in sequences.

In case of memmove() the most easily recognizable pattern in 100%
standard C99 appears to be:

void *memmove( void *dest, const void *src, size_t count)
{
if (count > 0) {
char tmp[count];
memcpy(tmp, src, count);
memcpy(dest, tmp, count);
}
return dest;
}

I don't suggest that real implementation in Brian's compiler is like
that. Much more likely his implementation uses non-standard C and looks approximately like:
void *memmove(void *dest, const void *src, size_t count {
return __builtin_memmove(dest, src, count);
}

Well, something like that. Clang will generate LLVM IR which acts like
a builtin_memmove that the backend can match and emit the MM instruction.

However, implementing the first variant efficiently is well within
abilities of good compiler.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 14 15:19:32 2024

From Newsgroup: comp.arch

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different
objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers, >>>>>>> rather than having only a valid pointer or NULL.Â A compiler, >>>>>>> for example, might want to store the fact that an error occurred >>>>>>> while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can >>>>>>> rely on what application programmers cannot, their implementation >>>>>>> details.Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can >>>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>>> memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Â It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and memmove. (For my own kind of work, I'd worry about such looping
instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)

And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a
given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.

I.e. totally removing the need for compiler tricks or wide register operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and to recognize common patterns (just as most compilers already do today), and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C. That
is independent of any ISA, any specialist instructions for memory moves,
and any compiler optimisations. And it is independent of the fact that
some good compilers can inline at least some calls to memcpy() and
memmove() today, using whatever instructions are most efficient for the target.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 14 16:40:26 2024

From Newsgroup: comp.arch

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different >>>>>>>>> objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,>>>>>>>> rather than having only a valid pointer or NULL.Ã‚Â A compiler,
for example, might want to store the fact that an error occurred>>>>>>>> while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can>>>>>>>> rely on what application programmers cannot, their implementation
details.Ã‚Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can >>>>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>>>> memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the ISA. >>>>

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C.Â That's why I said there >>> was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers.Â Often that
can be eliminated when the compiler optimises the functions inline - >>> when the compiler knows the size of the move/copy, it can optimise
directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and has >> the inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and > memmove. (For my own kind of work, I'd worry about such looping
instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)

And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a > given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.

I.e. totally removing the need for compiler tricks or wide register
operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today),
and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C. That
is independent of any ISA, any specialist instructions for memory moves,
and any compiler optimisations. And it is independent of the fact that some good compilers can inline at least some calls to memcpy() and
memmove() today, using whatever instructions are most efficient for the target.

David, you and Mitch are among my most cherished writers here on c.arch, I really don't think any of us really disagree, it is just that we have
been discussing two (mostly) orthogonal issues.
a) memmove/memcpy are so important that people have been spending a lot
of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct
comparison of arbitrary pointers).
b) Mitch have, like Andy ("Crazy") Glew many years before, realized that if a cpu architecture actually has an instruction designed to do this
particular job, it behooves cpu architects to make sure that it is in
fact so fast that it obviates any need for tricky coding to replace it. Ideally, it should be able to copy a single object, up to a cache line
in size, in the same or less time needed to do so manually with a SIMD
512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)
REP MOVSB on x86 does the canonical memcpy() operation, originally by
moving single bytes, and this was so slow that we also had REP MOVSW
(moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on
64-bit cpus.
With a suitable chunk of logic, the basic MOVSB operation could in fact
handle any kinds of alignments and sizes, while doing the actual
transfer at maximum bus speeds, i.e. at least one cache line/cycle for
things already in $L1.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 14 17:04:28 2024

From Newsgroup: comp.arch

On 13/10/2024 17:45, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library.

When you implements something like, say

vsum(double *a, double *b, double *c, size_t n);

where a, b, and c may point to arrays in different objects, or may
point to overlapping parts of the same object, and the result vector c
in the overlap case should be the same as in the no-overlap case
(similar to memmove()), being able to compare pointers to possibly
different objects also comes in handy.

OK, I can agree with that - /if/ you need such a function. I'd suggest
that when you are writing code that might call such a function, you've a
very good idea whether you want to do "vec_c = vec_a + vec_b;", or
"vec_c += vec_a;" (that is, "b" and "c" are the same). In other words,
the programmer calling vsum already knows if there are overlaps, and
you'd get the best results if you had different functions for the
separate cases.

It is conceivable that you don't know if there is an overlap, especially
if you are only dealing with parts of arrays rather than full arrays,
but I think such cases will be rare.

I do think it would be convenient if there were a fully standard way to compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it. Since a fully
defined portable method might not be possible (or at least, not
efficiently possible) for some weird targets, and it's a good thing that
C supports weird targets, I think perhaps the ideal would be to have
some feature that exists if and only if you can do sensible comparisons.
This could be an additional <stdint.h> pointer type, or some pointer
compare macros, or a pre-defined macro to say if you can simply use
uintptr_t for the purpose (as you can on most modern C implementations).

Another example is when the programmer uses the address as a key in,
e.g., a binary search tree. And, as you write, casting to intptr_t is
not guarenteed to work by the C standard, either.

Casting to uintptr_t (why would one want a /signed/ address?) is all you
need for most systems - and for any target where casting to uintptr_t
will not be sufficient here, the type uintptr_t will not exist and you
get a nice, safe hard compile-time error rather than silently UB code.
For uses like this, you don't need to compare pointers - comparing the integers converted from the pointers is fine. (Imagine a system where converted addresses consist of a 16-bit segment number and a 16-bit
offset, where the absolute address is the segment number times a scale
factor, plus the offset. You can't easily compare two pointers for real address ordering by converting them to an integer type, but the result
of casting to uintptr_t is still fine for your binary tree.)

An example that probably compares pointers to the same object as far
as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
When you have two free variables, and you unify them, in the
implementation one variable points to the other one. Now which should
point to which? The younger variable should point to the older one,
because it will die sooner. How do you know which variable is
younger? You compare the addresses; the variables reside on a stack,
so the younger one is closer to the top.

If that stack is one object as far as the C standard is concerned,
there is no problem with that solution. If the stack is implemented
as several objects (to make it easier growable; I don't know if there
is a Prolog implementation that does that), you first have to check in
which piece it is (maybe with a binary search), and then possibly
compare within the stack piece at hand.

My only experience of Prolog was working through a short tutorial
article when I was a teenager - I have no idea about implementations!

But again I come back to the same conclusion - there are situations
where being able to compare addresses can be useful, but it is very rare
for most programmers to ever actually need to do so. And I think it is
good that there is a widely portable way to achieve this, by casting to uintptr_t and comparing those integers. There are things that people
want to do with C programming that can be done with
implementation-specific code, but which cannot be done with fully
portable standard code. While it is always nice if you /can/ use fully portable solutions (while still being clear and efficient), it's okay to
have non-portable code when you need it.

An interesting case is the Forth standard. It specifies "contiguous
regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.

Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could
interpret part of the address value as the segment or other memory block
identifier and part of it as an index into that block, just as a C
implementation can.

I.e., what you are saying is that one can simulate a flat-memory model
on a segmented memory model.

Yes.

Certainly. In the case of the 8086 (and
even more so on the 286) the costs of that are so high that no
widely-used Forth system went there.

OK.

That's much the same as C on segmented targets.

One can also simulate segmented memory (a natural fit for many
programming languages) on flat memory. In this case the cost is much smaller, plus it gives the maximum flexibility about segment/object
sizes and numbers. That is why flat memory has won.

Sure, flat memory is nicer in many ways.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 14 17:19:40 2024

From Newsgroup: comp.arch

On 14/10/2024 16:40, Terje Mathisen wrote:

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different >>>>>>>>>> objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>>> rather than having only a valid pointer or NULL.Ã‚Â A compiler, >>>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>>> while parsing a subexpression as a special pointer constant. >>>>>>>>>
Compilers often have the unfair advantage, though, that they can >>>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>>> details.Ã‚Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they >>>>>>>> can
implement an efficient memmove() even though a pure standard C >>>>>>>> programmer cannot (other than by simply calling the standard
library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the ISA. >>>>>

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C.Â That's why I said
there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers.Â Often that >>>> can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and has
the inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very
close to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced hardware
MM instruction could be a very efficient way to implement both memcpy
and memmove. (For my own kind of work, I'd worry about such looping
instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)

And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a
given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/
get benefits from doing so, but it is not as simple as Mitch made out.

I.e. totally removing the need for compiler tricks or wide register
operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today),
and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to write
an efficient memmove() implementation using pure portable standard C.
That is independent of any ISA, any specialist instructions for memory
moves, and any compiler optimisations. And it is independent of the
fact that some good compilers can inline at least some calls to
memcpy() and memmove() today, using whatever instructions are most
efficient for the target.

David, you and Mitch are among my most cherished writers here on c.arch,
I really don't think any of us really disagree, it is just that we have
been discussing two (mostly) orthogonal issues.

I agree. It's a "god dag mann, økseskaft" situation.

I have a huge respect for Mitch, his knowledge and experience, and his willingness to share that freely with others. That's why I have found
this very frustrating.

a) memmove/memcpy are so important that people have been spending a lot
of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct comparison of arbitrary pointers).

Yes.

(Unlike memmov(), memcpy() can be implemented in standard C as a simple byte-copy loop, without needing to compare pointers. But an
implementation that copies in larger blocks than a byte requires implementation dependent behaviour to determine alignments, or it must
rely on unaligned accesses being allowed by the implementation.)

b) Mitch have, like Andy ("Crazy") Glew many years before, realized that
if a cpu architecture actually has an instruction designed to do this particular job, it behooves cpu architects to make sure that it is in
fact so fast that it obviates any need for tricky coding to replace it.

Yes.

Ideally, it should be able to copy a single object, up to a cache line
in size, in the same or less time needed to do so manually with a SIMD 512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)

Yes.

REP MOVSB on x86 does the canonical memcpy() operation, originally by
moving single bytes, and this was so slow that we also had REP MOVSW
(moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on 64-bit cpus.

With a suitable chunk of logic, the basic MOVSB operation could in fact handle any kinds of alignments and sizes, while doing the actual
transfer at maximum bus speeds, i.e. at least one cache line/cycle for things already in $L1.

I agree on all of that.

I am quite happy with the argument that suitable hardware can do these
basic operations faster than a software loop or the x86 "rep"
instructions. And I fully agree that these would be useful features in general-purpose processors.

My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C. They would make it easier to write efficient
implementations of these standard library functions for targets that had
such instructions - but that would be implementation-specific code. And
that is one of the reasons that C standard library implementations are
tied to the specific compiler and target, and the writers of these
libraries have "superpowers" and are not limited to standard C.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 14 19:08:56 2024

From Newsgroup: comp.arch

On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 16:40, Terje Mathisen wrote:

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to
different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in
pointers, rather than having only a valid pointer or
NULL.Ã‚Â A compiler, for example, might want to store the >>>>>>>>> fact that an error occurred while parsing a subexpression
as a special pointer constant.

Compilers often have the unfair advantage, though, that
they can rely on what application programmers cannot, their >>>>>>>>> implementation details.Ã‚Â (Some do not, such as f2c). >>>>>>>>

Standard library authors have the same superpowers, so that
they can
implement an efficient memmove() even though a pure standard >>>>>>>> C programmer cannot (other than by simply calling the
standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of
libc writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the >>>>>> ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let
you write an efficient memmove() in standard C.Â That's why I
said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in
assembly or using inline assembly, rather than in non-portable C
(which is the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things
up that is proportionally more costly for small transfers.Â
Often that can be eliminated when the compiler optimises the
functions inline - when the compiler knows the size of the
move/copy, it can optimise directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and
has the inside knowledge about cache (residency at level x? width
in bytes)/memory ranges/access rights/etc needed to do so in a
very close to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced
hardware MM instruction could be a very efficient way to implement
both memcpy and memmove. (For my own kind of work, I'd worry
about such looping instructions causing an unbounded increased in
interrupt latency, but that too is solvable given enough hardware
effort.)

And I agree that once you have an "MM" (or similar) instruction,
you don't need to re-write the implementation for your memmove()
and memcpy() library functions for every new generation of
processors of a given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will
/sometimes/ get benefits from doing so, but it is not as simple as
Mitch made out.

I.e. totally removing the need for compiler tricks or wide
register operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and
to recognize common patterns (just as most compilers already do
today), and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to
write an efficient memmove() implementation using pure portable
standard C. That is independent of any ISA, any specialist
instructions for memory moves, and any compiler optimisations.
And it is independent of the fact that some good compilers can
inline at least some calls to memcpy() and memmove() today, using
whatever instructions are most efficient for the target.

David, you and Mitch are among my most cherished writers here on
c.arch, I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.

I agree. It's a "god dag mann, økseskaft" situation.

I have a huge respect for Mitch, his knowledge and experience, and
his willingness to share that freely with others. That's why I have
found this very frustrating.

a) memmove/memcpy are so important that people have been spending a
lot of time & effort trying to make it faster, with the
complication that in general it cannot be implemented in pure C
(which disallows direct comparison of arbitrary pointers).

Yes.

(Unlike memmov(), memcpy() can be implemented in standard C as a
simple byte-copy loop, without needing to compare pointers. But an implementation that copies in larger blocks than a byte requires implementation dependent behaviour to determine alignments, or it
must rely on unaligned accesses being allowed by the implementation.)

b) Mitch have, like Andy ("Crazy") Glew many years before, realized
that if a cpu architecture actually has an instruction designed to
do this particular job, it behooves cpu architects to make sure
that it is in fact so fast that it obviates any need for tricky
coding to replace it.

Yes.

Ideally, it should be able to copy a single object, up to a cache
line in size, in the same or less time needed to do so manually
with a SIMD 512-bit load followed by a 512-bit store (both ops
masked to not touch anything it shouldn't)

Yes.

REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.

With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.

I agree on all of that.

I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep" instructions.

No, that's not true. And according to my understanding, that's not what
Terje wrote.
REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
details - fixed registers for src, dest, len and Direction flag in PSW
instead of being part of the opcode).
REP MOVSW/D/Q were introduced because back then processors were small
and stupid. When your processor is big and smart you don't need them
any longer. REP MOVSB is sufficient.
New Arm64 instruction that are hopefully coming next year are akin to
REP MOVSB rather than to MOVSW/D/Q.
Instructions for memmove, also defined by Arm and by Mitch, is the next
logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.
Now, is all that a good idea? I am not 100% convinced.
One can argue that streaming alignment hardware that is necessary for
1st-class implementation of these instructions is useful not only for
memory copy.
So, may be, it makes sense to expose this hardware in more generic ways.
May be, via Load Multiple Register? It was present in Arm's A32/T32,
but didn't make it into ARM64. Or, may be, there are even better ways
that I was not thinking about.

And I fully agree that these would be useful features
in general-purpose processors.

My only point of contention is that the existence or lack of such instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.

You are moving a goalpost.
One does not need "good implementation" in a sense you have in mind.
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.

They would make it easier to write efficient
implementations of these standard library functions for targets that
had such instructions - but that would be implementation-specific
code. And that is one of the reasons that C standard library
implementations are tied to the specific compiler and target, and the
writers of these libraries have "superpowers" and are not limited to
standard C.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 14 19:02:51 2024

From Newsgroup: comp.arch

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 14 22:20:42 2024

From Newsgroup: comp.arch

On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for equality).
Rarely needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

That's their problem. The rest of the C world shouldn't suffer because
of odd birds.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Oct 14 19:39:41 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Mon Oct 14 23:46:10 2024

From Newsgroup: comp.arch

On Tue, 8 Oct 2024 20:53:00 +0000, MitchAlsup1 wrote:

The Algol family of block structure gave the illusion that flat was less necessary and it could all be done with lexical address-
ing and block scoping rules.

Then malloc() and mmap() came along.

Algol-68 already had heap allocation and flex arrays. (The folks over in MULTICS land were working on mmap.)
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 15 00:14:25 2024

From Newsgroup: comp.arch

On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for equality).
Rarely needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

That's their problem. The rest of the C world shouldn't suffer because
of odd birds.

So, you are saying that 286 in its hey-day was/is odd ?!?
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 15 00:15:49 2024

From Newsgroup: comp.arch

On Mon, 14 Oct 2024 19:39:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

Stick to the question asked. Registers were 16-binary digits,
and segment registers enabled access to 24-bit address space.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Tue Oct 15 05:20:10 2024

From Newsgroup: comp.arch

On Tue, 8 Oct 2024 21:03:40 -0000 (UTC), John Levine wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

If you look at the 8086 manuals, that's clearly what they had in mind.

What I don't get is that the 286's segment stuff was so slow.

It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.

Right, and they appeared not to care or realize it was a performance
problem.

They didn’t expect anybody to make serious use of it.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 10:41:41 2024

From Newsgroup: comp.arch

On Tue, 15 Oct 2024 00:14:25 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

That's their problem. The rest of the C world shouldn't suffer
because of odd birds.

So, you are saying that 286 in its hey-day was/is odd ?!?

In its heyday 80286 was used as MUCH faster 8088.
286-as-286 was/is odd creature. I'd dare to say that it had no heyday.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 10:53:30 2024

From Newsgroup: comp.arch

On 14/10/2024 18:08, Michael S wrote:

On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 16:40, Terje Mathisen wrote:

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to
different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in
pointers, rather than having only a valid pointer or
NULL.Ã‚Â A compiler, for example, might want to store the >>>>>>>>>>> fact that an error occurred while parsing a subexpression >>>>>>>>>>> as a special pointer constant.

Compilers often have the unfair advantage, though, that
they can rely on what application programmers cannot, their >>>>>>>>>>> implementation details.Ã‚Â (Some do not, such as f2c). >>>>>>>>>>

Standard library authors have the same superpowers, so that >>>>>>>>>> they can
implement an efficient memmove() even though a pure standard >>>>>>>>>> C programmer cannot (other than by simply calling the
standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of
libc writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the >>>>>>>> ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let
you write an efficient memmove() in standard C.Â That's why I
said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in
assembly or using inline assembly, rather than in non-portable C
(which is the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things
up that is proportionally more costly for small transfers.Â
Often that can be eliminated when the compiler optimises the
functions inline - when the compiler knows the size of the
move/copy, it can optimise directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and
has the inside knowledge about cache (residency at level x? width
in bytes)/memory ranges/access rights/etc needed to do so in a
very close to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced
hardware MM instruction could be a very efficient way to implement
both memcpy and memmove. (For my own kind of work, I'd worry
about such looping instructions causing an unbounded increased in
interrupt latency, but that too is solvable given enough hardware
effort.)

And I agree that once you have an "MM" (or similar) instruction,
you don't need to re-write the implementation for your memmove()
and memcpy() library functions for every new generation of
processors of a given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will
/sometimes/ get benefits from doing so, but it is not as simple as
Mitch made out.

I.e. totally removing the need for compiler tricks or wide
register operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and
to recognize common patterns (just as most compilers already do
today), and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to
write an efficient memmove() implementation using pure portable
standard C. That is independent of any ISA, any specialist
instructions for memory moves, and any compiler optimisations.
And it is independent of the fact that some good compilers can
inline at least some calls to memcpy() and memmove() today, using
whatever instructions are most efficient for the target.

David, you and Mitch are among my most cherished writers here on
c.arch, I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.

I agree. It's a "god dag mann, økseskaft" situation.

I have a huge respect for Mitch, his knowledge and experience, and
his willingness to share that freely with others. That's why I have
found this very frustrating.

a) memmove/memcpy are so important that people have been spending a
lot of time & effort trying to make it faster, with the
complication that in general it cannot be implemented in pure C
(which disallows direct comparison of arbitrary pointers).

Yes.

(Unlike memmov(), memcpy() can be implemented in standard C as a
simple byte-copy loop, without needing to compare pointers. But an
implementation that copies in larger blocks than a byte requires
implementation dependent behaviour to determine alignments, or it
must rely on unaligned accesses being allowed by the implementation.)

b) Mitch have, like Andy ("Crazy") Glew many years before, realized
that if a cpu architecture actually has an instruction designed to
do this particular job, it behooves cpu architects to make sure
that it is in fact so fast that it obviates any need for tricky
coding to replace it.

Yes.

Ideally, it should be able to copy a single object, up to a cache
line in size, in the same or less time needed to do so manually
with a SIMD 512-bit load followed by a 512-bit store (both ops
masked to not touch anything it shouldn't)

Yes.

REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.

With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.

I agree on all of that.

I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep"
instructions.

No, that's not true. And according to my understanding, that's not what
Terje wrote.
REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
details - fixed registers for src, dest, len and Direction flag in PSW instead of being part of the opcode).

My understanding of what Terje wrote is that REP MOVSB /could/ be an
efficient solution if it were backed by a hardware block to run well
(i.e., transferring as many bytes per cycle as memory bus bandwidth
allows). But REP MOVSB is /not/ efficient - and rather than making it
work faster, Intel introduced variants with wider fixed sizes.

Could REP MOVSB realistically be improved to be as efficient as the instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction? I don't
know. Intel and AMD have had many decades to do so, so I assume it's
not an easy improvement.

REP MOVSW/D/Q were introduced because back then processors were small
and stupid. When your processor is big and smart you don't need them
any longer. REP MOVSB is sufficient.
New Arm64 instruction that are hopefully coming next year are akin to
REP MOVSB rather than to MOVSW/D/Q.
Instructions for memmove, also defined by Arm and by Mitch, is the next logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.

Now, is all that a good idea?

That's a very important question.

I am not 100% convinced.
One can argue that streaming alignment hardware that is necessary for 1st-class implementation of these instructions is useful not only for
memory copy.
So, may be, it makes sense to expose this hardware in more generic ways.

I believe that is the idea of "scalable vector" instructions as an
alternative philosophy to wide explicit SIMD registers. My expectation
is that SVE implementations will be more effort in the hardware than
SIMD for any specific SIMD-friendly size point (i.e., power-of-two
widths). That usually corresponds to lower clock rates and/or higher
latency and more coordination from extra pipeline stages.

But once you have SVE support in place, then memcpy() and memset() are
just examples of vector operations that you get almost for free when you
have hardware for vector MACs and other operations.

May be, via Load Multiple Register? It was present in Arm's A32/T32,
but didn't make it into ARM64. Or, may be, there are even better ways
that I was not thinking about.

And I fully agree that these would be useful features
in general-purpose processors.

My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.

You are moving a goalpost.

No, my goalposts have been in the same place all the time. Some others
have been kicking the ball at a completely different set of goalposts,
but I have kept the same point all along.

One does not need "good implementation" in a sense you have in mind.

Maybe not - but /that/ would be moving the goalposts.

All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.

But that would /not/ be an efficient implementation of memmove() in
plain portable standard C.

What do I mean by an "efficient" implementation in fully portable
standard C? There are two possible ways to think about that. One is
that the operations on the abstract machine are efficient. The other is
that the code is likely to result in efficient code over a wide range of real-world compilers, options, and targets. And I think it goes without saying that the implementation must not rely on any
implementation-defined behaviour or anything beyond the minimal limits
given in the C standards, and it must not introduce any new real or
potential UB.

Your "memmove()" implementation fails on several counts. It is
inefficient in the abstract machine - it copies everything twice instead
of once. It is inefficient in real-world implementations of all sorts
and countless targets - being efficient for some compilers with some
options on some targets (most of them hypothetical) does /not/ qualify
as an efficient implementation. And quite clearly it risks causing
failures from stack overflow in situations where the user would normally expect memmove() to function safely (on implementations other than those
few that turn it into efficient object code).

They would make it easier to write efficient
implementations of these standard library functions for targets that
had such instructions - but that would be implementation-specific
code. And that is one of the reasons that C standard library
implementations are tied to the specific compiler and target, and the
writers of these libraries have "superpowers" and are not limited to
standard C.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 11:59:27 2024

From Newsgroup: comp.arch

On Tue, 8 Oct 2024 21:03:40 -0000 (UTC)
John Levine <johnl@taugh.com> wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

If you look at the 8086 manuals, that's clearly what they had in
mind.

What I don't get is that the 286's segment stuff was so slow.

It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.

Right, and they appeared not to care or realize it was a performance
problem.

They didn't even do obvious things like see if you're reloading the
same value into the segment register and skip the rest of the setup.
Sure, you could put checks in your code and skip the segment load but
that would make your code a lot bigger and uglier.

The question is how slowness of 80286 segments compares to
contemporaries that used segment-based protected memory.
Wikipedia lists following machines as examples of segmentation:
- Burroughs B5000 and following Burroughs Large Systems
- GE 645 -> Honeywell 6080
- Prime 400 and successors
- IBM System/38
They also mention S/370, but to me segmentation in S/370 looks very
different and probably not intended for fine-grained protection.

Of those Burroughs B5900 looks to me as the most comparable to 80286.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 12:38:40 2024

From Newsgroup: comp.arch

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to statically allocated data. Then you would expect the segment to be the
same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different segments.
Then the comparison here might not give the same result as a full
virtual address comparison - but that does not matter. If the pointers
came from different mallocs, they could not overlap and memmove() can
run either direction.

The same applies to other uses, such as indexing in a binary search tree
or a hash map - the comparison above will be correct when it matters.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 14:22:46 2024

From Newsgroup: comp.arch

On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the pointers came from different mallocs, they could not overlap and
memmove() can run either direction.

The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.

It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!
---
* https://theretroweb.com/motherboards/s/compaq-deskpro-286e-p-n-001226
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 14:09:58 2024

From Newsgroup: comp.arch

On 15/10/2024 13:22, Michael S wrote:

On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to
statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the
pointers came from different mallocs, they could not overlap and
memmove() can run either direction.

The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.

It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!

I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

But I would expect that in almost any practical system where you can use
"p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

The exceptions would be systems where pointers hold more than just
addresses, such as access control information or bounds that mean they
are larger than the largest integer type on the target.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Brett@ggtgp@yahoo.com to comp.arch on Tue Oct 15 19:46:23 2024

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> wrote:

On 15/10/2024 13:22, Michael S wrote:

On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to
statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the
pointers came from different mallocs, they could not overlap and
memmove() can run either direction.

The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.

It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!

I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

But I would expect that in almost any practical system where you can use "p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

The exceptions would be systems where pointers hold more than just addresses, such as access control information or bounds that mean they
are larger than the largest integer type on the target.

EGA graphics had more than 64k, smart software would group one or more scan lines into segments for bit mapping the array. A bit mapper works a scan
line at a time so segment changes were not that expensive. This was
profoundly faster than using pixel pokes and the other default methods of changing bits.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Oct 15 17:26:29 2024

From Newsgroup: comp.arch

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 15 21:55:44 2024

From Newsgroup: comp.arch

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 15 22:05:56 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

POSIX adds some extensions (marked 'CX').

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 00:24:07 2024

From Newsgroup: comp.arch

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

POSIX adds some extensions (marked 'CX').

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 16 09:21:59 2024

From Newsgroup: comp.arch

On 15/10/2024 23:26, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

I don't see an advantage in being able to implement them in standard C.
I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require specific
time constraints on these functions. In such cases, you are not
interested in writing fully portable software - it will already contain
many implementation-specific features or use compiler extensions.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 16 09:38:20 2024

From Newsgroup: comp.arch

On 15/10/2024 23:55, MitchAlsup1 wrote:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

It's a very good philosophy in programming language design that the core language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and otherwise change things in libraries than the core language.

And it is also fine, IMHO, that some things in the standard library need non-standard C - the standard library is part of the implementation.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

The function has always been available in C since the language was standardised, and AFAIK it was in K&R C. But no one (in authority) ever claimed it could be implemented purely in standard C. What do you think
has changed?

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Oct 16 11:18:19 2024

From Newsgroup: comp.arch

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

I don't see an advantage in being able to implement them in standard C.

It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.
E.g. say if your application wants to use a region/pool/zone-based
memory management.

The fact that malloc can't be implemented in standard C is evidence
that standard C may not be general-purpose enough to accommodate an
application that wants to use a custom-designed allocator.

I don't disagree with you, from a practical perspective:

- in practice, C serves us well for Emacs's GC, even though that can't
be written in standard C.
- it's not like there are lots of other languages out there that offer
you portability together with the ability to define your own `malloc`.

But it's still a weakness, just a fairly minor one.

The reason why you might want your own special memmove, or your own special malloc, is that you are doing niche and specialised software.

Region/pool/zone-based memory management is common enough that I would
not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
Can't think of a practical reason to implement my own `memove`, OTOH.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Oct 16 15:38:47 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it >>>entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 16 19:57:03 2024

From Newsgroup: comp.arch

(Please do not snip or omit attributions. There are Usenet standards
for a reason.)

On 16/10/2024 17:18, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

I don't see an advantage in being able to implement them in standard C.

It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.

That makes no sense to me. We are talking about implementing standard
library functions. If you want to implement other functions, go ahead.

Or do you mean that it is only possible to implement related functions
(such as memory pools) if you also can implement malloc in fully
portable standard C? That would make a little more sense if it were
true, but it is not. First, you can implement such functions in implementation-specific code, so you are not hindered from writing the
code you want. Secondly, standard C provides functions such as malloc()
and aligned_alloc() that give you the parts you need - the fact that you
need something outside of standard C to implement malloc() does not
imply that you need those same features to implement your additional functions.

E.g. say if your application wants to use a region/pool/zone-based
memory management.

The fact that malloc can't be implemented in standard C is evidence
that standard C may not be general-purpose enough to accommodate an application that wants to use a custom-designed allocator.

No, it is not - see above.

And remember how C was designed and how it was intended to be used. The
aim was to be able to write portable code that could be reused on many systems, and /also/ implementation, OS and target specific code for
maximum efficiency, systems programming, and other non-portable work. A typical C program combines these - some parts can be fully portable,
other parts are partially portable (such as to any POSIX system, or
targets with 32-bit int and 8-bit char), and some parts may be very compiler-specific or target specific.

That's not an indication of failure of C for general-purpose
programming. (But I would certainly not suggest that C is the best
choice of language for many "general" programming tasks.)

I don't disagree with you, from a practical perspective:

- in practice, C serves us well for Emacs's GC, even though that can't
be written in standard C.
- it's not like there are lots of other languages out there that offer
you portability together with the ability to define your own `malloc`.

But it's still a weakness, just a fairly minor one.

The reason why you might want your own special memmove, or your own special >> malloc, is that you are doing niche and specialised software.

Region/pool/zone-based memory management is common enough that I would
not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
Can't think of a practical reason to implement my own `memove`, OTOH.

Stefan

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 16 20:00:27 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

But more problematic is the implementation of free() without
knowing how to compare pointers.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 22:18:49 2024

From Newsgroup: comp.arch

On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

Agreed, but once you HAVE a way of getting memory (by whatever name)
you can write malloc in std. C.

But more problematic is the implementation of free() without
knowing how to compare pointers.

Never wrote a program that actually needs free--I have re-written
programs that used free to avoid using free, though.
--- Synchronet 3.20a-Linux NewsLink 1.114

From George Neuner@gneuner2@comcast.net to comp.arch on Wed Oct 16 23:06:24 2024

From Newsgroup: comp.arch

On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of >>>>> a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it >>>>entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and >>>>> `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

Yes, but malloc() subdivides an already provided space. Because that
space can be treated as a single array of char, and comparing pointers
to elements of the same array is legal, the only thing I can see that
prevents writing malloc() in standard C would be the need to somhow
define the array from the /language's/ POV (not the compiler's) prior
to using it.

Which circles back to why something like

char (*heap)[ULONG_MAX] = ... ;

would/does not satisfy the language's requirement. All the compilers
I have ever seen would have been happy with it, but none of them ever
needed something like it anyway. Conversion to <an integer type> also
would always work, but also was never needed.

I am not a language lawyer - I don't even pretend to understand the
arguments against allowing general pointer comparison.

Aside: I have worked on architectures (DSPs) having disjoint memory
spaces, spaces with differing bit widths, and even spaces where [sans
MMU] the same physical address had multiple logical addresses whose
use depended on the type of access.

I have written allocators and even a GC for such architectures. Never
had a problem convincing C compilers to compare pointers - the only
issue I ever faced was whether the result actually was meaningful to
the program.
--- Synchronet 3.20a-Linux NewsLink 1.114

From George Neuner@gneuner2@comcast.net to comp.arch on Wed Oct 16 23:32:41 2024

From Newsgroup: comp.arch

On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
<david.brown@hesbynett.no> wrote:

It's a very good philosophy in programming language design that the core >language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and >otherwise change things in libraries than the core language.

And it is also fine, IMHO, that some things in the standard library need >non-standard C - the standard library is part of the implementation.

But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
compiler flags to be using a different compiler.]

Why? Because once these things are discovered, many programmers will
see their advantages and lack the discipline to avoid using them for
more general application work.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

The function has always been available in C since the language was >standardised, and AFAIK it was in K&R C. But no one (in authority) ever >claimed it could be implemented purely in standard C. What do you think
has changed?

--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 00:40:34 2024

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

Right. And that is why malloc(), or some essential internal component
of malloc(), has to be platform specific, and thus malloc() must be
supplied by the implementation (which means both the compiler and the
standard library).

But more problematic is the implementation of free() without knowing
how to compare pointers.

Once there is a way to get additional memory from whatever underlying environment is there, malloc() and free() can be implemented (and I
believe most often are implemented) without needing to compare
pointers. Note: pointers can be tested for equality without having
to compare them relationally, and testing pointers for equality is
well-defined between any two pointers (which may need to be converted
to 'void *' to avoid a type mismatch).
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 01:18:04 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C. I am not
asking if it is still in the std libraries, I am asking what
happened to make it impossible to write malloc() in standard C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

Agreed, but once you HAVE a way of getting memory (by whatever name)
you can write malloc in standard C.

The point is that getting more memory is inherently platform
specific, which is why malloc() must be defined by each particular implementation, and so was put in the standard library.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 02:48:49 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any
existing functionality that cannot be written using the language
is a sign of a weakness because it shows that despite being
"general purpose" it fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT need to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc`
and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be standard K&R C--what dropped if from the
standard??

It still is part of the ISO C standard.

The paragraph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C.

No, it didn't. In the original book (my copy is from the third
printing of the first edition, copyright 1978), on page 175 there
is a function 'alloc()' that shows how to write a memory allocator.
The code in alloc() calls 'morecore()', described as follows:

The function morecore obtains storage from the operating system.
The details of how this is done of course vary from system to
system. In UNIX, the system entry sbrk() returns a pointer to n
more bytes of storage. [...]

An implementation of morecore() is shown on the next page, and
it indeed uses sbrk() to get more memory. That makes it UNIX
specific, not portable standard C. Both alloc() and morecore()
are part of chapter 8, "The UNIX System Interface".

Note also that chapter 7, titled "Input and Output" and describing
the standard library, mentions in section 7.9, "Some Miscellaneous
Functions", the function calloc() as part of the standard library.
(There is no mention of malloc().) The point of having a standard
library is that the functions it contains depend on details of the
underlying OS and thus cannot be written in platform-agnostic code.
Being platform portable is the defining property of "standard C".

(Amusing aside: the entire standard library seems to be covered by
just #include <stdio.h>.)

I am not
asking if it is still in the standard libraries, I am asking what
happened to make it impossible to write malloc() in standard C ?!?

Functions such as sbrk() are not part of the C language. Whether
it's called calloc() or malloc(), memory allocation has always
needed access to some facilities not provided by the C language
itself. The function malloc() is not any more writable in standard
K&R C than it is in standard ISO C (except of course malloc() can
be implemented by using calloc() internally, but that depends on
calloc() being part of the standard library).
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 03:16:13 2024

From Newsgroup: comp.arch

George Neuner <gneuner2@comcast.net> writes:

On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

[...]

malloc() used to be standard K&R C--what dropped it from the
standard ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

Yes, but malloc() subdivides an already provided space.

Not necessarily.

Because that space can be treated as a single array of char,

Not always.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 03:17:33 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

That is a foolish statement.
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 17 16:16:42 2024

From Newsgroup: comp.arch

On 17/10/2024 05:06, George Neuner wrote:

On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing >>>>>> functionality that cannot be written using the language is a sign of >>>>>> a weakness because it shows that despite being "general purpose" it >>>>>> fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and >>>>>> `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

Yes, but malloc() subdivides an already provided space. Because that
space can be treated as a single array of char, and comparing pointers
to elements of the same array is legal, the only thing I can see that prevents writing malloc() in standard C would be the need to somhow
define the array from the /language's/ POV (not the compiler's) prior
to using it.

It is common for malloc() implementations to ask the OS for large chunks
of memory, then subdivide it and pass it out to the application. When
the chunk(s) it has run out, it will ask for more from the OS. You
could reasonably argue that each chunk it gets may be considered a
single unsigned char array, but that is certainly not true for
additional chunks.

--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 17 16:25:01 2024

From Newsgroup: comp.arch

On 17/10/2024 05:32, George Neuner wrote:

On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
<david.brown@hesbynett.no> wrote:

It's a very good philosophy in programming language design that the core
language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and
otherwise change things in libraries than the core language.

And it is also fine, IMHO, that some things in the standard library need
non-standard C - the standard library is part of the implementation.

But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
compiler flags to be using a different compiler.]

Specifying different flags would technically give you a different /implementation/, but it would not normally be considered a different /compiler/. I see no problem at all if libraries (standard library or otherwise) are compiled with different flags. I can absolutely
guarantee that the flags I use for compiling my application code are not
the same as those used for compiling the static libraries that came with
my toolchains. Using different /compilers/ could be a significant inconvenience, and might mean you lose additional features (such as
link-time optimisation), but as long as the ABI is consistent then they
should work fine.

Why? Because once these things are discovered, many programmers will
see their advantages and lack the discipline to avoid using them for
more general application work.

Really? Have you ever looked at the source code for a library such as
glibc or newlib? Most developers would look at that and quickly shy
away from all the macros, additional compiler-specific attributes,
conditional compilation, and the rest of it. Very, very few would look
into the details to see if there are any "tricks" or "secret" compiler extensions they can copy. And with very few exceptions, all the compiler-specific features will already be documented and available to programmers enthusiastic enough to RTFM.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

The function has always been available in C since the language was
standardised, and AFAIK it was in K&R C. But no one (in authority) ever
claimed it could be implemented purely in standard C. What do you think
has changed?

--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Fri Oct 18 06:00:54 2024

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

[...]

My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.

You are moving a goalpost.

No, he isn't.

One does not need "good implementation" in a sense you have in mind.
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.

You have misunderstood the meaning of "standard C", which means
code that does not rely on any implementation-specific behavior.
"All one needs is an implementation that ..." already invalidates
the requirement that the code not rely on implementation-specific
behavior.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Oct 18 14:06:17 2024

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

Unisys discontinued that line of systems in 1992.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 18 17:34:16 2024

From Newsgroup: comp.arch

On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.

OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

So, can it be said that ar least some of B6500-compatible models
suffered from the same problem as 80286 - the segment of maximal size
didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits in
the single segment?

Unisys discontinued that line of systems in 1992.

I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.

--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Oct 18 16:19:08 2024

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.

OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

So, can it be said that ar least some of B6500-compatible models

No. The systems I described above are from the medium
systems family (B2000/B3000/B4000). The B5000/B6000/B7000
(large) family systems were a completely different stack based
architecture with a 48-bit word size. The Small systems (B1000)
supported task-specific dynamic microcode loading (different
microcode for a cobol app vs. a fortran app).

Medium systems evolved from the Electrodata Datatron and 220 (1954) through
the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
was also developed at the old Electrodata plant in Pasadena
(where I worked in the 80s) - eventually large systems moved
out - the more capable large systems (B7XXX) were designed in Tredyffrin
Pa, the less capable large systems (B5XXX) were designed in Mission Viejo, Ca.

suffered from the same problem as 80286 - the segment of maximal size
didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits in
the single segment?

Unisys discontinued that line of systems in 1992.

I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.

Large systems still exist today in emulation[*], as do the
former Univac (Sperry 2200) systems. The last medium system
(V380) was retired by the City of Santa Ana in 2010 (almost two
decades after Unisys cancelled the product line) and was moved
to the Living Computer Museum.

City of Santa Ana replaced the single 1980 vintage V380 with
29 windows servers.

After the merger of Burroughs and Sperry in '86 there were six
different mainframe architectures - by 1990, all but
two (2200 and large systems) had been terminated.

[*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/
--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Fri Oct 18 17:38:55 2024

From Newsgroup: comp.arch

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard C.
I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require specific time constraints on these functions. In such cases, you are not
interested in writing fully portable software - it will already contain
many implementation-specific features or use compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 18 21:45:37 2024

From Newsgroup: comp.arch

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in
non-standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require
specific time constraints on these functions. In such cases, you are
not interested in writing fully portable software - it will already
contain many implementation-specific features or use compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite feasible there.

Sure - but you are not writing portable standard C. You are relying on implementation details, or writing code that is only suitable for a
particular implementation (or set of implementations). It is normal to
write this kind of thing in C, but it is non-portable C. (Or at least,
not fully portable C.)

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C is
the comparison of the pointers so you know which direction to do the
copying.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Oct 19 19:46:41 2024

From Newsgroup: comp.arch

On Fri, 18 Oct 2024 16:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully
standard way to compare independent pointers (other than
just for equality). Rarely needing something does not mean
/never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

So, can it be said that ar least some of B6500-compatible models

No. The systems I described above are from the medium
systems family (B2000/B3000/B4000).

I didn't realize that you were not talking about Large Systems.
I didn't even know that Medium Systems used segmented memory.
Sorry.

The B5000/B6000/B7000
(large) family systems were a completely different stack based
architecture with a 48-bit word size. The Small systems (B1000)
supported task-specific dynamic microcode loading (different
microcode for a cobol app vs. a fortran app).

Medium systems evolved from the Electrodata Datatron and 220 (1954)
through the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
was also developed at the old Electrodata plant in Pasadena
(where I worked in the 80s) - eventually large systems moved
out - the more capable large systems (B7XXX) were designed in
Tredyffrin Pa, the less capable large systems (B5XXX) were designed
in Mission Viejo, Ca.

suffered from the same problem as 80286 - the segment of maximal size >didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits
in the single segment?

Unisys discontinued that line of systems in 1992.

I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.

Large systems still exist today in emulation[*], as do the
former Univac (Sperry 2200) systems. The last medium system
(V380) was retired by the City of Santa Ana in 2010 (almost two
decades after Unisys cancelled the product line) and was moved
to the Living Computer Museum.

City of Santa Ana replaced the single 1980 vintage V380 with
29 windows servers.

After the merger of Burroughs and Sperry in '86 there were six
different mainframe architectures - by 1990, all but
two (2200 and large systems) had been terminated.

[*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/

--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Sun Oct 20 21:51:30 2024

From Newsgroup: comp.arch

On 18/10/2024 20:45, David Brown wrote:

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in non-
standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require
specific time constraints on these functions. In such cases, you are
not interested in writing fully portable software - it will already
contain many implementation-specific features or use compiler
extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

Sure - but you are not writing portable standard C. You are relying on implementation details, or writing code that is only suitable for a particular implementation (or set of implementations). It is normal to write this kind of thing in C, but it is non-portable C. (Or at least,
not fully portable C.)

Ah, I see your point. Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C is the comparison of the pointers so you know which direction to do the copying.

It's a long time since I had to mistrust a compiler so much that I was
pulling the assembler apart. It sounds as though they have got smarter
in the meantime.

I just checked BTW, and you are correct.

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 21 08:58:05 2024

From Newsgroup: comp.arch

On 20/10/2024 22:51, Vir Campestris wrote:

On 18/10/2024 20:45, David Brown wrote:

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in non-
standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised
software. For example, you might be making real-time software and
require specific time constraints on these functions. In such
cases, you are not interested in writing fully portable software -
it will already contain many implementation-specific features or use
compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

Sure - but you are not writing portable standard C. You are relying
on implementation details, or writing code that is only suitable for a
particular implementation (or set of implementations). It is normal
to write this kind of thing in C, but it is non-portable C. (Or at
least, not fully portable C.)

Ah, I see your point. Because some implementations will require communication with the OS there cannot be a truly portable malloc.

Yes.

I think /every/ implementation will require communication with the OS,
if there is an OS - otherwise it will need support from other parts of
the toolchain (such as symbols created in a linker script to define the
heap area - that's the typical implementation in small embedded systems).

The nearest you could get to a portable implementation would be using a
local unsigned char array as the heap, but I don't believe that would be
fully correct according to the effective type rules (or the "strict
aliasing" or type-based aliasing rules, if you prefer those terms). It
would also not be good enough for the needs of many programs.

Of course, a fair amount of the code for malloc/free can written in
fully portable C - and almost all of it can be written in a somewhat
vaguely defined "widely portable C" where you can mask pointer bits to
handle alignment, and other such conveniences.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C
is the comparison of the pointers so you know which direction to do
the copying.

It's a long time since I had to mistrust a compiler so much that I was pulling the assembler apart. It sounds as though they have got smarter
in the meantime.

I just checked BTW, and you are correct.

Looking at the generated assembly is usually not a matter of mistrusting
the compiler. One of the reasons I do so is to check that the compiler
can generate efficient object code from my source code, in cases where I
need maximal efficiency. I'd rather not write assembly unless I really
have to!

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 21 09:21:42 2024

From Newsgroup: comp.arch

David Brown wrote:

On 20/10/2024 22:51, Vir Campestris wrote:

On 18/10/2024 20:45, David Brown wrote:

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in
standard C. I /do/ see an advantage in being able to do so well in >>>>> non- standard, implementation-specific C.

The reason why you might want your own special memmove, or your own >>>>> special malloc, is that you are doing niche and specialised
software. For example, you might be making real-time software and
require specific time constraints on these functions.Â In such
cases, you are not interested in writing fully portable software - >>>>> it will already contain many implementation-specific features or
use compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

Sure - but you are not writing portable standard C.Â You are relying >>> on implementation details, or writing code that is only suitable for >>> a particular implementation (or set of implementations).Â It is
normal to write this kind of thing in C, but it is non-portable C.Â
(Or at least, not fully portable C.)

Ah, I see your point. Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

Yes.

I think /every/ implementation will require communication with the OS, > if there is an OS - otherwise it will need support from other parts of > the toolchain (such as symbols created in a linker script to define the
heap area - that's the typical implementation in small embedded systems).

The nearest you could get to a portable implementation would be using a local unsigned char array as the heap, but I don't believe that would be fully correct according to the effective type rules (or the "strict aliasing" or type-based aliasing rules, if you prefer those terms). It would also not be good enough for the needs of many programs.

Of course, a fair amount of the code for malloc/free can written in
fully portable C - and almost all of it can be written in a somewhat
vaguely defined "widely portable C" where you can mask pointer bits to > handle alignment, and other such conveniences.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C. >>>>

It would normally be written in C, and the compiler will generate the
"rep" assembly.Â The bit you can't write in fully portable standard
C is the comparison of the pointers so you know which direction to do
the copying.

It's a long time since I had to mistrust a compiler so much that I was
pulling the assembler apart. It sounds as though they have got smarter
in the meantime.

I just checked BTW, and you are correct.

Looking at the generated assembly is usually not a matter of mistrusting
the compiler. One of the reasons I do so is to check that the compiler
can generate efficient object code from my source code, in cases where I need maximal efficiency. I'd rather not write assembly unless I really have to!

For near-light-speed code I used to write it first in C, optimize that,
then I would translate it into (inline) asm and re-optimize based on
having the full cpu architecture available, before in the final stage I
would use the asm experience to tweak the C just enough to let the
compiler generate machine code quite close (90+%) to my best asm, while
still being portable to any cpu with more or less the same capabilities.
One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.
My asm submission was twice as fast as anyone else, while the C version
was still fast enough that a couple of years later I got a prize in the
mail: Someone in France had submitted my C code, with my name & address, to a similar competition there and it was still faster than anyone else. :-)
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Oct 21 14:04:42 2024

From Newsgroup: comp.arch

I don't see an advantage in being able to implement them in standard C.

It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.

That makes no sense to me. We are talking about implementing standard library functions. If you want to implement other functions, go ahead.

No, I'm talking about a very general principle that applies to
languages, libraries, etc...

For example, in Emacs I always try [and don't always succeed] to make
sure that the default behavior for a given functionality can be
implemented using the official API entry points of the underlying
library, because it makes it more likely that whoever wants to replace
that behavior with something else will be able to do it without having
to break abstraction barriers.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Mon Oct 21 23:17:10 2024

From Newsgroup: comp.arch

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 21 23:52:59 2024

From Newsgroup: comp.arch

On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

POSIX is an environment not an OS.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Tue Oct 22 01:09:49 2024

From Newsgroup: comp.arch

On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require communication with the OS
there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

POSIX is an environment not an OS.

Guess what the “OS” part of “POSIX” stands for.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Mon Oct 21 18:32:27 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such I book I guarantee I will want to buy one.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 22 08:27:12 2024

From Newsgroup: comp.arch

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such I book I guarantee I will want to buy one.

Thank you Tim!

Probably not a book but I would consider writing a series of blog posts similar to that, now that I am about to retire: My wife and I will both
go on "permanent vacation" starting a week before Christmas. :-)

I already know that this will give me more time to work on digital
mapping projects (ref my https://mapant.no/ Norwegian topo map generated
from ~50 TB of LiDAR), but if there's an interest in optimization I
might do that as well.

BTW, I am also open to doing some consulting work, if the problems are interesting enough. :-)

Regards,
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From George Neuner@gneuner2@comcast.net to comp.arch on Tue Oct 22 17:26:06 2024

From Newsgroup: comp.arch

On Tue, 22 Oct 2024 01:09:49 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require communication with the OS
there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

POSIX is an environment not an OS.

Guess what the “OS” part of “POSIX” stands for.

It's still an just environment - POSIX defines only an interface, not
an implementation.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Wed Oct 23 07:25:42 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such a book I guarantee I will want to buy one.

Thank you Tim!

I know from past experience you are good at this. I would love
to hear what you have to say.

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

P.S. Is the email address in your message a good way to reach you?
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 23 18:11:57 2024

From Newsgroup: comp.arch

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Oct 23 18:27:06 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

And start working for "HER". (Honeydew list).
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 23 21:11:59 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!
I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.
We recently started (officially) on the 754-2029 revision.
I'm still connected to Mill Computing as well.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 23 21:12:57 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

And start working for "HER". (Honeydew list).

My wife do have a small list of things that we (i.e. I) could do when we retire...

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 23 21:09:47 2024

From Newsgroup: comp.arch

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such a book I guarantee I will want to buy one.

Thank you Tim!

I know from past experience you are good at this. I would love
to hear what you have to say.

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

I'm sure you're right!

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

P.S. Is the email address in your message a good way to reach you?

Yes, that is my personal domain, so it won't be affected by my retirement.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 23 21:01:01 2024

From Newsgroup: comp.arch

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

I'm still connected to Mill Computing as well.

Terje

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Oct 24 07:39:52 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work".Â In any case I hope you both enjoy >>>> the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do>> want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

I don't know that usage, I thought quires was a typesetting/printing
measure?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 24 06:55:20 2024

From Newsgroup: comp.arch

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

One thing I have thought of is a wiki of optimization techniques that
contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 24 10:00:16 2024

From Newsgroup: comp.arch

On 24/10/2024 08:55, Anton Ertl wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

One thing I have thought of is a wiki of optimization techniques that contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.

Would it make sense to start something under Wikibooks on Wikipedia? I
have no experience with it myself, but it looks to me like a way to have
a collaborative collection of related knowledge. It could provide the structure and framework, saving you (plural) from having to set up a
wiki, blog, or whatever.

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 24 16:34:45 2024

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

On 24/10/2024 08:55, Anton Ertl wrote:

One thing I have thought of is a wiki of optimization techniques that
contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.

Would it make sense to start something under Wikibooks on Wikipedia?

Yes, I was thinking about that. In the bookshelf on computer
programming <https://en.wikibooks.org/wiki/Shelf:Computer_programming>
there are two "Books nearing completion" that have "Opti" in the
title:

https://en.wikibooks.org/wiki/Optimizing_Code_for_Speed https://en.wikibooks.org/wiki/Optimizing_C%2B%2B

Looking at the contents of the former, it's rather short and
high-level, and I don't think it's intended for the kind of project we
have in mind.

The latter is more in the direction I have in mind, but the limitation
to C++ is, well, limiting.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 24 18:32:22 2024

From Newsgroup: comp.arch

On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week >>>>>> before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work".Â In any case I hope you both enjoy >>>>> the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

I don't know that usage, I thought quires was a typesetting/printing
measure?

Terje

--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Sun Oct 27 20:42:09 2024

From Newsgroup: comp.arch

On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

One of the other groups I'm following just for the hell of it is
comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

"cannot be a _truly_ portable" is what I meant. Portable to most machine
is easy - just write for Windows. POSIX will give you a larger subset -
but still a subset.

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Sun Oct 27 20:45:09 2024

From Newsgroup: comp.arch

On 23/10/2024 20:12, Terje Mathisen wrote:

My wife do have a small list of things that we (i.e. I) could do when we retire...

Since I retired the garden is looking much better, I've started to win
the odd trophy sailing, most of the house has been redecorated...

But best of all - I've lost 5kG and been able to stop worrying about my weight!

Andy
--- Synchronet 3.20a-Linux NewsLink 1.114

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Sun Oct 27 21:04:49 2024

From Newsgroup: comp.arch

On Sun, 27 Oct 2024 20:42:09 +0000, Vir Campestris wrote:

I'm pretty sure you don't get POSIX in your 64kb (max).

<https://news.ycombinator.com/item?id=34981059>
--- Synchronet 3.20a-Linux NewsLink 1.114

From David Schultz@david.schultz@earthlink.net to comp.arch on Sun Oct 27 17:55:52 2024

From Newsgroup: comp.arch

On 10/27/24 3:42 PM, Vir Campestris wrote:

On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

One of the other groups I'm following just for the hell of it is comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

Ignores the 16 bit versions of CP/M: 8086, 68000, Z8000.
--
http://davesrocketworks.com
David Schultz
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 28 11:39:57 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week >>>>>>> before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual >>>>>> vacation and self-chosen "work".Ã‚Â In any case I hope you both >>>>>> enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do >>>> want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

OK, I have seen and used "Super-accumulator" as the term for those, I
have thought about implementing one in carry-save redundant form, but
that might be more redundancy than really needed?
Having a carry bit for every byte should still make it possible to
handle several additions/cycle, right?
I'm assuming the real cost is in the alignment network needed to route incoming addends into the right slice.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Oct 28 16:30:46 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Oct 28 10:12:08 2024

From Newsgroup: comp.arch

On 10/28/2024 9:30 AM, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

Another newer alternative. This came up on my news feed. I haven't
looked at the details at all, so I can't comment on it.

https://arxiv.org/abs/2410.03692
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Oct 28 18:14:20 2024

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 10/28/2024 9:30 AM, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

Another newer alternative. This came up on my news feed. I haven't
looked at the details at all, so I can't comment on it.

https://arxiv.org/abs/2410.03692

That is about another number representation for AI, trying to squeeze
more AI performance out of few bits.

Personally, I like the approach of doing analog calculation for
the low-accuracy dot products that they do, followed by an A/D
converter. There is a company doing that, but I forget its name.
--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Oct 28 15:24:18 2024

From Newsgroup: comp.arch

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load
the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Of course, once you have 168-byte registers people are going to
think of new uses for them.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 29 06:33:50 2024

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

Of course, once you have 168-byte registers people are going to
think of new uses for them.

SIMD from hell? Pretend that a CPU is a graphics card? :-)
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 29 08:07:50 2024

From Newsgroup: comp.arch

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

If I was implementing this I would probably want some redundant storage
to limit carry propagation, so maybe 48 bits per 64-bit chunk, in which
case I would need about 2800 bits or 6 of those 512-bit SIMD regs.

SIMD from hell? Pretend that a CPU is a graphics card? :-)

Writing this as a throughput task could make it fit better within a GPU?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Oct 29 14:19:13 2024

From Newsgroup: comp.arch

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

IIUC you can already implement such a thing with standard IEEE
operations, based on the "standard" Knuth approach of computing the
exact result of `a + b` as the sum of x + y where x is the "normal" sum
of a + b (and hence y holds the remaining bits lost to rounding).

I wonder how often this is used in practice.

Intuitively it should be possible to make it reasonably efficient, where
you first compute the "naive" sum but also keep the N remaining numbers representing the bits lost to each of the N roundings. I.e. you take in
a vector "as" of N numbers and return a pair of the "naive" sum plus
a vector of N rounding errors.

Σ as => (round(Σ As), rs)
such that round(Σ As) = the naive IEEE sum of as
and Σ as = round(Σ As) + Σ rs

You can then recursively compute "Σ rs" in the same way. At each step of
the recursion you can compute round(Σ |rs|) to estimate an upper bound
on the remaining error and thus stop when that error is smaller than
1 ULP or somesuch.

AFAICT, if your sum is well-conditioned you should need at most 2 steps
of the recursion, and I suspect you can predict when the next estimated
error will be too small before you start the last recursion, so the last recursion might skip the generation of the last "rs" vector.

Stefan
--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 29 14:29:28 2024

From Newsgroup: comp.arch

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

Right, something like 2048+52+3 = 2103 bits for data, plus some status bits. For x64 they could overlay it onto AVX-512 register file in groups of 5
and use existing SIMD instructions for management.
That would allow them to pack 3 accumulators into registers z0..z14.

For RISC-V they have the large vector registers, 32 * 256-bits each I think,
so again 3 accumulators.

So its a plausible proposition.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 29 19:57:25 2024

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>

These would be very large registers. You'd need some way to store and load >>> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.
(Insert fear and loathing for hex float here).
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 29 20:21:11 2024

From Newsgroup: comp.arch

On Tue, 29 Oct 2024 19:57:25 +0000, Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>

These would be very large registers. You'd need some way to store and
load
the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
"A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory
accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

Terje--IEEE is all capitals.

IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.

The span of an IEEE double "quire" would be the exponent-2 + fraction.
a) The most significant non-infinity has an exponent of +1023
b) The least significant non-underflow has an exponent of -1023
Leaving a span of 2046 bits plus 52 denormalized bits or 2098-bits
or 262 bytes.

One note: When left in memory, one indexes the accumulator with
the (exponent>>6) and fetches 2 doublewords. A carry out requires
accessing the 3rd doubleword (possibly transitively).

(Insert fear and loathing for hex float here).

Heck, watching Kahan's notes on FP problems leaves one in fear of
binary floating point representations.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 29 20:30:12 2024

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>

These would be very large registers. You'd need some way to store and load >>>> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
"A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

IBM format had one sign bit, seven exponent bits and six or fourteen >hexadecimal digits for single and double precision, respectively.
(Insert fear and loathing for hex float here).

Burroughs Medium systems had four exponent sign bits, eight exponent bits,
four mantissa sign bits, and up to 400 mantissa bits. BCD, so that's an exponent range of -99 to +99 and a 1 to 100 digit mantissa.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 29 21:27:29 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> schrieb:

(Insert fear and loathing for hex float here).

Heck, watching Kahan's notes on FP problems leaves one in fear of
binary floating point representations.

True, but... hex float is so much worse.

"Hacker's delight" has some choice words there, and the
author worked for IBM :-)
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Winston
  Thu Nov 21 08:55:50 2024
  from Kerrville, Tx via SSH
- Grey Gamer
  Thu Nov 21 07:37:11 2024
  from Show Low, Az via Telnet
- Microbot
  Thu Nov 21 03:10:00 2024
  from Moore, Ok via Telnet
- Winston
  Wed Nov 20 09:30:02 2024
  from Kerrville, Tx via SSH

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	991
Nodes:	10 (1 / 9)
Uptime:	77:04:23
Calls:	12,949
Calls today:	3
Files:	186,574
Messages:	3,264,591

Re: 80286 protected mode

Who's Online

Recent Visitors

System Info