This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.
The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.
The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
minforth wrote:
This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.
The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
Even without putting them in registers it is faster (Ryzen 7 5800X):
: bench
cr timer-reset #43 fib .elapsed ." ( " . ." )"
cr timer-reset #43 fib2 .elapsed ." ( " . ." )" ;
FORTH> bench
1.658 seconds elapsed. ( 701408733 )
1.564 seconds elapsed. ( 701408733 ) ok
Here is fib2:
: fib2 params| n | n 2 < if 1 else n 1- recurse n 2- recurse + endif ;
FORTH> ' fib2 idis
$014588C0 : fib2
$014588CA pop rbx
$014588CB cmp rbx, 2 b#
$014588CF lea rsi, [rsi #-16 +] qword
$014588D3 mov [rsi] qword, rbx
$014588D6 mov rbx, rcx
$014588D9 jge $014588EB offset NEAR
$014588DF mov rbx, 1 d#
$014588E6 jmp $01458921 offset NEAR
$014588EB mov rbx, [rsi] qword
$014588EE lea rbx, [rbx -1 +] qword
$014588F2 lea rbp, [rbp -8 +] qword
$014588F6 mov [rbp 0 +] qword, $01458903 d#
$014588FE jmp $014588CB offset NEAR
$01458903 mov rbx, [rsi] qword
$01458906 lea rbx, [rbx -2 +] qword
$0145890A lea rbp, [rbp -8 +] qword
$0145890E mov [rbp 0 +] qword, $0145891B d#
$01458916 jmp $014588CB offset NEAR
$0145891B pop rbx
$0145891C pop rdi
$0145891D lea rbx, [rdi rbx*1] qword
$01458921 push rbx
$01458922 lea rsi, [rsi #16 +] qword
$01458926 ;
On 10/04/2024 6:54 am, mhx wrote:[..]
minforth wrote:
VFX doesn't put much effort into optimizing locals code. Should they have? Or would that have encouraged users to write code that lacks thought and leave it to the compiler to make efficient? Which is the C approach.
dxf wrote:
On 10/04/2024 6:54 am, mhx wrote:[..]
minforth wrote:
VFX doesn't put much effort into optimizing locals code. Should they have? >> Or would that have encouraged users to write code that lacks thought and
leave it to the compiler to make efficient? Which is the C approach.
I wouldn't know what those authors think, believe, or try to enforce.
There has been no effort spent in optimizing iForth's locals. The shown result is simply a byproduct of the architecture of the compiler.
...
On 10/04/2024 3:25 pm, mhx wrote:
dxf wrote:
Do you have a disassembly for Fib1? I ask because NT/Forth which purports
to do exactly that produces notably different results - despite there being little 'stack juggling' to resolve.
I wouldn't know what those authors think, believe, or try to enforce.
There has been no effort spent in optimizing iForth's locals. The shown result is simply a byproduct of the architecture of the compiler.
It is not my task to guide the users, they can do whatever they ask.
I believe my actions are consistent with that.
Locals are completely OK if you want to get stuff done. I use them for
the boring parts.
With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
r 1->0 third 1->2 >l >l 1->1 dup 1->1mov -$08[r14],r8 mov r15,$10[r13] >l mov $00[r13],r8
2->1 add r14,$08 mov rax,rbp mov rbx,[r14]mov -$08[r13],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
mhx wrote:
Locals are completely OK if you want to get stuff done. I use them for...
the boring parts.
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.
On 10/04/2024 5:34 pm, minforth wrote:
mhx wrote:
Locals are completely OK if you want to get stuff done. I use them for...
the boring parts.
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.
However, it's more difficult when control flow is involved, and most native-code compilers register-allocate only in straight-line code.[..]
For comparison, here's what gforth-fast currently gives you:
Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:..
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The differences are:
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;...
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;
For comparison, here's what gforth-fast currently gives you:
3dup.2 3dup.4
third 1->2 dup 1->1
mov r15,$10[r13] mov $00[r13],r8
third 2->3 sub r13,$08
mov r9,$08[r13] 2over 1->3
third 3->1 mov r15,$18[r13]
mov $00[r13],r8 mov r9,$10[r13]
sub r13,$18 rot 3->1
mov $10[r13],r15 mov $00[r13],r15
mov $08[r13],r9 sub r13,$10
;s 1->1 mov $08[r13],r9
mov rbx,[r14] ;s 1->1
add r14,$08 mov rbx,[r14]
mov rax,[rbx] add r14,$08
jmp eax mov rax,[rbx]
jmp eax
Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The >differences are:
64-bit vs. 32-bit cells
different way of returning to the caller (;S code vs. ret)
sub vs. lea for the stack-pointer update
different order of loads and stores
different register allocation
But this is just a happy coincidence, and the other versions make it
clear that gforth-fast does not analyse the data flow even in
straight-line code.
The local version loses track of constants.
dxf wrote:
On 10/04/2024 5:34 pm, minforth wrote:
mhx wrote:
Locals are completely OK if you want to get stuff done. I use them for >>>> the boring parts....
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.
Yes, of course it's entirely up to you if you don't fully utilise the available possibilities of your tools.
But I also suspect that your Forth
system has no locals at all.
...
To give you another example: Dr Noble's formula translator. Certainly not
a toy of an incapable programmer, as you imply.
On 10/04/2024 8:38 pm, minforth wrote:
dxf wrote:
On 10/04/2024 5:34 pm, minforth wrote:
mhx wrote:
Locals are completely OK if you want to get stuff done. I use them for >>>>> the boring parts....
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed, >>>> if any, is simply too small.
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all >>> if one is merely going to toy with it.
Yes, of course it's entirely up to you if you don't fully utilise the
available possibilities of your tools.
In calling locals 'a tool' or 'used for the boring stuff' I'd have the >problem of explaining to colleagues how it is I can use a language in a
way that its creator has dismissed as antithetical.
But I also suspect that your Forth
system has no locals at all.
DX-Forth offers locals as a loadable option. They're implemented
efficiently so as to be credible. I'm not aware of any user that has
availed themselves of it. Indeed I find they come to forth looking
for something that's honestly unique and challenging. To discover
Moore still resolute after all these years provides their role model.
mhx wrote:
The local version loses track of constants.
Interesting aspect. Within the test words all 3dup variants had been
inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...
minforth@gmx.net (minforth) writes:
mhx wrote:
The local version loses track of constants.
Interesting aspect. Within the test words all 3dup variants had been >>inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...
I think mathematicians have a word for what you mean, but it's not >"commutative".
Anyway, it's not specific to locals. Everything that loses
information will affect everything that comes afterwards. E.g., in
Gforth we have a literal stack that only represents literals on the
data stack. So anything that moves values from the data stack (e.g.,
to the return stack) will lose the information about the constant.
- anton--
To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.
Moore and Fox discussed that. Making a non-optimal approach more
efficient isn't the answer.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 914 |
Nodes: | 10 (1 / 9) |
Uptime: | 232:47:18 |
Calls: | 12,157 |
Files: | 186,518 |
Messages: | 2,232,624 |