Forum: War Ensemble BBS

Another look at Gforth's locals implementation

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Apr 9 15:59:58 2024

From Newsgroup: comp.lang.forth

30 years ago I wrote a paper about Gforth's locals and also gave some performance results [ertl94l]. Since then the implementation of
Gforth, and also of locals has changed quite a bit; in particular,
recently we have improved the code generated for TO <local>.

So here I look at it again. As a first example, I look at the fib
benchmark (which has a one-off error, but that makes little difference):

no locals with a local

: fib ( n1 -- n2 ) : fib { n -- n2 }
dup 2 < if n 2 < if
drop 1 1
else else
dup n 1- recurse
1- recurse n 2 - recurse
swap 2 - recurse +
+ then ;
then ;

For 43 fib the performance results (AMD64 instructions, Skylake
cycles) are:

no locals with a local factor
13_805_561_653 15_012_002_369 1.09
41_475_949_910 48_490_119_217 1.17

So those of you who dislike locals can still point to worse
performance of code with locals on Gforth.

Another example is string comparison, where I compared three versions:

\ no locals
: strcmp1 ( addr1 u1 addr2 u2 -- n )
rot 2dup 2>r min 0 ?do ( a1 a2 )
over c@ over c@ - dup if
nip nip 2rdrop unloop exit then
drop
char+ swap char+ swap
loop
2drop r> r> - ;

\ locals with TO
: strcmp2
{ addr1 u1 addr2 u2 -- n }
u1 u2 min 0
?do
addr1 c@ addr2 c@ - dup if
unloop exit then
drop
addr1 char+ TO addr1
addr2 char+ TO addr2
loop
u1 u2 - ;

\ locals without TO
: strcmp3
{ addr1 u1 addr2 u2 -- n }
addr1 addr2
u1 u2 min 0
?do { s1 s2 }
s1 c@ s2 c@ - dup if
unloop exit then
drop
s1 char+ s2 char+
loop
2drop
u1 u2 - ;

I am too lazy to benchmark this, but here's the number of
native-instruction bytes in the inner loop:

129 strcmp1
128 strcmp2
149 strcmp3

Here's the code for body of the loop:

strcmp1 strcmp2 strcmp3
over 1->1 @local0 1->1 >l 1->1
mov $00[r13],r8 mov $00[r13],r8 mov rax,rbp
sub r13,$08 sub r13,$08 add r13,$08
mov r8,$10[r13] mov r8,$00[rbp] lea rbp,-$08[rbp]
c@ 1->1 c@ 1->1 mov -$08[rax],r8
movzx r8d,bytePTR[r8] movzx r8d,bytePTR[r8] mov r8,$00[r13]
over 1->2 @local2 1->2 >l @local0 1->1
mov r15,$08[r13] mov r15,$10[rbp] @local0
c@ 2->2 c@ 2->2 mov rax,rbp
movzx r15d,bytePTR[r15] movzx r15d,bytePTR[r15] lea rbp,-$08[rbp]
- 2->1 - 2->1 mov -$08[rax],r8
sub r8,r15 sub r8,r15 c@ 1->1
dup 1->2 dup 1->2 movzx r8d,bytePTR[r8]
mov r15,r8 mov r15,r8 @local1 1->2
?branch 2->1 ?branch 2->1 mov r15,$08[rbp] <strcmp1+$A8> <strcmp2+$C0> c@ 2->2
add rbx,$68 add rbx,$68 movzx r15d,bytePTR[r15]
mov rax,[rbx] mov rax,[rbx] - 2->1
test r15,r15 test r15,r15 sub r8,r15
jnz $7FED3DB2C317 jnz $7FED3DB2C0DC dup 1->2
jmp eax jmp eax mov r15,r8
nip 1->1 unloop 1->1 ?branch 2->1
add r13,$08 add r14,$10 <strcmp3+$E0>
nip 1->1 lit 1->2 add rbx,$78
add r13,$08 #32 mov rax,[rbx]
2rdrop 1->1 sub rbx,$10 test r15,r15
add r14,$10 mov r15,-$08[rbx] jnz $7FED3DB2C208 unloop 1->1 lp+! 2->1 jmp eax
add r14,$10 add rbp,r15 unloop 1->1
;s 1->1 ;s 1->1 add r14,$10
mov rbx,[r14] mov rbx,[r14] lit 1->2
add r14,$08 add r14,$08 #48
mov rax,[rbx] mov rax,[rbx] sub rbx,$10
jmp eax jmp eax mov r15,-$08[rbx] drop 1->1 drop 1->1 lp+! 2->1
mov r8,$08[r13] mov r8,$08[r13] add rbp,r15
add r13,$08 add r13,$08 ;s 1->1
char+ 1->1 @local0 1->2 mov rbx,[r14]
add r8,$01 mov r15,$00[rbp] add r14,$08
swap 1->2 char+ 2->2 mov rax,[rbx]
mov r15,$08[r13] add r15,$01 jmp eax
add r13,$08 !local0 2->1 drop 1->0
char+ 2->2 mov $00[rbp],r15 @local0 0->1
add r15,$01 @local2 1->2 mov r8,$00[rbp]
swap 2->1 mov r15,$10[rbp] char+ 1->1
mov $00[r13],r15 char+ 2->2 add r8,$01
sub r13,$08 add r15,$01 @local1 1->1
(loop) 1->1 !local2 2->1 mov $00[r13],r8 <strcmp1+$40> mov $10[rbp],r15 sub r13,$08
sub rbx,$68 (loop) 1->1 mov r8,$08[rbp]
mov rax,[r14] <strcmp2+$58> char+ 1->1
add rax,$01 sub rbx,$68 add r8,$01
cmp $08[r14],rax mov rax,[r14] (loop)-lp+!# 1->1
mov [r14],rax add rax,$01 <strcmp3+$68>
jz $7FED3DB2C36C cmp $08[r14],rax #16
mov rax,[rbx] mov [r14],rax add rbx,$40
jmp eax jz $7FED3DB2C130 mov rax,[r14]
mov rax,[rbx] mov rsi,-$10[rbx]
jmp eax add rax,$01
cmp $08[r14],rax
jz $7FED3DB2C25F
add rbp,-$08[rbx]
mov [r14],rax
mov rbx,rsi
mov rax,[rsi]
jmp eax

@InProceedings{ertl94l,
author = "M. Anton Ertl",
title = "Automatic Scoping of Local Variables",
booktitle = "EuroForth~'94 Conference Proceedings",
year = "1994",
address = "Winchester, UK",
pages = "31--37",
url = "http://www.complang.tuwien.ac.at/papers/ertl94l.ps.gz",
abstract = "In the process of lifting the restrictions on using
locals in Forth, an interesting problem poses
itself: What does it mean if a local is defined in a
control structure? Where is the local visible? Since
the user can create every possible control structure
in ANS Forth, the answer is not as simple as it may
seem. Ideally, the local is visible at a place if
the control flow {\em must} pass through the
definition of the local to reach this place. This
paper discusses locals in general, the visibility
problem, its solution, the consequences and the
implementation as well as related programming style
questions."
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Apr 9 19:43:55 2024

From Newsgroup: comp.lang.forth

This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.

The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.

With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Tue Apr 9 20:54:47 2024

From Newsgroup: comp.lang.forth

minforth wrote:

This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.

The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.

With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.

Even without putting them in registers it is faster (Ryzen 7 5800X):

: bench
cr timer-reset #43 fib .elapsed ." ( " . ." )"
cr timer-reset #43 fib2 .elapsed ." ( " . ." )" ;

FORTH> bench
1.658 seconds elapsed. ( 701408733 )
1.564 seconds elapsed. ( 701408733 ) ok

Here is fib2:
: fib2 params| n |
n 2 < if 1
else n 1- recurse
n 2- recurse
+
endif ;

FORTH> ' fib2 idis
$014588C0 : fib2
$014588CA pop rbx
$014588CB cmp rbx, 2 b#
$014588CF lea rsi, [rsi #-16 +] qword
$014588D3 mov [rsi] qword, rbx
$014588D6 mov rbx, rcx
$014588D9 jge $014588EB offset NEAR
$014588DF mov rbx, 1 d#
$014588E6 jmp $01458921 offset NEAR
$014588EB mov rbx, [rsi] qword
$014588EE lea rbx, [rbx -1 +] qword
$014588F2 lea rbp, [rbp -8 +] qword
$014588F6 mov [rbp 0 +] qword, $01458903 d#
$014588FE jmp $014588CB offset NEAR
$01458903 mov rbx, [rsi] qword
$01458906 lea rbx, [rbx -2 +] qword
$0145890A lea rbp, [rbp -8 +] qword
$0145890E mov [rbp 0 +] qword, $0145891B d#
$01458916 jmp $014588CB offset NEAR
$0145891B pop rbx
$0145891C pop rdi
$0145891D lea rbx, [rdi rbx*1] qword
$01458921 push rbx
$01458922 lea rsi, [rsi #16 +] qword
$01458926 ;

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 12:34:28 2024

From Newsgroup: comp.lang.forth

On 10/04/2024 5:43 am, minforth wrote:

This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.

The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.

'Readability' 'stack juggling'. No discussion there - just appeals
to prejudice.

With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.

Moore and Fox discussed that. Making a non-optimal approach more
efficient isn't the answer.

--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 13:00:48 2024

From Newsgroup: comp.lang.forth

On 10/04/2024 6:54 am, mhx wrote:

minforth wrote:

This is consistent with my observations. There is typically
a speed difference of around 10 per cent between code variants
with stack juggling and with locals. The difference is irrelevant
for the vast majority of tasks.

The gain in readability, on the other hand, is often enormous.
But that's another old, worn-out discussion.

With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.

Even without putting them in registers it is faster (Ryzen 7 5800X):

: bench
cr timer-reset #43 fib .elapsed ." ( " . ." )"
cr timer-reset #43 fib2 .elapsed ." ( " . ." )" ;

FORTH> bench
1.658 seconds elapsed. ( 701408733 )
1.564 seconds elapsed. ( 701408733 ) ok

Here is fib2:
: fib2 params| n |       n 2 < if 1           else n 1- recurse                n 2- recurse                +          endif ;

FORTH> ' fib2 idis
$014588C0 : fib2
$014588CA pop           rbx
$014588CB cmp           rbx, 2 b#
$014588CF lea           rsi, [rsi #-16 +] qword
$014588D3 mov           [rsi] qword, rbx
$014588D6 mov           rbx, rcx
$014588D9 jge           $014588EB offset NEAR
$014588DF mov           rbx, 1 d#
$014588E6 jmp           $01458921 offset NEAR
$014588EB mov           rbx, [rsi] qword
$014588EE lea           rbx, [rbx -1 +] qword
$014588F2 lea           rbp, [rbp -8 +] qword
$014588F6 mov           [rbp 0 +] qword, $01458903 d#
$014588FE jmp           $014588CB offset NEAR
$01458903 mov           rbx, [rsi] qword
$01458906 lea           rbx, [rbx -2 +] qword
$0145890A lea           rbp, [rbp -8 +] qword
$0145890E mov           [rbp 0 +] qword, $0145891B d#
$01458916 jmp           $014588CB offset NEAR
$0145891B pop           rbx
$0145891C pop           rdi
$0145891D lea           rbx, [rdi rbx*1] qword
$01458921 push          rbx
$01458922 lea           rsi, [rsi #16 +] qword
$01458926 ;

VFX doesn't put much effort into optimizing locals code. Should they have?
Or would that have encouraged users to write code that lacks thought and
leave it to the compiler to make efficient? Which is the C approach.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 05:25:48 2024

From Newsgroup: comp.lang.forth

dxf wrote:

On 10/04/2024 6:54 am, mhx wrote:

minforth wrote:

[..]

VFX doesn't put much effort into optimizing locals code. Should they have? Or would that have encouraged users to write code that lacks thought and leave it to the compiler to make efficient? Which is the C approach.

I wouldn't know what those authors think, believe, or try to enforce.

There has been no effort spent in optimizing iForth's locals. The shown
result is simply a byproduct of the architecture of the compiler.

It is not my task to guide the users, they can do whatever they ask.
I believe my actions are consistent with that.

Locals are completely OK if you want to get stuff done. I use them for
the boring parts.

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 15:45:06 2024

From Newsgroup: comp.lang.forth

On 10/04/2024 3:25 pm, mhx wrote:

dxf wrote:

On 10/04/2024 6:54 am, mhx wrote:

minforth wrote:

[..]

VFX doesn't put much effort into optimizing locals code. Should they have? >> Or would that have encouraged users to write code that lacks thought and
leave it to the compiler to make efficient? Which is the C approach.

I wouldn't know what those authors think, believe, or try to enforce.

There has been no effort spent in optimizing iForth's locals. The shown result is simply a byproduct of the architecture of the compiler.
...

Do you have a disassembly for Fib1? I ask because NT/Forth which purports
to do exactly that produces notably different results - despite there being little 'stack juggling' to resolve.

NT/FORTH (C) 2005 Peter Fälth Version 1.6-983-824 Compiled on 2017-12-03

see fib1
A49E58 40917C 58 C80000 5 normal FIB1

40917C 83FB02 cmp ebx , # 2h
40917F 0F8D0A000000 jge "0040918F"
409185 BB01000000 mov ebx , # 1h
40918A E926000000 jmp "004091B5"
40918F 8BC3 mov eax , ebx
409191 48 dec eax
409192 895DFC mov [ebp-4h] , ebx
409195 8BD8 mov ebx , eax
409197 8D6DFC lea ebp , [ebp-4h]
40919A E8DDFFFFFF call FIB1
40919F 8B4500 mov eax , [ebp]
4091A2 83E802 sub eax , # 2h
4091A5 895D00 mov [ebp] , ebx
4091A8 8BD8 mov ebx , eax
4091AA E8CDFFFFFF call FIB1
4091AF 035D00 add ebx , [ebp]
4091B2 8D6D04 lea ebp , [ebp+4h]
4091B5 C3 ret near

see fib2
A49E70 4091B6 86 C80000 5 normal FIB2

4091B6 83FB02 cmp ebx , # 2h
4091B9 895C24FC mov [esp-4h] , ebx
4091BD 8B5D00 mov ebx , [ebp]
4091C0 8D6D04 lea ebp , [ebp+4h]
4091C3 8D6424FC lea esp , [esp-4h]
4091C7 0F8D10000000 jge "004091DD"
4091CD 895DFC mov [ebp-4h] , ebx
4091D0 BB01000000 mov ebx , # 1h
4091D5 8D6DFC lea ebp , [ebp-4h]
4091D8 E92A000000 jmp "00409207"
4091DD 8B0424 mov eax , [esp]
4091E0 48 dec eax
4091E1 895DFC mov [ebp-4h] , ebx
4091E4 8BD8 mov ebx , eax
4091E6 8D6DFC lea ebp , [ebp-4h]
4091E9 E8C8FFFFFF call FIB2
4091EE 8B0424 mov eax , [esp]
4091F1 83E802 sub eax , # 2h
4091F4 895DFC mov [ebp-4h] , ebx
4091F7 8BD8 mov ebx , eax
4091F9 8D6DFC lea ebp , [ebp-4h]
4091FC E8B5FFFFFF call FIB2
409201 035D00 add ebx , [ebp]
409204 8D6D04 lea ebp , [ebp+4h]
409207 8D642404 lea esp , [esp+4h]
40920B C3 ret near

--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 06:58:17 2024

From Newsgroup: comp.lang.forth

dxf wrote:

On 10/04/2024 3:25 pm, mhx wrote:

dxf wrote:

Do you have a disassembly for Fib1? I ask because NT/Forth which purports
to do exactly that produces notably different results - despite there being little 'stack juggling' to resolve.

Here it is:

FORTH> : fib ( n1 -- n2 ) dup 2 < if drop 1 else dup 1- recurse swap 2 - recurse + then ; ok
FORTH> 10 fib . 89 ok
FORTH> see fib
Flags: ANSI
$01340A40 : fib
$01340A4A pop rbx
$01340A4B cmp rbx, 2 b#
$01340A4F jge $01340A61 offset NEAR
$01340A55 mov rbx, 1 d#
$01340A5C jmp $01340A95 offset NEAR
$01340A61 push rbx
$01340A62 lea rbx, [rbx -1 +] qword
$01340A66 lea rbp, [rbp -8 +] qword
$01340A6A mov [rbp 0 +] qword, $01340A77 d#
$01340A72 jmp $01340A4B offset NEAR
$01340A77 pop rbx
$01340A78 pop rdi
$01340A79 push rbx
$01340A7A lea rbx, [rdi -2 +] qword
$01340A7E lea rbp, [rbp -8 +] qword
$01340A82 mov [rbp 0 +] qword, $01340A8F d#
$01340A8A jmp $01340A4B offset NEAR
$01340A8F pop rbx
$01340A90 pop rdi
$01340A91 lea rbx, [rdi rbx*1] qword
$01340A95 push rbx
$01340A96 ;

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 10 07:34:38 2024

From Newsgroup: comp.lang.forth

mhx wrote:

I wouldn't know what those authors think, believe, or try to enforce.

There has been no effort spent in optimizing iForth's locals. The shown result is simply a byproduct of the architecture of the compiler.

It is not my task to guide the users, they can do whatever they ask.
I believe my actions are consistent with that.

Locals are completely OK if you want to get stuff done. I use them for
the boring parts.

I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.

In such a case, you first rethink the task and the algorithm and,
if necessary, write a few lines in assembler (or C) after profiling.
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Apr 10 07:00:38 2024

From Newsgroup: comp.lang.forth

minforth@gmx.net (minforth) writes:

With locals in CPU registers, I expect an even smaller speed
difference. However, I have not yet implemented and tested this.

Stack items and locals are just different ways of expressing the same
data flow. Ideally a Forth system maps both expressions to the same
optimal way of expressing the data flow in machine code. There is one
example where this actually happens: Consider:

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.3 {: a b c :} a b c a b c ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

These four ways of expressing 3DUP are all commpiled to exactly the
same code by lxf/ntf:

804FC0A 8B4500 mov eax , [ebp]
804FC0D 8945F4 mov [ebp-Ch] , eax
804FC10 8B4504 mov eax , [ebp+4h]
804FC13 8945F8 mov [ebp-8h] , eax
804FC16 895DFC mov [ebp-4h] , ebx
804FC19 8D6DF4 lea ebp , [ebp-Ch]
804FC1C C3 ret near

However, it's more difficult when control flow is involved, and most native-code compilers register-allocate only in straight-line code.
For comparison, here's what gforth-fast currently gives you:

3dup.1 3dup.2 3dup.3 3dup.4

r 1->0 third 1->2 >l >l 1->1 dup 1->1

mov -$08[r14],r8 mov r15,$10[r13] >l mov $00[r13],r8
sub r14,$08 third 2->3 mov -$08[rbp],r8 sub r13,$08
2dup 0->2 mov r9,$08[r13] mov rdx,$08[r13] 2over 1->3
mov r8,$10[r13] third 3->1 mov rax,rbp mov r15,$18[r13]
mov r15,$08[r13] mov $00[r13],r8 add r13,$10 mov r9,$10[r13]
i 2->3 sub r13,$18 lea rbp,-$10[rbp] rot 3->1
mov r9,[r14] mov $10[r13],r15 mov -$10[rax],rdx mov $00[r13],r15
-rot 3->2 mov $08[r13],r9 mov r8,$00[r13] sub r13,$10
mov $00[r13],r9 ;s 1->1 >l @local0 1->1 mov $08[r13],r9
sub r13,$08 mov rbx,[r14] @local0 ;s 1->1

2->1 add r14,$08 mov rax,rbp mov rbx,[r14]

mov -$08[r13],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
sub r13,$10 jmp eax mov -$08[rax],r8 mov rax,[rbx]
mov $10[r13],r8 @local1 1->2 jmp eax
mov r8,[r14] mov r15,$08[rbp]
add r14,$08 @local2 2->1
;s 1->1 mov -$08[r13],r15
mov rbx,[r14] sub r13,$10
add r14,$08 mov $10[r13],r8
mov rax,[rbx] mov r8,$10[rbp]
jmp eax @local0 1->2
mov r15,$00[rbp]
@local1 2->3
mov r9,$08[rbp]
@local2 3->1
mov -$10[r13],r9
sub r13,$18
mov $10[r13],r15
mov $18[r13],r8
mov r8,$10[rbp]
lit 1->2
#24
mov r15,$50[rbx]
lp+! 2->1
add rbp,r15
;s 1->1
mov rbx,[r14]
add r14,$08
mov rax,[rbx]
jmp eax

Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The
differences are:

64-bit vs. 32-bit cells
different way of returning to the caller (;S code vs. ret)
sub vs. lea for the stack-pointer update
different order of loads and stores
different register allocation

But this is just a happy coincidence, and the other versions make it
clear that gforth-fast does not analyse the data flow even in
straight-line code.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 18:32:09 2024

From Newsgroup: comp.lang.forth

On 10/04/2024 5:34 pm, minforth wrote:

mhx wrote:

Locals are completely OK if you want to get stuff done. I use them for
the boring parts.

...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.

To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.

--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 10 10:38:43 2024

From Newsgroup: comp.lang.forth

dxf wrote:

On 10/04/2024 5:34 pm, minforth wrote:

mhx wrote:

Locals are completely OK if you want to get stuff done. I use them for
the boring parts.

...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.

To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.

Yes, of course it's entirely up to you if you don't fully utilise the
available possibilities of your tools. But I also suspect that your Forth system has no locals at all.

To give you another example: Dr Noble's formula translator. Certainly not
a toy of an incapable programmer, as you imply.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 16:35:04 2024

From Newsgroup: comp.lang.forth

Anton Ertl wrote:

However, it's more difficult when control flow is involved, and most native-code compilers register-allocate only in straight-line code.
For comparison, here's what gforth-fast currently gives you:

[..]

Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The differences are:

..
Nice example. Here is what iForth does:

FORTH> : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ; ok
FORTH> : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ; ok
FORTH> : 3dup.3 params| a b c | a b c a b c ; ok
FORTH> : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ; ok
FORTH> : test.1 3dup.1 .S ; ok
FORTH> : test.2 3dup.2 .S ; ok
FORTH> : test.3 3dup.3 .S ; ok
FORTH> : test.4 3dup.4 .S ; ok
FORTH> see test.1
Flags: ANSI
$01340F80 : test.1
$01340F8A pop rbx
$01340F8B pop rdi
$01340F8C mov rax, [rsp] qword
$01340F90 push rdi
$01340F91 push rbx
$01340F92 push rax
$01340F93 push rdi
$01340F94 push rbx
$01340F95 jmp .S+10 ( $012BEA0A ) offset NEAR
FORTH> see test.2
Flags: ANSI
$01340FC0 : test.2
$01340FCA mov rbx, [rsp #16 +] qword
$01340FCF mov rcx, rbx
$01340FD2 mov rbx, [rsp 8 +] qword
$01340FD7 push rcx
$01340FD8 mov rcx, rbx
$01340FDB mov rbx, [rsp 8 +] qword
$01340FE0 push rcx
$01340FE1 push rbx
$01340FE2 jmp .S+10 ( $012BEA0A ) offset NEAR
FORTH> see test.3
Flags: ANSI
$01341000 : test.3
$0134100A pop rbx
$0134100B pop rdi
$0134100C mov rax, [rsp] qword
$01341010 push rdi
$01341011 push rbx
$01341012 push rax
$01341013 push rdi
$01341014 push rbx
$01341015 jmp .S+10 ( $012BEA0A ) offset NEAR
FORTH> see test.4
Flags: ANSI
$01341040 : test.4
$0134104A pop rbx
$0134104B pop rdi
$0134104C mov rax, [rsp] qword
$01341050 push rdi
$01341051 push rbx
$01341052 push rax
$01341053 push rdi
$01341054 push rbx
$01341055 jmp .S+10 ( $012BEA0A ) offset NEAR

FORTH> ( and finally all together ... )
FORTH> : test.5 1 2 3 3dup.1 3dup.2 3dup.3 3dup.4 .S ;

FORTH> see test.5
Flags: ANSI
$01341080 : test.4
$0134108A push 1 b#
$0134108C push 2 b#
$0134108E push 3 b#
$01341090 push 1 b#
$01341092 push 2 b#
$01341094 push 3 b#
$01341096 mov rbx, [rsp #16 +] qword
$0134109B mov rcx, rbx
$0134109E mov rbx, [rsp 8 +] qword
$013410A3 push rcx
$013410A4 mov rcx, rbx
$013410A7 mov rbx, [rsp 8 +] qword
$013410AC mov rdi, [rsp] qword
$013410B0 push rcx
$013410B1 push rbx
$013410B2 push rdi
$013410B3 push rcx
$013410B4 push rbx
$013410B5 push rdi
$013410B6 push rcx
$013410B7 push rbx
$013410B8 jmp .S+10 ( $012BEA0A ) offset NEAR

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 16:50:36 2024

From Newsgroup: comp.lang.forth

The local version loses track of constants. If we reorder test.5 :

FORTH> : test.6 1 2 3 3dup.1 3dup.3 3dup.4 3dup.2 .S ; ok
FORTH> see test.6
Flags: ANSI
$01341100 : test.6
$0134110A push 1 b#
$0134110C push 2 b#
$0134110E push 3 b#
$01341110 push 1 b#
$01341112 push 2 b#
$01341114 push 3 b#
$01341116 push 1 b#
$01341118 push 2 b#
$0134111A push 3 b#
$0134111C push 1 b#
$0134111E push 2 b#
$01341120 push 3 b#
$01341122 mov rbx, [rsp #16 +] qword
$01341127 mov rcx, rbx
$0134112A mov rbx, [rsp 8 +] qword
$0134112F push rcx
$01341130 mov rcx, rbx
$01341133 mov rbx, [rsp 8 +] qword
$01341138 push rcx
$01341139 push rbx
$0134113A jmp .S+10 ( $012BEA0A ) offset NEAR
$0134113F ;

Maybe in vsn 7.0 :--)

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Apr 10 17:06:45 2024

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
[deleted the versions not of interest in this posting

: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

...

For comparison, here's what gforth-fast currently gives you:

3dup.2 3dup.4
third 1->2 dup 1->1
mov r15,$10[r13] mov $00[r13],r8
third 2->3 sub r13,$08
mov r9,$08[r13] 2over 1->3
third 3->1 mov r15,$18[r13]
mov $00[r13],r8 mov r9,$10[r13]
sub r13,$18 rot 3->1
mov $10[r13],r15 mov $00[r13],r15
mov $08[r13],r9 sub r13,$10
;s 1->1 mov $08[r13],r9
mov rbx,[r14] ;s 1->1
add r14,$08 mov rbx,[r14]
mov rax,[rbx] add r14,$08
jmp eax mov rax,[rbx]
jmp eax

Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
load two items from the memory parts of the stack, store three items
to the memory part of the stack, and one stack-pointer update. The >differences are:

64-bit vs. 32-bit cells
different way of returning to the caller (;S code vs. ret)
sub vs. lea for the stack-pointer update
different order of loads and stores
different register allocation

But this is just a happy coincidence, and the other versions make it
clear that gforth-fast does not analyse the data flow even in
straight-line code.

3DUP.4 also performs these two loads and three stores (again in a
different order), but it distributes the stack pointer update across
two instructions.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 10 20:59:49 2024

From Newsgroup: comp.lang.forth

mhx wrote:

The local version loses track of constants.

Interesting aspect. Within the test words all 3dup variants had been
inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Thu Apr 11 12:09:25 2024

From Newsgroup: comp.lang.forth

On 10/04/2024 8:38 pm, minforth wrote:

dxf wrote:

On 10/04/2024 5:34 pm, minforth wrote:

mhx wrote:

Locals are completely OK if you want to get stuff done. I use them for >>>> the boring parts.

...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed,
if any, is simply too small.

To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.

Yes, of course it's entirely up to you if you don't fully utilise the available possibilities of your tools.

In calling locals 'a tool' or 'used for the boring stuff' I'd have the
problem of explaining to colleagues how it is I can use a language in a
way that its creator has dismissed as antithetical.

But I also suspect that your Forth
system has no locals at all.

DX-Forth offers locals as a loadable option. They're implemented
efficiently so as to be credible. I'm not aware of any user that has
availed themselves of it. Indeed I find they come to forth looking
for something that's honestly unique and challenging. To discover
Moore still resolute after all these years provides their role model.

--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Thu Apr 11 14:45:36 2024

From Newsgroup: comp.lang.forth

On 10/04/2024 8:38 pm, minforth wrote:

...
To give you another example: Dr Noble's formula translator. Certainly not
a toy of an incapable programmer, as you imply.

While Mr. Noble used locals in his formula translator, they were in fact superfluous. Here's a modified version made some years ago that exchanges locals with VALUEs:

https://pastebin.com/TnHKGNEk

--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 11 11:49:02 2024

From Newsgroup: comp.lang.forth

In article <66174655$1@news.ausics.net>, dxf <dxforth@gmail.com> wrote:

On 10/04/2024 8:38 pm, minforth wrote:

dxf wrote:

On 10/04/2024 5:34 pm, minforth wrote:

mhx wrote:

Locals are completely OK if you want to get stuff done. I use them for >>>>> the boring parts.

...
I think the attitude of "programming to make the compiler happy i.e.
by stack juggling" is a waste of human resources. I can't remember a
single case where a performance bottleneck was fixed by switching
from a local to a non-local code formulation. The difference in speed, >>>> if any, is simply too small.

To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all >>> if one is merely going to toy with it.

Yes, of course it's entirely up to you if you don't fully utilise the
available possibilities of your tools.

In calling locals 'a tool' or 'used for the boring stuff' I'd have the >problem of explaining to colleagues how it is I can use a language in a
way that its creator has dismissed as antithetical.

But I also suspect that your Forth
system has no locals at all.

DX-Forth offers locals as a loadable option. They're implemented
efficiently so as to be credible. I'm not aware of any user that has
availed themselves of it. Indeed I find they come to forth looking
for something that's honestly unique and challenging. To discover
Moore still resolute after all these years provides their role model.

ciforth offers LOCAL as a loadable option.
They are implemented without regard of efficiency.
Also they cannot be used recursively.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 13 17:27:15 2024

From Newsgroup: comp.lang.forth

minforth@gmx.net (minforth) writes:

mhx wrote:

The local version loses track of constants.

Interesting aspect. Within the test words all 3dup variants had been
inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...

I think mathematicians have a word for what you mean, but it's not "commutative".

Anyway, it's not specific to locals. Everything that loses
information will affect everything that comes afterwards. E.g., in
Gforth we have a literal stack that only represents literals on the
data stack. So anything that moves values from the data stack (e.g.,
to the return stack) will lose the information about the constant.

Gforth does not have automatic inlining, so I use manual inlining here
(I also added some constant folding optimizations that are not built
in):

: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;

: compile,-3dup.1 ( xt -- ) drop ]] >r 2dup r@ -rot r> [[ ;
' compile,-3dup.1 optimizes 3dup.1

: compile,-3dup.2 ( xt -- ) drop ]] 2 pick 2 pick 2 pick [[ ;
' compile,-3dup.2 optimizes 3dup.2

: fold3-4 ( xt -- ) 3 ['] 3lits> ['] >3lits ['] >4lits fold-constants ;
' fold3-4 optimizes third

: fold2-4 ( xt -- ) 2 ['] 2lits> ['] >2lits ['] >4lits fold-constants ;
' fold2-4 optimizes 2dup

: foo12 1 2 3 3dup.1 3dup.2 ;
: foo21 1 2 3 3dup.2 3dup.1 ;

Looking at the resulting code for FOO12 and FOO21, we see:

see foo12
: foo12 #1 #2 #3
>r 2dup i -rot r> third third third ; ok
see foo21
: foo21 #1 #2 #3 #1 #2 #3
>r 2dup i -rot r> ; ok

So 3DUP.1 loses information by involving the return stack, while
3DUP.2 only uses the data stack, and manages to work on the constants
when it is COMPILE,d (but I had to add an optimization rule for THIRD
to achieve that; doing the same with 2DUP did not help 3DUP.1,
though).

Having a literal-r-stack and a way to track literals through locals
would make all 3DUP variants work the same way when all parameters are literals.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 14 13:08:25 2024

From Newsgroup: comp.lang.forth

In article <2024Apr13.192715@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

minforth@gmx.net (minforth) writes:

mhx wrote:

The local version loses track of constants.

Interesting aspect. Within the test words all 3dup variants had been >>inlined, I assume.
However inlining semanthically identic words with locals is
"not commutative", to grossly abuse the algebraic expression. Hmm...

I think mathematicians have a word for what you mean, but it's not >"commutative".

Anyway, it's not specific to locals. Everything that loses
information will affect everything that comes afterwards. E.g., in
Gforth we have a literal stack that only represents literals on the
data stack. So anything that moves values from the data stack (e.g.,
to the return stack) will lose the information about the constant.

To cite Andrew Tannenbaum:
"global optimisation and symbolic debugging are each others
arch enemies"

- anton

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Apr 15 12:34:41 2024

From Newsgroup: comp.lang.forth

dxf <dxforth@gmail.com> writes:

To use forth's stack until such time as it becomes too much for one
suggests a half-heartedness. It begs the question why use forth at all
if one is merely going to toy with it.

If Forth's greatness comes from its extensibility, why resist using that extensibility to have locals? Forth is a philosophy that has many
attractions, but its stack is an implementation artifact.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Apr 15 12:41:51 2024

From Newsgroup: comp.lang.forth

dxf <dxforth@gmail.com> writes:

Moore and Fox discussed that. Making a non-optimal approach more
efficient isn't the answer.

Moore and Fox made cpu's that saved a lot of hardware by not having
registers. If you're on a register machine though, why not use it?
Doing the opposite is far from optimal.

Locals may not have been practical on Moore's stack machines, but
neither was ROT, it seems to me. So no locals and not that much
stackrobatics. He instead used VARIABLEs all over the place, which
burnt extra storage since they often weren't all alive at the same time,
plus gave rise to naming troubles and loss of reentrancy, both sort of
odd for a stack language.
--- Synchronet 3.20a-Linux NewsLink 1.114

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	914
Nodes:	10 (1 / 9)
Uptime:	232:47:18
Calls:	12,157
Files:	186,518
Messages:	2,232,624

Another look at Gforth's locals implementation

Who's Online

System Info