• Another look at Gforth's locals implementation

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Apr 9 15:59:58 2024
    From Newsgroup: comp.lang.forth

    30 years ago I wrote a paper about Gforth's locals and also gave some performance results [ertl94l]. Since then the implementation of
    Gforth, and also of locals has changed quite a bit; in particular,
    recently we have improved the code generated for TO <local>.

    So here I look at it again. As a first example, I look at the fib
    benchmark (which has a one-off error, but that makes little difference):

    no locals with a local

    : fib ( n1 -- n2 ) : fib { n -- n2 }
    dup 2 < if n 2 < if
    drop 1 1
    else else
    dup n 1- recurse
    1- recurse n 2 - recurse
    swap 2 - recurse +
    + then ;
    then ;

    For 43 fib the performance results (AMD64 instructions, Skylake
    cycles) are:

    no locals with a local factor
    13_805_561_653 15_012_002_369 1.09
    41_475_949_910 48_490_119_217 1.17

    So those of you who dislike locals can still point to worse
    performance of code with locals on Gforth.

    Another example is string comparison, where I compared three versions:

    \ no locals
    : strcmp1 ( addr1 u1 addr2 u2 -- n )
    rot 2dup 2>r min 0 ?do ( a1 a2 )
    over c@ over c@ - dup if
    nip nip 2rdrop unloop exit then
    drop
    char+ swap char+ swap
    loop
    2drop r> r> - ;

    \ locals with TO
    : strcmp2
    { addr1 u1 addr2 u2 -- n }
    u1 u2 min 0
    ?do
    addr1 c@ addr2 c@ - dup if
    unloop exit then
    drop
    addr1 char+ TO addr1
    addr2 char+ TO addr2
    loop
    u1 u2 - ;

    \ locals without TO
    : strcmp3
    { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - dup if
    unloop exit then
    drop
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    I am too lazy to benchmark this, but here's the number of
    native-instruction bytes in the inner loop:

    129 strcmp1
    128 strcmp2
    149 strcmp3

    Here's the code for body of the loop:

    strcmp1 strcmp2 strcmp3
    over 1->1 @local0 1->1 >l 1->1
    mov $00[r13],r8 mov $00[r13],r8 mov rax,rbp
    sub r13,$08 sub r13,$08 add r13,$08
    mov r8,$10[r13] mov r8,$00[rbp] lea rbp,-$08[rbp]
    c@ 1->1 c@ 1->1 mov -$08[rax],r8
    movzx r8d,bytePTR[r8] movzx r8d,bytePTR[r8] mov r8,$00[r13]
    over 1->2 @local2 1->2 >l @local0 1->1
    mov r15,$08[r13] mov r15,$10[rbp] @local0
    c@ 2->2 c@ 2->2 mov rax,rbp
    movzx r15d,bytePTR[r15] movzx r15d,bytePTR[r15] lea rbp,-$08[rbp]
    - 2->1 - 2->1 mov -$08[rax],r8
    sub r8,r15 sub r8,r15 c@ 1->1
    dup 1->2 dup 1->2 movzx r8d,bytePTR[r8]
    mov r15,r8 mov r15,r8 @local1 1->2
    ?branch 2->1 ?branch 2->1 mov r15,$08[rbp] <strcmp1+$A8> <strcmp2+$C0> c@ 2->2
    add rbx,$68 add rbx,$68 movzx r15d,bytePTR[r15]
    mov rax,[rbx] mov rax,[rbx] - 2->1
    test r15,r15 test r15,r15 sub r8,r15
    jnz $7FED3DB2C317 jnz $7FED3DB2C0DC dup 1->2
    jmp eax jmp eax mov r15,r8
    nip 1->1 unloop 1->1 ?branch 2->1
    add r13,$08 add r14,$10 <strcmp3+$E0>
    nip 1->1 lit 1->2 add rbx,$78
    add r13,$08 #32 mov rax,[rbx]
    2rdrop 1->1 sub rbx,$10 test r15,r15
    add r14,$10 mov r15,-$08[rbx] jnz $7FED3DB2C208 unloop 1->1 lp+! 2->1 jmp eax
    add r14,$10 add rbp,r15 unloop 1->1
    ;s 1->1 ;s 1->1 add r14,$10
    mov rbx,[r14] mov rbx,[r14] lit 1->2
    add r14,$08 add r14,$08 #48
    mov rax,[rbx] mov rax,[rbx] sub rbx,$10
    jmp eax jmp eax mov r15,-$08[rbx] drop 1->1 drop 1->1 lp+! 2->1
    mov r8,$08[r13] mov r8,$08[r13] add rbp,r15
    add r13,$08 add r13,$08 ;s 1->1
    char+ 1->1 @local0 1->2 mov rbx,[r14]
    add r8,$01 mov r15,$00[rbp] add r14,$08
    swap 1->2 char+ 2->2 mov rax,[rbx]
    mov r15,$08[r13] add r15,$01 jmp eax
    add r13,$08 !local0 2->1 drop 1->0
    char+ 2->2 mov $00[rbp],r15 @local0 0->1
    add r15,$01 @local2 1->2 mov r8,$00[rbp]
    swap 2->1 mov r15,$10[rbp] char+ 1->1
    mov $00[r13],r15 char+ 2->2 add r8,$01
    sub r13,$08 add r15,$01 @local1 1->1
    (loop) 1->1 !local2 2->1 mov $00[r13],r8 <strcmp1+$40> mov $10[rbp],r15 sub r13,$08
    sub rbx,$68 (loop) 1->1 mov r8,$08[rbp]
    mov rax,[r14] <strcmp2+$58> char+ 1->1
    add rax,$01 sub rbx,$68 add r8,$01
    cmp $08[r14],rax mov rax,[r14] (loop)-lp+!# 1->1
    mov [r14],rax add rax,$01 <strcmp3+$68>
    jz $7FED3DB2C36C cmp $08[r14],rax #16
    mov rax,[rbx] mov [r14],rax add rbx,$40
    jmp eax jz $7FED3DB2C130 mov rax,[r14]
    mov rax,[rbx] mov rsi,-$10[rbx]
    jmp eax add rax,$01
    cmp $08[r14],rax
    jz $7FED3DB2C25F
    add rbp,-$08[rbx]
    mov [r14],rax
    mov rbx,rsi
    mov rax,[rsi]
    jmp eax

    @InProceedings{ertl94l,
    author = "M. Anton Ertl",
    title = "Automatic Scoping of Local Variables",
    booktitle = "EuroForth~'94 Conference Proceedings",
    year = "1994",
    address = "Winchester, UK",
    pages = "31--37",
    url = "http://www.complang.tuwien.ac.at/papers/ertl94l.ps.gz",
    abstract = "In the process of lifting the restrictions on using
    locals in Forth, an interesting problem poses
    itself: What does it mean if a local is defined in a
    control structure? Where is the local visible? Since
    the user can create every possible control structure
    in ANS Forth, the answer is not as simple as it may
    seem. Ideally, the local is visible at a place if
    the control flow {\em must} pass through the
    definition of the local to reach this place. This
    paper discusses locals in general, the visibility
    problem, its solution, the consequences and the
    implementation as well as related programming style
    questions."
    }

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Apr 9 19:43:55 2024
    From Newsgroup: comp.lang.forth

    This is consistent with my observations. There is typically
    a speed difference of around 10 per cent between code variants
    with stack juggling and with locals. The difference is irrelevant
    for the vast majority of tasks.

    The gain in readability, on the other hand, is often enormous.
    But that's another old, worn-out discussion.

    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Tue Apr 9 20:54:47 2024
    From Newsgroup: comp.lang.forth

    minforth wrote:

    This is consistent with my observations. There is typically
    a speed difference of around 10 per cent between code variants
    with stack juggling and with locals. The difference is irrelevant
    for the vast majority of tasks.

    The gain in readability, on the other hand, is often enormous.
    But that's another old, worn-out discussion.

    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.

    Even without putting them in registers it is faster (Ryzen 7 5800X):

    : bench
    cr timer-reset #43 fib .elapsed ." ( " . ." )"
    cr timer-reset #43 fib2 .elapsed ." ( " . ." )" ;

    FORTH> bench
    1.658 seconds elapsed. ( 701408733 )
    1.564 seconds elapsed. ( 701408733 ) ok

    Here is fib2:
    : fib2 params| n |
    n 2 < if 1
    else n 1- recurse
    n 2- recurse
    +
    endif ;

    FORTH> ' fib2 idis
    $014588C0 : fib2
    $014588CA pop rbx
    $014588CB cmp rbx, 2 b#
    $014588CF lea rsi, [rsi #-16 +] qword
    $014588D3 mov [rsi] qword, rbx
    $014588D6 mov rbx, rcx
    $014588D9 jge $014588EB offset NEAR
    $014588DF mov rbx, 1 d#
    $014588E6 jmp $01458921 offset NEAR
    $014588EB mov rbx, [rsi] qword
    $014588EE lea rbx, [rbx -1 +] qword
    $014588F2 lea rbp, [rbp -8 +] qword
    $014588F6 mov [rbp 0 +] qword, $01458903 d#
    $014588FE jmp $014588CB offset NEAR
    $01458903 mov rbx, [rsi] qword
    $01458906 lea rbx, [rbx -2 +] qword
    $0145890A lea rbp, [rbp -8 +] qword
    $0145890E mov [rbp 0 +] qword, $0145891B d#
    $01458916 jmp $014588CB offset NEAR
    $0145891B pop rbx
    $0145891C pop rdi
    $0145891D lea rbx, [rdi rbx*1] qword
    $01458921 push rbx
    $01458922 lea rsi, [rsi #16 +] qword
    $01458926 ;

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 12:34:28 2024
    From Newsgroup: comp.lang.forth

    On 10/04/2024 5:43 am, minforth wrote:
    This is consistent with my observations. There is typically
    a speed difference of around 10 per cent between code variants
    with stack juggling and with locals. The difference is irrelevant
    for the vast majority of tasks.

    The gain in readability, on the other hand, is often enormous.
    But that's another old, worn-out discussion.

    'Readability' 'stack juggling'. No discussion there - just appeals
    to prejudice.

    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.

    Moore and Fox discussed that. Making a non-optimal approach more
    efficient isn't the answer.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 13:00:48 2024
    From Newsgroup: comp.lang.forth

    On 10/04/2024 6:54 am, mhx wrote:
    minforth wrote:

    This is consistent with my observations. There is typically
    a speed difference of around 10 per cent between code variants
    with stack juggling and with locals. The difference is irrelevant
    for the vast majority of tasks.

    The gain in readability, on the other hand, is often enormous.
    But that's another old, worn-out discussion.

    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.

    Even without putting them in registers it is faster (Ryzen 7 5800X):

    : bench
      cr timer-reset #43 fib  .elapsed ."  ( " . ." )"
      cr timer-reset #43 fib2 .elapsed ."  ( " . ." )" ;

    FORTH> bench
    1.658 seconds elapsed. ( 701408733 )
    1.564 seconds elapsed. ( 701408733 ) ok

    Here is fib2:
    : fib2 params| n |        n 2 < if 1           else n 1- recurse                n 2- recurse                +          endif ;

    FORTH> ' fib2 idis
    $014588C0  : fib2
    $014588CA  pop           rbx
    $014588CB  cmp           rbx, 2 b#
    $014588CF  lea           rsi, [rsi #-16 +] qword
    $014588D3  mov           [rsi] qword, rbx
    $014588D6  mov           rbx, rcx
    $014588D9  jge           $014588EB offset NEAR
    $014588DF  mov           rbx, 1 d#
    $014588E6  jmp           $01458921 offset NEAR
    $014588EB  mov           rbx, [rsi] qword
    $014588EE  lea           rbx, [rbx -1 +] qword
    $014588F2  lea           rbp, [rbp -8 +] qword
    $014588F6  mov           [rbp 0 +] qword, $01458903 d#
    $014588FE  jmp           $014588CB offset NEAR
    $01458903  mov           rbx, [rsi] qword
    $01458906  lea           rbx, [rbx -2 +] qword
    $0145890A  lea           rbp, [rbp -8 +] qword
    $0145890E  mov           [rbp 0 +] qword, $0145891B d#
    $01458916  jmp           $014588CB offset NEAR
    $0145891B  pop           rbx
    $0145891C  pop           rdi
    $0145891D  lea           rbx, [rdi rbx*1] qword
    $01458921  push          rbx
    $01458922  lea           rsi, [rsi #16 +] qword
    $01458926  ;

    VFX doesn't put much effort into optimizing locals code. Should they have?
    Or would that have encouraged users to write code that lacks thought and
    leave it to the compiler to make efficient? Which is the C approach.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 05:25:48 2024
    From Newsgroup: comp.lang.forth

    dxf wrote:

    On 10/04/2024 6:54 am, mhx wrote:
    minforth wrote:
    [..]
    VFX doesn't put much effort into optimizing locals code. Should they have? Or would that have encouraged users to write code that lacks thought and leave it to the compiler to make efficient? Which is the C approach.

    I wouldn't know what those authors think, believe, or try to enforce.

    There has been no effort spent in optimizing iForth's locals. The shown
    result is simply a byproduct of the architecture of the compiler.

    It is not my task to guide the users, they can do whatever they ask.
    I believe my actions are consistent with that.

    Locals are completely OK if you want to get stuff done. I use them for
    the boring parts.

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 15:45:06 2024
    From Newsgroup: comp.lang.forth

    On 10/04/2024 3:25 pm, mhx wrote:
    dxf wrote:

    On 10/04/2024 6:54 am, mhx wrote:
    minforth wrote:
    [..]
    VFX doesn't put much effort into optimizing locals code.  Should they have? >> Or would that have encouraged users to write code that lacks thought and
    leave it to the compiler to make efficient?  Which is the C approach.

    I wouldn't know what those authors think, believe, or try to enforce.

    There has been no effort spent in optimizing iForth's locals. The shown result is simply a byproduct of the architecture of the compiler.
    ...

    Do you have a disassembly for Fib1? I ask because NT/Forth which purports
    to do exactly that produces notably different results - despite there being little 'stack juggling' to resolve.

    NT/FORTH (C) 2005 Peter Fälth Version 1.6-983-824 Compiled on 2017-12-03

    see fib1
    A49E58 40917C 58 C80000 5 normal FIB1

    40917C 83FB02 cmp ebx , # 2h
    40917F 0F8D0A000000 jge "0040918F"
    409185 BB01000000 mov ebx , # 1h
    40918A E926000000 jmp "004091B5"
    40918F 8BC3 mov eax , ebx
    409191 48 dec eax
    409192 895DFC mov [ebp-4h] , ebx
    409195 8BD8 mov ebx , eax
    409197 8D6DFC lea ebp , [ebp-4h]
    40919A E8DDFFFFFF call FIB1
    40919F 8B4500 mov eax , [ebp]
    4091A2 83E802 sub eax , # 2h
    4091A5 895D00 mov [ebp] , ebx
    4091A8 8BD8 mov ebx , eax
    4091AA E8CDFFFFFF call FIB1
    4091AF 035D00 add ebx , [ebp]
    4091B2 8D6D04 lea ebp , [ebp+4h]
    4091B5 C3 ret near

    see fib2
    A49E70 4091B6 86 C80000 5 normal FIB2

    4091B6 83FB02 cmp ebx , # 2h
    4091B9 895C24FC mov [esp-4h] , ebx
    4091BD 8B5D00 mov ebx , [ebp]
    4091C0 8D6D04 lea ebp , [ebp+4h]
    4091C3 8D6424FC lea esp , [esp-4h]
    4091C7 0F8D10000000 jge "004091DD"
    4091CD 895DFC mov [ebp-4h] , ebx
    4091D0 BB01000000 mov ebx , # 1h
    4091D5 8D6DFC lea ebp , [ebp-4h]
    4091D8 E92A000000 jmp "00409207"
    4091DD 8B0424 mov eax , [esp]
    4091E0 48 dec eax
    4091E1 895DFC mov [ebp-4h] , ebx
    4091E4 8BD8 mov ebx , eax
    4091E6 8D6DFC lea ebp , [ebp-4h]
    4091E9 E8C8FFFFFF call FIB2
    4091EE 8B0424 mov eax , [esp]
    4091F1 83E802 sub eax , # 2h
    4091F4 895DFC mov [ebp-4h] , ebx
    4091F7 8BD8 mov ebx , eax
    4091F9 8D6DFC lea ebp , [ebp-4h]
    4091FC E8B5FFFFFF call FIB2
    409201 035D00 add ebx , [ebp]
    409204 8D6D04 lea ebp , [ebp+4h]
    409207 8D642404 lea esp , [esp+4h]
    40920B C3 ret near

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 06:58:17 2024
    From Newsgroup: comp.lang.forth

    dxf wrote:

    On 10/04/2024 3:25 pm, mhx wrote:
    dxf wrote:

    Do you have a disassembly for Fib1? I ask because NT/Forth which purports
    to do exactly that produces notably different results - despite there being little 'stack juggling' to resolve.

    Here it is:

    FORTH> : fib ( n1 -- n2 ) dup 2 < if drop 1 else dup 1- recurse swap 2 - recurse + then ; ok
    FORTH> 10 fib . 89 ok
    FORTH> see fib
    Flags: ANSI
    $01340A40 : fib
    $01340A4A pop rbx
    $01340A4B cmp rbx, 2 b#
    $01340A4F jge $01340A61 offset NEAR
    $01340A55 mov rbx, 1 d#
    $01340A5C jmp $01340A95 offset NEAR
    $01340A61 push rbx
    $01340A62 lea rbx, [rbx -1 +] qword
    $01340A66 lea rbp, [rbp -8 +] qword
    $01340A6A mov [rbp 0 +] qword, $01340A77 d#
    $01340A72 jmp $01340A4B offset NEAR
    $01340A77 pop rbx
    $01340A78 pop rdi
    $01340A79 push rbx
    $01340A7A lea rbx, [rdi -2 +] qword
    $01340A7E lea rbp, [rbp -8 +] qword
    $01340A82 mov [rbp 0 +] qword, $01340A8F d#
    $01340A8A jmp $01340A4B offset NEAR
    $01340A8F pop rbx
    $01340A90 pop rdi
    $01340A91 lea rbx, [rdi rbx*1] qword
    $01340A95 push rbx
    $01340A96 ;

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 10 07:34:38 2024
    From Newsgroup: comp.lang.forth

    mhx wrote:
    I wouldn't know what those authors think, believe, or try to enforce.

    There has been no effort spent in optimizing iForth's locals. The shown result is simply a byproduct of the architecture of the compiler.

    It is not my task to guide the users, they can do whatever they ask.
    I believe my actions are consistent with that.

    Locals are completely OK if you want to get stuff done. I use them for
    the boring parts.

    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed,
    if any, is simply too small.

    In such a case, you first rethink the task and the algorithm and,
    if necessary, write a few lines in assembler (or C) after profiling.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Apr 10 07:00:38 2024
    From Newsgroup: comp.lang.forth

    minforth@gmx.net (minforth) writes:
    With locals in CPU registers, I expect an even smaller speed
    difference. However, I have not yet implemented and tested this.

    Stack items and locals are just different ways of expressing the same
    data flow. Ideally a Forth system maps both expressions to the same
    optimal way of expressing the data flow in machine code. There is one
    example where this actually happens: Consider:

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.3 {: a b c :} a b c a b c ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

    These four ways of expressing 3DUP are all commpiled to exactly the
    same code by lxf/ntf:

    804FC0A 8B4500 mov eax , [ebp]
    804FC0D 8945F4 mov [ebp-Ch] , eax
    804FC10 8B4504 mov eax , [ebp+4h]
    804FC13 8945F8 mov [ebp-8h] , eax
    804FC16 895DFC mov [ebp-4h] , ebx
    804FC19 8D6DF4 lea ebp , [ebp-Ch]
    804FC1C C3 ret near

    However, it's more difficult when control flow is involved, and most native-code compilers register-allocate only in straight-line code.
    For comparison, here's what gforth-fast currently gives you:

    3dup.1 3dup.2 3dup.3 3dup.4
    r 1->0 third 1->2 >l >l 1->1 dup 1->1
    mov -$08[r14],r8 mov r15,$10[r13] >l mov $00[r13],r8
    sub r14,$08 third 2->3 mov -$08[rbp],r8 sub r13,$08
    2dup 0->2 mov r9,$08[r13] mov rdx,$08[r13] 2over 1->3
    mov r8,$10[r13] third 3->1 mov rax,rbp mov r15,$18[r13]
    mov r15,$08[r13] mov $00[r13],r8 add r13,$10 mov r9,$10[r13]
    i 2->3 sub r13,$18 lea rbp,-$10[rbp] rot 3->1
    mov r9,[r14] mov $10[r13],r15 mov -$10[rax],rdx mov $00[r13],r15
    -rot 3->2 mov $08[r13],r9 mov r8,$00[r13] sub r13,$10
    mov $00[r13],r9 ;s 1->1 >l @local0 1->1 mov $08[r13],r9
    sub r13,$08 mov rbx,[r14] @local0 ;s 1->1
    2->1 add r14,$08 mov rax,rbp mov rbx,[r14]
    mov -$08[r13],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
    sub r13,$10 jmp eax mov -$08[rax],r8 mov rax,[rbx]
    mov $10[r13],r8 @local1 1->2 jmp eax
    mov r8,[r14] mov r15,$08[rbp]
    add r14,$08 @local2 2->1
    ;s 1->1 mov -$08[r13],r15
    mov rbx,[r14] sub r13,$10
    add r14,$08 mov $10[r13],r8
    mov rax,[rbx] mov r8,$10[rbp]
    jmp eax @local0 1->2
    mov r15,$00[rbp]
    @local1 2->3
    mov r9,$08[rbp]
    @local2 3->1
    mov -$10[r13],r9
    sub r13,$18
    mov $10[r13],r15
    mov $18[r13],r8
    mov r8,$10[rbp]
    lit 1->2
    #24
    mov r15,$50[rbx]
    lp+! 2->1
    add rbp,r15
    ;s 1->1
    mov rbx,[r14]
    add r14,$08
    mov rax,[rbx]
    jmp eax

    Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
    load two items from the memory parts of the stack, store three items
    to the memory part of the stack, and one stack-pointer update. The
    differences are:

    64-bit vs. 32-bit cells
    different way of returning to the caller (;S code vs. ret)
    sub vs. lea for the stack-pointer update
    different order of loads and stores
    different register allocation

    But this is just a happy coincidence, and the other versions make it
    clear that gforth-fast does not analyse the data flow even in
    straight-line code.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Wed Apr 10 18:32:09 2024
    From Newsgroup: comp.lang.forth

    On 10/04/2024 5:34 pm, minforth wrote:
    mhx wrote:

    Locals are completely OK if you want to get stuff done. I use them for
    the boring parts.
    ...
    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed,
    if any, is simply too small.

    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness. It begs the question why use forth at all
    if one is merely going to toy with it.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 10 10:38:43 2024
    From Newsgroup: comp.lang.forth

    dxf wrote:

    On 10/04/2024 5:34 pm, minforth wrote:
    mhx wrote:

    Locals are completely OK if you want to get stuff done. I use them for
    the boring parts.
    ...
    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed,
    if any, is simply too small.

    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness. It begs the question why use forth at all
    if one is merely going to toy with it.

    Yes, of course it's entirely up to you if you don't fully utilise the
    available possibilities of your tools. But I also suspect that your Forth system has no locals at all.

    To give you another example: Dr Noble's formula translator. Certainly not
    a toy of an incapable programmer, as you imply.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 16:35:04 2024
    From Newsgroup: comp.lang.forth

    Anton Ertl wrote:

    However, it's more difficult when control flow is involved, and most native-code compilers register-allocate only in straight-line code.
    For comparison, here's what gforth-fast currently gives you:
    [..]
    Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
    load two items from the memory parts of the stack, store three items
    to the memory part of the stack, and one stack-pointer update. The differences are:
    ..
    Nice example. Here is what iForth does:

    FORTH> : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ; ok
    FORTH> : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ; ok
    FORTH> : 3dup.3 params| a b c | a b c a b c ; ok
    FORTH> : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ; ok
    FORTH> : test.1 3dup.1 .S ; ok
    FORTH> : test.2 3dup.2 .S ; ok
    FORTH> : test.3 3dup.3 .S ; ok
    FORTH> : test.4 3dup.4 .S ; ok
    FORTH> see test.1
    Flags: ANSI
    $01340F80 : test.1
    $01340F8A pop rbx
    $01340F8B pop rdi
    $01340F8C mov rax, [rsp] qword
    $01340F90 push rdi
    $01340F91 push rbx
    $01340F92 push rax
    $01340F93 push rdi
    $01340F94 push rbx
    $01340F95 jmp .S+10 ( $012BEA0A ) offset NEAR
    FORTH> see test.2
    Flags: ANSI
    $01340FC0 : test.2
    $01340FCA mov rbx, [rsp #16 +] qword
    $01340FCF mov rcx, rbx
    $01340FD2 mov rbx, [rsp 8 +] qword
    $01340FD7 push rcx
    $01340FD8 mov rcx, rbx
    $01340FDB mov rbx, [rsp 8 +] qword
    $01340FE0 push rcx
    $01340FE1 push rbx
    $01340FE2 jmp .S+10 ( $012BEA0A ) offset NEAR
    FORTH> see test.3
    Flags: ANSI
    $01341000 : test.3
    $0134100A pop rbx
    $0134100B pop rdi
    $0134100C mov rax, [rsp] qword
    $01341010 push rdi
    $01341011 push rbx
    $01341012 push rax
    $01341013 push rdi
    $01341014 push rbx
    $01341015 jmp .S+10 ( $012BEA0A ) offset NEAR
    FORTH> see test.4
    Flags: ANSI
    $01341040 : test.4
    $0134104A pop rbx
    $0134104B pop rdi
    $0134104C mov rax, [rsp] qword
    $01341050 push rdi
    $01341051 push rbx
    $01341052 push rax
    $01341053 push rdi
    $01341054 push rbx
    $01341055 jmp .S+10 ( $012BEA0A ) offset NEAR

    FORTH> ( and finally all together ... )
    FORTH> : test.5 1 2 3 3dup.1 3dup.2 3dup.3 3dup.4 .S ;

    FORTH> see test.5
    Flags: ANSI
    $01341080 : test.4
    $0134108A push 1 b#
    $0134108C push 2 b#
    $0134108E push 3 b#
    $01341090 push 1 b#
    $01341092 push 2 b#
    $01341094 push 3 b#
    $01341096 mov rbx, [rsp #16 +] qword
    $0134109B mov rcx, rbx
    $0134109E mov rbx, [rsp 8 +] qword
    $013410A3 push rcx
    $013410A4 mov rcx, rbx
    $013410A7 mov rbx, [rsp 8 +] qword
    $013410AC mov rdi, [rsp] qword
    $013410B0 push rcx
    $013410B1 push rbx
    $013410B2 push rdi
    $013410B3 push rcx
    $013410B4 push rbx
    $013410B5 push rdi
    $013410B6 push rcx
    $013410B7 push rbx
    $013410B8 jmp .S+10 ( $012BEA0A ) offset NEAR

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 10 16:50:36 2024
    From Newsgroup: comp.lang.forth

    The local version loses track of constants. If we reorder test.5 :

    FORTH> : test.6 1 2 3 3dup.1 3dup.3 3dup.4 3dup.2 .S ; ok
    FORTH> see test.6
    Flags: ANSI
    $01341100 : test.6
    $0134110A push 1 b#
    $0134110C push 2 b#
    $0134110E push 3 b#
    $01341110 push 1 b#
    $01341112 push 2 b#
    $01341114 push 3 b#
    $01341116 push 1 b#
    $01341118 push 2 b#
    $0134111A push 3 b#
    $0134111C push 1 b#
    $0134111E push 2 b#
    $01341120 push 3 b#
    $01341122 mov rbx, [rsp #16 +] qword
    $01341127 mov rcx, rbx
    $0134112A mov rbx, [rsp 8 +] qword
    $0134112F push rcx
    $01341130 mov rcx, rbx
    $01341133 mov rbx, [rsp 8 +] qword
    $01341138 push rcx
    $01341139 push rbx
    $0134113A jmp .S+10 ( $012BEA0A ) offset NEAR
    $0134113F ;

    Maybe in vsn 7.0 :--)

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Apr 10 17:06:45 2024
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    [deleted the versions not of interest in this posting
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;
    ...
    For comparison, here's what gforth-fast currently gives you:

    3dup.2 3dup.4
    third 1->2 dup 1->1
    mov r15,$10[r13] mov $00[r13],r8
    third 2->3 sub r13,$08
    mov r9,$08[r13] 2over 1->3
    third 3->1 mov r15,$18[r13]
    mov $00[r13],r8 mov r9,$10[r13]
    sub r13,$18 rot 3->1
    mov $10[r13],r15 mov $00[r13],r15
    mov $08[r13],r9 sub r13,$10
    ;s 1->1 mov $08[r13],r9
    mov rbx,[r14] ;s 1->1
    add r14,$08 mov rbx,[r14]
    mov rax,[rbx] add r14,$08
    jmp eax mov rax,[rbx]
    jmp eax

    Interestingly, 3DUP.2 is essentially the same as the lxf/ntf variant:
    load two items from the memory parts of the stack, store three items
    to the memory part of the stack, and one stack-pointer update. The >differences are:

    64-bit vs. 32-bit cells
    different way of returning to the caller (;S code vs. ret)
    sub vs. lea for the stack-pointer update
    different order of loads and stores
    different register allocation

    But this is just a happy coincidence, and the other versions make it
    clear that gforth-fast does not analyse the data flow even in
    straight-line code.

    3DUP.4 also performs these two loads and three stores (again in a
    different order), but it distributes the stack pointer update across
    two instructions.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 10 20:59:49 2024
    From Newsgroup: comp.lang.forth

    mhx wrote:

    The local version loses track of constants.

    Interesting aspect. Within the test words all 3dup variants had been
    inlined, I assume.
    However inlining semanthically identic words with locals is
    "not commutative", to grossly abuse the algebraic expression. Hmm...
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Apr 11 12:09:25 2024
    From Newsgroup: comp.lang.forth

    On 10/04/2024 8:38 pm, minforth wrote:
    dxf wrote:

    On 10/04/2024 5:34 pm, minforth wrote:
    mhx wrote:

    Locals are completely OK if you want to get stuff done. I use them for >>>> the boring parts.
    ...
    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed,
    if any, is simply too small.

    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness.  It begs the question why use forth at all
    if one is merely going to toy with it.

    Yes, of course it's entirely up to you if you don't fully utilise the available possibilities of your tools.

    In calling locals 'a tool' or 'used for the boring stuff' I'd have the
    problem of explaining to colleagues how it is I can use a language in a
    way that its creator has dismissed as antithetical.

    But I also suspect that your Forth
    system has no locals at all.

    DX-Forth offers locals as a loadable option. They're implemented
    efficiently so as to be credible. I'm not aware of any user that has
    availed themselves of it. Indeed I find they come to forth looking
    for something that's honestly unique and challenging. To discover
    Moore still resolute after all these years provides their role model.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Apr 11 14:45:36 2024
    From Newsgroup: comp.lang.forth

    On 10/04/2024 8:38 pm, minforth wrote:
    ...
    To give you another example: Dr Noble's formula translator. Certainly not
    a toy of an incapable programmer, as you imply.

    While Mr. Noble used locals in his formula translator, they were in fact superfluous. Here's a modified version made some years ago that exchanges locals with VALUEs:

    https://pastebin.com/TnHKGNEk

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 11 11:49:02 2024
    From Newsgroup: comp.lang.forth

    In article <66174655$1@news.ausics.net>, dxf <dxforth@gmail.com> wrote:
    On 10/04/2024 8:38 pm, minforth wrote:
    dxf wrote:

    On 10/04/2024 5:34 pm, minforth wrote:
    mhx wrote:

    Locals are completely OK if you want to get stuff done. I use them for >>>>> the boring parts.
    ...
    I think the attitude of "programming to make the compiler happy i.e.
    by stack juggling" is a waste of human resources. I can't remember a
    single case where a performance bottleneck was fixed by switching
    from a local to a non-local code formulation. The difference in speed, >>>> if any, is simply too small.

    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness.  It begs the question why use forth at all >>> if one is merely going to toy with it.

    Yes, of course it's entirely up to you if you don't fully utilise the
    available possibilities of your tools.

    In calling locals 'a tool' or 'used for the boring stuff' I'd have the >problem of explaining to colleagues how it is I can use a language in a
    way that its creator has dismissed as antithetical.

    But I also suspect that your Forth
    system has no locals at all.

    DX-Forth offers locals as a loadable option. They're implemented
    efficiently so as to be credible. I'm not aware of any user that has
    availed themselves of it. Indeed I find they come to forth looking
    for something that's honestly unique and challenging. To discover
    Moore still resolute after all these years provides their role model.


    ciforth offers LOCAL as a loadable option.
    They are implemented without regard of efficiency.
    Also they cannot be used recursively.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 13 17:27:15 2024
    From Newsgroup: comp.lang.forth

    minforth@gmx.net (minforth) writes:
    mhx wrote:

    The local version loses track of constants.

    Interesting aspect. Within the test words all 3dup variants had been
    inlined, I assume.
    However inlining semanthically identic words with locals is
    "not commutative", to grossly abuse the algebraic expression. Hmm...

    I think mathematicians have a word for what you mean, but it's not "commutative".

    Anyway, it's not specific to locals. Everything that loses
    information will affect everything that comes afterwards. E.g., in
    Gforth we have a literal stack that only represents literals on the
    data stack. So anything that moves values from the data stack (e.g.,
    to the return stack) will lose the information about the constant.

    Gforth does not have automatic inlining, so I use manual inlining here
    (I also added some constant folding optimizations that are not built
    in):

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;

    : compile,-3dup.1 ( xt -- ) drop ]] >r 2dup r@ -rot r> [[ ;
    ' compile,-3dup.1 optimizes 3dup.1

    : compile,-3dup.2 ( xt -- ) drop ]] 2 pick 2 pick 2 pick [[ ;
    ' compile,-3dup.2 optimizes 3dup.2

    : fold3-4 ( xt -- ) 3 ['] 3lits> ['] >3lits ['] >4lits fold-constants ;
    ' fold3-4 optimizes third

    : fold2-4 ( xt -- ) 2 ['] 2lits> ['] >2lits ['] >4lits fold-constants ;
    ' fold2-4 optimizes 2dup

    : foo12 1 2 3 3dup.1 3dup.2 ;
    : foo21 1 2 3 3dup.2 3dup.1 ;

    Looking at the resulting code for FOO12 and FOO21, we see:

    see foo12
    : foo12 #1 #2 #3
    >r 2dup i -rot r> third third third ; ok
    see foo21
    : foo21 #1 #2 #3 #1 #2 #3
    >r 2dup i -rot r> ; ok

    So 3DUP.1 loses information by involving the return stack, while
    3DUP.2 only uses the data stack, and manages to work on the constants
    when it is COMPILE,d (but I had to add an optimization rule for THIRD
    to achieve that; doing the same with 2DUP did not help 3DUP.1,
    though).

    Having a literal-r-stack and a way to track literals through locals
    would make all 3DUP variants work the same way when all parameters are literals.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 14 13:08:25 2024
    From Newsgroup: comp.lang.forth

    In article <2024Apr13.192715@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    minforth@gmx.net (minforth) writes:
    mhx wrote:

    The local version loses track of constants.

    Interesting aspect. Within the test words all 3dup variants had been >>inlined, I assume.
    However inlining semanthically identic words with locals is
    "not commutative", to grossly abuse the algebraic expression. Hmm...

    I think mathematicians have a word for what you mean, but it's not >"commutative".

    Anyway, it's not specific to locals. Everything that loses
    information will affect everything that comes afterwards. E.g., in
    Gforth we have a literal stack that only represents literals on the
    data stack. So anything that moves values from the data stack (e.g.,
    to the return stack) will lose the information about the constant.

    To cite Andrew Tannenbaum:
    "global optimisation and symbolic debugging are each others
    arch enemies"

    - anton
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Apr 15 12:34:41 2024
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    To use forth's stack until such time as it becomes too much for one
    suggests a half-heartedness. It begs the question why use forth at all
    if one is merely going to toy with it.

    If Forth's greatness comes from its extensibility, why resist using that extensibility to have locals? Forth is a philosophy that has many
    attractions, but its stack is an implementation artifact.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Mon Apr 15 12:41:51 2024
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    Moore and Fox discussed that. Making a non-optimal approach more
    efficient isn't the answer.

    Moore and Fox made cpu's that saved a lot of hardware by not having
    registers. If you're on a register machine though, why not use it?
    Doing the opposite is far from optimal.

    Locals may not have been practical on Moore's stack machines, but
    neither was ROT, it seems to me. So no locals and not that much
    stackrobatics. He instead used VARIABLEs all over the place, which
    burnt extra storage since they often weren't all alive at the same time,
    plus gave rise to naming troubles and loss of reentrancy, both sort of
    odd for a stack language.
    --- Synchronet 3.20a-Linux NewsLink 1.114