• Summary: Forth systems where do/?do pushes that loop start address

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 9 18:09:57 2024
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    On 7/03/2024 12:28 am, Ruvim wrote:
    ...
    In SP-Forth v3 and v4 (they generate native code), "DO" pushes three items on the return stack, and among them the address that "LEAVE" then jumps to. ...

    That is the classic implementation as suggested by Bob Berkey - inventor of >the Forth-83 DO LOOP. IIUC Anton is asking about systems that push the loop >address - not the exit address.

    Correct.

    So, to summarize the answers:

    kForth keeps the loop-back address on the return stack in addition to
    index and limit.

    A number of systems (at least ciForth, VFX, SP-Forth) have followed
    Bob Berkey's suggestion of keeping the loop-exit address on the return
    stack for LEAVE. That is not what I was asking about, but it's also interesting.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 9 18:26:23 2024
    From Newsgroup: comp.lang.forth

    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    Yes, kForth uses this method. DO pushes three items onto the return
    stack, the two loop parameters, and the virtual instruction pointer.

    \ From ForthVM.cpp

    int CPP_do ()
    {
    // stack: ( -- | generate opcodes for beginning of loop structure )

    pCurrentOps->push_back(OP_PUSH);
    pCurrentOps->push_back(OP_PUSH);
    pCurrentOps->push_back(OP_PUSHIP);

    dostack.push(pCurrentOps->size());
    return 0;
    }

    Thanks. Why do you do it this way? Do you want to break dependence
    chains on the virtual instruction pointer (the reason for the speedup
    in my results)?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 9 18:28:45 2024
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    DO LOOP in FIG / ISO say FORTH is a mess anyway. The idea that >signed/unsigned numbers can be handled uniformly was cute at the
    time, when you could not spare 10 bytes. In the 50 years no novice
    even dared to try negative indices or negative increments.

    LOOP is fine. +LOOP with negative increment is more problematic
    (that's why Gforth has -LOOP), but it turns out that for running
    backwards through an array, +LOOP with negative increment actually
    works out ok. But Gforth now has MEM-DO..LOOP so you don't need to
    worry about that.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Mar 10 10:43:48 2024
    From Newsgroup: comp.lang.forth

    On 10/03/2024 5:09 am, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    On 7/03/2024 12:28 am, Ruvim wrote:
    ...
    In SP-Forth v3 and v4 (they generate native code), "DO" pushes three items on the return stack, and among them the address that "LEAVE" then jumps to. ...

    That is the classic implementation as suggested by Bob Berkey - inventor of >> the Forth-83 DO LOOP. IIUC Anton is asking about systems that push the loop >> address - not the exit address.

    Correct.

    So, to summarize the answers:

    kForth keeps the loop-back address on the return stack in addition to
    index and limit.

    A number of systems (at least ciForth, VFX, SP-Forth) have followed
    Bob Berkey's suggestion of keeping the loop-exit address on the return
    stack for LEAVE. That is not what I was asking about, but it's also interesting.

    The lack of affirmative responses is unsurprising since a native hard-coded loop is considered 'as good as it gets'. It's difficult to imagine under
    what circumstances a loop address on the stack is faster, but it suggests
    one is starting from an inefficient or compromised base.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Mar 12 11:41:15 2024
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    On 10/03/2024 5:09 am, Anton Ertl wrote:
    It's difficult to imagine under
    what circumstances a loop address on the stack is faster, but it suggests
    one is starting from an inefficient or compromised base.

    The starting point is gforth-fast from June 2023. Here's an example.
    The inner loop of the siev benchmark is:

    0 i c! dup +loop

    The following shows the threaded code intermixed with the native code:

    loop-back address in ...
    ... threaded code ... return stack
    lit 1->2 lit 1->2
    #0 #0
    mov r15,[r14] mov r15,[r14]
    add r14,$10 add r14,$10
    i 2->3 i 2->3
    mov r9,[rbx] mov r9,[rbx]
    add r14,$08 add r14,$08
    c! 3->1 c! 3->1
    mov [r9],r15lb mov [r9],r15lb
    add r14,$08 add r14,$08
    dup 1->2 dup 1->2
    mov r15,r8 mov r15,r8
    add r14,$08 add r14,$08
    (+loop) 2->1 (+loop)-rstack 2->1
    <PRIMES+$108>
    mov rax,[rbx] mov rdx,[rbx]
    mov rsi,[r14] mov rsi,$10[rbx]
    lea r10,$08[r14] mov rax,rdx
    mov rdx,rax sub rax,$08[rbx]
    sub rdx,$08[rbx] add rdx,r15
    add rax,r15 lea rcx,[r15][rax]
    lea rcx,[r15][rdx] xor rcx,rax
    xor rcx,rdx xor rax,r15
    xor rdx,r15 test rcx,rax
    test rcx,rdx js $7F22DC4C075F
    js $7F860CE101F1 mov r14,rsi
    mov [rbx],rax mov [rbx],rdx
    mov rcx,[rsi] add r14,$08
    lea r14,$08[rsi] mov rcx,-$08[r14]
    jmp ecx jmp ecx

    On Zen3 (Ryzen 5800X) and Tiger Lake (Core i5-1135G7) the return stack
    variant is faster by a factor >2; we also see speedups on other
    processors, but they are smaller. Where do these speedups come from?

    If you look at the updates to r14, which contains the virtual-machine instruction pointer updates, they are as follows:

    loop-back address in ...
    ... threaded code ... return stack
    add r14,$10 add r14,$10
    add r14,$08 add r14,$08
    add r14,$08 add r14,$08
    add r14,$08 add r14,$08
    mov rsi,[r14] mov rsi,$10[rbx]
    lea r14,$08[rsi] mov r14,rsi
    add r14,$08

    The crucial difference is that in the left column there is an unbroken dependence chain from the r14 at the end of the previous iteration to
    the r14 at the end of the present iteration; this dependence chain has
    a latency of 9 cycles per iteration on Zen3, meaning that, with enough iterations, the loop takes at least 9 cycles.

    In the right column r14 at the end of one iteration does not depend on
    r14 at the end of the previous iteration, because the dependence chain
    starts from the instruction "mov rsi,$10[rbx]". This means that the
    loop can be executed faster and on Zen3 and on Tiger Lake, that
    speedup happens to be more than a factor of 2.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114