• IA64 frcpa

    From Robert Finch@robfi680@gmail.com to comp.arch on Wed Mar 25 19:06:17 2026
    From Newsgroup: comp.arch

    I was studying the IA-64 reciprocal estimate, and it has two source
    operands as if a divide were taking place, then it checks for special
    values for a divide operation, setting a predicate bit if subsequent instructions should not execute. PowerPC does not do this, taking only a single source operand.

    The IA64 estimate seems a bit unusual to me, but there is a certain
    amount of sense checking for divide issues since most likely a divide
    happens next.

    I am considering replicating something similar, but branching instead
    of using predicates. Seems to me a number of special cases could benefit bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
    calculate exact reciprocals for.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Mar 25 23:53:38 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    I was studying the IA-64 reciprocal estimate, and it has two source
    operands as if a divide were taking place, then it checks for special
    values for a divide operation, setting a predicate bit if subsequent instructions should not execute. PowerPC does not do this, taking only a single source operand.

    In practice, there is ½ a Newton-Raphson iteration difference.
    Figuring out that the FDIV is a power of 2 is a job for the
    compiler 80% of the time.

    The IA64 estimate seems a bit unusual to me, but there is a certain
    amount of sense checking for divide issues since most likely a divide happens next.

    IA-64 assumes that there are 2 FMAC units per core, and the FDIV code
    sequence is designed for that. {Markstien has several papers on FDIV,
    FSQRT and the transcendentals for IA-64}

    I am considering replicating something similar, but branching instead
    of using predicates. Seems to me a number of special cases could benefit bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
    calculate exact reciprocals for.

    They are also easy to find in the divisor and allows taking a different sequence through FDIV. So, as I see it, this is something HW should be
    doing. After HW does this, SW can still gain something. It is only when
    HW does nothing that SW alone can add significant performance (remember
    FDIV is typically under 2% of instructions executed.)
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Mar 25 21:28:05 2026
    From Newsgroup: comp.arch

    On 2026-03-25 7:53 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I was studying the IA-64 reciprocal estimate, and it has two source
    operands as if a divide were taking place, then it checks for special
    values for a divide operation, setting a predicate bit if subsequent
    instructions should not execute. PowerPC does not do this, taking only a
    single source operand.

    In practice, there is ½ a Newton-Raphson iteration difference.
    Figuring out that the FDIV is a power of 2 is a job for the
    compiler 80% of the time.

    The IA64 estimate seems a bit unusual to me, but there is a certain
    amount of sense checking for divide issues since most likely a divide
    happens next.

    IA-64 assumes that there are 2 FMAC units per core, and the FDIV code sequence is designed for that. {Markstien has several papers on FDIV,
    FSQRT and the transcendentals for IA-64}

    There are two FMAC units designed in my design. I would rather have the
    extra FMAC instead of FDIV. I see Markstien has a book from 2000. 25
    years ago now.

    I am considering replicating something similar, but branching instead
    of using predicates. Seems to me a number of special cases could benefit
    bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
    calculate exact reciprocals for.

    They are also easy to find in the divisor and allows taking a different sequence through FDIV. So, as I see it, this is something HW should be
    doing. After HW does this, SW can still gain something. It is only when
    HW does nothing that SW alone can add significant performance (remember
    FDIV is typically under 2% of instructions executed.)

    In my case I am likely implementing FDIV with micro-ops and NR
    iterations. The reciprocal estimate is now good to 9 bits or better, so
    3 NR iterations should work. A branch would just be to the next ISA instruction. It was fun getting subnormal results for FRES to work
    correctly.

    One trade-off possible is if a lower precision FDIV takes place it still
    does 3 NR iterations, since there would be just a single translation of
    FDIV to micro-ops for all precisions.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Mar 26 01:20:10 2026
    From Newsgroup: comp.arch

    On 3/25/2026 4:53 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I was studying the IA-64 reciprocal estimate, and it has two source
    operands as if a divide were taking place, then it checks for special
    values for a divide operation, setting a predicate bit if subsequent
    instructions should not execute. PowerPC does not do this, taking only a
    single source operand.

    In practice, there is ½ a Newton-Raphson iteration difference.
    Figuring out that the FDIV is a power of 2 is a job for the
    compiler 80% of the time.

    The IA64 estimate seems a bit unusual to me, but there is a certain
    amount of sense checking for divide issues since most likely a divide
    happens next.

    IA-64 assumes that there are 2 FMAC units per core, and the FDIV code sequence is designed for that. {Markstien has several papers on FDIV,
    FSQRT and the transcendentals for IA-64}

    I am considering replicating something similar, but branching instead
    of using predicates. Seems to me a number of special cases could benefit
    bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
    calculate exact reciprocals for.

    They are also easy to find in the divisor and allows taking a different sequence through FDIV. So, as I see it, this is something HW should be
    doing. After HW does this, SW can still gain something. It is only when
    HW does nothing that SW alone can add significant performance (remember
    FDIV is typically under 2% of instructions executed.)

    Fwiw, my code loves nice floating points...

    https://paulbourke.net/fractals/multijulia

    Let me zoom before exploding...
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Mar 26 09:16:59 2026
    From Newsgroup: comp.arch

    On 2026-03-26 4:20 a.m., Chris M. Thomasson wrote:
    On 3/25/2026 4:53 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I was studying the IA-64 reciprocal estimate, and it has two source
    operands as if a divide were taking place, then it checks for special
    values for a divide operation, setting a predicate bit if subsequent
    instructions should not execute. PowerPC does not do this, taking only a >>> single source operand.

    In practice, there is ½ a Newton-Raphson iteration difference.
    Figuring out that the FDIV is a power of 2 is a job for the
    compiler 80% of the time.
    The IA64 estimate seems a bit unusual to me, but there is a certain
    amount of sense checking for divide issues since most likely a divide
    happens next.

    IA-64 assumes that there are 2 FMAC units per core, and the FDIV code
    sequence is designed for that. {Markstien has several papers on FDIV,
    FSQRT and the transcendentals for IA-64}
    I am considering replicating something similar, but branching instead
    of using predicates. Seems to me a number of special cases could benefit >>> bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
    calculate exact reciprocals for.

    They are also easy to find in the divisor and allows taking a different
    sequence through FDIV. So, as I see it, this is something HW should be
    doing. After HW does this, SW can still gain something. It is only when
    HW does nothing that SW alone can add significant performance (remember
    FDIV is typically under 2% of instructions executed.)

    Fwiw, my code loves nice floating points...

    https://paulbourke.net/fractals/multijulia

    Let me zoom before exploding...

    Beautiful fractals.

    *****

    Hit a divide anomaly.

    Made better use of the BRAM storing reciprocal values. The estimate is
    now good to better than 11-bits. That makes half-precision division
    faster than multiply. Half-precision divides take only two clock cycles.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Mar 26 09:41:25 2026
    From Newsgroup: comp.arch

    On 2026-03-26 9:16 a.m., Robert Finch wrote:
    On 2026-03-26 4:20 a.m., Chris M. Thomasson wrote:
    On 3/25/2026 4:53 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I was studying the IA-64 reciprocal estimate, and it has two source
    operands as if a divide were taking place, then it checks for special
    values for a divide operation, setting a predicate bit if subsequent
    instructions should not execute. PowerPC does not do this, taking
    only a
    single source operand.

    In practice, there is ½ a Newton-Raphson iteration difference.
    Figuring out that the FDIV is a power of 2 is a job for the
    compiler 80% of the time.
    The IA64 estimate seems a bit unusual to me, but there is a certain
    amount of sense checking for divide issues since most likely a divide
    happens next.

    IA-64 assumes that there are 2 FMAC units per core, and the FDIV code
    sequence is designed for that. {Markstien has several papers on FDIV,
    FSQRT and the transcendentals for IA-64}
    I am considering replicating something similar, but branching instead
    of using predicates. Seems to me a number of special cases could
    benefit
    bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
    calculate exact reciprocals for.

    They are also easy to find in the divisor and allows taking a different
    sequence through FDIV. So, as I see it, this is something HW should be
    doing. After HW does this, SW can still gain something. It is only when
    HW does nothing that SW alone can add significant performance (remember
    FDIV is typically under 2% of instructions executed.)

    Fwiw, my code loves nice floating points...

    https://paulbourke.net/fractals/multijulia

    Let me zoom before exploding...

    Beautiful fractals.

    *****

    Hit a divide anomaly.

    Made better use of the BRAM storing reciprocal values. The estimate is
    now good to better than 11-bits. That makes half-precision division
    faster than multiply. Half-precision divides take only two clock cycles.

    Just testing. Got to wondering how.



    --- Synchronet 3.21f-Linux NewsLink 1.2