Forum: War Ensemble BBS

IA64 frcpa

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Mar 25 19:06:17 2026

From Newsgroup: comp.arch

I was studying the IA-64 reciprocal estimate, and it has two source
operands as if a divide were taking place, then it checks for special
values for a divide operation, setting a predicate bit if subsequent instructions should not execute. PowerPC does not do this, taking only a single source operand.

The IA64 estimate seems a bit unusual to me, but there is a certain
amount of sense checking for divide issues since most likely a divide
happens next.

I am considering replicating something similar, but branching instead
of using predicates. Seems to me a number of special cases could benefit bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
calculate exact reciprocals for.
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Mar 25 23:53:38 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

I was studying the IA-64 reciprocal estimate, and it has two source
operands as if a divide were taking place, then it checks for special
values for a divide operation, setting a predicate bit if subsequent instructions should not execute. PowerPC does not do this, taking only a single source operand.

In practice, there is ½ a Newton-Raphson iteration difference.
Figuring out that the FDIV is a power of 2 is a job for the
compiler 80% of the time.

The IA64 estimate seems a bit unusual to me, but there is a certain
amount of sense checking for divide issues since most likely a divide happens next.

IA-64 assumes that there are 2 FMAC units per core, and the FDIV code
sequence is designed for that. {Markstien has several papers on FDIV,
FSQRT and the transcendentals for IA-64}

I am considering replicating something similar, but branching instead
of using predicates. Seems to me a number of special cases could benefit bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
calculate exact reciprocals for.

They are also easy to find in the divisor and allows taking a different sequence through FDIV. So, as I see it, this is something HW should be
doing. After HW does this, SW can still gain something. It is only when
HW does nothing that SW alone can add significant performance (remember
FDIV is typically under 2% of instructions executed.)
--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Mar 25 21:28:05 2026

From Newsgroup: comp.arch

On 2026-03-25 7:53 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I was studying the IA-64 reciprocal estimate, and it has two source
operands as if a divide were taking place, then it checks for special
values for a divide operation, setting a predicate bit if subsequent
instructions should not execute. PowerPC does not do this, taking only a
single source operand.

In practice, there is ½ a Newton-Raphson iteration difference.
Figuring out that the FDIV is a power of 2 is a job for the
compiler 80% of the time.

The IA64 estimate seems a bit unusual to me, but there is a certain
amount of sense checking for divide issues since most likely a divide
happens next.

IA-64 assumes that there are 2 FMAC units per core, and the FDIV code sequence is designed for that. {Markstien has several papers on FDIV,
FSQRT and the transcendentals for IA-64}

There are two FMAC units designed in my design. I would rather have the
extra FMAC instead of FDIV. I see Markstien has a book from 2000. 25
years ago now.

I am considering replicating something similar, but branching instead
of using predicates. Seems to me a number of special cases could benefit
bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
calculate exact reciprocals for.

They are also easy to find in the divisor and allows taking a different sequence through FDIV. So, as I see it, this is something HW should be
doing. After HW does this, SW can still gain something. It is only when
HW does nothing that SW alone can add significant performance (remember
FDIV is typically under 2% of instructions executed.)

In my case I am likely implementing FDIV with micro-ops and NR
iterations. The reciprocal estimate is now good to 9 bits or better, so
3 NR iterations should work. A branch would just be to the next ISA instruction. It was fun getting subnormal results for FRES to work
correctly.

One trade-off possible is if a lower precision FDIV takes place it still
does 3 NR iterations, since there would be just a single translation of
FDIV to micro-ops for all precisions.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Mar 26 01:20:10 2026

From Newsgroup: comp.arch

On 3/25/2026 4:53 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I was studying the IA-64 reciprocal estimate, and it has two source
operands as if a divide were taking place, then it checks for special
values for a divide operation, setting a predicate bit if subsequent
instructions should not execute. PowerPC does not do this, taking only a
single source operand.

In practice, there is ½ a Newton-Raphson iteration difference.
Figuring out that the FDIV is a power of 2 is a job for the
compiler 80% of the time.

The IA64 estimate seems a bit unusual to me, but there is a certain
amount of sense checking for divide issues since most likely a divide
happens next.

IA-64 assumes that there are 2 FMAC units per core, and the FDIV code sequence is designed for that. {Markstien has several papers on FDIV,
FSQRT and the transcendentals for IA-64}

I am considering replicating something similar, but branching instead
of using predicates. Seems to me a number of special cases could benefit
bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
calculate exact reciprocals for.

They are also easy to find in the divisor and allows taking a different sequence through FDIV. So, as I see it, this is something HW should be
doing. After HW does this, SW can still gain something. It is only when
HW does nothing that SW alone can add significant performance (remember
FDIV is typically under 2% of instructions executed.)

Fwiw, my code loves nice floating points...

https://paulbourke.net/fractals/multijulia

Let me zoom before exploding...
--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Mar 26 09:16:59 2026

From Newsgroup: comp.arch

On 2026-03-26 4:20 a.m., Chris M. Thomasson wrote:

On 3/25/2026 4:53 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I was studying the IA-64 reciprocal estimate, and it has two source
operands as if a divide were taking place, then it checks for special
values for a divide operation, setting a predicate bit if subsequent
instructions should not execute. PowerPC does not do this, taking only a >>> single source operand.

In practice, there is ½ a Newton-Raphson iteration difference.
Figuring out that the FDIV is a power of 2 is a job for the
compiler 80% of the time.

The IA64 estimate seems a bit unusual to me, but there is a certain
amount of sense checking for divide issues since most likely a divide
happens next.

IA-64 assumes that there are 2 FMAC units per core, and the FDIV code
sequence is designed for that. {Markstien has several papers on FDIV,
FSQRT and the transcendentals for IA-64}

I am considering replicating something similar, but branching instead
of using predicates. Seems to me a number of special cases could benefit >>> bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
calculate exact reciprocals for.

They are also easy to find in the divisor and allows taking a different
sequence through FDIV. So, as I see it, this is something HW should be
doing. After HW does this, SW can still gain something. It is only when
HW does nothing that SW alone can add significant performance (remember
FDIV is typically under 2% of instructions executed.)

Fwiw, my code loves nice floating points...

https://paulbourke.net/fractals/multijulia

Let me zoom before exploding...

Beautiful fractals.

*****

Hit a divide anomaly.

Made better use of the BRAM storing reciprocal values. The estimate is
now good to better than 11-bits. That makes half-precision division
faster than multiply. Half-precision divides take only two clock cycles.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Mar 26 09:41:25 2026

From Newsgroup: comp.arch

On 2026-03-26 9:16 a.m., Robert Finch wrote:

On 2026-03-26 4:20 a.m., Chris M. Thomasson wrote:

On 3/25/2026 4:53 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I was studying the IA-64 reciprocal estimate, and it has two source
operands as if a divide were taking place, then it checks for special
values for a divide operation, setting a predicate bit if subsequent
instructions should not execute. PowerPC does not do this, taking
only a
single source operand.

In practice, there is ½ a Newton-Raphson iteration difference.
Figuring out that the FDIV is a power of 2 is a job for the
compiler 80% of the time.

The IA64 estimate seems a bit unusual to me, but there is a certain
amount of sense checking for divide issues since most likely a divide
happens next.

IA-64 assumes that there are 2 FMAC units per core, and the FDIV code
sequence is designed for that. {Markstien has several papers on FDIV,
FSQRT and the transcendentals for IA-64}

I am considering replicating something similar, but branching instead
of using predicates. Seems to me a number of special cases could
benefit
bypassing the divide code. Values like 1/2, 1/4, .etc are easy to
calculate exact reciprocals for.

They are also easy to find in the divisor and allows taking a different
sequence through FDIV. So, as I see it, this is something HW should be
doing. After HW does this, SW can still gain something. It is only when
HW does nothing that SW alone can add significant performance (remember
FDIV is typically under 2% of instructions executed.)

Fwiw, my code loves nice floating points...

https://paulbourke.net/fractals/multijulia

Let me zoom before exploding...

Beautiful fractals.

*****

Hit a divide anomaly.

Made better use of the BRAM storing reciprocal values. The estimate is
now good to better than 11-bits. That makes half-precision division
faster than multiply. Half-precision divides take only two clock cycles.

Just testing. Got to wondering how.

--- Synchronet 3.21f-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,113
Nodes:	10 (0 / 10)
Uptime:	492335:42:32
Calls:	14,238
Files:	186,312
D/L today:	3,553 files (1,156M bytes)
Messages:	2,514,865

IA64 frcpa

Who's Online

System Info