Forum: War Ensemble BBS

Inverted Page Tables

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Jan 29 10:33:29 2026

From Newsgroup: comp.arch

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.

Translations could work directly from the page table.

The hash function would select a page table group. Averaging about 1.2
lookups per memory access. Using the ASID xor’d with address bits 20 to
29 as the hash. 32k entries in the table.

Using 64kB pages to keep the page table small. 1GB RAM to manage.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 29 18:04:44 2026

From Newsgroup: comp.arch

On Thu, 29 Jan 2026 10:33:29 -0500
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor?

It depends on whom do you ask.
If you ask LBT then inverted page tables never had an opportunity to
fall out his favor, because he hated them since the very first
encounter.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jan 29 13:09:37 2026

From Newsgroup: comp.arch

I am thinking about using an inverted page table instead of a hierarchical page table for the current project. The advantage is that the entire page table should be able to fit into BRAM on the FPGA. It should then be possible to eliminate the TLB from the design at perhaps a small performance cost. It would reduce the LUT cost of the MMU.

If you have the freedom (on the OS side) to use a virtually-indexed
cache, I'd recommend you try and use a system like:

PA physical_address_of_virtual_address (VA x)
{
if (x == 0)
return 0;
else {
VA pagetable_entry = (x / PAGESIZE * sizeof (VA));
return *(PA*)pagetable_entry;
}
}

This uses the CPU cache as the TLB and does pagetable walks "from the
bottom", so its efficiency is basically not affected by the number of
levels of pagetables.

=== Stefan
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 29 21:40:27 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Have inverted page tables fallen out of favor?

Not so much as: that "they were never really IN Favor".

And how much OS support
is there for them?

Probably just enough to get buy if you copy an existing IPT.

I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.

An alternative is for HW to generate a hash address based on faulting
VA and make one probe into a 1-16MB TLB stored across the memory hierarchy.
If the access matches VA the rest of the data is your PTE. If not, then
you trap to SW to refill the memory-TLB. Rational:: if the hashed address
is not that bad, and the hash table is large, TLB refills end up at low
cost (delay).

Translations could work directly from the page table.

The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xor’d with address bits 20 to
29 as the hash. 32k entries in the table.

Using 64kB pages to keep the page table small. 1GB RAM to manage.

But realistically, a HW table-walker with a bit of internal caching supplemented with a large L2-TLB works fine in practice.
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jan 29 17:36:24 2026

From Newsgroup: comp.arch

On 1/29/2026 9:33 AM, Robert Finch wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.

Translations could work directly from the page table.

The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xor’d with address bits 20 to
29 as the hash. 32k entries in the table.

Using 64kB pages to keep the page table small. 1GB RAM to manage.

Inverted page tables kinda have the disadvantage though of having *both*
the disadvantages of a software managed TLB along with some of the disadvantages of a HW page walker.

The main argued benefits they have (over plain SW TLB):
Typically larger (because RAM backed);
Potentially slightly faster on process switches.
Less of a "TLB miss storm".

Common drawbacks with SW TLB:
Still need to deal with TLB misses (if page isn't in IPT);
Finite associativity means can't fit whole VAS in an IPT.

Common drawbacks with HW page walking:
Still need to access RAM, potentially multiple times, to resolve a miss;
Loss of flexibility: As with HW page walking, one can't transparently
deal with multi-level address translation (without, essentially, falling
back to just treating it like a SW TLB; or having whatever code manages
the bottom-level IPT also being aware of the use of nested address translation).

Though, maybe not as bad as with HW page walking, where the page-walker
also has to be aware of and natively deal with nested translation
(potentially a very big headache for a HW page walker).

Big set-associative IPT held in BRAM (vs in Main RAM):
Basically the same as how I am already doing SW TLB.

While it could be argued that an IPT can reduce process-switch cost by
having per-process IPT, with some cleverness this can also be faked on
top of SW TLB (by modeling the TLB per-process, and then proactively
injecting TLBE's on process switch).

One other intermediate idea I had considered was a "Last Level Page
Table", where the MMU is given the address of the last-level of each page-table as its own specialized TLBE (after any requisite address translation), which can potentially have some of the advantages of both
SW TLB and HW page walking:
Retains flexibility and can deal with nested translation;
Avoids many of the harder parts of a page walker;
Only needs a single memory access to resolve a request.
Could have some of the benefits of a HW page-walker:
Most of the would-be TLB misses can be handled by hardware.

Though, admittedly, haven't implemented it as of yet, but mostly because
in my case TLB miss activity tends to be low enough that it isn't a big
enough part of the total clock cycle budget (or, IOW, 256x4W already has
a pretty good hit rate in a world of mostly smallish processes).

Would have the issue that it can no longer detect success or failure
within a fixed latency, so would need more complexity to deal with an "in-flight potential miss", either needing a mechanism to signal the
request to bounce around the ring some more, or a way to temporarily
take requests off the bus and inject them back later.

Though, the most likely option being that untranslated VAS's are not
allowed to leave the L1 ringbus (so would be deflected back into the
ring until the TLB can deal with them). Does lead to the potential issue
of bus congestion leading to a scenario where the page-access response
(from the L2 cache or RAM) can't easily enter back into the L1 ring if
most of the spots are already full (with in-flight cycling requests).

Though, then again, for latency:
L1 D$: 2c
L1 I$: 2c
TLB: 3c
BridgeVcA: 3c
This is a 64x4W write-through cache that also functions as a bridge.

So, L1A has ~ 10c ring latency.
Assuming both L1's fire off 2 requests, this is around a 40% congestion,
and theoretically should not break down until around 70% (unless the
latency values have an unlucky fraction, *).

Goes and checks:
Needed logic was already in place, though as-is it would print debug
messages whenever the request is bounced back inside the L1 ring (an untranslated L1 request escaping the L1 ring would be "extra bad", as
only physically addressed requests are allowed on the L2 ring).

*: If the L2 ring has a latency that is an integer multiple of the L1
ring latency, then potentially the same messages can collide endlessly,
and the bus would effectively be stalled. In theory, could be addressed
via a mechanism to either fine-adjust or jitter L2 ring latency if needed.

--- Synchronet 3.21b-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Jan 30 11:16:43 2026

From Newsgroup: comp.arch

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 11:51:15 2026

From Newsgroup: comp.arch

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that
cause an issue?

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance.

A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 30 20:35:14 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance.

A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 16:03:39 2026

From Newsgroup: comp.arch

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support >>>> is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for a TLB. >>
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages can
point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance. >>
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Jan 30 15:44:45 2026

From Newsgroup: comp.arch

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly
worse in the average case.

Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW,
ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an
extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).

Though, with something like my ringbus design this would require needing
to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests
that miss on the 1-way TLB and were shadowed by a prior request taking
up the "multi-way probe" resources, would need to cycle back around the
ring for another pass through the TLB, *2).

*1: Or, semi-global in a MMU that lacks true global pages, as the
existence of true global pages are actively detrimental in some
scenarios (and it is preferable instead to divide ASIDs into groups, and
a global page is only global within a group or within related groups;
allowing for ASIDs to exist which will not see these "global" pages).

*2: And, while the ringbus continues to seem high-latency and stupid,
still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 19:23:56 2026

From Newsgroup: comp.arch

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS
support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way
associative.

I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.

Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).

Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.

Though, with something like my ringbus design this would require needing
to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests that miss on the 1-way TLB and were shadowed by a prior request taking
up the "multi-way probe" resources, would need to cycle back around the
ring for another pass through the TLB, *2).

*1: Or, semi-global in a MMU that lacks true global pages, as the
existence of true global pages are actively detrimental in some
scenarios (and it is preferable instead to divide ASIDs into groups, and
a global page is only global within a group or within related groups; allowing for ASIDs to exist which will not see these "global" pages).

*2: And, while the ringbus continues to seem high-latency and stupid,
still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

Using a huge page size (256kB) now so the table can be kept small. There
are three hash tables, one for instructions, two for data. Same as
triple porting the table. There are still thousands of pages available.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 31 01:36:59 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS
support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash >>>> table based one. I tend to think of them as the same thing. I do not >>>> think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for >>>> a TLB.

For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following >>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>> that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle: while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way associative.

I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.

Ironically, also means that while small-scale alignment is often preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving associativity relative to LUT budget), in turn still favoring the use of modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).

Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.

Be careful !!

Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.

--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 12:31:36 2026

From Newsgroup: comp.arch

Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.

If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.

The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

What is your MMU currently doing that uses 9000 LUTs?

--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 16:29:07 2026

From Newsgroup: comp.arch

EricP wrote:

Robert Finch wrote:

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.

If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.

The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

One could achieve the same effect as the CAM approach by picking off the
range of addresses assigned to the OS and assigning them to ASID = 0,
and having the process range use the curent non-zero ASID register.

If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
and we added a 12-bit ASID then that gives a 76-bit virtual space
with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
1..4095 are for processes.

The MMU has a register MMU_OS_START that hold the OS unsigned lower
start address and a comparator. If the VA is >= MMU_OS_START then
it uses an ASID = 0, otherwise it comes from the MMU_ASID register.

This 12-bit value is concatenated with the 52-bit VPN to give a
64-bit combined ASID-VPN index value to lookup in the TLB.

The OS software that assigned ASIDs also skips assigning 0 to any process.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Jan 31 17:03:54 2026

From Newsgroup: comp.arch

On 2026-01-31 12:31 p.m., EricP wrote:

Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS
support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for
a TLB.

For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.

Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.

If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.

A global bit is not being used. The address space is divided in two by
the MSB. The kernel space has the MSB set and is unmapped. Previously I
used an ASID of zero to represent the kernel address space. The ASID is
not being used as an index.

The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large TLB. As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

I am not sure about getting Linux working. I may just use my own OS.
Skimming through the mmap() code has given me an idea of just how
complex the software is.

Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

Collisions will not push other entries out, they just bounce to a
different empty entry. If one cannot be found then it is an out-of-memory.

Testing some. Testing just allocates pages without freeing them so it is
a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.

I have been wondering how to collect and use stats to control things.

The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.

A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

What is your MMU currently doing that uses 9000 LUTs?

I tried to over build it originally. It currently has a triple ported L2
TLB so that L2 TLB entries can be used directly to form addresses on a
miss. This trims at least one clock cycle off the miss time for L1. The
L1 TLB is fully associative and triple ported as well. L2 is also split
into two TLBs one for regular pages and a second TLB for huge pages.
There is a triple-ported miss queue. The TLB hit logic is a bit scary.
Then there is the logic to update the accessed and modified flags of the
TLB entry.

I have been working on the TLB to make L2 single ported and the miss
queue single ported as well. That should reduce both the LUT cost and
BRAM cost. It will reduce performance a little bit if there are
concurrent misses happening.

--- Synchronet 3.21b-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Sun Feb 1 15:28:47 2026

From Newsgroup: comp.arch

In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

I'm going to guess your physical address space is just 32-bits, otherwise
this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.

A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?

Your hashed page table has issues which I think you should address:

1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?

2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware
lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.

3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like
you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.

4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.

A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash
of ASID+VA, and does a single probe. It might be useful to do two hash
tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you
reverse bits and do other stuff that may be hard for software to calculate,
you make it easy).

This gives you the benefits of a hardware table walker with less trouble.

An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

Kent
--- Synchronet 3.21b-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Feb 1 12:49:53 2026

From Newsgroup: comp.arch

Robert Finch wrote:

On 2026-01-31 12:31 p.m., EricP wrote:

Robert Finch wrote:

The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults would
be handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large
TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

I am not sure about getting Linux working. I may just use my own OS. Skimming through the mmap() code has given me an idea of just how
complex the software is.

It was just an example of how one can use a page table to emulate
any other kind of page table.

Also you can't guarantee that your whole program will fit into an single
inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

Collisions will not push other entries out, they just bounce to a
different empty entry. If one cannot be found then it is an out-of-memory.

Testing some. Testing just allocates pages without freeing them so it is
a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.

One might make the hash function controllable so you can optimize the
bounce length for different programs. Also the max bounce limit should
be a dynamic HW register and not hard wired.

I have been wondering how to collect and use stats to control things.

The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.

That implies some kind of ASID allocator to track which are in-use.
It also implies a whole IVT scan as each ASID is freed,
or maintain a pending free list and a scavenger process.

Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 1 16:41:34 2026

From Newsgroup: comp.arch

On 2026-02-01 10:28 a.m., Kent Dickey wrote:

In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

I'm going to guess your physical address space is just 32-bits, otherwise

Yes, this is for a 29-bit DRAM address space. I was going to use 2^13
entries with 256kB (or possibly smaller) pages. It is only usable with a smaller address space that can fit into BRAM.

this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.

A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?

Your hashed page table has issues which I think you should address:

1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?

Supporting multiple page sized is a pain even for a hierarchical table. Multiple TLBs are used, so maybe multiple hash tables could be used. But
I was not planning on supporting more than one page size.

2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware

Thanks, I had forgotten the issue with removes. Inserts are easy without removes. I wonder if the removes could be done in bulk similar to a
garbage collection. Just mark the entry as deleted rather than leaving a
gap, then have a process that periodically rebuilds the table. For the
table size I am planning on using garbage collecting it may not be too bad.

The table probes until it finds the entry, or an empty entry in a group
where the lookup should be. It stops after a certain number of probes if
there are no empty entries in the group.

The lookup returns the group and the group empty status at which there
is an empty entry if the lookup is not found. So, adding a new entry
should be straightforward.

lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.

30 steps (in HW) is about as fast as a divide operation. It only works
for BRAM, it was DRAM it would be many clocks per lookup.

3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like

I was not planning on anything that complex. The miss returns the group
at which the entry should be located, so SW should know where to insert.
For testing, the test bench fakes inserting pages. The issues rise when
the table gets full.

you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.

4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.

Definitely an issue. May have only the system (one) thread able to
update it.

A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

This gives you the benefits of a hardware table walker with less trouble.

An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

Kent

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Feb 1 16:12:34 2026

From Newsgroup: comp.arch

On 1/30/2026 7:36 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 4:44 p.m., BGB wrote:

On 1/30/2026 3:03 PM, Robert Finch wrote:

On 2026-01-30 3:35 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-30 6:16 a.m., Dan Cross wrote:

In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:

Have inverted page tables fallen out of favor? And how much OS >>>>>>>> support
is there for them?

IVTs are a terrible idea. Consider the challenge of using them >>>>>>> to share memory between virtual address spaces in a
multiprocessing system, efficiently.

- Dan C.

I may have used misleading terminology, the IVT I refer to is a hash >>>>>> table based one. I tend to think of them as the same thing. I do not >>>>>> think anybody would use a plain IVT.

<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?

With the entire PAS covered by the page table in BRAM it can be walked >>>>>> in hardware very quickly, one cycle per step as opposed to walking the >>>>>> page table in DRAM which could be quite slow.

Process switch is handled by including an ASID in the mapping as for >>>>>> a TLB.

For the IVT implementation the table is twice the size needed to cover >>>>>> the PAS to allow for shared memory pages.

The table just stores mappings VPN -> PPN so I am not quite following >>>>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>>>> that
cause an issue?

Consider mmap(file) in multiple different processes at different VAs. >>>>> So, now one has multiple VA pages pointing at one Physical page.

??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.

I guess I should take a look at the mmap() code.

The hash table is only slightly different than a giant TLB. The TLB is >>>> walked on a miss instead of walking main memory. Page faults would be
handled the same way.

The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.

It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.

It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.

Errm, now I am feeling uncertain how implementing this would be, viable...

Sometimes, it ends up being better to implement the minimum complexity mechanism that can work, and only implement a more complex mechanism if
there is a significant drawback or inefficiency with the simpler
mechanism that the more complex mechanism could address.

Like, I can note my use of a custom RP2 compression scheme:
It is a byte oriented LZ variant:
Typically slightly better than LZ4 for general data;
Typically slightly worse than LZ4 for program binaries.

I have one use-case where I use RP2:
For compressing voxel chunks in my BT3 3D engine.

Each chunk is 16*16*16 blocks, with one of several in-RAM storage schemes:
Single block type (no index);
2-16 block types, 4-bit index, 2K/chunk
17-256 block types, 8 bit index, 4K/chunk
257-4096: no index, 32K / chunk.

So, for storage, it flattens out the chunk:
Simple header;
List of block types;
Index data.

Then RP2 compresses the chunk.

The previous engine (BT2) had used a LZ+STF+AdRice scheme here, but this
is slower than a byte-oriented scheme (like RP2 or LZ4), and the
compression delta is small. So, BT3 originally went with RP2 mostly as
it was faster.

Technically, either LZ4 or RP2 could be used, but RP2 notably beats LZ4
here, as the smallest run encoding ends up used heavily:
dddddddd-dlllrrr0:
0-7 literal bytes, 3-10 matched bytes, distance=0..511
A distance of 0 is invalid, can be used as an escape code.

The next match size up is 3 bytes: 4-67 match bytes, vs an 8K distance.
This seems to be the main difference with program binaries, which have
more of a skew towards short long distance matches, which favor LZ4.

RP2 though has a stronger advantage when the input data is under 8K.
Though, it is possible that a 35 byte match over 16K might have been
better on average (possibly with another match type for "long match at
short distance", or RLE like patterns).

I recently evaluated a possible trick of bolting on STF+AdRice after the
fact (on top of RP2) as a sort of post-compressor entropy stage (jank,
but limits computational complexity on the encoder side).

Had then noted a pattern (based on the RP2-compressed size):
Under 512 bytes: Entropy stage typically makes it bigger.
This was the dominant size class here.
1-2K: Entropy helps slightly, but fails to cross a "useful threshold".
4K+: Mostly unseen.

Then went and grabbed a bitwise range coder, and tried using this as a
test, along with something resembling a "mini-LZMA" (LZ + bitwise range
coder, along similar lines to LZMA).

Result, none crossed the threshold...
The threshold: Give an additional 20% size reduction over the use of RP2.

Might at least get used sometimes if the threshold were 5% or 10%, but
don't really want to spend the comparably steep cost of an entropy coder
for a merely 5% or 10% gain.

Why not Huffman (or Deflate?):
Because, ironically, for small highly compressible data, Deflate
actually turns up being worse off here than RP2.

Here, one wants the entropy-coded compressors to be more, "break glass
in case of compression emergency".

Though, that said, bolting an LZMA style range coder onto RP2 would
still appear to be a moderately strong possibility (in terms of
compression), even if bitwise range coding is almost impractically slow
(on a desktop PC, one is hard pressed to push more than around 50 MB/sec through a range coder).

Though, that said, there is sometimes something in my mind that seems to
be trying to yell out that the LZMA range coder design holds a clue for greatly increasing the efficiency and learning rate of neural nets (but
the rest of my mind can't figure out how to combine them; and both
generic algorithms and back-propagation leave something to be desired...).

If anything, one could try to go the other way and use a neural-net as a predictive model for a range-coder, but then one would merely be burning
CPU time.

Though, did note a phenomenon with backprop neural nets where they were
using weakly-biased neurons as a sort of memory between training steps
(in the case of repeating test cases, it would "memorize" the answer and
then use it to cheat on the following run when given the same inputs).
Which is in some sense analogous to the normal behavioral pattern in a range-coder.

Or, basically, the training process was working like "flash cards" and
when it would get an answer wrong, the training algo would note it and
then present the same inputs again a few passes later. The NN was then effectively remember inputs and the output adjustment and then repeating
them again (without actually learning anything), and the net seemingly
got pretty good at this (not learning the actual task, just sort of
optimizing itself to be able to cheat at the NN training algorithm's use
of repetition on wrong answers).

Where, in my NN experiments, it would appear that near-0 behavior for
weights may be more relevant than previously expected (and handling them
in a similar way to DEC/PDP style floating-point may not be ideal). Even
if it mostly reflects a use-case that is an adverse edge case in NN
training (but, in a dynamically adapted net, could potentially be a useful/desirable property; working as a sort of working memory; well,
along with feeding part of the activation outputs back into the inputs,
etc).

Well, also in range-coder predictors, the relative adjustment strength
depends on which direction one is moving and where one is in the space.
If near 0, then movements away from 0 are rapid, but as one gets
further away, it gets harder to form an increment (at the upper
extremes, it becoming impossible to move further). The closer one is
towards 0, the harder it is to approach 0. Though this is likely an
artifact of the distortion of what linear adjustments look like within a non-linear space.

Comparably, the curve of FP values is steeper than that of RC predictors (exponential rather than square).

Well, some of this could imply using some lookup-tables for weight
adjustment in the backprop stage (similar to those used for an
range-coder). Might also need to consider using IEEE style denormals for
FP8.

Still feels as if I am missing something.

Well, there is still the issue that I needed to run the NN training
using FP16 weights internally, as directly training the FP8 weights
wasn't stable enough. So, part of the issue is figuring out stable live adaptation on FP8 weights (which does kinda resemble the range-coder scenario). Well, and that (post backprop) the weight-update in this case
is mostly XOR'ing the sign-bits to figure out which direction to adjust
the weight (likewise, also resembling the RC update scenario), though
skipping weight adjustment when the error is 0 (and skipping downward adjustment when the weight is already 0, etc).

Issue though seems to be that while FP8 works well enough for forward inference, the direct jumps between FP8 values are too large for
backprop to work with effectively, and it essentially ends up needing
some sub-ULP bits for finer adjustment (leading to the issue of using
FP16 for the backprop stage; which would limit practicality for a live adaptation net).

Also, would need to figure out a good way to replace "correct/incorrect" training with something more like "pain/no-pain" adaptation, but with
"pain driven adaptation" one is left merely to do backprop using the
inverse of the previous outputs as the error vector, which is far less effective (and not clearly much better than just using genetic
algorithms as the training process; except that GAs limit scenarios a
lot more than the use of positive and negative reinforcement as a
training algorithm).

But, the downside of positive/negative reinforcement is getting
something much beyond simplistic insect-like behaviors.

...

Main difference I think was that I ended up mostly settling on modulo
addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.

Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly
worse in the average case.

I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way
associative.

I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.

Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, >>> ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).

Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of >>> modulo addressing.

Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an
extreme case, using latency tricks to cause a 2-way TLB to pretend to be >>> 4 or 8 way).

Not too worried about latency as long as it is reasonable. Every step is
another eight ways associative due to clustering.

Be careful !!

Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.

In my own testing, it seemed for the main (L2) TLB, assuming modulo
indexing and SW refill:
1-way: doesn't work for L2 TLB (works well for an L1 TLB though);
2-way: bare minimum, but can fail in certain edge cases (*1);
4-way: Works pretty well;
8-way: "near perfect" on the associativity front.

In many contexts, there doesn't seem to be much gain in going over 8-way associativity, and only a modest gain vs 4-way.

*1: There is a risk for cases where both D$ and I$ miss and both cross
page boundaries in the access. While this is less of an issue with
modulo addressing than with hashing (if this happens with hashed
indexing and 2-way associativity, one is SOL). It seems to squeeze
through, if barely, with modulo indexing (because no two adjacent pages
can stomp the same index).

As noted, hashing is a risk here:
Makes average better at expense of worst case;
Effectively makes 4-way the bare minimum;
Paradoxically, while making the average case better, it increases the associativity required to avoid performance loss in adverse cases (hash collisions).

Some of the same benefits of hashing can also be realized in the OS via strategic use of ASLR (where ASLR tends to benefit TLB performance in
this case).

Can note that 8-way is an improvement over 4-way, but the gains are
comparably smaller. Likewise, 16-way seems to gain very little over 8-way.

Once the required associativity is present, more benefit is seen mostly
by making the TLB bigger (if possible, you really want the working set
to be able to fit into the TLB).

But, as noted, the "Last Level Page-Table TLB" scheme I had considered previously could greatly expand the working set of the MMU without the
cost of a full page walker.

But, more a case of investing resources into fixing an issue that is not
(yet) a significant performance issue.

Granted, can note that both my BT3 engine and Quake 3 Arena tend to
exceed the working-set size of the MMU, so are prone to a higher
frequency of TLB miss interrupts. But, fixing this still wonk make
Quake3 all that playable on a 50MHz CPU.

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:03:36 2026

From Newsgroup: comp.arch

After getting bogged down testing the hash table, I decided to shelf it
for now. And...

I re-wrote Qupls4 TLB for hierarchical page tables. It is much cleaner
code now. The TLB is taking up only about 600 LUTs and 8 BRAMs now. It
is also simpler being only a single level TLB. Three copies of the TLB
would be needed.

Thinking about adding caching of translation steps to avoid main memory access.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:23:05 2026

From Newsgroup: comp.arch

On 2026-02-01 12:49 p.m., EricP wrote:

Robert Finch wrote:

On 2026-01-31 12:31 p.m., EricP wrote:

Robert Finch wrote:

The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults
would be handled the same way.

Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. >>> Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a
large TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.

The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.

I am not sure about getting Linux working. I may just use my own OS.
Skimming through the mmap() code has given me an idea of just how
complex the software is.

It was just an example of how one can use a page table to emulate
any other kind of page table.

Also you can't guarantee that your whole program will fit into an single >>> inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.

Collisions will not push other entries out, they just bounce to a
different empty entry. If one cannot be found then it is an out-of-
memory.

Testing some. Testing just allocates pages without freeing them so it
is a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.

One might make the hash function controllable so you can optimize the
bounce length for different programs. Also the max bounce limit should
be a dynamic HW register and not hard wired.

The max bounce limit is a HW register. Making the hash function
controllable is bit tricky as it is used by HW. It might be easier to
provide a selection of a few different hashes. Hash function reminds me
of crypto rotates and combinations.

I have been wondering how to collect and use stats to control things.

The entries in the IVT for ASIDs no longer in use could be cleared
automatically by a background HW process.

That implies some kind of ASID allocator to track which are in-use.
It also implies a whole IVT scan as each ASID is freed,
or maintain a pending free list and a scavenger process.

For the hash table a small ASID (eg 8 bits) is used. It could be tracked
with a bitmnap.

If the ASIDs are not reused too quickly the scavenger may not need to be
fast. A ASID FIFO could be used to allocate / free ASIDs.

Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.

A garbage collection approach may work for removed entries. It might be
a complicated state machine though to do in HW. It could look for
invalid or entries with defunct ASIDs.

--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Microbot
  Tue Feb 3 08:27:31 2026
  from Moore, Ok via Telnet
- Noozle
  Tue Feb 3 07:13:28 2026
  from Noozle City via Telnet
- Microbot
  Mon Feb 2 10:07:38 2026
  from Moore, Ok via Telnet
- Noozle
  Mon Feb 2 08:57:17 2026
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,096
Nodes:	10 (0 / 10)
Uptime:	398:09:29
Calls:	14,036
Calls today:	2
Files:	187,082
D/L today:	2,450 files (1,578M bytes)
Messages:	2,479,082

Inverted Page Tables

Who's Online

Recent Visitors

System Info