Have inverted page tables fallen out of favor?
I am thinking about using an inverted page table instead of a hierarchical page table for the current project. The advantage is that the entire page table should be able to fit into BRAM on the FPGA. It should then be possible to eliminate the TLB from the design at perhaps a small performance cost. It would reduce the LUT cost of the MMU.
Have inverted page tables fallen out of favor?
And how much OS support
is there for them?
I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.
Translations could work directly from the page table.
The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xor’d with address bits 20 to
29 as the hash. 32k entries in the table.
Using 64kB pages to keep the page table small. 1GB RAM to manage.
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
I am thinking about using an inverted page table instead of a
hierarchical page table for the current project. The advantage is that
the entire page table should be able to fit into BRAM on the FPGA. It
should then be possible to eliminate the TLB from the design at perhaps
a small performance cost. It would reduce the LUT cost of the MMU.
Translations could work directly from the page table.
The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xor’d with address bits 20 to
29 as the hash. 32k entries in the table.
Using 64kB pages to keep the page table small. 1GB RAM to manage.
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
I may have used misleading terminology, the IVT I refer to is a hash
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for a TLB.
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that cause an issue?
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance.
A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support >>>> is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for a TLB. >>
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for instance. >>
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.
The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
On 1/30/2026 3:03 PM, Robert Finch wrote:
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS
support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.
Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.
Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.
Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).
Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of modulo addressing.
Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).
Though, with something like my ringbus design this would require needing
to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests that miss on the 1-way TLB and were shadowed by a prior request taking
up the "multi-way probe" resources, would need to cycle back around the
ring for another pass through the TLB, *2).
*1: Or, semi-global in a MMU that lacks true global pages, as the
existence of true global pages are actively detrimental in some
scenarios (and it is preferable instead to divide ASIDs into groups, and
a global page is only global within a group or within related groups; allowing for ASIDs to exist which will not see these "global" pages).
*2: And, while the ringbus continues to seem high-latency and stupid,
still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash >>>> table based one. I tend to think of them as the same thing. I do not >>>> think anybody would use a plain IVT.
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS
support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for >>>> a TLB.
For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following >>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>> that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.
It is not a lot different. On a translation miss the table is walked
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle: while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.
Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.
I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way associative.
I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.
Ironically, also means that while small-scale alignment is often preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).
Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving associativity relative to LUT budget), in turn still favoring the use of modulo addressing.
Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
4 or 8 way).
Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked
in hardware very quickly, one cycle per step as opposed to walking the
page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover
the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.
The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
begins using quadratic open addressing.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
Robert Finch wrote:
??? I think a hash table has this characteristic. Multiple VA pages
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.
If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.
But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.
Robert Finch wrote:
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS
support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them
to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
table based one. I tend to think of them as the same thing. I do not
think anybody would use a plain IVT.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for
a TLB.
For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following
the challenge to using them for shared memory? Multiple VPN-> to the
same PPN are possible. Is it the freeing up of physical pages in SW
that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs.
So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a
performance issue? ASID is part of the hash/compare.
Be careful about ASID's - they aren't just part of the virtual address
in MMU's like x86. There is a logic aspect in the address compare.
If the PTE Global flag is set then ASID is NOT included in the VA compare,
if the PTE Global flag is clear then ASID is included in the VA compare.
If the G flag is set on OS kernel pages then the effect is to make the
kernel virtual pages one single address space on all cores with no
associated ASID, while process virtual spaces reside in their own ASID's.
The above is easy to implement in a CAM because the Global flag is
part of the PTE entry. Compare the virtual page number (VPN) and if
there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.
But if you just tack the ASID onto the VA to get a larger VA and
use that as an index then you don't get the same logic because
the PTE.G flag that tells you whether ASID matters for that PTE,
is on the PTE that you are looking up.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is
walked on a miss instead of walking main memory. Page faults would be
handled the same way.
Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large TLB. As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.
The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.
Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
I seem to recall at least one fellow advocating the limited use of
shared memory, using replication instead of shared libraries for
instance.
A hierarchical page table is a better solution, but I was looking for
something lower cost. My monster 2-level TLB is currently about 9000
LUTs (I have been working to reduce this). The IVT is about 900 LUTs.
What is your MMU currently doing that uses 9000 LUTs?
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
It isn't obvious (at least to me) how what you are describing is muchIt is not a lot different. On a translation miss the table is walked
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
On 2026-01-31 12:31 p.m., EricP wrote:
Robert Finch wrote:I am not sure about getting Linux working. I may just use my own OS. Skimming through the mmap() code has given me an idea of just how
The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults would
be handled the same way.
Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a large
TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.
The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.
complex the software is.
Also you can't guarantee that your whole program will fit into an singleCollisions will not push other entries out, they just bounce to a
inverted/hash table of a predetermined size because you don't know in
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.
different empty entry. If one cannot be found then it is an out-of-memory.
Testing some. Testing just allocates pages without freeing them so it is
a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.
I have been wondering how to collect and use stats to control things.
The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.
In article <10lji2u$2haq2$1@dont-email.me>,
Robert Finch <robfi680@gmail.com> wrote:
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:
[ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
be searched before moving to the next entry].
It isn't obvious (at least to me) how what you are describing is muchIt is not a lot different. On a translation miss the table is walked
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
I'm going to guess your physical address space is just 32-bits, otherwise
this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
44-bit physical address space would mean 2^28 entries, with each entry
being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
entries of 64-bits each.
A page table which doesn't scale is problematic. Especially if it needs
OS support. Who wants to invest effort in a dead-end idea?
Your hashed page table has issues which I think you should address:
1) Supporting multiple page sizes is a pain. If you have 2MB pages and
64KB pages, there are 32 64KB pages for each 2MB page. So do you
lazily load the hashtable only on misses, or load all 32 aliases
at once?
2) It's common to just do a single hash lookup, then trap to software. Doing
probing is a big pain. The problems come from removing entries.
If you page out a page, it leaves gaps in the table. When you do
a hardware lookup, and find no valid pages at the first entry, this
doesn't tell you anything! The obvious solution is ALL hardware
lookups must do the maximum searching before declaring a hash-table
miss--which you say is 30 steps, checking 240 entries. I'm not sure
there's a better solution than that, but this is a pretty big
penalty for a TLB miss.
3) It's very complex to efficiently update the hash table. On inserting
a new entry, where do you put it? On one hand, a "move-to-front"
strategy seems like a good idea--newly added pages seem like they
should get lookup priority. But how do you do this? It looks like
each software handled miss has a TON of work to do. It seems like
you need to re-hash each entry you consider moving, to make sure it
stays on its proper hash chain. And keeping the OS page structures
in sync looks like even more work.
4) The global nature of the hash table means multiple threads probably need
a lock to change it. So misses to software have to get a lock before
making any hashtable changes. This is very expensive. This is
especially bad when a process ends--all of its translations need to
be removed, and so all page table misses are blocked while that
happens. This is bad.
A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
On any miss, the hash entry to be updated is easy to find for software.
But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).
This gives you the benefits of a hardware table walker with less trouble.
An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
is not much of an issue if the primary TLB is >= 256 entries, but it's
really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.
Kent
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 4:44 p.m., BGB wrote:
On 1/30/2026 3:03 PM, Robert Finch wrote:It is not a lot different. On a translation miss the table is walked
On 2026-01-30 3:35 p.m., MitchAlsup wrote:
??? I think a hash table has this characteristic. Multiple VA pages
Robert Finch <robfi680@gmail.com> posted:
On 2026-01-30 6:16 a.m., Dan Cross wrote:
In article <10lfuka$1d2c4$1@dont-email.me>,I may have used misleading terminology, the IVT I refer to is a hash >>>>>> table based one. I tend to think of them as the same thing. I do not >>>>>> think anybody would use a plain IVT.
Robert Finch <robfi680@gmail.com> wrote:
Have inverted page tables fallen out of favor? And how much OS >>>>>>>> support
is there for them?
IVTs are a terrible idea. Consider the challenge of using them >>>>>>> to share memory between virtual address spaces in a
multiprocessing system, efficiently.
- Dan C.
<response to earlier comments>
Is the entire VAS covered by the hierarchical page table system?
With the entire PAS covered by the page table in BRAM it can be walked >>>>>> in hardware very quickly, one cycle per step as opposed to walking the >>>>>> page table in DRAM which could be quite slow.
Process switch is handled by including an ASID in the mapping as for >>>>>> a TLB.
For the IVT implementation the table is twice the size needed to cover >>>>>> the PAS to allow for shared memory pages.
The table just stores mappings VPN -> PPN so I am not quite following >>>>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>>>> that
cause an issue?
Consider mmap(file) in multiple different processes at different VAs. >>>>> So, now one has multiple VA pages pointing at one Physical page.
can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.
I guess I should take a look at the mmap() code.
The hash table is only slightly different than a giant TLB. The TLB is >>>> walked on a miss instead of walking main memory. Page faults would be
handled the same way.
The table is clustered, the hash points to eight entries which are
then searched in parallel for the correct one. If not found a table
walk begins using quadratic open addressing.
It isn't obvious (at least to me) how what you are describing is much
different from a SW managed TLB.
with HW though instead of SW walking main memory. There are enough
entries in the table to cover the physical address range. Only page
faults are handled in SW which should be rare. A page fault is generated
after 30 steps in the walk though. That is a miss on 240 possible
locations. 240-way associativity is not enough?
The page fault handler will like just report 'out-of-memory' then quit
the app.
Main difference I think was that I ended up mostly settling on modulo
addressing rather than hashing, but the reasoning here is more subtle:
while on average hashing is better than modulo addressing, the worst
case behavior for hashing is worst than the worst case for modulo.
Effectively, hashing requires twice the associativity to not shoot
itself in the foot in the worst case, whereas modulo is merely slightly
worse in the average case.
I think clustering give the table more associativity. Up to eight VAs
can all hash to the same cluster. Almost as if it were eight-way
associative.
I think clustering is key to using an inverted hash-table. I got this
from studying the PowerPC.
Ironically, also means that while small-scale alignment is often
preferable, medium and large scale alignment is best minimized (or, IOW, >>> ASLR can actually be beneficial for performance by reducing unrelated
data stomping the same locations in the caches and TLB).
Though, there are intermediate options, say, for a TLB, XOR'ing the
index with the ASID for non-global pages, *1; but then this means
needing to partition the TLB between global and local (halving
associativity relative to LUT budget), in turn still favoring the use of >>> modulo addressing.
Granted, a hashing scheme and multiple probing steps could be possible
to increase effective associativity at the cost of latency (say, in an
extreme case, using latency tricks to cause a 2-way TLB to pretend to be >>> 4 or 8 way).
Not too worried about latency as long as it is reasonable. Every step is
another eight ways associative due to clustering.
Be careful !!
Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.
Robert Finch wrote:
On 2026-01-31 12:31 p.m., EricP wrote:
Robert Finch wrote:I am not sure about getting Linux working. I may just use my own OS.
The hash table is only slightly different than a giant TLB. The TLB
is walked on a miss instead of walking main memory. Page faults
would be handled the same way.
Yes.
Long ago I read a paper about a port of Linux to a PowerPC with its IVT. >>> Linux memory management is based on a hierarchical page table (HPT) so
rather that rewrite the memory manager they treated the IVT as a
large TLB.
As pages were touched they copied the PTE's from the HPT into IVT.
When there was a process switch they purged the IVT of all entries
the same way one might purge a TLB. Later they added ASID's so they
didn't have to purge IVT on address space switch.
The code that copied the PTE's from HPT to IVT was very similar
to a SW managed TLB, as VM Virtual Machines also do if they don't
have nested page tables.
Skimming through the mmap() code has given me an idea of just how
complex the software is.
It was just an example of how one can use a page table to emulate
any other kind of page table.
Also you can't guarantee that your whole program will fit into an single >>> inverted/hash table of a predetermined size because you don't know inCollisions will not push other entries out, they just bounce to a
advance which entries will collide and push other entries out.
Even if it does fit today, any change to the program could break that.
different empty entry. If one cannot be found then it is an out-of-
memory.
Testing some. Testing just allocates pages without freeing them so it
is a bit unrealistic.
The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
set). Average number of bounces: 1.8 per lookup. This was with random
ASIDs and addresses.
The table can be clocked at twice the CPU rate.
One might make the hash function controllable so you can optimize the
bounce length for different programs. Also the max bounce limit should
be a dynamic HW register and not hard wired.
I have been wondering how to collect and use stats to control things.
The entries in the IVT for ASIDs no longer in use could be cleared
automatically by a background HW process.
That implies some kind of ASID allocator to track which are in-use.
It also implies a whole IVT scan as each ASID is freed,
or maintain a pending free list and a scavenger process.
Simpler to have create process assign a new ASID by incrementing
an unsigned up counter that starts at 1. When you hit its max then reset
to 1 and clear the whole IVT for all non-zero ASID entries in one pass.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,096 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 398:09:29 |
| Calls: | 14,036 |
| Calls today: | 2 |
| Files: | 187,082 |
| D/L today: |
2,450 files (1,578M bytes) |
| Messages: | 2,479,082 |