• Inverted Page Tables

    From Robert Finch@robfi680@gmail.com to comp.arch on Thu Jan 29 10:33:29 2026
    From Newsgroup: comp.arch

    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    I am thinking about using an inverted page table instead of a
    hierarchical page table for the current project. The advantage is that
    the entire page table should be able to fit into BRAM on the FPGA. It
    should then be possible to eliminate the TLB from the design at perhaps
    a small performance cost. It would reduce the LUT cost of the MMU.

    Translations could work directly from the page table.

    The hash function would select a page table group. Averaging about 1.2
    lookups per memory access. Using the ASID xor’d with address bits 20 to
    29 as the hash. 32k entries in the table.

    Using 64kB pages to keep the page table small. 1GB RAM to manage.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 29 18:04:44 2026
    From Newsgroup: comp.arch

    On Thu, 29 Jan 2026 10:33:29 -0500
    Robert Finch <robfi680@gmail.com> wrote:

    Have inverted page tables fallen out of favor?

    It depends on whom do you ask.
    If you ask LBT then inverted page tables never had an opportunity to
    fall out his favor, because he hated them since the very first
    encounter.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jan 29 13:09:37 2026
    From Newsgroup: comp.arch

    I am thinking about using an inverted page table instead of a hierarchical page table for the current project. The advantage is that the entire page table should be able to fit into BRAM on the FPGA. It should then be possible to eliminate the TLB from the design at perhaps a small performance cost. It would reduce the LUT cost of the MMU.

    If you have the freedom (on the OS side) to use a virtually-indexed
    cache, I'd recommend you try and use a system like:

    PA physical_address_of_virtual_address (VA x)
    {
    if (x == 0)
    return 0;
    else {
    VA pagetable_entry = (x / PAGESIZE * sizeof (VA));
    return *(PA*)pagetable_entry;
    }
    }

    This uses the CPU cache as the TLB and does pagetable walks "from the
    bottom", so its efficiency is basically not affected by the number of
    levels of pagetables.


    === Stefan
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 29 21:40:27 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Have inverted page tables fallen out of favor?

    Not so much as: that "they were never really IN Favor".

    And how much OS support
    is there for them?

    Probably just enough to get buy if you copy an existing IPT.

    I am thinking about using an inverted page table instead of a
    hierarchical page table for the current project. The advantage is that
    the entire page table should be able to fit into BRAM on the FPGA. It
    should then be possible to eliminate the TLB from the design at perhaps
    a small performance cost. It would reduce the LUT cost of the MMU.

    An alternative is for HW to generate a hash address based on faulting
    VA and make one probe into a 1-16MB TLB stored across the memory hierarchy.
    If the access matches VA the rest of the data is your PTE. If not, then
    you trap to SW to refill the memory-TLB. Rational:: if the hashed address
    is not that bad, and the hash table is large, TLB refills end up at low
    cost (delay).

    Translations could work directly from the page table.

    The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xor’d with address bits 20 to
    29 as the hash. 32k entries in the table.

    Using 64kB pages to keep the page table small. 1GB RAM to manage.

    But realistically, a HW table-walker with a bit of internal caching supplemented with a large L2-TLB works fine in practice.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jan 29 17:36:24 2026
    From Newsgroup: comp.arch

    On 1/29/2026 9:33 AM, Robert Finch wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    I am thinking about using an inverted page table instead of a
    hierarchical page table for the current project. The advantage is that
    the entire page table should be able to fit into BRAM on the FPGA. It
    should then be possible to eliminate the TLB from the design at perhaps
    a small performance cost. It would reduce the LUT cost of the MMU.

    Translations could work directly from the page table.

    The hash function would select a page table group. Averaging about 1.2 lookups per memory access. Using the ASID xor’d with address bits 20 to
    29 as the hash. 32k entries in the table.

    Using 64kB pages to keep the page table small. 1GB RAM to manage.


    Inverted page tables kinda have the disadvantage though of having *both*
    the disadvantages of a software managed TLB along with some of the disadvantages of a HW page walker.

    The main argued benefits they have (over plain SW TLB):
    Typically larger (because RAM backed);
    Potentially slightly faster on process switches.
    Less of a "TLB miss storm".

    Common drawbacks with SW TLB:
    Still need to deal with TLB misses (if page isn't in IPT);
    Finite associativity means can't fit whole VAS in an IPT.

    Common drawbacks with HW page walking:
    Still need to access RAM, potentially multiple times, to resolve a miss;
    Loss of flexibility: As with HW page walking, one can't transparently
    deal with multi-level address translation (without, essentially, falling
    back to just treating it like a SW TLB; or having whatever code manages
    the bottom-level IPT also being aware of the use of nested address translation).

    Though, maybe not as bad as with HW page walking, where the page-walker
    also has to be aware of and natively deal with nested translation
    (potentially a very big headache for a HW page walker).



    Big set-associative IPT held in BRAM (vs in Main RAM):
    Basically the same as how I am already doing SW TLB.

    While it could be argued that an IPT can reduce process-switch cost by
    having per-process IPT, with some cleverness this can also be faked on
    top of SW TLB (by modeling the TLB per-process, and then proactively
    injecting TLBE's on process switch).




    One other intermediate idea I had considered was a "Last Level Page
    Table", where the MMU is given the address of the last-level of each page-table as its own specialized TLBE (after any requisite address translation), which can potentially have some of the advantages of both
    SW TLB and HW page walking:
    Retains flexibility and can deal with nested translation;
    Avoids many of the harder parts of a page walker;
    Only needs a single memory access to resolve a request.
    Could have some of the benefits of a HW page-walker:
    Most of the would-be TLB misses can be handled by hardware.

    Though, admittedly, haven't implemented it as of yet, but mostly because
    in my case TLB miss activity tends to be low enough that it isn't a big
    enough part of the total clock cycle budget (or, IOW, 256x4W already has
    a pretty good hit rate in a world of mostly smallish processes).

    Would have the issue that it can no longer detect success or failure
    within a fixed latency, so would need more complexity to deal with an "in-flight potential miss", either needing a mechanism to signal the
    request to bounce around the ring some more, or a way to temporarily
    take requests off the bus and inject them back later.


    Though, the most likely option being that untranslated VAS's are not
    allowed to leave the L1 ringbus (so would be deflected back into the
    ring until the TLB can deal with them). Does lead to the potential issue
    of bus congestion leading to a scenario where the page-access response
    (from the L2 cache or RAM) can't easily enter back into the L1 ring if
    most of the spots are already full (with in-flight cycling requests).

    Though, then again, for latency:
    L1 D$: 2c
    L1 I$: 2c
    TLB: 3c
    BridgeVcA: 3c
    This is a 64x4W write-through cache that also functions as a bridge.

    So, L1A has ~ 10c ring latency.
    Assuming both L1's fire off 2 requests, this is around a 40% congestion,
    and theoretically should not break down until around 70% (unless the
    latency values have an unlucky fraction, *).

    Goes and checks:
    Needed logic was already in place, though as-is it would print debug
    messages whenever the request is bounced back inside the L1 ring (an untranslated L1 request escaping the L1 ring would be "extra bad", as
    only physically addressed requests are allowed on the L2 ring).

    *: If the L2 ring has a latency that is an integer multiple of the L1
    ring latency, then potentially the same messages can collide endlessly,
    and the bus would effectively be stalled. In theory, could be addressed
    via a mechanism to either fine-adjust or jitter L2 ring latency if needed.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Jan 30 11:16:43 2026
    From Newsgroup: comp.arch

    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 11:51:15 2026
    From Newsgroup: comp.arch

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that
    cause an issue?

    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for instance.

    A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 30 20:35:14 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support
    is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for instance.

    A hierarchical page table is a better solution, but I was looking for something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 16:03:39 2026
    From Newsgroup: comp.arch

    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support >>>> is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for a TLB. >>
    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages can
    point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
    begins using quadratic open addressing.



    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for instance. >>
    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Jan 30 15:44:45 2026
    From Newsgroup: comp.arch

    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch  <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

    IVTs are a terrible idea.  Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

        - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

    The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
    begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.


    Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
    while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly
    worse in the average case.

    Ironically, also means that while small-scale alignment is often
    preferable, medium and large scale alignment is best minimized (or, IOW,
    ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving
    associativity relative to LUT budget), in turn still favoring the use of modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an
    extreme case, using latency tricks to cause a 2-way TLB to pretend to be
    4 or 8 way).

    Though, with something like my ringbus design this would require needing
    to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
    the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests
    that miss on the 1-way TLB and were shadowed by a prior request taking
    up the "multi-way probe" resources, would need to cycle back around the
    ring for another pass through the TLB, *2).


    *1: Or, semi-global in a MMU that lacks true global pages, as the
    existence of true global pages are actively detrimental in some
    scenarios (and it is preferable instead to divide ASIDs into groups, and
    a global page is only global within a group or within related groups;
    allowing for ASIDs to exist which will not see these "global" pages).

    *2: And, while the ringbus continues to seem high-latency and stupid,
    still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).




    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Jan 30 19:23:56 2026
    From Newsgroup: comp.arch

    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch  <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS
    support
    is there for them?

    IVTs are a terrible idea.  Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

        - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW
    that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated
    after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.


    Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle:
    while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

    I think clustering give the table more associativity. Up to eight VAs
    can all hash to the same cluster. Almost as if it were eight-way
    associative.

    I think clustering is key to using an inverted hash-table. I got this
    from studying the PowerPC.


    Ironically, also means that while small-scale alignment is often
    preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving
    associativity relative to LUT budget), in turn still favoring the use of modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
    4 or 8 way).

    Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.


    Though, with something like my ringbus design this would require needing
    to get clever to allow it to work (as it could not translate requests on multiple adjacent slots). Would likely need, say, a small 1-way TLB on
    the front-end, followed by 2 or 3 additional slots to resolve the lookup sequentially if the 1-way TLB misses (where multiple subsequent requests that miss on the 1-way TLB and were shadowed by a prior request taking
    up the "multi-way probe" resources, would need to cycle back around the
    ring for another pass through the TLB, *2).


    *1: Or, semi-global in a MMU that lacks true global pages, as the
    existence of true global pages are actively detrimental in some
    scenarios (and it is preferable instead to divide ASIDs into groups, and
    a global page is only global within a group or within related groups; allowing for ASIDs to exist which will not see these "global" pages).

    *2: And, while the ringbus continues to seem high-latency and stupid,
    still seems to be one of the better options to balance performance and cheapness (simpler bus designs being painfully slow, and more advanced FIFO/queue based buses being far more complex and expensive...).





    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.




    Using a huge page size (256kB) now so the table can be kept small. There
    are three hash tables, one for instructions, two for data. Same as
    triple porting the table. There are still thousands of pages available.





    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jan 31 01:36:59 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch  <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS
    support
    is there for them?

    IVTs are a terrible idea.  Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

        - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash >>>> table based one. I tend to think of them as the same thing. I do not >>>> think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for >>>> a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following >>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>> that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.


    Main difference I think was that I ended up mostly settling on modulo addressing rather than hashing, but the reasoning here is more subtle: while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly worse in the average case.

    I think clustering give the table more associativity. Up to eight VAs
    can all hash to the same cluster. Almost as if it were eight-way associative.

    I think clustering is key to using an inverted hash-table. I got this
    from studying the PowerPC.


    Ironically, also means that while small-scale alignment is often preferable, medium and large scale alignment is best minimized (or, IOW, ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving associativity relative to LUT budget), in turn still favoring the use of modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an extreme case, using latency tricks to cause a 2-way TLB to pretend to be
    4 or 8 way).

    Not too worried about latency as long as it is reasonable. Every step is another eight ways associative due to clustering.

    Be careful !!

    Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
    a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
    A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 12:31:36 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS support >>>>> is there for them?

    IVTs are a terrible idea. Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

    - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked
    in hardware very quickly, one cycle per step as opposed to walking the
    page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover
    the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW that >>> cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages can point to a single physical page using a a hash table. Is it just a performance issue? ASID is part of the hash/compare.

    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare,
    if the PTE Global flag is clear then ASID is included in the VA compare.

    If the G flag is set on OS kernel pages then the effect is to make the
    kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's.

    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is walked on a miss instead of walking main memory. Page faults would be handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
    Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a large TLB.
    As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.


    The table is clustered, the hash points to eight entries which are then searched in parallel for the correct one. If not found a table walk
    begins using quadratic open addressing.



    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    What is your MMU currently doing that uses 9000 LUTs?

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Jan 31 16:29:07 2026
    From Newsgroup: comp.arch

    EricP wrote:
    Robert Finch wrote:
    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare,
    if the PTE Global flag is clear then ASID is included in the VA compare.

    If the G flag is set on OS kernel pages then the effect is to make the
    kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's.

    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.

    One could achieve the same effect as the CAM approach by picking off the
    range of addresses assigned to the OS and assigning them to ASID = 0,
    and having the process range use the curent non-zero ASID register.

    If we have a 64-bit VA with 4kB pages = 52-bit virtual page number
    and we added a 12-bit ASID then that gives a 76-bit virtual space
    with a 64-bit AS-VPN to look up. We reserve ASID=0 for the OS,
    1..4095 are for processes.

    The MMU has a register MMU_OS_START that hold the OS unsigned lower
    start address and a comparator. If the VA is >= MMU_OS_START then
    it uses an ASID = 0, otherwise it comes from the MMU_ASID register.

    This 12-bit value is concatenated with the 52-bit VPN to give a
    64-bit combined ASID-VPN index value to lookup in the TLB.

    The OS software that assigned ASIDs also skips assigning 0 to any process.



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Jan 31 17:03:54 2026
    From Newsgroup: comp.arch

    On 2026-01-31 12:31 p.m., EricP wrote:
    Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch  <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS
    support
    is there for them?

    IVTs are a terrible idea.  Consider the challenge of using them
    to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

        - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash
    table based one. I tend to think of them as the same thing. I do not
    think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>> in hardware very quickly, one cycle per step as opposed to walking the >>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for
    a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following
    the challenge to using them for shared memory? Multiple VPN-> to the
    same PPN are possible. Is it the freeing up of physical pages in SW
    that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs.
    So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a
    performance issue? ASID is part of the hash/compare.

    Be careful about ASID's - they aren't just part of the virtual address
    in MMU's like x86. There is a logic aspect in the address compare.
    If the PTE Global flag is set then ASID is NOT included in the VA compare,
    if the PTE Global flag is clear then ASID is included in the VA compare.

    If the G flag is set on OS kernel pages then the effect is to make the
    kernel virtual pages one single address space on all cores with no
    associated ASID, while process virtual spaces reside in their own ASID's.

    A global bit is not being used. The address space is divided in two by
    the MSB. The kernel space has the MSB set and is unmapped. Previously I
    used an ASID of zero to represent the kernel address space. The ASID is
    not being used as an index.

    The above is easy to implement in a CAM because the Global flag is
    part of the PTE entry. Compare the virtual page number (VPN) and if
    there is a PTE match, then if the PTE.G flag is set take ASID as matched, otherwise check if ASID is matched.

    But if you just tack the ASID onto the VA to get a larger VA and
    use that as an index then you don't get the same logic because
    the PTE.G flag that tells you whether ASID matters for that PTE,
    is on the PTE that you are looking up.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is
    walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT. Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a large TLB. As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    I am not sure about getting Linux working. I may just use my own OS.
    Skimming through the mmap() code has given me an idea of just how
    complex the software is.

    Also you can't guarantee that your whole program will fit into an single inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.

    Collisions will not push other entries out, they just bounce to a
    different empty entry. If one cannot be found then it is an out-of-memory.

    Testing some. Testing just allocates pages without freeing them so it is
    a bit unrealistic.
    The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
    set). Average number of bounces: 1.8 per lookup. This was with random
    ASIDs and addresses.
    The table can be clocked at twice the CPU rate.

    I have been wondering how to collect and use stats to control things.

    The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.


    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.



    I seem to recall at least one fellow advocating the limited use of
    shared memory, using replication instead of shared libraries for
    instance.

    A hierarchical page table is a better solution, but I was looking for
    something lower cost. My monster 2-level TLB is currently about 9000
    LUTs (I have been working to reduce this). The IVT is about 900 LUTs.

    What is your MMU currently doing that uses 9000 LUTs?

    I tried to over build it originally. It currently has a triple ported L2
    TLB so that L2 TLB entries can be used directly to form addresses on a
    miss. This trims at least one clock cycle off the miss time for L1. The
    L1 TLB is fully associative and triple ported as well. L2 is also split
    into two TLBs one for regular pages and a second TLB for huge pages.
    There is a triple-ported miss queue. The TLB hit logic is a bit scary.
    Then there is the logic to update the accessed and modified flags of the
    TLB entry.

    I have been working on the TLB to make L2 single ported and the miss
    queue single ported as well. That should reduce both the LUT cost and
    BRAM cost. It will reduce performance a little bit if there are
    concurrent misses happening.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Sun Feb 1 15:28:47 2026
    From Newsgroup: comp.arch

    In article <10lji2u$2haq2$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:

    [ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
    be searched before moving to the next entry].

    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated >after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.

    I'm going to guess your physical address space is just 32-bits, otherwise
    this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
    44-bit physical address space would mean 2^28 entries, with each entry
    being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
    entries of 64-bits each.

    A page table which doesn't scale is problematic. Especially if it needs
    OS support. Who wants to invest effort in a dead-end idea?

    Your hashed page table has issues which I think you should address:

    1) Supporting multiple page sizes is a pain. If you have 2MB pages and
    64KB pages, there are 32 64KB pages for each 2MB page. So do you
    lazily load the hashtable only on misses, or load all 32 aliases
    at once?

    2) It's common to just do a single hash lookup, then trap to software. Doing
    probing is a big pain. The problems come from removing entries.
    If you page out a page, it leaves gaps in the table. When you do
    a hardware lookup, and find no valid pages at the first entry, this
    doesn't tell you anything! The obvious solution is ALL hardware
    lookups must do the maximum searching before declaring a hash-table
    miss--which you say is 30 steps, checking 240 entries. I'm not sure
    there's a better solution than that, but this is a pretty big
    penalty for a TLB miss.

    3) It's very complex to efficiently update the hash table. On inserting
    a new entry, where do you put it? On one hand, a "move-to-front"
    strategy seems like a good idea--newly added pages seem like they
    should get lookup priority. But how do you do this? It looks like
    each software handled miss has a TON of work to do. It seems like
    you need to re-hash each entry you consider moving, to make sure it
    stays on its proper hash chain. And keeping the OS page structures
    in sync looks like even more work.

    4) The global nature of the hash table means multiple threads probably need
    a lock to change it. So misses to software have to get a lock before
    making any hashtable changes. This is very expensive. This is
    especially bad when a process ends--all of its translations need to
    be removed, and so all page table misses are blocked while that
    happens. This is bad.

    A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash
    of ASID+VA, and does a single probe. It might be useful to do two hash
    tables: one for large pages, and one for small pages (with 2 lookups).
    On any miss, the hash entry to be updated is easy to find for software.
    But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
    hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you
    reverse bits and do other stuff that may be hard for software to calculate,
    you make it easy).

    This gives you the benefits of a hardware table walker with less trouble.

    An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
    is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

    Kent
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Feb 1 12:49:53 2026
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2026-01-31 12:31 p.m., EricP wrote:
    Robert Finch wrote:

    The hash table is only slightly different than a giant TLB. The TLB
    is walked on a miss instead of walking main memory. Page faults would
    be handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT.
    Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a large
    TLB.
    As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    I am not sure about getting Linux working. I may just use my own OS. Skimming through the mmap() code has given me an idea of just how
    complex the software is.

    It was just an example of how one can use a page table to emulate
    any other kind of page table.

    Also you can't guarantee that your whole program will fit into an single
    inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.

    Collisions will not push other entries out, they just bounce to a
    different empty entry. If one cannot be found then it is an out-of-memory.

    Testing some. Testing just allocates pages without freeing them so it is
    a bit unrealistic.
    The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
    set). Average number of bounces: 1.8 per lookup. This was with random
    ASIDs and addresses.
    The table can be clocked at twice the CPU rate.

    One might make the hash function controllable so you can optimize the
    bounce length for different programs. Also the max bounce limit should
    be a dynamic HW register and not hard wired.

    I have been wondering how to collect and use stats to control things.

    The entries in the IVT for ASIDs no longer in use could be cleared automatically by a background HW process.

    That implies some kind of ASID allocator to track which are in-use.
    It also implies a whole IVT scan as each ASID is freed,
    or maintain a pending free list and a scavenger process.

    Simpler to have create process assign a new ASID by incrementing
    an unsigned up counter that starts at 1. When you hit its max then reset
    to 1 and clear the whole IVT for all non-zero ASID entries in one pass.


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Feb 1 16:41:34 2026
    From Newsgroup: comp.arch

    On 2026-02-01 10:28 a.m., Kent Dickey wrote:
    In article <10lji2u$2haq2$1@dont-email.me>,
    Robert Finch <robfi680@gmail.com> wrote:
    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:

    [ Proposal is a hashed page table, with 2x the entries to cover the entire physical address space, with each hash entry having 8 pages which need to
    be searched before moving to the next entry].

    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated
    after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.

    I'm going to guess your physical address space is just 32-bits, otherwise

    Yes, this is for a 29-bit DRAM address space. I was going to use 2^13
    entries with 256kB (or possibly smaller) pages. It is only usable with a smaller address space that can fit into BRAM.

    this table is huge. I think you said you have 64KB pages (but then you said 256KB pages--which is just too big, so I'll ignore that), which for a
    44-bit physical address space would mean 2^28 entries, with each entry
    being 128-bits, for a 4GB page table. But for 32-bit, it's just 2^16
    entries of 64-bits each.

    A page table which doesn't scale is problematic. Especially if it needs
    OS support. Who wants to invest effort in a dead-end idea?

    Your hashed page table has issues which I think you should address:

    1) Supporting multiple page sizes is a pain. If you have 2MB pages and
    64KB pages, there are 32 64KB pages for each 2MB page. So do you
    lazily load the hashtable only on misses, or load all 32 aliases
    at once?

    Supporting multiple page sized is a pain even for a hierarchical table. Multiple TLBs are used, so maybe multiple hash tables could be used. But
    I was not planning on supporting more than one page size.


    2) It's common to just do a single hash lookup, then trap to software. Doing
    probing is a big pain. The problems come from removing entries.
    If you page out a page, it leaves gaps in the table. When you do
    a hardware lookup, and find no valid pages at the first entry, this
    doesn't tell you anything! The obvious solution is ALL hardware

    Thanks, I had forgotten the issue with removes. Inserts are easy without removes. I wonder if the removes could be done in bulk similar to a
    garbage collection. Just mark the entry as deleted rather than leaving a
    gap, then have a process that periodically rebuilds the table. For the
    table size I am planning on using garbage collecting it may not be too bad.

    The table probes until it finds the entry, or an empty entry in a group
    where the lookup should be. It stops after a certain number of probes if
    there are no empty entries in the group.

    The lookup returns the group and the group empty status at which there
    is an empty entry if the lookup is not found. So, adding a new entry
    should be straightforward.

    lookups must do the maximum searching before declaring a hash-table
    miss--which you say is 30 steps, checking 240 entries. I'm not sure
    there's a better solution than that, but this is a pretty big
    penalty for a TLB miss.

    30 steps (in HW) is about as fast as a divide operation. It only works
    for BRAM, it was DRAM it would be many clocks per lookup.


    3) It's very complex to efficiently update the hash table. On inserting
    a new entry, where do you put it? On one hand, a "move-to-front"
    strategy seems like a good idea--newly added pages seem like they
    should get lookup priority. But how do you do this? It looks like
    each software handled miss has a TON of work to do. It seems like

    I was not planning on anything that complex. The miss returns the group
    at which the entry should be located, so SW should know where to insert.
    For testing, the test bench fakes inserting pages. The issues rise when
    the table gets full.

    you need to re-hash each entry you consider moving, to make sure it
    stays on its proper hash chain. And keeping the OS page structures
    in sync looks like even more work.

    4) The global nature of the hash table means multiple threads probably need
    a lock to change it. So misses to software have to get a lock before
    making any hashtable changes. This is very expensive. This is
    especially bad when a process ends--all of its translations need to
    be removed, and so all page table misses are blocked while that
    happens. This is bad.

    Definitely an issue. May have only the system (one) thread able to
    update it.


    A simpler approach is a per-process hash table of variable size in memory, sized to at least 2X the number of pages being used. It's indexed by a hash of ASID+VA, and does a single probe. It might be useful to do two hash tables: one for large pages, and one for small pages (with 2 lookups).
    On any miss, the hash entry to be updated is easy to find for software.
    But: there's effectively no associativity--but just increase the hash table's size if software detects that you are getting conflicts. Pick a good
    hash function to avoid stupid conflicts. I recommend you make the hash function an instruction, so software can calculate it easily (so if you reverse bits and do other stuff that may be hard for software to calculate, you make it easy).

    This gives you the benefits of a hardware table walker with less trouble.

    An advantage of hashed page tables is the TLB itself has no "intermediate" entries. This lets the primary TLB hold more useful translations. This
    is not much of an issue if the primary TLB is >= 256 entries, but it's
    really noticeable if the TLB is <= 32 entries. You need to lean in to hashed page tables advantages.

    Kent

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Feb 1 16:12:34 2026
    From Newsgroup: comp.arch

    On 1/30/2026 7:36 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 4:44 p.m., BGB wrote:
    On 1/30/2026 3:03 PM, Robert Finch wrote:
    On 2026-01-30 3:35 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-01-30 6:16 a.m., Dan Cross wrote:
    In article <10lfuka$1d2c4$1@dont-email.me>,
    Robert Finch  <robfi680@gmail.com> wrote:
    Have inverted page tables fallen out of favor? And how much OS >>>>>>>> support
    is there for them?

    IVTs are a terrible idea.  Consider the challenge of using them >>>>>>> to share memory between virtual address spaces in a
    multiprocessing system, efficiently.

        - Dan C.

    I may have used misleading terminology, the IVT I refer to is a hash >>>>>> table based one. I tend to think of them as the same thing. I do not >>>>>> think anybody would use a plain IVT.

    <response to earlier comments>
    Is the entire VAS covered by the hierarchical page table system?

    With the entire PAS covered by the page table in BRAM it can be walked >>>>>> in hardware very quickly, one cycle per step as opposed to walking the >>>>>> page table in DRAM which could be quite slow.

    Process switch is handled by including an ASID in the mapping as for >>>>>> a TLB.

    For the IVT implementation the table is twice the size needed to cover >>>>>> the PAS to allow for shared memory pages.


    The table just stores mappings VPN -> PPN so I am not quite following >>>>>> the challenge to using them for shared memory? Multiple VPN-> to the >>>>>> same PPN are possible. Is it the freeing up of physical pages in SW >>>>>> that
    cause an issue?

    Consider mmap(file) in multiple different processes at different VAs. >>>>> So, now one has multiple VA pages pointing at one Physical page.

    ??? I think a hash table has this characteristic. Multiple VA pages
    can point to a single physical page using a a hash table. Is it just a >>>> performance issue? ASID is part of the hash/compare.

    I guess I should take a look at the mmap() code.

    The hash table is only slightly different than a giant TLB. The TLB is >>>> walked on a miss instead of walking main memory. Page faults would be
    handled the same way.

    The table is clustered, the hash points to eight entries which are
    then searched in parallel for the correct one. If not found a table
    walk begins using quadratic open addressing.


    It isn't obvious (at least to me) how what you are describing is much
    different from a SW managed TLB.

    It is not a lot different. On a translation miss the table is walked
    with HW though instead of SW walking main memory. There are enough
    entries in the table to cover the physical address range. Only page
    faults are handled in SW which should be rare. A page fault is generated
    after 30 steps in the walk though. That is a miss on 240 possible
    locations. 240-way associativity is not enough?
    The page fault handler will like just report 'out-of-memory' then quit
    the app.


    Errm, now I am feeling uncertain how implementing this would be, viable...


    Sometimes, it ends up being better to implement the minimum complexity mechanism that can work, and only implement a more complex mechanism if
    there is a significant drawback or inefficiency with the simpler
    mechanism that the more complex mechanism could address.


    Like, I can note my use of a custom RP2 compression scheme:
    It is a byte oriented LZ variant:
    Typically slightly better than LZ4 for general data;
    Typically slightly worse than LZ4 for program binaries.


    I have one use-case where I use RP2:
    For compressing voxel chunks in my BT3 3D engine.

    Each chunk is 16*16*16 blocks, with one of several in-RAM storage schemes:
    Single block type (no index);
    2-16 block types, 4-bit index, 2K/chunk
    17-256 block types, 8 bit index, 4K/chunk
    257-4096: no index, 32K / chunk.

    So, for storage, it flattens out the chunk:
    Simple header;
    List of block types;
    Index data.

    Then RP2 compresses the chunk.

    The previous engine (BT2) had used a LZ+STF+AdRice scheme here, but this
    is slower than a byte-oriented scheme (like RP2 or LZ4), and the
    compression delta is small. So, BT3 originally went with RP2 mostly as
    it was faster.

    Technically, either LZ4 or RP2 could be used, but RP2 notably beats LZ4
    here, as the smallest run encoding ends up used heavily:
    dddddddd-dlllrrr0:
    0-7 literal bytes, 3-10 matched bytes, distance=0..511
    A distance of 0 is invalid, can be used as an escape code.

    The next match size up is 3 bytes: 4-67 match bytes, vs an 8K distance.
    This seems to be the main difference with program binaries, which have
    more of a skew towards short long distance matches, which favor LZ4.

    RP2 though has a stronger advantage when the input data is under 8K.
    Though, it is possible that a 35 byte match over 16K might have been
    better on average (possibly with another match type for "long match at
    short distance", or RLE like patterns).



    I recently evaluated a possible trick of bolting on STF+AdRice after the
    fact (on top of RP2) as a sort of post-compressor entropy stage (jank,
    but limits computational complexity on the encoder side).

    Had then noted a pattern (based on the RP2-compressed size):
    Under 512 bytes: Entropy stage typically makes it bigger.
    This was the dominant size class here.
    1-2K: Entropy helps slightly, but fails to cross a "useful threshold".
    4K+: Mostly unseen.

    Then went and grabbed a bitwise range coder, and tried using this as a
    test, along with something resembling a "mini-LZMA" (LZ + bitwise range
    coder, along similar lines to LZMA).

    Result, none crossed the threshold...
    The threshold: Give an additional 20% size reduction over the use of RP2.

    Might at least get used sometimes if the threshold were 5% or 10%, but
    don't really want to spend the comparably steep cost of an entropy coder
    for a merely 5% or 10% gain.

    Why not Huffman (or Deflate?):
    Because, ironically, for small highly compressible data, Deflate
    actually turns up being worse off here than RP2.

    Here, one wants the entropy-coded compressors to be more, "break glass
    in case of compression emergency".



    Though, that said, bolting an LZMA style range coder onto RP2 would
    still appear to be a moderately strong possibility (in terms of
    compression), even if bitwise range coding is almost impractically slow
    (on a desktop PC, one is hard pressed to push more than around 50 MB/sec through a range coder).


    Though, that said, there is sometimes something in my mind that seems to
    be trying to yell out that the LZMA range coder design holds a clue for greatly increasing the efficiency and learning rate of neural nets (but
    the rest of my mind can't figure out how to combine them; and both
    generic algorithms and back-propagation leave something to be desired...).

    If anything, one could try to go the other way and use a neural-net as a predictive model for a range-coder, but then one would merely be burning
    CPU time.

    Though, did note a phenomenon with backprop neural nets where they were
    using weakly-biased neurons as a sort of memory between training steps
    (in the case of repeating test cases, it would "memorize" the answer and
    then use it to cheat on the following run when given the same inputs).
    Which is in some sense analogous to the normal behavioral pattern in a range-coder.

    Or, basically, the training process was working like "flash cards" and
    when it would get an answer wrong, the training algo would note it and
    then present the same inputs again a few passes later. The NN was then effectively remember inputs and the output adjustment and then repeating
    them again (without actually learning anything), and the net seemingly
    got pretty good at this (not learning the actual task, just sort of
    optimizing itself to be able to cheat at the NN training algorithm's use
    of repetition on wrong answers).


    Where, in my NN experiments, it would appear that near-0 behavior for
    weights may be more relevant than previously expected (and handling them
    in a similar way to DEC/PDP style floating-point may not be ideal). Even
    if it mostly reflects a use-case that is an adverse edge case in NN
    training (but, in a dynamically adapted net, could potentially be a useful/desirable property; working as a sort of working memory; well,
    along with feeding part of the activation outputs back into the inputs,
    etc).

    Well, also in range-coder predictors, the relative adjustment strength
    depends on which direction one is moving and where one is in the space.
    If near 0, then movements away from 0 are rapid, but as one gets
    further away, it gets harder to form an increment (at the upper
    extremes, it becoming impossible to move further). The closer one is
    towards 0, the harder it is to approach 0. Though this is likely an
    artifact of the distortion of what linear adjustments look like within a non-linear space.

    Comparably, the curve of FP values is steeper than that of RC predictors (exponential rather than square).

    Well, some of this could imply using some lookup-tables for weight
    adjustment in the backprop stage (similar to those used for an
    range-coder). Might also need to consider using IEEE style denormals for
    FP8.

    Still feels as if I am missing something.


    Well, there is still the issue that I needed to run the NN training
    using FP16 weights internally, as directly training the FP8 weights
    wasn't stable enough. So, part of the issue is figuring out stable live adaptation on FP8 weights (which does kinda resemble the range-coder scenario). Well, and that (post backprop) the weight-update in this case
    is mostly XOR'ing the sign-bits to figure out which direction to adjust
    the weight (likewise, also resembling the RC update scenario), though
    skipping weight adjustment when the error is 0 (and skipping downward adjustment when the weight is already 0, etc).


    Issue though seems to be that while FP8 works well enough for forward inference, the direct jumps between FP8 values are too large for
    backprop to work with effectively, and it essentially ends up needing
    some sub-ULP bits for finer adjustment (leading to the issue of using
    FP16 for the backprop stage; which would limit practicality for a live adaptation net).

    Also, would need to figure out a good way to replace "correct/incorrect" training with something more like "pain/no-pain" adaptation, but with
    "pain driven adaptation" one is left merely to do backprop using the
    inverse of the previous outputs as the error vector, which is far less effective (and not clearly much better than just using genetic
    algorithms as the training process; except that GAs limit scenarios a
    lot more than the use of positive and negative reinforcement as a
    training algorithm).

    But, the downside of positive/negative reinforcement is getting
    something much beyond simplistic insect-like behaviors.

    ...



    Main difference I think was that I ended up mostly settling on modulo
    addressing rather than hashing, but the reasoning here is more subtle:
    while on average hashing is better than modulo addressing, the worst
    case behavior for hashing is worst than the worst case for modulo.

    Effectively, hashing requires twice the associativity to not shoot
    itself in the foot in the worst case, whereas modulo is merely slightly
    worse in the average case.

    I think clustering give the table more associativity. Up to eight VAs
    can all hash to the same cluster. Almost as if it were eight-way
    associative.

    I think clustering is key to using an inverted hash-table. I got this
    from studying the PowerPC.


    Ironically, also means that while small-scale alignment is often
    preferable, medium and large scale alignment is best minimized (or, IOW, >>> ASLR can actually be beneficial for performance by reducing unrelated
    data stomping the same locations in the caches and TLB).


    Though, there are intermediate options, say, for a TLB, XOR'ing the
    index with the ASID for non-global pages, *1; but then this means
    needing to partition the TLB between global and local (halving
    associativity relative to LUT budget), in turn still favoring the use of >>> modulo addressing.


    Granted, a hashing scheme and multiple probing steps could be possible
    to increase effective associativity at the cost of latency (say, in an
    extreme case, using latency tricks to cause a 2-way TLB to pretend to be >>> 4 or 8 way).

    Not too worried about latency as long as it is reasonable. Every step is
    another eight ways associative due to clustering.

    Be careful !!

    Matrix300 (SPEC89) uses all 4 transposes of DGEMM. The 4th transpose takes
    a TLB miss every 4th Load; or once every 2-cycles on the 6-wide machine.
    A 32-entry fully associative TLB would miss every other cycle, while a 256-entry Direct Mapped TLB would never take a miss.


    In my own testing, it seemed for the main (L2) TLB, assuming modulo
    indexing and SW refill:
    1-way: doesn't work for L2 TLB (works well for an L1 TLB though);
    2-way: bare minimum, but can fail in certain edge cases (*1);
    4-way: Works pretty well;
    8-way: "near perfect" on the associativity front.

    In many contexts, there doesn't seem to be much gain in going over 8-way associativity, and only a modest gain vs 4-way.

    *1: There is a risk for cases where both D$ and I$ miss and both cross
    page boundaries in the access. While this is less of an issue with
    modulo addressing than with hashing (if this happens with hashed
    indexing and 2-way associativity, one is SOL). It seems to squeeze
    through, if barely, with modulo indexing (because no two adjacent pages
    can stomp the same index).


    As noted, hashing is a risk here:
    Makes average better at expense of worst case;
    Effectively makes 4-way the bare minimum;
    Paradoxically, while making the average case better, it increases the associativity required to avoid performance loss in adverse cases (hash collisions).

    Some of the same benefits of hashing can also be realized in the OS via strategic use of ASLR (where ASLR tends to benefit TLB performance in
    this case).


    Can note that 8-way is an improvement over 4-way, but the gains are
    comparably smaller. Likewise, 16-way seems to gain very little over 8-way.

    Once the required associativity is present, more benefit is seen mostly
    by making the TLB bigger (if possible, you really want the working set
    to be able to fit into the TLB).


    But, as noted, the "Last Level Page-Table TLB" scheme I had considered previously could greatly expand the working set of the MMU without the
    cost of a full page walker.

    But, more a case of investing resources into fixing an issue that is not
    (yet) a significant performance issue.

    Granted, can note that both my BT3 engine and Quake 3 Arena tend to
    exceed the working-set size of the MMU, so are prone to a higher
    frequency of TLB miss interrupts. But, fixing this still wonk make
    Quake3 all that playable on a 50MHz CPU.

    ...



    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:03:36 2026
    From Newsgroup: comp.arch

    After getting bogged down testing the hash table, I decided to shelf it
    for now. And...

    I re-wrote Qupls4 TLB for hierarchical page tables. It is much cleaner
    code now. The TLB is taking up only about 600 LUTs and 8 BRAMs now. It
    is also simpler being only a single level TLB. Three copies of the TLB
    would be needed.

    Thinking about adding caching of translation steps to avoid main memory access.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Feb 3 01:23:05 2026
    From Newsgroup: comp.arch

    On 2026-02-01 12:49 p.m., EricP wrote:
    Robert Finch wrote:
    On 2026-01-31 12:31 p.m., EricP wrote:
    Robert Finch wrote:

    The hash table is only slightly different than a giant TLB. The TLB
    is walked on a miss instead of walking main memory. Page faults
    would be handled the same way.

    Yes.
    Long ago I read a paper about a port of Linux to a PowerPC with its IVT. >>> Linux memory management is based on a hierarchical page table (HPT) so
    rather that rewrite the memory manager they treated the IVT as a
    large TLB.
    As pages were touched they copied the PTE's from the HPT into IVT.
    When there was a process switch they purged the IVT of all entries
    the same way one might purge a TLB. Later they added ASID's so they
    didn't have to purge IVT on address space switch.

    The code that copied the PTE's from HPT to IVT was very similar
    to a SW managed TLB, as VM Virtual Machines also do if they don't
    have nested page tables.

    I am not sure about getting Linux working. I may just use my own OS.
    Skimming through the mmap() code has given me an idea of just how
    complex the software is.

    It was just an example of how one can use a page table to emulate
    any other kind of page table.

    Also you can't guarantee that your whole program will fit into an single >>> inverted/hash table of a predetermined size because you don't know in
    advance which entries will collide and push other entries out.
    Even if it does fit today, any change to the program could break that.

    Collisions will not push other entries out, they just bounce to a
    different empty entry. If one cannot be found then it is an out-of-
    memory.

    Testing some. Testing just allocates pages without freeing them so it
    is a bit unrealistic.
    The IVT stopped after 63 bounces at 69% full. (A max of 64 bounces is
    set). Average number of bounces: 1.8 per lookup. This was with random
    ASIDs and addresses.
    The table can be clocked at twice the CPU rate.

    One might make the hash function controllable so you can optimize the
    bounce length for different programs. Also the max bounce limit should
    be a dynamic HW register and not hard wired.

    The max bounce limit is a HW register. Making the hash function
    controllable is bit tricky as it is used by HW. It might be easier to
    provide a selection of a few different hashes. Hash function reminds me
    of crypto rotates and combinations.


    I have been wondering how to collect and use stats to control things.

    The entries in the IVT for ASIDs no longer in use could be cleared
    automatically by a background HW process.

    That implies some kind of ASID allocator to track which are in-use.
    It also implies a whole IVT scan as each ASID is freed,
    or maintain a pending free list and a scavenger process.

    For the hash table a small ASID (eg 8 bits) is used. It could be tracked
    with a bitmnap.

    If the ASIDs are not reused too quickly the scavenger may not need to be
    fast. A ASID FIFO could be used to allocate / free ASIDs.

    Simpler to have create process assign a new ASID by incrementing
    an unsigned up counter that starts at 1. When you hit its max then reset
    to 1 and clear the whole IVT for all non-zero ASID entries in one pass.


    A garbage collection approach may work for removed entries. It might be
    a complicated state machine though to do in HW. It could look for
    invalid or entries with defunct ASIDs.



    --- Synchronet 3.21b-Linux NewsLink 1.2