• Design for a persistent memory image for a simple language

    From luserdroog@mijoryx@yahoo.com to comp.lang.misc on Thu Mar 23 11:15:02 2023
    From Newsgroup: comp.lang.misc

    Design for a persistent memory image for a simple language

    This document will attempt to describe the data representation for the
    types in a minimal Lisp-like language, where all the program data --
    indeed the entire state of the interpreter -- can be saved to disk and reloaded. The central idea is to define the in-memory representations
    to incorporate the serialization/deserialization automatically --
    baked in, so to speak.

    The types are fixnum, symbol, list, string, subr, subr2, fsubr, fsubr2.
    Where a symbol is a short handle that associates to an interned string,
    subr and subr2 are C functions that accept one or two arguments,
    and fsubr and fsubr2 are C functions that accept one or two unevaluated
    lists as arguments.

    Using the terminology from David Gudeman,
    ``Representing Type Information in Dynamically Typed Languages'',
    the scheme chosen is a /staged representation/ with the first stage
    tagging the low 2 bits of a 32 bit cell giving a spread of 4 types:

    CONS address
    symbol index
    OBJ address
    NUM number

    where CONS cells contain an address in the remaining bits which
    address is the cell number of the pair of car and cdr parts stored sequentially, NUM contains a 30 bit signed value, symbol contains an
    index into the symbol list, and OBJ is the escape valve to the second
    stage. An OBJ cell contains an address locating the remaining parts
    of its representation.

    At the second stage, an OBJ has a full cell for an extended tag
    and another cell as its payload, this covers the remaining types:

    STRING address
    SUBR index
    SUBR2 index
    FSUBR index
    FSUBR2 index

    where a STRING contains an address of zero terminated string of bytes
    (padded to cell alignment), and where all the function types hold an
    index into the C function table.

    These entities exist in a memory space of 32bit int-sized cells with a
    simple linear allocator (new_address = global.next++) where all the
    addresses described above are the cell number. So accessing a real
    object in memory from this space requires a formula like:

    global.mem[ address ]

    or the similar formula to yield a C pointer to a cell:

    ( global.mem + address )


    Saving/Loading the memory image.

    Since all objects exist in a convenient binary space, the whole
    state of the interpeter can be saved and re-loaded by storing
    the contents of the memory as bytes, along with a few sizes and
    important addresses.

    saved memory descriptor
    cells used <=> global.next
    cells malloced <=> global.mem
    address of env list <=> global.env
    address of symbol list <=> global.syms


    One remaining issue (which I had hoped to solve and then brag about,
    delaying further work on this writeup) is coordinating or synchronizing
    the saved memory image with the version of the C code which will
    work with it.

    I think I need some kind of CRC or hash upon the suite of function
    names that are loaded into the environment. So that the entries in the
    C function table will match the intended meanings of the antecedent
    indices that exist in the memory space.

    Or there may be some (overly) clever way of using git's own hash code
    that identifies the current commit of the codebase. But that may be
    too restrictive.


    The above has been implemented in my tiny Lisp interpreter in golfed C
    as of the current commit.

    https://github.com/luser-dr00g/sexp.c/tree/597d2ca7ad4f66e3f1f836b0325c21d2f3611f55
    https://github.com/luser-dr00g/sexp.c/archive/597d2ca7ad4f66e3f1f836b0325c21d2f3611f55.zip

    I've also written a separate program to analyze the dump file.
    It uses /bin/od to interpret the bytes as little-endian words
    and feeds that into an awk program which builds up an array of
    locations and values which can be traversed to show the layout
    of the environment data in memory. This effort feels like it
    demonstrates some degree of portability or internal consistency,
    as well as offering a separate debugging tool for use with the
    interpreter.

    [I tried posting through gnus but got a 441 error I need to investigate.]
    --- Synchronet 3.20a-Linux NewsLink 1.114