• Compiler IR tree or AST - what node types to include?

    From James Harris@james.harris.1@gmail.com to comp.lang.misc on Wed Nov 23 14:38:57 2022
    From Newsgroup: comp.lang.misc

    Any advice on what node types should be in an IR tree? There seem to be
    many options and I am not sure where to strike the balance.

    My guess is that it's best to keep the number of different node types reasonably small. That should mean less code is needed to walk the tree.
    But when does that number become too small?

    For example, consider expressions.

    There /could/ be one node type for "+" and another for "*", as in a node
    with these fields:

    +
    left
    right

    Alternatively, both operators could be handled by a single node type of "dyadic operation". Such a node would have four fields:

    nodetype = dyadic
    operator (e.g. "+")
    left
    right

    That could be further generalised by having one type of node which would
    be usable for both monadic and dyadic operators:

    nodetype = operation
    operator
    number of arguments: 1 or 2
    arguments

    Further, why limit the number of arguments to 1 or 2? If one were to
    allow an arbitrary number of arguments then such a type of node could
    also be used for function calls.

    In fact, that form is effectively an s expression of the form

    (operator arguments)

    So for expressions, which of the above schemes is best to use for nodes
    in an IR tree?

    (A potential fly in the ointment is whether such a node is suitable for functions which modify some of their parameters.)

    Then there are other (non-expression) nodes.

    To allow for things like code hoisting I think I need the tree initially
    to be at a fairly high level, containing things such as loops and
    selections though it would be lowered later to use labels and
    conditional branches.

    So what other node types are needed? One could have

    program
    function
    identifier: constant or variable or parameter
    assignment
    iteration
    sequential selection
    switch
    break iteration
    break sequential selection
    break switch

    Is that enough? Too many? Too few? What types are missing?

    That's an idea of the kinds of node I am thinking about but I'd
    appreciate your suggestions on what types to use in reality.
    --
    James Harris

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Dmitry A. Kazakov@mailbox@dmitry-kazakov.de to comp.lang.misc on Wed Nov 23 16:39:43 2022
    From Newsgroup: comp.lang.misc

    On 2022-11-23 15:38, James Harris wrote:
    Any advice on what node types should be in an IR tree? There seem to be
    many options and I am not sure where to strike the balance.

    My guess is that it's best to keep the number of different node types reasonably small. That should mean less code is needed to walk the tree.
    But when does that number become too small?

    For example, consider expressions.

    There /could/ be one node type for "+" and another for "*", as in a node with these fields:

      +
      left
      right

    Alternatively, both operators could be handled by a single node type of "dyadic operation". Such a node would have four fields:

      nodetype = dyadic
      operator (e.g. "+")
      left
      right

    I have an abstract node type with derived node types for expression
    terms like identifiers and literals and for branches. The branch node
    type is like:

    operation - an enumeration
    location - the source code location
    operands - the nodes list (1..n)

    operation can be for example "+" or "call." E.g.

    f(a,b,c)

    produces a node

    operation => call
    operands => f,a,b,c

    a + b + c

    produces

    operation => +
    operands => a,b,c

    (a,b,c)

    produces

    operation => parenthesis
    operands => a,b,c

    etc.

    Further, why limit the number of arguments to 1 or 2? If one were to
    allow an arbitrary number of arguments then such a type of node could
    also be used for function calls.

    In fact, that form is effectively an s expression of the form

      (operator arguments)

    Yes, this is what I use in my parser. Furthermore the parser combines associative operations into n-arguments like + above. Inverse when added
    to also combines - or /. E.g.

    a - b + c

    produces

    operation => +
    operands => a,-b,c

    So what other node types are needed? One could have

      program
      function
      identifier: constant or variable or parameter
      assignment
      iteration
      sequential selection
      switch
      break iteration
      break sequential selection
      break switch

    operator a+b
    call f(a,b)

    Then various ligatures that are not operators, but some meta-operators
    e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
    etc. Usually you cannot treat them as first-class operators, because a
    in a::b might not be an expression.

    I do not create one tree for all program, just for individual
    expressions. IMO there is no need to generate a tree for a switch or a
    loop, unless you have switch-operator and loop-operator, of course.
    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Bart@bc@freeuk.com to comp.lang.misc on Wed Nov 23 18:33:57 2022
    From Newsgroup: comp.lang.misc

    On 23/11/2022 14:38, James Harris wrote:
    Any advice on what node types should be in an IR tree? There seem to be
    many options and I am not sure where to strike the balance.

    My comments will be about an AST. I don't know whether you are using
    'IR' as a synonym for the AST; I think it's generally used for the next
    stage beyond AST.


    My guess is that it's best to keep the number of different node types reasonably small. That should mean less code is needed to walk the tree.
    But when does that number become too small?

    For example, consider expressions.

    There /could/ be one node type for "+" and another for "*", as in a node with these fields:

      +
      left
      right

    Alternatively, both operators could be handled by a single node type of "dyadic operation". Such a node would have four fields:

      nodetype = dyadic
      operator (e.g. "+")
      left
      right

    The problem is, there are loads of ways to do this, they will all work.
    There is no right or wrong way.

    I use both of the above methods across three active compilers. With the
    first, its easy to dispatch directly the "+" node when you need to deal
    with "add".

    With the second, you have to dispatch first on "dyadic" and then on "operator". But it's not a big deal.


    That could be further generalised by having one type of node which would
    be usable for both monadic and dyadic operators:

      nodetype = operation
      operator
      number of arguments: 1 or 2
      arguments

    Further, why limit the number of arguments to 1 or 2? If one were to
    allow an arbitrary number of arguments then such a type of node could
    also be used for function calls.

    How practical this is depends on your language, the arrangements for
    matching an arbitrary length list to the node.

    Currently I allow 3 operands for one language, and 2 for another.
    Remember this not just about expressions, it has to be able to represent anything.

    For variable-length lists, any of the operands (although usually just
    the first) is interpreted as the root of a linked list. (With some
    compiler versions in dynamic code, each operand could directly be a list.)


    In fact, that form is effectively an s expression of the form

      (operator arguments)

    So for expressions, which of the above schemes is best to use for nodes
    in an IR tree?

    (A potential fly in the ointment is whether such a node is suitable for functions which modify some of their parameters.)

    Then there are other (non-expression) nodes.

    To allow for things like code hoisting I think I need the tree initially
    to be at a fairly high level, containing things such as loops and
    selections though it would be lowered later to use labels and
    conditional branches.

    So what other node types are needed? One could have

      program
      function
      identifier: constant or variable or parameter
      assignment
      iteration
      sequential selection
      switch
      break iteration
      break sequential selection
      break switch

    Is that enough? Too many? Too few? What types are missing?

    My ASTs are unusual (although I only discovered this recently) in that
    they only represent executable code. That is, the bodies of functions,
    or init data for a variable.

    Others use the AST also for non-executable elements, like function declarations. I couldn't see the point of this, unless this is a
    scripting where declarations /are/ executed from the top of the module.

    However even without declaration nodes, I use either 130 or 200 node
    types (depending on whether ADD, SUB etc have dedicated tags, or grouped
    under one BINOP tag). My C compiler uses 80.

    But you must have already done all this in your compiler; didn't that
    use an AST?

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From James Harris@james.harris.1@gmail.com to comp.lang.misc on Wed Nov 23 19:38:11 2022
    From Newsgroup: comp.lang.misc

    On 23/11/2022 15:39, Dmitry A. Kazakov wrote:
    On 2022-11-23 15:38, James Harris wrote:

    Any advice on what node types should be in an IR tree? There seem to
    be many options and I am not sure where to strike the balance.

    ...

    I have an abstract node type with derived node types for expression
    terms like identifiers and literals and for branches. The branch node
    type is like:

       operation - an enumeration
       location  - the source code location
       operands  - the nodes list (1..n)

    operation can be for example "+" or "call." E.g.

       f(a,b,c)

    produces a node

       operation => call
       operands  => f,a,b,c

    Yes, I might do something which is very much like that for two reasons:

    1. In my language f is just as much an expression as a, b and c.

    2. I may at some point implement different types of call so a 'call'
    node would help with later changes.


       a + b + c

    produces

       operation => +
       operands  => a,b,c

    ...


       a - b + c

    produces

       operation => +
       operands  => a,-b,c

    Both of those are curious. Why have more than two operands?


    So what other node types are needed? One could have

       program
       function
       identifier: constant or variable or parameter
       assignment
       iteration
       sequential selection
       switch
       break iteration
       break sequential selection
       break switch

       operator  a+b
       call      f(a,b)

    Then various ligatures that are not operators, but some meta-operators
    e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
    etc. Usually you cannot treat them as first-class operators, because a
    in a::b might not be an expression.

    I think I understood your post up until that paragraph. What do you mean
    by ligatures and meta operators? And aside from the semantics what's
    wrong with "a in a::b"?


    I do not create one tree for all program, just for individual
    expressions. IMO there is no need to generate a tree for a switch or a
    loop, unless you have switch-operator and loop-operator, of course.

    Again, a curious comment. My sources are naturally hierarchical so a
    tree seems to be a good way to represent them. Are you saying another
    form could be better?
    --
    James Harris


    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Dmitry A. Kazakov@mailbox@dmitry-kazakov.de to comp.lang.misc on Wed Nov 23 21:06:04 2022
    From Newsgroup: comp.lang.misc

    On 2022-11-23 20:38, James Harris wrote:
    On 23/11/2022 15:39, Dmitry A. Kazakov wrote:

        a - b + c

    produces

        operation => +
        operands  => a,-b,c

    Both of those are curious. Why have more than two operands?

    Because it is easier to fold constants and do other optimizations for operations declared commutative. Consider:

    1 - x + 3

    with

    +(1,-x,3)

    you could fold it into

    +(4,-x) == 4-x

    So what other node types are needed? One could have

       program
       function
       identifier: constant or variable or parameter
       assignment
       iteration
       sequential selection
       switch
       break iteration
       break sequential selection
       break switch

        operator  a+b
        call      f(a,b)

    Then various ligatures that are not operators, but some meta-operators
    e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
    etc. Usually you cannot treat them as first-class operators, because a
    in a::b might not be an expression.

    I think I understood your post up until that paragraph. What do you mean
    by ligatures and meta operators? And aside from the semantics what's
    wrong with "a in a::b"?

    Because it might evaluate nothing. Consider

    System::IO::Print

    as an example. You might wish to recognize such things (meta expression)
    in the AST and handle them before semantic analysis. Normally, you have
    System namespace/package/module visible in the context. When not, it is
    a syntax error. Then you look into System names for IO and so on.

    I do not create one tree for all program, just for individual
    expressions. IMO there is no need to generate a tree for a switch or a
    loop, unless you have switch-operator and loop-operator, of course.

    Again, a curious comment. My sources are naturally hierarchical so a
    tree seems to be a good way to represent them. Are you saying another
    form could be better?

    Because you do not need AST to represent control flow statements or declarations. You can generate intermediate code or interpret straight
    away during recursive descend.
    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From James Harris@james.harris.1@gmail.com to comp.lang.misc on Wed Nov 23 20:29:46 2022
    From Newsgroup: comp.lang.misc

    On 23/11/2022 18:33, Bart wrote:
    On 23/11/2022 14:38, James Harris wrote:

    Any advice on what node types should be in an IR tree? There seem to
    be many options and I am not sure where to strike the balance.

    My comments will be about an AST. I don't know whether you are using
    'IR' as a synonym for the AST; I think it's generally used for the next stage beyond AST.

    Does it matter much? Either is a tree representation of the program so wouldn't they have similar node types, at least initially?

    ...


    The problem is, there are loads of ways to do this, they will all work. There is no right or wrong way.

    OK, that's reassuring!

    ...

    Further, why limit the number of arguments to 1 or 2? If one were to
    allow an arbitrary number of arguments then such a type of node could
    also be used for function calls.

    How practical this is depends on your language, the arrangements for matching an arbitrary length list to the node.

    Why would one want to match a list to the node? Wouldn't a "node type"
    field provide all the matching needed?

    ...

    So what other node types are needed? One could have

       program
       function
       identifier: constant or variable or parameter
       assignment
       iteration
       sequential selection
       switch
       break iteration
       break sequential selection
       break switch

    Is that enough? Too many? Too few? What types are missing?

    My ASTs are unusual (although I only discovered this recently) in that
    they only represent executable code. That is, the bodies of functions,
    or init data for a variable.

    That sounds reasonable to me. Declarations would be stored in the symbol
    table so would probably not need tree nodes.


    Others use the AST also for non-executable elements, like function declarations. I couldn't see the point of this, unless this is a
    scripting where declarations /are/ executed from the top of the module.

    Yes, I allow for forward references such as calling a function which is defined later in the source file without having to declare it before the
    call is seen. That already works for any global (all functions are
    global) but I will need other forward references so I'll have to parse
    some identifiers purely as names and interpret them later.


    However even without declaration nodes, I use either 130 or 200 node
    types (depending on whether ADD, SUB etc have dedicated tags, or grouped under one BINOP tag). My C compiler uses 80.

    Wow, those numbers sound high! Maybe I have been trying to cut the
    numbers down unnecessarily.


    But you must have already done all this in your compiler; didn't that
    use an AST?

    My compiler doesn't currently have any tree at all, although adding one
    has long been in the plan.
    --
    James Harris


    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Bart@bc@freeuk.com to comp.lang.misc on Wed Nov 23 23:16:09 2022
    From Newsgroup: comp.lang.misc

    On 23/11/2022 20:29, James Harris wrote:
    On 23/11/2022 18:33, Bart wrote:

    However even without declaration nodes, I use either 130 or 200 node
    types (depending on whether ADD, SUB etc have dedicated tags, or
    grouped under one BINOP tag). My C compiler uses 80.

    Wow, those numbers sound high! Maybe I have been trying to cut the
    numbers down unnecessarily.

    I'm sure I posted something like this before. But here is the set of
    node tags I currently use in that 130-node compiler, reduced to 125 as
    some I saw were obsolete:

    const
    null
    name
    block
    assem # for inline assembly
    assemmacro
    assemreg
    assemxreg
    assemmem
    strinclude
    andl
    orl
    notl
    istruel
    makelist
    makerange
    makeset
    makedict
    makeslice
    returnmult
    keyword
    keyvalue
    assign
    assignmm # m = multiple
    assignms
    assignmdrem
    copy
    callfn
    cmp
    cmpchain
    bin
    unary
    binto
    unaryto
    incr
    inrev
    inrange
    inset
    clamp
    index
    slice
    dot
    dotindex
    dotslice
    ptr
    addrof
    addroffirst
    convert
    shorten
    autocast
    typepun
    typeconst
    operator
    upper
    bitwidth
    bytesize
    typeof
    typestr
    bitfield
    minvalue
    maxvalue
    cvlineno # compiler variables; these still have dedicated tags cvstrlineno
    cvmodulename
    cvfilename
    cvfunction
    cvdate
    cvtime
    cvversion
    cvtypename
    cvtargetbits
    cvtargetsize
    cvtargetcode
    cvctarget
    cvwindows
    cvlinux
    cvnil
    cvpi
    cvinfinity
    cvtrue
    cvfalse
    whenthen
    fmtitem
    nogap
    space
    callproc
    return
    syscall
    to
    if
    forup
    fordown
    forall
    forallrev
    while
    repeat
    goto
    labeldef
    redo
    next
    exit
    do
    case
    docase
    switch
    doswitch
    swap
    select
    recase
    print
    println
    fprint
    fprintln
    sprint
    sfprint
    read
    readln
    sread
    sreadln
    stop
    eval
    empty
    emitc
    infinity

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From BGB@cr88192@gmail.com to comp.lang.misc on Sun Nov 27 03:31:59 2022
    From Newsgroup: comp.lang.misc

    On 11/23/2022 12:33 PM, Bart wrote:
    On 23/11/2022 14:38, James Harris wrote:
    Any advice on what node types should be in an IR tree? There seem to
    be many options and I am not sure where to strike the balance.

    My comments will be about an AST. I don't know whether you are using
    'IR' as a synonym for the AST; I think it's generally used for the next stage beyond AST.


    Yes, typically they are different stages.

    Typically the AST is a tree structured representation of the program.

    IR is typically either a stack machine (often also called a "bytecode"
    or "IL");
    Something along the lines of 3AC (Three Address Code) or SSA (Static
    Single Assignment).


    In my compilers, I have mostly ended up going from the AST from a stack machine IL; and then from this IL into a 3AC IR.

    There is a difference in that while an AST is usually tree structured, something like a 3AC IR is typically organized into "basic blocks"
    containing a series of operations operating on temporaries or similar.


    I have mostly ended up keeping the stack IL as:
    AST -> Stack: Easy
    AST -> 3AC/SSA: Not as Easy
    Along with:
    Stack -> 3AC / SSA: Relatively easy;
    Serializing an IR in SSA form: Massive pain.


    As for AST structure, I have tried a few major strategies:

    S-Expressions: Used in my early VM projects (Scheme based);
    Made a return for most of my "BGBScript 1.x" VMs (~ 2006..2013).

    XML Based, used in the first version of BGBScript;
    Still used in BGBCC, which was a fork off the first VM.
    Later on, BGBScript switched back to S-Expressions, using a core VM
    based on my original Scheme interpreter.

    JSON style: Used in my BGBScript2 VM (~ 2015..2018). This technically
    worked well, and had a "good compromise" between XML nodes and
    S-Expressions. But, this VM was short lived (mostly used in my
    "BGBTech2" 3D engine, and mostly died off along with it).

    Where, syntactically:
    BGBScript was along similar lines to JavaScript and ActionScript.
    BGBScript2 was a constrained BGBScript, and mostly went over to a more Java-like syntax (but retains some elements of its predecessors).

    BGBCC was originally a case of, when I was younger, it didn't seem like
    that huge of a stretch to go from a "terrible JavaScript knock off" over
    to trying to compile C (I was pretty wrong about this, but did
    eventually end up with a C compiler...).


    Currently I am working on my "BJX2 CPU ISA" project, which mostly uses
    BGBCC (my C compiler, but also compiles a variant of BGBScript2). In its present form, BGBCC is using a similar structure to that of the dictionary-object approach, but still representing things in terms of
    XML abstractions.

    In this case, nodes consist of a collection of key/value pairs, where:
    Each 'key' is a 16-bit tag (4-bit type, 12-bit symbol ID or index);
    Each 'value' is 64 bits, with a meaning depending on the 4b type.

    Each node only directly holds 14 key/value pairs, if the node needs more
    than this, it expands out B-Tree style, so:
    1 node, 1 level, 14 keys;
    15 nodes, 2 levels, 196 keys;
    211 nodes, 3 levels, 2744 keys;
    ...

    All the keys are kept sorted, so access time is "O(log2 n)".


    Most calls to get/set a key in a node supply both a name and a pointer
    to a variable to cache the ID number. Typically, the ID is primary, but
    the string is used if the ID number hasn't been initialized yet
    (originally, everything used strings, but this was not ideal for
    performance).

    Also originally, nodes were organized into linked lists (with internal linkage), but this has been eliminated.

    But, it sort of awkwardly fakes the original DOM style interface mostly because otherwise I would have needed to rewrite both the parser and
    compiler to use the dictionary objects directly.

    There is some trickery, say, something at the AST level like:
    <if>
    <cond>
    <binary op="&gt;">
    <left> <ref name="x"/> </left>
    <right> <int value="0"/> </right>
    </binary>
    </cond>
    <then> ... </then>
    </if>

    Is internally represented as:
    { tag: #if,
    cond:
    {tag: #binary, op: ">",
    left: {tag: #ref, name: "S"},
    right: {tag: #int, value: 0}}
    then: { ... }
    }

    With named keys holding a node implicitly assumed to represent a
    "virtual node" as far as the XML is concerned.

    There are no arrays, rather array-like cases are handled via
    index-number keys and some additional hackery to deal with nodes with a
    large number of children.

    For nodes which represent a body section containing an ordered list of
    child nodes, these are given a sequentially assigned index number
    (almost always, the newest node is added to the end of the list).


    For a small to moderate number of indexed child nodes, key ranges could be:
    000..3FF: Named Keys
    400..7FF: Radix Space
    800..FFF: Indexed Keys (reverse numbered)

    Where, say, the "radix" hack could be used to significantly extend the
    max number of indexed keys (via dividing the index space into multiple
    levels of nested nodes).


    Though, a case could be made for skipping the whole XML thing...

    A case could also be made for using tagged references as values rather
    than type-tagging the key.


    My guess is that it's best to keep the number of different node types
    reasonably small. That should mean less code is needed to walk the
    tree. But when does that number become too small?

    For example, consider expressions.

    There /could/ be one node type for "+" and another for "*", as in a
    node with these fields:

       +
       left
       right

    Alternatively, both operators could be handled by a single node type
    of "dyadic operation". Such a node would have four fields:

       nodetype = dyadic
       operator (e.g. "+")
       left
       right

    The problem is, there are loads of ways to do this, they will all work. There is no right or wrong way.

    I use both of the above methods across three active compilers. With the first, its easy to dispatch directly the "+" node when you need to deal
    with "add".

    With the second, you have to dispatch first on "dyadic" and then on "operator". But it's not a big deal.


    I used a "binary" node for all of the binary operators.


    That could be further generalised by having one type of node which
    would be usable for both monadic and dyadic operators:

       nodetype = operation
       operator
       number of arguments: 1 or 2
       arguments

    Further, why limit the number of arguments to 1 or 2? If one were to
    allow an arbitrary number of arguments then such a type of node could
    also be used for function calls.

    How practical this is depends on your language, the arrangements for matching an arbitrary length list to the node.

    Currently I allow 3 operands for one language, and 2 for another.
    Remember this not just about expressions, it has to be able to represent anything.

    For variable-length lists, any of the operands (although usually just
    the first) is interpreted as the root of a linked list. (With some
    compiler versions in dynamic code, each operand could directly be a list.)


    In fact, that form is effectively an s expression of the form

       (operator arguments)

    So for expressions, which of the above schemes is best to use for
    nodes in an IR tree?

    (A potential fly in the ointment is whether such a node is suitable
    for functions which modify some of their parameters.)

    Then there are other (non-expression) nodes.

    To allow for things like code hoisting I think I need the tree
    initially to be at a fairly high level, containing things such as
    loops and selections though it would be lowered later to use labels
    and conditional branches.

    So what other node types are needed? One could have

       program
       function
       identifier: constant or variable or parameter
       assignment
       iteration
       sequential selection
       switch
       break iteration
       break sequential selection
       break switch

    Is that enough? Too many? Too few? What types are missing?

    My ASTs are unusual (although I only discovered this recently) in that
    they only represent executable code. That is, the bodies of functions,
    or init data for a variable.

    Others use the AST also for non-executable elements, like function declarations. I couldn't see the point of this, unless this is a
    scripting where declarations /are/ executed from the top of the module.

    However even without declaration nodes, I use either 130 or 200 node
    types (depending on whether ADD, SUB etc have dedicated tags, or grouped under one BINOP tag). My C compiler uses 80.

    But you must have already done all this in your compiler; didn't that
    use an AST?


    My ASTs hold "nearly everything", though for C, things like struct
    bodies and typedefs are folded off into their own special lists.

    --- Synchronet 3.19c-Linux NewsLink 1.113