• Scanning

    From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 12:10:36 2023
    From Newsgroup: comp.programming

    Some idle thoughts about scanning (lexical analysis, or
    rather what comes before it) ...

    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    It is a function "get_next_token" that on each call will
    return the next character from a file to its client (caller),
    except that spaces at the end of a line will skipped.

    So we read the line and strip the spaces. (One line in
    Python.)

    But how do I know in advance if the line will fit into
    memory?

    Perhaps because of such fears, traditional scanners¹ do not
    read lines or, Heaven forbid, files, but only characters!

    They do not use random access with respect to the text to be
    scanned, but sequential access, although things would be
    easier with random access.

    So how would you do it with this style of programming (never
    reading the whole line into memory)?

    "I read a character. If it's a space, I peek at the next
    character, if that's a space, I start adding spaces to my
    look-ahead buffer. If an EOL is encountered, the look-ahead
    buffer is discarded. Otherwise, I have to start feeding my
    client from the lookahead buffer until the lookahead buffer
    is empty."

    If I am concerned that a line will not fit in memory, how do
    I know that the sequence of spaces at the end of a line will
    fit in memory (the look-ahead buffer)? The look-ahead buffer
    could be replaced by a counter. If you are paranoid, you
    would use a 64-bit counter and check it for overflow!

    Is it worth the effort with a look-ahead buffer and
    sequential access? Should you just read a line, assuming
    that a line will always fit into memory, and strip the
    blanks the easy way, i.e., using random access? TIA for any
    comments!

    1

    an example of a traditional scanner:

    It only ever calls "GetCh", never "GetLine". The code could
    be easier to write by reading a whole line and then just
    using functions that can look at that line using random
    access to get the next symbol (maybe using regular
    expressions). But a traditional scanner carefully only ever
    reads a single character and manages a state.

    PROCEDURE GetSym;

    VAR i : CARDINAL;

    BEGIN
    WHILE ch <= ' ' DO GetCh END;
    IF ch = '/' THEN
    SkipLine;
    WHILE ch <= ' ' DO GetCh END
    END;
    IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
    i := 0;
    sym := literal;
    REPEAT
    IF i < IdLength THEN
    id [i] := ch;
    INC (i)
    END;
    IF ch > 'Z' THEN sym := ident END;
    GetCh
    ...


    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Richard Heathfield@rjh@cpax.org.uk to comp.programming on Thu Jan 19 12:43:45 2023
    From Newsgroup: comp.programming

    On 19/01/2023 12:10 pm, Stefan Ram wrote:
    Some idle thoughts about scanning (lexical analysis, or
    rather what comes before it) ...

    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    It is a function "get_next_token" that on each call will
    return the next character from a file to its client (caller),
    except that spaces at the end of a line will skipped.

    So we read the line and strip the spaces. (One line in
    Python.)

    But how do I know in advance if the line will fit into
    memory?

    Perhaps because of such fears, traditional scanners¹ do not
    read lines or, Heaven forbid, files, but only characters!

    They do not use random access with respect to the text to be
    scanned, but sequential access, although things would be
    easier with random access.

    So how would you do it with this style of programming (never
    reading the whole line into memory)?

    "I read a character. If it's a space, I peek at the next
    character, if that's a space, I start adding spaces to my
    look-ahead buffer. If an EOL is encountered, the look-ahead
    buffer is discarded. Otherwise, I have to start feeding my
    client from the lookahead buffer until the lookahead buffer
    is empty."

    If I am concerned that a line will not fit in memory, how do
    I know that the sequence of spaces at the end of a line will
    fit in memory (the look-ahead buffer)? The look-ahead buffer
    could be replaced by a counter. If you are paranoid, you
    would use a 64-bit counter and check it for overflow!

    Is it worth the effort with a look-ahead buffer and
    sequential access? Should you just read a line, assuming
    that a line will always fit into memory, and strip the
    blanks the easy way, i.e., using random access? TIA for any
    comments!

    1

    an example of a traditional scanner:

    It only ever calls "GetCh", never "GetLine". The code could
    be easier to write by reading a whole line and then just
    using functions that can look at that line using random
    access to get the next symbol (maybe using regular
    expressions). But a traditional scanner carefully only ever
    reads a single character and manages a state.

    PROCEDURE GetSym;

    VAR i : CARDINAL;

    BEGIN
    WHILE ch <= ' ' DO GetCh END;
    IF ch = '/' THEN
    SkipLine;
    WHILE ch <= ' ' DO GetCh END
    END;
    IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
    i := 0;
    sym := literal;
    REPEAT
    IF i < IdLength THEN
    id [i] := ch;
    INC (i)
    END;
    IF ch > 'Z' THEN sym := ident END;
    GetCh
    ...

    man 3 realloc

    This was a perennial comp.lang.c topic back in the day.

    My interface looked (and still looks) like this:

    #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
    #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
    #define FGDATA_REDUCE 1

    int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
    *fp, unsigned int flags, size_t *plen);

    It's easier to use than it might look:

    char *data = NULL; /* where will the data go? NULL is fine */
    size_t size = 0; /* how much space do we have right now? */
    size_t len = 0; /* after call, holds line length */

    while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
    {
    if(len > 0)

    If you want fgetline.c and don't have 20 years of clc archives,
    just yell.
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Richard Damon@Richard@Damon-Family.org to comp.programming on Thu Jan 19 07:48:09 2023
    From Newsgroup: comp.programming

    On 1/19/23 7:10 AM, Stefan Ram wrote:
    Some idle thoughts about scanning (lexical analysis, or
    rather what comes before it) ...

    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    It is a function "get_next_token" that on each call will
    return the next character from a file to its client (caller),
    except that spaces at the end of a line will skipped.

    So we read the line and strip the spaces. (One line in
    Python.)

    But how do I know in advance if the line will fit into
    memory?

    Perhaps because of such fears, traditional scanners¹ do not
    read lines or, Heaven forbid, files, but only characters!

    They do not use random access with respect to the text to be
    scanned, but sequential access, although things would be
    easier with random access.

    So how would you do it with this style of programming (never
    reading the whole line into memory)?

    "I read a character. If it's a space, I peek at the next
    character, if that's a space, I start adding spaces to my
    look-ahead buffer. If an EOL is encountered, the look-ahead
    buffer is discarded. Otherwise, I have to start feeding my
    client from the lookahead buffer until the lookahead buffer
    is empty."

    If I am concerned that a line will not fit in memory, how do
    I know that the sequence of spaces at the end of a line will
    fit in memory (the look-ahead buffer)? The look-ahead buffer
    could be replaced by a counter. If you are paranoid, you
    would use a 64-bit counter and check it for overflow!

    Is it worth the effort with a look-ahead buffer and
    sequential access? Should you just read a line, assuming
    that a line will always fit into memory, and strip the
    blanks the easy way, i.e., using random access? TIA for any
    comments!

    1

    an example of a traditional scanner:

    It only ever calls "GetCh", never "GetLine". The code could
    be easier to write by reading a whole line and then just
    using functions that can look at that line using random
    access to get the next symbol (maybe using regular
    expressions). But a traditional scanner carefully only ever
    reads a single character and manages a state.

    PROCEDURE GetSym;

    VAR i : CARDINAL;

    BEGIN
    WHILE ch <= ' ' DO GetCh END;
    IF ch = '/' THEN
    SkipLine;
    WHILE ch <= ' ' DO GetCh END
    END;
    IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
    i := 0;
    sym := literal;
    REPEAT
    IF i < IdLength THEN
    id [i] := ch;
    INC (i)
    END;
    IF ch > 'Z' THEN sym := ident END;
    GetCh
    ...



    Because of the particulars of this problem, you don't need a look-ahead buffer, just a count of spaces you have seen and what character is after
    the spaces.

    If you had to handle multiple types of whitespace (arbitrary mix of tabs
    and spaces for example) then you would need a buffer, and you need to
    try and hand;e the case of that sequence being "too long".

    In general, parsers need a way to report that the file it too
    "complicated" for it.

    Even the simple counter version has a limit, when ever the type of the
    counter overflows
    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 13:38:36 2023
    From Newsgroup: comp.programming

    Richard Heathfield <rjh@cpax.org.uk> writes:
    This was a perennial comp.lang.c topic back in the day.

    But what about writing a scanner in languages with automatic
    memory management where reading a whole line is very simple
    and assuming an input language that limits line length to
    some reasonable value, say, 1,000,000 characters?

    In such a language, would there still be reasons not to
    read the whole line into memory, but to read it char-by-char
    as traditional scanners do?


    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Dmitry A. Kazakov@mailbox@dmitry-kazakov.de to comp.programming on Thu Jan 19 14:50:58 2023
    From Newsgroup: comp.programming

    On 2023-01-19 13:10, Stefan Ram wrote:

    But how do I know in advance if the line will fit into
    memory?

    No idea, my parser reads whole source line into the buffer.

    Perhaps because of such fears, traditional scanners¹ do not
    read lines or, Heaven forbid, files, but only characters!

    I think it is more C/UNIX tradition coming from having neither proper
    strings in the language nor lines/records in the filesystem.

    So how would you do it with this style of programming (never
    reading the whole line into memory)?

    By never following this style and never using scanners, lexers,
    tokenizers and other primitive stuff. I do all that in a single pass
    that produces either the code or else the AST.

    "I read a character. If it's a space, I peek at the next
    character, if that's a space, I start adding spaces to my
    look-ahead buffer. If an EOL is encountered, the look-ahead
    buffer is discarded. Otherwise, I have to start feeding my
    client from the lookahead buffer until the lookahead buffer
    is empty."

    Reasonable languages deploy the rule that one blank character is
    equivalent to any number of blank characters, so you could simply pass
    one single space further. Note that you have to annotate tokens by
    source location anyway (another reason for ditching the scanner
    altogether). So you do not need to care about what this blank was built
    of. And yet another reason not to use scanner is that the blank can be a
    part of a, possibly malformed, comment or literal.

    Is it worth the effort with a look-ahead buffer and
    sequential access? Should you just read a line, assuming
    that a line will always fit into memory, and strip the
    blanks the easy way, i.e., using random access?

    My parser works with an abstract source object. The implementation of
    the source object maintains an internal line buffer, which size is a parameter. Whether it is set to 1TB or 1024 bytes, the parser does not care.
    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Richard Heathfield@rjh@cpax.org.uk to comp.programming on Thu Jan 19 13:56:29 2023
    From Newsgroup: comp.programming

    On 19/01/2023 1:38 pm, Stefan Ram wrote:
    Richard Heathfield <rjh@cpax.org.uk> writes:
    This was a perennial comp.lang.c topic back in the day.

    But what about writing a scanner in languages with automatic
    memory management where reading a whole line is very simple
    and assuming an input language that limits line length to
    some reasonable value, say, 1,000,000 characters?

    In such a language, would there still be reasons not to
    read the whole line into memory, but to read it char-by-char
    as traditional scanners do?

    There are always reasons, and sometimes they conflict.

    For example, memory management, which should be done by the
    language because it's too important to be left to the programmer,
    and which should be done by the programmer because it's too
    important to be left to the language.

    What are your priorities? Run speed? Speed of development? Code
    re-use? Scalability? Programmer cost? Robustness? Security?

    And what are your constraints?

    I'm not asking for your answer to these questions. I'm just
    pointing out that the answer to your question will depend at
    least in part on the answers to mine.
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 14:48:29 2023
    From Newsgroup: comp.programming

    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    Richard said that it matters what I need this for.

    I'd like to implement a tiny markup language similar
    to languages like "Markdown" or "reStructuredText".
    It should ignore spaces at the end of lines.
    I'm going to implement it in Python.

    Here is a first draft of a scanner that strips
    spaces at the end of lines. It works by reading
    single characters from the source.

    For demonstration purposes, I have written spaces
    as underlines "_".

    The demo takes

    Howdy___\nthere!

    as input and outputs

    Howdy\nthere!\n

    . (It also tries to insert '\n' at the end of a
    source when there is no '\n' at the end.)

    The input text is given in the source code via

    input_text = iter( 'Howdy___\nthere!' )

    . What I now need to do next is to write more
    tests in order to find errors. (I avoided using
    classes to make the code a bit easier to read for
    the newsgroup, but the code also will be changed
    soon to use a class definition.)

    Python 3.9

    main.py

    def catcode( ch ):
    # 5 means: "this is a line terminator"
    # 10 means: "this is a blank space"
    # 11 means: "this is a plain character"
    if ch == '\n': return 5
    if ch == ' ': return 10
    if ch == '_': return 10 # for debugging, make "_" a space
    if ch == '\t': return 10
    return 11

    spaces_seen = [] # a buffer for spaces collected
    char_read = '' # a buffer allowing one-character lookahead
    previous = '' # the previous character read by "get_next_character"
    terminated = False # set after the last character of the source was read

    def get_next_character():
    # insert EOL at the end of the last line if missing
    global previous
    global terminated
    global char_read
    if terminated: raise StopIteration
    if char_read:
    ch = char_read; char_read = ''
    else:
    try:
    ch = next( input_text )
    except StopIteration:
    if previous != '' and catcode( previous )!= 5:
    # if there is no EOL at EOF, insert one
    ch = '\n'
    terminated = True
    else:
    raise StopIteration
    previous = ch
    return ch

    def get_next_token():
    # skip blanks at the end of a line
    global char_read
    global spaces_seen
    while True:
    if not spaces_seen:
    ch = get_next_character()
    if catcode( ch )== 10:
    spaces_seen =[ ch ]
    while True:
    ch = get_next_character()
    if catcode( ch )== 10:
    spaces_seen += ch
    elif catcode( ch )== 5:
    spaces_seen = []
    return( 0, ch, 5, f'{spaces_seen=}' )
    else:
    char_read = ch
    break
    else:
    return( 0, ch, catcode( ch ), f'{spaces_seen=}')
    if spaces_seen:
    ch = spaces_seen.pop( 0 )
    return( 1, ch, catcode( ch ), f'{spaces_seen=}')

    input_text = iter( 'Howdy___\nthere!' )

    def main():
    result = ''
    while True:
    try:
    token = get_next_token()
    result += token[ 1 ]
    except StopIteration:
    break
    print( repr( result ))

    main()

    stdout

    'Howdy\nthere!\n'


    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Richard Heathfield@rjh@cpax.org.uk to comp.programming on Thu Jan 19 15:06:05 2023
    From Newsgroup: comp.programming

    On 19/01/2023 2:48 pm, Stefan Ram wrote:
    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    Richard said that it matters what I need this for.

    I'd like to implement a tiny markup language

    Okay, BIG job with lots of complicated, so strive to keep each
    part relatively simple if you ever hope to get it working. Do it
    in whatever way comes most natural to your programming style,
    because that's how /you/ can define 'simple'. You're using
    Python, so I guess you're not overly concerned by performance, so
    do it the way you personally find easiest. I'm guessing you'll go
    for line by line and lean on Python's memory management.

    But write this down somewhere: if, further down the line, your
    parser turns out to be too slow and the profiler blames this bit,
    rewriting it to go byte by byte might well be one of the ways you
    could speed it up.
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 16:46:08 2023
    From Newsgroup: comp.programming

    Richard Heathfield <rjh@cpax.org.uk> writes:
    Okay, BIG job with lots of complicated, so strive to keep each
    part relatively simple if you ever hope to get it working.

    I know I have to simplify things to make it work. To do this,
    I have broken the development into several iterations.
    During the first iteration, I just want to read a /single/
    paragraph containing only simple words without any markup,
    parse it into an internal format, and then generate two
    output formats from it: HTML and plain text.

    In the next iteration, I want to extend this to a sequence
    of paragraphs. Still without any real markup.


    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Ben Bacarisse@ben.usenet@bsb.me.uk to comp.programming on Thu Jan 19 18:08:04 2023
    From Newsgroup: comp.programming

    ram@zedat.fu-berlin.de (Stefan Ram) writes:

    Some idle thoughts about scanning (lexical analysis, or
    rather what comes before it) ...

    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    It is a function "get_next_token" that on each call will
    return the next character from a file to its client (caller),
    except that spaces at the end of a line will skipped.

    So we read the line and strip the spaces. (One line in
    Python.)

    But how do I know in advance if the line will fit into
    memory?

    That's a huge assumption! There's no need to read the line just to skip
    spaces at the end. All you need to do is read and count them so you can
    "hand back" the right number of spaces if you don't see a newline
    character.

    But then this is not the real problem, I suspect. You probably want to
    skip spaces and tabs and probably other things at the end of a line.
    Then again, maybe you really want to replace multiple spaces with just
    on at this stage of the processing? That's is the trouble with cut down problem statements -- they can have simple solutions that don't apply in
    the real case.

    Mind you, I would try hard to avoid reading a line unless a line is
    really and important structure. You might only need to store the
    largest token.
    --
    Ben.
    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Fri Jan 20 12:16:57 2023
    From Newsgroup: comp.programming

    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    In the next iteration, I want to extend this to a sequence
    of paragraphs. Still without any real markup.

    (As before, I was not able to shorten all lines of this post
    to the 72 characters which are recommended for Usenet posts,
    so please bear with me while some lines below will exceed
    the length of 72 characters. I do not ignore Usenet customs
    lightly, but only after painstaking consideration.)

    This post is just a report, but contains no questions to the
    group, so please read on only if you are interested in the topic!

    It was a bit difficult for me to figure out how to properly
    do things, so I resorted to reading Chapter 8 of the TeXbook
    where the scanning of TeX is explained. To verify my understanding,
    I wrote small snippets of TeX. For example,

    \tracingscantokens1
    \tracingcommands3
    \tracingonline1
    H

    \tracingscantokens0

    (That is, one line containing only an "H" and then one empty line.)

    gave this output on TeX:

    {the letter H}
    {horizontal mode: the letter H}
    {blank space }
    {\par}

    . This is because TeX converts the first \n (directly after "H")
    to a blank space and the next (directly below "H") to the control
    sequence "\par".

    I then tried to imitate this.

    Here are the test cases I wrote for my code in Python:

    catcode_dict[ '\t' ]= catcode_of_space # repeated here for clarification process( 'Howdy___\nthere!' )
    process( ' Howdy___\n there!_' )
    process( 'H__\n\n' )
    process( ' Howdy\n\n there!\n\n' )
    process( ' Howdy\n \n there!\n \n' )
    process( ' Howdy\n\n\n there!\n \n\n' )
    process( 'Howdy\n\t\nthere!' )
    catcode_dict[ '\t' ]= catcode_of_other
    process( 'Howdy\n\t\nthere! (catcode of tab temporarily rededfined to "other")' )
    catcode_dict[ '\t' ]= catcode_of_space
    process( '' )
    process( ' ' )
    process( ' ' )
    process( r''' In a Galaxy, there lived a man.
    He was happy when he was typing
    paragraphs.''' )

    One will see below, that, just like TeX, my scanner ignores
    a tab at the end of a line, when the tab character has been
    given then category of "space character" (as in plain TeX),
    but not when it has been given the category of "other
    character" (as in INITEX).

    The output follows below. Most tests pass, but there is
    still one error. (The error is: When the input is a sequence
    of blanks, it produces [par], but should produce nothing.)
    For demonstration purposes, the underscore "_" was made to
    act like a blank space.

    The actual output of the scanner is a sequence of tokens,
    but it was assembled into a string for the demonstration
    output below.

    The output often ends with one space, because a '\n' is
    added to the end of the input if it's missing, and this
    then is being converted to a space. So, ironically, while
    I set out to strip spaces at the end of lines, I now
    sometimes add them to the end of lines!

    'Howdy___\nthere!' (=input) ==>
    'Howdy there! ' (=output)

    ' Howdy___\n there!_' (=input) ==>
    'Howdy there! ' (=output)

    'H__\n\n' (=input) ==>
    'H [par]' (=output)

    ' Howdy\n\n there!\n\n' (=input) ==>
    'Howdy [par]there! [par]' (=output)

    ' Howdy\n \n there!\n \n' (=input) ==>
    'Howdy [par]there! [par]' (=output)

    ' Howdy\n\n\n there!\n \n\n' (=input) ==>
    'Howdy [par][par]there! [par][par]' (=output)

    'Howdy\n\t\nthere!' (=input) ==>
    'Howdy [par]there! ' (=output)

    'Howdy\n\t\nthere! (catcode of tab temporarily rededfined to "other")' (=input) ==>
    'Howdy \t there! (catcode of tab temporarily rededfined to "other") ' (=output)

    '' (=input) ==>
    '' (=output)

    ' ' (=input) ==>
    '[par]' (=output)

    ' ' (=input) ==>
    '[par]' (=output)

    ' In a Galaxy, there lived a man.\nHe was happy when he was typing\nparagraphs.' (=input) ==>
    'In a Galaxy, there lived a man. He was happy when he was typing paragraphs. ' (=output)


    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Noel Duffy@uaigh@icloud.com to comp.programming on Sat Jan 21 11:29:13 2023
    From Newsgroup: comp.programming

    On 21/01/23 01:16, Stefan Ram wrote:
    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    [..]

    The output often ends with one space, because a '\n' is
    added to the end of the input if it's missing, and this
    then is being converted to a space. So, ironically, while
    I set out to strip spaces at the end of lines, I now
    sometimes add them to the end of lines!

    While I don't have any great insight to offer, I did write a small
    markup engine a few years ago (in Object Pascal). What you say above
    brought back memories of struggles I had with my code too. The
    conclusion I came to at the time is that when it comes to things like
    spacing, there are several equally valid ways to do it, and you'll
    probably want different handling for different use-cases, so it's better
    to parameterize it so that users of your code can set which handling
    they prefer. I went with making it a parameter. It's a bit more work but
    the flexibility is usually worth it in the long run.


    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Sat Jan 21 15:00:35 2023
    From Newsgroup: comp.programming

    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    The output follows below. Most tests pass, but there is
    still one error. (The error is: When the input is a sequence
    of blanks, it produces [par], but should produce nothing.)

    In this case, the error was not in my code but in my
    assumptions. In fact, TeX behaves exactly this way too.
    When the input is empty, the output is empty (no tokens),
    but when the input is exactly one space, this yields the
    one token "\par". This is because an "\n" is added to
    the last line if it was missing. The space is ignored.
    This gives an "\n" at the start of the line, and a "\n"
    at the start of a line yields the "\par" token.


    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From Spiros Bousbouras@spibou@gmail.com to comp.programming on Sun Jan 22 16:44:44 2023
    From Newsgroup: comp.programming

    On 19 Jan 2023 14:48:29 GMT
    ram@zedat.fu-berlin.de (Stefan Ram) wrote:
    ram@zedat.fu-berlin.de (Stefan Ram) writes:
    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    Richard said that it matters what I need this for.

    I'd like to implement a tiny markup language similar
    to languages like "Markdown" or "reStructuredText".
    It should ignore spaces at the end of lines.
    I'm going to implement it in Python.

    Does it need to have functionality where it produces output before it has
    seen all the input ? If not , then I would not just read a whole line but a whole file (or input) ! It seems extravagant but unless you have a realistic scenario where you worry that the whole input won't fit into memory , it is simplest to read the whole input into memory.
    --- Synchronet 3.20a-Linux NewsLink 1.113
  • From V V V V V V V V V V V V V V V V V V@vvvvvvvvaaaaaaaaaaaaaaa@mail.ee to comp.programming on Fri Jan 27 01:46:00 2023
    From Newsgroup: comp.programming

    You are a devil !
    On Thursday, January 19, 2023 at 2:43:51 PM UTC+2, Richard Heathfield wrote:
    On 19/01/2023 12:10 pm, Stefan Ram wrote:
    Some idle thoughts about scanning (lexical analysis, or
    rather what comes before it) ...

    Let's take a very simple task: This scanner for text files
    has nothing more to do than to return every character,
    except to strip the spaces at the end of a line.

    It is a function "get_next_token" that on each call will
    return the next character from a file to its client (caller),
    except that spaces at the end of a line will skipped.

    So we read the line and strip the spaces. (One line in
    Python.)

    But how do I know in advance if the line will fit into
    memory?

    Perhaps because of such fears, traditional scanners¹ do not
    read lines or, Heaven forbid, files, but only characters!

    They do not use random access with respect to the text to be
    scanned, but sequential access, although things would be
    easier with random access.

    So how would you do it with this style of programming (never
    reading the whole line into memory)?

    "I read a character. If it's a space, I peek at the next
    character, if that's a space, I start adding spaces to my
    look-ahead buffer. If an EOL is encountered, the look-ahead
    buffer is discarded. Otherwise, I have to start feeding my
    client from the lookahead buffer until the lookahead buffer
    is empty."

    If I am concerned that a line will not fit in memory, how do
    I know that the sequence of spaces at the end of a line will
    fit in memory (the look-ahead buffer)? The look-ahead buffer
    could be replaced by a counter. If you are paranoid, you
    would use a 64-bit counter and check it for overflow!

    Is it worth the effort with a look-ahead buffer and
    sequential access? Should you just read a line, assuming
    that a line will always fit into memory, and strip the
    blanks the easy way, i.e., using random access? TIA for any
    comments!

    1

    an example of a traditional scanner:

    It only ever calls "GetCh", never "GetLine". The code could
    be easier to write by reading a whole line and then just
    using functions that can look at that line using random
    access to get the next symbol (maybe using regular
    expressions). But a traditional scanner carefully only ever
    reads a single character and manages a state.

    PROCEDURE GetSym;

    VAR i : CARDINAL;

    BEGIN
    WHILE ch <= ' ' DO GetCh END;
    IF ch = '/' THEN
    SkipLine;
    WHILE ch <= ' ' DO GetCh END
    END;
    IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
    i := 0;
    sym := literal;
    REPEAT
    IF i < IdLength THEN
    id [i] := ch;
    INC (i)
    END;
    IF ch > 'Z' THEN sym := ident END;
    GetCh
    ...

    man 3 realloc

    This was a perennial comp.lang.c topic back in the day.

    My interface looked (and still looks) like this:

    #define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
    #define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
    #define FGDATA_REDUCE 1

    int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
    *fp, unsigned int flags, size_t *plen);

    It's easier to use than it might look:

    char *data = NULL; /* where will the data go? NULL is fine */
    size_t size = 0; /* how much space do we have right now? */
    size_t len = 0; /* after call, holds line length */

    while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
    {
    if(len > 0)

    If you want fgetline.c and don't have 20 years of clc archives,
    just yell.

    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within
    --- Synchronet 3.20a-Linux NewsLink 1.113