Forum: War Ensemble BBS

Scanning

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 12:10:36 2023

From Newsgroup: comp.programming

Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.

So we read the line and strip the spaces. (One line in
Python.)

But how do I know in advance if the line will fit into
memory?

Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!

They do not use random access with respect to the text to be
scanned, but sequential access, although things would be
easier with random access.

So how would you do it with this style of programming (never
reading the whole line into memory)?

"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."

If I am concerned that a line will not fit in memory, how do
I know that the sequence of spaces at the end of a line will
fit in memory (the look-ahead buffer)? The look-ahead buffer
could be replaced by a counter. If you are paranoid, you
would use a 64-bit counter and check it for overflow!

Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access? TIA for any
comments!

1

an example of a traditional scanner:

It only ever calls "GetCh", never "GetLine". The code could
be easier to write by reading a whole line and then just
using functions that can look at that line using random
access to get the next symbol (maybe using regular
expressions). But a traditional scanner carefully only ever
reads a single character and manages a state.

PROCEDURE GetSym;

VAR i : CARDINAL;

BEGIN
WHILE ch <= ' ' DO GetCh END;
IF ch = '/' THEN
SkipLine;
WHILE ch <= ' ' DO GetCh END
END;
IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
i := 0;
sym := literal;
REPEAT
IF i < IdLength THEN
id [i] := ch;
INC (i)
END;
IF ch > 'Z' THEN sym := ident END;
GetCh
...

--- Synchronet 3.20a-Linux NewsLink 1.113

From Richard Heathfield@rjh@cpax.org.uk to comp.programming on Thu Jan 19 12:43:45 2023

From Newsgroup: comp.programming

On 19/01/2023 12:10 pm, Stefan Ram wrote:

Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.

So we read the line and strip the spaces. (One line in
Python.)

But how do I know in advance if the line will fit into
memory?

Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!

They do not use random access with respect to the text to be
scanned, but sequential access, although things would be
easier with random access.

So how would you do it with this style of programming (never
reading the whole line into memory)?

"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."

If I am concerned that a line will not fit in memory, how do
I know that the sequence of spaces at the end of a line will
fit in memory (the look-ahead buffer)? The look-ahead buffer
could be replaced by a counter. If you are paranoid, you
would use a 64-bit counter and check it for overflow!

Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access? TIA for any
comments!

1

an example of a traditional scanner:

It only ever calls "GetCh", never "GetLine". The code could
be easier to write by reading a whole line and then just
using functions that can look at that line using random
access to get the next symbol (maybe using regular
expressions). But a traditional scanner carefully only ever
reads a single character and manages a state.

PROCEDURE GetSym;

VAR i : CARDINAL;

BEGIN
WHILE ch <= ' ' DO GetCh END;
IF ch = '/' THEN
SkipLine;
WHILE ch <= ' ' DO GetCh END
END;
IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
i := 0;
sym := literal;
REPEAT
IF i < IdLength THEN
id [i] := ch;
INC (i)
END;
IF ch > 'Z' THEN sym := ident END;
GetCh
...

man 3 realloc

This was a perennial comp.lang.c topic back in the day.

My interface looked (and still looks) like this:

#define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
#define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
#define FGDATA_REDUCE 1

int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
*fp, unsigned int flags, size_t *plen);

It's easier to use than it might look:

char *data = NULL; /* where will the data go? NULL is fine */
size_t size = 0; /* how much space do we have right now? */
size_t len = 0; /* after call, holds line length */

while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
{
if(len > 0)

If you want fgetline.c and don't have 20 years of clc archives,
just yell.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

--- Synchronet 3.20a-Linux NewsLink 1.113

From Richard Damon@Richard@Damon-Family.org to comp.programming on Thu Jan 19 07:48:09 2023

From Newsgroup: comp.programming

On 1/19/23 7:10 AM, Stefan Ram wrote:

Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.

So we read the line and strip the spaces. (One line in
Python.)

But how do I know in advance if the line will fit into
memory?

Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!

They do not use random access with respect to the text to be
scanned, but sequential access, although things would be
easier with random access.

So how would you do it with this style of programming (never
reading the whole line into memory)?

"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."

If I am concerned that a line will not fit in memory, how do
I know that the sequence of spaces at the end of a line will
fit in memory (the look-ahead buffer)? The look-ahead buffer
could be replaced by a counter. If you are paranoid, you
would use a 64-bit counter and check it for overflow!

Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access? TIA for any
comments!

1

an example of a traditional scanner:

It only ever calls "GetCh", never "GetLine". The code could
be easier to write by reading a whole line and then just
using functions that can look at that line using random
access to get the next symbol (maybe using regular
expressions). But a traditional scanner carefully only ever
reads a single character and manages a state.

PROCEDURE GetSym;

VAR i : CARDINAL;

BEGIN
WHILE ch <= ' ' DO GetCh END;
IF ch = '/' THEN
SkipLine;
WHILE ch <= ' ' DO GetCh END
END;
IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
i := 0;
sym := literal;
REPEAT
IF i < IdLength THEN
id [i] := ch;
INC (i)
END;
IF ch > 'Z' THEN sym := ident END;
GetCh
...

Because of the particulars of this problem, you don't need a look-ahead buffer, just a count of spaces you have seen and what character is after
the spaces.

If you had to handle multiple types of whitespace (arbitrary mix of tabs
and spaces for example) then you would need a buffer, and you need to
try and hand;e the case of that sequence being "too long".

In general, parsers need a way to report that the file it too
"complicated" for it.

Even the simple counter version has a limit, when ever the type of the
counter overflows
--- Synchronet 3.20a-Linux NewsLink 1.113

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 13:38:36 2023

From Newsgroup: comp.programming

Richard Heathfield <rjh@cpax.org.uk> writes:

This was a perennial comp.lang.c topic back in the day.

But what about writing a scanner in languages with automatic
memory management where reading a whole line is very simple
and assuming an input language that limits line length to
some reasonable value, say, 1,000,000 characters?

In such a language, would there still be reasons not to
read the whole line into memory, but to read it char-by-char
as traditional scanners do?

--- Synchronet 3.20a-Linux NewsLink 1.113

From Dmitry A. Kazakov@mailbox@dmitry-kazakov.de to comp.programming on Thu Jan 19 14:50:58 2023

From Newsgroup: comp.programming

On 2023-01-19 13:10, Stefan Ram wrote:

But how do I know in advance if the line will fit into
memory?

No idea, my parser reads whole source line into the buffer.

Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!

I think it is more C/UNIX tradition coming from having neither proper
strings in the language nor lines/records in the filesystem.

So how would you do it with this style of programming (never
reading the whole line into memory)?

By never following this style and never using scanners, lexers,
tokenizers and other primitive stuff. I do all that in a single pass
that produces either the code or else the AST.

"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."

Reasonable languages deploy the rule that one blank character is
equivalent to any number of blank characters, so you could simply pass
one single space further. Note that you have to annotate tokens by
source location anyway (another reason for ditching the scanner
altogether). So you do not need to care about what this blank was built
of. And yet another reason not to use scanner is that the blank can be a
part of a, possibly malformed, comment or literal.

Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access?

My parser works with an abstract source object. The implementation of
the source object maintains an internal line buffer, which size is a parameter. Whether it is set to 1TB or 1024 bytes, the parser does not care.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

--- Synchronet 3.20a-Linux NewsLink 1.113

From Richard Heathfield@rjh@cpax.org.uk to comp.programming on Thu Jan 19 13:56:29 2023

From Newsgroup: comp.programming

On 19/01/2023 1:38 pm, Stefan Ram wrote:

Richard Heathfield <rjh@cpax.org.uk> writes:

This was a perennial comp.lang.c topic back in the day.

But what about writing a scanner in languages with automatic
memory management where reading a whole line is very simple
and assuming an input language that limits line length to
some reasonable value, say, 1,000,000 characters?

In such a language, would there still be reasons not to
read the whole line into memory, but to read it char-by-char
as traditional scanners do?

There are always reasons, and sometimes they conflict.

For example, memory management, which should be done by the
language because it's too important to be left to the programmer,
and which should be done by the programmer because it's too
important to be left to the language.

What are your priorities? Run speed? Speed of development? Code
re-use? Scalability? Programmer cost? Robustness? Security?

And what are your constraints?

I'm not asking for your answer to these questions. I'm just
pointing out that the answer to your question will depend at
least in part on the answers to mine.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

--- Synchronet 3.20a-Linux NewsLink 1.113

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 14:48:29 2023

From Newsgroup: comp.programming

ram@zedat.fu-berlin.de (Stefan Ram) writes:

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

Richard said that it matters what I need this for.

I'd like to implement a tiny markup language similar
to languages like "Markdown" or "reStructuredText".
It should ignore spaces at the end of lines.
I'm going to implement it in Python.

Here is a first draft of a scanner that strips
spaces at the end of lines. It works by reading
single characters from the source.

For demonstration purposes, I have written spaces
as underlines "_".

The demo takes

Howdy___\nthere!

as input and outputs

Howdy\nthere!\n

. (It also tries to insert '\n' at the end of a
source when there is no '\n' at the end.)

The input text is given in the source code via

input_text = iter( 'Howdy___\nthere!' )

. What I now need to do next is to write more
tests in order to find errors. (I avoided using
classes to make the code a bit easier to read for
the newsgroup, but the code also will be changed
soon to use a class definition.)

Python 3.9

main.py

def catcode( ch ):
# 5 means: "this is a line terminator"
# 10 means: "this is a blank space"
# 11 means: "this is a plain character"
if ch == '\n': return 5
if ch == ' ': return 10
if ch == '_': return 10 # for debugging, make "_" a space
if ch == '\t': return 10
return 11

spaces_seen = [] # a buffer for spaces collected
char_read = '' # a buffer allowing one-character lookahead
previous = '' # the previous character read by "get_next_character"
terminated = False # set after the last character of the source was read

def get_next_character():
# insert EOL at the end of the last line if missing
global previous
global terminated
global char_read
if terminated: raise StopIteration
if char_read:
ch = char_read; char_read = ''
else:
try:
ch = next( input_text )
except StopIteration:
if previous != '' and catcode( previous )!= 5:
# if there is no EOL at EOF, insert one
ch = '\n'
terminated = True
else:
raise StopIteration
previous = ch
return ch

def get_next_token():
# skip blanks at the end of a line
global char_read
global spaces_seen
while True:
if not spaces_seen:
ch = get_next_character()
if catcode( ch )== 10:
spaces_seen =[ ch ]
while True:
ch = get_next_character()
if catcode( ch )== 10:
spaces_seen += ch
elif catcode( ch )== 5:
spaces_seen = []
return( 0, ch, 5, f'{spaces_seen=}' )
else:
char_read = ch
break
else:
return( 0, ch, catcode( ch ), f'{spaces_seen=}')
if spaces_seen:
ch = spaces_seen.pop( 0 )
return( 1, ch, catcode( ch ), f'{spaces_seen=}')

input_text = iter( 'Howdy___\nthere!' )

def main():
result = ''
while True:
try:
token = get_next_token()
result += token[ 1 ]
except StopIteration:
break
print( repr( result ))

main()

stdout

'Howdy\nthere!\n'

--- Synchronet 3.20a-Linux NewsLink 1.113

From Richard Heathfield@rjh@cpax.org.uk to comp.programming on Thu Jan 19 15:06:05 2023

From Newsgroup: comp.programming

On 19/01/2023 2:48 pm, Stefan Ram wrote:

ram@zedat.fu-berlin.de (Stefan Ram) writes:

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

Richard said that it matters what I need this for.

I'd like to implement a tiny markup language

Okay, BIG job with lots of complicated, so strive to keep each
part relatively simple if you ever hope to get it working. Do it
in whatever way comes most natural to your programming style,
because that's how /you/ can define 'simple'. You're using
Python, so I guess you're not overly concerned by performance, so
do it the way you personally find easiest. I'm guessing you'll go
for line by line and lean on Python's memory management.

But write this down somewhere: if, further down the line, your
parser turns out to be too slow and the profiler blames this bit,
rewriting it to go byte by byte might well be one of the ways you
could speed it up.
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

--- Synchronet 3.20a-Linux NewsLink 1.113

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Thu Jan 19 16:46:08 2023

From Newsgroup: comp.programming

Richard Heathfield <rjh@cpax.org.uk> writes:

Okay, BIG job with lots of complicated, so strive to keep each
part relatively simple if you ever hope to get it working.

I know I have to simplify things to make it work. To do this,
I have broken the development into several iterations.
During the first iteration, I just want to read a /single/
paragraph containing only simple words without any markup,
parse it into an internal format, and then generate two
output formats from it: HTML and plain text.

In the next iteration, I want to extend this to a sequence
of paragraphs. Still without any real markup.

--- Synchronet 3.20a-Linux NewsLink 1.113

From Ben Bacarisse@ben.usenet@bsb.me.uk to comp.programming on Thu Jan 19 18:08:04 2023

From Newsgroup: comp.programming

ram@zedat.fu-berlin.de (Stefan Ram) writes:

Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.

So we read the line and strip the spaces. (One line in
Python.)

But how do I know in advance if the line will fit into
memory?

That's a huge assumption! There's no need to read the line just to skip
spaces at the end. All you need to do is read and count them so you can
"hand back" the right number of spaces if you don't see a newline
character.

But then this is not the real problem, I suspect. You probably want to
skip spaces and tabs and probably other things at the end of a line.
Then again, maybe you really want to replace multiple spaces with just
on at this stage of the processing? That's is the trouble with cut down problem statements -- they can have simple solutions that don't apply in
the real case.

Mind you, I would try hard to avoid reading a line unless a line is
really and important structure. You might only need to store the
largest token.
--
Ben.
--- Synchronet 3.20a-Linux NewsLink 1.113

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Fri Jan 20 12:16:57 2023

From Newsgroup: comp.programming

ram@zedat.fu-berlin.de (Stefan Ram) writes:

In the next iteration, I want to extend this to a sequence
of paragraphs. Still without any real markup.

(As before, I was not able to shorten all lines of this post
to the 72 characters which are recommended for Usenet posts,
so please bear with me while some lines below will exceed
the length of 72 characters. I do not ignore Usenet customs
lightly, but only after painstaking consideration.)

This post is just a report, but contains no questions to the
group, so please read on only if you are interested in the topic!

It was a bit difficult for me to figure out how to properly
do things, so I resorted to reading Chapter 8 of the TeXbook
where the scanning of TeX is explained. To verify my understanding,
I wrote small snippets of TeX. For example,

\tracingscantokens1
\tracingcommands3
\tracingonline1
H

\tracingscantokens0

(That is, one line containing only an "H" and then one empty line.)

gave this output on TeX:

{the letter H}
{horizontal mode: the letter H}
{blank space }
{\par}

. This is because TeX converts the first \n (directly after "H")
to a blank space and the next (directly below "H") to the control
sequence "\par".

I then tried to imitate this.

Here are the test cases I wrote for my code in Python:

catcode_dict[ '\t' ]= catcode_of_space # repeated here for clarification process( 'Howdy___\nthere!' )
process( ' Howdy___\n there!_' )
process( 'H__\n\n' )
process( ' Howdy\n\n there!\n\n' )
process( ' Howdy\n \n there!\n \n' )
process( ' Howdy\n\n\n there!\n \n\n' )
process( 'Howdy\n\t\nthere!' )
catcode_dict[ '\t' ]= catcode_of_other
process( 'Howdy\n\t\nthere! (catcode of tab temporarily rededfined to "other")' )
catcode_dict[ '\t' ]= catcode_of_space
process( '' )
process( ' ' )
process( ' ' )
process( r''' In a Galaxy, there lived a man.
He was happy when he was typing
paragraphs.''' )

One will see below, that, just like TeX, my scanner ignores
a tab at the end of a line, when the tab character has been
given then category of "space character" (as in plain TeX),
but not when it has been given the category of "other
character" (as in INITEX).

The output follows below. Most tests pass, but there is
still one error. (The error is: When the input is a sequence
of blanks, it produces [par], but should produce nothing.)
For demonstration purposes, the underscore "_" was made to
act like a blank space.

The actual output of the scanner is a sequence of tokens,
but it was assembled into a string for the demonstration
output below.

The output often ends with one space, because a '\n' is
added to the end of the input if it's missing, and this
then is being converted to a space. So, ironically, while
I set out to strip spaces at the end of lines, I now
sometimes add them to the end of lines!

'Howdy___\nthere!' (=input) ==>
'Howdy there! ' (=output)

' Howdy___\n there!_' (=input) ==>
'Howdy there! ' (=output)

'H__\n\n' (=input) ==>
'H [par]' (=output)

' Howdy\n\n there!\n\n' (=input) ==>
'Howdy [par]there! [par]' (=output)

' Howdy\n \n there!\n \n' (=input) ==>
'Howdy [par]there! [par]' (=output)

' Howdy\n\n\n there!\n \n\n' (=input) ==>
'Howdy [par][par]there! [par][par]' (=output)

'Howdy\n\t\nthere!' (=input) ==>
'Howdy [par]there! ' (=output)

'Howdy\n\t\nthere! (catcode of tab temporarily rededfined to "other")' (=input) ==>
'Howdy \t there! (catcode of tab temporarily rededfined to "other") ' (=output)

'' (=input) ==>
'' (=output)

' ' (=input) ==>
'[par]' (=output)

' ' (=input) ==>
'[par]' (=output)

' In a Galaxy, there lived a man.\nHe was happy when he was typing\nparagraphs.' (=input) ==>
'In a Galaxy, there lived a man. He was happy when he was typing paragraphs. ' (=output)

--- Synchronet 3.20a-Linux NewsLink 1.113

From Noel Duffy@uaigh@icloud.com to comp.programming on Sat Jan 21 11:29:13 2023

From Newsgroup: comp.programming

On 21/01/23 01:16, Stefan Ram wrote:

ram@zedat.fu-berlin.de (Stefan Ram) writes:

[..]

The output often ends with one space, because a '\n' is
added to the end of the input if it's missing, and this
then is being converted to a space. So, ironically, while
I set out to strip spaces at the end of lines, I now
sometimes add them to the end of lines!

While I don't have any great insight to offer, I did write a small
markup engine a few years ago (in Object Pascal). What you say above
brought back memories of struggles I had with my code too. The
conclusion I came to at the time is that when it comes to things like
spacing, there are several equally valid ways to do it, and you'll
probably want different handling for different use-cases, so it's better
to parameterize it so that users of your code can set which handling
they prefer. I went with making it a parameter. It's a bit more work but
the flexibility is usually worth it in the long run.

--- Synchronet 3.20a-Linux NewsLink 1.113

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.programming on Sat Jan 21 15:00:35 2023

From Newsgroup: comp.programming

ram@zedat.fu-berlin.de (Stefan Ram) writes:

The output follows below. Most tests pass, but there is
still one error. (The error is: When the input is a sequence
of blanks, it produces [par], but should produce nothing.)

In this case, the error was not in my code but in my
assumptions. In fact, TeX behaves exactly this way too.
When the input is empty, the output is empty (no tokens),
but when the input is exactly one space, this yields the
one token "\par". This is because an "\n" is added to
the last line if it was missing. The space is ignored.
This gives an "\n" at the start of the line, and a "\n"
at the start of a line yields the "\par" token.

--- Synchronet 3.20a-Linux NewsLink 1.113

From Spiros Bousbouras@spibou@gmail.com to comp.programming on Sun Jan 22 16:44:44 2023

From Newsgroup: comp.programming

On 19 Jan 2023 14:48:29 GMT
ram@zedat.fu-berlin.de (Stefan Ram) wrote:

ram@zedat.fu-berlin.de (Stefan Ram) writes:

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

Richard said that it matters what I need this for.

I'd like to implement a tiny markup language similar
to languages like "Markdown" or "reStructuredText".
It should ignore spaces at the end of lines.
I'm going to implement it in Python.

Does it need to have functionality where it produces output before it has
seen all the input ? If not , then I would not just read a whole line but a whole file (or input) ! It seems extravagant but unless you have a realistic scenario where you worry that the whole input won't fit into memory , it is simplest to read the whole input into memory.
--- Synchronet 3.20a-Linux NewsLink 1.113

From V V V V V V V V V V V V V V V V V V@vvvvvvvvaaaaaaaaaaaaaaa@mail.ee to comp.programming on Fri Jan 27 01:46:00 2023

From Newsgroup: comp.programming

You are a devil !
On Thursday, January 19, 2023 at 2:43:51 PM UTC+2, Richard Heathfield wrote:

On 19/01/2023 12:10 pm, Stefan Ram wrote:

Some idle thoughts about scanning (lexical analysis, or
rather what comes before it) ...

Let's take a very simple task: This scanner for text files
has nothing more to do than to return every character,
except to strip the spaces at the end of a line.

It is a function "get_next_token" that on each call will
return the next character from a file to its client (caller),
except that spaces at the end of a line will skipped.

So we read the line and strip the spaces. (One line in
Python.)

But how do I know in advance if the line will fit into
memory?

Perhaps because of such fears, traditional scanners¹ do not
read lines or, Heaven forbid, files, but only characters!

They do not use random access with respect to the text to be
scanned, but sequential access, although things would be
easier with random access.

So how would you do it with this style of programming (never
reading the whole line into memory)?

"I read a character. If it's a space, I peek at the next
character, if that's a space, I start adding spaces to my
look-ahead buffer. If an EOL is encountered, the look-ahead
buffer is discarded. Otherwise, I have to start feeding my
client from the lookahead buffer until the lookahead buffer
is empty."

If I am concerned that a line will not fit in memory, how do
I know that the sequence of spaces at the end of a line will
fit in memory (the look-ahead buffer)? The look-ahead buffer
could be replaced by a counter. If you are paranoid, you
would use a 64-bit counter and check it for overflow!

Is it worth the effort with a look-ahead buffer and
sequential access? Should you just read a line, assuming
that a line will always fit into memory, and strip the
blanks the easy way, i.e., using random access? TIA for any
comments!

1

an example of a traditional scanner:

It only ever calls "GetCh", never "GetLine". The code could
be easier to write by reading a whole line and then just
using functions that can look at that line using random
access to get the next symbol (maybe using regular
expressions). But a traditional scanner carefully only ever
reads a single character and manages a state.

PROCEDURE GetSym;

VAR i : CARDINAL;

BEGIN
WHILE ch <= ' ' DO GetCh END;
IF ch = '/' THEN
SkipLine;
WHILE ch <= ' ' DO GetCh END
END;
IF (CAP (ch) <= 'Z') AND (CAP (ch) >= 'A') THEN
i := 0;
sym := literal;
REPEAT
IF i < IdLength THEN
id [i] := ch;
INC (i)
END;
IF ch > 'Z' THEN sym := ident END;
GetCh
...

man 3 realloc

This was a perennial comp.lang.c topic back in the day.

My interface looked (and still looks) like this:

#define FGDATA_BUFSIZ BUFSIZ /* adjust to taste */
#define FGDATA_WRDSIZ sizeof("floccinaucinihilipilification")
#define FGDATA_REDUCE 1

int fgetline(char **line, size_t *size, size_t maxrecsize, FILE
*fp, unsigned int flags, size_t *plen);

It's easier to use than it might look:

char *data = NULL; /* where will the data go? NULL is fine */
size_t size = 0; /* how much space do we have right now? */
size_t len = 0; /* after call, holds line length */

while(fgetline(&data, &size, (size_t)-1, stdin, 0, &len) == 0)
{
if(len > 0)

If you want fgetline.c and don't have 20 years of clc archives,
just yell.

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

--- Synchronet 3.20a-Linux NewsLink 1.113

Who's Online
Recent Visitors
- Bytor
  Tue Apr 16 19:30:43 2024
  from Ri, Ri via Telnet
- Microbot
  Tue Apr 16 05:35:07 2024
  from Moore, Ok via Telnet
- Microbot
  Wed Apr 17 03:53:00 2024
  from Moore, Ok via Telnet
- Microbot
  Thu Apr 18 02:47:50 2024
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	910
Nodes:	10 (0 / 10)
Uptime:	218:48:46
Calls:	12,116
Calls today:	1
Files:	186,504
Messages:	2,226,809

Scanning

Who's Online

Recent Visitors

System Info