• awk functionality not in ksh93

    From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Thu Feb 17 21:24:59 2022
    From Newsgroup: comp.lang.awk

    Hello,

    following the recent discussion about staying in awk versus promiscous pipelining I tried to come up with simple awk idioms that might replace
    my most common external commands (bc, cat, cut, echo, grep, head, nl,
    printf, sed, tac, tail, tr, wc), enable me to stay within awk and thus
    reap the benefits of having access to more sophisticated program logic
    than a simple pipeline.

    To my delight and surprise I discovered quite a lot of features that I overlooked or did not understand before (e.g., I almost never used
    (g)sub, but rather went for sed). Some pages with one-liners, such as

    https://catonmat.net/awk-one-liners-explained-part-one

    also helped to rekindle my appreciation of this great tool. So thanks
    everyone here (especially Janis) for nudging me!

    While searching here and there on the web, I stumbled again across
    ksh93, which was advertised back then with

    The new version of ksh has the functionality of other scripting
    languages such as awk, icon, perl, rexx, and tcl. For this and many
    other reasons, ksh is a much better scripting language than any of the
    other popular shells.

    When I saw that it comes with floating point arithmetic, arrays (indexed
    and associative) the questions occured to me how true the rather bold
    claim quoted above is.

    In other words: How feasible is it to stay not within awk (the idea from
    the start of this post), but rather stay within ksh93 solely? Is it so
    powerful that awk could be burned on the stake? I have doubts, but
    neither is my awk knowledge solid enough nor did I ever do serious work
    with ksh93 to judge this from a technical perspective rather than with
    gut feeling.

    So I would be very happy to learn about awk functionality not in ksh93!

    Thanks and best regards

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk,comp.unix.shell on Thu Feb 17 21:25:17 2022
    From Newsgroup: comp.lang.awk

    In article <87fsohxmuc.fsf@axel-reichert.de>,
    Axel Reichert <mail@axel-reichert.de> wrote:
    ...
    While searching here and there on the web, I stumbled again across
    ksh93, which was advertised back then with

    The new version of ksh has the functionality of other scripting
    languages such as awk, icon, perl, rexx, and tcl. For this and many
    other reasons, ksh is a much better scripting language than any of the
    other popular shells.

    When I saw that it comes with floating point arithmetic, arrays (indexed
    and associative) the questions occured to me how true the rather bold
    claim quoted above is.

    I wonder about that. I doubt I'd have to work very hard to come up with something that I can do in GAWK that can't be done in ksh93.

    But what I'm really wondering about is whether it makes sense to switch to
    ksh from bash for my day-to-day shell scripting. I'm pretty familiar and comfortable with bash at this point; I'd rather not switch unless there was
    a good reason.

    The system that I am typing on right now has /usr/bin/ksh as a link to /usr/bin/ksh2020 - which is, presumably, 27 years better than ksh93 (if you
    see what I mean...). I'd like to hear from people knowledgeable on both
    shells as what kind of advantages ksh has over bash.

    The one that I am aware of is floating point math handled natively in the shell. This is a significant thing, and one I often miss in bash. I see
    no particular reason why it could not be implemented in bash. In fact,
    I've worked up a system where I run "bc" as a co-process that works pretty well. Note that if you Google "floating point in bash", you'll get lots of suggestions to use "bc", but none (AFAICT) mention running it as a
    co-process. That seems to be my innovation.
    --
    Kenny, I'll ask you to stop using quotes of mine as taglines.

    - Rick C Hodgin -

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Thu Feb 17 15:36:44 2022
    From Newsgroup: comp.lang.awk

    On 2/17/2022 2:24 PM, Axel Reichert wrote:
    Hello,

    following the recent discussion about staying in awk versus promiscous pipelining I tried to come up with simple awk idioms that might replace
    my most common external commands (bc, cat, cut, echo, grep, head, nl,
    printf, sed, tac, tail, tr, wc), enable me to stay within awk and thus
    reap the benefits of having access to more sophisticated program logic
    than a simple pipeline.

    To my delight and surprise I discovered quite a lot of features that I overlooked or did not understand before (e.g., I almost never used
    (g)sub, but rather went for sed). Some pages with one-liners, such as

    https://catonmat.net/awk-one-liners-explained-part-one

    also helped to rekindle my appreciation of this great tool. So thanks everyone here (especially Janis) for nudging me!

    While searching here and there on the web, I stumbled again across
    ksh93, which was advertised back then with

    The new version of ksh has the functionality of other scripting
    languages such as awk, icon, perl, rexx, and tcl. For this and many
    other reasons, ksh is a much better scripting language than any of the
    other popular shells.

    When I saw that it comes with floating point arithmetic, arrays (indexed
    and associative) the questions occured to me how true the rather bold
    claim quoted above is.

    In other words: How feasible is it to stay not within awk (the idea from
    the start of this post), but rather stay within ksh93 solely? Is it so powerful that awk could be burned on the stake? I have doubts, but
    neither is my awk knowledge solid enough nor did I ever do serious work
    with ksh93 to judge this from a technical perspective rather than with
    gut feeling.

    So I would be very happy to learn about awk functionality not in ksh93!

    Thanks and best regards

    Axel

    A shell is an environment from which to manipulate (create/destroy)
    files and processes with a language to sequence calls to tools. It is
    not a tool to manipulate text. Awk is a tool to manipulate text. In fact
    awk is the tool that the people who invented shell also invented for
    shell to call to do general purpose text manipulation.

    So no, no matter what language constructs it has, a shell is not
    designed to replace awk and vice-versa. See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice
    for some information on some aspects of that.

    Ed.
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.lang.awk on Thu Feb 17 22:38:29 2022
    From Newsgroup: comp.lang.awk

    On 17.02.2022 21:24, Axel Reichert wrote:
    Hello,

    following the recent discussion about staying in awk versus promiscous pipelining I tried to come up with simple awk idioms that might replace
    my most common external commands (bc, cat, cut, echo, grep, head, nl,
    printf, sed, tac, tail, tr, wc), enable me to stay within awk and thus
    reap the benefits of having access to more sophisticated program logic
    than a simple pipeline.

    To my delight and surprise I discovered quite a lot of features that I overlooked or did not understand before (e.g., I almost never used
    (g)sub, but rather went for sed). Some pages with one-liners, such as

    https://catonmat.net/awk-one-liners-explained-part-one

    also helped to rekindle my appreciation of this great tool. So thanks everyone here (especially Janis) for nudging me!

    While searching here and there on the web, I stumbled again across
    ksh93, which was advertised back then with

    The new version of ksh has the functionality of other scripting
    languages such as awk, icon, perl, rexx, and tcl. For this and many
    other reasons, ksh is a much better scripting language than any of the
    other popular shells.

    When I saw that it comes with floating point arithmetic, arrays (indexed
    and associative) the questions occured to me how true the rather bold
    claim quoted above is.

    In other words: How feasible is it to stay not within awk (the idea from
    the start of this post), but rather stay within ksh93 solely? Is it so powerful that awk could be burned on the stake? I have doubts, but
    neither is my awk knowledge solid enough nor did I ever do serious work
    with ksh93 to judge this from a technical perspective rather than with
    gut feeling.

    So I would be very happy to learn about awk functionality not in ksh93!

    Glad you took advantages of the discussion here. Though I'd like to
    address some details in your post where you may likely misunderstood
    my stance.

    First some personal background; I am using ksh since the early 1990's
    and still use ksh93 nowadays on a daily basis, and also for scripting.
    In the 1990's I also started using awk on a regular basis and I still
    use it nowadays frequently. I combine these tools also regularly. And
    I still also use the "Unix tool-chest".

    The point of my previous posts was to be suspicious when starting to
    connect a lot of the primitive tools - you mentioned them above - by
    pipelines especially in _addition_ to one (or more) awk instances in
    that same pipeline.

    I observed that many folks who are used to the Unix-tools seem to use
    awk mostly like one of the other primitive Unix-tools; often just to
    select fields in a more comfortable way than 'cut' by awk '{print $2}'
    and the like, sometimes they make use of more features, but still not recognizing that they can just substitute some Unix-tool's features.
    Sometimes it seems that they even know advanced features but are just
    used to old habits and have difficulties to switch.

    Some of the Unix tools, while in principle implementable with awk, I'd
    not do that. Examples are 'tac' or 'tail -N' (as opposed to 'tail +N'
    which is easy to implement), because functions of these tools require
    buffering of data. Often these tools are only used at front or rear of
    a pipeline, so easily added per pipe when necessary. In cases where
    you store the input data anyway in awk arrays you can of course also
    omit these tools.

    Now on to your main topic. You quoted: "ksh is a much better scripting
    language than any of the other popular shells." - This is restricting
    the comparison to _shells_. While I basically agree with it, it's not
    a black-or-white issue. It has a lot of interesting features, many of
    them later borrowed by other shells as well, and many unique concepts.
    It is a highly optimized and speedy shell. You can implement things in
    ksh where you would have escaped to awk (or other tools) with older or
    less powerful shells (you mentioned FP-arithmetic as an example).

    But the basic feature of looping over large data sets is generally not
    fast in shells; here awk (and other tools) typically have advantages.
    Awk code is typically also much clearer. You mentioned the associative
    arrays as an example of a ksh feature; all that cryptic shell syntax
    here is not something I'd think is good syntax - and I say that while
    I am used to that syntax since decades.

    So your final question - "How feasible is it to stay not within awk
    (the idea from the start of this post), but rather stay within ksh93
    solely?" - I would answer (if not already obvious from what I wrote):
    don't be dogmatic. Stay in ksh93 if it fits, stay in awk if it fits,
    combine both (or add other Unix tools if necessary) if that fits best.

    So much for the feature-based perspective, but there's more that may
    influence your decision; e.g. standards. Do you need your programs to
    be widely portable? Then use the POSIX subset (of shell, awk, and the
    Unix tools).

    Personally I use ksh93 (typically with its non-standard extensions),
    I use mainly standard awk features (that are already powerful enough)
    but I also don't hesitate to use GNU awk features if I need them. And
    of course I also combine both regularly.

    If I had to draw a line between them I'd say (as a rule of thumb); if
    doing data processing I have an emphasis on awk, if doing process based automation my emphasis is on ksh93.

    Janis

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From John McCue@jmccue@magnetar.hsd1.ma.comcast.net to comp.lang.awk,comp.unix.shell on Fri Feb 18 15:35:02 2022
    From Newsgroup: comp.lang.awk

    trimmed followups to comp.lang.awk

    In comp.unix.shell Kenny McCormack <gazelle@shell.xmission.com> wrote:
    <snip>

    But what I'm really wondering about is whether it makes sense to switch to ksh from bash for my day-to-day shell scripting. I'm pretty familiar and comfortable with bash at this point; I'd rather not switch unless there was
    a good reason.

    Depends, will you ever write scripts for any of the BSDs
    or a proprietary UNIX ?

    If so you may want to use ksh. Otherwise it does not
    matter.


    The system that I am typing on right now has /usr/bin/ksh as a link to /usr/bin/ksh2020 - which is, presumably, 27 years better than ksh93 (if you see what I mean...). I'd like to hear from people knowledgeable on both shells as what kind of advantages ksh has over bash.

    YYMV on that one, OpenBSD defaults to a ksh88 clone,
    AIX defaults to ksh88. NetBSD seems to default to a ksh
    look-alike and the same now seems to be true for FreeBSD.

    The one that I am aware of is floating point math handled natively in the shell. This is a significant thing, and one I often miss in bash. I see
    no particular reason why it could not be implemented in bash. In fact,
    I've worked up a system where I run "bc" as a co-process that works pretty well. Note that if you Google "floating point in bash", you'll get lots of suggestions to use "bc", but none (AFAICT) mention running it as a co-process. That seems to be my innovation.

    I use bc(1) when I need floating point math since I write
    shell scripts assuming ksh88.
    --
    [t]csh(1) - "An elegant shell, for a more... civilized age."
    - Paraphrasing Star Wars
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Fri Feb 18 23:34:47 2022
    From Newsgroup: comp.lang.awk

    Ed Morton <mortonspam@gmail.com> writes:

    A shell is an environment from which to manipulate (create/destroy)
    files and processes with a language to sequence calls to tools. It is
    not a tool to manipulate text. Awk is a tool to manipulate text. In
    fact awk is the tool that the people who invented shell also invented
    for shell to call to do general purpose text manipulation.

    So no, no matter what language constructs it has, a shell is not
    designed to replace awk and vice-versa.

    Thanks, that is crystal clear.

    https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice

    Great read, thanks again.

    So it seems I should not overdo minimalism and rather settle for a
    healthy troika of shell (probably ksh93 has not that many benefits over
    bash that would be relevant for me), text manipulation (perl has not
    that many benefits over awk that would be relevant for me), and editor
    (emacs has many benefits over vi that are relevant for me). (-:

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Fri Feb 18 23:49:24 2022
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    The point of my previous posts was to be suspicious when starting to
    connect a lot of the primitive tools - you mentioned them above - by pipelines especially in _addition_ to one (or more) awk instances in
    that same pipeline.

    I do get your point, but to be honest I struggle with converting a
    pipeline of primitive tools to ONE awk script. I managed to replace
    several pipeline parts with a nice awk one-liner individually (even if
    it still feels "unnatural" to me), but that results of course in the
    same number of processes and will not reap the dataflow benefits enabled
    by staying within one awk incarnation.

    Perhaps I will present an example for my struggling over the next days
    and can learn something here about how to integrate this into one
    script.

    [tac, tail -n]

    Often these tools are only used at front or rear of a pipeline, so
    easily added per pipe when necessary. In cases where you store the
    input data anyway in awk arrays you can of course also omit these
    tools.

    Sure.

    You mentioned the associative arrays as an example of a ksh feature;
    all that cryptic shell syntax here is not something I'd think is good
    syntax

    Good to know, and one advantage less over bash.

    Do you need your programs to be widely portable?

    No ...

    Then use the POSIX subset

    ... but I have adopted this mindset already.

    if doing data processing I have an emphasis on awk, if doing process
    based automation my emphasis is on ksh93.

    This summary nicely matches with Ed's. Thanks!

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.lang.awk on Sat Feb 19 21:56:14 2022
    From Newsgroup: comp.lang.awk

    On 18.02.2022 23:49, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    The point of my previous posts was to be suspicious when starting to
    connect a lot of the primitive tools - you mentioned them above - by
    pipelines especially in _addition_ to one (or more) awk instances in
    that same pipeline.

    I do get your point, but to be honest I struggle with converting a
    pipeline of primitive tools to ONE awk script.

    Ideally you don't "convert" a pipeline, but just use the appropriate
    tool to formulate the task.

    While some conversions can literally be done, knowing the (awk-)tool
    and having got/built a feeling would be preferable.

    I managed to replace
    several pipeline parts with a nice awk one-liner individually (even if
    it still feels "unnatural" to me), but that results of course in the
    same number of processes and will not reap the dataflow benefits enabled
    by staying within one awk incarnation.

    Let's start with the insight that simple specialized commands can be
    expressed with awk syntax - note that there may be subtle differences
    in one case or another but that doesn't change its general usability
    in practice...

    cat { print $0 } or just 1
    grep pattern /pattern/
    head -n 5 NR <= 5
    cut -f1 { print $1 }
    tr a-z A-Z { print toupper($0) }
    sed 's/hi/ho/g' gsub(/hi/,"ho")
    wc -l END { print NR }

    (These are just a few examples. But it shows that you can use awk in
    simple ways to achieve elementary standard tasks, and it also shows
    that you use a primitive and coherent standard syntax as opposed to
    many individual commands, options, and many tool-specific syntaxes.)

    Let's continue with a simple composition of basic functions, say,

    cat infile | tail -n +2 | grep 'Error' | cut -f1 | tr a-z A-Z

    awk 'NR>=2 && /Error/ { print toupper($1) }' infile

    Finally get your task further extended by asking questions how to solve
    these tasks in one way (pipes) or another (awk)...

    * What do you do if you want to process multiple input files?
    (seems easy since we used 'cat' already, but...)
    * You want to skip the header line (tail -n +2) for each file!
    (... - oh!)
    (...maybe start using shell loops and handle files individually?
    okay, I feel it's going to get messy!)
    * What if you want to match individual fields?
    (darn!)

    And while you think about a (proliferating) shell script, we change NR
    to FNR, add $3~, and extend the file list...

    awk 'FNR>=2 && $3~/Error/ { print toupper($1) }' infile1 ... infileN

    This is just an ad hoc example of non-trivial yet simple requirements
    and also how requirements typically evolve, e.g. once you notice that
    the text "Error" can also appear in other fields and should not trigger
    the match; a case that may not be detected until the software goes in production use, but a fix will be necessary - and that can be easy or cumbersome.


    You mentioned the associative arrays as an example of a ksh feature;
    all that cryptic shell syntax here is not something I'd think is good
    syntax

    Good to know, and one advantage less over bash.

    Erm.. - shell syntax (bash, ksh, ...) I consider cryptic (compared to
    awk). Bash and Ksh don't differ, Awk has a clear syntax.

    Janis

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Ben Bacarisse@ben.usenet@bsb.me.uk to comp.lang.awk on Sat Feb 19 22:42:38 2022
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    Let's start with the insight that simple specialized commands can be expressed with awk syntax - note that there may be subtle differences
    in one case or another but that doesn't change its general usability
    in practice...

    cat { print $0 } or just 1
    grep pattern /pattern/
    head -n 5 NR <= 5
    cut -f1 { print $1 }
    tr a-z A-Z { print toupper($0) }
    sed 's/hi/ho/g' gsub(/hi/,"ho")
    wc -l END { print NR }

    (These are just a few examples. But it shows that you can use awk in
    simple ways to achieve elementary standard tasks, and it also shows
    that you use a primitive and coherent standard syntax as opposed to
    many individual commands, options, and many tool-specific syntaxes.)

    These examples seem to be chosen so as to be particularly easy to write
    in AWK. You could have chosen

    cat -A
    grep -o pattern
    head -n -4
    cut -f5-
    tr n-za-mN-ZA-M a-zA-Z
    sed '/:/s/hi/Hi/'
    wc -c

    And you've shortened the AWK in a few places: sed 's/hi/ho/g' is really

    awk '{gsub(/hi/,"ho");print}'

    and cut -f1 is really

    awk 'BEGIN{FS="\t"}{print $1}'

    (or awk -F$'\t' '{print $1}' if you don't mind a non-standard shell
    construct).

    Let's continue with a simple composition of basic functions, say,

    cat infile | tail -n +2 | grep 'Error' | cut -f1 | tr a-z A-Z

    awk 'NR>=2 && /Error/ { print toupper($1) }' infile

    It could be argued that your presentation is again a little skewed! Why
    the UUOC -- it just makes the pipe look longer? Why pick an example
    where the matching happens before the transform? And its handy that
    tail -n +2 is easier than tail -2 in AWK. For example

    <infile sed 's/#.*//' | cut -f3 | grep . | tail -5

    is more fiddly in AWK.
    --
    Ben.
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.lang.awk on Sun Feb 20 03:44:14 2022
    From Newsgroup: comp.lang.awk

    On 19.02.2022 23:42, Ben Bacarisse wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    Let's start with the insight that simple specialized commands can be
    expressed with awk syntax - note that there may be subtle differences
    in one case or another but that doesn't change its general usability
    in practice...

    cat { print $0 } or just 1
    grep pattern /pattern/
    head -n 5 NR <= 5
    cut -f1 { print $1 }
    tr a-z A-Z { print toupper($0) }
    sed 's/hi/ho/g' gsub(/hi/,"ho")
    wc -l END { print NR }

    (These are just a few examples. But it shows that you can use awk in
    simple ways to achieve elementary standard tasks, and it also shows
    that you use a primitive and coherent standard syntax as opposed to
    many individual commands, options, and many tool-specific syntaxes.)

    These examples seem to be chosen so as to be particularly easy to write
    in AWK.

    Yes, exactly; I've taken just a couple examples, and I've chosen them
    to address a few different (orthogonal) areas that awk can cover, so
    that the combination of them (as building blocks) can easily be seen.

    We can also easily extend them, but my goal was not to provide a more
    or less complete pool or encyclopedia of all possible patterns, rather
    to show the principle. In the posts of another thread I've repeatedly
    pointed out that the combination of especially the many simple commands
    and applications of the commands are where it's trivial to write an
    awk-code instead of composing inflexible pipes. (That doesn't exclude
    that the domain what can be migrated from pipes to awk in much greater.
    And that still doesn't touch the fact that awk still covers functional
    areas where command-pipelines can't even join the game.)

    You could have chosen

    cat -A
    grep -o pattern
    head -n -4
    cut -f5-
    tr n-za-mN-ZA-M a-zA-Z
    sed '/:/s/hi/Hi/'
    wc -c

    Yes, as I had chosen a subset for the ease of understanding the main
    point. I could have also started to choose all sort of tool-option- combinations, but where does that lead to? I had pointed out already
    that some tool-option-combinations (while implementable with not much
    effort) are not per se a candidate for awk; remember my mention of
    'tail -N' in a previous post.

    (BTW: Some examples in the list are badly chosen, their use arguable;
    cat -A for scripting? In addition it's broken; e.g. try with UTF-8/DE
    head -n -4 negative numbers are non-standard, you cannot rely on this
    cut despite being "specialized" cannot handle sophisticated delimiters
    tr ... you are using tr for rot13? You have various options with awk
    sed '/:/s/hi/Hi/' natively supported by awk (pattern/sub)
    wc -c supported in awk for strings and thus (RS="^$") also for files
    )


    And you've shortened the AWK in a few places: sed 's/hi/ho/g' is really

    awk '{gsub(/hi/,"ho");print}'

    If you have substitutions only on a subset of lines that's correct.
    Then you can write it the way you did, or awk 'gsub(/hi/,"ho")||1'
    But I'm not here for code-golf; the point was functionality provided
    by awk.


    and cut -f1 is really

    awk 'BEGIN{FS="\t"}{print $1}'

    I already spoke of subtle differences, specifically with cut's field
    separator in mind. But that is rather a point for the inflexibility
    of the 'cut' command that, despite being a specialized command, is not
    able to accept anything other than a single character as delimiter,
    not a constant string, let alone a regular expression (as awk does).


    (or awk -F$'\t' '{print $1}' if you don't mind a non-standard shell construct).

    Let's continue with a simple composition of basic functions, say,

    cat infile | tail -n +2 | grep 'Error' | cut -f1 | tr a-z A-Z

    awk 'NR>=2 && /Error/ { print toupper($1) }' infile

    It could be argued that your presentation is again a little skewed!

    From what you wrote thus far it's obvious that you had not understood
    what I was explaining here and aiming at. (Re-read Axel's post where
    he said what makes problems to him; I tried to address exactly that.)

    Why the UUOC -- it just makes the pipe look longer?

    Because I wanted to keep it in three steps short. I could have made it
    longer by starting without 'cat', then adding the multiple-files-input
    and introduce cat,[*] just to show that it will not solve the issue if
    you want to remove, as shown in the example, all initial header lines.

    In other words (and again); it was not intended as code-golf. It was
    about the restrictions of a pipeline construct that is evolving by
    refinements of the requirements. And it was to show the other poster
    how one may detect pipeline-code that is a candidate for refactoring
    and how to possibly approach the task. How awk scripts evolve and how
    pipeline scripts evolve, and where you will end in a dead-end with
    pipes and where you can continue to refine your code with awk.

    Why pick an example where the matching happens before the transform?

    (I explained the goal of the post already.)

    And its handy that tail -n +2 is easier than tail -2 in AWK.

    (I explained that already. And that tail -2 is less straightforward,
    though still simple, to implement in awk is something that I pointed
    out in my posts. - But anyway, skipping header lines is a common task
    that I regularly used in scripting, certainly much more often in my
    programs than tail -N that I mainly use interactively. YMMV.)

    For example

    <infile sed 's/#.*//' | cut -f3 | grep . | tail -5

    is more fiddly in AWK.

    And I am sure both of us can construct dozens of such examples where
    awk will need more characters to implement (more "fiddling"). That
    doesn't change the fact that you obviously still missed the point;
    the point of the explanations in this post, and maybe also the fact
    of the inherent restrictions of pipeline-expressions (if compared to
    awk, specifically).

    Janis

    [*] And no; it is no option to add the multiple files list to the
    subsequent tail command, because tail then creates other output that
    you don't want. And no, you cannot assume to use tail -q to suppress
    this output because it's non-standard.

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Sun Feb 20 20:35:37 2022
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    On 18.02.2022 23:49, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    Ideally you don't "convert" a pipeline, but just use the appropriate
    tool to formulate the task.

    Sure, but I have just drafted byself to a boot camp.

    Let's start with the insight that simple specialized commands can be expressed with awk syntax

    This is what I have done for some of my typical pipeline idioms.

    cat infile | tail -n +2 | grep 'Error' | cut -f1 | tr a-z A-Z

    awk 'NR>=2 && /Error/ { print toupper($1) }' infile

    This one was easy.

    And while you think about a (proliferating) shell script, we change NR
    to FNR, add $3~, and extend the file list...

    You are doing good advertising. (-:

    Now for one example where the pipeline solution was a matter of minutes
    without much thinking or trial and error, but where I struggled with a conversion to awk (remember, boot camp):

    A 4-digit number is called "bellied" if the sum of the inner two digits
    is larger than the sum of the outer two digits. How long is the longest uninterrupted sequence of bellied numbers? (This was a bonus home work
    for 12-year olds. Extra points to be earned if the start of this
    sequence is output as well. No computer to be used.)

    The pipeline solution (there are others, shorter, more elegant, but that
    is how it flowed out of my head):

    seq 1000 9999 \
    | sed 's/\(.\)/\1 /g' \
    | awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}' \
    | tr -d '\n' \
    | tr -s 0 '\n' \
    | sort \
    | tail -n 1 \
    | tr -d '\n' \
    | wc -c

    Get all 4-digit numbers and separate the digits with spaces. Print 1 for bellied numbers, 0 otherwise. Transform into a single line. Squeeze the
    zeros (they serve only to mark the start of a new bellied sequence) and
    make a linebreak from them. After sorting, the last line will contain
    the a string of "1"s from the longest bellied sequence. Get rid of the
    final newline and count characters.

    Now some awk replacements for the individual "tubes" of the pipeline:

    - awk 'BEGIN {for (i=1000; i<10000; i++) print i}'
    - awk -F "" '{$1=$1; print}' # non-standard (?) empty FS
    - awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}'
    - awk 'BEGIN {RS=""; ORS=""} gsub(/\n/, "")'
    - awk 'gsub(/0+/, "\n")'
    - awk '{print | "sort"}'
    But this is kind of cheating. I could of course take the maximum
    number found so far. But that is a different solution, and "sort" is
    a staple part of my typical pipelines.
    - awk 'END {print}'
    - No need for "tr -d '\n' here
    - awk '{print length($0)}'

    Now that the tubes are ready, I would be very interested about the "plumbing"/"welding" together. I took me already quite some time to come
    up with the BEGIN in the inital for loop and have no idea how to deal
    with several BEGIN addresses or the mangling of the field/record
    separators in one single file bellied.awk, to be run with "awk -f". That
    is what I am aiming for. My best so far:

    awk 'BEGIN {for (i=1000; i<10000; i++) {print i}}' \
    | awk '{FS=""; $1=$1; ORS=""; if ($1+$4<$2+$3) {print 1} else {print 0}}' \
    | awk 'gsub(/0+/, "\n") {print | "sort"}' \
    | awk 'END {print length($0)}'

    Leaving pipes is hard to do ...

    Best regards

    Axel

    P. S.: With apologies to Kaz, who has read this problem some time back
    in comp.lang.lisp, with similary stupid Lisp questions from me. (-:

    P. P. S.: I understand that an idiomatic awk solution will likely be
    based more on numbers than on strings.
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Sun Feb 20 21:17:24 2022
    From Newsgroup: comp.lang.awk

    In article <87v8x9baba.fsf_-_@axel-reichert.de>,
    Axel Reichert <mail@axel-reichert.de> wrote:
    ...
    A 4-digit number is called "bellied" if the sum of the inner two digits
    is larger than the sum of the outer two digits. How long is the longest >uninterrupted sequence of bellied numbers? (This was a bonus home work
    for 12-year olds. Extra points to be earned if the start of this
    sequence is output as well. No computer to be used.)

    % gawk4 'BEGIN { FIELDWIDTHS="1 1 1 1";for (i=1000; i<10000; i++) { $0=i; if ($1+$4 < $2+$3) A[x] = A[x] " " i;else x++ } for (i in A) if ((l = length(A[i])) > max) { max = l;idx = i } print max,idx,A[idx] }'
    400 283 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
    %

    How'd I do?
    --
    Joni Ernst (2014): Obama should be impeached because 2 people have died of Ebola.
    Joni Ernst (2020): Trump is doing great things, because only 65,000 times as many people have died of COVID-19.

    Josef Stalin (1947): When one person dies, it is a tragedy; when a million die, it is merely statistics.
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Sun Feb 20 22:48:08 2022
    From Newsgroup: comp.lang.awk

    gazelle@shell.xmission.com (Kenny McCormack) writes:

    In article <87v8x9baba.fsf_-_@axel-reichert.de>,
    Axel Reichert <mail@axel-reichert.de> wrote:
    ...
    A 4-digit number is called "bellied" if the sum of the inner two digits
    is larger than the sum of the outer two digits. How long is the longest >>uninterrupted sequence of bellied numbers? (This was a bonus home work
    for 12-year olds. Extra points to be earned if the start of this
    sequence is output as well. No computer to be used.)

    % gawk4 'BEGIN { FIELDWIDTHS="1 1 1 1";for (i=1000; i<10000; i++) {
    $0=i; if ($1+$4 < $2+$3) A[x] = A[x] " " i;else x++ } for (i in A) if
    ((l = length(A[i])) > max) { max = l;idx = i } print max,idx,A[idx] }'
    400 283 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931
    1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945
    1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959
    1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
    1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
    1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
    %

    How'd I do?

    Well, the length of the longest sequence (80, which is indeed from 1920
    to 1999) is nowhere in your output. No, substrings do not count. (-;

    I have no idea where the 400 and 283 come from. I did not know about FIELDWIDTH, interesting technique for the "split".

    Overall, yours is more of a number-based solution than my string-based pipeline. It is clear that awk is pretty strong here with the many
    similarities to C, but I was asking more about the capabilities related
    to string processing, even though such a solution certainly is an
    abuse/stretch of the problem at hand (which is of course mathematical).

    Best regards

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Sun Feb 20 16:26:49 2022
    From Newsgroup: comp.lang.awk

    On 2/20/2022 1:35 PM, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    On 18.02.2022 23:49, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    Ideally you don't "convert" a pipeline, but just use the appropriate
    tool to formulate the task.

    Sure, but I have just drafted byself to a boot camp.

    Let's start with the insight that simple specialized commands can be
    expressed with awk syntax

    This is what I have done for some of my typical pipeline idioms.

    cat infile | tail -n +2 | grep 'Error' | cut -f1 | tr a-z A-Z

    awk 'NR>=2 && /Error/ { print toupper($1) }' infile

    This one was easy.

    And while you think about a (proliferating) shell script, we change NR
    to FNR, add $3~, and extend the file list...

    You are doing good advertising. (-:

    Now for one example where the pipeline solution was a matter of minutes without much thinking or trial and error, but where I struggled with a conversion to awk (remember, boot camp):

    A 4-digit number is called "bellied" if the sum of the inner two digits
    is larger than the sum of the outer two digits. How long is the longest uninterrupted sequence of bellied numbers?

    $ cat tst.awk
    BEGIN { FS="" }
    ($2+$3) > ($1+$4) {
    if ( ++cnt > max ) {
    max = cnt
    }
    next
    }
    { cnt = 0 }
    END { print max+0 }

    $ seq 1000 9999 | awk -f tst.awk
    80

    The above relies on you using an awk version like GNU awk that given a
    null FS splits the string into characters. With any other awk you'd
    delete the BEGIN section and replace

    ($2+$3) > ($1+$4) {

    with

    (substr($0,2,1)+substr($0,3,1)) > (substr($0,1,1)+substr($0,4,4))

    If that's not what you need then please clarify your requirements and
    tell us the expected output.

    Ed.

    (This was a bonus home work
    for 12-year olds. Extra points to be earned if the start of this
    sequence is output as well. No computer to be used.)

    The pipeline solution (there are others, shorter, more elegant, but that
    is how it flowed out of my head):

    seq 1000 9999 \
    | sed 's/\(.\)/\1 /g' \
    | awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}' \
    | tr -d '\n' \
    | tr -s 0 '\n' \
    | sort \
    | tail -n 1 \
    | tr -d '\n' \
    | wc -c

    Get all 4-digit numbers and separate the digits with spaces. Print 1 for bellied numbers, 0 otherwise. Transform into a single line. Squeeze the
    zeros (they serve only to mark the start of a new bellied sequence) and
    make a linebreak from them. After sorting, the last line will contain
    the a string of "1"s from the longest bellied sequence. Get rid of the
    final newline and count characters.

    Now some awk replacements for the individual "tubes" of the pipeline:

    - awk 'BEGIN {for (i=1000; i<10000; i++) print i}'
    - awk -F "" '{$1=$1; print}' # non-standard (?) empty FS
    - awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}'
    - awk 'BEGIN {RS=""; ORS=""} gsub(/\n/, "")'
    - awk 'gsub(/0+/, "\n")'
    - awk '{print | "sort"}'
    But this is kind of cheating. I could of course take the maximum
    number found so far. But that is a different solution, and "sort" is
    a staple part of my typical pipelines.
    - awk 'END {print}'
    - No need for "tr -d '\n' here
    - awk '{print length($0)}'

    Now that the tubes are ready, I would be very interested about the "plumbing"/"welding" together. I took me already quite some time to come
    up with the BEGIN in the inital for loop and have no idea how to deal
    with several BEGIN addresses or the mangling of the field/record
    separators in one single file bellied.awk, to be run with "awk -f". That
    is what I am aiming for. My best so far:

    awk 'BEGIN {for (i=1000; i<10000; i++) {print i}}' \
    | awk '{FS=""; $1=$1; ORS=""; if ($1+$4<$2+$3) {print 1} else {print 0}}' \
    | awk 'gsub(/0+/, "\n") {print | "sort"}' \
    | awk 'END {print length($0)}'

    Leaving pipes is hard to do ...

    Best regards

    Axel

    P. S.: With apologies to Kaz, who has read this problem some time back
    in comp.lang.lisp, with similary stupid Lisp questions from me. (-:

    P. P. S.: I understand that an idiomatic awk solution will likely be
    based more on numbers than on strings.

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Sun Feb 20 16:32:30 2022
    From Newsgroup: comp.lang.awk

    On 2/20/2022 4:26 PM, Ed Morton wrote:
    <snip>
    The above relies on you using an awk version like GNU awk that given a
    null FS splits the string into characters. With any other awk you'd
    delete the BEGIN section and replace

        ($2+$3) > ($1+$4) {

    with

        (substr($0,2,1)+substr($0,3,1)) > (substr($0,1,1)+substr($0,4,4))

    That should of course be (1 instead of 4 at the end):

    (substr($0,2,1)+substr($0,3,1)) > (substr($0,1,1)+substr($0,4,1))

    Regards,

    Ed.
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Sun Feb 20 22:47:01 2022
    From Newsgroup: comp.lang.awk

    In article <87r17xb46f.fsf@axel-reichert.de>,
    Axel Reichert <mail@axel-reichert.de> wrote:
    ...
    Well, the length of the longest sequence (80, which is indeed from 1920
    to 1999) is nowhere in your output. No, substrings do not count. (-;

    Maybe I didn't read exactly what question was to be answered, but the point
    is that it does identify the longest string.

    I have no idea where the 400 and 283 come from. I did not know about >FIELDWIDTH, interesting technique for the "split".

    400 is the length of the string. I think you can do the math of dividing
    that by 5 to get the result you seek (80).

    283 is the index in the array. It is (sort of) a line number.
    --
    The book "1984" used to be a cautionary tale;
    Now it is a "how-to" manual.
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Sun Feb 20 16:54:48 2022
    From Newsgroup: comp.lang.awk

    On 2/20/2022 1:35 PM, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    On 18.02.2022 23:49, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    Ideally you don't "convert" a pipeline, but just use the appropriate
    tool to formulate the task.

    Sure, but I have just drafted byself to a boot camp.

    Let's start with the insight that simple specialized commands can be
    expressed with awk syntax

    This is what I have done for some of my typical pipeline idioms.

    cat infile | tail -n +2 | grep 'Error' | cut -f1 | tr a-z A-Z

    awk 'NR>=2 && /Error/ { print toupper($1) }' infile

    This one was easy.

    And while you think about a (proliferating) shell script, we change NR
    to FNR, add $3~, and extend the file list...

    You are doing good advertising. (-:

    Now for one example where the pipeline solution was a matter of minutes without much thinking or trial and error, but where I struggled with a conversion to awk (remember, boot camp):

    A 4-digit number is called "bellied" if the sum of the inner two digits
    is larger than the sum of the outer two digits. How long is the longest uninterrupted sequence of bellied numbers? (This was a bonus home work
    for 12-year olds. Extra points to be earned if the start of this
    sequence is output as well. No computer to be used.)

    The pipeline solution (there are others, shorter, more elegant, but that
    is how it flowed out of my head):

    seq 1000 9999 \
    | sed 's/\(.\)/\1 /g' \
    | awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}' \
    | tr -d '\n' \
    | tr -s 0 '\n' \
    | sort \
    | tail -n 1 \
    | tr -d '\n' \
    | wc -c

    Get all 4-digit numbers and separate the digits with spaces. Print 1 for bellied numbers, 0 otherwise. Transform into a single line. Squeeze the
    zeros (they serve only to mark the start of a new bellied sequence) and
    make a linebreak from them. After sorting, the last line will contain
    the a string of "1"s from the longest bellied sequence. Get rid of the
    final newline and count characters.

    I'm not saying this is the right way to tackle this problem in awk (see
    my previous script for that) but it sounds like you're looking for a comparison of awk string constructs to your above command pipeline and
    the most similar but still fairly realistic all-awk implementation to
    that would probably be this, using GNU awk for "sorted_in":

    $ cat tst.awk
    {
    gsub(/./,"& ")
    str = str ( ($2+$3) > ($1+$4) ? 1 : FS )
    }
    END {
    split(str,strs)
    PROCINFO["sorted_in"] = "@val_str_desc"
    for (idx in strs) {
    print length(strs[idx])
    break
    }
    }

    $ seq 1000 9999 | awk -f tst.awk
    80

    Again, if that's not what you're looking for, do tell...

    Ed.

    Now some awk replacements for the individual "tubes" of the pipeline:

    - awk 'BEGIN {for (i=1000; i<10000; i++) print i}'
    - awk -F "" '{$1=$1; print}' # non-standard (?) empty FS
    - awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}'
    - awk 'BEGIN {RS=""; ORS=""} gsub(/\n/, "")'
    - awk 'gsub(/0+/, "\n")'
    - awk '{print | "sort"}'
    But this is kind of cheating. I could of course take the maximum
    number found so far. But that is a different solution, and "sort" is
    a staple part of my typical pipelines.
    - awk 'END {print}'
    - No need for "tr -d '\n' here
    - awk '{print length($0)}'

    Now that the tubes are ready, I would be very interested about the "plumbing"/"welding" together. I took me already quite some time to come
    up with the BEGIN in the inital for loop and have no idea how to deal
    with several BEGIN addresses or the mangling of the field/record
    separators in one single file bellied.awk, to be run with "awk -f". That
    is what I am aiming for. My best so far:

    awk 'BEGIN {for (i=1000; i<10000; i++) {print i}}' \
    | awk '{FS=""; $1=$1; ORS=""; if ($1+$4<$2+$3) {print 1} else {print 0}}' \
    | awk 'gsub(/0+/, "\n") {print | "sort"}' \
    | awk 'END {print length($0)}'

    Leaving pipes is hard to do ...

    Best regards

    Axel

    P. S.: With apologies to Kaz, who has read this problem some time back
    in comp.lang.lisp, with similary stupid Lisp questions from me. (-:

    P. P. S.: I understand that an idiomatic awk solution will likely be
    based more on numbers than on strings.

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Tue Feb 22 18:57:39 2022
    From Newsgroup: comp.lang.awk

    Ed Morton <mortonspam@gmail.com> writes:

    it sounds like you're looking for a comparison of awk string
    constructs to your above command pipeline

    Yes, exactly. Sorry for not making this clearer. Overall it looks like
    the "numerical" solution is simpler in awk, while the "string" solution
    is easier in a pipeline.

    gsub(/./,"& ")

    That's a nice idiom for splitting! And from my understanding it is also
    more portable than FS="".

    Thanks, I learned quite a bit here!

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Tue Feb 22 18:59:06 2022
    From Newsgroup: comp.lang.awk

    gazelle@shell.xmission.com (Kenny McCormack) writes:

    400 is the length of the string. I think you can do the math of dividing that by 5 to get the result you seek (80).

    Ah, sure, I missed that.

    Thanks!

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Tue Feb 22 12:08:20 2022
    From Newsgroup: comp.lang.awk

    On 2/22/2022 11:57 AM, Axel Reichert wrote:
    Ed Morton <mortonspam@gmail.com> writes:

    it sounds like you're looking for a comparison of awk string
    constructs to your above command pipeline

    Yes, exactly. Sorry for not making this clearer. Overall it looks like
    the "numerical" solution is simpler in awk, while the "string" solution
    is easier in a pipeline.

    gsub(/./,"& ")

    That's a nice idiom for splitting! And from my understanding it is also
    more portable than FS="".

    Correct. Field splitting using a null string is undefined behavior per
    POSIX so some awk variants (e.g. GNU awk) will split the input into characters, others will ignore the FS setting, and others still can do whatever else they like with it and still be POSIX compliant.

    I only used it though so you could compare that statement to the `sed 's/\(.\)/\1 /g'` in your script, otherwise I'd have used `FS=""` with
    `str = str ( ($2+$3) > ($1+$4) ? 1 : RS )` for the first part of my
    script and then `split(str,strs,RS)` in the END part.


    Thanks, I learned quite a bit here!

    You're welcome.

    Ed.


    Axel

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Kpop 2GM@jason.cy.kwan@gmail.com to comp.lang.awk on Wed Feb 23 01:18:06 2022
    From Newsgroup: comp.lang.awk


    gsub(/./,"& ")

    That's a nice idiom for splitting! And from my understanding it is also
    more portable than FS="".

    even though POSIX spec says it's undefined, it's hard to find an awk variant that doesn't split it by characters. Even Busybox awk does that fine.
    if you're concerned with portability, perhaps macOS 12.2 built-in nawk is more of concern :
    gecho -e 'one\354\210\267two' \
    \
    | nawk 'BEGIN { FS="" } {print} { print length($0), NF, gsub(".", "&"), match($0, ".$"), split($0, arr, "") }'
    one숷two
    9 7 7 9 9
    Every other awk variant gives you either all 9's in byte mode, or all 7's in unicode mode (unless you're in gawk -c or gawk -P, then 2nd # becomes a 1),
    but somehow, the latest macos built-in nawk is really messed up state of half-unicode half-byte mode, resulting in absolutely inconsistent behavior depending on how you try to count characters.
    Length, and Split perform their ops on byte level. FS splitting are unicode level. gsub( ) count at unicode level (only partially). match( ), just doing a straight up count, is byte-level, but even worse :
    match($0, /[^\200-\277]+/)
    The test case has one byte of \210 and another of \267, so correct RSTART:RLENGTH pair is 1:4, yet in nawk, it spits out 1:9, while trying to gsub( ) on the unicode level ALSO results in a major mess :
    gsub(/[\302-\364][\200-\277]+/,"& ") gives a count of zero
    if you ask me, i won't touch nawk with a 10-ft pole unless absolutely necessary --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.lang.awk on Wed Feb 23 13:55:45 2022
    From Newsgroup: comp.lang.awk

    On 20.02.2022 20:35, Axel Reichert wrote:

    awk 'NR>=2 && /Error/ { print toupper($1) }' infile

    This one was easy.

    That was (as explained elsewhere) intended.


    And while you think about a (proliferating) shell script, we change NR
    to FNR, add $3~, and extend the file list...

    You are doing good advertising. (-:

    I don't sell anything. It's just to make obvious what might not be
    obvious. But getting aware of the possibility of such modifications
    is a key to understand and overcome inherent restrictions with other
    concepts.

    The example, BTW, stems from an awk course, and the audience told me
    that they considered it very enlightening. And that code paragon had
    been an adaption of a functionality used in practice, only slightly
    adapted for the course; so it was not made up just for this post but
    had (in its original form) an actual application.


    Now for one example where the pipeline solution was a matter of minutes without much thinking or trial and error, but where I struggled with a conversion to awk (remember, boot camp):

    Please note that "conversion" might not be the ideal view for every
    case, one might as well just start the way one is thinking, awk'ish
    (as you did pipe'ish). With the building blocks (pipe'ish: commands [incoherent], awk'ish: language [coherent]) in mind you can take any
    path you prefer. Below you design a task from scratch. You think in
    terms of pipes and so is your solution. But (with standard awk) at
    least the first four lines can easily be done by an awk instance,
    and the last three as well. The 'sort' is "in the way" because sort
    needs and manipulates data in memory (as do the commands I mentioned
    in my previous posts, where I said that while you can also implement
    that in standard awk it's not as simple as with the "more primitive"
    commands, and some popular awks like the GNU one also support sort
    functions, just BTW).


    A 4-digit number is called "bellied" if the sum of the inner two digits
    is larger than the sum of the outer two digits. How long is the longest uninterrupted sequence of bellied numbers? (This was a bonus home work
    for 12-year olds. Extra points to be earned if the start of this
    sequence is output as well. No computer to be used.)

    The pipeline solution (there are others, shorter, more elegant, but that
    is how it flowed out of my head):

    seq 1000 9999 \
    | sed 's/\(.\)/\1 /g' \
    | awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}' \
    | tr -d '\n' \
    | tr -s 0 '\n' \
    | sort \
    | tail -n 1 \
    | tr -d '\n' \
    | wc -c

    I haven't tried to follow every transformation step but I would have
    applied a few changes (in addition to merging the first few commands
    into one awk instance); I'd simplify the awk command as
    awk '{print ($1+$4 < $2+$3)}'
    and (although tail is optimized in that respect) I'd nonetheless had
    used 'sort -r' and 'head -n 1' instead; the latter has the conceptual
    advantage that you need not "run through the whole data", and the nice
    side effect that you can implement it easily with awk (well, the case
    tail -n N where N=1 is also simple in awk, only N>1 requires memory of
    size N).


    Get all 4-digit numbers and separate the digits with spaces. Print 1 for bellied numbers, 0 otherwise. Transform into a single line. Squeeze the
    zeros (they serve only to mark the start of a new bellied sequence) and
    make a linebreak from them. After sorting, the last line will contain
    the a string of "1"s from the longest bellied sequence. Get rid of the
    final newline and count characters.

    My first thought with awk would be along the lines of splitting number
    with gsub(), build binary string r=r ($1+$4 < $2+$3), split result r
    by /0+/, and finally find the maximum. But to avoid the storage of the intermediate data I'd rather optimize the maximum search by doing it
    on the fly, like

    seq 1000 9999 | awk '
    { gsub(/./,"& ") }
    m=($1+$4 < $2+$3) { if (++c>max) max=c }
    !m { c=0 }
    END { print max }
    '

    That way I don't need to think about time or memory issue when sorting
    or searching data sets that are significantly larger.


    Now some awk replacements for the individual "tubes" of the pipeline:

    - awk 'BEGIN {for (i=1000; i<10000; i++) print i}'

    (Ah, okay, you want the numbers also generated by awk. I assumed that
    'seq' was just an example input source that could be in principle any
    process or file as data source. I usually don't hard-code constants,
    but pass them as parameters if not already fed from external sources.)

    With your subsequent individual "translations" I refer to what I said
    already above. (Start doing that with the obvious parts 1-4).

    You want digits separate;
    gsub(/./,"& ")
    Create a binary string;
    r=r ($1+$4 < $2+$3)
    etc., which may lead to a version like the one I informally described
    above
    seq 1000 9999 | awk '
    { gsub(/./,"& ") }
    { r=r ($1+$4 < $2+$3) }
    { split(r,a,/0+/) }
    END { ...sorting of or maximum in array a... }
    '

    (Note: The curly braces in the main part (but one pair) can be omitted
    of course.)

    But there are many ways to skin a cat. You can omit the "complex" awk
    sorting code, just create the binary output and pipe the awk part into
    'sort' if you like, or omit sorting and do the maximum determination on
    the fly (as I showed in my code) completely in awk.

    Also the pipe expression can be formulated differently, e.g. with 'uniq'
    (and still using 'sort' [during the code-metamorphosis])

    seq 1000 9999 | awk '
    { gsub(/./,"& ") ; print ($1+$4 < $2+$3) }
    ' | uniq -c | sort -rn |
    awk '$2 { print $1 ; exit }'

    and then, instead of sorting, do the maximum determination, and do it in
    the existing awk code

    seq 1000 9999 | awk '
    { gsub(/./,"& ") ; print ($1+$4 < $2+$3) }
    ' | uniq -c |
    awk '$2 && $1>max { max=$1 } END { print max }'

    You can then also decide to implement 'uniq -c' as part of the first awk instance, then you get a pipe with two awks. And then merge these two,
    and you land somewhere in the do-max-on-the-fly-with-single-awk version.

    - awk -F "" '{$1=$1; print}' # non-standard (?) empty FS
    - awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}'
    - awk 'BEGIN {RS=""; ORS=""} gsub(/\n/, "")'
    - awk 'gsub(/0+/, "\n")'
    - awk '{print | "sort"}'
    But this is kind of cheating. I could of course take the maximum
    number found so far. But that is a different solution, and "sort" is
    a staple part of my typical pipelines.
    - awk 'END {print}'
    - No need for "tr -d '\n' here
    - awk '{print length($0)}'

    Now that the tubes are ready, I would be very interested about the "plumbing"/"welding" together. I took me already quite some time to come
    up with the BEGIN in the inital for loop and have no idea how to deal
    with several BEGIN addresses or the mangling of the field/record
    separators in one single file bellied.awk, to be run with "awk -f". That
    is what I am aiming for. My best so far:

    awk 'BEGIN {for (i=1000; i<10000; i++) {print i}}' \
    | awk '{FS=""; $1=$1; ORS=""; if ($1+$4<$2+$3) {print 1} else {print 0}}' \
    | awk 'gsub(/0+/, "\n") {print | "sort"}' \

    Not bad, but that doesn't work since you want to process the output
    of 'sort'. GNU awk at least supports co-processes, so there you could
    implement that code pattern. But personally I don't find that to be
    elegant since we can also avoid 'sort' in the first place.

    Janis

    | awk 'END {print length($0)}'

    Leaving pipes is hard to do ...

    Best regards

    Axel

    P. S.: With apologies to Kaz, who has read this problem some time back
    in comp.lang.lisp, with similary stupid Lisp questions from me. (-:

    P. P. S.: I understand that an idiomatic awk solution will likely be
    based more on numbers than on strings.


    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Wed Feb 23 22:44:51 2022
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    On 20.02.2022 20:35, Axel Reichert wrote:

    You are doing good advertising. (-:

    I don't sell anything.

    No worries, just take it as a compliment, like this one: You explain
    very well. Better?

    Please note that "conversion" might not be the ideal view for every
    case

    Sure. But being the son and the brother of (human) language translators, methinks learning vocabulary is the base. After you start forming your
    first sentences, you will get corrected hopefully by "native" speakers
    and will hopefully pick up some more elegant idioms. Which is precisely
    what is happening here. (:

    building blocks (pipe'ish: commands [incoherent], awk'ish: language [coherent])

    I like this analogy. Pipelines have a simplistic grammar and a rich, but
    highly irregular vocabulary. awk has a more complex and powerful grammar (allowing for more than main sentences), but a more limited vocabulary.

    The 'sort' is "in the way"

    That is clear.

    I'd simplify the awk command as
    awk '{print ($1+$4 < $2+$3)}'

    Ah, great, this works in C style. Much better, yes.

    I'd nonetheless had used 'sort -r' and 'head -n 1'

    Understood, great: "sort" does not care about the "-r", because it just
    negates the "predicate", but printing the first line is both cheaper and
    easier for awk. It seems you have learned a thing or two about Big Oh
    notation. (-;

    I tend to neglect this, being used to much larger number crunching
    efforts than the modest 10^4 numbers here. Thanks for reminding me.

    build binary string r=r ($1+$4 < $2+$3)

    Great, another feature that I tend to forget, the default string
    concatenation. Very elegant in combination with the C-style boolean.

    seq 1000 9999 | awk '
    { gsub(/./,"& ") }
    m=($1+$4 < $2+$3) { if (++c>max) max=c }
    !m { c=0 }
    END { print max }

    Great. That kind of stuff I was after and eager to learn!

    (Ah, okay, you want the numbers also generated by awk. I assumed that
    'seq' was just an example input source that could be in principle any
    process or file as data source.

    In general I agree, but here it is one of the conceptual obstacles in my
    way of thinking. So far I believe there did not come up any post in this
    thread that shows how to "feed" output from the BEGIN block to the rest
    of the "awk" program. Or did I miss something?

    I usually don't hard-code constants, but pass them as parameters

    Sure, but here we know from the problem description that 1000 and 9999
    are the numbers to go for.

    { split(r,a,/0+/) }

    "split" was also missing in my repertoire ...

    awk '$2 { print $1 ; exit }'

    ... just like "exit".

    Not bad, but that doesn't work since you want to process the output
    of 'sort'

    ... which is why I needed yet another pipe.

    Many thanks again, that helped a lot!

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Mon Feb 28 23:51:13 2022
    From Newsgroup: comp.lang.awk

    Kaz Kylheku <480-992-1380@kylheku.com> writes:

    (flow (range 1000 9999)
    (keep-if [chain digits (ap > (+ @2 @3) (+ @1 @4))])
    (partition-if (op neq 1 (- @2 @1)))
    (find-max @1 : len))

    Interesting, how close a Lisp can get to my original pipeline. Thanks!

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.lang.awk on Tue Mar 1 12:58:11 2022
    From Newsgroup: comp.lang.awk

    On 28.02.2022 23:51, Axel Reichert wrote:
    Kaz Kylheku <480-992-1380@kylheku.com> writes:

    (flow (range 1000 9999)
    (keep-if [chain digits (ap > (+ @2 @3) (+ @1 @4))])
    (partition-if (op neq 1 (- @2 @1)))
    (find-max @1 : len))

    Interesting, how close a Lisp can get to my original pipeline. Thanks!

    I'm unfamiliar with this syntax, but this looks a lot like the general
    approach to tackle the task, the one that I had (informally) described
    with

    seq 1000 9999 | awk '
    { gsub(/./,"& ") }
    { r=r ($1+$4 < $2+$3) }
    { split(r,a,/0+/) }
    END { ...sorting of or maximum in array a... }
    '

    and not so much like this band-worm (your "original pipeline")

    seq 1000 9999 \
    | sed 's/\(.\)/\1 /g' \
    | awk '{if ($1+$4 < $2+$3) {print 1} else {print 0}}' \
    | tr -d '\n' \
    | tr -s 0 '\n' \
    | sort \
    | tail -n 1 \
    | tr -d '\n' \
    | wc -c

    which is full of primitive text processing tools (tail, many tr) that
    have no counterpart in the high-level-language lisp code.

    Opinions obviously differ.

    Janis


    Axel


    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.lang.awk on Tue Mar 1 13:20:45 2022
    From Newsgroup: comp.lang.awk

    On 23.02.2022 22:44, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
    On 20.02.2022 20:35, Axel Reichert wrote:

    I'd nonetheless had used 'sort -r' and 'head -n 1'

    Understood, great: "sort" does not care about the "-r", because it just negates the "predicate", but printing the first line is both cheaper and easier for awk. It seems you have learned a thing or two about Big Oh notation. (-;

    Since you mention "Big Oh"; note that sorting (O(N log N) or O(N^2))
    is more expensive than a linear find (O(N)). In case of your sample
    application that you proposed in this sub-thread it may be practically irrelevant (nowadays) to consider that, since we process only 9000
    data lines. But if I want to handle larger data sets that's another
    reason to prefer a simple maximum determination instead of sorting.
    Just BTW. (Yes, "Big Oh" is crucial. I've heared that in interviews
    for Google jobs it's also a central and important element.)


    I tend to neglect this, being used to much larger number crunching
    efforts than the modest 10^4 numbers here. Thanks for reminding me.

    build binary string r=r ($1+$4 < $2+$3)

    Great, another feature that I tend to forget, the default string concatenation. Very elegant in combination with the C-style boolean.

    seq 1000 9999 | awk '
    { gsub(/./,"& ") }
    m=($1+$4 < $2+$3) { if (++c>max) max=c }
    !m { c=0 }
    END { print max }

    Great. That kind of stuff I was after and eager to learn!

    (Ah, okay, you want the numbers also generated by awk. I assumed that
    'seq' was just an example input source that could be in principle any
    process or file as data source.

    In general I agree, but here it is one of the conceptual obstacles in my
    way of thinking. So far I believe there did not come up any post in this thread that shows how to "feed" output from the BEGIN block to the rest
    of the "awk" program. Or did I miss something?

    I'm not sure I understand what you wrote here. If you are saying that
    you want to generate the sequence numbers in the BEGIN block and pass
    them to the implicit awk-processing-loop then yes, that's not possible.
    You would then do the whole processing in the BEGIN block using a loop;
    I think I've seen someone in this sub-thread posting such a loop-based awk-solution.

    Personally I tend to and intuitively separate data generation from the
    data processing as consequence of a design and usability experience;
    that's why it didn't occur to me (before I saw your post) that it was
    part of the "code puzzle".

    Janis

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Janis Papanagnou@janis_papanagnou@hotmail.com to comp.lang.awk on Tue Mar 1 13:35:22 2022
    From Newsgroup: comp.lang.awk

    On 23.02.2022 22:44, Axel Reichert wrote:
    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    build binary string r=r ($1+$4 < $2+$3)

    (I forgot to comment on that. - A caveat for the general case here.)

    Great, another feature that I tend to forget, the default string concatenation. Very elegant in combination with the C-style boolean.

    Since we were talking about "Big Oh"... - Note that in case of large
    strings or data sets such concatenations can be expensive (depending
    on the Awk version you use). GNU Awk has a built-in optimization for
    the x = x ... cases of concatenations, though, so it's cheap here.

    Janis

    --- Synchronet 3.19c-Linux NewsLink 1.113
  • From Axel Reichert@mail@axel-reichert.de to comp.lang.awk on Tue Mar 1 20:32:25 2022
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:

    If you are saying that you want to generate the sequence numbers in
    the BEGIN block and pass them to the implicit awk-processing-loop

    Yes.

    then yes, that's not possible.

    O. K., thanks.

    You would then do the whole processing in the BEGIN block using a loop;
    I think I've seen someone in this sub-thread posting such a loop-based awk-solution.

    Yes, you maybe right.

    Personally I tend to and intuitively separate data generation from the
    data processing as consequence of a design and usability experience

    Makes sense.

    Best regards

    Axel
    --- Synchronet 3.19c-Linux NewsLink 1.113