• writing a good gsub regexp for matching between two specific characters

    From Bryan@bryanlepore@gmail.com to comp.lang.awk on Sat Mar 11 16:06:09 2023
    From Newsgroup: comp.lang.awk

    I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and provide a lot of material in case it is useful or there is conflict in the script, but I am trying not to ramble.
    I prepared a test script below - which should be easy to copy/paste into a shell, e.g. bash. I am focused on the gsub regexps, which are obviously contrived to replace all these different strings which - as they vary from output from another program - take the general form (attempting a "plain English" version):
    [open apostrophe][the word "path"][maybe an underscore][various digits][end apostrophe]
    I want to take all of that ^^^ and delete it - or equivalently replace it with nothing (ideally), to prepare input to gnuplot as "x,y" or "x y" data - two columns.
    I tried using this type of command :
    gsub("^[a-z]{4}$","TEST") ;
    ... and more, e.g. trying sub and gensub - but did not get far - I am aware of a curly brace escape that is important or not depending on the awk version, so I also tried with \{ and \}.
    I put "TEST" in the present case for testing a few different cases. I wrote this script based on extensive reading of a certain popular online resource and the The Awk Programming Language (1988 - maybe time for a newer edition?). This is a useful script because as I find new types of output from the upstream program (a whole other story), I might add new gsub commands to take care of it.
    copy/paste example script:
    echo "\
    {\"path_1234567\"\
    :[`seq -s',' -f '%f' 1 20 `],\
    \"path_123456\"\
    :[`seq -s',' -f '%f' 1 20 `],\
    \"path_1234\"\
    :[`seq -s',' -f '%f' 1 20 `],\
    \"path1234\"\
    :[`seq -s',' -f '%f' 1 20 `]}" | \
    gawk -F, '
    {
    gsub("\{","") ;
    gsub("\}","") ;
    gsub("\]","") ;
    gsub("^[a-z]{4}$","TEST") ;
    gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSEVEN") ;
    gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSIX") ;
    gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9]\":\\\[","TESTFOURB") ;
    gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9]\":\\\[","TESTFOURA") ;
    for (i=1;i<=NF;i++)
    {
    printf("%s%s",$i,i%2?",":"\n")
    }
    }'
    ... the last printf thing is perhaps for another post, but (IIUC) matches every 2nd comma and replaces it with a newline. So that's the "x,y" data idea. I hope that is clear - I imagine the regexps in the [a-z][0-9] parts ought to be able to go all into one gsub if I knew the syntax or what to read about.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Sun Mar 12 03:52:27 2023
    From Newsgroup: comp.lang.awk

    First, I cannot really decipher what you actually want to do and
    where your problems are. The usual procedure is to post sample data:
    input data and the corresponding output data at least (not shell
    code that creates the input data). Anyway you find below some hints
    and suggestions...

    On 12.03.2023 01:06, Bryan wrote:
    I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and
    provide a lot of material in case it is useful or there is conflict
    in the script, but I am trying not to ramble.

    I prepared a test script below - which should be easy to copy/paste
    into a shell, e.g. bash. I am focused on the gsub regexps, which are obviously contrived to replace all these different strings which - as
    they vary from output from another program - take the general form (attempting a "plain English" version):

    [open apostrophe][the word "path"][maybe an underscore][various
    digits][end apostrophe]

    I want to take all of that ^^^ and delete it - or equivalently
    replace it with nothing (ideally), to prepare input to gnuplot as
    "x,y" or "x y" data - two columns.

    I tried using this type of command :

    gsub("^[a-z]{4}$","TEST") ;

    This is fine to substitutes lines containing _only_ a sequence of
    four lower case letters to "TEST". gsub() _without_ the ^ and $
    anchors will substitute any occurrence of that pattern on a line.
    You can provide a third argument to gsub() to operate on variables
    or specific fields; in that case the anchors ^ and $ will define
    the beginning and end of that variable or field respectively.
    It is also advantageous to use /.../ syntax for constant patterns
    instead of the string form "...".


    ... and more, e.g. trying sub and gensub - but did not get far - I am
    aware of a curly brace escape that is important or not depending on
    the awk version, so I also tried with \{ and \}.

    There's no need to escape these braces.


    I put "TEST" in the present case for testing a few different cases. I
    wrote this script based on extensive reading of a certain popular
    online resource and the The Awk Programming Language (1988 - maybe
    time for a newer edition?). This is a useful script because as I find
    new types of output from the upstream program (a whole other story),
    I might add new gsub commands to take care of it.

    copy/paste example script:
    echo "\
    {\"path_1234567\"\
    :[`seq -s',' -f '%f' 1 20 `],\
    \"path_123456\"\
    :[`seq -s',' -f '%f' 1 20 `],\
    \"path_1234\"\
    :[`seq -s',' -f '%f' 1 20 `],\
    \"path1234\"\
    :[`seq -s',' -f '%f' 1 20 `]}" | \
    gawk -F, '
    {
    gsub("\{","") ;
    gsub("\}","") ;
    gsub("\]","") ;
    gsub("^[a-z]{4}$","TEST") ;
    gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSEVEN") ;
    gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSIX") ;
    gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9]\":\\\[","TESTFOURB") ;
    gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9]\":\\\[","TESTFOURA") ;
    for (i=1;i<=NF;i++)
    {
    printf("%s%s",$i,i%2?",":"\n")
    }
    }'

    Instead of echo arguments with quotes and newline-escapes I suggest,
    in shell, to use here-documents with this syntax:

    awk '
    # ... your awk program ...
    ...
    ' <<EOT
    your data line 1
    your data line 2
    ...
    EOT

    and with the more contemporary $(...) a line might be

    {"path_1234567":[$(seq -s',' -f '%f' 1 20)], ...

    but I wouldn't call seq many times but only once and assign it to a
    variable and use that repeatedly

    s=$(seq -s',' -f '%f' 1 20)
    awk '
    ...
    ' <<EOT
    {"path_1234567":[${s}], ...
    ...
    EOT

    If you pipe in or redirect other input just omit the code from <<EOT
    onward.
    data_from_some_process | awk '...'
    awk '...' < data_from_some_file

    (But for testing the here-documents have advantages.)


    ... the last printf thing is perhaps for another post, but (IIUC)
    matches every 2nd comma and replaces it with a newline.

    printf doesn't replace anything. It prints every other time a newline
    instead of a comma.

    So that's the
    "x,y" data idea. I hope that is clear - I imagine the regexps in the [a-z][0-9] parts ought to be able to go all into one gsub if I knew
    the syntax or what to read about.

    To match more than one regexp for the _same_ replacement you can
    combine them with the | (or) operator. For an example from your
    code above use, e.g., gsub(/{|}|]/, "") to remove those three
    braces/brackets in one expression.

    But with your samples above you can also use other regexp syntaxes,
    like ? (for optional parts) and use grouping with parenthesis (...)
    for longer subexpressions, e.g.
    [a-z][4}_?[0-9]{4}([0-9]{2})?
    for an optional underscore and two optional digits.

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bryan@bryanlepore@gmail.com to comp.lang.awk on Sun Mar 12 09:25:50 2023
    From Newsgroup: comp.lang.awk

    Apologies for the `seq` synthetic data, I'll prepare it the better way next time.

    But with your samples above you can also use other regexp syntaxes,
    like ? (for optional parts) and use grouping with parenthesis (...)
    for longer subexpressions, e.g.
    [a-z][4}_?[0-9]{4}([0-9]{2})?
    for an optional underscore and two optional digits.

    This is exactly what I was looking for and it works (I think a typo is in there but let's leave it for now).

    I tried {1-4} to get a range, but it didn't work - is that the idea? so

    [a-z]{4}_?[0-9]{4}([0-9]{1-4})?

    to match any number of digits from 1 to 4?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Sun Mar 12 16:49:42 2023
    From Newsgroup: comp.lang.awk

    In article <13052000-b214-4a5f-8b4b-7a3f09986af3n@googlegroups.com>,
    Bryan <bryanlepore@gmail.com> wrote:
    Apologies for the `seq` synthetic data, I'll prepare it the better way
    next time.

    But with your samples above you can also use other regexp syntaxes,
    like ? (for optional parts) and use grouping with parenthesis (...)
    for longer subexpressions, e.g.
    [a-z][4}_?[0-9]{4}([0-9]{2})?
    for an optional underscore and two optional digits.

    This is exactly what I was looking for and it works (I think a typo is
    in there but let's leave it for now).

    I tried {1-4} to get a range, but it didn't work - is that the idea? so

    [a-z]{4}_?[0-9]{4}([0-9]{1-4})?

    to match any number of digits from 1 to 4?

    It is: {1,4}
    --
    "If our country is going broke, let it be from feeding the poor and caring for the elderly. And not from pampering the rich and fighting wars for them."

    --Living Blue in a Red State--
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bryan@bryanlepore@gmail.com to comp.lang.awk on Sun Mar 12 13:11:09 2023
    From Newsgroup: comp.lang.awk

    This is great. My old awk book (Aho, Kernighan, and Weinberger) has a table on p.32 saying :

    "expression [c1-c2] matches any character in the range beginning with c1 and ending with c2."

    ... p.30 has more discussion, and I never saw anything about the comma "," to indicate a range - perhaps this is a strong indication I need to get a better book.

    And, I apologize, but I must say - this discussion reached a good answer in less than 24 hours - even though discussion doesn't "scale", and I can't cast a vote on it.

    IOW Thank you!
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bryan@bryanlepore@gmail.com to comp.lang.awk on Sun Mar 12 13:43:38 2023
    From Newsgroup: comp.lang.awk

    addendum : in writing a separate question about the printf statement, I found a better way to print a newline instead of every 2nd comma from a long string of signed floating points, so I simply share the method here :

    digits=$(seq -s',' -f '%f' -10 10)
    gawk -F, '
    {
    for (i=1;i<=NF;i++)
    {
    printf("%3.6f%s",$i,i%2?",":"\n")
    }
    }' <<EOT
    ${digits}
    EOT

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Sun Mar 12 22:42:10 2023
    From Newsgroup: comp.lang.awk

    On 12.03.2023 21:11, Bryan wrote:
    This is great. My old awk book (Aho, Kernighan, and Weinberger) has a
    table on p.32 saying :

    "expression [c1-c2] matches any character in the range beginning
    with c1 and ending with c2."

    You are referring here to something different. Slightly simplified said
    [a-z] is a regexp matching any single lowercase letter
    [0-9] any single digit
    [0-9a-fA-F] any hexadecimal digit

    The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the
    classic awk ("nawk") that is based of Aho's, etc. book. More recent and commonly used Awks like GNU awk supports it, though. That's why there's
    no mention in that book.


    ... p.30 has more discussion, and I never saw anything about the
    comma "," to indicate a range - perhaps this is a strong indication I
    need to get a better book.

    The old book is excellently written and contains all what comprises
    the power of the awk language. (Don't ignore it nor throw it away!)

    But I suggest, especially if you use GNU awk which supports yet more
    features, to get a copy of Arnold Robbin's "Effective Awk Programming"
    which is based on GNU Awk. (It's also online available in a searchable
    digital form.)

    Janis
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Mar 13 22:03:26 2023
    From Newsgroup: comp.lang.awk

    On 12.03.2023 22:42, Janis Papanagnou wrote:
    On 12.03.2023 21:11, Bryan wrote:
    This is great. My old awk book (Aho, Kernighan, and Weinberger) [...]

    The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the classic awk ("nawk") that is based of Aho's, etc. book. More recent and commonly used Awks like GNU awk supports it, though. That's why there's
    no mention in that book.

    While true for classic awk ("nawk") Arnold Robbins informed me that
    in more recent versions of "nawk" this syntax is also supported, now
    already for years. (Just in case my post was misinterpreted.)

    To my knowledge, though, there's no newer/updated releases of the book
    you mentioned; it is based on the old version of (n)awk, and thus it
    does not describe that (newer) feature. (Which was my point.)

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bryan@bryanlepore@gmail.com to comp.lang.awk on Tue Mar 14 06:55:57 2023
    From Newsgroup: comp.lang.awk

    I noticed in the "Computerphile" video with Brian Kernighan - shared on this user group - that a new version of The Awk Book might be in the works as of August 2022.
    Meanwhile, the overnight delivery is in-hand now, and, from page 45:
    "[begin quote]
    {n}
    {n,}
    {n,m}
    One or two numbers inside braces denote an *interval expression*. If there is one number in the braces, the preceeding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. if [p. 46] there is one number followed by a comma, then the preceding regexp is repeated at least n times:[end quote]"
    ... examples shown are :
    wh{3}y Matches 'whhhy', but not 'why' or 'whhhhy'.
    wh{3,5}y matches 'whhhy', 'whhhy', or 'whhhhhy' only.
    wh{2,}y matches 'whhy', 'whhhy', and so on.
    There is more.
    Lastly, fom the back cover :
    "You have the freedom to copy and modify this GNU manual."
    Glad to support the FSF in this way!
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Wed Mar 15 00:14:30 2023
    From Newsgroup: comp.lang.awk

    On 14.03.2023 14:55, Bryan wrote:
    I noticed in the "Computerphile" video with Brian Kernighan - shared
    on this user group - that a new version of The Awk Book might be in
    the works as of August 2022.

    I cannot find a new version of the original Awk book with Google
    (or other commercial providers). Could you provide a link, please?

    Or are you speaking about Arnold Robbin's book? (Especially since
    below you mention GNU and the FSF.)

    I'm certainly confused by your mention of Brian Kernighan, one of
    the authors of the original book.


    Meanwhile, the overnight delivery is in-hand now, [...] There is
    more.

    Lastly, fom the back cover :
    "You have the freedom to copy and modify this GNU manual."

    Glad to support the FSF in this way!

    Janis
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ben Bacarisse@ben.lists@bsb.me.uk to comp.lang.awk on Tue Mar 14 23:46:24 2023
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    On 14.03.2023 14:55, Bryan wrote:
    I noticed in the "Computerphile" video with Brian Kernighan - shared
    on this user group - that a new version of The Awk Book might be in
    the works as of August 2022.

    I cannot find a new version of the original Awk book with Google
    (or other commercial providers). Could you provide a link, please?

    Or are you speaking about Arnold Robbin's book? (Especially since
    below you mention GNU and the FSF.)

    I'm certainly confused by your mention of Brian Kernighan, one of
    the authors of the original book.

    Th phrase "might be in the works" means only that there is a possibility
    that a new edition might be in preparation. Is that's what's confusing?

    Bryan is clearly talking about a new version of the original book, but
    he is referring to the most vague suggestion that there might, soon, be
    a new edition. As far as I can tell there isn't one, but there could be
    on "in the works" (i.e. in preparation).
    --
    Ben.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.awk on Tue Mar 14 16:49:00 2023
    From Newsgroup: comp.lang.awk

    Bryan <bryanlepore@gmail.com> writes:
    I noticed in the "Computerphile" video with Brian Kernighan - shared
    on this user group - that a new version of The Awk Book might be in
    the works as of August 2022.

    Meanwhile, the overnight delivery is in-hand now, and, from page 45:

    "[begin quote]
    {n}
    {n,}
    {n,m}
    One or two numbers inside braces denote an *interval expression*. If
    there is one number in the braces, the preceeding regexp is repeated n
    times. If there are two numbers separated by a comma, the preceding
    regexp is repeated n to m times. if [p. 46] there is one number
    followed by a comma, then the preceding regexp is repeated at least n times:[end quote]"

    ... examples shown are :
    wh{3}y Matches 'whhhy', but not 'why' or 'whhhhy'.
    wh{3,5}y matches 'whhhy', 'whhhy', or 'whhhhhy' only.
    wh{2,}y matches 'whhy', 'whhhy', and so on.

    There is more.

    Lastly, fom the back cover :

    "You have the freedom to copy and modify this GNU manual."

    Glad to support the FSF in this way!

    That's the GNU Awk manual. I don't have a printed version, but it
    appears to have the same content as the online manual available by
    typing "info gawk" (if you have the right things installed)
    or at <https://www.gnu.org/software/gawk/manual/gawk.html>.

    "The Awk Book" presumably refers to the original "The AWK Programming
    Language" by Aho, Kernighan, and Weinberger, published in 1988.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    Working, but not speaking, for XCOM Labs
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Wed Mar 15 01:22:23 2023
    From Newsgroup: comp.lang.awk

    On 15.03.2023 00:46, Ben Bacarisse wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    On 14.03.2023 14:55, Bryan wrote:
    I noticed in the "Computerphile" video with Brian Kernighan - shared
    on this user group - that a new version of The Awk Book might be in
    the works as of August 2022.

    I cannot find a new version of the original Awk book with Google
    (or other commercial providers). Could you provide a link, please?

    Or are you speaking about Arnold Robbin's book? (Especially since
    below you mention GNU and the FSF.)

    I'm certainly confused by your mention of Brian Kernighan, one of
    the authors of the original book.

    Th phrase "might be in the works" means only that there is a possibility
    that a new edition might be in preparation. Is that's what's confusing?

    It was various things that confused me (but not the "in works" per se):
    - "might be in the works" vs. "the overnight delivery is in-hand now"
    - "GNU" and "FSF" vs. "The [original][commercial] Awk Book"
    - and the date "August 2022" I couldn't assign to both books mentioned


    Bryan is clearly talking about a new version of the original book, but
    he is referring to the most vague suggestion that there might, soon, be
    a new edition. As far as I can tell there isn't one, but there could be
    on "in the works" (i.e. in preparation).

    I am certainly interested in any new version. Read his post as if he
    already had got it. But I didn't find anything online.

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bryan@bryanlepore@gmail.com to comp.lang.awk on Wed Mar 15 08:31:02 2023
    From Newsgroup: comp.lang.awk

    I apologize for the confusion!

    I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Wed Mar 15 12:12:09 2023
    From Newsgroup: comp.lang.awk

    On 3/15/2023 10:31 AM, Bryan wrote:
    I apologize for the confusion!

    I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).

    You're posting on usenet, not a forum, so please make sure every post
    has enough context included to make sense stand-alone. Right now you're truncating/removing all context on all of your posts.

    Thanks.
    --- Synchronet 3.20a-Linux NewsLink 1.114