• Preliminary version of new regex matcher for gawk now available

    From arnold@arnold@skeeve.com (Aharon Robbins) to comp.lang.awk on Thu Jul 25 09:44:30 2024
    From Newsgroup: comp.lang.awk

    Hi All.

    I've been working with Mike Haertel (the original author of GNU grep)
    for a number of months now. He is writing a new regexp matcher for
    use in gawk (and other places, as people desire).

    The matcher is avalable on Github: https://github.com/mikehaertel/minrx.
    I have created a branch in the gawk repo that uses it: feature/minrx.

    MinRX is currently written in C++20. Mike will eventually rewrite it
    in C for portability. For the moment, you'll need to use gcc / g++
    to build the branch. I haven't tried to mess with clang / clang++.

    The test suite passes completely.

    The new matcher is the default, so that it will be exercised. The old
    matchers are still available. To use them, set GAWK_GNU_MATCHERS in
    the environment. I will NOT make a formal release with MinRX as long
    as MinRX is still in C++.

    For now, the only way to access the code is via Git:

    git clone https://git.savannah.gnu.org/r/gawk.git
    cd gawk
    git checkout feature/minrx
    ./bootstrap.sh && ./configure && make -j && make check

    If you use gawk, please try this branch out.

    Questions, comments, and *bug reports* are welcome.

    Thanks,

    Arnold
    --
    Aharon (Arnold) Robbins arnold AT skeeve DOT com
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Jul 25 14:05:27 2024
    From Newsgroup: comp.lang.awk

    On 25.07.2024 11:44, Aharon Robbins wrote:
    Hi All.

    I've been working with Mike Haertel (the original author of GNU grep)
    for a number of months now. He is writing a new regexp matcher for
    use in gawk (and other places, as people desire).

    The matcher is avalable on Github: https://github.com/mikehaertel/minrx.
    I have created a branch in the gawk repo that uses it: feature/minrx.

    MinRX is currently written in C++20. Mike will eventually rewrite it
    in C for portability. For the moment, you'll need to use gcc / g++
    to build the branch. I haven't tried to mess with clang / clang++.

    The test suite passes completely.

    The new matcher is the default, so that it will be exercised. The old matchers are still available. To use them, set GAWK_GNU_MATCHERS in
    the environment. I will NOT make a formal release with MinRX as long
    as MinRX is still in C++.

    For now, the only way to access the code is via Git:

    git clone https://git.savannah.gnu.org/r/gawk.git
    cd gawk
    git checkout feature/minrx
    ./bootstrap.sh && ./configure && make -j && make check

    If you use gawk, please try this branch out.

    My system complains about -std=c++20 so I cannot test it. (I think
    I'll wait for a native C release.)


    Questions, comments, and *bug reports* are welcome.

    Well, I skimmed through the txt file on Mike's git page to learn
    about the algorithm; especially the algorithm and its complexity
    is of interest to me. The document was not quite clear about that
    (or at least made me doubt) beyond the general and typical O(N*M) characteristics. One thing I was astonished about was why there's
    a non-deterministic automaton model used (NFSM can be transformed
    into Deterministic FSM); isn't the non-deterministic tree-search
    (where every branch is traversed breadth-first) sub-optimal? Then
    it spoke about a finite number of states, but that is a normal
    characteristic for a "Finite SM". I'm also astonished that while
    he spoke about back-references (something a bit more exotic in
    "RE"s) why he sees problems when applying the ERE based approach
    on BRE subset; my grep supports back-references with BRE and ERE.
    Not supporting back-references - if that's all he wanted to say -
    with his RE algorithm is of course okay (for me, at least, others
    might complain).

    So, at the moment, that all appears still a bit obscure to me.
    But that's just a first impression from an only brief look at it.

    So, yet, only questions. (No comments. No bugs.) :-)

    But wait! I do have one comment. In Mike's git-page description
    he speaks about the goal of this implementation approach. Given
    that goal I don't think it's yet a good time to incorporate that
    algorithm in GNU Awk; it adds some inherent uncertainty of new
    code without providing a gain. Algorithm simplicity is nice but
    as I understand there's not yet performance comparisons done?
    Unless it was a deliberate offer to use GNU Awk as a test bed.
    And "nearly-feature-complete implementation" (section Features)
    is not quite a fruitful marketing concept.
    I also wonder why BSD and GNU extensions are supported but not
    the very useful abbreviations for {some,all} Perl RE shortcuts.

    Just my 2 cents.

    If performance comparison numbers are available I'm certainly
    very interested to hear about them.

    Thanks for reading this (unintentionally) longish post.

    Janis


    Thanks,

    Arnold


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From arnold@arnold@skeeve.com (Aharon Robbins) to comp.lang.awk on Fri Jul 26 07:31:53 2024
    From Newsgroup: comp.lang.awk

    In article <v7tf29$2984r$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 25.07.2024 11:44, Aharon Robbins wrote:
    Hi All.

    I've been working with Mike Haertel (the original author of GNU grep)
    for a number of months now. He is writing a new regexp matcher for
    use in gawk (and other places, as people desire).

    [ clipped ]

    My system complains about -std=c++20 so I cannot test it. (I think
    I'll wait for a native C release.)

    That will be a while. It's not hard to build current GCC from scratch
    on a Linux system.

    Questions, comments, and *bug reports* are welcome.

    Well, I skimmed through the txt file on Mike's git page to learn
    about the algorithm; especially the algorithm and its complexity
    is of interest to me. The document was not quite clear about that
    (or at least made me doubt) beyond the general and typical O(N*M) >characteristics.

    You can email Mike directly about the technical stuff. It's mostly
    beyond me. Or, just open an issue on the GitHub and ask questions there.

    I forgot to mention what is likely the most important point about
    the new matcher, which is that it is fully POSIX-compliant. The
    existing GNU matchers are not, and likely never will be. There's at
    least one bug I reported a few years back in the GNU matchers
    that MinRX doesn't have, also.

    This matcher also has advantages for me as the maintainer.

    Algorithm simplicity is nice but as I understand there's not yet
    performance comparisons done?

    They will be done. By the time MinRX is in gawk for real, it will
    be performant, and in C.

    Unless it was a deliberate offer to use GNU Awk as a test bed.
    And "nearly-feature-complete implementation" (section Features)
    is not quite a fruitful marketing concept.

    As far as I'm concerned, it is feature complete. However, it
    doesn't support POSIX BREs.

    I also wonder why BSD and GNU extensions are supported but not
    the very useful abbreviations for {some,all} Perl RE shortcuts.

    Because they're just window dressing. I have no desire to be
    perl compatible. My needs are to be able to do what gawk currently
    does, no more.

    HTH.
    --
    Aharon (Arnold) Robbins arnold AT skeeve DOT com
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ben Bacarisse@ben@bsb.me.uk to comp.lang.awk on Fri Jul 26 10:57:43 2024
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    Well, I skimmed through the txt file on Mike's git page to learn
    about the algorithm; especially the algorithm and its complexity
    is of interest to me. The document was not quite clear about that
    (or at least made me doubt) beyond the general and typical O(N*M) characteristics. One thing I was astonished about was why there's
    a non-deterministic automaton model used (NFSM can be transformed
    into Deterministic FSM); isn't the non-deterministic tree-search
    (where every branch is traversed breadth-first) sub-optimal?

    The non-deterministic to deterministic transformation yields (at worst)
    an exponential number of states. Keeping track of a set of states is
    usually the preferred method.
    --
    Ben.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Sat Jul 27 16:36:33 2024
    From Newsgroup: comp.lang.awk

    In article <66a350e9$0$706$14726298@news.sunsite.dk>,
    Aharon Robbins <arnold@skeeve.com> wrote:
    ...
    My system complains about -std=c++20 so I cannot test it. (I think
    I'll wait for a native C release.)

    That will be a while. It's not hard to build current GCC from scratch
    on a Linux system.

    I doubt that. I wouldn't have the first clue about how to do it, and I'm certainly no Linux newbie.

    Maybe it (getting/building GCC) should be part of your "bootstrap" script?

    Also, is there an easy way to find out if your current GCC is "good enough" ?

    The system I am typing this on says it has GCC 9.4? Will that work?
    --
    person woman man camera tv
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Luuk@luuk@invalid.lan to comp.lang.awk on Sat Aug 10 15:42:29 2024
    From Newsgroup: comp.lang.awk

    On 27-7-2024 18:36, Kenny McCormack wrote:
    In article <66a350e9$0$706$14726298@news.sunsite.dk>,
    Aharon Robbins <arnold@skeeve.com> wrote:
    ...
    My system complains about -std=c++20 so I cannot test it. (I think
    I'll wait for a native C release.)

    That will be a while. It's not hard to build current GCC from scratch
    on a Linux system.

    I doubt that. I wouldn't have the first clue about how to do it, and I'm certainly no Linux newbie.

    Maybe it (getting/building GCC) should be part of your "bootstrap" script?

    Also, is there an easy way to find out if your current GCC is "good enough" ?

    The system I am typing this on says it has GCC 9.4? Will that work?


    from: https://stackoverflow.com/a/68545455/724039
    C++20 features are available since GCC 8.

    To enable C++20 support, add the command-line parameter

    -std=c++20

    For G++ 9 and earlier use

    -std=c++2a

    Or, to enable GNU extensions in addition to C++20 features, add

    -std=gnu++20


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Sat Aug 10 13:50:14 2024
    From Newsgroup: comp.lang.awk

    In article <nnd$15de9d74$69aba32d@91e4ec1203dca4f9>,
    Luuk <luuk@invalid.lan> wrote:
    ...
    from: https://stackoverflow.com/a/68545455/724039
    C++20 features are available since GCC 8.

    To enable C++20 support, add the command-line parameter

    -std=c++20

    For G++ 9 and earlier use

    -std=c++2a

    Or, to enable GNU extensions in addition to C++20 features, add

    -std=gnu++20

    Thanks. That was very helpful.
    --
    A 70 year old man who watches 6 hours of TV a day, plays a lot of golf
    and seems to always be in Florida is a retiree, not a president.
    --- Synchronet 3.20a-Linux NewsLink 1.114