I accidentally stumbled across the book "The Art of UNIX Programming"
(2004), by Eric S. Raymond. It has a chapter on Awk (about one and a
half page long). I was a bit astonished about quite some statements, >valuations, and conclusions. (And not only in the light of a recent
Fosslife article that Arnold informed us about here in c.l.a in May
2021.)
Here are two paragraphs quoted from the book. I'm interested in your >opinions.
I accidentally stumbled across the book "The Art of UNIX Programming"
(2004), by Eric S. Raymond. It has a chapter on Awk (about one and a
half page long). I was a bit astonished about quite some statements, valuations, and conclusions. (And not only in the light of a recent
Fosslife article that Arnold informed us about here in c.l.a in May
2021.)
Here are two paragraphs quoted from the book. I'm interested in your opinions.
" The awk language was originally designed to be a small,
expressive special-purpose language for report generation.
Unfortunately, it turns out to have been designed at a bad
spot on the complexity-vs.-power curve. The action language
is noncompact, but the pattern-driven framework it sits
inside keeps it from being generally applicable — that’s the
worst of both worlds. And the new-school scripting languages
can do anything awk can; their equivalent programs are
usually just as readable, if not more so. "
" For a few years after the release of Perl in 1987, awk
remained competitive simply because it had a smaller, faster
implementation. But as the cost of compute cycles and memory
dropped, the economic reasons for favoring a special-purpose
language that was relatively thrifty with both lost their
force. Programmers increasingly chose to do awklike things
with Perl or (later) Python, rather than keep two different
scripting languages in their heads. By the year 2000 awk had
become little more than a memory for most old-school Unix
hackers, and not a particularly nostalgic one. "
Janis
In article <st6udg$k03$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
I accidentally stumbled across the book "The Art of UNIX Programming" >>(2004), by Eric S. Raymond. It has a chapter on Awk (about one and a
half page long). I was a bit astonished about quite some statements, >>valuations, and conclusions. (And not only in the light of a recent >>Fosslife article that Arnold informed us about here in c.l.a in May
2021.)
Here are two paragraphs quoted from the book. I'm interested in your >>opinions.
Obviously, this guy is full of crap.
That's not as uncommon a situation (even in those we are supposed to admire >and hold up as heroes) as we'd like it to be.
In article <st70q4$3r4c6$1@news.xmission.com>,
Kenny McCormack <gazelle@shell.xmission.com> wrote:
In article <st6udg$k03$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
I accidentally stumbled across the book "The Art of UNIX Programming"
(2004), by Eric S. Raymond. It has a chapter on Awk (about one and a
half page long). I was a bit astonished about quite some statements,
valuations, and conclusions. (And not only in the light of a recent
Fosslife article that Arnold informed us about here in c.l.a in May
2021.)
Here are two paragraphs quoted from the book. I'm interested in your
opinions.
Obviously, this guy is full of crap.
That's not as uncommon a situation (even in those we are supposed to admire >> and hold up as heroes) as we'd like it to be.
It's funny in particular, since he mentions the power-complexity curve, and
I always thought that was AWK's main strength - that's thing I always liked about it - that it was perfectly situated on that curve.
You can do really
cool things in AWK w/o having to spend lots of time bowing down to the gods of the language. I.e., with AWK, you can sit down and start writing your algorithm w/o having to spend lots of time writing boilerplate code to get started, as with most other languages.
The problem really is as it with everything - ya always gotta push the new stuff. Whether we're talking about books, movies, TV, music, programming languages, whatever. You always have to be pushing the new stuff and disparaging the old. [...]
I accidentally stumbled across the book "The Art of UNIX Programming"
(2004), by Eric S. Raymond. It has a chapter on Awk (about one and a
half page long). I was a bit astonished about quite some statements, valuations, and conclusions. (And not only in the light of a recent
Fosslife article that Arnold informed us about here in c.l.a in May
2021.)
Here are two paragraphs quoted from the book. I'm interested in your opinions.
" The awk language was originally designed to be a small,
expressive special-purpose language for report generation.
Unfortunately, it turns out to have been designed at a bad
spot on the complexity-vs.-power curve. The action language
is noncompact, but the pattern-driven framework it sits
inside keeps it from being generally applicable — that’s the
worst of both worlds. And the new-school scripting languages
can do anything awk can; their equivalent programs are
usually just as readable, if not more so. "
" For a few years after the release of Perl in 1987, awk
remained competitive simply because it had a smaller, faster
implementation. But as the cost of compute cycles and memory
dropped, the economic reasons for favoring a special-purpose
language that was relatively thrifty with both lost their
force. Programmers increasingly chose to do awklike things
with Perl or (later) Python, rather than keep two different
scripting languages in their heads. By the year 2000 awk had
become little more than a memory for most old-school Unix
hackers, and not a particularly nostalgic one. "
As someone else has said here, there's a lot to be said for a small
language, but that advantage starts to drain away as soon as you are
forced to bite the bullet of using a bigger one (whatever that really
means). A huge driver of this for Perl was CPAN. Perl had publicly
shared modules so you could knock up something to parse out bits of
HTML, process emails and so on in just an hour or so. And you could
avoid name clashes quite easily. At the time (and maybe to this day)
you shared AWK code by literal copying of text into your script, hoping
that no name clashes would cause trouble.
On 31.01.2022 00:26, Ben Bacarisse wrote:
As someone else has said here, there's a lot to be said for a small
language, but that advantage starts to drain away as soon as you are
forced to bite the bullet of using a bigger one (whatever that really
means). A huge driver of this for Perl was CPAN. Perl had publicly
shared modules so you could knock up something to parse out bits of
HTML, process emails and so on in just an hour or so. And you could
avoid name clashes quite easily. At the time (and maybe to this day)
you shared AWK code by literal copying of text into your script, hoping
that no name clashes would cause trouble.
I am skeptical about that. Aren't you essentially drawing the picture
of "featureitis" - feature driven language enhancements?
It seems to
me that often hype starts and initially fosters a new language, and
fans continue using these languages for things initially unintended,
so that the application domain is expanded step by step, and (if done
in the right way) by libraries, successfully (in a way). The result
appears to be asymptotically evolution of general purpose languages, or something that's intended as one; often ignoring sophisticated design (Javascript comes to my mind)[*].
But the point is, in my opinion, that
the original intent to be a small language that covers only a special
domain gets lost. The schizophrenic thing - also just in my opinion -
is that it seems contrary to Unix-Philosophy, the separation of duties
and keeping tools small and specialized -; incidentally also described
by ESR in that book extensively.
On 31.01.2022 00:26, Ben Bacarisse wrote:
As someone else has said here, there's a lot to be said for a small
language, but that advantage starts to drain away as soon as you are
forced to bite the bullet of using a bigger one (whatever that really
means). A huge driver of this for Perl was CPAN. Perl had publicly
shared modules so you could knock up something to parse out bits of
HTML, process emails and so on in just an hour or so. And you could
avoid name clashes quite easily. At the time (and maybe to this day)
you shared AWK code by literal copying of text into your script, hoping
that no name clashes would cause trouble.
I am skeptical about that. Aren't you essentially drawing the picture
of "featureitis" - feature driven language enhancements? It seems to
me that often hype starts and initially fosters a new language, and
fans continue using these languages for things initially unintended,
[**] Incidentally GNU Awk opens that path with its Extension Library,
without actually taking it.
[ keep things simple and specialized Unix philosophy ]
This is why keeping AWK simple and narrowly focused is good. But that
will inevitable lead people to find alternatives for some tasks, and
that is a danger (if you want to look at it like a competition) because
it opens the door to using these other alternatives in the future.
On 31.01.2022 18:15, Ben Bacarisse wrote:
[ keep things simple and specialized Unix philosophy ]
This is why keeping AWK simple and narrowly focused is good. But that
will inevitable lead people to find alternatives for some tasks, and
that is a danger (if you want to look at it like a competition) because
it opens the door to using these other alternatives in the future.
Well, I think it's okay to find a language better suited for a given
task. Certainly better that if every [simple] language gets enhanced
only for competition purposes (which is no argument for me).
In that light I cannot understand how ESR came to the statements that
I quoted in my OP.
To my mind, where ESR errs is in thinking of AWK as a scripting language
at all. If you think of it as a tool for manipulating line-oriented
text files to be used alongside Unixes other tool like grep, cut, sort,
uniq then you probably won't mind the space it takes up in your head.
On 2022-02-01, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
To my mind, where ESR errs is in thinking of AWK as a scripting language
at all. If you think of it as a tool for manipulating line-oriented
text files to be used alongside Unixes other tool like grep, cut, sort,
uniq then you probably won't mind the space it takes up in your head.
Where ESR errs is believing that Awk is a language he actually knows.
Otherwise he'd know that you can use the "curly brace dialect" without
the pattern-condtiion framework, other than a BEGIN clause.
function helper()
{
}
function main()
{
helper();
}
BEGIN { main(); }
Awk turns off the pattern-action framework when there are no patterns
and actions other than BEGIN.
Kaz Kylheku <480-992-1380@kylheku.com> writes:
On 2022-02-01, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
To my mind, where ESR errs is in thinking of AWK as a scripting language >>> at all. If you think of it as a tool for manipulating line-oriented
text files to be used alongside Unixes other tool like grep, cut, sort,
uniq then you probably won't mind the space it takes up in your head.
Where ESR errs is believing that Awk is a language he actually knows.
Otherwise he'd know that you can use the "curly brace dialect" without
the pattern-condtiion framework, other than a BEGIN clause.
function helper()
{
}
function main()
{
helper();
}
BEGIN { main(); }
Awk turns off the pattern-action framework when there are no patterns
and actions other than BEGIN.
I'll take your word that he did not know this.
But how does this weaken
what he was saying? The "curly brace dialect" of AWK is hardly a better
AWK.
It's AWK without the most convenient part (for most tasks).
On 2022-02-01, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
To my mind, where ESR errs is in thinking of AWK as a scripting language
at all. If you think of it as a tool for manipulating line-oriented
text files to be used alongside Unixes other tool like grep, cut, sort,
uniq then you probably won't mind the space it takes up in your head.
Where ESR errs is believing that Awk is a language he actually knows.
Otherwise he'd know that you can use the "curly brace dialect" without
the pattern-condtiion framework, other than a BEGIN clause.
function helper()
{
}
function main()
{
helper();
}
BEGIN { main(); }
Awk turns off the pattern-action framework when there are no patterns
and actions other than BEGIN.
So I try to encourage new colleagues to use awk as often as I can.
i'm an ultra late-comer to awk - only discovering it in 2017-2018. and
the moment i found it, i realized nearly all else - perl R python java
C# - can be thrown straight into the toilet, if performance is a key
criteria for the task at a hand
[....] I started immediately, cobbled something together (awk featured prominently among other usual suspects, such as tr, sed, cut, grep).
prominently among other usual suspects, such as tr, sed, cut, grep). It delivered the desired results before his Python script was finished. Sofunny you mentioned "the usual suspects". You can replicate the following test benchmark attempting a bare-bones replication of unix utility [ wc ] , for both GNU wc ("gwc") and BSD wc ("wc") :
the final tally was "10 min" versus "> 30 min + 10 min + 10 min".
Once the logic becomes more intricate, I will usually go for Python
though, so I will use awk mostly for command line use, rarely as a file
to be run by "awk -f".
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured
prominently among other usual suspects, such as tr, sed, cut, grep).
Hmm.. - these four tools are amongst those where I usually say; instead
of connecting and running a lot of such processes use just one instance
of awk. The functions expressed in those tools are - modulo a few edge
cases - basics in Awk and part of its core.
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured
prominently among other usual suspects, such as tr, sed, cut, grep).
Hmm.. - these four tools are amongst those where I usually say; instead
of connecting and running a lot of such processes use just one instance
of awk. The functions expressed in those tools are - modulo a few edge
cases - basics in Awk and part of its core.
— gawk 5.1.1 posts a reasonably competitive time of 31.5 secs,
— mawk 1.3.4 's time of 19.3secs beats GNU wc and being only slightly slower than BSD wc, while
— mawk 1.9.9.6 's impressive 12.7secs leaves both in the dust, some 41% faster than BSD wc, and a whopping 83% faster than GNU wc. I wasn't kidding when I said I benchmark awk codes against C binaries instead of against perl or python.
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured
prominently among other usual suspects, such as tr, sed, cut, grep).
Hmm.. - these four tools are amongst those where I usually say; instead
of connecting and running a lot of such processes use just one instance
of awk. The functions expressed in those tools are - modulo a few edge
cases - basics in Awk and part of its core.
That sometimes works, but the trouble is that once you've used AWK's pattern/action once feature, you can't do so again -- you are stuck
inside the action part. Just the other day I needed to split fields
within a filed after finding the lines I wanted. This was, for me, an obvious case for two processes:
awk -F: '/wanted/ { print $3 }' | awk -F, '...'
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured
prominently among other usual suspects, such as tr, sed, cut, grep).
Hmm.. - these four tools are amongst those where I usually say; instead
of connecting and running a lot of such processes use just one instance
of awk. The functions expressed in those tools are - modulo a few edge
cases - basics in Awk and part of its core.
That sometimes works,
but the trouble is that once you've used AWK's
pattern/action once feature, you can't do so again -- you are stuck
inside the action part. Just the other day I needed to split fields
within a filed after finding the lines I wanted.
This was, for me, an obvious case for two processes:
awk -F: '/wanted/ { print $3 }' | awk -F, '...'
but I could have used grep and cut in place of the first AWK. Maybe I'm
just not good at remembering the details of all the key functions,
but I find I use AWK in pipelines quite a lot.
On 2022-02-09, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured >>>> prominently among other usual suspects, such as tr, sed, cut, grep).
Hmm.. - these four tools are amongst those where I usually say; instead
of connecting and running a lot of such processes use just one instance
of awk. The functions expressed in those tools are - modulo a few edge
cases - basics in Awk and part of its core.
That sometimes works, but the trouble is that once you've used AWK's
pattern/action once feature, you can't do so again -- you are stuck
inside the action part. Just the other day I needed to split fields
within a filed after finding the lines I wanted. This was, for me, an
obvious case for two processes:
awk -F: '/wanted/ { print $3 }' | awk -F, '...'
You can split $3 into fields by assigning its value to $0, after
tweaking FS for the inner field separator:
$ awk '/wanted/ { FS=","; $0=$3; OFS=":"; $1=$1; print }'
wanted two three,a,b,c <- input
three:a:b:c <- output
You have to save and restore FS to do this repeatedly for
different records of the outer file. Another approach is to
use the split function to populate an array, where the pattern
is an argument (only defaulting to FS if omitted).
On 09.02.2022 22:05, Ben Bacarisse wrote:
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured >>>> prominently among other usual suspects, such as tr, sed, cut, grep).
Hmm.. - these four tools are amongst those where I usually say; instead
of connecting and running a lot of such processes use just one instance
of awk. The functions expressed in those tools are - modulo a few edge
cases - basics in Awk and part of its core.
That sometimes works,
My observation is that it usually works smoothly, and only sometimes
(the edge cases, I called them above) not obviously straightforward,
but usually just in a slightly different way. But it works generally.
but the trouble is that once you've used AWK's
pattern/action once feature, you can't do so again -- you are stuck
inside the action part. Just the other day I needed to split fields
within a filed after finding the lines I wanted.
You can always simply split() the fields, no need to invoke another
process just for another implicit loop that awk supports.
This was, for me, an obvious case for two processes:
awk -F: '/wanted/ { print $3 }' | awk -F, '...'
I understand the impulse to develop commands that way; that usually
leads to such horrible and inflexible cascades of the tools mentioned
above (cat, sed, grep, cut, head, tail, tr, wc, pr, or yet more awks).
And as soon as you need yet more information from the first instance
this approach needs more workarounds, e.g. passing state information
through the OS level.
Of course there's many ways to skin a cat. I just advocate to think
about one-process solutions before following the reflex to construct inflexible pipeline constructs.
but I could have used grep and cut in place of the first AWK. Maybe I'm
just not good at remembering the details of all the key functions,
The nice thing about awk - actually already mentioned in context of
the features/complexity vs. power comments - is that you don't need
to memorize a lot;[*] I think awk is terse and compact enough. YMMV.
but I find I use AWK in pipelines quite a lot.
That's how we learned it; pipelining through simple dedicated tools.
I also still do that.
My observation is that whenever a more powerful
tool like awk gets into use, the more primitive tools in the pipeline
can be eliminated,
the whole pipeline gets then refactored, typically for efficiency, flexibility, robustness, and clarity in design.
I want to close my comment with another aspect; the primitive helper
tools are often restricted and incoherent.[*] In GNU context you have additional options that I'm glad to be able to use, but if you want to
stay standard conforming the tools might not "suffice" or usage gets
more bulky. With awk the standard version supports already the powerful
core.
Kaz Kylheku <480-992-1380@kylheku.com> writes:
On 2022-02-09, Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured >>>>> prominently among other usual suspects, such as tr, sed, cut, grep).
Hmm.. - these four tools are amongst those where I usually say; instead >>>> of connecting and running a lot of such processes use just one instance >>>> of awk. The functions expressed in those tools are - modulo a few edge >>>> cases - basics in Awk and part of its core.
That sometimes works, but the trouble is that once you've used AWK's
pattern/action once feature, you can't do so again -- you are stuck
inside the action part. Just the other day I needed to split fields
within a filed after finding the lines I wanted. This was, for me, an
obvious case for two processes:
awk -F: '/wanted/ { print $3 }' | awk -F, '...'
You can split $3 into fields by assigning its value to $0, after
tweaking FS for the inner field separator:
$ awk '/wanted/ { FS=","; $0=$3; OFS=":"; $1=$1; print }'
wanted two three,a,b,c <- input
three:a:b:c <- output
Sure, but you don't get to use pattern/action pairs on the result.
I understand the impulse to develop commands that way; that usually
leads to such horrible and inflexible cascades of the tools mentioned
above (cat, sed, grep, cut, head, tail, tr, wc, pr, or yet more awks).
And as soon as you need yet more information from the first instance
this approach needs more workarounds, e.g. passing state information
through the OS level.
Of course there's many ways to skin a cat. I just advocate to think
about one-process solutions before following the reflex to construct inflexible pipeline constructs.
On 2022-02-10, Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
Kaz Kylheku <480-99...@kylheku.com> writes:
On 2022-02-09, Ben Bacarisse <ben.u...@bsb.me.uk> wrote:
Janis Papanagnou <janis_pa...@hotmail.com> writes:
On 09.02.2022 08:49, Axel Reichert wrote:
[ about an ASCII data mangling Python script ]
[....] I started immediately, cobbled something together (awk featured >>>>> prominently among other usual suspects, such as tr, sed, cut, grep). >>>>Hmm.. - these four tools are amongst those where I usually say; instead >>>> of connecting and running a lot of such processes use just one instance >>>> of awk. The functions expressed in those tools are - modulo a few edge >>>> cases - basics in Awk and part of its core.
That sometimes works, but the trouble is that once you've used AWK's
pattern/action once feature, you can't do so again -- you are stuck
inside the action part. Just the other day I needed to split fields
within a filed after finding the lines I wanted. This was, for me, an >>> obvious case for two processes:
awk -F: '/wanted/ { print $3 }' | awk -F, '...'
You can split $3 into fields by assigning its value to $0, after
tweaking FS for the inner field separator:
$ awk '/wanted/ { FS=","; $0=$3; OFS=":"; $1=$1; print }'
wanted two three,a,b,c <- input
three:a:b:c <- output
Sure, but you don't get to use pattern/action pairs on the result.But that's largely just syntactic sugar for a glorified case statement.
Instead of
/abc/ { ... }
$2 > $3 { ... }
you have to write
if (/abc/) { ... }
if ($2 > $3) { ... }
kind of thing.
one-liner solution to that wanted-three question :
echo 'wanted two three,a,b,c' \
\
| [mg]awk '/^wanted/ && gsub(",", substr(":", ($0=$3)~"", 1)) + 1'
three:a:b:c
command 1 is
[ echo "wanted two three,a,b,c" | mawk2 '/wanted/ * gsub(",", substr(":",$_!=($_=$NF),_~_))' ]
three:a:b:c
command 2 is
[ echo "wanted two three,a,b,c" | mawk2 -F, '/wanted/ && ($!_=substr($!_,match($!_,/[^ \t]+$/) ) )' OFS=":" ]
three:a:b:c
On 2022-02-11, Kpop 2GM <jason....@gmail.com> wrote:
one-liner solution to that wanted-three question :
echo 'wanted two three,a,b,c' \
\
| [mg]awk '/^wanted/ && gsub(",", substr(":", ($0=$3)~"", 1)) + 1'
three:a:b:c
Are you positively sure that you're taking my example literally enough?
Try this:
sed -e 's/wanted two //' -e 's/,/:/g'
echo "wanted two three,a,b,c" | awk '{print $3}' | tr ',' ':'
Janis Papanagnou <janis_pa...@hotmail.com> writes:
I understand the impulse to develop commands that way; that usually
leads to such horrible and inflexible cascades of the tools mentioned above (cat, sed, grep, cut, head, tail, tr, wc, pr, or yet more awks).
And as soon as you need yet more information from the first instance
this approach needs more workarounds, e.g. passing state information through the OS level.
Of course there's many ways to skin a cat. I just advocate to thinkIt seems that like Ben I am a pipeliner, the igniting spark probably "Opening the software toolbox":
about one-process solutions before following the reflex to construct inflexible pipeline constructs.
https://www.gnu.org/software/coreutils/manual/html_node/Opening-the-software-toolbox.html
I know that a lot can be done within awk, but it often does not seem to
meet my way of thinking. For example, I might start with a grep. To my surprise it finds many matches, so further processing is called for, say
awk '{print $3}' or similar. At that point, I will NOT replace the grep
with awk '/.../', because it is easier to just add another pipeline
after fetching the command from history using the up arrow. And so on, adding pipeline after pipeline (which I also can easily relate to
functional programming). Once the whole dataflow is ready, I will
usually not "refactor" the beast, only in glaringly obvious cases/optimizations. I might even have started with a (in hindsight)
Useless Use Of Cat. On the more ambitious side, I well remember how
proud I was when plumbing several xargs into a pipeline:
foo | bar | xargs -i baz {} 333 | quux | xargs fubar
By now this is a common idiom for me on the command line.
Kpop 2GM <jason.cy.kwan@gmail.com> writes:
command 1 is
[ echo "wanted two three,a,b,c" | mawk2 '/wanted/ * gsub(",", substr(":",$_!=($_=$NF),_~_))' ]
three:a:b:c
command 2 is
[ echo "wanted two three,a,b,c" | mawk2 -F, '/wanted/ && ($!_=substr($!_,match($!_,/[^ \t]+$/) ) )' OFS=":" ]
three:a:b:c
And both seem to me horrendously unelegant compared to
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
I understand the impulse to develop commands that way; that usually
leads to such horrible and inflexible cascades of the tools mentioned
above (cat, sed, grep, cut, head, tail, tr, wc, pr, or yet more awks).
[...]
It seems that like Ben I am a pipeliner, the igniting spark probably
"Opening the software toolbox":
https://www.gnu.org/software/coreutils/manual/html_node/Opening-the-software-toolbox.html
I know that a lot can be done within awk, but it often does not seem to
meet my way of thinking. For example, I might start with a grep. To my surprise it finds many matches, so further processing is called for, say
awk '{print $3}' or similar. At that point, I will NOT replace the grep
with awk '/.../', because it is easier to just add another pipeline
after fetching the command from history using the up arrow. And so on,
adding pipeline after pipeline (which I also can easily relate to
functional programming). Once the whole dataflow is ready, I will
usually not "refactor" the beast, only in glaringly obvious cases/optimizations. I might even have started with a (in hindsight)
Useless Use Of Cat. On the more ambitious side, I well remember how
proud I was when plumbing several xargs into a pipeline:
foo | bar | xargs -i baz {} 333 | quux | xargs fubar
By now this is a common idiom for me on the command line.
But full ACK on passing information from the first instance downstream,
at which point I tend to start using Python. But up to then pipelining
"just flows". That's what they were designed for. (-:
Axel
P. S.: I will keep your advice in memory, though, to avoid my worst
excesses. Point taken.
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
You can always simply split() the fields, no need to invoke another
process just for another implicit loop that awk supports.
Yes, there's no need, but why worry about it? Maybe I am alone in
thinking processes are cheap.
But more to the point, a pipeline is an elegant, easily understood, and
often natural way to organise a task.
[...]
A pipeline is not the right structure for such tasks, but there are a
huge number of tasks where combining Unix tools is the simplest
solution.
The nice thing about awk - actually already mentioned in context of
the features/complexity vs. power comments - is that you don't need
to memorize a lot;[*] I think awk is terse and compact enough. YMMV.
But since I use pipelines so much, I rarely use split, patsplit, gsub or gensub. I find myself checking their arguments pretty much every time I
use them.
That's how we learned it; pipelining through simple dedicated tools.
I also still do that.
Why? Serious question. It sound like a dreadful risk based on your
comments above. Doing is "usually leads to such horrible and inflexible cascades of the tools" when there is no need "to invoke another
process". What makes you sometimes take the risk of horrible cascades
and pay the price of another process?
I ask because it's possible we disagree only on how frequently it should
be done, and about exactly what circumstances warrant it.
What I was addressing is the use of programs with primitive functions
that awk is providing in a simple and consistent way inherently!
the whole pipeline gets then refactored, typically for efficiency,
flexibility, robustness, and clarity in design.
That's where I disagree. I often choose a pipeline because it is the
most robust, flexible and clear design. (I rarely care about efficiency
when doing this sort of thing.)
I want to close my comment with another aspect; the primitive helper
tools are often restricted and incoherent.[*] In GNU context you have
additional options that I'm glad to be able to use, but if you want to
stay standard conforming the tools might not "suffice" or usage gets
more bulky. With awk the standard version supports already the powerful
core.
I agree. That's a shame, but an inevitable cost of piecemeal historical development.
On 10.02.2022 18:33, Axel Reichert wrote:
P. S.: I will keep your advice in memory, though, to avoid my worst
excesses. Point taken.
Don't get me wrong. Pipelines of tools are not "bad" and I also wrote:
"That's how we learned it; pipelining through simple dedicated tools.
I also still do that. [...]"
Your double xargs programming pattern is certainly rare
as soon as a search task gets slightly more complex, say searching for
lines with two matches, /A/&&/B/, I'd use awk.
more clumsy to implement with pipes, e.g. extracting keys from a file
to match records in another file; then I don't even think about how
that (maybe) could be implemented by function compositions with
primitive Unix programs
a one-shot ad hoc task the greps are okay, and I see that I sometimes
start with a command, then add another piped command to the previous
one (from shell history), and then a third one. But as soon as this
goes into a shell script these commands are getting optimized.
On 10.02.2022 02:37, Ben Bacarisse wrote:
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
You can always simply split() the fields, no need to invoke another
process just for another implicit loop that awk supports.
Yes, there's no need, but why worry about it? Maybe I am alone in
thinking processes are cheap.
I just try to avoid unnecessary processes. A dozen is not an issue,
but once you've embedded them in shell loops it might get an issue.
But more to the point, a pipeline is an elegant, easily understood, and
often natural way to organise a task.
Agreed.
[...]
A pipeline is not the right structure for such tasks, but there are a
huge number of tasks where combining Unix tools is the simplest
solution.
Agreed.
The nice thing about awk - actually already mentioned in context of
the features/complexity vs. power comments - is that you don't need
to memorize a lot;[*] I think awk is terse and compact enough. YMMV.
But since I use pipelines so much, I rarely use split, patsplit, gsub or
gensub. I find myself checking their arguments pretty much every time I
use them.
(Some of the mentioned functions are non-standard, GNU Awk'ish.)
Well, if you don't use them regularly you'll have to look up the docs. Personally I think the [standard] functions are easy to remember, but
okay. Myself I can easy remember them if only by thinking about their
finally placed default arguments; split (what, where [,by-what])
omitting the "by-what" will use the standard FS, and "what", "where"
I think is the natural order I'd expect, and similarly with gsub;
gsub (what, by-what [, where]) omitting the where will operate on
the whole line, and "what", "by-what" I think are the natural order.
The non-standard GNU functions patsplit() and gensub() I also have to
look up, but I think just because I rarely have a need to use these functions.
That's how we learned it; pipelining through simple dedicated tools.
I also still do that.
Why? Serious question. It sound like a dreadful risk based on your
comments above. Doing is "usually leads to such horrible and inflexible
cascades of the tools" when there is no need "to invoke another
process". What makes you sometimes take the risk of horrible cascades
and pay the price of another process?
I think this is answered in my previous post, my reply to Axel's post.
It's certainly not something I'd call a risk, because I can control it,
I can make the decisions, based on application case, requirements, and expertise.
I ask because it's possible we disagree only on how frequently it should
be done, and about exactly what circumstances warrant it.
I think we should not take the theme too dogmatic or too strict. To
quote from my other post:
What I was addressing is the use of programs with primitive functions
that awk is providing in a simple and consistent way inherently!
the whole pipeline gets then refactored, typically for efficiency,
flexibility, robustness, and clarity in design.
That's where I disagree. I often choose a pipeline because it is the
most robust, flexible and clear design. (I rarely care about efficiency
when doing this sort of thing.)
We do not disagree concerning the clearness of the pipe concept. It is
just very _primitive_ (an advantage, and a restriction WRT
flexibility).
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
Yes, sure, I noticed that. I do think that it's mostly about stylistic matters at this point of the discussion. I like to compare
shell/Unix/awk/CLI issues with a language: A former boss was impressed
by what was feasible with these strange "words", which to him sounded
like Greek. He wanted me to save these utterances for others to benefit
from my "words of wisdom". I argued that they were not a "quote" to be
put into some anthology, but a spontaneous sentence formed during "live talk". The point for me was not "memorizing" them (shell script), but
being able to speak.
Of course this analogy is valid only for ad hoc stuff, but that is how I
use them in almost all cases: These are throw-away command lines and
only very rarely I see the potential for them to be re-used. [...]
as soon as a search task gets slightly more complex, say searching for
lines with two matches, /A/&&/B/, I'd use awk.
grep A ... | grep B
more clumsy to implement with pipes, e.g. extracting keys from a file
to match records in another file; then I don't even think about how
that (maybe) could be implemented by function compositions with
primitive Unix programs
But you do know "join"? An often overlooked gem.
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 10.02.2022 02:37, Ben Bacarisse wrote:
Why? Serious question. It sound like a dreadful risk based on your
comments above. Doing is "usually leads to such horrible and inflexible >>> cascades of the tools" when there is no need "to invoke another
process". What makes you sometimes take the risk of horrible cascades
and pay the price of another process?
I think this is answered in my previous post, my reply to Axel's post.
It's certainly not something I'd call a risk, because I can control it,
I can make the decisions, based on application case, requirements, and
expertise.
That's an odd answer. Do you think I can't control the risk,
On 12.02.2022 00:32, Ben Bacarisse wrote:
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 10.02.2022 02:37, Ben Bacarisse wrote:
Why? Serious question. It sound like a dreadful risk based on your
comments above. Doing is "usually leads to such horrible and inflexible >>>> cascades of the tools" when there is no need "to invoke another
process". What makes you sometimes take the risk of horrible cascades >>>> and pay the price of another process?
I think this is answered in my previous post, my reply to Axel's post.
It's certainly not something I'd call a risk, because I can control it,
I can make the decisions, based on application case, requirements, and
expertise.
That's an odd answer. Do you think I can't control the risk,
No, Ben. You were (IMO unnecessarily) introducing the (IMO also inappropriate) term "risk".
And I pointed out that I see risks
only in cases where one cannot control the situation or where I
am restricted in any ways in my decisions.
I was neither saying nor implying anything about you.
(This discussion got a spin in
an [emotional] direction that I don't want to follow.)
On 11.02.2022 22:45, Axel Reichert wrote:
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
more clumsy to implement with pipes, e.g. extracting keys from a file
to match records in another file; then I don't even think about how
that (maybe) could be implemented by function compositions with
primitive Unix programs
But you do know "join"? An often overlooked gem.
I know the 'join' command but don't see what that has to do with what I
wrote here. By "function composition" I meant that programs represent functions; tool x does f, tool y does g, and combining tool x and y by,
say, x|y does g o f, where o is the function connector, a composition
of functionality (and code).
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 11.02.2022 22:45, Axel Reichert wrote:
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
more clumsy to implement with pipes, e.g. extracting keys from a file
to match records in another file; then I don't even think about how
that (maybe) could be implemented by function compositions with
primitive Unix programs
But you do know "join"? An often overlooked gem.
I know the 'join' command but don't see what that has to do with what I
wrote here. By "function composition" I meant that programs represent
functions; tool x does f, tool y does g, and combining tool x and y by,
say, x|y does g o f, where o is the function connector, a composition
of functionality (and code).
foo-1.txt:
foo 1 2 3
Foo 4 5 6 7
FOO 8 9
foo-2.txt:
foo 456
Foo 45 67
FOO 89
To me, the first column seems like a key and the whole line like a
record.
To get something like
foo-joined.txt:
foo 1 2 456
Foo 4 5 45
Foo 8 9 89
would be a typical job for join. Hence my question.
But we digress from awk. (-:
Axel
Hi, thank you for your outstanding contributions and discussions on
awk.
Working with it from more than 20 years now and still amazed at the
power of this wonderful language !
This is my modest contribution to shed light on a usage too little
documented on the Internet, I named "Record Separator" : https://rosettacode.org/wiki/Search_in_paragraph%27s_text
Olivier Gabathuler
Sure. You join two data sets identified by a common key. But so what?
You have probably been triggered by the formulation of a sample use ("extracting keys from a file to match records in another file") in my
post
I hope I dragged the thread back on topic with the awk samples. ;-)
On 16.02.2022 23:11, olivier gabathuler wrote:
Hi, thank you for your outstanding contributions and discussions on
awk.
Working with it from more than 20 years now and still amazed at the
power of this wonderful language !
This is my modest contribution to shed light on a usage too little documented on the Internet, I named "Record Separator" : https://rosettacode.org/wiki/Search_in_paragraph%27s_textI had a peek view into the awk code and (unstructured) data sample.
The task is described not very specific as:
"The goal is to verify the presence of a word or regular expression
within several paragraphs of text (structured or not) and to print
the relevant paragraphs on the standard output."
When I saw the code I first wondered about the definition of a two
newlines output record separator just to define the same as input
separator to the next awk stage. (An indication for a candidate to
be refactored.)
It seems that your code basically extracts from records of blocks
those blocks that contain a specific string. In addition it changes
the data in a subtle way beyond the formulated task description.
Personally my first attempt for such a task would have been simpler
(using awk's multi-line data blocks feature), something like
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }
/Traceback/ && /SystemError/
' Traceback.txt
with possible extensions to test for the patterns in specific fields
(by adding FS = "\n") so that the patterns if appearing in the data
won't compromise the correct function.
(Note that the output of above code keeps the matched data intact.)
Yes, features relying on the separators allow interesting solutions.
(In the given case it's arguable whether they've been used sensibly.)
Janis
Olivier Gabathuler
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }is :
/Traceback/ && /SystemError/
' Traceback.txt
not exactly the output I expect, but as you said, I was not specific enough in the description of the output formatting.I will fix that.
Le jeudi 17 février 2022 à 03:36:11 UTC+1, Janis Papanagnou a écrit :
On 16.02.2022 23:11, olivier gabathuler wrote:
Hi, thank you for your outstanding contributions and discussions onI had a peek view into the awk code and (unstructured) data sample.
awk.
Working with it from more than 20 years now and still amazed at the
power of this wonderful language !
This is my modest contribution to shed light on a usage too little
documented on the Internet, I named "Record Separator" :
https://rosettacode.org/wiki/Search_in_paragraph%27s_text
The task is described not very specific as:
"The goal is to verify the presence of a word or regular expression
within several paragraphs of text (structured or not) and to print
the relevant paragraphs on the standard output."
When I saw the code I first wondered about the definition of a two
newlines output record separator just to define the same as input
separator to the next awk stage. (An indication for a candidate to
be refactored.)
It seems that your code basically extracts from records of blocks
those blocks that contain a specific string. In addition it changes
the data in a subtle way beyond the formulated task description.
Personally my first attempt for such a task would have been simpler
(using awk's multi-line data blocks feature), something like
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }
/Traceback/ && /SystemError/
' Traceback.txt
with possible extensions to test for the patterns in specific fields
(by adding FS = "\n") so that the patterns if appearing in the data
won't compromise the correct function.
(Note that the output of above code keeps the matched data intact.)
Yes, features relying on the separators allow interesting solutions.
(In the given case it's arguable whether they've been used sensibly.)
Janis
Olivier Gabathuler
Hi Janis,
thanks for your response :-)
Just to understand, the output with
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }is :
/Traceback/ && /SystemError/
' Traceback.txt
..
----------------
[Tue Jan 21 16:16:19.250245 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] Traceback (most recent call last):
[Tue Jan 21 16:16:19.252221 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] SystemError: unable to access /home/dir
[Tue Jan 21 16:16:19.249067 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] mod_wsgi (pid=6515): Failed to exec Python script file '/home/pi/RaspBerryPiAdhan/www/sysinfo.wsgi'.
[Tue Jan 21 16:16:19.249609 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] mod_wsgi (pid=6515): Exception occurred processing WSGI script '/home/pi/RaspBerryPiAdhan/www/sysinfo.wsgi'.
----------------
12/01 19:24:57.726 ERROR| log:0072| post-test sysinfo error: 11/01 18:24:57.727 ERROR| traceback:0013| Traceback (most recent call last): 11/01 18:24:57.728 ERROR| traceback:0013| File "/tmp/sysinfo/autoserv-0tMj3m/common_lib/log.py", line 70, in decorated_func 11/01 18:24:57.729 ERROR| traceback:0013| fn(*args, **dargs) 11/01 18:24:57.730 ERROR| traceback:0013| File "/tmp/sysinfo/autoserv-0tMj3m/bin/base_sysinfo.py", line 286, in log_after_each_test 11/01 18:24:57.731 ERROR| traceback:0013| old_packages = set(self._installed_packages) 11/01 18:24:57.731 ERROR| traceback:0013| SystemError: no such file or directory
----------------
..
not exactly the output I expect, but as you said, I was not specific enough in the description of the output formatting.I will fix that.
In fact I took this example, but in my working life on +10k Linux boxes as sysadmin, I used RS extensively to parse a lot of logs, so..
Olivier G.
It's its pathological or unnecessary use I consider to be problematic.
On 18.02.2022 19:08, olivier gabathuler wrote:
Le jeudi 17 février 2022 à 03:36:11 UTC+1, Janis Papanagnou a écrit :
On 16.02.2022 23:11, olivier gabathuler wrote:
Hi, thank you for your outstanding contributions and discussions onI had a peek view into the awk code and (unstructured) data sample.
awk.
Working with it from more than 20 years now and still amazed at the
power of this wonderful language !
This is my modest contribution to shed light on a usage too little
documented on the Internet, I named "Record Separator" :
https://rosettacode.org/wiki/Search_in_paragraph%27s_text
The task is described not very specific as:
"The goal is to verify the presence of a word or regular expression
within several paragraphs of text (structured or not) and to print
the relevant paragraphs on the standard output."
When I saw the code I first wondered about the definition of a two
newlines output record separator just to define the same as input
separator to the next awk stage. (An indication for a candidate to
be refactored.)
It seems that your code basically extracts from records of blocks
those blocks that contain a specific string. In addition it changes
the data in a subtle way beyond the formulated task description.
Personally my first attempt for such a task would have been simpler
(using awk's multi-line data blocks feature), something like
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }
/Traceback/ && /SystemError/
' Traceback.txt
with possible extensions to test for the patterns in specific fields
(by adding FS = "\n") so that the patterns if appearing in the data
won't compromise the correct function.
(Note that the output of above code keeps the matched data intact.)
Yes, features relying on the separators allow interesting solutions.
(In the given case it's arguable whether they've been used sensibly.)
Janis
Olivier Gabathuler
Hi Janis,
thanks for your response :-)
Just to understand, the output withActually I was much more saying and implying. To expand on it...
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }is :
/Traceback/ && /SystemError/
' Traceback.txt
..
----------------
[Tue Jan 21 16:16:19.250245 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] Traceback (most recent call last):
[Tue Jan 21 16:16:19.252221 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] SystemError: unable to access /home/dir
[Tue Jan 21 16:16:19.249067 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] mod_wsgi (pid=6515): Failed to exec Python script file '/home/pi/RaspBerryPiAdhan/www/sysinfo.wsgi'.
[Tue Jan 21 16:16:19.249609 2020] [wsgi:error] [pid 6515:tid 3041002528] [remote 10.0.0.12:50757] mod_wsgi (pid=6515): Exception occurred processing WSGI script '/home/pi/RaspBerryPiAdhan/www/sysinfo.wsgi'.
----------------
12/01 19:24:57.726 ERROR| log:0072| post-test sysinfo error: 11/01 18:24:57.727 ERROR| traceback:0013| Traceback (most recent call last): 11/01 18:24:57.728 ERROR| traceback:0013| File "/tmp/sysinfo/autoserv-0tMj3m/common_lib/log.py", line 70, in decorated_func 11/01 18:24:57.729 ERROR| traceback:0013| fn(*args, **dargs) 11/01 18:24:57.730 ERROR| traceback:0013| File "/tmp/sysinfo/autoserv-0tMj3m/bin/base_sysinfo.py", line 286, in log_after_each_test 11/01 18:24:57.731 ERROR| traceback:0013| old_packages = set(self._installed_packages) 11/01 18:24:57.731 ERROR| traceback:0013| SystemError: no such file or directory
----------------
..
not exactly the output I expect, but as you said, I was not specific enough in the description of the output formatting.I will fix that.
From the code and the task description it was unclear whether the
output of your script was just by accident or deliberately beyond
the description on the web page.
If it was by accident differing - as a consequence of a convoluted
design based on the field separators - then above simple code is an immediate improvement (in more than one aspect).
If your task was actually to output the matching lines, but these
matching lines should start from the keyword "Traceback" (and the
leading time stamps suppressed), then you can and should formulate
that in a clean way; not only the description but also the code
should be clearly formulated.
A clean awk function is simply substr($0,index($0,"Traceback"))
and the resulting code still clean and comprehensible; instead of
printing the whole record
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }
/Traceback/ && /SystemError/ ## this implies: print $0
' Traceback.txt
you just print the desired part
awk 'BEGIN { RS = "" ; ORS = "\n----------------\n" }
/Traceback/ && /SystemError/ {
print substr($0,index($0,"Traceback"))
}
' Traceback.txt
A simple straightforward addition without side-effects or any hard
to follow program logic. No unnecessary awk instances, FS-fiddling,
or anything.
And this code prints at least the same output as the code you posted
on that web page. That code of yours was
awk -v ORS='\n\n' '/SystemError/ { print RS $0 }'
RS="Traceback" Traceback.txt |\
awk -v ORS='\n----------------\n' '/Traceback/' RS="\n\n"
If you think this code is in any way to prefer I'd be interested in
your explanations. - No, not really, that was just rhetorical.
In fact I took this example, but in my working life on +10k Linux boxes as sysadmin, I used RS extensively to parse a lot of logs, so..In fact, if that posted code you showed here is a characteristic code sample, then I doubt that it's a good idea to spread it to +10k Linux systems.
But that taunt aside; there's nothing wrong in using the awk separators, it's a basic feature any proficient awk authority will [sensibly] use.
It's its pathological or unnecessary use I consider to be problematic.
YMMV.
Janis
Olivier G.
Kpop 2GM <> writes:
i'm an ultra late-comer to awk - only discovering it in 2017-2018. andI would rather go for TCW (Total Cost of Wizardry): A competent Python programmer once consulted me on performance tuning for an (ASCII data mangling) script he had written (which took him about 30 min). It was running since 10 min, and no end in sight according to a monitor on the (transformed) output. After he had explained the task at hand, I replied that I would not use Python, but rather some Unix command line tools. I started immediately, cobbled something together (awk featured
the moment i found it, i realized nearly all else - perl R python java
C# - can be thrown straight into the toilet, if performance is a key criteria for the task at a hand
prominently among other usual suspects, such as tr, sed, cut, grep). It delivered the desired results before his Python script was finished. So
the final tally was "10 min" versus "> 30 min + 10 min + 10 min".
Once the logic becomes more intricate, I will usually go for Python
though, so I will use awk mostly for command line use, rarely as a file
to be run by "awk -f".
I was also a later-comer to this tool. When I started to learn Perl in
the late 90s, I learned that it was a superset to sed and awk (coming
even with conversion scripts), and so I gave the older tools another try (the "man" pages were completely incomprehensible to me before, I could
not wrap my head around stream processing). Once it clicked, I rarely
used Perl anywmore.
Same goes for spreadsheet tools, for which I also seldom feel the need.
Best regards
Axel
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 790 |
Nodes: | 10 (0 / 10) |
Uptime: | 193:24:48 |
Calls: | 11,043 |
Files: | 186,065 |
Messages: | 1,743,698 |