Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference
Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.
* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.
On 2/23/2022 3:34 PM, J Naman wrote:Here are six results, scaled: (not surprising to me)
Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference
Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.
* These Scores are scaled. I was previously warnedWere there 128% more matches or some other difference in the matched strings? Without knowing what the input contained it's hard to know what those results mean. What were you hoping to test by adding `"?` to the regexps? Without knowing how IGN=1 compares to the alternative of `tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
not to report actual CPU or clock times for one particular system.
this information.
Ed.
On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
On 2/23/2022 3:34 PM, J Naman wrote:Here are six results, scaled: (not surprising to me)
Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1Were there 128% more matches or some other difference in the matched
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference
Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.
* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.
strings? Without knowing what the input contained it's hard to know what
those results mean. What were you hoping to test by adding `"?` to the
regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with
this information.
Ed.
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer
low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"
function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper if(str~/^Include Variable Function NameSpace x/) {x++} # mixed if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)
So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be. tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.
On 2/23/2022 10:45 PM, J Naman wrote:
On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:
On 2/23/2022 3:34 PM, J Naman wrote:Here are six results, scaled: (not surprising to me)
Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1Were there 128% more matches or some other difference in the matched
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference
Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.
* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.
strings? Without knowing what the input contained it's hard to know what >> those results mean. What were you hoping to test by adding `"?` to the
regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with >> this information.
Ed.
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer
low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"
function testmatch(str, x){ # all 7 regexp are tested every call if(str~/^include variable function namespace x/) {x++} # lower if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper if(str~/^Include Variable Function NameSpace x/) {x++} # mixed if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case if(str~/^[A-Z ]+$/) {x++}Am I right in thinking that by the above you mean your test script is basically a script that calls that function some large number of times
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)
in a loop with 1 of the stated strings, e.g.
BEGIN {
low = "include variable function namespace x"
for (i=1;i<=1000000;i++) testmatch(low)
}
Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I have wasted people's time. JohnSo, worst case, IGNORECASE=1 takes about twice as long. No surprise.I'm still struggling to understand what we're supposed to **do** with
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be. tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.
the above information. I mean if we need to match a regexp against mixed-case input we have 2 choices:
1) IGNORECASE=1; .. $0 ~ /foo/
2) tolower($0) ~ /foo/
and what we cannot do is just:
3) $0 ~ /foo/
so what can we do with the information that "1" would be slower than "3" since we can't use "3" for this anyway? If you told us that "1" was
slower than "2" then we could use that information to write scripts
using "2" instead of "1" but I just don't see how the speed of "1" vs
the speed of "3" is something we can act on.
Ed.
On Thursday, 24 February 2022 at 07:51:12 UTC-5, Ed Morton wrote:
On 2/23/2022 10:45 PM, J Naman wrote:Ed, you re quite right. And I apologize for not telling you what motivated all this. I have files of mixed case text and regexps that are mixed case, e.g. /New York, NY/. I did an @include "foo" that included some function that set IGNORECASE=1 and everything s-l-o-w-e-d down. Once I figured out that IGNORECASE was probably responsible, I benchmarked to see what the times were. Thus, for my data, if and when possible, exact match text to regexp with IGNORECASE=0. As I said, no surprise. Sorry if I have wasted people's time. John
On Wednesday, 23 February 2022 at 16:55:36 UTC-5, Ed Morton wrote:Am I right in thinking that by the above you mean your test script is
On 2/23/2022 3:34 PM, J Naman wrote:Here are six results, scaled: (not surprising to me)
Benchmarks have been mentioned recently in some of these posts. Recently, I benchmarked IGNORECASE=0 vs =1 under GAWK Ver 5.1.1Were there 128% more matches or some other difference in the matched
The result is no surprise, 128% difference for this one benchmark.
I am just reporting the quantitative difference. The wall clock time difference
can be non-trivial for processing large files having hundreds of
thousands of lines of text.
The benchmark was for a set of six statements:
four of the form:
if(str~/^text/) {return;}
plus two statements:
if(str~/^[A-Z]+$/) {return;}
if(str~/^[a-z]+$/) {return;}
The results over a aggregate 3 million loops:
Score Test
817 Avg IGN=1
640 Avg IGN=0
128% difference
Also, I reran the same test with "? at the beginning
and end of all six regexps. The scores were
not significantly different than the above.
* These Scores are scaled. I was previously warned
not to report actual CPU or clock times for one particular system.
strings? Without knowing what the input contained it's hard to know what >>>> those results mean. What were you hoping to test by adding `"?` to the >>>> regexps? Without knowing how IGN=1 compares to the alternative of
`tolower(str) ~ /^[a-z]+$/` I'm not sure what we could actually do with >>>> this information.
Ed.
IC=0 IC=1 1/0
low 108 226 109% longer
Mix 100 228 128% longer
UP 111 225 102% longer
low = "include variable function namespace x"
Mix = "Include Variable Function NameSpace x"
UP = "INCLUDE VARIABLE FUNCTION NAMESPACE X"
function testmatch(str, x){ # all 7 regexp are tested every call
if(str~/^include variable function namespace x/) {x++} # lower
if(str~/^INCLUDE VARIABLE FUNCTION NAMESPACE X/) {x++} # upper
if(str~/^Include Variable Function NameSpace x/) {x++} # mixed
if(str~/^iNCLUDE vARIABLE fUNCTION nAMEsPACE X/) {x++} # Invert case
if(str~/^INclUde varIABLE FUnCtIon NamEspACe X/) {x++} # random case
if(str~/^[A-Z ]+$/) {x++}
if(str~/^[a-z ]+$/) {x++}
} #eofunc testmatch(str)
basically a script that calls that function some large number of times
in a loop with 1 of the stated strings, e.g.
BEGIN {
low = "include variable function namespace x"
for (i=1;i<=1000000;i++) testmatch(low)
}
I'm still struggling to understand what we're supposed to **do** with
So, worst case, IGNORECASE=1 takes about twice as long. No surprise.
I forced testing all 7 regexp are every call because my real data doesn't match very often.
All of my regexp are mixed case and the file data are supposed to be.
tolower() on both input and regexp looks to be no better than
mixed case input to mixed case regexp
btw: 'random case' is a quirky feature of my editor I never had any use for before.
the above information. I mean if we need to match a regexp against
mixed-case input we have 2 choices:
1) IGNORECASE=1; .. $0 ~ /foo/
2) tolower($0) ~ /foo/
and what we cannot do is just:
3) $0 ~ /foo/
so what can we do with the information that "1" would be slower than "3"
since we can't use "3" for this anyway? If you told us that "1" was
slower than "2" then we could use that information to write scripts
using "2" instead of "1" but I just don't see how the speed of "1" vs
the speed of "3" is something we can act on.
Ed.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 790 |
Nodes: | 10 (0 / 10) |
Uptime: | 193:46:22 |
Calls: | 11,043 |
Files: | 186,065 |
Messages: | 1,743,706 |