Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
scott@slp53.sl.home (Scott Lurndal) writes:
Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
The convetional wisdom is the opposite, But here, conventional wisdom >>>>> fails. Because heaps are unlimited while stacks are not.
That's not actually true. The size of both are bounded, yes.
It's certainly possible (in POSIX, anyway) for the stack bounds
to be unlimited (given sufficient real memory and/or backing
store) and the heap size to be bounded. See 'setrlimit'.
The sizes of both heaps and stacks are bounded, because
pointers have a fixed number of bits. Certainly these
sizes can be very very large, but they are not unbounded.
I was referring to the term of art used in POSIX, where
unlimited simply means that the operating system doesn't
limit them [.. elaboration ..]
[..various fill algorithms and how they scale..]
One thing that I were not expecting at this bigger pictures, is good performance of simple recursive algorithm. Of course, not of original
form of it, but of variation with explicit stack.
For many shapes it has quite large memory footprint and despite that it
is not slow. Probably the stack has very good locality of reference.
[algorithm]
Michael S <already5chosen@yahoo.com> writes:
[..various fill algorithms and how they scale..]
One thing that I were not expecting at this bigger pictures, is good performance of simple recursive algorithm. Of course, not of
original form of it, but of variation with explicit stack.
For many shapes it has quite large memory footprint and despite
that it is not slow. Probably the stack has very good locality of reference.
[algorithm]
You are indeed a very clever fellow. I'm impressed.
Intrigued by your idea, I wrote something along the same lines,
only shorter and (at least for me) a little easier to grok.
If someone is interested I can post it.
I see you have also done a revised algorithm based on the same
idea, but more elaborate (to save on runtime footprint?).
Still working on formulating a response to that one...
On Thu, 28 Mar 2024 23:04:36 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
Michael S <already5chosen@yahoo.com> writes:
[..various fill algorithms and how they scale..]
One thing that I were not expecting at this bigger pictures, is
good performance of simple recursive algorithm. Of course, not of
original form of it, but of variation with explicit stack. For
many shapes it has quite large memory footprint and despite that
it is not slow. Probably the stack has very good locality of
reference.
[algorithm]
You are indeed a very clever fellow. I'm impressed.
Yes, the use of switch is clever :(
It more resemble computed GO TO in old FORTRAN or indirect jumps
in asm than idiomatic C switch. But it is a legal* C.
Intrigued by your idea, I wrote something along the same lines,
only shorter and (at least for me) a little easier to grok.
If someone is interested I can post it.
If non-trivially different, why not?
I see you have also done a revised algorithm based on the same
idea, but more elaborate (to save on runtime footprint?).
Still working on formulating a response to that one...
The original purpose of enhancement was to amortize non-trivial
and probably not very fast call stack emulation logic over more
than one pixel. 2x2 just happens to be the biggest block that
still has very simple in-block recoloring logic. ~4x reduction in
the size of auxiliary memory is just a pleasant side effect.
Exactly the same 4x reduction in memory size could have been
achieved with single-pixel variant by using packed array for 2-bit
state (==trace back) stack elements. But it would be the same or
slower than original while the enhanced variant is robustly faster
than original.
After implementing the first enhancement I paid attention that at
4K size the timing (per pixel) for few of my test cases is
significantly worse than at smaller images. So, I added another
enhancement aiming to minimize cache trashing effects by never
looking back at immediate parent of current block. The info about
location of the parent nicely fitted into remaining 2 bits of
stack octet.
The most robust code that I found so far that performs well both
with small pictures and with large and huge, is a variation on the
same theme of explicit stack, may be, more properly called trace
back. It operates on 2x2 squares instead of individual pixels.
The worst case auxiliary memory footprint of this variant is rather
big, up to picture_size/4 bytes. The code is *not* simple, but
complexity appears to be necessary for robust performance with
various shapes and sizes.
[...]
I did program in FORTRAN briefly but don't remember ever using
computed GO TO. And yes, I found that missing semicolon and put it
back. Is there some reason you don't always use -pedantic? I
pretty much always do.
An alternate idea is to use a 64-bit integer for 32 "top of stack"
elements, or up to 32 I should say, and a stack with 64-bit values.
Just an idea, it may not turn out to be useful.
The few measurements I have done don't show a big difference in
performance between the two methods. But I admit I wasn't paying
close attention, and like I said only a few patterns of filling were exercised.
After implementing the first enhancement I paid attention that at
4K size the timing (per pixel) for few of my test cases is
significantly worse than at smaller images. So, I added another enhancement aiming to minimize cache trashing effects by never
looking back at immediate parent of current block. The info about
location of the parent nicely fitted into remaining 2 bits of
stack octet.
The idea of not going back to the originator (what you call the
parent) is something I developed independently before looking at
your latest code (and mostly I still haven't). Seems like a good
idea.
Two further comments.
One, the new code is a lot more complicated than the previous
code. I'm not sure the performance gain is worth the cost
in complexity. What kind of speed improvements do you see,
in terms of percent?
Two, and more important, the new algorithm still uses O(NxN) memory
for an N by N pixel field. We really would like to get that down to
O(N) memory (and of course run just as fast). Have you looked into
how that might be done?
Michael S <already5chosen@yahoo.com> writes:
[...]
The most robust code that I found so far that performs well both
with small pictures and with large and huge, is a variation on the
same theme of explicit stack, may be, more properly called trace
back. It operates on 2x2 squares instead of individual pixels.
The worst case auxiliary memory footprint of this variant is rather
big, up to picture_size/4 bytes. The code is *not* simple, but
complexity appears to be necessary for robust performance with
various shapes and sizes.
[...]
I took a cursory look just now, after reading your other later
posting. I think I have a general sense, especially in conjunction
with the explanatory comments.
I'm still hoping to find a method that is both fast and has
good memory use, which is to say O(N) for an NxN pixel field.
Something that would help is to have a library of test cases,
by which I mean patterns to be colored, so that a set of
methods could be tried, and timed, over all the patterns in
the library. Do you have something like that? So far all
my testing has been ad hoc.
Incidentally, it looks like your code assumes X varies more rapidly
than Y, so a "by row" order, whereas my code assumes Y varies more
rapidly than X, a "by column" order.
The difference doesn't matter
as long as the pixel field is square and the test cases either are
symmetric about the X == Y axis or duplicate a non-symmetric pattern
about the X == Y axis. I would like to be able to run comparisons
between different methods and get usable results without having
to jump around because of different orientations. I'm not sure
how to accommodate that.
On Thu, 28 Mar 2024 23:04:36 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
Intrigued by your idea, I wrote something along the same lines,
only shorter and (at least for me) a little easier to grok.
If someone is interested I can post it.
If non-trivially different, why not?
On Fri, 29 Mar 2024 23:58:26 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
One, the new code is a lot more complicated than the previous
code. I'm not sure the performance gain is worth the cost
in complexity. What kind of speed improvements do you see,
in terms of percent?
On my 11 y.o. and not top-of-the-line even then home PC for 4K
image (3840 x 2160) with cross-in-cross shape that I took from one of
your previous post, it is 2.43 times faster.
I don't remember how it compares on more modern systems. Anyway, right
now I have no test systems more modern than 3 y.o. Zen3.
On Sat, 30 Mar 2024 00:54:19 -0700[...]
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
Something that would help is to have a library of test cases,
by which I mean patterns to be colored, so that a set of
methods could be tried, and timed, over all the patterns in
the library. Do you have something like that? So far all
my testing has been ad hoc.
I am not 100% sure about the meaning of 'ad hoc', but I'd guess
that mine are ad hoc too. Below are shapes that I use apart from
solid rectangles. I run them at 5 sizes: 25x19, 200x200,
1280x720, 1920x1080, 3840x2160. That is certainly not enough for
correction tests, but feel that it is sufficient for speed tests.
[code]
On Fri, 29 Mar 2024 23:58:26 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
I did program in FORTRAN briefly but don't remember ever using
computed GO TO. And yes, I found that missing semicolon and put it
back. Is there some reason you don't always use -pedantic? I
pretty much always do.
Just a habit.
In "real" work, as opposed to hobby, I use gcc almost exclusively for
small embedded targets and quite often with 3-rd party libraries in
source form. In such environment rising warnings level above -Wall
would be counterproductive, because it would be hard to see relevant
warning behind walls of false alarms.
May be, for hobby, where I have full control on everything, switching
to -Wpedantic is not a bad idea.
The idea of not going back to the originator (what you call the
parent) is something I developed independently before looking at
your latest code (and mostly I still haven't). Seems like a good
idea.
I call it a principle of Lot's wife.
That is yet another reason to not grow blocks above 2x2.
For bigger blocks it does not apply.
On Sat, 30 Mar 2024 21:15:06 +0300
Michael S <already5chosen@yahoo.com> wrote:
On Fri, 29 Mar 2024 23:58:26 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
One, the new code is a lot more complicated than the previous
code. I'm not sure the performance gain is worth the cost
in complexity. What kind of speed improvements do you see,
in terms of percent?
On my 11 y.o. and not top-of-the-line even then home PC for 4K
image (3840 x 2160) with cross-in-cross shape that I took from one of
your previous post, it is 2.43 times faster.
I don't remember how it compares on more modern systems. Anyway, right
now I have no test systems more modern than 3 y.o. Zen3.
I tested on newer hardware - Intel Coffee Lake (Xeon-E 2176G) and AMD
Zen3 (EPYC 7543P).
Here I no longer see significant drop in speed of the 1x1 variant at 4K
size, but I still see that more complicated variant provides nice speed
up. Up to 1.56x on Coffee Lake and up to 3x on Zen3.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 919 |
Nodes: | 10 (0 / 10) |
Uptime: | 45:20:57 |
Calls: | 12,182 |
Calls today: | 2 |
Files: | 186,524 |
Messages: | 2,236,058 |