• Re: filling area by color atack safety

    From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Thu Mar 28 22:51:21 2024
    From Newsgroup: comp.lang.c

    scott@slp53.sl.home (Scott Lurndal) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    scott@slp53.sl.home (Scott Lurndal) writes:

    Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:

    The convetional wisdom is the opposite, But here, conventional wisdom >>>>> fails. Because heaps are unlimited while stacks are not.

    That's not actually true. The size of both are bounded, yes.

    It's certainly possible (in POSIX, anyway) for the stack bounds
    to be unlimited (given sufficient real memory and/or backing
    store) and the heap size to be bounded. See 'setrlimit'.

    The sizes of both heaps and stacks are bounded, because
    pointers have a fixed number of bits. Certainly these
    sizes can be very very large, but they are not unbounded.

    I was referring to the term of art used in POSIX, where
    unlimited simply means that the operating system doesn't
    limit them [.. elaboration ..]

    The earlier sentence was confusing, as the sentence construction
    suggested "unlimited" was a general term rather than one with a
    specific meaning in POSIX. In any case thank you for the
    education.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Thu Mar 28 23:04:36 2024
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    [..various fill algorithms and how they scale..]

    One thing that I were not expecting at this bigger pictures, is good performance of simple recursive algorithm. Of course, not of original
    form of it, but of variation with explicit stack.
    For many shapes it has quite large memory footprint and despite that it
    is not slow. Probably the stack has very good locality of reference.

    [algorithm]

    You are indeed a very clever fellow. I'm impressed.

    Intrigued by your idea, I wrote something along the same lines,
    only shorter and (at least for me) a little easier to grok.
    If someone is interested I can post it.

    I see you have also done a revised algorithm based on the same
    idea, but more elaborate (to save on runtime footprint?).
    Still working on formulating a response to that one...
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Fri Mar 29 15:21:41 2024
    From Newsgroup: comp.lang.c

    On Thu, 28 Mar 2024 23:04:36 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    [..various fill algorithms and how they scale..]

    One thing that I were not expecting at this bigger pictures, is good performance of simple recursive algorithm. Of course, not of
    original form of it, but of variation with explicit stack.
    For many shapes it has quite large memory footprint and despite
    that it is not slow. Probably the stack has very good locality of reference.

    [algorithm]

    You are indeed a very clever fellow. I'm impressed.


    Yes, the use of switch is clever :(
    It more resemble computed GO TO in old FORTRAN or indirect jumps in asm
    than idiomatic C switch. But it is a legal* C.


    Intrigued by your idea, I wrote something along the same lines,
    only shorter and (at least for me) a little easier to grok.
    If someone is interested I can post it.

    If non-trivially different, why not?


    I see you have also done a revised algorithm based on the same
    idea, but more elaborate (to save on runtime footprint?).
    Still working on formulating a response to that one...

    The original purpose of enhancement was to amortize non-trivial and
    probably not very fast call stack emulation logic over more than one
    pixel. 2x2 just happens to be the biggest block that still has very
    simple in-block recoloring logic. ~4x reduction in the size of
    auxiliary memory is just a pleasant side effect.

    Exactly the same 4x reduction in memory size could have been achieved
    with single-pixel variant by using packed array for 2-bit state
    (==trace back) stack elements. But it would be the same or slower than
    original while the enhanced variant is robustly faster than original.

    After implementing the first enhancement I paid attention that at 4K
    size the timing (per pixel) for few of my test cases is significantly
    worse than at smaller images. So, I added another enhancement aiming to minimize cache trashing effects by never looking back at immediate
    parent of current block. The info about location of the parent nicely
    fitted into remaining 2 bits of stack octet.


    ------
    * - the versions I posted are not exactly legal C; they are illegal C non-rejected by gcc. But they can be trivially made into legal C by
    adding semicolon after one of the case labels.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Fri Mar 29 23:58:26 2024
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 28 Mar 2024 23:04:36 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    [..various fill algorithms and how they scale..]

    One thing that I were not expecting at this bigger pictures, is
    good performance of simple recursive algorithm. Of course, not of
    original form of it, but of variation with explicit stack. For
    many shapes it has quite large memory footprint and despite that
    it is not slow. Probably the stack has very good locality of
    reference.

    [algorithm]

    You are indeed a very clever fellow. I'm impressed.

    Yes, the use of switch is clever :(

    In my view the cleverness is how "recursion" is accomplished by a
    means of a combination of using a stack to store a "return address"
    and restoring state by undoing a change rather than storing the
    old value. Using a switch() is just a detail (and to my way of
    thinking how the switch() is done here obscures the basic idea and
    makes the code harder to understand, but never mind that).

    It more resemble computed GO TO in old FORTRAN or indirect jumps
    in asm than idiomatic C switch. But it is a legal* C.

    I did program in FORTRAN briefly but don't remember ever using
    computed GO TO. And yes, I found that missing semicolon and put it
    back. Is there some reason you don't always use -pedantic? I
    pretty much always do.

    Intrigued by your idea, I wrote something along the same lines,
    only shorter and (at least for me) a little easier to grok.
    If someone is interested I can post it.

    If non-trivially different, why not?

    I hope to soon but am unable to right now (and maybe for a week
    or so due to circumstances beyond my control). For sure the
    code is different; whether it is non-trivially different I
    leave for others to judge.

    I see you have also done a revised algorithm based on the same
    idea, but more elaborate (to save on runtime footprint?).
    Still working on formulating a response to that one...

    The original purpose of enhancement was to amortize non-trivial
    and probably not very fast call stack emulation logic over more
    than one pixel. 2x2 just happens to be the biggest block that
    still has very simple in-block recoloring logic. ~4x reduction in
    the size of auxiliary memory is just a pleasant side effect.

    Exactly the same 4x reduction in memory size could have been
    achieved with single-pixel variant by using packed array for 2-bit
    state (==trace back) stack elements. But it would be the same or
    slower than original while the enhanced variant is robustly faster
    than original.

    An alternate idea is to use a 64-bit integer for 32 "top of stack"
    elements, or up to 32 I should say, and a stack with 64-bit values.
    Just an idea, it may not turn out to be useful.

    The few measurements I have done don't show a big difference in
    performance between the two methods. But I admit I wasn't paying
    close attention, and like I said only a few patterns of filling were
    exercised.

    After implementing the first enhancement I paid attention that at
    4K size the timing (per pixel) for few of my test cases is
    significantly worse than at smaller images. So, I added another
    enhancement aiming to minimize cache trashing effects by never
    looking back at immediate parent of current block. The info about
    location of the parent nicely fitted into remaining 2 bits of
    stack octet.

    The idea of not going back to the originator (what you call the
    parent) is something I developed independently before looking at
    your latest code (and mostly I still haven't). Seems like a good
    idea.

    Two further comments.

    One, the new code is a lot more complicated than the previous
    code. I'm not sure the performance gain is worth the cost
    in complexity. What kind of speed improvements do you see,
    in terms of percent?

    Two, and more important, the new algorithm still uses O(NxN) memory
    for an N by N pixel field. We really would like to get that down to
    O(N) memory (and of course run just as fast). Have you looked into
    how that might be done?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sat Mar 30 00:54:19 2024
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    [...]

    The most robust code that I found so far that performs well both
    with small pictures and with large and huge, is a variation on the
    same theme of explicit stack, may be, more properly called trace
    back. It operates on 2x2 squares instead of individual pixels.

    The worst case auxiliary memory footprint of this variant is rather
    big, up to picture_size/4 bytes. The code is *not* simple, but
    complexity appears to be necessary for robust performance with
    various shapes and sizes.

    [...]

    I took a cursory look just now, after reading your other later
    posting. I think I have a general sense, especially in conjunction
    with the explanatory comments.

    I'm still hoping to find a method that is both fast and has
    good memory use, which is to say O(N) for an NxN pixel field.

    Something that would help is to have a library of test cases,
    by which I mean patterns to be colored, so that a set of
    methods could be tried, and timed, over all the patterns in
    the library. Do you have something like that? So far all
    my testing has been ad hoc.

    Incidentally, it looks like your code assumes X varies more rapidly
    than Y, so a "by row" order, whereas my code assumes Y varies more
    rapidly than X, a "by column" order. The difference doesn't matter
    as long as the pixel field is square and the test cases either are
    symmetric about the X == Y axis or duplicate a non-symmetric pattern
    about the X == Y axis. I would like to be able to run comparisons
    between different methods and get usable results without having
    to jump around because of different orientations. I'm not sure
    how to accommodate that.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Mar 30 21:15:06 2024
    From Newsgroup: comp.lang.c

    On Fri, 29 Mar 2024 23:58:26 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:


    I did program in FORTRAN briefly but don't remember ever using
    computed GO TO. And yes, I found that missing semicolon and put it
    back. Is there some reason you don't always use -pedantic? I
    pretty much always do.



    Just a habit.
    In "real" work, as opposed to hobby, I use gcc almost exclusively for
    small embedded targets and quite often with 3-rd party libraries in
    source form. In such environment rising warnings level above -Wall
    would be counterproductive, because it would be hard to see relevant
    warning behind walls of false alarms.
    May be, for hobby, where I have full control on everything, switching
    to -Wpedantic is not a bad idea.



    An alternate idea is to use a 64-bit integer for 32 "top of stack"
    elements, or up to 32 I should say, and a stack with 64-bit values.
    Just an idea, it may not turn out to be useful.


    That's just a detail of how to do pack/unpack with minimal
    overhead. It does not change the principle that 'packed' version would
    be less memory hungry but on modern PC with GBs of RAM it will not be
    faster than original.
    Memory footprint can directly affect speed when access patterns have
    poor locality or when the rate of access exceeds 10-20 GB/s. In our
    case locality of stack access is very good and the rate of stack
    access, even on ultra fast processor, is less than 1 GB/s.


    The few measurements I have done don't show a big difference in
    performance between the two methods. But I admit I wasn't paying
    close attention, and like I said only a few patterns of filling were exercised.

    After implementing the first enhancement I paid attention that at
    4K size the timing (per pixel) for few of my test cases is
    significantly worse than at smaller images. So, I added another enhancement aiming to minimize cache trashing effects by never
    looking back at immediate parent of current block. The info about
    location of the parent nicely fitted into remaining 2 bits of
    stack octet.

    The idea of not going back to the originator (what you call the
    parent) is something I developed independently before looking at
    your latest code (and mostly I still haven't). Seems like a good
    idea.


    I call it a principle of Lot's wife.
    That is yet another reason to not grow blocks above 2x2.
    For bigger blocks it does not apply.


    Two further comments.

    One, the new code is a lot more complicated than the previous
    code. I'm not sure the performance gain is worth the cost
    in complexity. What kind of speed improvements do you see,
    in terms of percent?


    On my 11 y.o. and not top-of-the-line even then home PC for 4K
    image (3840 x 2160) with cross-in-cross shape that I took from one of
    your previous post, it is 2.43 times faster.
    I don't remember how it compares on more modern systems. Anyway, right
    now I have no test systems more modern than 3 y.o. Zen3.


    Two, and more important, the new algorithm still uses O(NxN) memory
    for an N by N pixel field. We really would like to get that down to
    O(N) memory (and of course run just as fast). Have you looked into
    how that might be done?

    Using this particular principle of not saving (x,y) in auxiliary
    storage, I don't believe that it is possible to have a footprint
    smaller than O(W*H).




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sat Mar 30 21:26:57 2024
    From Newsgroup: comp.lang.c

    On Sat, 30 Mar 2024 00:54:19 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:

    [...]

    The most robust code that I found so far that performs well both
    with small pictures and with large and huge, is a variation on the
    same theme of explicit stack, may be, more properly called trace
    back. It operates on 2x2 squares instead of individual pixels.

    The worst case auxiliary memory footprint of this variant is rather
    big, up to picture_size/4 bytes. The code is *not* simple, but
    complexity appears to be necessary for robust performance with
    various shapes and sizes.

    [...]

    I took a cursory look just now, after reading your other later
    posting. I think I have a general sense, especially in conjunction
    with the explanatory comments.

    I'm still hoping to find a method that is both fast and has
    good memory use, which is to say O(N) for an NxN pixel field.

    Something that would help is to have a library of test cases,
    by which I mean patterns to be colored, so that a set of
    methods could be tried, and timed, over all the patterns in
    the library. Do you have something like that? So far all
    my testing has been ad hoc.


    I am not 100% sure about the meaning of 'ad hoc', but I'd guess that
    mine are ad hoc too. Below are shapes that I use apart from solid
    rectangles. I run them at 5 sizes: 25x19, 200x200, 1280x720, 1920x1080, 3840x2160. That is certainly not enough for correction tests, but feel
    that it is sufficient for speed tests.

    static void make_standing_snake(
    unsigned char *image,
    int width, int height,
    unsigned char background_c,
    unsigned char pen_c)
    {
    for (int y = 0; y < height; ++y) {
    unsigned char* p = &image[y*width];
    if (y % 2 == 0) {
    memset(p, pen_c, width);
    } else {
    memset(p, background_c, width);
    if (y % 4 == 1)
    p[width-1] = pen_c;
    else
    p[0] = pen_c;
    }
    }
    }

    static void make_prostrate_snake(
    unsigned char *image,
    int width, int height,
    unsigned char background_c,
    unsigned char pen_c)
    {
    memset(image, background_c, sizeof(*image)*width*height);
    // vertical bars
    for (int y = 0; y < height; ++y)
    for (int x = 0; x < width; x += 2)
    image[y*width+x] = pen_c;

    // connect bars at top
    for (int x = 3; x < width; x += 4)
    image[x] = pen_c;

    // connect bars at bottom
    for (int x = 1; x < width; x += 4)
    image[(height-1)*width+x] = pen_c;
    }


    static void make_slalom(
    unsigned char *image,
    int width, int height,
    unsigned char background_c,
    unsigned char pen_c)
    {
    const int n_col = width/3;
    const int n_row = (height-3)/4;

    // top row
    // P B B P P P
    for (int col = 0; col < n_col; ++col) {
    unsigned char c = (col & 1)==0 ? background_c : pen_c;
    image[col*3] = pen_c; image[col*3+1] = c; image[col*3+2] = c;
    }
    for (int x = n_col*3; x < width; ++x)
    image[x] = image[n_col*3-1];

    // main image: consists of 3x4 blocks filled by following pattern
    // P B B
    // P P B
    // B P B
    // P P B
    for (int row = 0; row < n_row; ++row) {
    for (int col = 0; col < n_col; ++col) {
    unsigned char* p = &image[(row*4+1)*width+col*3];
    p[0] = pen_c; p[1] = background_c; p[2] = background_c; p
    += width; p[0] = pen_c; p[1] = pen_c; p[2] =
    background_c; p += width; p[0] = background_c; p[1] = pen_c;
    p[2] = background_c; p += width; p[0] = pen_c; p[1] = pen_c;
    p[2] = background_c; p += width; }
    }

    // near-bottom rows
    // P B B
    for (int y = n_row*4+1; y < height-1; ++y) {
    for (int col = 0; col < n_col; ++col) {
    unsigned char* p = &image[y*width+col*3];
    p[0] = pen_c; p[1] = background_c; p[2] = background_c;
    }
    }

    // bottom row - all P
    // P P P P B B
    unsigned char *b_row = &image[width*(height-1)];
    for (int col = 0; col < n_col; ++col) {
    unsigned char c = (col & 1)==1 ? background_c : pen_c;
    b_row[col*3+0] = pen_c;
    b_row[col*3+1] = c;
    b_row[col*3+2] = c;
    }
    for (int x = n_col*3; x < width; ++x)
    b_row[x] = b_row[n_col*3-1];

    // rightmost columns
    for (int x = n_col*3; x < width; ++x) {
    for (int y = 1; y < height-1; ++y)
    image[y*width+x] = background_c;
    }
    }

    static void make_slalom90(
    unsigned char *image,
    int width, int height,
    unsigned char background_c,
    unsigned char pen_c)
    {
    const int n_col = (width-3)/4;
    const int n_row = height/3;

    // leftmost column
    // P
    // B
    // B
    // P
    // P
    // P
    for (int row = 0; row < n_row; ++row) {
    unsigned char c = (row & 1)==0 ? background_c : pen_c;
    image[(row*3+0)*width] = pen_c;
    image[(row*3+1)*width] = c;
    image[(row*3+2)*width] = c;
    }
    for (int y = n_row*3; y < height; ++y)
    image[y*width] = image[(n_row*3-1)*width];

    // main image: consists of 4x3 blocks filled by following pattern
    // P P B P
    // B P P P
    // B B B B
    for (int row = 0; row < n_row; ++row) {
    for (int col = 0; col < n_col; ++col) {
    unsigned char* p = &image[(row*3*width)+(col*4+1)];
    p[0] = pen_c; p[1] = pen_c; p[2] = background_c;
    p[3] = pen_c; p += width; p[0] = background_c; p[1] = pen_c;
    p[2] = pen_c; p[3] = pen_c; p += width; p[0] = background_c;
    p[1] = background_c; p[2] = background_c; p[3] = background_c; }
    }

    // near-rightmost column
    // P
    // B
    // B
    for (int row = 0; row < n_row; ++row) {
    for (int x = n_col*4+1; x < width-1; ++x) {
    unsigned char* p = &image[row*width*3+x];
    p[0*width] = pen_c;
    p[1*width] = background_c;
    p[2*width] = background_c;
    }
    }

    // rightmost column
    // P
    // P
    // P
    // P
    // B
    // B
    unsigned char *r_col = &image[width-1];
    for (int row = 0; row < n_row; ++row) {
    unsigned char c = (row & 1)==1 ? background_c : pen_c;
    r_col[(row*3+0)*width] = pen_c;
    r_col[(row*3+1)*width] = c;
    r_col[(row*3+2)*width] = c;
    }
    for (int y = n_row*3; y < height; ++y)
    r_col[y*width] = r_col[(n_row*3-1)*width];

    // bottom rows
    for (int y = n_row*3; y < height; ++y) {
    for (int x = 1; x < width-1; ++x)
    image[y*width+x] = background_c;
    }
    }

    static void make_crosss_in_cross(
    unsigned char* image,
    int width,
    int height,
    int xc,
    int yc,
    unsigned char background_c,
    unsigned char pen_c)
    {
    memset(image, pen_c, width*height);

    if (xc > 1 && xc+1 < width-1 && yc > 1 && yc+1 < height-1) {
    memset(&image[(yc-1)*width+1], background_c, xc-1);
    memset(&image[(yc+1)*width+1], background_c, xc-1);
    memset(&image[(yc-1)*width+xc+1], background_c, width-xc-2);
    memset(&image[(yc+1)*width+xc+1], background_c, width-xc-2);
    for (int y = 1; y < yc; ++y) {
    image[y*width+xc-1] = background_c;
    image[y*width+xc+1] = background_c;
    }
    for (int y = yc+1; y < height-1; ++y) {
    image[y*width+xc-1] = background_c;
    image[y*width+xc+1] = background_c;
    }
    }
    }


    Incidentally, it looks like your code assumes X varies more rapidly
    than Y, so a "by row" order, whereas my code assumes Y varies more
    rapidly than X, a "by column" order.

    It is not so much about what I assume as about what is cheaper for
    CPU hardware.

    The difference doesn't matter
    as long as the pixel field is square and the test cases either are
    symmetric about the X == Y axis or duplicate a non-symmetric pattern
    about the X == Y axis. I would like to be able to run comparisons
    between different methods and get usable results without having
    to jump around because of different orientations. I'm not sure
    how to accommodate that.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sat Mar 30 11:59:03 2024
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 28 Mar 2024 23:04:36 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Intrigued by your idea, I wrote something along the same lines,
    only shorter and (at least for me) a little easier to grok.
    If someone is interested I can post it.

    If non-trivially different, why not?

    Here is the code:

    void
    stack_fill( UI w0, UI h0, UC pixels[], Point p0, UC old, UC new ){
    U64 w = ( assert( w0 > 0 ), w0 );
    U64 h = ( assert( h0 > 0 ), h0 );
    U64 px = ( assert( p0.x < w ), p0.x );
    U64 py = ( assert( p0.y < h ), p0.y );

    UC *x0 = ( assert( pixels ), pixels );
    UC *x = x0 + px*h;
    UC *xm = x0 + h*w - h;

    U64 y0 = 0;
    U64 y = py;
    U64 ym = h-1;

    UC *s0 = malloc( sizeof *s0 );
    UC *s = s0;
    UC *sn = s0 ? s0+1 : s0;

    if( s0 && x[y] == old ) do {
    x[y] = new;
    if( s == sn ){
    U64 s_offset = s - s0;
    U64 n = (sn-s0+1) *3 /2;
    UC *new_s0 = realloc( s0, n * sizeof *new_s0 );

    if( ! new_s0 ) break;
    s0 = new_s0, s = s0 + s_offset, sn = s0 + n;
    }

    if( y < ym && x[y+1] == old ){
    y += 1, *s++ = 2; continue; UNDO_UP:
    y -= 1;
    }
    if( y > y0 && x[y-1] == old ){
    y -= 1, *s++ = 3; continue; UNDO_DOWN:
    y += 1;
    }
    if( x < xm && y[x+h] == old ){
    x += h, *s++ = 0; continue; UNDO_LEFT:
    x -= h;
    }
    if( x > x0 && y[x-h] == old ){
    x -= h, *s++ = 1; continue; UNDO_RIGHT:
    x += h;
    }

    if( s == s0 ) break;

    switch( *--s & 3 ){
    case 0: goto UNDO_LEFT;
    case 1: goto UNDO_RIGHT;
    case 2: goto UNDO_UP;
    case 3: goto UNDO_DOWN;
    }
    } while( 1 );

    free( s0 );
    }
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.lang.c on Sun Mar 31 10:54:38 2024
    From Newsgroup: comp.lang.c

    On Sat, 30 Mar 2024 21:15:06 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Fri, 29 Mar 2024 23:58:26 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:


    One, the new code is a lot more complicated than the previous
    code. I'm not sure the performance gain is worth the cost
    in complexity. What kind of speed improvements do you see,
    in terms of percent?


    On my 11 y.o. and not top-of-the-line even then home PC for 4K
    image (3840 x 2160) with cross-in-cross shape that I took from one of
    your previous post, it is 2.43 times faster.
    I don't remember how it compares on more modern systems. Anyway, right
    now I have no test systems more modern than 3 y.o. Zen3.



    I tested on newer hardware - Intel Coffee Lake (Xeon-E 2176G) and AMD
    Zen3 (EPYC 7543P).
    Here I no longer see significant drop in speed of the 1x1 variant at 4K
    size, but I still see that more complicated variant provides nice speed
    up. Up to 1.56x on Coffee Lake and up to 3x on Zen3.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Apr 9 01:00:34 2024
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Sat, 30 Mar 2024 00:54:19 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
    [...]
    Something that would help is to have a library of test cases,
    by which I mean patterns to be colored, so that a set of
    methods could be tried, and timed, over all the patterns in
    the library. Do you have something like that? So far all
    my testing has been ad hoc.

    I am not 100% sure about the meaning of 'ad hoc', but I'd guess
    that mine are ad hoc too. Below are shapes that I use apart from
    solid rectangles. I run them at 5 sizes: 25x19, 200x200,
    1280x720, 1920x1080, 3840x2160. That is certainly not enough for
    correction tests, but feel that it is sufficient for speed tests.

    [code]

    I got these, thank you.

    Here is a pattern generating function I wrote for my own
    testing. Disclaimer: slightly changed from my original
    source, hopefully any errors inadvertently introduced can
    be corrected easily. Also, it uses the value 0 for the
    background and the value 1 for the pattern to be colored.

    #include <math.h>
    #include <stddef.h>
    #include <string.h>

    typedef unsigned char Pixel;

    extern void
    ellipse_with_hole( Pixel *field, unsigned w, unsigned h ){
    size_t i, j;
    double wc = w/2, hc = h/2;

    double a = (w > h ? wc : hc) -1;
    double b = (w > h ? hc : wc) -1;

    double b3 = 1+6*b/8;
    double radius = b/2.5;
    double cx = w > h ? wc : b3+1;
    double cy = w > h ? b3+1 : hc;

    double focus = sqrt( a*a - b*b );
    double f1x = w > h ? wc - focus : wc;
    double f1y = w > h ? hc : hc - focus;
    double f2x = w > h ? wc + focus : wc;
    double f2y = w > h ? hc : hc + focus;

    memset( field, 0, w*h );

    for( i = 0; i < w; i++ ){
    for( j = 0; j < h; j++ ){
    double dx = i - cx, dy = j - cy;
    double r2 = radius * radius;
    if( dx * dx + dy*dy <= r2 ) continue;
    double dx1 = i - f1x, dy1 = j - f1y;
    double dx2 = i - f2x, dy2 = j - f2y;
    double sum2 = a*2;
    double d1 = sqrt( dx1*dx1 + dy1*dy1 );
    double d2 = sqrt( dx2*dx2 + dy2*dy2 );
    if( d1 + d2 > 2*a ) continue;
    field[ i+j*w ] = 1;
    }}
    }
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Apr 9 01:55:39 2024
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Fri, 29 Mar 2024 23:58:26 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    I did program in FORTRAN briefly but don't remember ever using
    computed GO TO. And yes, I found that missing semicolon and put it
    back. Is there some reason you don't always use -pedantic? I
    pretty much always do.

    Just a habit.
    In "real" work, as opposed to hobby, I use gcc almost exclusively for
    small embedded targets and quite often with 3-rd party libraries in
    source form. In such environment rising warnings level above -Wall
    would be counterproductive, because it would be hard to see relevant
    warning behind walls of false alarms.
    May be, for hobby, where I have full control on everything, switching
    to -Wpedantic is not a bad idea.

    My experience with third party libraries is that sometimes they use
    extensions, probably mostly gcc-isms. Not much to be done in that
    case. Of course turning on -pedantic could be done selectively.

    It might be worth an experiment of turning off -Wall while turning
    on -pedantic to see how big or how little the problem is.

    The idea of not going back to the originator (what you call the
    parent) is something I developed independently before looking at
    your latest code (and mostly I still haven't). Seems like a good
    idea.

    I call it a principle of Lot's wife.
    That is yet another reason to not grow blocks above 2x2.
    For bigger blocks it does not apply.

    Here is an updated version of my "stacking" code. On my test
    system (and I don't even know exactly what CPU it has, probably
    about 5 years old) this code runs about 30% faster than your 2x2
    version, averaged across all patterns and all sizes above the
    smallest ones (25x19 and 19x25).

    #include <assert.h>
    #include <stdlib.h>

    typedef unsigned char UC, Color;
    typedef size_t Index, Count;
    typedef struct { Index x, y; } Point;

    extern Count
    stack_plus( UC *field, Index w, Index h, Point p0, Color old, Color new ){
    Index px = ( assert( p0.x < w ), p0.x );
    Index py = ( assert( p0.y < h ), p0.y );

    Index x0 = 0;
    Index x = px;
    Index xm = w-1;

    UC *y0 = field;
    UC *y = y0 + py*w;
    UC *ym = y0 + h*w - w;

    UC *s0 = malloc( 8 * sizeof *s0 );
    UC *s = s0;
    UC *sn = s0 ? s0+8 : s0;

    Count r = 0;

    if( s0 ) goto START_FOUR;

    while( s != s0 ){
    switch( *--s & 15 ){
    case 0: goto UNDO_START_LEFT;
    case 1: goto UNDO_START_RIGHT;
    case 2: goto UNDO_START_UP;
    case 3: goto UNDO_START_DOWN;

    case 4: goto UNDO_LEFT_DOWN;
    case 5: goto UNDO_LEFT_LEFT;
    case 6: goto UNDO_LEFT_UP;

    case 7: goto UNDO_UP_LEFT;
    case 8: goto UNDO_UP_UP;
    case 9: goto UNDO_UP_RIGHT;

    case 10: goto UNDO_RIGHT_UP;
    case 11: goto UNDO_RIGHT_RIGHT;
    case 12: goto UNDO_RIGHT_DOWN;

    case 13: goto UNDO_DOWN_RIGHT;
    case 14: goto UNDO_DOWN_DOWN;
    case 15: goto UNDO_DOWN_LEFT;
    }

    START_FOUR:
    if( y[x] != old ) continue;
    y[x] = new; r++;
    if( x < xm && y[x+1] == old ){
    x += 1, *s++ = 0; goto START_LEFT; UNDO_START_LEFT:
    x -= 1;
    }
    if( x > x0 && y[x-1] == old ){
    x -= 1, *s++ = 1; goto START_RIGHT; UNDO_START_RIGHT:
    x += 1;
    }
    if( y < ym && x[y+w] == old ){
    y += w, *s++ = 2; goto START_UP; UNDO_START_UP:
    y -= w;
    }
    if( y > y0 && x[y-w] == old ){
    y -= w, *s++ = 3; goto START_DOWN; UNDO_START_DOWN:
    y += w;
    }
    continue;

    START_LEFT:
    y[x] = new; r++;
    if( s == sn ){
    Index s_offset = s - s0;
    Index n = (sn-s0+1) *3 /2;
    UC *new_s0 = realloc( s0, n * sizeof *new_s0 );

    if( ! new_s0 ) break;
    s0 = new_s0, s = s0 + s_offset, sn = s0 + n;
    }
    if( x < xm && y[x+1] == old ){
    x += 1, *s++ = 5; goto START_LEFT; UNDO_LEFT_LEFT:
    x -= 1;
    }
    if( y > y0 && x[y-w] == old ){
    y -= w, *s++ = 4; goto START_DOWN; UNDO_LEFT_DOWN:
    y += w;
    }
    if( y < ym && x[y+w] == old ){
    y += w, *s++ = 6; goto START_UP; UNDO_LEFT_UP:
    y -= w;
    }
    continue;

    START_UP:
    y[x] = new; r++;
    if( s == sn ){
    Index s_offset = s - s0;
    Index n = (sn-s0+1) *3 /2;
    UC *new_s0 = realloc( s0, n * sizeof *new_s0 );

    if( ! new_s0 ) break;
    s0 = new_s0, s = s0 + s_offset, sn = s0 + n;
    }
    if( x < xm && y[x+1] == old ){
    x += 1, *s++ = 7; goto START_LEFT; UNDO_UP_LEFT:
    x -= 1;
    }
    if( x > x0 && y[x-1] == old ){
    x -= 1, *s++ = 9; goto START_RIGHT; UNDO_UP_RIGHT:
    x += 1;
    }
    if( y < ym && x[y+w] == old ){
    y += w, *s++ = 8; goto START_UP; UNDO_UP_UP:
    y -= w;
    }
    continue;

    START_RIGHT:
    y[x] = new; r++;
    if( s == sn ){
    Index s_offset = s - s0;
    Index n = (sn-s0+1) *3 /2;
    UC *new_s0 = realloc( s0, n * sizeof *new_s0 );

    if( ! new_s0 ) break;
    s0 = new_s0, s = s0 + s_offset, sn = s0 + n;
    }
    if( x > x0 && y[x-1] == old ){
    x -= 1, *s++ = 11; goto START_RIGHT; UNDO_RIGHT_RIGHT:
    x += 1;
    }
    if( y < ym && x[y+w] == old ){
    y += w, *s++ = 10; goto START_UP; UNDO_RIGHT_UP:
    y -= w;
    }
    if( y > y0 && x[y-w] == old ){
    y -= w, *s++ = 12; goto START_DOWN; UNDO_RIGHT_DOWN:
    y += w;
    }
    continue;

    START_DOWN:
    y[x] = new; r++;
    if( s == sn ){
    Index s_offset = s - s0;
    Index n = (sn-s0+1) *3 /2;
    UC *new_s0 = realloc( s0, n * sizeof *new_s0 );

    if( ! new_s0 ) break;
    s0 = new_s0, s = s0 + s_offset, sn = s0 + n;
    }
    if( x > x0 && y[x-1] == old ){
    x -= 1, *s++ = 13; goto START_RIGHT; UNDO_DOWN_RIGHT:
    x += 1;
    }
    if( x < xm && y[x+1] == old ){
    x += 1, *s++ = 15; goto START_LEFT; UNDO_DOWN_LEFT:
    x -= 1;
    }
    if( y > y0 && x[y-w] == old ){
    y -= w, *s++ = 14; goto START_DOWN; UNDO_DOWN_DOWN:
    y += w;
    }
    continue;

    }

    return free( s0 ), r;
    }
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Apr 9 02:32:31 2024
    From Newsgroup: comp.lang.c

    Michael S <already5chosen@yahoo.com> writes:

    On Sat, 30 Mar 2024 21:15:06 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Fri, 29 Mar 2024 23:58:26 -0700
    Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    One, the new code is a lot more complicated than the previous
    code. I'm not sure the performance gain is worth the cost
    in complexity. What kind of speed improvements do you see,
    in terms of percent?

    On my 11 y.o. and not top-of-the-line even then home PC for 4K
    image (3840 x 2160) with cross-in-cross shape that I took from one of
    your previous post, it is 2.43 times faster.
    I don't remember how it compares on more modern systems. Anyway, right
    now I have no test systems more modern than 3 y.o. Zen3.

    I tested on newer hardware - Intel Coffee Lake (Xeon-E 2176G) and AMD
    Zen3 (EPYC 7543P).
    Here I no longer see significant drop in speed of the 1x1 variant at 4K
    size, but I still see that more complicated variant provides nice speed
    up. Up to 1.56x on Coffee Lake and up to 3x on Zen3.

    On my test system the numbers are closer and also more evenly
    balanced: ratios range from about 0.70 to about 1.40, roughly
    evenly split with the 2x2 version somewhat better. There was
    one outlier at approximately 1.48. More precisely, the ratios
    have an average of 1.06 (which means the 1x1 version is about
    6 percent slower on average), with a standard deviation of 0.21.
    --- Synchronet 3.20a-Linux NewsLink 1.114