• Library for save an events log in Flash

    From pozz@pozzugno@gmail.com to comp.arch.embedded on Thu Apr 18 16:38:32 2024
    From Newsgroup: comp.arch.embedded

    The request is very common: when some interested events occur in the
    system, they, with the related timestamp, must be saved in a log. The
    log must be saved in an external SPI Flash connected to a MCU. The log
    has a maximum number of events. After the log is completely filled, the
    new event overwrite the oldest event.

    I tried to implement such library, but I eventually found it's not a
    simple task, mostly if you want a reliable library that works even when
    errors occur (for example, when one writing fails).

    I started by assigning 5 sectors of SPI Flash to the log. There are 256
    events in a sector (the event is a fixed struct).
    In this scenario, the maximum log size is 1024 events, because the 5th
    sector can be erased when the pointer reaches the end of a sector.

    The first challenge is how the system can understand what is the first (newest) event in the log at startup. I solved saving a 16-bits counter
    ID to each event. The initialization routine starts reading all the IDs
    and taks the greatest as the last event.
    However initially the log is empty, so all the IDs are 0xFFFF, the
    maximum. One solution is to stop reading events when 0xFFFF is read and wrap-around ID at 0xFFFE and not 0xFFFF.

    However there's another problem. What happens after writing 65535 events
    in the log? The ID restarts from 0, so the latest event hasn't the
    greatest ID anymore.

    These are the saved IDs after 65536 events:

    1^ SECT 2^ SECT 3^ SECT 4^ SECT 5^SECT---------->
    0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF

    The rule "newest event has greatest ID" is correct yet. Now a new event
    is written:

    1^ SECT-------> 2^ SECT 3^ SECT 4^ SECT 5^SECT--------->
    0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF

    Now the rule doesn't work anymore. The solution I found is to detect discontinuity. The IDs are consecutive, so the initialization routine continues reading while the ID(n+1)=ID(n)+1. When there's a gap, the
    init function stops and found the ID and position of the newest event.

    But he problems don't stop here. What happens if an event write failed?
    Should I verify the real written values, by reading back them and
    comparing with original data? In a reliable system, yes, I should.

    I was thinking to protect an event with a CRC, so to understand at any
    time if an event is corrupted (bad written). However the possibility to
    have isolated corrupted events increase the complexity of the task a lot.

    Suppose to write the 4th event with ID=3 at position 4 in the sector.
    Now suppose the write failed. I can try to re-write the same event at poisition 5. Should I use ID=4 or ID=5? At first, I think it's better to
    use ID=5. The CRC should detect that at position 4 is stored a corrupted event.
    After that two events are written as well, ID=6 and 7.

    Now the higher application wants to read the 4th event of the log
    (starting from the newest). We know that the newest is ID=7, so the 4th
    event is ID=7-4+1=4. However the event with ID=4 is corrupted, so we
    should check the previous ID=3... and so on.
    Now we can't have a mathematical function that returns the position of
    an event starting from it's ordinal number from the newest event.


    Eventually I think it's better to use a public domain library, if it exists.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch.embedded on Thu Apr 18 21:30:19 2024
    From Newsgroup: comp.arch.embedded

    On 18/04/2024 16:38, pozz wrote:
    The request is very common: when some interested events occur in the
    system, they, with the related timestamp, must be saved in a log. The
    log must be saved in an external SPI Flash connected to a MCU. The log
    has a maximum number of events. After the log is completely filled, the
    new event overwrite the oldest event.

    I tried to implement such library, but I eventually found it's not a
    simple task, mostly if you want a reliable library that works even when errors occur (for example, when one writing fails).

    I started by assigning 5 sectors of SPI Flash to the log. There are 256 events in a sector (the event is a fixed struct).
    In this scenario, the maximum log size is 1024 events, because the 5th sector can be erased when the pointer reaches the end of a sector.

    The first challenge is how the system can understand what is the first (newest) event in the log at startup. I solved saving a 16-bits counter
    ID to each event. The initialization routine starts reading all the IDs
    and taks the greatest as the last event.
    However initially the log is empty, so all the IDs are 0xFFFF, the
    maximum. One solution is to stop reading events when 0xFFFF is read and wrap-around ID at 0xFFFE and not 0xFFFF.

    Start at 0xfffe, and count down. Or xor with 0xffff for storage. Or
    wrap at 0xfffe, as you suggest. Or use 32-bit values. Or have another
    way to indicate that the log entry is valid. Or, since you have a
    timestamp, there's no need to track an ID - the timestamp will be
    increasing monotonically.


    However there's another problem. What happens after writing 65535 events
    in the log? The ID restarts from 0, so the latest event hasn't the
    greatest ID anymore.

    These are the saved IDs after 65536 events:

        1^ SECT    2^ SECT    3^ SECT    4^ SECT    5^SECT---------->
        0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF

    The rule "newest event has greatest ID" is correct yet. Now a new event
    is written:

        1^ SECT-------> 2^ SECT   3^ SECT   4^ SECT   5^SECT--------->
        0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF

    Now the rule doesn't work anymore. The solution I found is to detect discontinuity. The IDs are consecutive, so the initialization routine continues reading while the ID(n+1)=ID(n)+1. When there's a gap, the
    init function stops and found the ID and position of the newest event.


    Make your counts from 0 to 256*5 - 1, then wrap. Log entry "n" will be
    at address n * sizeof(log entry), with up to 256 log entries blank.
    Then you don't need to store a number at all.

    But he problems don't stop here. What happens if an event write failed? Should I verify the real written values, by reading back them and
    comparing with original data? In a reliable system, yes, I should.

    I was thinking to protect an event with a CRC, so to understand at any
    time if an event is corrupted (bad written). However the possibility to
    have isolated corrupted events increase the complexity of the task a lot.


    An 8-bit or 16-bit CRC is peanuts to calculate and check. Write the
    whole log entry except for a byte or two (whatever is the minimum write
    size for the flashes), which must be something other than 0xff's. Check
    the entry after writing if you really want, then write the final byte or
    two. Then if there's a power-fail in the middle, it's obvious that the
    log entry is bad.

    For SPI NOR flash, the risk of a bad write - other than through
    unexpected resets or power fails - is negligible for most purposes.
    Don't over-complicate things worrying about something that will never
    happen.

    Suppose to write the 4th event with ID=3 at position 4 in the sector.
    Now suppose the write failed. I can try to re-write the same event at poisition 5. Should I use ID=4 or ID=5? At first, I think it's better to
    use ID=5. The CRC should detect that at position 4 is stored a corrupted event.
    After that two events are written as well, ID=6 and 7.

    Now the higher application wants to read the 4th event of the log
    (starting from the newest). We know that the newest is ID=7, so the 4th event is ID=7-4+1=4. However the event with ID=4 is corrupted, so we
    should check the previous ID=3... and so on.
    Now we can't have a mathematical function that returns the position of
    an event starting from it's ordinal number from the newest event.


    Eventually I think it's better to use a public domain library, if it
    exists.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From pozz@pozzugno@gmail.com to comp.arch.embedded on Thu Apr 18 22:36:36 2024
    From Newsgroup: comp.arch.embedded

    Il 18/04/2024 21:30, David Brown ha scritto:
    On 18/04/2024 16:38, pozz wrote:
    The request is very common: when some interested events occur in the
    system, they, with the related timestamp, must be saved in a log. The
    log must be saved in an external SPI Flash connected to a MCU. The log
    has a maximum number of events. After the log is completely filled,
    the new event overwrite the oldest event.

    I tried to implement such library, but I eventually found it's not a
    simple task, mostly if you want a reliable library that works even
    when errors occur (for example, when one writing fails).

    I started by assigning 5 sectors of SPI Flash to the log. There are
    256 events in a sector (the event is a fixed struct).
    In this scenario, the maximum log size is 1024 events, because the 5th
    sector can be erased when the pointer reaches the end of a sector.

    The first challenge is how the system can understand what is the first
    (newest) event in the log at startup. I solved saving a 16-bits
    counter ID to each event. The initialization routine starts reading
    all the IDs and taks the greatest as the last event.
    However initially the log is empty, so all the IDs are 0xFFFF, the
    maximum. One solution is to stop reading events when 0xFFFF is read
    and wrap-around ID at 0xFFFE and not 0xFFFF.

    Start at 0xfffe, and count down.


    And what to do when the counter reaches zero? It can wrap-around up to
    0xfffe (that is very similar to an increasing counter from 0x0000-0xFFFE).

    Or xor with 0xffff for storage.  Or
    wrap at 0xfffe, as you suggest.  Or use 32-bit values.  Or have another way to indicate that the log entry is valid.

    I will add a CRC for each entry and that can be used to validate the
    event. An empty/erased slot filled with 0xFF will not pass CRC validation.


    Or, since you have a
    timestamp, there's no need to track an ID - the timestamp will be
    increasing monotonically.

    I don't want to use timestamps for two reasons:

    - the system wall clock can be changed (the system is isolated)
    - the library I'm writing doesn't know the content of "events", for
    it the event is an opaque sequence of bytes.


    However there's another problem. What happens after writing 65535
    events in the log? The ID restarts from 0, so the latest event hasn't
    the greatest ID anymore.

    These are the saved IDs after 65536 events:

         1^ SECT    2^ SECT    3^ SECT    4^ SECT    5^SECT---------->
         0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF

    The rule "newest event has greatest ID" is correct yet. Now a new
    event is written:

         1^ SECT-------> 2^ SECT   3^ SECT   4^ SECT   5^SECT--------->
         0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF

    Now the rule doesn't work anymore. The solution I found is to detect
    discontinuity. The IDs are consecutive, so the initialization routine
    continues reading while the ID(n+1)=ID(n)+1. When there's a gap, the
    init function stops and found the ID and position of the newest event.

    Make your counts from 0 to 256*5 - 1, then wrap.  Log entry "n" will be
    at address n * sizeof(log entry), with up to 256 log entries blank. Then
    you don't need to store a number at all.

    What do you mean with log entry "0"? Is it the oldest or the newest? I
    think the oldest, because that formula is imho correct in this case.

    However it doesn't appear correct when the log has rotated, that happens
    after writing 5x256+1 events. In this case the newest entry ("n"=1024)
    is at address 0, not n*sizeof(entry).


    But he problems don't stop here. What happens if an event write
    failed? Should I verify the real written values, by reading back them
    and comparing with original data? In a reliable system, yes, I should.

    I was thinking to protect an event with a CRC, so to understand at any
    time if an event is corrupted (bad written). However the possibility
    to have isolated corrupted events increase the complexity of the task
    a lot.

    An 8-bit or 16-bit CRC is peanuts to calculate and check.

    I know. Here the increased complexity wasn't related to the CRC
    calculation, but to the possibility to have isolated corrupted slots in
    the buffer. Taking into account these corrupted slots isn't so simple
    for me.


    Write the
    whole log entry except for a byte or two (whatever is the minimum write
    size for the flashes), which must be something other than 0xff's.  Check the entry after writing if you really want, then write the final byte or two.  Then if there's a power-fail in the middle, it's obvious that the
    log entry is bad.

    I don't understand why to write in two steps. I think I can write all
    the bytes of an entry, including CRC. After writing, I can read back all
    the bytes and check integrity.


    For SPI NOR flash, the risk of a bad write - other than through
    unexpected resets or power fails - is negligible for most purposes.
    Don't over-complicate things worrying about something that will never happen.

    Indeed unexpected resets, but mainly power failes, are possible. Rare,
    but possible.


    Suppose to write the 4th event with ID=3 at position 4 in the sector.
    Now suppose the write failed. I can try to re-write the same event at
    poisition 5. Should I use ID=4 or ID=5? At first, I think it's better
    to use ID=5. The CRC should detect that at position 4 is stored a
    corrupted event.
    After that two events are written as well, ID=6 and 7.

    Now the higher application wants to read the 4th event of the log
    (starting from the newest). We know that the newest is ID=7, so the
    4th event is ID=7-4+1=4. However the event with ID=4 is corrupted, so
    we should check the previous ID=3... and so on.
    Now we can't have a mathematical function that returns the position of
    an event starting from it's ordinal number from the newest event.


    Eventually I think it's better to use a public domain library, if it
    exists.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Don Y@blockedofcourse@foo.invalid to comp.arch.embedded on Fri Apr 19 00:58:04 2024
    From Newsgroup: comp.arch.embedded

    On 4/18/2024 7:38 AM, pozz wrote:
    The request is very common: when some interested events occur in the system, they, with the related timestamp, must be saved in a log. The log must be saved
    in an external SPI Flash connected to a MCU. The log has a maximum number of events. After the log is completely filled, the new event overwrite the oldest
    event.

    A circular buffer.

    I tried to implement such library, but I eventually found it's not a simple task, mostly if you want a reliable library that works even when errors occur
    (for example, when one writing fails).

    That depends on the sorts of failures that can occur and that are likely
    to occur. If, for example, power fails during a write... (do you know
    how your hardware will handle such an event? even if you THINK you
    have some guarantee that it can't happen -- because you have an early
    warning indication from the power supply?)

    Can a write APPEAR successful and yet the data degrade later (so your
    notion of where the next message should start changes based on whether
    or not you currently KNOW where it should start vs. having to DISCOVER
    where it should start?

    I started by assigning 5 sectors of SPI Flash to the log. There are 256 events
    in a sector (the event is a fixed struct).
    In this scenario, the maximum log size is 1024 events, because the 5th sector
    can be erased when the pointer reaches the end of a sector.

    You're being overly conservative; expecting new events to occur WHILE you
    are erasing the "extra" sector. If you can cache the errors elsewhere
    while the erase is in progress, then you can use the extra sector, too
    (and defer erasing until ALL sectors are full AND another event occurs)

    The first challenge is how the system can understand what is the first (newest)
    event in the log at startup. I solved saving a 16-bits counter ID to each event. The initialization routine starts reading all the IDs and taks the greatest as the last event.
    However initially the log is empty, so all the IDs are 0xFFFF, the maximum. One
    solution is to stop reading events when 0xFFFF is read and wrap-around ID at 0xFFFE and not 0xFFFF.

    In your scheme, one sector will be erased or in the process of being erased. So, you can (almost) rely on its contents (unless the erase can be interrupted by a power event)

    However there's another problem. What happens after writing 65535 events in the
    log? The ID restarts from 0, so the latest event hasn't the greatest ID anymore.

    This is the same as handling a circular buffer; how address N<M can actually represent something that occurs AFTER the data at M. Why do you see it as anything different?

    These are the saved IDs after 65536 events:

        1^ SECT    2^ SECT    3^ SECT    4^ SECT    5^SECT---------->
        0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF

    The rule "newest event has greatest ID" is correct yet. Now a new event is written:

        1^ SECT-------> 2^ SECT   3^ SECT   4^ SECT   5^SECT--------->
        0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF

    Now the rule doesn't work anymore. The solution I found is to detect discontinuity. The IDs are consecutive, so the initialization routine continues
    reading while the ID(n+1)=ID(n)+1. When there's a gap, the init function stops
    and found the ID and position of the newest event.

    Unless, of course, the entry at n or n+1 are corrupt...

    But he problems don't stop here. What happens if an event write failed? Should
    I verify the real written values, by reading back them and comparing with original data? In a reliable system, yes, I should.

    So, you've answered your own question. If you want the log to have
    value, then you have to put effort into ensure its integrity.

    Note my earlier comment "Can a write APPEAR successful and yet..."

    I was thinking to protect an event with a CRC, so to understand at any time if
    an event is corrupted (bad written). However the possibility to have isolated
    corrupted events increase the complexity of the task a lot.

    You can't assume ANYTHING about a corrupted event.

    Suppose to write the 4th event with ID=3 at position 4 in the sector. Now suppose the write failed. I can try to re-write the same event at poisition 5.
    Should I use ID=4 or ID=5? At first, I think it's better to use ID=5. The CRC
    should detect that at position 4 is stored a corrupted event.
    After that two events are written as well, ID=6 and 7.

    Now the higher application wants to read the 4th event of the log (starting from the newest). We know that the newest is ID=7, so the 4th event is ID=7-4+1=4. However the event with ID=4 is corrupted, so we should check the previous ID=3... and so on.
    Now we can't have a mathematical function that returns the position of an event
    starting from it's ordinal number from the newest event.

    Eventually I think it's better to use a public domain library, if it exists.

    You want an existing library to be able to address arbitrary log entry formats on arbitrary media with arbitrary failure modes (hardware and software)?

    The only time you really care about recovering data from the log is when you actually *want* (need) to recover data. (??) Can the log automatically restart with each device reset? (and, if you want to preserve the log,
    have a mechanism that inhibits logging after a "diagnostic reset")

    Assuming there is no external control data for the log (i.e., that you have
    to infer the structure of the log from the log's contents), a generic algorithm is to start at the first memory location that *can* hold the start of a message. Then, look at that piece of memory (taking into account any "wrap" caused by the circular nature of the buffer) and determine if it represents
    a valid log message (e.g., if you store a hash of the message in the message, then the hash must be correct in addition to the format/content of the message).

    If a valid message is encountered, return SUCCESS along with the length of
    the message (so you will know where to start looking for the NEXT message). [You already know where THIS message starts]

    If an invalid message is encountered, return FAIL along with a number of
    bytes that can safely be skipped over to begin searching for the next
    message. In the degenerate case, this is '1'.

    Based on YOUR expected failure modes (and your write strategy), you can then walk through the log in search of "good" messages and extract their ID's (sequence numbers).

    [Note that this scheme allows messages to be of variable length]

    If you rely on sequence numbers or "short" timestamps that can wrap in the context of the log buffer's length, then you need to use a numbering scheme that is relatively prime wrt the buffer length otherwise you could encounter sets of messages that can appear to start "anywhere".

    Note that your IDs are just abbreviated timestamps; with arbitrary time intervals between them. Using a "system time" that is always going to
    increase monotonically often offers more USEFUL information as it lets
    you decide how events are temporarily related (did event #5 happen
    immediately after event #4? Or, two weeks later??)

    Having a system time that you always reinitialize with the last noted
    value (assuming you can't track time when powered off) means that chronology/sequence is always evident. (you can add events like "User
    set Wall Clock Time to XX:YY:ZZ" so you can map the logged events to
    "real" time for convenience -- is something happening at some particular
    time of day that may be pertinent??)

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Don Y@blockedofcourse@foo.invalid to comp.arch.embedded on Fri Apr 19 01:02:47 2024
    From Newsgroup: comp.arch.embedded

    On 4/19/2024 12:58 AM, Don Y wrote:
    Note that your IDs are just abbreviated timestamps; with arbitrary time intervals between them.  Using a "system time" that is always going to increase monotonically often offers more USEFUL information as it lets
    you decide how events are temporarily related (did event #5 happen immediately after event #4?  Or, two weeks later??)

    s/temporarily/temporally/

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch.embedded on Fri Apr 19 10:28:46 2024
    From Newsgroup: comp.arch.embedded

    On 18/04/2024 22:36, pozz wrote:
    Il 18/04/2024 21:30, David Brown ha scritto:
    On 18/04/2024 16:38, pozz wrote:
    The request is very common: when some interested events occur in the
    system, they, with the related timestamp, must be saved in a log. The
    log must be saved in an external SPI Flash connected to a MCU. The
    log has a maximum number of events. After the log is completely
    filled, the new event overwrite the oldest event.

    I tried to implement such library, but I eventually found it's not a
    simple task, mostly if you want a reliable library that works even
    when errors occur (for example, when one writing fails).

    I started by assigning 5 sectors of SPI Flash to the log. There are
    256 events in a sector (the event is a fixed struct).
    In this scenario, the maximum log size is 1024 events, because the
    5th sector can be erased when the pointer reaches the end of a sector.

    The first challenge is how the system can understand what is the
    first (newest) event in the log at startup. I solved saving a 16-bits
    counter ID to each event. The initialization routine starts reading
    all the IDs and taks the greatest as the last event.
    However initially the log is empty, so all the IDs are 0xFFFF, the
    maximum. One solution is to stop reading events when 0xFFFF is read
    and wrap-around ID at 0xFFFE and not 0xFFFF.

    Start at 0xfffe, and count down.


    And what to do when the counter reaches zero? It can wrap-around up to 0xfffe (that is very similar to an increasing counter from 0x0000-0xFFFE).


    How big are your log entries? How many entries are realistic in the
    lifetime of the system?

    Or xor with 0xffff for storage.  Or
    wrap at 0xfffe, as you suggest.  Or use 32-bit values.  Or have
    another way to indicate that the log entry is valid.

    I will add a CRC for each entry and that can be used to validate the
    event. An empty/erased slot filled with 0xFF will not pass CRC validation.


    That's usually fine.


    Or, since you have a timestamp, there's no need to track an ID - the
    timestamp will be increasing monotonically.

    I don't want to use timestamps for two reasons:

    - the system wall clock can be changed (the system is isolated)
    - the library I'm writing doesn't know the content of "events", for
      it the event is an opaque sequence of bytes.


    OK.


    However there's another problem. What happens after writing 65535
    events in the log? The ID restarts from 0, so the latest event hasn't
    the greatest ID anymore.

    These are the saved IDs after 65536 events:

         1^ SECT    2^ SECT    3^ SECT    4^ SECT    5^SECT---------->
         0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF

    The rule "newest event has greatest ID" is correct yet. Now a new
    event is written:

         1^ SECT-------> 2^ SECT   3^ SECT   4^ SECT   5^SECT--------->
         0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF

    Now the rule doesn't work anymore. The solution I found is to detect
    discontinuity. The IDs are consecutive, so the initialization routine
    continues reading while the ID(n+1)=ID(n)+1. When there's a gap, the
    init function stops and found the ID and position of the newest event.

    Make your counts from 0 to 256*5 - 1, then wrap.  Log entry "n" will
    be at address n * sizeof(log entry), with up to 256 log entries blank.
    Then you don't need to store a number at all.

    What do you mean with log entry "0"? Is it the oldest or the newest? I
    think the oldest, because that formula is imho correct in this case.

    However it doesn't appear correct when the log has rotated, that happens after writing 5x256+1 events. In this case the newest entry ("n"=1024)
    is at address 0, not n*sizeof(entry).


    (I misread your "5 sectors of an SPI flash chip" as "5 SPI flash chips"
    when first replying. It makes no real difference to what I wrote, but I
    might have used "chip" instead of "sector".)

    You have 256 entries per flash sector, and 5 flash sectors. For the log
    entry number "n" - where "n" is an abstract count that never wraps, your
    index "i" into the flash array is (n % 5*256). The sector number is
    then (i / 256), and the index into the sector is (i % 256). The
    position in the log is determined directly by the entry number, and you
    don't actually need to store it anywhere.

    Think of this a different way - dispense with the log entry numbers
    entirely. When you start up, scan the flash to find the next free slot.
    You do this by looking at slot 0 first. If that is not empty, keep
    scanning until you find a free slot - that's the next free slot. If
    slot 0 is empty, scan until you have non-empty slots, then keep going
    until you get a free one again, and that's the next free slot. If you
    never find a used slot, or fail to find a free slot after the non-free
    slots, then your first free slot is slot 0.

    Any new logs are then put in this slot, moving forward. If you need to
    read out old logs, move backwards. When storing new logs, as you are
    nearing the end of a flash sector (how near depends on the sector erase
    time and how often events can occur), start the erase of the next sector
    in line.


    But he problems don't stop here. What happens if an event write
    failed? Should I verify the real written values, by reading back them
    and comparing with original data? In a reliable system, yes, I should.

    I was thinking to protect an event with a CRC, so to understand at
    any time if an event is corrupted (bad written). However the
    possibility to have isolated corrupted events increase the complexity
    of the task a lot.

    An 8-bit or 16-bit CRC is peanuts to calculate and check.

    I know. Here the increased complexity wasn't related to the CRC
    calculation, but to the possibility to have isolated corrupted slots in
    the buffer. Taking into account these corrupted slots isn't so simple
    for me.


    Think how such corruption could happen, and its consequences. For most
    event logs, it is simply not going to occur in the lifetime of working products - and if it does, as an isolated error in an event log, it
    doesn't matter significantly. Errors in a sensibly designed SPI NOR
    flash system would be an indication of serious hardware problems such as erratic power supplies, and then the log is the least of your concerns.

    The only thing to consider is a reset or power failure in the middle of writing a log event.


    Write the whole log entry except for a byte or two (whatever is the
    minimum write size for the flashes), which must be something other
    than 0xff's.  Check the entry after writing if you really want, then
    write the final byte or two.  Then if there's a power-fail in the
    middle, it's obvious that the log entry is bad.

    I don't understand why to write in two steps. I think I can write all
    the bytes of an entry, including CRC. After writing, I can read back all
    the bytes and check integrity.


    No, you can't. Or to be more accurate, you /can/, but it leaves you
    open to the risk of undetectable errors in the event of a power fail in
    the middle of writing. An 8-bit CRC has a 1 in 256 chance of passing if
    data is partially written.

    If you have enough capacitance on the board to ensure that any write completes, and detection of failure of external power so that you never
    start a log write after power has failed, then it is probably safe to do
    the write as a single action.


    For SPI NOR flash, the risk of a bad write - other than through
    unexpected resets or power fails - is negligible for most purposes.
    Don't over-complicate things worrying about something that will never
    happen.

    Indeed unexpected resets, but mainly power failes, are possible. Rare,
    but possible.


    Yes.


    Suppose to write the 4th event with ID=3 at position 4 in the sector.
    Now suppose the write failed. I can try to re-write the same event at
    poisition 5. Should I use ID=4 or ID=5? At first, I think it's better
    to use ID=5. The CRC should detect that at position 4 is stored a
    corrupted event.
    After that two events are written as well, ID=6 and 7.

    Now the higher application wants to read the 4th event of the log
    (starting from the newest). We know that the newest is ID=7, so the
    4th event is ID=7-4+1=4. However the event with ID=4 is corrupted, so
    we should check the previous ID=3... and so on.
    Now we can't have a mathematical function that returns the position
    of an event starting from it's ordinal number from the newest event.


    Eventually I think it's better to use a public domain library, if it
    exists.

    In my experience, these things are simple enough to implement, and
    specific enough to the application and target, that I don't see the need
    of a library.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From pozz@pozzugno@gmail.com to comp.arch.embedded on Fri Apr 19 16:58:35 2024
    From Newsgroup: comp.arch.embedded

    Il 19/04/2024 10:28, David Brown ha scritto:
    On 18/04/2024 22:36, pozz wrote:
    Il 18/04/2024 21:30, David Brown ha scritto:
    On 18/04/2024 16:38, pozz wrote:
    The request is very common: when some interested events occur in the
    system, they, with the related timestamp, must be saved in a log.
    The log must be saved in an external SPI Flash connected to a MCU.
    The log has a maximum number of events. After the log is completely
    filled, the new event overwrite the oldest event.

    I tried to implement such library, but I eventually found it's not a
    simple task, mostly if you want a reliable library that works even
    when errors occur (for example, when one writing fails).

    I started by assigning 5 sectors of SPI Flash to the log. There are
    256 events in a sector (the event is a fixed struct).
    In this scenario, the maximum log size is 1024 events, because the
    5th sector can be erased when the pointer reaches the end of a sector. >>>>
    The first challenge is how the system can understand what is the
    first (newest) event in the log at startup. I solved saving a
    16-bits counter ID to each event. The initialization routine starts
    reading all the IDs and taks the greatest as the last event.
    However initially the log is empty, so all the IDs are 0xFFFF, the
    maximum. One solution is to stop reading events when 0xFFFF is read
    and wrap-around ID at 0xFFFE and not 0xFFFF.

    Start at 0xfffe, and count down.


    And what to do when the counter reaches zero? It can wrap-around up to
    0xfffe (that is very similar to an increasing counter from
    0x0000-0xFFFE).


    How big are your log entries?  How many entries are realistic in the lifetime of the system?

    16 bytes each entry. It's difficult to reach 0xFFFF events in the log,
    but why limit our fantasy? :-D

    With 20 events per day, the log will be filled after 9 years. It's a
    very long life, but I think I have solved the problem to understand if
    the ID was wrapped-around.


    Or xor with 0xffff for storage.  Or
    wrap at 0xfffe, as you suggest.  Or use 32-bit values.  Or have
    another way to indicate that the log entry is valid.

    I will add a CRC for each entry and that can be used to validate the
    event. An empty/erased slot filled with 0xFF will not pass CRC
    validation.


    That's usually fine.


    Or, since you have a timestamp, there's no need to track an ID - the
    timestamp will be increasing monotonically.

    I don't want to use timestamps for two reasons:

    - the system wall clock can be changed (the system is isolated)
    - the library I'm writing doesn't know the content of "events", for
       it the event is an opaque sequence of bytes.


    OK.


    However there's another problem. What happens after writing 65535
    events in the log? The ID restarts from 0, so the latest event
    hasn't the greatest ID anymore.

    These are the saved IDs after 65536 events:

         1^ SECT    2^ SECT    3^ SECT    4^ SECT    5^SECT---------->
         0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF >>>>
    The rule "newest event has greatest ID" is correct yet. Now a new
    event is written:

         1^ SECT-------> 2^ SECT   3^ SECT   4^ SECT   5^SECT--------->
         0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF >>>>
    Now the rule doesn't work anymore. The solution I found is to detect
    discontinuity. The IDs are consecutive, so the initialization
    routine continues reading while the ID(n+1)=ID(n)+1. When there's a
    gap, the init function stops and found the ID and position of the
    newest event.

    Make your counts from 0 to 256*5 - 1, then wrap.  Log entry "n" will
    be at address n * sizeof(log entry), with up to 256 log entries
    blank. Then you don't need to store a number at all.

    What do you mean with log entry "0"? Is it the oldest or the newest? I
    think the oldest, because that formula is imho correct in this case.

    However it doesn't appear correct when the log has rotated, that
    happens after writing 5x256+1 events. In this case the newest entry
    ("n"=1024) is at address 0, not n*sizeof(entry).


    (I misread your "5 sectors of an SPI flash chip" as "5 SPI flash chips"
    when first replying.  It makes no real difference to what I wrote, but I might have used "chip" instead of "sector".)

    You have 256 entries per flash sector, and 5 flash sectors. For the log entry number "n" - where "n" is an abstract count that never wraps, your index "i" into the flash array is (n % 5*256). The sector number is
    then (i / 256), and the index into the sector is (i % 256). The
    position in the log is determined directly by the entry number, and you don't actually need to store it anywhere.

    Think of this a different way - dispense with the log entry numbers entirely.  When you start up, scan the flash to find the next free slot.
     You do this by looking at slot 0 first.  If that is not empty, keep scanning until you find a free slot - that's the next free slot.  If
    slot 0 is empty, scan until you have non-empty slots, then keep going
    until you get a free one again, and that's the next free slot.  If you never find a used slot, or fail to find a free slot after the non-free slots, then your first free slot is slot 0.

    Any new logs are then put in this slot, moving forward.  If you need to read out old logs, move backwards.  When storing new logs, as you are nearing the end of a flash sector (how near depends on the sector erase
    time and how often events can occur), start the erase of the next sector
    in line.

    Yes, it is what I already do. However I disagree on the formula.

    The higher-layer application requests log entry 0. What is it? The
    newest event. My eventlog library should convert 0 to the slot index in
    the Flash, that is directly related to the Flash addres (I don't really
    need the number of sector here).

    If the log is empty, event 0 for the application is slot 0 for the
    eventlog library. However, if there are three events in the log, event 0
    for the application is slot 2 for the eventlog library.

    The application doesn't know anything about what I named the ID of the
    event. It's just a number used by the lower-layer eventlog module to
    find the first free slot at startup.


    But he problems don't stop here. What happens if an event write
    failed? Should I verify the real written values, by reading back
    them and comparing with original data? In a reliable system, yes, I
    should.

    I was thinking to protect an event with a CRC, so to understand at
    any time if an event is corrupted (bad written). However the
    possibility to have isolated corrupted events increase the
    complexity of the task a lot.

    An 8-bit or 16-bit CRC is peanuts to calculate and check.

    I know. Here the increased complexity wasn't related to the CRC
    calculation, but to the possibility to have isolated corrupted slots
    in the buffer. Taking into account these corrupted slots isn't so
    simple for me.


    Think how such corruption could happen, and its consequences.  For most event logs, it is simply not going to occur in the lifetime of working products - and if it does, as an isolated error in an event log, it
    doesn't matter significantly.  Errors in a sensibly designed SPI NOR
    flash system would be an indication of serious hardware problems such as erratic power supplies, and then the log is the least of your concerns.

    The only thing to consider is a reset or power failure in the middle of writing a log event.

    Yes, I agree with you.


    Write the whole log entry except for a byte or two (whatever is the
    minimum write size for the flashes), which must be something other
    than 0xff's.  Check the entry after writing if you really want, then
    write the final byte or two.  Then if there's a power-fail in the
    middle, it's obvious that the log entry is bad.

    I don't understand why to write in two steps. I think I can write all
    the bytes of an entry, including CRC. After writing, I can read back
    all the bytes and check integrity.


    No, you can't.  Or to be more accurate, you /can/, but it leaves you
    open to the risk of undetectable errors in the event of a power fail in
    the middle of writing.  An 8-bit CRC has a 1 in 256 chance of passing if data is partially written.

    I can use crc16. We are talking of power-fail/reset in the middle of
    event writing, this is possible, but rare. I think the case a crc16
    reporting a valid slot when the writing was wrong or incomplete is
    extremely rare.


    If you have enough capacitance on the board to ensure that any write completes, and detection of failure of external power so that you never start a log write after power has failed, then it is probably safe to do
    the write as a single action.

    No I can't know in advance when the power will fail.


    For SPI NOR flash, the risk of a bad write - other than through
    unexpected resets or power fails - is negligible for most purposes.
    Don't over-complicate things worrying about something that will never
    happen.

    Indeed unexpected resets, but mainly power failes, are possible. Rare,
    but possible.


    Yes.


    Suppose to write the 4th event with ID=3 at position 4 in the
    sector. Now suppose the write failed. I can try to re-write the same
    event at poisition 5. Should I use ID=4 or ID=5? At first, I think
    it's better to use ID=5. The CRC should detect that at position 4 is
    stored a corrupted event.
    After that two events are written as well, ID=6 and 7.

    Now the higher application wants to read the 4th event of the log
    (starting from the newest). We know that the newest is ID=7, so the
    4th event is ID=7-4+1=4. However the event with ID=4 is corrupted,
    so we should check the previous ID=3... and so on.
    Now we can't have a mathematical function that returns the position
    of an event starting from it's ordinal number from the newest event.


    Eventually I think it's better to use a public domain library, if it
    exists.

    In my experience, these things are simple enough to implement, and
    specific enough to the application and target, that I don't see the need
    of a library.



    --- Synchronet 3.20a-Linux NewsLink 1.114