• feature share : a more sensible approach to browsing binary files

    From Kpop 2GM@jason.cy.kwan@gmail.com to comp.lang.awk on Thu Jun 9 05:54:35 2022
    From Newsgroup: comp.lang.awk

    although existing tools like " od " and " xxd " are very powerful in their own right, the information overload sometimes become detrimental when one only wants a high-level browsing of it, to gain *TEXTUAL* context and perspective.
    the caret-notation approach such as in " less " creates even more of a headache; disassemblers are definitely too low level for anything meaningful.
    furthermore, binary executables frequently have looooong chains of null bytes or chains of 0xFF \377, making it easy to get lost in the midst when attempt to textual context, while straight up URL encoding clumps the encoded hex with actual alphanumeric characters.
    My goal was to hoping to find a different, more sensible approach, without re-inventing the wheel, or attempting to replace any of the existing tools.
    —————————— (the leading edge underscores are only for usenet formatting purposes - those aren't part of the code)
    …..< binary executable file, already URL-encoded (using any of ur existing preferred methodologies > …… |
    gawk -be / mawk 'BEGIN {
    __ RS = FS = "^$"
    __ OFS = ORS = ""
    END {
    __ gsub("([%]00)+", "\n ~~nulls~~ ")
    __ gsub("[%]FF([%]FF)+", "\n ..FF's.. ")
    __ gsub("[+]", " ")
    __ gsub("[%]", "|")
    __ gsub("[%|]25", "\371")
    __ gsub("[|]", "\367")
    __ gsub("[\367][0-9A-F][0-9A-F]", "\301&\300")
    __ ___ = sprintf("%c", _ = 3 ^ 4 + 5 + 6)
    __ for (_ = 3 + 3 ^ 3 + 3; (_ - 2) < (2 + 3) ^ 3; _++) {
    __ __ if (_ != (5 + 2 ^ 5 )) {
    __ __ __ gsub(sprintf("\367%.2X", _),
    sprintf("%.*s%c", _ == (2 + 6 ^ 2), ___, _))
    __ __ }
    __ }
    __ gsub("\301\367", "_[_")
    __ gsub("\300", "_]_")
    __ gsub("\371", "%")
    __ gsub("[\000-\t\v-\037\177-\377]", "")
    __ print
    the basic idea is to keep ascii control bytes and 8-bit bytes still hex encoded, squeeze long-chains of NULLs \000 or xFF`s \377, and treating them as ORS instead of new line, while visually padding out remaining hex so they wouldn't visually be interfering with actual ascii alphanumeric, and make them appear in a unique fashion that they wouldn't be misconstrued as regex classes.
    using gawk's executable as an example, now i can clearly see how the error messages, both internal, and external, are laid out, without going to the C-file source :
    e.g. Op codes with their related ascii-punctuation character next to them :
    ~~nulls~~ Op_times
    ~~nulls~~ *_]_
    ~~nulls~~ Op_times_i
    ~~nulls~~ Op_quotient
    ~~nulls~~ /_]_
    ~~nulls~~ Op_quotient_i
    ~~nulls~~ Op_mod
    ~~nulls~~ %
    ~~nulls~~ Op_mod_i
    ~~nulls~~ Op_plus
    ~~nulls~~ +_]_
    ~~nulls~~ Op_plus_i
    ~~nulls~~ Op_minus
    ~~nulls~~ -
    or see a whole slew of error messages, both internal and external, grouped together, which helps with identifying discrepancies of approach :
    ~~nulls~~ for loop:_]_ array `_]_%s'_]_ changed size from %ld to %ld during loop execution
    ~~nulls~~ %s:_]_ called with %lu arguments,_]_ expecting at least %lu
    ~~nulls~~ %s:_]_ called with %lu arguments,_]_ expecting no more than %lu
    ~~nulls~~ indirect function call requires a simple scalar value
    ~~nulls~~ `_]_%s'_]_ is not a function,_]_ so it cannot be called indirectly
    ~~nulls~~ function called indirectly through `_]_%s'_]_ does not exist
    ~~nulls~~ function `_]_%s'_]_ not defined
    ~~nulls~~ error reading input file `_]_%s'_]_:_]_ %s
    ~~nulls~~ `_]_nextfile'_]_ cannot be called from a `_]_%s'_]_ rule
    ~~nulls~~ `_]_exit'_]_ cannot be called in the current context
    ~~nulls~~ `_]_next'_]_ cannot be called from a `_]_%s'_]_ rule
    ~~nulls~~ Sorry,_]_ don'_]_t know how to interpret `_]_%s'_]_
    ~~nulls~~ GAWK_STACKSIZE
    ~~nulls~~ Node_illegal
    ~~nulls~~ Node_val
    or sections that appear well-pre-sorted (perhaps related to collation ordering ?), then one can quickly glance and see if anything glaring got accidentally left out :
    ~~nulls~~ _[_E3_]__[_05_]_
    ~~nulls~~ _[_E4_]__[_05_]_
    ~~nulls~~ _[_E5_]__[_05_]_
    ~~nulls~~ _[_E6_]__[_05_]_
    ~~nulls~~ _[_E7_]__[_05_]_
    ~~nulls~~ _[_E8_]__[_05_]_
    ~~nulls~~ _[_E9_]__[_05_]_
    ~~nulls~~ _[_EA_]__[_05_]_
    ~~nulls~~ _[_EB_]__[_05_]_
    ~~nulls~~ _[_EC_]__[_05_]_
    ~~nulls~~ _[_ED_]__[_05_]_
    ~~nulls~~ _[_EE_]__[_05_]_
    ~~nulls~~ _[_EF_]__[_05_]_
    after some trial-and-error, i find " _[_7F_]_ " to be least intrusive visually, provides clear gap separation with its surrounding bytes/text, without overlapping with various equivalence/collation/character-class syntaxes.
    while it's still valid regex in its own right, i can't imagine many writing underscore twice in the same class for no reason.
    All that said, its still a work in progress, as some of the rougher edges still don't look as ideal as i hoped.
    — The 4Chan Teller
    --- Synchronet 3.19c-Linux NewsLink 1.113