• Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

    From Abdur-Rahmaan Janhangeer@arj.python@gmail.com to comp.lang.python on Mon Sep 30 09:49:21 2024
    From Newsgroup: comp.lang.python

    Idk if you tried Polars, but it seems to work well with JSON data
    import polars as pl
    pl.read_json("file.json")
    Kind Regards,
    Abdur-Rahmaan Janhangeer
    about <https://compileralchemy.github.io/> | blog <https://www.pythonkitchen.com>
    github <https://github.com/Abdur-RahmaanJ>
    Mauritius
    On Mon, Sep 30, 2024 at 8:00 AM Asif Ali Hirekumbi via Python-list < python-list@python.org> wrote:
    Dear Python Experts,

    I am working with the Kenna Application's API to retrieve vulnerability
    data. The API endpoint provides a single, massive JSON file in gzip format, approximately 60 GB in size. Handling such a large dataset in one go is proving to be quite challenging, especially in terms of memory management.

    I am looking for guidance on how to efficiently stream this data and
    process it in chunks using Python. Specifically, I am wondering if there’s a way to use the requests library or any other libraries that would allow
    us to pull data from the API endpoint in a memory-efficient manner.

    Here are the relevant API endpoints from Kenna:

    - Kenna API Documentation
    <https://apidocs.kennasecurity.com/reference/welcome>
    - Kenna Vulnerabilities Export
    <https://apidocs.kennasecurity.com/reference/retrieve-data-export>

    If anyone has experience with similar use cases or can offer any advice, it would be greatly appreciated.

    Thank you in advance for your help!

    Best regards
    Asif Ali
    --
    https://mail.python.org/mailman/listinfo/python-list

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Chris Angelico@rosuav@gmail.com to comp.lang.python on Tue Oct 1 03:00:21 2024
    From Newsgroup: comp.lang.python

    On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list <python-list@python.org> wrote:

    On 9/30/2024 11:30 AM, Barry via Python-list wrote:


    On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list <python-list@python.org> wrote:


    import polars as pl
    pl.read_json("file.json")



    This is not going to work unless the computer has a lot more the 60GiB of RAM.

    As later suggested a streaming parser is required.

    Streaming won't work because the file is gzipped. You have to receive
    the whole thing before you can unzip it. Once unzipped it will be even larger, and all in memory.

    Streaming gzip is perfectly possible. You may be thinking of PKZip
    which has its EOCD at the end of the file (although it may still be
    possible to stream-decompress if you work at it).

    ChrisA
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Grant Edwards@grant.b.edwards@gmail.com to comp.lang.python on Mon Sep 30 18:54:52 2024
    From Newsgroup: comp.lang.python

    On 2024-09-30, Dan Sommers via Python-list <python-list@python.org> wrote:

    In Common Lisp, integers can be written in any integer base from two
    to thirty six, inclusive. So knowing the last digit doesn't tell
    you whether an integer is even or odd until you know the base
    anyway.

    I had to think about that for an embarassingly long time before it
    clicked.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.python on Mon Sep 30 18:48:02 2024
    From Newsgroup: comp.lang.python

    2QdxY4RzWzUUiLuE@potatochowder.com writes:
    [...]
    In Common Lisp, you can write integers as #nnR[digits], where nn is the decimal representation of the base (possibly without a leading zero),
    the # and the R are literal characters, and the digits are written in
    the intended base. So the input #16fFFFF is read as the integer 65535.

    Typo: You meant #16RFFFF, not #16fFFFF.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Left Right@olegsivokon@gmail.com to comp.lang.python on Wed Oct 2 08:05:02 2024
    From Newsgroup: comp.lang.python

    By that definition of "streaming", no parser can ever be streaming,
    because there will be some constructs that must be read in their
    entirety before a suitably-structured piece of output can be
    emitted.

    In the same email you replied to, I gave examples of languages for
    which parsers can be streaming (in general): SCSI or IP. For some
    languages (eg. everything in the context-free family) streaming
    parsers are _in general_ impossible, because there are pathological
    cases like the one with parsing numbers. But this doesn't mean that
    you cannot come up with a parser that is only useful _sometimes_.
    And, in practice, languages like XML or JSON do well with streaming,
    even though in general it's impossible.

    I'm sorry if this comes as a surprise. On one hand I don't want to
    sound condescending, on the other hand, this is something that you'd
    typically study in automata theory class. Well, not exactly in the
    very same words, but you should be able to figure this stuff out if
    you had that class.
    --- Synchronet 3.20a-Linux NewsLink 1.114