• help: pandas and 2d table

    From jak@nospam@please.ty to comp.lang.python on Fri Apr 12 20:40:09 2024
    From Newsgroup: comp.lang.python

    Hi everyone.
    I state that I don't know anything about 'pandas' but I intuited that
    it could do what I want. I get, through the "read_excel" method, a
    table similar to this:

    obj| foo1 foo2 foo3 foo4 foo5 foo6
    -----------------------------------
    foo1| aa ab zz ad ae af
    |
    foo2| ba bb bc bd zz bf
    |
    foo3| ca zz cc cd ce zz
    |
    foo4| da db dc dd de df
    |
    foo5| ea eb ec zz ee ef
    |
    foo6| fa fb fc fd fe ff


    And I would like to get a result similar to this:

    {
    'zz':[('foo1','foo3'),
    ('foo2','foo5'),
    ('foo3','foo2'),
    ('foo3','foo6'),
    ('foo5','foo4')
    ]
    }

    Would you show me the path, please?
    Thank you in advance.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.lang.python on Fri Apr 12 19:22:54 2024
    From Newsgroup: comp.lang.python

    jak <nospam@please.ty> wrote or quoted:
    Would you show me the path, please?

    I was not able to read xls here, so I used csv instead; Warning:
    the script will overwrite file "file_20240412201813_tmp_DML.csv"!

    import pandas as pd

    with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
    print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
    foo1,aa,ab,zz,ad,ae,af
    foo2,ba,bb,bc,bd,zz,bf
    foo3,ca,zz,cc,cd,ce,zz
    foo4,da,db,dc,dd,de,df
    foo5,ea,eb,ec,zz,ee,ef
    foo6,fa,fb,fc,fd,fe,ff''', file=out )

    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

    result = {}

    for rownum, row in df.iterrows():
    iterator = row.items()
    _, rowname = next( iterator )
    for colname, value in iterator:
    if value not in result: result[ value ]= []
    result[ value ].append( ( rowname, colname ))

    print( result )
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jak@nospam@please.ty to comp.lang.python on Sat Apr 13 15:00:35 2024
    From Newsgroup: comp.lang.python

    Stefan Ram ha scritto:
    jak <nospam@please.ty> wrote or quoted:
    Would you show me the path, please?

    I was not able to read xls here, so I used csv instead; Warning:
    the script will overwrite file "file_20240412201813_tmp_DML.csv"!

    import pandas as pd

    with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
    print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
    foo1,aa,ab,zz,ad,ae,af
    foo2,ba,bb,bc,bd,zz,bf
    foo3,ca,zz,cc,cd,ce,zz
    foo4,da,db,dc,dd,de,df
    foo5,ea,eb,ec,zz,ee,ef
    foo6,fa,fb,fc,fd,fe,ff''', file=out )

    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

    result = {}

    for rownum, row in df.iterrows():
    iterator = row.items()
    _, rowname = next( iterator )
    for colname, value in iterator:
    if value not in result: result[ value ]= []
    result[ value ].append( ( rowname, colname ))

    print( result )


    In reality what I wanted to achieve was this:

    what = 'zz'
    result = {what: []}

    for rownum, row in df.iterrows():
    iterator = row.items()
    _, rowname = next(iterator)
    for colname, value in iterator:
    if value == what:
    result[what] += [(rowname, colname)]
    print(result)

    In any case, thank you again for pointing me in the right direction. I
    had lost myself looking for a pandas method that would do this in a
    single shot or almost.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Mats Wichmann@mats@wichmann.us to comp.lang.python on Sat Apr 13 11:07:37 2024
    From Newsgroup: comp.lang.python

    On 4/13/24 07:00, jak via Python-list wrote:
    Stefan Ram ha scritto:
    jak <nospam@please.ty> wrote or quoted:
    Would you show me the path, please?

       I was not able to read xls here, so I used csv instead; Warning:
       the script will overwrite file "file_20240412201813_tmp_DML.csv"!

    import pandas as pd

    with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
         print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
    foo1,aa,ab,zz,ad,ae,af
    foo2,ba,bb,bc,bd,zz,bf
    foo3,ca,zz,cc,cd,ce,zz
    foo4,da,db,dc,dd,de,df
    foo5,ea,eb,ec,zz,ee,ef
    foo6,fa,fb,fc,fd,fe,ff''', file=out )

    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

    result = {}

    for rownum, row in df.iterrows():
         iterator = row.items()
         _, rowname = next( iterator )
         for colname, value in iterator:
             if value not in result: result[ value ]= []
             result[ value ].append( ( rowname, colname ))

    print( result )


    In reality what I wanted to achieve was this:

        what = 'zz'
        result = {what: []}

        for rownum, row in df.iterrows():
            iterator = row.items()
            _, rowname = next(iterator)
            for colname, value in iterator:
                if value == what:
                    result[what] += [(rowname, colname)]
        print(result)

    In any case, thank you again for pointing me in the right direction. I
    had lost myself looking for a pandas method that would do this in a
    single shot or almost.


    doesn't Pandas have a "where" method that can do this kind of thing? Or doesn't it match what you are looking for? Pretty sure numpy does, but
    that's a lot to bring in if you don't need the rest of numpy.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Williams@tjandacw@gmail.com to comp.lang.python on Sat Apr 13 13:14:35 2024
    From Newsgroup: comp.lang.python

    On Sat, Apr 13, 2024 at 1:10 PM Mats Wichmann via Python-list < python-list@python.org> wrote:
    On 4/13/24 07:00, jak via Python-list wrote:

    doesn't Pandas have a "where" method that can do this kind of thing? Or doesn't it match what you are looking for? Pretty sure numpy does, but that's a lot to bring in if you don't need the rest of numpy.

    pandas.DataFrame.where — pandas 2.2.2 documentation (pydata.org)
    <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html#pandas.DataFrame.where>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.lang.python on Sat Apr 13 18:39:51 2024
    From Newsgroup: comp.lang.python

    Tim Williams <tjandacw@gmail.com> wrote or quoted: >e.where.html#pandas.DataFrame.where>

    Threw together a quick thing with "where", see how it looks.

    import pandas as pd

    # Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
    with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
    print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
    foo1,aa,ab,zz,ad,ae,af
    foo2,ba,bb,bc,bd,zz,bf
    foo3,ca,zz,cc,cd,ce,zz
    foo4,da,db,dc,dd,de,df
    foo5,ea,eb,ec,zz,ee,ef
    foo6,fa,fb,fc,fd,fe,ff''', file=out )

    # Note the "index_col=0" below, which is important here!
    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

    df = df.where( df == 'zz' ).stack().reset_index()
    result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

    print( result )

    Prints here:

    {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

    .
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jak@nospam@please.ty to comp.lang.python on Sat Apr 13 23:35:57 2024
    From Newsgroup: comp.lang.python

    Stefan Ram ha scritto:
    df = df.where( df == 'zz' ).stack().reset_index()
    result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

    Since I don't know Pandas, I will need a month at least to understand
    these 2 lines of code. Thanks again.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.lang.python on Sun Apr 14 08:58:16 2024
    From Newsgroup: comp.lang.python

    jak <nospam@please.ty> wrote or quoted:
    Stefan Ram ha scritto:
    df = df.where( df == 'zz' ).stack().reset_index()
    result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
    Since I don't know Pandas, I will need a month at least to understand
    these 2 lines of code. Thanks again.

    Here's a technique to better understand such code:

    Transform it into a program with small statements and small
    expressions with no more than one call per statement if possible.
    (After each litte change check that the output stays the same.)

    import pandas as pd

    # Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
    with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
    print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
    foo1,aa,ab,zz,ad,ae,af
    foo2,ba,bb,bc,bd,zz,bf
    foo3,ca,zz,cc,cd,ce,zz
    foo4,da,db,dc,dd,de,df
    foo5,ea,eb,ec,zz,ee,ef
    foo6,fa,fb,fc,fd,fe,ff''', file=out )
    # Note the "index_col=0" below, which is important here!
    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

    selection = df.where( df == 'zz' )
    selection_stack = selection.stack()
    df = selection_stack.reset_index()
    df0 = df.iloc[ :, 0 ]
    df1 = df.iloc[ :, 1 ]
    z = zip( df0, df1 )
    l = list( z )
    result ={ 'zz': l }
    print( result )

    I suggest to next insert print statements to print each intermediate
    value:

    # Note the "index_col=0" below, which is important here!
    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
    print( 'df = \n', type( df ), ':\n"', df, '"\n' )

    selection = df.where( df == 'zz' )
    print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
    selection, '"\n' )

    selection_stack = selection.stack()
    print( 'result of stack() = \n', type( selection_stack ), ':\n"',
    selection_stack, '"\n' )

    df = selection_stack.reset_index()
    print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

    df0 = df.iloc[ :, 0 ]
    print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

    df1 = df.iloc[ :, 1 ]
    print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

    z = zip( df0, df1 )
    print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

    l = list( z )
    print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

    result ={ 'zz': l }
    print( "value of { 'zz': l }= \n", type( result ), ':\n"',
    result, '"\n' )

    print( result )

    Now you can see what each single step does!

    df =
    <class 'pandas.core.frame.DataFrame'> :
    " foo1 foo2 foo3 foo4 foo5 foo6
    obj
    foo1 aa ab zz ad ae af
    foo2 ba bb bc bd zz bf
    foo3 ca zz cc cd ce zz
    foo4 da db dc dd de df
    foo5 ea eb ec zz ee ef
    foo6 fa fb fc fd fe ff "

    result of where( df == 'zz' ) =
    <class 'pandas.core.frame.DataFrame'> :
    " foo1 foo2 foo3 foo4 foo5 foo6
    obj
    foo1 NaN NaN zz NaN NaN NaN
    foo2 NaN NaN NaN NaN zz NaN
    foo3 NaN zz NaN NaN NaN zz
    foo4 NaN NaN NaN NaN NaN NaN
    foo5 NaN NaN NaN zz NaN NaN
    foo6 NaN NaN NaN NaN NaN NaN "

    result of stack() =
    <class 'pandas.core.series.Series'> :
    " obj
    foo1 foo3 zz
    foo2 foo5 zz
    foo3 foo2 zz
    foo6 zz
    foo5 foo4 zz
    dtype: object "

    result of reset_index() =
    <class 'pandas.core.frame.DataFrame'> :
    " obj level_1 0
    0 foo1 foo3 zz
    1 foo2 foo5 zz
    2 foo3 foo2 zz
    3 foo3 foo6 zz
    4 foo5 foo4 zz "

    value of .iloc[ :, 0 ]=
    <class 'pandas.core.series.Series'> :
    " 0 foo1
    1 foo2
    2 foo3
    3 foo3
    4 foo5
    Name: obj, dtype: object "

    value of .iloc[ :, 1 ] =
    <class 'pandas.core.series.Series'> :
    " 0 foo3
    1 foo5
    2 foo2
    3 foo6
    4 foo4
    Name: level_1, dtype: object "

    result of zip( df0, df1 )=
    <class 'zip'> :
    " <zip object at 0x000000000B3B9548> "

    result of list( z )=
    <class 'list'> :
    " [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')] "

    value of { 'zz': l }=
    <class 'dict'> :
    " {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]} "

    {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

    The script reads a CSV file and stores the data in a Pandas
    DataFrame object named "df". The "index_col=0" parameter tells
    Pandas to use the first column as the index for the DataFrame,
    which is kinda like column headers.

    The "where" creates a new DataFrame selection that contains
    the same data as df, but with all values replaced by NaN (Not
    a Number) except for the values that are equal to 'zz'.

    "stack" returns a Series with a multi-level index created
    by pivoting the columns. Here it gives a Series with the
    row-col-addresses of a all the non-NaN values. The general
    meaning of "stack" might be the most complex operation of
    this script. It's explained in the pandas manual (see there).

    "reset_index" then just transforms this Series back into a
    DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
    first and second column, respectively, of that DataFrame. These
    then are zipped to get the desired form as a list of pairs.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jak@nospam@please.ty to comp.lang.python on Mon Apr 15 08:05:18 2024
    From Newsgroup: comp.lang.python

    Stefan Ram ha scritto:
    jak <nospam@please.ty> wrote or quoted:
    Stefan Ram ha scritto:
    df = df.where( df == 'zz' ).stack().reset_index()
    result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
    Since I don't know Pandas, I will need a month at least to understand
    these 2 lines of code. Thanks again.

    Here's a technique to better understand such code:

    Transform it into a program with small statements and small
    expressions with no more than one call per statement if possible.
    (After each litte change check that the output stays the same.)

    import pandas as pd

    # Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
    with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
    print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
    foo1,aa,ab,zz,ad,ae,af
    foo2,ba,bb,bc,bd,zz,bf
    foo3,ca,zz,cc,cd,ce,zz
    foo4,da,db,dc,dd,de,df
    foo5,ea,eb,ec,zz,ee,ef
    foo6,fa,fb,fc,fd,fe,ff''', file=out )
    # Note the "index_col=0" below, which is important here!
    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

    selection = df.where( df == 'zz' )
    selection_stack = selection.stack()
    df = selection_stack.reset_index()
    df0 = df.iloc[ :, 0 ]
    df1 = df.iloc[ :, 1 ]
    z = zip( df0, df1 )
    l = list( z )
    result ={ 'zz': l }
    print( result )

    I suggest to next insert print statements to print each intermediate
    value:

    # Note the "index_col=0" below, which is important here!
    df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
    print( 'df = \n', type( df ), ':\n"', df, '"\n' )

    selection = df.where( df == 'zz' )
    print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
    selection, '"\n' )

    selection_stack = selection.stack()
    print( 'result of stack() = \n', type( selection_stack ), ':\n"',
    selection_stack, '"\n' )

    df = selection_stack.reset_index()
    print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

    df0 = df.iloc[ :, 0 ]
    print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

    df1 = df.iloc[ :, 1 ]
    print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

    z = zip( df0, df1 )
    print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

    l = list( z )
    print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

    result ={ 'zz': l }
    print( "value of { 'zz': l }= \n", type( result ), ':\n"',
    result, '"\n' )

    print( result )

    Now you can see what each single step does!

    df =
    <class 'pandas.core.frame.DataFrame'> :
    " foo1 foo2 foo3 foo4 foo5 foo6
    obj
    foo1 aa ab zz ad ae af
    foo2 ba bb bc bd zz bf
    foo3 ca zz cc cd ce zz
    foo4 da db dc dd de df
    foo5 ea eb ec zz ee ef
    foo6 fa fb fc fd fe ff "

    result of where( df == 'zz' ) =
    <class 'pandas.core.frame.DataFrame'> :
    " foo1 foo2 foo3 foo4 foo5 foo6
    obj
    foo1 NaN NaN zz NaN NaN NaN
    foo2 NaN NaN NaN NaN zz NaN
    foo3 NaN zz NaN NaN NaN zz
    foo4 NaN NaN NaN NaN NaN NaN
    foo5 NaN NaN NaN zz NaN NaN
    foo6 NaN NaN NaN NaN NaN NaN "

    result of stack() =
    <class 'pandas.core.series.Series'> :
    " obj
    foo1 foo3 zz
    foo2 foo5 zz
    foo3 foo2 zz
    foo6 zz
    foo5 foo4 zz
    dtype: object "

    result of reset_index() =
    <class 'pandas.core.frame.DataFrame'> :
    " obj level_1 0
    0 foo1 foo3 zz
    1 foo2 foo5 zz
    2 foo3 foo2 zz
    3 foo3 foo6 zz
    4 foo5 foo4 zz "

    value of .iloc[ :, 0 ]=
    <class 'pandas.core.series.Series'> :
    " 0 foo1
    1 foo2
    2 foo3
    3 foo3
    4 foo5
    Name: obj, dtype: object "

    value of .iloc[ :, 1 ] =
    <class 'pandas.core.series.Series'> :
    " 0 foo3
    1 foo5
    2 foo2
    3 foo6
    4 foo4
    Name: level_1, dtype: object "

    result of zip( df0, df1 )=
    <class 'zip'> :
    " <zip object at 0x000000000B3B9548>"

    result of list( z )=
    <class 'list'> :
    " [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]"

    value of { 'zz': l }=
    <class 'dict'> :
    " {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}"

    {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

    The script reads a CSV file and stores the data in a Pandas
    DataFrame object named "df". The "index_col=0" parameter tells
    Pandas to use the first column as the index for the DataFrame,
    which is kinda like column headers.

    The "where" creates a new DataFrame selection that contains
    the same data as df, but with all values replaced by NaN (Not
    a Number) except for the values that are equal to 'zz'.

    "stack" returns a Series with a multi-level index created
    by pivoting the columns. Here it gives a Series with the
    row-col-addresses of a all the non-NaN values. The general
    meaning of "stack" might be the most complex operation of
    this script. It's explained in the pandas manual (see there).

    "reset_index" then just transforms this Series back into a
    DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
    first and second column, respectively, of that DataFrame. These
    then are zipped to get the desired form as a list of pairs.


    And this is a technique very similar to reverse engineering. Thanks for
    the explanation and examples. All this is really clear and I was able to
    follow it easily because I have already written a version of this code
    in C without any kind of external library that uses the .CSV version of
    the table as data ( 234 code lines :^/ ).


    --- Synchronet 3.20a-Linux NewsLink 1.114