Forum: War Ensemble BBS

help: pandas and 2d table

From jak@nospam@please.ty to comp.lang.python on Fri Apr 12 20:40:09 2024

From Newsgroup: comp.lang.python

Hi everyone.
I state that I don't know anything about 'pandas' but I intuited that
it could do what I want. I get, through the "read_excel" method, a
table similar to this:

obj| foo1 foo2 foo3 foo4 foo5 foo6
-----------------------------------
foo1| aa ab zz ad ae af
|
foo2| ba bb bc bd zz bf
|
foo3| ca zz cc cd ce zz
|
foo4| da db dc dd de df
|
foo5| ea eb ec zz ee ef
|
foo6| fa fb fc fd fe ff

And I would like to get a result similar to this:

{
'zz':[('foo1','foo3'),
('foo2','foo5'),
('foo3','foo2'),
('foo3','foo6'),
('foo5','foo4')
]
}

Would you show me the path, please?
Thank you in advance.

--- Synchronet 3.20a-Linux NewsLink 1.114

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.lang.python on Fri Apr 12 19:22:54 2024

From Newsgroup: comp.lang.python

jak <nospam@please.ty> wrote or quoted:

Would you show me the path, please?

I was not able to read xls here, so I used csv instead; Warning:
the script will overwrite file "file_20240412201813_tmp_DML.csv"!

import pandas as pd

with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )

df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

result = {}

for rownum, row in df.iterrows():
iterator = row.items()
_, rowname = next( iterator )
for colname, value in iterator:
if value not in result: result[ value ]= []
result[ value ].append( ( rowname, colname ))

print( result )
--- Synchronet 3.20a-Linux NewsLink 1.114

From jak@nospam@please.ty to comp.lang.python on Sat Apr 13 15:00:35 2024

From Newsgroup: comp.lang.python

Stefan Ram ha scritto:

jak <nospam@please.ty> wrote or quoted:

Would you show me the path, please?

I was not able to read xls here, so I used csv instead; Warning:
the script will overwrite file "file_20240412201813_tmp_DML.csv"!

import pandas as pd

with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )

df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

result = {}

for rownum, row in df.iterrows():
iterator = row.items()
_, rowname = next( iterator )
for colname, value in iterator:
if value not in result: result[ value ]= []
result[ value ].append( ( rowname, colname ))

print( result )

In reality what I wanted to achieve was this:

what = 'zz'
result = {what: []}

for rownum, row in df.iterrows():
iterator = row.items()
_, rowname = next(iterator)
for colname, value in iterator:
if value == what:
result[what] += [(rowname, colname)]
print(result)

In any case, thank you again for pointing me in the right direction. I
had lost myself looking for a pandas method that would do this in a
single shot or almost.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Mats Wichmann@mats@wichmann.us to comp.lang.python on Sat Apr 13 11:07:37 2024

From Newsgroup: comp.lang.python

On 4/13/24 07:00, jak via Python-list wrote:

Stefan Ram ha scritto:

jak <nospam@please.ty> wrote or quoted:

Would you show me the path, please?

   I was not able to read xls here, so I used csv instead; Warning:
   the script will overwrite file "file_20240412201813_tmp_DML.csv"!

import pandas as pd

with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
     print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )

df = pd.read_csv( 'file_20240412201813_tmp_DML.csv' )

result = {}

for rownum, row in df.iterrows():
     iterator = row.items()
     _, rowname = next( iterator )
     for colname, value in iterator:
         if value not in result: result[ value ]= []
         result[ value ].append( ( rowname, colname ))

print( result )

In reality what I wanted to achieve was this:

    what = 'zz'
    result = {what: []}

    for rownum, row in df.iterrows():
        iterator = row.items()
        _, rowname = next(iterator)
        for colname, value in iterator:
            if value == what:
                result[what] += [(rowname, colname)]
    print(result)

In any case, thank you again for pointing me in the right direction. I
had lost myself looking for a pandas method that would do this in a
single shot or almost.

doesn't Pandas have a "where" method that can do this kind of thing? Or doesn't it match what you are looking for? Pretty sure numpy does, but
that's a lot to bring in if you don't need the rest of numpy.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Tim Williams@tjandacw@gmail.com to comp.lang.python on Sat Apr 13 13:14:35 2024

From Newsgroup: comp.lang.python

On Sat, Apr 13, 2024 at 1:10 PM Mats Wichmann via Python-list < python-list@python.org> wrote:

On 4/13/24 07:00, jak via Python-list wrote:

doesn't Pandas have a "where" method that can do this kind of thing? Or doesn't it match what you are looking for? Pretty sure numpy does, but that's a lot to bring in if you don't need the rest of numpy.

pandas.DataFrame.where — pandas 2.2.2 documentation (pydata.org)

<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html#pandas.DataFrame.where>
--- Synchronet 3.20a-Linux NewsLink 1.114

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.lang.python on Sat Apr 13 18:39:51 2024

From Newsgroup: comp.lang.python

Tim Williams <tjandacw@gmail.com> wrote or quoted: >e.where.html#pandas.DataFrame.where>

Threw together a quick thing with "where", see how it looks.

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

print( result )

Prints here:

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

.
--- Synchronet 3.20a-Linux NewsLink 1.114

From jak@nospam@please.ty to comp.lang.python on Sat Apr 13 23:35:57 2024

From Newsgroup: comp.lang.python

Stefan Ram ha scritto:

df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.
--- Synchronet 3.20a-Linux NewsLink 1.114

From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.lang.python on Sun Apr 14 08:58:16 2024

From Newsgroup: comp.lang.python

jak <nospam@please.ty> wrote or quoted:

Stefan Ram ha scritto:

df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.

Here's a technique to better understand such code:

Transform it into a program with small statements and small
expressions with no more than one call per statement if possible.
(After each litte change check that the output stays the same.)

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )

I suggest to next insert print statements to print each intermediate
value:

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )

selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
selection, '"\n' )

selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
selection_stack, '"\n' )

df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
result, '"\n' )

print( result )

Now you can see what each single step does!

df =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1 aa ab zz ad ae af
foo2 ba bb bc bd zz bf
foo3 ca zz cc cd ce zz
foo4 da db dc dd de df
foo5 ea eb ec zz ee ef
foo6 fa fb fc fd fe ff "

result of where( df == 'zz' ) =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1 NaN NaN zz NaN NaN NaN
foo2 NaN NaN NaN NaN zz NaN
foo3 NaN zz NaN NaN NaN zz
foo4 NaN NaN NaN NaN NaN NaN
foo5 NaN NaN NaN zz NaN NaN
foo6 NaN NaN NaN NaN NaN NaN "

result of stack() =
<class 'pandas.core.series.Series'> :
" obj
foo1 foo3 zz
foo2 foo5 zz
foo3 foo2 zz
foo6 zz
foo5 foo4 zz
dtype: object "

result of reset_index() =
<class 'pandas.core.frame.DataFrame'> :
" obj level_1 0
0 foo1 foo3 zz
1 foo2 foo5 zz
2 foo3 foo2 zz
3 foo3 foo6 zz
4 foo5 foo4 zz "

value of .iloc[ :, 0 ]=
<class 'pandas.core.series.Series'> :
" 0 foo1
1 foo2
2 foo3
3 foo3
4 foo5
Name: obj, dtype: object "

value of .iloc[ :, 1 ] =
<class 'pandas.core.series.Series'> :
" 0 foo3
1 foo5
2 foo2
3 foo6
4 foo4
Name: level_1, dtype: object "

result of zip( df0, df1 )=
<class 'zip'> :
" <zip object at 0x000000000B3B9548> "

result of list( z )=
<class 'list'> :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')] "

value of { 'zz': l }=
<class 'dict'> :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]} "

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

The script reads a CSV file and stores the data in a Pandas
DataFrame object named "df". The "index_col=0" parameter tells
Pandas to use the first column as the index for the DataFrame,
which is kinda like column headers.

The "where" creates a new DataFrame selection that contains
the same data as df, but with all values replaced by NaN (Not
a Number) except for the values that are equal to 'zz'.

"stack" returns a Series with a multi-level index created
by pivoting the columns. Here it gives a Series with the
row-col-addresses of a all the non-NaN values. The general
meaning of "stack" might be the most complex operation of
this script. It's explained in the pandas manual (see there).

"reset_index" then just transforms this Series back into a
DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
first and second column, respectively, of that DataFrame. These
then are zipped to get the desired form as a list of pairs.
--- Synchronet 3.20a-Linux NewsLink 1.114

From jak@nospam@please.ty to comp.lang.python on Mon Apr 15 08:05:18 2024

From Newsgroup: comp.lang.python

Stefan Ram ha scritto:

jak <nospam@please.ty> wrote or quoted:

Stefan Ram ha scritto:

df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}

Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.

Here's a technique to better understand such code:

Transform it into a program with small statements and small
expressions with no more than one call per statement if possible.
(After each litte change check that the output stays the same.)

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )

I suggest to next insert print statements to print each intermediate
value:

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )

selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
selection, '"\n' )

selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
selection_stack, '"\n' )

df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
result, '"\n' )

print( result )

Now you can see what each single step does!

df =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1 aa ab zz ad ae af
foo2 ba bb bc bd zz bf
foo3 ca zz cc cd ce zz
foo4 da db dc dd de df
foo5 ea eb ec zz ee ef
foo6 fa fb fc fd fe ff "

result of where( df == 'zz' ) =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1 NaN NaN zz NaN NaN NaN
foo2 NaN NaN NaN NaN zz NaN
foo3 NaN zz NaN NaN NaN zz
foo4 NaN NaN NaN NaN NaN NaN
foo5 NaN NaN NaN zz NaN NaN
foo6 NaN NaN NaN NaN NaN NaN "

result of stack() =
<class 'pandas.core.series.Series'> :
" obj
foo1 foo3 zz
foo2 foo5 zz
foo3 foo2 zz
foo6 zz
foo5 foo4 zz
dtype: object "

result of reset_index() =
<class 'pandas.core.frame.DataFrame'> :
" obj level_1 0
0 foo1 foo3 zz
1 foo2 foo5 zz
2 foo3 foo2 zz
3 foo3 foo6 zz
4 foo5 foo4 zz "

value of .iloc[ :, 0 ]=
<class 'pandas.core.series.Series'> :
" 0 foo1
1 foo2
2 foo3
3 foo3
4 foo5
Name: obj, dtype: object "

value of .iloc[ :, 1 ] =
<class 'pandas.core.series.Series'> :
" 0 foo3
1 foo5
2 foo2
3 foo6
4 foo4
Name: level_1, dtype: object "

result of zip( df0, df1 )=
<class 'zip'> :
" <zip object at 0x000000000B3B9548>"

result of list( z )=
<class 'list'> :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]"

value of { 'zz': l }=
<class 'dict'> :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}"

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

The script reads a CSV file and stores the data in a Pandas
DataFrame object named "df". The "index_col=0" parameter tells
Pandas to use the first column as the index for the DataFrame,
which is kinda like column headers.

The "where" creates a new DataFrame selection that contains
the same data as df, but with all values replaced by NaN (Not
a Number) except for the values that are equal to 'zz'.

"stack" returns a Series with a multi-level index created
by pivoting the columns. Here it gives a Series with the
row-col-addresses of a all the non-NaN values. The general
meaning of "stack" might be the most complex operation of
this script. It's explained in the pandas manual (see there).

"reset_index" then just transforms this Series back into a
DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
first and second column, respectively, of that DataFrame. These
then are zipped to get the desired form as a list of pairs.

And this is a technique very similar to reverse engineering. Thanks for
the explanation and examples. All this is really clear and I was able to
follow it easily because I have already written a version of this code
in C without any kind of external library that uses the .CSV version of
the table as data ( 234 code lines :^/ ).

--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Grey Gamer
  Wed May 1 15:05:58 2024
  from Show Low, Az via Telnet
- Grey Gamer
  Wed May 1 11:25:27 2024
  from Show Low, Az via Telnet
- Microbot
  Wed May 1 06:29:14 2024
  from Moore, Ok via Telnet
- Foxyz
  Wed May 1 03:43:46 2024
  from Nantes, France via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	915
Nodes:	10 (1 / 9)
Uptime:	18:20:18
Calls:	12,168
Calls today:	4
Files:	186,520
Messages:	2,233,845

help: pandas and 2d table

Who's Online

Recent Visitors

System Info