`pandas` vs. tidyverse

In base R matrices and dataframes have row name indexes which in my opinion are a bit annoying, because they add another layer of complexity to your data transformation. You naturally have to keep your column names in order and they sort of make you care about maintain sensible row names as well. So in the tidyverse row names have been declared a deprecated feature.

In pandas on the other hand indexing is brought to another level and they let you have multiple layers of row and column indexes. As far as I can tell most data reshaping methods such as melt(), stack(), unstack() and pivot() expect that you assign a unique index to each row.

The advantage of multi indexing is apparently that it is easier to dice your data when you have more than 2 dimensions in you dataframe. However, I have been using R for almost two years now and never needed more than 2 dimensions in my dataframe.

Talking to data scientists that are well-experienced with pandas I learned that hardly no-one seems to be using multi-indexes.

The tidyverse is all about functional programming, pandas not so much. Nevertheless there are some possibilities which I will get into in this post.

pandas documentation

There is a comparison of the pandas and the R synthax in the pandas documentation

Sample Data

When looking at pandas example code. The example usually starts by generating a dtaframe with random values sampled from a normal distribution. Coming from R I am used to use acutal sample data like the iris data set which are part of base R. We can use pydataset to get similar sample data sets in python.

import seaborn as sns
import pandas as pd
import numpy as np

df = sns.load_dataset('iris')

print( df.head() )

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Indexing

We have a numerical row index and the categorical variable Species with three unique values. In R we would convert this to a factor variable and there is something similar in python but it does not seem to be very pythonic to me.

df.index

RangeIndex(start=0, stop=150, step=1)

df.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

Let’s move Species to the index

df.set_index('species'
             , inplace = True ## modifies existing object
             , append = True ## keeps exisiting index
             )

df.head()

		sepal_length	sepal_width	petal_length	petal_width
	species
0	setosa	5.1	3.5	1.4	0.2
1	setosa	4.9	3.0	1.4	0.2
2	setosa	4.7	3.2	1.3	0.2
3	setosa	4.6	3.1	1.5	0.2
4	setosa	5.0	3.6	1.4	0.2

Let’s move Species back to the columns

df.reset_index( inplace = True
               , level = 'species' ## only reset Species
              )

df.head()

	species	sepal_length	sepal_width	petal_length	petal_width
0	setosa	5.1	3.5	1.4	0.2
1	setosa	4.9	3.0	1.4	0.2
2	setosa	4.7	3.2	1.3	0.2
3	setosa	4.6	3.1	1.5	0.2
4	setosa	5.0	3.6	1.4	0.2

Reshape Data

`stack()` and `unstack()`

Once we set the index we can bring the dataframe from long to wide format

df.set_index('species'
             , inplace = True ## modifies existing object
             , append = True ## keeps exisiting index
             )

df_short = df.stack()  

df_short.head()

   species              
0  setosa   sepal_length    5.1
            sepal_width     3.5
            petal_length    1.4
            petal_width     0.2
1  setosa   sepal_length    4.9
dtype: float64

df_long = df_short.unstack()

df_long.head()

		sepal_length	sepal_width	petal_length	petal_width
	species
0	setosa	5.1	3.5	1.4	0.2
1	setosa	4.9	3.0	1.4	0.2
2	setosa	4.7	3.2	1.3	0.2
3	setosa	4.6	3.1	1.5	0.2
4	setosa	5.0	3.6	1.4	0.2

`melt()` and `pivot_table()`

We can use melt() to select which columns we want to move into the long format. However, in the case of our dataset without an ID column we loose the information which value belongs to which row or individual plant.

df.reset_index( inplace = True
              , level = 'species') 

df_melt = df.melt( id_vars = 'species' )

df_melt.head()

	species	variable	value
0	setosa	sepal_length	5.1
1	setosa	sepal_length	4.9
2	setosa	sepal_length	4.7
3	setosa	sepal_length	4.6
4	setosa	sepal_length	5.0

df['ID'] = list( range( len(df) ) )

df.head()

	species	sepal_length	sepal_width	petal_length	petal_width	ID
0	setosa	5.1	3.5	1.4	0.2	0
1	setosa	4.9	3.0	1.4	0.2	1
2	setosa	4.7	3.2	1.3	0.2	2
3	setosa	4.6	3.1	1.5	0.2	3
4	setosa	5.0	3.6	1.4	0.2	4

df_melt = df.melt( id_vars = ['ID', 'species'] )

df_melt.head()

	ID	species	variable	value
0	0	setosa	sepal_length	5.1
1	1	setosa	sepal_length	4.9
2	2	setosa	sepal_length	4.7
3	3	setosa	sepal_length	4.6
4	4	setosa	sepal_length	5.0

df_pivot_table = df_melt.pivot_table(columns='variable'
                                        , values ='value'
                                        , index = [ 'ID', 'species']
                                        )

df_pivot_table.head()

	variable	petal_length	petal_width	sepal_length	sepal_width
ID	species
0	setosa	1.4	0.2	5.1	3.5
1	setosa	1.4	0.2	4.9	3.0
2	setosa	1.3	0.2	4.7	3.2
3	setosa	1.5	0.2	4.6	3.1
4	setosa	1.4	0.2	5.0	3.6

Functional programming with pandas

Functional prgramming in R for me has two main components. That is the pipe %>% operator that let’s you string a number of functions together without having to come up with names for all the intermediate steps and the purrr package which iterates over columns in a dataframe using them as arguments for a preassigned function whose results are stringed together to form a new column in the same dataframe.

Functional Programming in python is a bit frowned upon and in general it seems to be preferred to build new columns using iterations using for loops. Nevertheless functions like map, filter and apply exist and I find no problem in using them to a similar extend like I would with R

dot `.` and line breaks `\` can be used similar to the R pipe operator `%>%`

There are some implementations for pipes in python but to me they are really ugly because they require you to put the whole pipe sequence in brackets which kind of breaks up the whole python coding style which usually avoids having to put matching openeing and closing brackets spanning over multiple lines. However using the . operator and code line breaks \ signifier we can do something very similar to the pipes in R.

Let’s string together some commands we find in the pandas documentation R vs dplyr section.

df_trans = df.loc[ : , ['species', 'sepal_length', 'sepal_width']] \
    .query('sepal_length >= 4.7') \
    .groupby('species') \
    .mean() \
    .rename( axis = 'columns'
           , mapper = lambda x: 'mean_' + x ) \
    .assign( sepal_ratio_of_means = lambda x: x['mean_sepal_length'] / x['mean_sepal_width'] )
    
df_trans

	mean_sepal_length	mean_sepal_width	sepal_ratio_of_means
species
setosa	5.119512	3.504878	1.460682
versicolor	5.936000	2.770000	2.142960
virginica	6.588000	2.974000	2.215198

I think this reads really beautifully much better than the awkward pipe operator in R. There are some setbacks to this approach though.

there is no ‘.’ operator that allows you to reference the piped object from within the pipe
we have to use the full df[‘column’] reference to call a column vector and cannot use the unquoted column name as in dplyr
if we want to reference a column that has only been created inside the pipe we need to use a lambda function to reference it or start a new pipe.

Using multiple columns to calculate a new one

Using vectorized calculations

df_trans = df \
    .assign( petal_ratio = df['petal_width'] / df['petal_length']
           , sepal_ratio = df['sepal_width'] / df['petal_length']
           ) \
    .assign( overall_ratio = lambda x: x['petal_ratio'] / x['sepal_ratio'] )
    
    
df_trans.head()

	species	sepal_length	sepal_width	petal_length	petal_width	ID	petal_ratio	sepal_ratio	overall_ratio
0	setosa	5.1	3.5	1.4	0.2	0	0.142857	2.500000	0.057143
1	setosa	4.9	3.0	1.4	0.2	1	0.142857	2.142857	0.066667
2	setosa	4.7	3.2	1.3	0.2	2	0.153846	2.461538	0.062500
3	setosa	4.6	3.1	1.5	0.2	3	0.133333	2.066667	0.064516
4	setosa	5.0	3.6	1.4	0.2	4	0.142857	2.571429	0.055556

Using none-vectorized calculations

These would be cases in which we would use either map() or pmap() from the purrr package.

single column

We select the Series we want to transform and use the apply() method.

def range_sepal_length(x):
    if x > 5:
        return 'high'
    elif x > 4.6:
        return 'medium'
    else:
        return 'low'

df_trans = df \
    .assign( sepal_length_ranges = df['sepal_length'].apply(range_sepal_length) )
    
df_trans.head()

	species	sepal_length	sepal_width	petal_length	petal_width	ID	sepal_length_ranges
0	setosa	5.1	3.5	1.4	0.2	0	high
1	setosa	4.9	3.0	1.4	0.2	1	medium
2	setosa	4.7	3.2	1.3	0.2	2	medium
3	setosa	4.6	3.1	1.5	0.2	3	low
4	setosa	5.0	3.6	1.4	0.2	4	medium

multiple columns

We select from the dataframe only the columns that we want to use as arguments to our function call. We use apply() again but this time we change axis to row (axis = 1) which means that it provides all values from a row as arguments to a function call.

df_trans = df \
    .assign( smallest_value = df.loc[:,['sepal_length','sepal_width','petal_length','petal_width']]\
                               .apply( min, axis = 1) )
    
df_trans.head()

	species	sepal_length	sepal_width	petal_length	petal_width	ID	smallest_value
0	setosa	5.1	3.5	1.4	0.2	0	0.2
1	setosa	4.9	3.0	1.4	0.2	1	0.2
2	setosa	4.7	3.2	1.3	0.2	2	0.2
3	setosa	4.6	3.1	1.5	0.2	3	0.2
4	setosa	5.0	3.6	1.4	0.2	4	0.2

Aggregating, Grouping, Summarizing

One of the most commen things to do when working with tables is grouping and summarizing data. In dplyr and also in SQL we define our set of grouping columns and then define a set of new columns that contain aggregates of some of the old columns. There are two ways to do this in pandas

use a single window function on all none-grouped dataframe columns (see exapmple above)
use a set of different window functions on a limited set of columns (one window function per column)

One window function per column

We can either change existing column or create new ones which we have to add as empty columns via assign first. We can use whatever window function is implemented in pandas which we pass as strings in a dictionary or we can write lambda functions.

df_gr = df.loc[ : , ['species', 'sepal_length', 'sepal_width']]  \
    .assign( set_of_lengths = df['sepal_length'] ) \
    .groupby('species') \
    .aggregate( dict( sepal_length = 'mean'
                     , sepal_width = 'median'
                     , set_of_lengths = lambda x: set( round(x, 0) ) ) )
    
df_gr

	sepal_length	sepal_width	set_of_lengths
species
setosa	5.006	3.4	{4.0, 5.0, 6.0}
versicolor	5.936	2.8	{5.0, 6.0, 7.0}
virginica	6.588	3.0	{8.0, 5.0, 6.0, 7.0}

Note that the grouping column is automatically moved to the index, we can revert this as follows:

df_gr.reset_index()

	species	sepal_length	sepal_width	set_of_lengths
0	setosa	5.006	3.4	{4.0, 5.0, 6.0}
1	versicolor	5.936	2.8	{5.0, 6.0, 7.0}
2	virginica	6.588	3.0	{8.0, 5.0, 6.0, 7.0}

Other Data Transformation

There is a comparison of the pandas and the R synthax in the pandas documentation

query()

query() can be used like filter() in R. Alternatively we could use boolean indexing but that gets a bit cumbersome if you want to filter on more than one column.

However the string expression does not tolerate ‘.’ in the column names thus we replace them before.

better_columns = df.columns.str.replace('\.', '_')

better_columns

df.columns = better_columns

df.head()

df.query('petal_width > 0.2 \
          & sepal_length < 5')

	species	sepal_length	sepal_width	petal_length	petal_width	ID
6	setosa	4.6	3.4	1.4	0.3	6
41	setosa	4.5	2.3	1.3	0.3	41
45	setosa	4.8	3.0	1.4	0.3	45
57	versicolor	4.9	2.4	3.3	1.0	57
106	virginica	4.9	2.5	4.5	1.7	106

transform()

transform can be used like the group_by() - mutate() combination in R

df = df.assign( max_petal_len_species = lambda x: x.groupby('species').transform('max')['petal_length'] )
df.head()

	species	sepal_length	sepal_width	petal_length	petal_width	ID	max_petal_len_species
0	setosa	5.1	3.5	1.4	0.2	0	1.9
1	setosa	4.9	3.0	1.4	0.2	1	1.9
2	setosa	4.7	3.2	1.3	0.2	2	1.9
3	setosa	4.6	3.1	1.5	0.2	3	1.9
4	setosa	5.0	3.6	1.4	0.2	4	1.9

df.tail()

	species	sepal_length	sepal_width	petal_length	petal_width	ID	max_petal_len_species
145	virginica	6.7	3.0	5.2	2.3	145	6.9
146	virginica	6.3	2.5	5.0	1.9	146	6.9
147	virginica	6.5	3.0	5.2	2.0	147	6.9
148	virginica	6.2	3.4	5.4	2.3	148	6.9
149	virginica	5.9	3.0	5.1	1.8	149	6.9

reindex()

reindex() can be used to mimic the complete() function in R which can be used to expose implicitly missing values. We can use reindex to write a python version of complete()

def complete(df, cols):
        
    assert all( [ col in df.columns for col in cols] )

    df = df.set_index( cols, append = False )

    df_reset = df.reset_index()

    for i, name in enumerate(df.index.names):
        
        if i == 0:
            df_cross = pd.DataFrame( { name : df_reset[name].unique() } ) \
                .assign( key = 0 )
                
        else:
            df_cross_name = pd.DataFrame( { name : df_reset[name].unique() } ) \
                .assign( key = 0 )
                
            df_cross = df_cross.merge(df_cross_name, on = 'key')

    df_cross = df_cross.drop('key', axis = 1) \
        .set_index( df.index.names )
        
    df = df.reindex( df_cross.index )

    return df

df = pd.DataFrame( dict( numbers = [1,1,2,2,3,3,4,5]
                        , letters = ['A','B','A','B','A','B','A','B']) ) \
    .assign( value = np.random.randn(8) )

df

	numbers	letters	value
0	1	A	-0.866225
1	1	B	0.086730
2	2	A	-1.385008
3	2	B	1.729923
4	3	A	2.299604
5	3	B	0.271672
6	4	A	0.608976
7	5	B	-0.055641

complete(df, ['numbers', 'letters'])

		value
numbers	letters
1	A	-0.866225
1	B	0.086730
2	A	-1.385008
2	B	1.729923
3	A	2.299604
3	B	0.271672
4	A	0.608976
4	B	NaN
5	A	NaN
5	B	-0.055641

Moving from R to python - 2/7 - pandas

Table of Contents

`pandas` vs. tidyverse

pandas documentation

Sample Data

Indexing

Reshape Data

`stack()` and `unstack()`

`melt()` and `pivot_table()`

Functional programming with pandas

dot `.` and line breaks `\` can be used similar to the R pipe operator `%>%`

Using multiple columns to calculate a new one

Using vectorized calculations

Using none-vectorized calculations

single column

multiple columns

Aggregating, Grouping, Summarizing

One window function per column

Other Data Transformation

query()

transform()

reindex()

Dr. Björn Koneswarakantha

Moving from R to python - 2/7 - pandas

Table of Contents

pandas vs. tidyverse

pandas documentation

Sample Data

Indexing

Reshape Data

stack() and unstack()

melt() and pivot_table()

Functional programming with pandas

dot . and line breaks \ can be used similar to the R pipe operator %>%

Using multiple columns to calculate a new one

Using vectorized calculations

Using none-vectorized calculations

single column

multiple columns

Aggregating, Grouping, Summarizing

One window function per column

Other Data Transformation

query()

transform()

reindex()

Dr. Björn Koneswarakantha

`pandas` vs. tidyverse

`stack()` and `unstack()`

`melt()` and `pivot_table()`

dot `.` and line breaks `\` can be used similar to the R pipe operator `%>%`