Moving from R to python - 2/7 - pandas
- 1 of 7: IDE
- 2 of 7: pandas
- 3 of 7: matplotlib and seaborn
- 4 of 7: plotly
- 5 of 7: scikitlearn
- 6 of 7: advanced scikitlearn
- 7 of 7: automated machine learning
Table of Contents
pandas
vs. tidyverse
In base R
matrices and dataframes have row name indexes which in my opinion are a bit annoying, because they add another layer of complexity to your data transformation. You naturally have to keep your column names in order and they sort of make you care about maintain sensible row names as well. So in the tidyverse row names have been declared a deprecated feature.
In pandas
on the other hand indexing is brought to another level and they let you have multiple layers of row and column indexes. As far as I can tell most data reshaping methods such as melt()
, stack()
, unstack()
and pivot()
expect that you assign a unique index to each row.
The advantage of multi indexing is apparently that it is easier to dice your data when you have more than 2 dimensions in you dataframe. However, I have been using R
for almost two years now and never needed more than 2 dimensions in my dataframe.
Talking to data scientists that are well-experienced with pandas
I learned that hardly no-one seems to be using multi-indexes.
The tidyverse is all about functional programming, pandas not so much. Nevertheless there are some possibilities which I will get into in this post.
pandas documentation
There is a comparison of the pandas
and the R
synthax in the pandas
documentation
Sample Data
When looking at pandas
example code. The example usually starts by generating a dtaframe with random values sampled from a normal distribution. Coming from R
I am used to use acutal sample data like the iris data set which are part of base R
. We can use pydataset
to get similar sample data sets in python
.
import seaborn as sns
import pandas as pd
import numpy as np
df = sns.load_dataset('iris')
print( df.head() )
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Indexing
We have a numerical row index and the categorical variable Species with three unique values. In R
we would convert this to a factor variable and there is something similar in python
but it does not seem to be very pythonic to me.
df.index
RangeIndex(start=0, stop=150, step=1)
df.species.unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)
Let’s move Species to the index
df.set_index('species'
, inplace = True ## modifies existing object
, append = True ## keeps exisiting index
)
df.head()
sepal_length | sepal_width | petal_length | petal_width | ||
---|---|---|---|---|---|
species | |||||
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 |
Let’s move Species back to the columns
df.reset_index( inplace = True
, level = 'species' ## only reset Species
)
df.head()
species | sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 |
Reshape Data
stack()
and unstack()
Once we set the index we can bring the dataframe from long to wide format
df.set_index('species'
, inplace = True ## modifies existing object
, append = True ## keeps exisiting index
)
df_short = df.stack()
df_short.head()
species
0 setosa sepal_length 5.1
sepal_width 3.5
petal_length 1.4
petal_width 0.2
1 setosa sepal_length 4.9
dtype: float64
df_long = df_short.unstack()
df_long.head()
sepal_length | sepal_width | petal_length | petal_width | ||
---|---|---|---|---|---|
species | |||||
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 |
melt()
and pivot_table()
We can use melt()
to select which columns we want to move into the long format. However, in the case of our dataset without an ID column we loose the information which value belongs to which row or individual plant.
df.reset_index( inplace = True
, level = 'species')
df_melt = df.melt( id_vars = 'species' )
df_melt.head()
species | variable | value | |
---|---|---|---|
0 | setosa | sepal_length | 5.1 |
1 | setosa | sepal_length | 4.9 |
2 | setosa | sepal_length | 4.7 |
3 | setosa | sepal_length | 4.6 |
4 | setosa | sepal_length | 5.0 |
df['ID'] = list( range( len(df) ) )
df.head()
species | sepal_length | sepal_width | petal_length | petal_width | ID | |
---|---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 | 1 |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 | 2 |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 | 3 |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 | 4 |
df_melt = df.melt( id_vars = ['ID', 'species'] )
df_melt.head()
ID | species | variable | value | |
---|---|---|---|---|
0 | 0 | setosa | sepal_length | 5.1 |
1 | 1 | setosa | sepal_length | 4.9 |
2 | 2 | setosa | sepal_length | 4.7 |
3 | 3 | setosa | sepal_length | 4.6 |
4 | 4 | setosa | sepal_length | 5.0 |
df_pivot_table = df_melt.pivot_table(columns='variable'
, values ='value'
, index = [ 'ID', 'species']
)
df_pivot_table.head()
variable | petal_length | petal_width | sepal_length | sepal_width | |
---|---|---|---|---|---|
ID | species | ||||
0 | setosa | 1.4 | 0.2 | 5.1 | 3.5 |
1 | setosa | 1.4 | 0.2 | 4.9 | 3.0 |
2 | setosa | 1.3 | 0.2 | 4.7 | 3.2 |
3 | setosa | 1.5 | 0.2 | 4.6 | 3.1 |
4 | setosa | 1.4 | 0.2 | 5.0 | 3.6 |
Functional programming with pandas
Functional prgramming in R
for me has two main components. That is the pipe %>%
operator that let’s you string a number of functions together without having to come up with names for all the intermediate steps and the purrr
package which iterates over columns in a dataframe using them as arguments for a preassigned function whose results are stringed together to form a new column in the same dataframe.
Functional Programming in python
is a bit frowned upon and in general it seems to be preferred to build new columns using iterations using for loops. Nevertheless functions like map, filter and apply exist and I find no problem in using them to a similar extend like I would with R
dot .
and line breaks \
can be used similar to the R pipe operator %>%
There are some implementations for pipes in python
but to me they are really ugly because they require you to put the whole pipe sequence in brackets which kind of breaks up the whole python
coding style which usually avoids having to put matching openeing and closing brackets spanning over multiple lines. However using the .
operator and code line breaks \
signifier we can do something very similar to the pipes in R
.
Let’s string together some commands we find in the pandas
documentation R vs dplyr section.
df_trans = df.loc[ : , ['species', 'sepal_length', 'sepal_width']] \
.query('sepal_length >= 4.7') \
.groupby('species') \
.mean() \
.rename( axis = 'columns'
, mapper = lambda x: 'mean_' + x ) \
.assign( sepal_ratio_of_means = lambda x: x['mean_sepal_length'] / x['mean_sepal_width'] )
df_trans
mean_sepal_length | mean_sepal_width | sepal_ratio_of_means | |
---|---|---|---|
species | |||
setosa | 5.119512 | 3.504878 | 1.460682 |
versicolor | 5.936000 | 2.770000 | 2.142960 |
virginica | 6.588000 | 2.974000 | 2.215198 |
I think this reads really beautifully much better than the awkward pipe operator in R
. There are some setbacks to this approach though.
- there is no ‘.’ operator that allows you to reference the piped object from within the pipe
- we have to use the full df[‘column’] reference to call a column vector and cannot use the unquoted column name as in
dplyr
- if we want to reference a column that has only been created inside the pipe we need to use a lambda function to reference it or start a new pipe.
Using multiple columns to calculate a new one
Using vectorized calculations
df_trans = df \
.assign( petal_ratio = df['petal_width'] / df['petal_length']
, sepal_ratio = df['sepal_width'] / df['petal_length']
) \
.assign( overall_ratio = lambda x: x['petal_ratio'] / x['sepal_ratio'] )
df_trans.head()
species | sepal_length | sepal_width | petal_length | petal_width | ID | petal_ratio | sepal_ratio | overall_ratio | |
---|---|---|---|---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 | 0 | 0.142857 | 2.500000 | 0.057143 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 | 1 | 0.142857 | 2.142857 | 0.066667 |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 | 2 | 0.153846 | 2.461538 | 0.062500 |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 | 3 | 0.133333 | 2.066667 | 0.064516 |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 | 4 | 0.142857 | 2.571429 | 0.055556 |
Using none-vectorized calculations
These would be cases in which we would use either map()
or pmap()
from the purrr
package.
single column
We select the Series we want to transform and use the apply()
method.
def range_sepal_length(x):
if x > 5:
return 'high'
elif x > 4.6:
return 'medium'
else:
return 'low'
df_trans = df \
.assign( sepal_length_ranges = df['sepal_length'].apply(range_sepal_length) )
df_trans.head()
species | sepal_length | sepal_width | petal_length | petal_width | ID | sepal_length_ranges | |
---|---|---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 | 0 | high |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 | 1 | medium |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 | 2 | medium |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 | 3 | low |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 | 4 | medium |
multiple columns
We select from the dataframe only the columns that we want to use as arguments to our function call. We use apply()
again but this time we change axis to row (axis = 1) which means that it provides all values from a row as arguments to a function call.
df_trans = df \
.assign( smallest_value = df.loc[:,['sepal_length','sepal_width','petal_length','petal_width']]\
.apply( min, axis = 1) )
df_trans.head()
species | sepal_length | sepal_width | petal_length | petal_width | ID | smallest_value | |
---|---|---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 | 0 | 0.2 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 | 1 | 0.2 |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 | 2 | 0.2 |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 | 3 | 0.2 |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 | 4 | 0.2 |
Aggregating, Grouping, Summarizing
One of the most commen things to do when working with tables is grouping and summarizing data. In dplyr
and also in SQL
we define our set of grouping columns and then define a set of new columns that contain aggregates of some of the old columns. There are two ways to do this in pandas
- use a single window function on all none-grouped dataframe columns (see exapmple above)
- use a set of different window functions on a limited set of columns (one window function per column)
One window function per column
We can either change existing column or create new ones which we have to add as empty columns via assign first. We can use whatever window function is implemented in pandas
which we pass as strings in a dictionary or we can write lambda functions.
df_gr = df.loc[ : , ['species', 'sepal_length', 'sepal_width']] \
.assign( set_of_lengths = df['sepal_length'] ) \
.groupby('species') \
.aggregate( dict( sepal_length = 'mean'
, sepal_width = 'median'
, set_of_lengths = lambda x: set( round(x, 0) ) ) )
df_gr
sepal_length | sepal_width | set_of_lengths | |
---|---|---|---|
species | |||
setosa | 5.006 | 3.4 | {4.0, 5.0, 6.0} |
versicolor | 5.936 | 2.8 | {5.0, 6.0, 7.0} |
virginica | 6.588 | 3.0 | {8.0, 5.0, 6.0, 7.0} |
Note that the grouping column is automatically moved to the index, we can revert this as follows:
df_gr.reset_index()
species | sepal_length | sepal_width | set_of_lengths | |
---|---|---|---|---|
0 | setosa | 5.006 | 3.4 | {4.0, 5.0, 6.0} |
1 | versicolor | 5.936 | 2.8 | {5.0, 6.0, 7.0} |
2 | virginica | 6.588 | 3.0 | {8.0, 5.0, 6.0, 7.0} |
Other Data Transformation
There is a comparison of the pandas
and the R
synthax in the pandas
documentation
query()
query()
can be used like filter()
in R
. Alternatively we could use boolean indexing but that gets a bit cumbersome if you want to filter on more than one column.
However the string expression does not tolerate ‘.’ in the column names thus we replace them before.
better_columns = df.columns.str.replace('\.', '_')
better_columns
df.columns = better_columns
df.head()
df.query('petal_width > 0.2 \
& sepal_length < 5')
species | sepal_length | sepal_width | petal_length | petal_width | ID | |
---|---|---|---|---|---|---|
6 | setosa | 4.6 | 3.4 | 1.4 | 0.3 | 6 |
41 | setosa | 4.5 | 2.3 | 1.3 | 0.3 | 41 |
45 | setosa | 4.8 | 3.0 | 1.4 | 0.3 | 45 |
57 | versicolor | 4.9 | 2.4 | 3.3 | 1.0 | 57 |
106 | virginica | 4.9 | 2.5 | 4.5 | 1.7 | 106 |
transform()
transform
can be used like the group_by()
- mutate()
combination in R
df = df.assign( max_petal_len_species = lambda x: x.groupby('species').transform('max')['petal_length'] )
df.head()
species | sepal_length | sepal_width | petal_length | petal_width | ID | max_petal_len_species | |
---|---|---|---|---|---|---|---|
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 | 0 | 1.9 |
1 | setosa | 4.9 | 3.0 | 1.4 | 0.2 | 1 | 1.9 |
2 | setosa | 4.7 | 3.2 | 1.3 | 0.2 | 2 | 1.9 |
3 | setosa | 4.6 | 3.1 | 1.5 | 0.2 | 3 | 1.9 |
4 | setosa | 5.0 | 3.6 | 1.4 | 0.2 | 4 | 1.9 |
df.tail()
species | sepal_length | sepal_width | petal_length | petal_width | ID | max_petal_len_species | |
---|---|---|---|---|---|---|---|
145 | virginica | 6.7 | 3.0 | 5.2 | 2.3 | 145 | 6.9 |
146 | virginica | 6.3 | 2.5 | 5.0 | 1.9 | 146 | 6.9 |
147 | virginica | 6.5 | 3.0 | 5.2 | 2.0 | 147 | 6.9 |
148 | virginica | 6.2 | 3.4 | 5.4 | 2.3 | 148 | 6.9 |
149 | virginica | 5.9 | 3.0 | 5.1 | 1.8 | 149 | 6.9 |
reindex()
reindex()
can be used to mimic the complete()
function in R
which can be used to expose implicitly missing values. We can use reindex
to write a python version of complete()
def complete(df, cols):
assert all( [ col in df.columns for col in cols] )
df = df.set_index( cols, append = False )
df_reset = df.reset_index()
for i, name in enumerate(df.index.names):
if i == 0:
df_cross = pd.DataFrame( { name : df_reset[name].unique() } ) \
.assign( key = 0 )
else:
df_cross_name = pd.DataFrame( { name : df_reset[name].unique() } ) \
.assign( key = 0 )
df_cross = df_cross.merge(df_cross_name, on = 'key')
df_cross = df_cross.drop('key', axis = 1) \
.set_index( df.index.names )
df = df.reindex( df_cross.index )
return df
df = pd.DataFrame( dict( numbers = [1,1,2,2,3,3,4,5]
, letters = ['A','B','A','B','A','B','A','B']) ) \
.assign( value = np.random.randn(8) )
df
numbers | letters | value | |
---|---|---|---|
0 | 1 | A | -0.866225 |
1 | 1 | B | 0.086730 |
2 | 2 | A | -1.385008 |
3 | 2 | B | 1.729923 |
4 | 3 | A | 2.299604 |
5 | 3 | B | 0.271672 |
6 | 4 | A | 0.608976 |
7 | 5 | B | -0.055641 |
complete(df, ['numbers', 'letters'])
value | ||
---|---|---|
numbers | letters | |
1 | A | -0.866225 |
B | 0.086730 | |
2 | A | -1.385008 |
B | 1.729923 | |
3 | A | 2.299604 |
B | 0.271672 | |
4 | A | 0.608976 |
B | NaN | |
5 | A | NaN |
B | -0.055641 |