Moving from R to python - 6/7 - advanced scikitlearn
- 1 of 7: IDE
- 2 of 7: pandas
- 3 of 7: matplotlib and seaborn
- 4 of 7: plotly
- 5 of 7: scikitlearn
- 6 of 7: advanced scikitlearn
- 7 of 7: automated machine learning
Table of Contents
Advanced scikitlearn
In the last post, we have seen some advantages of scikitlearn
. Most notably the seamless integration of parallel processing. I was struggeling a bit with the fact that scikitlearn
only accepts numpy
arrays as input and I was missing the recipes
package which makes initial data transformation in R
so much easier. Then I stumbled upon sklearn-pandas
which seamlessly integrates pandas
with sklearn
without having to worry about numpy arrays and it supports a pipe based workflow, which is a sklearn
feaure I have not started to explore yet.
Apart from sklearn-pandas
there are a number of projects that use the synthax and structure of scikit learn, a collection of them can be found at
-http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
-http://scikit-learn.org/stable/related_projects.html
sklearn-pandas
Core of this package is the DataFrameMapper
class which maps scikit learn Transformer classes to specific columns of a dataframe and outputs either a numpy array or dataframe.
Additionally it provides a CategoricalImputer
which accepts categorical data, which I had to write myself before in the last post.
import seaborn as sns
import numpy as np
import pandas as pd
import sklearn
from sklearn_pandas import DataFrameMapper, CategoricalImputer, gen_features
df = sns.load_dataset('titanic')
X = df.copy().drop(['alive','survived'], axis = 'columns')
y = df.survived
# we need to set up transformations for numerical and categorical columns
col_categorical = list( X.select_dtypes(exclude=np.number) )
col_numerical = list( X.select_dtypes(include=np.number) )
#we need to convert to list of lists
col_categorical = [ [x] for x in col_categorical ]
col_numerical = [ [x] for x in col_numerical ]
# we have two ways of passing the classes as a simple list or as list of dicts if we need to pass
# arguments as well
classes_categorical = [ CategoricalImputer, sklearn.preprocessing.LabelBinarizer ]
classes_numerical = [ {'class':sklearn.preprocessing.Imputer, 'strategy' : 'median'}
, sklearn.preprocessing.StandardScaler
]
# now that we have defined the columns and the classes of transformers we can use gen_features
# in order to generate a list of tuples suitable for DataFrameMapper
feature_def = gen_features(
columns = col_categorical
, classes = classes_categorical
)
feature_def_numerical = gen_features(
columns = col_numerical
, classes = classes_numerical
)
feature_def.extend(feature_def_numerical)
# when constructing the mapper we can specify whether we want a dataframe or a numpy array as output
mapper_df = DataFrameMapper( feature_def , df_out = True )
mapper_np = DataFrameMapper( feature_def , df_out = False )
mapped_df = mapper_df.fit_transform( df.copy() )
mapped_np = mapper_np.fit_transform( df.copy() )
print( mapped_np[1:10,1:20] )
mapped_df.head(10)
[[1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1.]
[0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.]]
sex | embarked_C | embarked_Q | embarked_S | class_First | class_Second | class_Third | who_child | who_man | who_woman | ... | deck_G | embark_town_Cherbourg | embark_town_Queenstown | embark_town_Southampton | alone | pclass | age | sibsp | parch | fare | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.827377 | -0.565736 | 0.432793 | -0.473674 | -0.502445 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | -1.566107 | 0.663861 | 0.432793 | -0.473674 | 0.786845 |
2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.827377 | -0.258337 | -0.474545 | -0.473674 | -0.488854 |
3 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.566107 | 0.433312 | 0.432793 | -0.473674 | 0.420730 |
4 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.827377 | 0.433312 | -0.474545 | -0.473674 | -0.486337 |
5 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.827377 | -0.104637 | -0.474545 | -0.473674 | -0.478116 |
6 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | -1.566107 | 1.893459 | -0.474545 | -0.473674 | 0.395814 |
7 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.827377 | -2.102733 | 2.247470 | 0.767630 | -0.224083 |
8 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.827377 | -0.181487 | -0.474545 | 2.008933 | -0.424256 |
9 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | -0.369365 | -1.180535 | 0.432793 | -0.473674 | -0.042956 |
10 rows × 27 columns
the results are looking really good, its almost as good as recipes
. However if we wanted to apply a boxcox transformation on top of it we would have to write our own scikit-learn
like transformer. However the transformer will be added in a future version so I would not bother with that at the moment.
To sparse or not to sparse
In the python
data world data is considered to be sparse or dense. Which adresses the number of zeros in a matrix wiki. sparse means that you have a lot of them while dense means the opposite. There is no particular threshold but we should be aware that some data transformatios like dummy encoding make our data more sparse. A sparse matrix can be stored in a more memory efficient format such similar as a compressed image file and some algorithms can computationally leaverage this format to reduce computing time. lasso and boosting gradient style algorithms seem to be able to profit from the sparse data format while others neural nets, knn do not, and some like randomForest require the regular dense format and will otherwise raise an error. We can use SciPy
to transform matrices to a dense format. We can measure the sparcity ratio as follows## Sparcity ratio
Sparcity Ratio
def sparsity_ratio(X):
return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
print('sparcity ratio original data:', round( sparsity_ratio(X), 2) )
print('sparcity ratio tranformed data:', round( sparsity_ratio(mapped_np), 2) )
sparcity ratio original data: 0.17
sparcity ratio tranformed data: 0.56
The transformation have resulted in a matrix with a high sparcity thus we will test whether we might benefit from converting to a sparse matrix format
from scipy import sparse
from time import time
from sklearn.tree import DecisionTreeClassifier
X_sparse = sparse.coo_matrix(mapped_np)
clf_sparse = DecisionTreeClassifier()
clf_dense = DecisionTreeClassifier()
t0 = time()
clf_sparse.fit(X_sparse, y)
print('exec time sparse:', round( time() - t0,3 ) )
t0 = time()
clf_dense.fit(mapped_np, y)
print('exec time dense :', round( time() - t0,3 ) )
exec time sparse: 0.019
exec time dense : 0.008
We can see that our decision tree classifiert does not benefit from a sparse data format.
Pipelines
Pipelines are constructs that chain scikit preprocessing steps together and attaching an optional classifier or a regressor to the end.
We can then use the pipe as we would use a regular model we can fit it and get predictions, we could get crossvalidated performance scores or perform parameter tuning. This has a couple of advantages.
- The code becomes more compact and readable
- Instead of saving multiple transformers (scaling, boxcox ) we can simply store one to apply to future data
- We can tune several steps of the pipeline in one go (for example feature selector + model tuning parameters)
We are going to contruct two pipes one for preprocessing and one for model fitting. It makes sense to seperate these two because we the first one contains a defined sequence of steps and the last pipe we are going to use to tune certain parameters via cross validation.
When performning the cross validation the transformers and estimators in the pipe will be applied after splitting the data into cross validation pairs. Cross validation is computationally expensive and we only want to use it for steps which are likely to introduce bias and can lead to overfitting such as feature selection and hyperparameter tuning.
Preprocessing Pipeline
We are going to apply the sklearn-pandas
dataframe mapper and a low variance feature filter.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import VarianceThreshold
from scipy import stats
import os
pipe_pre_process = sklearn.pipeline.Pipeline([
('mapper', mapper_np )
, ('feats', VarianceThreshold() )
])
pipe_pre_process
Pipeline(memory=None,
steps=[('mapper', DataFrameMapper(default=False, df_out=False,
features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_la...h_std=True)])],
input_df=False, sparse=False)), ('feats', VarianceThreshold(threshold=0.0))])
pipe_pre_process.named_steps
{'feats': VarianceThreshold(threshold=0.0),
'mapper': DataFrameMapper(default=False, df_out=False,
features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['class'], [CategoricalImp...es='NaN', strategy='median', verbose=0), StandardScaler(copy=True, with_mean=True, with_std=True)])],
input_df=False, sparse=False)}
The parameters are saved as follows in a nested dictionary and are named after the following principle step_name + '__' + argument
pipe_pre_process.get_params()
{'feats': VarianceThreshold(threshold=0.0),
'feats__threshold': 0.0,
'mapper': DataFrameMapper(default=False, df_out=False,
features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['class'], [CategoricalImp...es='NaN', strategy='median', verbose=0), StandardScaler(copy=True, with_mean=True, with_std=True)])],
input_df=False, sparse=False),
'mapper__default': False,
'mapper__df_out': False,
'mapper__features': [(['sex'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['embarked'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['class'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['who'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['adult_male'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['deck'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['embark_town'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['alone'],
[CategoricalImputer(copy=True, missing_values='NaN'),
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
(['pclass'],
[Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
StandardScaler(copy=True, with_mean=True, with_std=True)]),
(['age'],
[Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
StandardScaler(copy=True, with_mean=True, with_std=True)]),
(['sibsp'],
[Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
StandardScaler(copy=True, with_mean=True, with_std=True)]),
(['parch'],
[Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
StandardScaler(copy=True, with_mean=True, with_std=True)]),
(['fare'],
[Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
StandardScaler(copy=True, with_mean=True, with_std=True)])],
'mapper__input_df': False,
'mapper__sparse': False,
'memory': None,
'steps': [('mapper', DataFrameMapper(default=False, df_out=False,
features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['class'], [CategoricalImp...es='NaN', strategy='median', verbose=0), StandardScaler(copy=True, with_mean=True, with_std=True)])],
input_df=False, sparse=False)),
('feats', VarianceThreshold(threshold=0.0))]}
We can set a parameter
pipe_pre_process.set_params(feats__threshold = 0.05)
pipe_pre_process.named_steps.feats
VarianceThreshold(threshold=0.05)
Then we fit the preprocessing pipe to the data
pipe_pre_process.fit(X)
X_proc = pipe_pre_process.fit_transform(X)
X_proc
array([[ 1. , 0. , 0. , ..., 0.43279337,
-0.47367361, -0.50244517],
[ 0. , 1. , 0. , ..., 0.43279337,
-0.47367361, 0.78684529],
[ 0. , 0. , 0. , ..., -0.4745452 ,
-0.47367361, -0.48885426],
...,
[ 0. , 0. , 0. , ..., 0.43279337,
2.00893337, -0.17626324],
[ 1. , 1. , 0. , ..., -0.4745452 ,
-0.47367361, -0.04438104],
[ 1. , 0. , 1. , ..., -0.4745452 ,
-0.47367361, -0.49237783]])
We will be using the pre processed data in another post so we are saving it to disc. We are storing it in feather format which is basically hdfs which has much faster in terms of reading and writing from and to disc.
import feather
import os
if not os.path.isdir('./data'):
os.mkdir('./data')
df_feather = mapped_df.\
assign( y = y )
feather.write_dataframe(df_feather, './data/mapped_df.feather')
Modelling Pipeline
We will add a feature selection step, which choses variables based on a univariate test such as a chisquare test (which we cannot use here because it does not accept negative values) and ANOVA and then fit a decision tree.
pipe_mod = sklearn.pipeline.Pipeline([
('feats', sklearn.feature_selection.SelectKBest( k = 10) )
, ('tree', sklearn.tree.DecisionTreeClassifier() )
])
We can apply the same ‘__’ synthax as we used for setting the parameters of the pipe for constructing the dictionary for the sandomized hyperparameter search
param_dist = dict( tree__min_samples_split = stats.randint(2,250)
, tree__min_samples_leaf = stats.randint(1,500)
, tree__min_impurity_decrease = stats.uniform(0,1)
, tree__max_features = stats.uniform(0,1)
, feats__score_func = [sklearn.feature_selection.f_classif ## Anova
, sklearn.feature_selection.mutual_info_classif] ) ## nearest n
n_iter = 500
random_search = RandomizedSearchCV(pipe_mod
, param_dist
, n_iter = n_iter
, scoring = 'roc_auc'
, cv = RepeatedKFold( n_splits = 5, n_repeats = 3 )
, verbose = True
, n_jobs = 4 ## parallel processing
, return_train_score = True
)
random_search.fit(X = X_proc, y = df.survived )
Fitting 15 folds for each of 500 candidates, totalling 7500 fits
[Parallel(n_jobs=4)]: Done 49 tasks | elapsed: 5.8s
[Parallel(n_jobs=4)]: Done 740 tasks | elapsed: 34.9s
[Parallel(n_jobs=4)]: Done 2005 tasks | elapsed: 1.4min
[Parallel(n_jobs=4)]: Done 3737 tasks | elapsed: 2.6min
[Parallel(n_jobs=4)]: Done 5556 tasks | elapsed: 4.3min
[Parallel(n_jobs=4)]: Done 7241 tasks | elapsed: 5.9min
[Parallel(n_jobs=4)]: Done 7493 out of 7500 | elapsed: 6.1min remaining: 0.2s
[Parallel(n_jobs=4)]: Done 7500 out of 7500 | elapsed: 6.1min finished
RandomizedSearchCV(cv=<sklearn.model_selection._split.RepeatedKFold object at 0x000001916A6DC9B0>,
error_score='raise',
estimator=Pipeline(memory=None,
steps=[('feats', SelectKBest(k=10, score_func=<function f_classif at 0x000001916A6ABF28>)), ('tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'))]),
fit_params=None, iid=True, n_iter=500, n_jobs=4,
param_distributions={'tree__min_impurity_decrease': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001916A6DC160>, 'feats__score_func': [<function f_classif at 0x000001916A6ABF28>, <function mutual_info_classif at 0x000001916A6CD2F0>], 'tree__max_features': <scipy.stats._distn_infrastru...tree__min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001916A6D9FD0>},
pre_dispatch='2*n_jobs', random_state=None, refit=True,
return_train_score=True, scoring='roc_auc', verbose=True)
random_search.best_estimator_
Pipeline(memory=None,
steps=[('feats', SelectKBest(k=10,
score_func=<function mutual_info_classif at 0x000001916A6CD2F0>)), ('tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=0.6995334988533182, max_leaf_nodes=None,
min_impurity_decrease=0.00253...t=47, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best'))])
res = pd.DataFrame( random_search.cv_results_ )
res.sort_values('rank_test_score').head(10)
mean_fit_time | mean_score_time | mean_test_score | mean_train_score | param_feats__score_func | param_tree__max_features | param_tree__min_impurity_decrease | param_tree__min_samples_leaf | param_tree__min_samples_split | params | ... | split7_test_score | split7_train_score | split8_test_score | split8_train_score | split9_test_score | split9_train_score | std_fit_time | std_score_time | std_test_score | std_train_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
206 | 0.360021 | 0.001642 | 0.851727 | 0.862858 | <function mutual_info_classif at 0x000001916A6... | 0.699533 | 0.00253909 | 33 | 47 | {'feats__score_func': <function mutual_info_cl... | ... | 0.772863 | 0.866083 | 0.847184 | 0.881330 | 0.867859 | 0.867060 | 0.031329 | 0.003972 | 0.036028 | 0.016553 |
88 | 0.004166 | 0.001308 | 0.829953 | 0.838292 | <function f_classif at 0x000001916A6ABF28> | 0.87121 | 0.0133541 | 89 | 210 | {'feats__score_func': <function f_classif at 0... | ... | 0.736742 | 0.809940 | 0.845372 | 0.858085 | 0.798939 | 0.825771 | 0.006909 | 0.003953 | 0.034927 | 0.019111 |
194 | 0.298661 | 0.000000 | 0.811322 | 0.811516 | <function mutual_info_classif at 0x000001916A6... | 0.813777 | 0.0138638 | 124 | 79 | {'feats__score_func': <function mutual_info_cl... | ... | 0.713745 | 0.779844 | 0.815275 | 0.821707 | 0.798939 | 0.825771 | 0.029102 | 0.000000 | 0.032071 | 0.018513 |
230 | 0.435979 | 0.003441 | 0.777916 | 0.776460 | <function mutual_info_classif at 0x000001916A6... | 0.407913 | 0.127602 | 3 | 215 | {'feats__score_func': <function mutual_info_cl... | ... | 0.754735 | 0.786423 | 0.768997 | 0.766081 | 0.750393 | 0.787890 | 0.268832 | 0.005945 | 0.019844 | 0.009008 |
285 | 0.312650 | 0.001333 | 0.777642 | 0.779836 | <function mutual_info_classif at 0x000001916A6... | 0.95935 | 0.0960747 | 234 | 36 | {'feats__score_func': <function mutual_info_cl... | ... | 0.713745 | 0.779844 | 0.779288 | 0.780731 | 0.750393 | 0.787890 | 0.020375 | 0.001885 | 0.025972 | 0.005002 |
347 | 0.003999 | 0.001600 | 0.777642 | 0.779836 | <function f_classif at 0x000001916A6ABF28> | 0.722103 | 0.0547579 | 104 | 224 | {'feats__score_func': <function f_classif at 0... | ... | 0.713745 | 0.779844 | 0.779288 | 0.780731 | 0.750393 | 0.787890 | 0.002529 | 0.001959 | 0.025972 | 0.005002 |
263 | 0.004570 | 0.001416 | 0.777642 | 0.779836 | <function f_classif at 0x000001916A6ABF28> | 0.83097 | 0.0618748 | 211 | 14 | {'feats__score_func': <function f_classif at 0... | ... | 0.713745 | 0.779844 | 0.779288 | 0.780731 | 0.750393 | 0.787890 | 0.006060 | 0.002616 | 0.025972 | 0.005002 |
459 | 0.005208 | 0.000000 | 0.777642 | 0.779836 | <function f_classif at 0x000001916A6ABF28> | 0.93767 | 0.0945291 | 11 | 96 | {'feats__score_func': <function f_classif at 0... | ... | 0.713745 | 0.779844 | 0.779288 | 0.780731 | 0.750393 | 0.787890 | 0.009316 | 0.000000 | 0.025972 | 0.005002 |
224 | 0.002137 | 0.001575 | 0.773051 | 0.774232 | <function f_classif at 0x000001916A6ABF28> | 0.509988 | 0.0569047 | 74 | 12 | {'feats__score_func': <function f_classif at 0... | ... | 0.754735 | 0.786423 | 0.779288 | 0.780731 | 0.750393 | 0.787890 | 0.001999 | 0.003991 | 0.021037 | 0.015873 |
472 | 0.004166 | 0.002083 | 0.771509 | 0.774650 | <function f_classif at 0x000001916A6ABF28> | 0.555193 | 0.0792222 | 173 | 195 | {'feats__score_func': <function f_classif at 0... | ... | 0.713745 | 0.779844 | 0.779288 | 0.780731 | 0.750393 | 0.787890 | 0.006909 | 0.005311 | 0.027898 | 0.013231 |
10 rows × 45 columns
Summary
We can see that our maximum ROC score 0.86 is similar to what we obtained in the last post (0.85) where we took a more manual approach. However using sklearn-pandas
and pipes we were able to write code that is more generalizable and is less dependent on the actual dataset. We have more or less written a generalizable code for the preprocessing pipe however the code for the modelling pipe is quite specific for the model that we used. I f we wanted to train more models of a different type we would have to manually write pipes for them as well.