Moving from R to python - 7/7 - automated machine learning
- 1 of 7: IDE
- 2 of 7: pandas
- 3 of 7: matplotlib and seaborn
- 4 of 7: plotly
- 5 of 7: scikitlearn
- 6 of 7: advanced scikitlearn
- 7 of 7: automated machine learning
Table of Contents
Automated Machine Learning
We have seen in the previous post on advanced scikitlearn methods that using pipes in scikitlearn
allows us to write pretty generalizable code, however we still need to customize our modelling pipeline to the algorithms that we want to use. However theoretically since scikitlearn
uses a unified synthax there is no reason why we should not try all the modelling algorithms supported by scikitlearn. We can also combine the modelling algorithm with various feature selection and preprocessing steps. The first limiting factor is that theoretically we need to define the hyperparameter space that is being searched for each model even if we are using randomized search. Which requires some manual coding the second limiting factor is computational power it will simply take a very long time to go through all possible combinations of pipes that we could build.
Packages for automated machine learning have taken care of the manual work we would need to do to program the hyperparameter search and mitigate the problem of computational power by employing specific search strategies that allow us to preselect pipes that are likely to succeed and optimise hyperparameter search in a way that we do not have to test every single combinations.
tpot
tpot
is a data science assistant that iteratively constructs sklearn
pipelines and optimises them using genetic programming algorithms that are able to optimize multiple criteria simulaneously while minimizing complexity at the same time. It uses a package called deap
- Supports regression and classification
- Supports the usual performance metrics
- Is meant to run for hours to days
- We can inspect the process anytime and look at intermediate results
- We can limit algorithms and hyperparameter space (not so usefull at the moment because we have to sepcifiy the whole pyrameter range and basically get stuck doing grid search)
tpot
can generate python code to reconstruct the best models
Load data
We have prepared the data that we are going to use in the previous post on advanced scikitlearn methods. It is basically the titanic dataset with imputed numerical and categorical variables.
import feather
df = feather.read_dataframe('./data/mapped_df.feather')
y = df['y']
X = df.drop('y', axis = 1)\
.as_matrix()
C:\anaconda\envs\py36r343\lib\site-packages\ipykernel\__main__.py:6: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
Run
We are using a limited version of tpot
which uses only fast algorithms
from tpot import TPOTClassifier
pipeline_optimizer = TPOTClassifier(generations=2
, population_size=2
, offspring_size = 5
## TPOT will evaluate population_size
## + generations × offspring_size = pipelines
## in total.
, cv=5
, random_state=42 ##seed
, verbosity=2 ## print progressbar
, n_jobs = 4
, warm_start = True ## allows us to restart
, scoring = 'roc_auc'
, config_dict = 'TPOT light' ## only uses fast algorithms
)
pipeline_optimizer.fit(X,y)
HBox(children=(IntProgress(value=0, description='Optimization Progress', max=12), HTML(value='')))
Generation 1 - Current best internal CV score: 0.8616010162631975
Generation 2 - Current best internal CV score: 0.8628532835333793
Best pipeline: LogisticRegression(Normalizer(input_matrix, norm=l1), C=20.0, dual=False, penalty=l1)
TPOTClassifier(config_dict='TPOT light', crossover_rate=0.1, cv=5,
disable_update_check=False, early_stop=None, generations=2,
max_eval_time_mins=5, max_time_mins=None, memory=None,
mutation_rate=0.9, n_jobs=4, offspring_size=5,
periodic_checkpoint_folder=None, population_size=2,
random_state=42, scoring='roc_auc', subsample=1.0, use_dask=False,
verbosity=2, warm_start=True)
pipeline_optimizer.score(X,y)
0.8745406320902439
Export best modelling pipeline as python code
pipeline_optimizer.export('./data/pipe.py')
True
Get the best pipe
pipeline_optimizer.fitted_pipeline_
Pipeline(memory=None,
steps=[('normalizer', Normalizer(copy=True, norm='l1')), ('logisticregression', LogisticRegression(C=20.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])
Get all tested pipes
pipeline_optimizer.evaluated_individuals_
{'DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=2, DecisionTreeClassifier__min_samples_leaf=12, DecisionTreeClassifier__min_samples_split=4)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8557279030479364,
'mutation_count': 2,
'operator_count': 1,
'predecessor': ('DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=3, DecisionTreeClassifier__min_samples_leaf=12, DecisionTreeClassifier__min_samples_split=4)',)},
'DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=3, DecisionTreeClassifier__min_samples_leaf=12, DecisionTreeClassifier__min_samples_split=4)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8616010162631975,
'mutation_count': 1,
'operator_count': 1,
'predecessor': ('GaussianNB(input_matrix)',)},
'GaussianNB(input_matrix)': {'crossover_count': 0,
'generation': 0,
'internal_cv_score': 0.816504621285001,
'mutation_count': 0,
'operator_count': 1,
'predecessor': ('ROOT',)},
'KNeighborsClassifier(BernoulliNB(input_matrix, BernoulliNB__alpha=10.0, BernoulliNB__fit_prior=True), KNeighborsClassifier__n_neighbors=21, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8499987059406567,
'mutation_count': 1,
'operator_count': 2,
'predecessor': ('KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=21, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)',)},
'KNeighborsClassifier(BernoulliNB(input_matrix, BernoulliNB__alpha=100.0, BernoulliNB__fit_prior=True), KNeighborsClassifier__n_neighbors=21, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8484891857167135,
'mutation_count': 1,
'operator_count': 2,
'predecessor': ('KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=21, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)',)},
'KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=21, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)': {'crossover_count': 0,
'generation': 0,
'internal_cv_score': 0.8470927552585381,
'mutation_count': 0,
'operator_count': 1,
'predecessor': ('ROOT',)},
'KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=22, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.846680654239431,
'mutation_count': 1,
'operator_count': 1,
'predecessor': ('KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=21, KNeighborsClassifier__p=1, KNeighborsClassifier__weights=distance)',)},
'LogisticRegression(MinMaxScaler(input_matrix), LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8584054700315054,
'mutation_count': 2,
'operator_count': 2,
'predecessor': ('LogisticRegression(input_matrix, LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1)',)},
'LogisticRegression(Normalizer(input_matrix, Normalizer__norm=l1), LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8628532835333793,
'mutation_count': 2,
'operator_count': 2,
'predecessor': ('LogisticRegression(input_matrix, LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1)',)},
'LogisticRegression(input_matrix, LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8584323502037432,
'mutation_count': 1,
'operator_count': 1,
'predecessor': ('GaussianNB(input_matrix)',)},
'LogisticRegression(input_matrix, LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l2)': {'crossover_count': 0,
'generation': 'INVALID',
'internal_cv_score': 0.8586414616613588,
'mutation_count': 2,
'operator_count': 1,
'predecessor': ('LogisticRegression(input_matrix, LogisticRegression__C=20.0, LogisticRegression__dual=False, LogisticRegression__penalty=l1)',)}}
auto-sklearn
auto-skelarn
uses bayesian methods to optimize computing time. We will add an example in a later version of this post
Summary
Even though auto-sklearn still needs to be tested we could already obtain pretty decent results using tpot
with a ROC score of 0.87 which is higher then our previous attempts. Normally I would follow the following strategy to select a model.
- Test all models on the same cv pairs
- Calculate mean and SEM for the performance metric of each variant
- Look at the model with the lowest mean
- Select the simplest model whose mean is still in the range of the overall lowest mean + SEM
However tpot
does not give us the SEM values thus we cannot select the model which it presents us to be the best and compare it to simpler ones it might have fitted. Given that the tpot
algorithm is already minimizing the complexity we should simply accept the best model it returns. We should however then compare it to simpler models we can come up with to have a frame of reference to compare it to and of course we should check the tpot
model for plausibility.