August 6, 2024
Actual machine learning code usually takes up a minority of your project code [1].
The majority of your code is probably data processing, the rest of the production system, etc.
Here, you can find the shorter part :)
Specifially, this example code tunes a few hyperparameters of an XGBoost
model and finds estimates of generalization performance, all using 2 loops of
cross validation, aka, nested cross validation.
Key snippets are:
clf = XGBClassifier(...)
# inner cv
gcv = GridSearchCV(clf, params, ...)
# outer cv
results = cross_validate(gcv, X, y, ...)
# estimates of generalization like:
results['test_roc_auc'].mean()
# predict on new data
gcv.best_estimator_.predict(X_new)
See below for full code.
BONUS: Explaining models
There are different philosophies and methods for “explaining” ML models.
Some ideas to get started:
- Explain the final model from
gcv.fit(X, y)(see code) - And/or summarize explanations from all models via
cross_validate(..., return_estimator=True) - Compare explanations of XGB to other models
- Try permutation importance (on train or test folds…?)
- Use SHAP values [2]
- And a lot more…
Code
Copy/clone from github.com/plpxsk/xgb
Install requirements into you virtual env:
pip install xgboost scikit-learn
Run code script:
… where main.py is:
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
# train and evaluate with nested cross-validation
inner_cv = 5
outer_cv = 3
inner_metric = 'roc_auc'
outer_metrics = ['roc_auc', 'average_precision']
random_state = 99
n_jobs_1 = 2
n_jobs_2 = 3
X, y = make_classification(random_state=random_state)
params = dict(
xgbclassifier__n_estimators=[100, 500],
xgbclassifier__max_depth=[1, 3, 5]
)
clf = make_pipeline(
StandardScaler(),
XGBClassifier(random_state=random_state, n_jobs=n_jobs_1)
)
# inner cv
gcv = GridSearchCV(clf, params, scoring=inner_metric, cv=inner_cv, n_jobs=1,
return_train_score=False)
# outer cv
results = cross_validate(gcv, X, y, scoring=outer_metrics, cv=outer_cv,
n_jobs=n_jobs_2, return_train_score=False)
# show estimates of generalization performance of our classifier
print(results['test_roc_auc'].mean(), "ROC AUC")
print(results['test_average_precision'].mean(), "PR AUC")
# final model that can be used in production
# gcv.fit(X, y)
# gcv.best_params_
# predict in production
# gcv.predict(X_new)
# or more explicitly
# gcv.best_estimator_.predict(X_new)
[1] See, eg, Figure 1 in Sculley, Holt, et al, “Hidden Technical Debt in Machine Learning Systems”. NeurIPS 2015. papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
[2] SHAP values: shap.readthedocs.io
References
Code: github.com/plpxsk/xgb
See previous post: What is nested cross-validation (for) and why you should use it
See scikit-learn docs: Plot Nested CV
From Prof. Sebastian Raschka: Code Notebooks
Site based on the Primer Theme.