Simple example of XGBoost with nested cross validation, in python (with Bonus)

August 6, 2024

Actual machine learning code usually takes up a minority of your project code [1].

The majority of your code is probably data processing, the rest of the production system, etc.

Here, you can find the shorter part :)

Specifially, this example code tunes a few hyperparameters of an XGBoost model and finds estimates of generalization performance, all using 2 loops of cross validation, aka, nested cross validation.

Key snippets are:

clf = XGBClassifier(...)

# inner cv
gcv = GridSearchCV(clf, params, ...)

# outer cv
results = cross_validate(gcv, X, y, ...)

# estimates of generalization like:
results['test_roc_auc'].mean()

# predict on new data
gcv.best_estimator_.predict(X_new)

See below for full code.

BONUS: Explaining models

There are different philosophies and methods for “explaining” ML models.

Some ideas to get started:

Explain the final model from gcv.fit(X, y) (see code)
And/or summarize explanations from all models via cross_validate(..., return_estimator=True)
Compare explanations of XGB to other models
Try permutation importance (on train or test folds…?)
Use SHAP values [2]
And a lot more…

Code

Copy/clone from github.com/plpxsk/xgb

Install requirements into you virtual env:

pip install xgboost scikit-learn

Run code script:

… where main.py is:

from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.pipeline import make_pipeline

from xgboost import XGBClassifier

# train and evaluate with nested cross-validation
inner_cv = 5
outer_cv = 3
inner_metric = 'roc_auc'
outer_metrics = ['roc_auc', 'average_precision']

random_state = 99
n_jobs_1 = 2
n_jobs_2 = 3


X, y = make_classification(random_state=random_state)

params = dict(
    xgbclassifier__n_estimators=[100, 500],
    xgbclassifier__max_depth=[1, 3, 5]
)

clf = make_pipeline(
    StandardScaler(),
    XGBClassifier(random_state=random_state, n_jobs=n_jobs_1)
)

# inner cv
gcv = GridSearchCV(clf, params, scoring=inner_metric, cv=inner_cv, n_jobs=1,
                   return_train_score=False)

# outer cv
results = cross_validate(gcv, X, y, scoring=outer_metrics, cv=outer_cv,
                         n_jobs=n_jobs_2, return_train_score=False)

# show estimates of generalization performance of our classifier
print(results['test_roc_auc'].mean(), "ROC AUC")
print(results['test_average_precision'].mean(), "PR AUC")

# final model that can be used in production
# gcv.fit(X, y)
# gcv.best_params_

# predict in production
# gcv.predict(X_new)
# or more explicitly
# gcv.best_estimator_.predict(X_new)

[1] See, eg, Figure 1 in Sculley, Holt, et al, “Hidden Technical Debt in Machine Learning Systems”. NeurIPS 2015. papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

[2] SHAP values: shap.readthedocs.io

References

Code: github.com/plpxsk/xgb

See previous post: What is nested cross-validation (for) and why you should use it

See scikit-learn docs: Plot Nested CV

From Prof. Sebastian Raschka: Code Notebooks

Site based on the Primer Theme.