Classification with XGBoost
This chapter will introduce you to the fundamental idea behind XGBoost—boosted learners. Once you understand how XGBoost works, you’ll apply it to solve a common classification problem found in industry — predicting whether a customer will stop being a customer at some point in the future. This is the Summary of lecture «Extreme Gradient Boosting with XGBoost», via datacamp.
Jul 6, 2020 • Chanseok Kang • 6 min read
Introduction
- Supervised Learning
- Relies on labeled data
- Have some understanding of past behavior
- Area Under the ROC Curve (AUC)
- Larger area under the ROC curve = better model
- Features can be either numeric or categorical
- Numeric features should be scaled (Z-scored)
- Categorical features should be encoded (one-hot)
Introducing XGBoost
- What is XGBoost? (eXtreme Gradient Boosting)
- Optimized gradient-boosting machine learning library
- Originally written in C++
- Has APIs in several languages:
- Python, R, Scala, Julia, Java
- Speed and performance
- Core algorithm is parallelizable
- Consistently outperforms single-algorithm methods
- State-of-the-art performance in many ML tasks
import numpy as np import pandas as pd import matplotlib.pyplot as plt import xgboost as xgb plt.rcParams['figure.figsize'] = (10, 5)
XGBoost — Fit/Predict
It’s time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!
Here, you’ll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up.
Your goal is to use the first month’s worth of data to predict whether the app’s users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you’ll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.
churn_data = pd.read_csv('./dataset/churn_data.csv')
Get Started with XGBoost
This is a quick start tutorial showing snippets for you to quickly try out XGBoost on the demo dataset on a binary classification task.
Links to Other Helpful Resources
- See Installation Guide on how to install XGBoost.
- See Text Input Format on using text format for specifying training/testing data.
- See Tutorials for tips and tutorials.
- See Learning to use XGBoost by Examples for more code examples.
Python
from xgboost import XGBClassifier # read data from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split data = load_iris() X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size=.2) # create model instance bst = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic') # fit model bst.fit(X_train, y_train) # make predictions preds = bst.predict(X_test)
R
# load data data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train agaricus.train test agaricus.test # fit model bst xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nrounds = 2, nthread = 2, objective = "binary:logistic") # predict pred predict(bst, test$data)
Julia
using XGBoost # read data train_X, train_Y = readlibsvm("demo/data/agaricus.txt.train", (6513, 126)) test_X, test_Y = readlibsvm("demo/data/agaricus.txt.test", (1611, 126)) # fit model num_round = 2 bst = xgboost(train_X, num_round, label=train_Y, eta=1, max_depth=2) # predict pred = predict(bst, test_X)
Scala
import ml.dmlc.xgboost4j.scala.DMatrix import ml.dmlc.xgboost4j.scala.XGBoost object XGBoostScalaExample def main(args: Array[String]) // read trainining data, available at xgboost/demo/data val trainData = new DMatrix("/path/to/agaricus.txt.train") // define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> "binary:logistic").toMap // number of iterations val round = 2 // train the model val model = XGBoost.train(trainData, paramMap, round) // run prediction val predTrain = model.predict(trainData) // save model to the file. model.saveModel("/local/path/to/model") > >
© Copyright 2022, xgboost developers. Revision 36eb41c9 .
DataTechNotes
In this tutorial, we’ll use the iris dataset as the classification data. First, we’ll separate data into x and y parts.
iris = load_iris() x, y = iris.data, iris.target
Then we’ll split them into train and test parts. Here, we’ll extract 15 percent of the dataset as test data.
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)
We’ve loaded the XGBClassifier class from xgboost library above. Now we can define the classifier model.
xgbc = XGBClassifier() print(xgbc)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
You can change the classifier model parameters according to your dataset characteristics. Here, we’ve defined it with default parameter values.
We’ll fit the model with train data.Next, we’ll check the training accuracy with cross-validation and k-fold methods.
scores = cross_val_score(xgbc, xtrain, ytrain, cv=5) print("Mean cross-validation score: %.2f" % scores.mean())
Mean cross-validation score: 0.94
kfold = KFold(n_splits=10, shuffle=True) kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold ) print("K-fold CV average score: %.2f" % kf_cv_scores.mean())
K-fold CV average score: 0.94
Finally, we’ll predict test data check the prediction accuracy with a confusion matrix.
ypred = xgbc.predict(xtest) cm = confusion_matrix(ytest,ypred)
In this post, we’ve briefly learned how to classify data with the XGBClassifier class in Python. The full source code is listed below.
from xgboost import XGBClassifier from sklearn.datasets import load_iris from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, KFold iris = load_iris() x, y = iris.data, iris.target xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15) xgbc = XGBClassifier() print(xgbc) xgbc.fit(xtrain, ytrain) # - cross validataion scores = cross_val_score(xgbc, xtrain, ytrain, cv=5) print("Mean cross-validation score: %.2f" % scores.mean()) kfold = KFold(n_splits=10, shuffle=True) kf_cv_scores = cross_val_score(xgbc, xtrain, ytrain, cv=kfold ) print("K-fold CV average score: %.2f" % kf_cv_scores.mean()) ypred = xgbc.predict(xtest) cm = confusion_matrix(ytest,ypred) print(cm)
XGBoost Python Feature Walkthrough
This is a collection of examples for using the XGBoost Python package.