Effective Hyperparameter Optimization of LightGBM Classifiers with Optuna
LightGBM is a really convenient to use, fast to train and usually accurate implementation of boosted trees. Here I use optuna for hyperparameter search using Bayesian optimization methods, with 5-fold cross validation, to gain a fairly accurate model. Also I leverage some seemingly minor but very useful built in features of the LightGBM library to handle categorical variables.
Note that the dataset I am using for this blog is the kaggle tabular playground march-2021 dataset which could be downloaded at https://www.kaggle.com/c/tabular-playground-series-mar-2021. Also note that I am not going to go over the EDA aspects of the problem because the purpose is to show the ease with which optuna could be used to tune the hyperparameters of LightGBM to yield highly accurate models.
##Import the required packages. These include pandas, numpy,scikit-learn and optuna
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import KFold
import optuna
import optuna.integration.lightgbm as lgb
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/tabular-playground-series-mar-2021/sample_submission.csv
/kaggle/input/tabular-playground-series-mar-2021/train.csv
/kaggle/input/tabular-playground-series-mar-2021/test.csv
Read the train and test csv’s into variables and list out the column names
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.columns.to_list()
['id',
'cat0',
'cat1',
'cat2',
'cat3',
'cat4',
'cat5',
'cat6',
'cat7',
'cat8',
'cat9',
'cat10',
'cat11',
'cat12',
'cat13',
'cat14',
'cat15',
'cat16',
'cat17',
'cat18',
'cont0',
'cont1',
'cont2',
'cont3',
'cont4',
'cont5',
'cont6',
'cont7',
'cont8',
'cont9',
'cont10',
'target']
Convert the categorical data into the category type such that lightgbm can handle the categorical variables. Unless you leverage learned embeddings for categorical variables, this fares better that one hot encoding or label encoding
conts = ['cont0','cont1','cont2','cont3','cont4','cont5','cont6','cont7','cont8','cont9','cont10']
cats = ['cat0','cat1','cat2','cat3','cat4','cat5','cat6','cat7','cat8','cat9','cat10','cat11','cat12','cat13','cat14','cat15','cat16','cat17','cat18','target']
for c in train.columns:
col_type = train[c].dtype
if col_type == 'object' or col_type.name == 'category':
train[c] = train[c].astype('category')
for c in test.columns:
col_type = test[c].dtype
if col_type == 'object' or col_type.name == 'category':
test[c] = test[c].astype('category')
Specify dependent and independent variables and create a lgb Dataset object
X= train[['cont0','cont1','cont2','cont3','cont4','cont5','cont6','cont7','cont8','cont9','cont10','cat0','cat1','cat2','cat3','cat4','cat5','cat6','cat7','cat8','cat9','cat10','cat11','cat12','cat13','cat14','cat15','cat16','cat17','cat18']]
Y = train[['target']]
Use the optuna lightGBM integration to do hyperparamater optimization with 5 fold cross validation. Make sure to pass in the argument ‘auto’ for categorical_feature for automated feature engineering for categorical input features.
from sklearn.model_selection import StratifiedKFold
kfolds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
Notice how I’ve specified auc as the metric.
dtrain = lgb.Dataset(X,Y,categorical_feature = 'auto')
params = {
"objective": "binary",
"metric": "auc",
"verbosity": -1,
"boosting_type": "gbdt",
}
tuner = lgb.LightGBMTunerCV(
params, dtrain, verbose_eval=100, early_stopping_rounds=1000000, folds=kfolds
)
tuner.run()
print("Best score:", tuner.best_score)
best_params = tuner.best_params
print("Best params:", best_params)
print(" Params: ")
for key, value in best_params.items():
print(" {}: {}".format(key, value))
#Results
Best score: 0.8950986360077963
Best params: {'objective': 'binary', 'metric': 'auc', 'verbosity': -1, 'boosting_type': 'gbdt', 'feature_pre_filter': False, 'lambda_l1': 7.959176411127531, 'lambda_l2': 5.381687699546818e-05, 'num_leaves': 135, 'feature_fraction': 0.4, 'bagging_fraction': 1.0, 'bagging_freq': 0, 'min_child_samples': 20}
Params:
objective: binary
metric: auc
verbosity: -1
boosting_type: gbdt
feature_pre_filter: False
lambda_l1: 7.959176411127531
lambda_l2: 5.381687699546818e-05
num_leaves: 135
feature_fraction: 0.4
bagging_fraction: 1.0
bagging_freq: 0
min_child_samples: 20
Inspect the best score for AUC value
tuner.best_score
0.8950986360077963
Assign the best params to a variable
params = tuner.best_params
Use these parameters to train a LightGBM model on the entire training dataset
import lightgbm as lgb
id_test = test.id.to_list()
model = lgb.train(params, dtrain, num_boost_round=1000)
Predict on the test set and save file. Make sure you set index=False
X_test = test[['cont0','cont1','cont2','cont3','cont4','cont5','cont6','cont7','cont8','cont9','cont10','cat0','cat1','cat2','cat3','cat4','cat5','cat6','cat7','cat8','cat9','cat10','cat11','cat12','cat13','cat14','cat15','cat16','cat17','cat18']]
preds = model.predict(X_test)
resultf = pd.DataFrame()
resultf['id'] = id_test
resultf['target'] = preds
resultf.to_csv('submission.csv',index=False)