Binary Classification with a Bank Churn

Overview

Today is the seventh day of 30 Kaggle Challenges in 30 Days. I took a day off after six days of continuous posting. Solving the problems, writing the blog, and posting it takes a lot of time. Additional time is also consumed because the website is new, and almost daily, I discover some bugs, which also take my time. Going at this pace and considering my other commitments, I think this challenge will take more than 30 days, maybe it will take 36 days.

Problem Description

Today’s problem is binary classification with a bank churn dataset. The task is to predict whether a customer continues with their account or closes it (e.g., churns). The evaluation metric is the area under the ROC Curve between the predicted probability and the observed target. The dataset for this competition was generated from a deep learning model trained on the bank customer churn prediction dataset. Links for both are as follows:

Kaggle Dataset: Season 4, Episode 1
Original Dataset: Bank Customer Churn Prediction

Data Description

The training dataset has 14 columns: two for IDs, one for the target variable ‘Exited’, and 11 for features. The sizes of each dataset are as follows:

Train Shape:	Rows: 165034	Columns: 14
Test Shape:	    Rows: 110023	Columns: 13

Target Distribution

Kaggle S4E1: Distribution of the target variable

The difference between the two values makes the dataset highly imbalanced. We will use stratified folds to validate the model performance, which will reduce the impact of imbalanced data.

Model Performance

I have fitted the following models and ran each on five-folds I created from the training dataset. The average score of each are as follows:

Model Average AUC Average Time (sec)
LightGBM 0.8893 0.652
Gradient Boosting 0.8885 22.12
CatBoost 0.8885 15.292
XGBoost 0.8863 0.682
AdaBoost (SAMME.R) 0.8803 6.376
AdaBoost (SAMME) 0.8735 5.536
Logistic Regression 0.8707 0.394
Random Forest 0.8705 16.806
Extra Trees 0.8572 14.394
Bagging 0.8422 6.86
K-Nearest Neighbors 0.8177 3.58
Decision Tree 0.7022 1.128

Table: Average AUC Score and Average Training Time per Model

I got the best score by using the LightGBM model.

Result

I have finalized the LightGBM model for final submission.

Kaggle Score: 0.89197

Progress on Challenge

This challenge is taking up all of my time. I barely do anything else than solve Kaggle’s problems. The first 5 days were fun, but now it’s getting repetitive, and the format is fixed. I think now I should up my game, and instead of just reporting on scores achieved and results of submission, I should try to explore certain parts of the code and explain the logic behind it and the reason for doing that certain task in a certain way. The problem with the Kaggle competition is that I spend a lot of time on hyperparameter tuning and feature engineering instead of learning new models. The internal functioning and algorithm of each model. So, from tomorrow onward, I won’t try to improve the score but to understand why specific models are working and why others are getting low scores.

Links

Notebook: Kaggle Notebook for S4E1
Code: GitHub Repository for Day 7

← Back to Blog
← Back to Blog