Classification with an Academic Success Dataset

Sep 12, 2024

Overview

This is the second challenge of the 30 Kaggle Challenges in 30 Days series. Yesterday, we worked on a binary classification for insurance cross-selling, and today, we are tackling a multiclass classification problem to predict students’ academic outcomes.

Kaggle Season 4 Episode 6

The dataset for this competition was generated from a deep learning model trained on the Predict Students’ Dropout and Academic Success dataset. The features are close to, but not exactly the same as, the original. Our task is to build a multiclassification model to predict students’ dropout and academic success.
This is a 3-category classification problem with a notable imbalance between classes.

Exploratory Data Analysis (EDA)

Dataset Overview

train = pd.read_csv(train.csv)
train.shape

(76518, 38)

The train dataset has 76518 rows and 38 columns. Next, we will check the value counts of the target variable. Before that, you can check the head to get an idea of the various columns. But this dataset has so many columns that I can’t write them here (it won’t look good).

train.Target.value_counts()

Target
Graduate    36282
Dropout     25296
Enrolled    14940
Name: count, dtype: int64

Distribution of Target Variable in Kaggle S4E6 Problem

From the plot, you can see that the majority of the students graduated, 33% dropped out, and the rest enrolled. There is an imbalance in the target feature. As the evaluation metric is accuracy, having imbalanced data can give us a wrong idea about the perfection of our model. We will need to handle the data imbalance to make good predictions in all categories.

Missing Values

Checked for missing values. Train dataset don’t have any missing values.

train.isna().sum()

id                                                0
Marital status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship holder                                0
Age at enrollment                                 0
International                                     0
Curricular units 1st sem (credited)               0
Curricular units 1st sem (enrolled)               0
Curricular units 1st sem (evaluations)            0
Curricular units 1st sem (approved)               0
Curricular units 1st sem (grade)                  0
Curricular units 1st sem (without evaluations)    0
Curricular units 2nd sem (credited)               0
Curricular units 2nd sem (enrolled)               0
Curricular units 2nd sem (evaluations)            0
Curricular units 2nd sem (approved)               0
Curricular units 2nd sem (grade)                  0
Curricular units 2nd sem (without evaluations)    0
Unemployment rate                                 0
Inflation rate                                    0
GDP                                               0
Target                                            0
dtype: int64

Data Types

Check the datatypes of the features.

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76518 entries, 0 to 76517
Data columns (total 38 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   id                                              76518 non-null  int64  
 1   Marital status                                  76518 non-null  int64  
 2   Application mode                                76518 non-null  int64  
 3   Application order                               76518 non-null  int64  
 4   Course                                          76518 non-null  int64  
 5   Daytime/evening attendance                      76518 non-null  int64  
 6   Previous qualification                          76518 non-null  int64  
 7   Previous qualification (grade)                  76518 non-null  float64
 8   Nacionality                                     76518 non-null  int64  
 9   Mother's qualification                          76518 non-null  int64  
 10  Father's qualification                          76518 non-null  int64  
 11  Mother's occupation                             76518 non-null  int64  
 12  Father's occupation                             76518 non-null  int64  
 13  Admission grade                                 76518 non-null  float64
 14  Displaced                                       76518 non-null  int64  
 15  Educational special needs                       76518 non-null  int64  
 16  Debtor                                          76518 non-null  int64  
 17  Tuition fees up to date                         76518 non-null  int64  
 18  Gender                                          76518 non-null  int64  
 19  Scholarship holder                              76518 non-null  int64  
 20  Age at enrollment                               76518 non-null  int64  
 21  International                                   76518 non-null  int64  
 22  Curricular units 1st sem (credited)             76518 non-null  int64  
 23  Curricular units 1st sem (enrolled)             76518 non-null  int64  
 24  Curricular units 1st sem (evaluations)          76518 non-null  int64  
 25  Curricular units 1st sem (approved)             76518 non-null  int64  
 26  Curricular units 1st sem (grade)                76518 non-null  float64
 27  Curricular units 1st sem (without evaluations)  76518 non-null  int64  
 28  Curricular units 2nd sem (credited)             76518 non-null  int64  
 29  Curricular units 2nd sem (enrolled)             76518 non-null  int64  
 30  Curricular units 2nd sem (evaluations)          76518 non-null  int64  
 31  Curricular units 2nd sem (approved)             76518 non-null  int64  
 32  Curricular units 2nd sem (grade)                76518 non-null  float64
 33  Curricular units 2nd sem (without evaluations)  76518 non-null  int64  
 34  Unemployment rate                               76518 non-null  float64
 35  Inflation rate                                  76518 non-null  float64
 36  GDP                                             76518 non-null  float64
 37  Target                                          76518 non-null  object 
dtypes: float64(7), int64(30), object(1)
memory usage: 22.2+ MB

As you can see, all of the features except the target feature are numerical. The reason for this is that all of the data is already encoded. Features like Marital status, Application mode, Course, etc. are categorical data originally and the description is available at UCI ML Repository.

First, we will use the data directly in the numerical form provided by Kaggle. Later, we can convert the data into its categorical form to do feature engineering.

Our main goal is to build an ML model, so I am not going to do detailed data analysis. However, it would be helpful if we were required to do feature engineering. I have prepared plots and detailed statistics on each feature in jupyter notebook in case you are interested in going through it.

Handling Imbalance with Stratified K-Fold Cross-Validation

The next task is to create a stratified K-fold cross-validation from the training set to validate the models before building the final model. By validating the models on different folds, we can gain insights into how well the model will perform on unseen data. Since the target variable is highly imbalanced, using a stratified K-fold ensures that each fold maintains the same proportion of the target variable as in the original dataset.

import pandas as pd
from sklearn.model_selection import StratifiedKFold

def create_fold(fold):
    # import data
    data = pd.read_csv('./input/train.csv')
    data['kfold'] = -1
    data = data.sample(frac=1).reset_index(drop=True)
    y = data.Target.values
    skf = StratifiedKFold(n_splits=fold)
    for f, (t_, v_) in enumerate(skf.split(X=data, y=y)):
        data.loc[v_, 'kfold'] = f
    data.to_csv('./input/train_folds.csv', index=False)

if __name__ == '__main__':
    create_fold(5)

Feature Scaling

Scaling the numerical features. Since all the features have numerical data type, we will scale all features.

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the training data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Model Training and Evaluation

We will not try different models and evaluate their performance. The evaluation metric is accuracy.

Logistic Regression

from sklearn.linear_model import LogisticRegression

Results:

Fold=0, Accuracy=0.8187401986408782
Fold=1, Accuracy=0.814231573444851
Fold=2, Accuracy=0.8159958180867747
Fold=3, Accuracy=0.8163105273475789
Fold=4, Accuracy=0.8166372606678429

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

Results:

Fold=0, Accuracy=0.7449032932566649
Fold=1, Accuracy=0.7400679560899112
Fold=2, Accuracy=0.7374542603240982
Fold=3, Accuracy=0.7370450238515324
Fold=4, Accuracy=0.7412925570149644

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

Results:

Fold=0, Accuracy=0.8278881338212232
Fold=1, Accuracy=0.8227914270778881
Fold=2, Accuracy=0.8259278619968635
Fold=3, Accuracy=0.8231719270731229
Fold=4, Accuracy=0.8272887669084493

SVM

from sklearn.svm import SVC

Results:

Fold=0, Accuracy=0.8208311552535285
Fold=1, Accuracy=0.8147543125980136
Fold=2, Accuracy=0.8207004704652379
Fold=3, Accuracy=0.8194471672221133
Fold=4, Accuracy=0.8220610337842253

Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier

Results:

Fold=0, Accuracy=0.8295216936748563
Fold=1, Accuracy=0.8273000522739153
Fold=2, Accuracy=0.8282148457919498
Fold=3, Accuracy=0.8248709403384957
Fold=4, Accuracy=0.8280729268770829

XGBoost Classifier

from xgboost import XGBClassifier

Results:

Fold=0, Accuracy=0.8325927861996864
Fold=1, Accuracy=0.8284762153685311
Fold=2, Accuracy=0.8296523784631469
Fold=3, Accuracy=0.8267006469319741
Fold=4, Accuracy=0.8330392733450958

Model Comparison

Here’s a comparison of the models tested:

Comparison of the models tested for Kaggle S4E6

Model	Average Accuracy
Logistic Regression	81.64%
Decision Tree Classifier	74.01%
Random Forest Classifier	82.59%
SVM	81.96%
Gradient Boosting	82.76%
XGBoost	83.01%

XGBoost gave us the best results, making it the model of choice for final submission.

Final Model and Submission

We have selected the XGBoost for our final submission. After training on the entire dataset, we achieved a final accuracy score of 83.45%.

You can check the code in Kaggle Notebook.

Key Learnings

Label Encoding for XGBoost: XGBoost does not accept string values in the target variable, so I had to label encode the target values.
Automated Model Testing: I have implemented a command-line script to automate the running of multiple models using different folds. This allows me to efficiently test various models, streamlining the experimentation process. I will write a detailed blog on this topic, so stay tuned. For now, you can check the code in the GitHub repository.

Conclusion

This was the second day of the 30 Kaggle Challenges in 30 Days series, and we successfully completed a multiclass classification problem. Feel free to try the code from Kaggle Notebook and share your result!

← Back to Blog