Examples

This document provides practical examples of how to use the suraj_datalab package. Each example demonstrates a specific feature, complete with sample code and expected outputs.

Example 1: Analyzing a Categorical Feature

Scenario

You have a dataset with a categorical feature, and you want to analyze its distribution with respect to a target variable.

Code

import pandas as pd
from suraj_datalab import analyze

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A'],
    'Target': [1, 0, 1, 0, 1, 0, 0, 1]
}
df = pd.DataFrame(data)

# Analyze the categorical feature
distribution = analyze.categorical_feature(df, 'Category', 'Target')
print(distribution)

Expected Output

  Total Count  Total Percentage  0 of Total (%)  1 of Total (%)  0 within Category (%)  1 within Category (%)
A            4              50.0            25.0            75.0                  25.0                  75.0
B            2              25.0            50.0            50.0                  50.0                  50.0
C            2              25.0            75.0            25.0                  75.0                  25.0

A plot showing the distribution of the Category feature by the Target variable will also be generated.

Example 2: Analyzing a Numerical Feature

Scenario

You want to analyze the distribution of a numerical feature, identify outliers, and visualize the data.

Code

import pandas as pd
from suraj_datalab import analyze

# Sample DataFrame
data = {
    'NumericalFeature': [10, 12, 10, 22, 23, 45, 47, 50],
    'Target': [1, 1, 0, 0, 1, 1, 0, 0]
}
df = pd.DataFrame(data)

# Analyze the numerical feature
outliers_df, summary_df = analyze.numerical_feature(df, 'NumericalFeature', 'Target')
print(outliers_df)
print(summary_df)

Expected Output

   Outlier Percentage  Lower Outliers Percentage  Upper Outliers Percentage
0                 25.0                        0.0                       25.0

         count  mean  std   min   25%   50%   75%   max
Overall     8  27.375  18.518  10.0  12.5  22.5  46.5  50.0
Lower_Outliers 0  NaN   NaN    NaN   NaN   NaN   NaN   NaN
Upper_Outliers 2  48.5   1.5  47.0  47.75  48.5  49.25  50.0

Histograms and boxplots for the NumericalFeature will be generated.

Example 3: Handling Missing Values

Scenario

You have a dataset with missing values and want to generate a summary.

Code

import pandas as pd
from suraj_datalab import analyze

# Sample DataFrame
data = {
    'Feature1': [1, 2, None, 4, 5],
    'Feature2': [None, 2, 3, 4, None],
    'Feature3': [1, None, 3, None, 5]
}
df = pd.DataFrame(data)

# Get missing values summary
missing_summary = analyze.missing_values(df)
print(missing_summary)

Expected Output

          Missing Count  Missing Percentage  Data Type
Feature1              1                20.0    float64
Feature2              2                40.0    float64
Feature3              2                40.0    float64

Example 4: Replacing Rare Categories

Scenario

You need to replace rare categories in your dataset with a specified value.

Code

import pandas as pd
from suraj_datalab.clean import RareCategoryReplacer

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'A', 'C', 'D', 'E', 'F', 'A'],
}
df = pd.DataFrame(data)

# Define the columns to replace rare categories in
columns_to_replace = ['Category']

# Create an instance of the RareCategoryReplacer
replacer = RareCategoryReplacer(columns=columns_to_replace, proportion_threshold=0.2, replacement_value="Others")

# Fit and transform the dataset
df_transformed = replacer.fit_transform(df)
print(df_transformed)

Expected Output

  Category
0        A
1        B
2    Others
3        A
4        B
5        A
6    Others
7    Others
8    Others
9    Others
10       A

Example 5: Creating K-Folds

Scenario

You want to create K-Folds for cross-validation on a dataset.

Code

from suraj_datalab.fold_creator import create_kfolds

# Create K-Folds for a dataset
kfold_data = create_kfolds(file_path="your_dataset.csv", n_splits=5, shuffle=True, random_state=42)
print(kfold_data.head())

Expected Output

A new CSV file with an additional kfold column indicating the fold assignment for each row in the dataset.

   Feature1  Feature2  kfold
0       ...       ...      0
1       ...       ...      1
2       ...       ...      4
3       ...       ...      3
4       ...       ...      2

Example 6: Stratified K-Folds for Classification

Scenario

You want to create stratified K-Folds for a classification problem to ensure balanced folds.

Code

from suraj_datalab.fold_creator import create_classification_kfolds

# Create Stratified K-Folds for a classification dataset
stratified_classification_data = create_classification_kfolds(
    file_path="your_dataset.csv", target_column="target", n_splits=5, random_state=42
)
print(stratified_classification_data.head())

Expected Output

A new CSV file with an additional kfold column indicating the fold assignment, stratified by the target variable.

   Feature1  Feature2  target  kfold
0       ...       ...      0      1
1       ...       ...      1      3
2       ...       ...      0      4
3       ...       ...      1      0
4       ...       ...      1      2

Example 7: Stratified K-Folds for Regression with Sturges' Binning

Scenario

You want to create stratified K-Folds for a regression problem using Sturges' binning method.

Code

from suraj_datalab.fold_creator import create_regression_kfolds

# Create Stratified K-Folds for a regression dataset using 'sturges' binning method
stratified_regression_data = create_regression_kfolds(
    file_path="your_dataset.csv", target_column="target", n_splits=5, binning_method='sturges', random_state=42
)
print(stratified_regression_data.head())

Expected Output

A new CSV file with an additional kfold column indicating the fold assignment, with the target variable stratified using Sturges' binning method.

   Feature1  Feature2  target  kfold
0       ...       ...    1.5      1
1       ...       ...    2.1      3
2       ...       ...    3.4      0
3       ...       ...    2.9      4
4       ...       ...    3.8      2

Learn More

For detailed usage instructions, please visit the Usage Guide.

For more information about my work, other projects, or to get in touch, visit my personal website

If you have any feedback or suggestions, feel free to open an issue on GitHub.