Usage

This document provides a guide on how to use the suraj_datalab package for data analysis.

Installation

Ensure that you have all the necessary dependencies installed by running:

pip install -r requirements.txt

Importing the Package

To use the suraj_datalab package in your Python script, you need to import it as follows:

from suraj_datalab import analyze, clean, fold_creator

Analyzing Categorical Features

The categorical_feature function allows you to analyze the distribution of a categorical feature with respect to a target variable. This function is useful for understanding how different categories are distributed and how they relate to the target variable.

Example Usage

import pandas as pd
from suraj_datalab import analyze

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Analyze a categorical feature
distribution = analyze.categorical_feature(df, 'feature_name', 'target_name')
print(distribution)

Output

The output is a DataFrame showing:

Total Count
Total Percentage
Percentages for each target class relative to the total
Percentages of each target class within the feature category

A plot is also generated, showing the distribution of the feature by the target variable.

Analyzing Numerical Features

The numerical_feature function allows you to analyze the distribution of a numerical feature, with optional grouping by a target variable. This function provides detailed statistics, including outlier detection.

Example Usage

import pandas as pd
from suraj_datalab import analyze

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Analyze a numerical feature
outliers_df, summary_df = analyze.numerical_feature(df, 'numerical_feature_name', 'target_name')
print(outliers_df)
print(summary_df)

Output

The function returns:

outliers_df: A DataFrame containing the percentage of outliers.
summary_df: A DataFrame with overall statistics, as well as statistics for lower and upper outliers.

A histogram and boxplot are also generated to visualize the distribution.

Missing Values Summary

The missing_values function generates a summary of missing values in the DataFrame.

Example Usage

import pandas as pd
from suraj_datalab import analyze

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Get missing values summary
missing_summary = analyze.missing_values(df)
print(missing_summary)

Output

The function returns a DataFrame with the following information:

Count of missing values
Percentage of missing values
Data type of each column that has missing values

Handling Rare Categories in Categorical Features

The RareCategoryReplacer class in clean.py is used to replace rare categories in specified columns of a DataFrame with a replacement value.

Example Usage

import pandas as pd
from suraj_datalab.clean import RareCategoryReplacer

# Load your dataset
df = pd.read_csv('your_dataset.csv')

# Define the columns to replace rare categories in
columns_to_replace = ['feature_1', 'feature_2']

# Create an instance of the RareCategoryReplacer
replacer = RareCategoryReplacer(columns=columns_to_replace, proportion_threshold=0.02, replacement_value="Others")

# Fit and transform the dataset
df_transformed = replacer.fit_transform(df)
print(df_transformed)

Output

The output is a transformed DataFrame where rare categories in the specified columns are replaced with the value "Others".

Creating K-Folds for Cross-Validation

The fold_creator.py script provides functions to create K-Folds for cross-validation, including standard K-Folds, stratified K-Folds for classification, and stratified K-Folds for regression tasks.

Example Usage for K-Folds

from suraj_datalab.fold_creator import create_kfolds

# Create K-Folds for a dataset
kfold_data = create_kfolds(file_path="your_dataset.csv", n_splits=5, shuffle=True, random_state=42)
print(kfold_data)

Example Usage for Stratified K-Folds (Classification)

from suraj_datalab.fold_creator import create_classification_kfolds

# Create Stratified K-Folds for a classification dataset
stratified_classification_data = create_classification_kfolds(
    file_path="your_dataset.csv", target_column="target", n_splits=5, random_state=42
)
print(stratified_classification_data)

Example Usage for Stratified K-Folds (Regression)

from suraj_datalab.fold_creator import create_regression_kfolds

# Create Stratified K-Folds for a regression dataset using 'sturges' binning method
stratified_regression_data = create_regression_kfolds(
    file_path="your_dataset.csv", target_column="target", n_splits=5, binning_method='sturges', random_state=42
)
print(stratified_regression_data)

Output

Each of these functions will return a DataFrame with an additional 'kfold' column, indicating the fold assignment for each row in the dataset. The create_regression_kfolds function offers several methods for binning the target variable before applying stratified K-Folds, such as 'sturges', 'quantile', 'kmeans', and custom binning.

Cross-references

For a general overview and more information about the project, please visit the Project Overview.

Additional Resources

For more details about my work and other projects, visit my personal website.

If you have any questions or run into issues, please check the GitHub repository for additional help or to open an issue.