Common Pitfalls in Supervised Learning

Tutorial 5 of 5

1. Introduction

In this tutorial, we aim to explore common pitfalls in supervised learning, a popular machine learning paradigm where a model is trained on labeled data. By understanding these pitfalls, you can avoid common mistakes, improve your models, and achieve better results.

You will learn about issues such as overfitting, underfitting, data leakage, and biased data, along with practical strategies to mitigate them.

This tutorial assumes a basic understanding of machine learning concepts and Python programming. Familiarity with a library like scikit-learn would be beneficial, but not required.

2. Step-by-Step Guide

2.1. Overfitting

Overfitting occurs when your model learns the training data too well, capturing noise and outliers. This leads to poor performance on unseen data.

To avoid overfitting:

Use simpler models with fewer parameters.
Regularize your models.
Use techniques like cross-validation.
Gather more training data.

2.2. Underfitting

Underfitting is when your model is too simple to capture the underlying structure of the data.

To avoid underfitting:

Use more complex models.
Add more features.
Reduce regularization.

2.3. Data Leakage

Data leakage happens when your model is inadvertently exposed to information from the validation or test data. This usually leads to overly optimistic performance estimates.

To avoid data leakage:

Carefully handle your data, especially during preprocessing.
Split your data into training, validation, and test sets at the beginning of your workflow.

2.4. Biased Data

Biased data can lead to models that unfairly favor certain outcomes or groups.

To avoid biased data:

Ensure your data is representative of the problem space.
Regularly evaluate and update your data.

3. Code Examples

3.1. Overfitting

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create a simple binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a decision tree model (prone to overfitting)
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
train_acc = accuracy_score(y_train, clf.predict(X_train))
test_acc = accuracy_score(y_test, clf.predict(X_test))

print(f'Training Accuracy: {train_acc*100:.2f}%')
print(f'Test Accuracy: {test_acc*100:.2f}%')

This example illustrates overfitting. The decision tree model may perform very well on the training data but poorly on the test data.

4. Summary

In this tutorial, we've covered common pitfalls in supervised learning, including overfitting, underfitting, data leakage, and biased data. We've also discussed strategies to mitigate these issues.

Next steps would be to dig deeper into each of these topics and practice identifying and addressing them in real-world scenarios. You can find additional resources at the scikit-learn documentation and tutorials on Towards Data Science.

5. Practice Exercises

Exercise 1: Create a logistic regression model on the same dataset above and identify if it overfits or underfits.
Exercise 2: Create a pipeline that includes data preprocessing steps and a model. Make sure there is no data leakage.
Exercise 3: Evaluate your model for possible bias.

Remember, practice is key in mastering machine learning. Keep exploring and learning!