Advanced Data Wrangling in Python

Tutorial 5 of 5

Advanced Data Wrangling in Python

1. Introduction

This tutorial aims to equip you with advanced data wrangling techniques using Python. Data wrangling involves the process of cleaning and unifying messy and complex data sets for easy access and analysis. We will be using Python's Pandas library, an open-source data analysis and manipulation tool, to handle our data.

By the end of this tutorial, you will learn:
- How to handle missing and duplicate data
- How to apply functions to transform data
- How to reshape and pivot data frames

Prerequisites: Basic understanding of Python programming and familiarity with the Pandas library. If you are new to Python or Pandas, you might want to check beginner tutorials first.

2. Step-by-Step Guide

Handling Missing Data

Missing data is a common problem in data sets. Pandas provides several methods to handle it, such as isnull(), notnull(), dropna(), and fillna() functions.

# Importing pandas library
import pandas as pd

# Create a dataframe
df = pd.DataFrame({
   'A': [1, 2, np.nan],
   'B': [5, np.nan, np.nan],
   'C': [1, 2, 3]
})

# Check for missing values
df.isnull()

Removing Duplicates

Duplicate data can skew your analysis. Use drop_duplicates() to remove them.

# Create a dataframe with duplicates
df = pd.DataFrame({
   'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   'C': ['small', 'large', 'large', 'small', 'small', 'large', 'large', 'small'],
   'D': [1, 2, 2, 3, 3, 4, 5, 6],
   'E': [2, 4, 5, 5, 6, 6, 8, 9]
})

# Remove duplicate rows
df.drop_duplicates()

3. Code Examples

Applying Functions

Apply functions are powerful tools to transform data. Here we use applymap() function.

# Create a dataframe
df = pd.DataFrame({
   'A': [1, 2, 3, 4, 5],
   'B': [10, 20, 30, 40, 50],
   'C': [100, 200, 300, 400, 500]
})

# Create a function to square the values
square = lambda x: x**2

# Apply the function to the dataframe
df = df.applymap(square)

Reshaping Data

Use melt() function to reshape data.

# Create a dataframe
df = pd.DataFrame({
   'A': ['John', 'Boby', 'Mina'],
   'B': ['Masters', 'Graduate', 'Graduate'],
   'C': [27, 23, 21]
})

# Reshape the data
df.melt()

4. Summary

We covered advanced data wrangling techniques such as handling missing and duplicate data, applying functions to transform data, and reshaping data frames.

For further learning, consider exploring how to merge and join data frames, handling categorical data, and advanced data filtering.

5. Practice Exercises

  1. Create a DataFrame with some missing values and try different methods of handling them.
  2. Remove duplicate data from a DataFrame.
  3. Apply a function to transform a DataFrame.
  4. Reshape a DataFrame using the melt function.

Try to solve these exercises on your own. They will help you understand and remember the techniques better. Happy coding!