This tutorial aims to equip you with advanced data wrangling techniques using Python. Data wrangling involves the process of cleaning and unifying messy and complex data sets for easy access and analysis. We will be using Python's Pandas library, an open-source data analysis and manipulation tool, to handle our data.
By the end of this tutorial, you will learn:
- How to handle missing and duplicate data
- How to apply functions to transform data
- How to reshape and pivot data frames
Prerequisites: Basic understanding of Python programming and familiarity with the Pandas library. If you are new to Python or Pandas, you might want to check beginner tutorials first.
Missing data is a common problem in data sets. Pandas provides several methods to handle it, such as isnull()
, notnull()
, dropna()
, and fillna()
functions.
# Importing pandas library
import pandas as pd
# Create a dataframe
df = pd.DataFrame({
'A': [1, 2, np.nan],
'B': [5, np.nan, np.nan],
'C': [1, 2, 3]
})
# Check for missing values
df.isnull()
Duplicate data can skew your analysis. Use drop_duplicates()
to remove them.
# Create a dataframe with duplicates
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C': ['small', 'large', 'large', 'small', 'small', 'large', 'large', 'small'],
'D': [1, 2, 2, 3, 3, 4, 5, 6],
'E': [2, 4, 5, 5, 6, 6, 8, 9]
})
# Remove duplicate rows
df.drop_duplicates()
Apply functions are powerful tools to transform data. Here we use applymap()
function.
# Create a dataframe
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
})
# Create a function to square the values
square = lambda x: x**2
# Apply the function to the dataframe
df = df.applymap(square)
Use melt()
function to reshape data.
# Create a dataframe
df = pd.DataFrame({
'A': ['John', 'Boby', 'Mina'],
'B': ['Masters', 'Graduate', 'Graduate'],
'C': [27, 23, 21]
})
# Reshape the data
df.melt()
We covered advanced data wrangling techniques such as handling missing and duplicate data, applying functions to transform data, and reshaping data frames.
For further learning, consider exploring how to merge and join data frames, handling categorical data, and advanced data filtering.
Try to solve these exercises on your own. They will help you understand and remember the techniques better. Happy coding!