Association Analysis

Tutorial 3 of 4

1. Introduction

Goal of the Tutorial

This tutorial aims to explain association rules and how to use them for discovering interesting relationships in your dataset.

Learning Outcomes

By the end of this tutorial, you should be able to:
- Understand the concept of association rules.
- Implement association rule mining using Python.
- Interpret the output of association rule mining.

Prerequisites

  • Basic knowledge of Python programming.
  • Familiarity with data analysis libraries like pandas and numpy.
  • Basic understanding of data mining concepts.

2. Step-by-Step Guide

Understanding Association Rules

Association rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in databases using different measures of interestingness.

The premise of association rule learning is based on two measures: support and confidence. Support indicates how frequently the rule appears in the dataset, while confidence indicates how often the rule has been found to be true.

For instance, let's say we have a supermarket dataset and we want to find a rule that will allow us to predict that if a customer buys onions and potatoes, they will also buy burger patties. The support is the number of transactions with onions, potatoes, and patties divided by total number of transactions, while confidence is the number of transactions with onions, potatoes, and patties divided by the number of transactions with onions and potatoes.

Best Practices and Tips

  • Use your domain knowledge to set the minimum thresholds for support and confidence.
  • Association rules do not imply causality.
  • Association rules can be misleading if not validated with other statistical measures.

3. Code Examples

Example 1: Basic Implementation using the mlxtend library

# Import necessary libraries
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Load your data
# data = ...

# Generate frequent itemsets
frequent_itemsets = apriori(data, min_support=0.1, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Print rules
print(rules)

In the above code:
- We first import the necessary functions from the mlxtend library.
- We then load our dataset (data).
- We generate frequent itemsets using the apriori function, specifying a minimum support of 0.1.
- We generate association rules from the frequent itemsets, using lift as our metric and 1 as our minimum threshold.
- Finally, we print the generated rules.

The output will be a DataFrame showing the antecedents, consequents, and the computed support, confidence, and lift for each rule.

Example 2: Filtering Rules

# Filter rules by confidence and lift
filtered_rules = rules[(rules['confidence'] > 0.7) & (rules['lift'] > 1.2)]

# Print filtered rules
print(filtered_rules)

In this example, we filter the previously generated rules by confidence and lift, choosing only those with confidence greater than 0.7 and lift greater than 1.2.

4. Summary

In this tutorial, we learned about association rules, their measures, and how to implement association rule mining in Python using the mlxtend library.

To further improve your skills in this area, consider exploring different datasets and playing around with the parameters of the apriori and association_rules functions.

5. Practice Exercises

Exercise 1: Use the mlxtend library to perform association rule mining on a dataset of your choice with a minimum support of 0.2 and a minimum confidence of 0.7.

Exercise 2: Try filtering the rules from Exercise 1 by lift, keeping only those with a lift greater than 1.5.

Solutions and Explanations: The solutions to these exercises will depend on the specific dataset you choose. Remember, the key steps are to generate frequent itemsets using the apriori function, generate association rules using the association_rules function, and filter the rules using logical indexing.

For further practice, consider exploring different measures of interestingness, such as leverage and conviction, and how they influence the generated rules.