This tutorial aims to guide you through the process of performing regression analysis in Python. By the end of this tutorial, you will have a basic understanding of regression analysis and how to implement it with Python's powerful libraries - NumPy, Pandas, and Scikit-learn.
Prerequisites: Basic knowledge of Python programming and a bit of Statistics would be beneficial.
Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable(s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables.
First of all, we need to install the necessary libraries. You can do this with pip:
pip install numpy pandas matplotlib scikit-learn seaborn
After installation, you can import these libraries as:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import seaborn as seabornInstance
import matplotlib.pyplot as plt
Let's assume we are going to perform a simple linear regression using a dataset that contains two columns — "area" and "price".
# Define simple data
area = [1.2, 2.4, 3.5, 4.6, 5.7]
price = [150, 220, 340, 470, 560]
# Convert to pandas DataFrame
data = pd.DataFrame(list(zip(area, price)), columns=['Area', 'Price'])
# Show data
print(data)
The expected output:
Area Price
0 1.2 150
1 2.4 220
2 3.5 340
3 4.6 470
4 5.7 560
We can use seaborn to visualize our data.
plt.figure(figsize=(6,4))
plt.tight_layout()
seabornInstance.distplot(data['Area'])
The output will be a histogram representing the 'Area' column.
The next step is to divide the data into "attributes" and "labels". Attributes are the independent variables while labels are dependent variables whose values are to be predicted. In our dataset, we only have two columns. We want to predict the Price depending upon the Area recorded. Therefore our attribute set will consist of the "Area" column, and the label will be the "Price" column.
X = data['Area'].values.reshape(-1,1)
y = data['Price'].values.reshape(-1,1)
Next, we split 80% of the data to the training set while 20% of the data to test set using below code.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
We have split our data into training and testing sets, and now is finally the time to train our algorithm.
regressor = LinearRegression()
regressor.fit(X_train, y_train) #training the algorithm
Now that we have trained our algorithm, it's time to make some predictions.
y_pred = regressor.predict(X_test)
To compare the actual output values for X_test with the predicted values, execute the following script:
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
print(df)
This will print the actual vs predicted prices.
In this tutorial, we learned how to perform regression analysis in Python using the Scikit-learn library. We started by explaining the basics of regression and then discussed how to divide data into attributes and labels, how to split data into training and testing sets, and how to train a regression algorithm.
Next steps for learning: Explore multiple linear regression, polynomial regression, and logistic regression.
Additional resources:
- Python Machine Learning Tutorial
- Scikit-Learn Documentation
Solutions: These are open-ended problems. The solutions will depend on the dataset you choose. Always remember to visualize your data before making predictions and evaluate your model using metrics like Mean Squared Error (MSE).
Tips for further practice: Try to understand the assumptions behind regression analyses and how to check if your data meets those assumptions.