Harness the Power of Regression with Python: A Beginner's Guide

 


Introduction

A regression is a statistical technique that helps us analyze and understand the relationship between a dependent variable and one or more independent variables. There are many types of regression the most common models are; Linear regression Polynomial regression and Logistic regression.

In this article, I would like to discuss linear regression including simple linear regression and multiple linear regression, and how to implement this algorithm using Python and scikit-learn (a machine learning library for Python).

Linear Regression

Linear regression is a supervised learning that uses and shows the relationship between the data points to draw a straight line through all of them. This line can be used to predict future values.

Simple Linear Regression

This types of regression uses a single feature i.e. independent variable to determine the dependent variable.

let’s use scikit-learn to implement simple Linear Regression and create a model, train it, test it and use the model.

Importing Needed packages & Downloading Data

In this example, I will use fuelconsumption dataset, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada.

import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np

%matplotlib inline

df = pd.read_csv("FuelConsumptionCo2.csv")

# take a look at the dataset
df.head()

As you can see this contain a lot of features, so let’s select one i.e ENGINE SIZE as the independent variable and CO2 EMISSION as the dependent variable.

cdf = df[['ENGINESIZE','CO2EMISSIONS']]
cdf.head()

let’s plot each of this features against the Emission, to see how linear their relationship is

plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

Creating train/test dataset and Modeling

from sklearn.model_selection import train_test_split
from sklearn import linear_model


X = np.asanyarray(df[["ENGINESIZE"]])
y = np.asanyarray(df[["CO2EMISSIONS"]])

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)

output:
Coefficients: [[39.86432526]]
Intercept: [122.8486243]

Plot output: We can plot the fit line over the data.

plt.scatter(X_train, y_train,  color='blue')
plt.plot(X_train, regr.coef_[0][0]*X_train + regr.intercept_[0], '-r')
plt.xlabel("Engine size")
plt.ylabel("Emission")

Evaluation

from sklearn.metrics import r2_score

y_pred = regr.predict(X_test)

print("Mean absolute error: %.2f" % np.mean(np.absolute(y_pred - y_test)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_pred - y_test) ** 2))
print("R2-score: %.2f" % r2_score(y_test , y_pred) )

output:
Mean absolute error: 22.78
Residual sum of squares (MSE): 908.26
R2-score: 0.78

As you can see the model accuracy is around 80%

Comments

Popular Posts