dataset year: 19 May 1987
train linear regression models with supervised learning methods to predict price of a used car taking in the car’s details as input
# for data import and data wrangling:
import numpy as np
import pandas as pd
# for exploratory data analysis:
import seaborn as sns
from matplotlib import pyplot as plt
# for test-train split:
from sklearn.model_selection import train_test_split
# for linear regression:
from sklearn.linear_model import LinearRegression
# for K-Folds cross validaton and prediction:
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
# for mean-squared-error:
from sklearn.metrics import mean_squared_error
# for saving trained models to disk:
import pickle
set path for data file and load data
data_path = 'imports-85.data'
df = pd.read_csv(data_path)
inspect loaded data
print(df.head(5))
print(df.tail(5))
print(df.info)
check if headers are clean, if not assign the right headers
in this dataset, the headers are provided in a separate file, so they need to be added to the dataframe
header = ['symboling','normalized-losses','make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location','wheel-base','length','width','height','curb-weight','engine-type','num-of-cylinders','engine-size','fuel-system','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
df.columns = header
missing value clean-up:
missing values in this dataset are denoted by ‘?’, replace them with numpy
NaN
values and drop the corresponding observations
df.replace('?',np.nan, inplace=True)
df.dropna(inplace=True)
data-type cleanup:
inspect datatypes of each column
print(df.head(5))
print(df.dtypes)
numpy
datatypes can be appliedpandas
’ Object
is equivalent to python
’s str
datatypefollowing datatype correction are applied in this write-up
df['normalized-losses'] = df['normalized-losses'].astype('int')
df['bore'] = df['bore'].astype('float')
df['stroke'] = df['stroke'].astype('float')
df['horsepower'] = df['horsepower'].astype('float')
df['peak-rpm'] = df['peak-rpm'].astype('int')
df['price'] = df['price'].astype('int')
review datatype of each column to confirm datatype conversion
print(df.dtypes)
short statistical summary as a common-sense test for dataset values
print(df.describe(include="all"))
explore data to understand which predictor variables (‘engine-size’, ‘horsepower’, ‘highway-mpg’, etc) have a significant bearing on the target variable (‘price’)
check correlation co-efficient between all variables with each other (of int64
and float64
datatype)
df.corr()
the influence of individual predictors (of int64
and float64
datatype) on the target variable (price) can be studies with Regression Plots
predictor variables are of two kinds:
correlation plots for some continuous predictor variables
above regression line only shows the trend/correlation of the predictor-target variable and not the actual relationship (avoid mis-interpreting the regression line going below price=0 for best results)
price distribution and preliminary statistics visualization for categorical variables
some insights gained from exploratory data analysis:
more engine-size and horsepower in a used car costs more; cars more efficient on the highway cost less; there is strong correlation between those qualities of a used car with it’s price
higher the risk of the car, lesser the price; similarly, higher the normalized-losses, higher the price; these two ratings, however, have a weaker correlation to the price of the used car compared to engine-size, horsepower and highway fuel efficiency
the symboling variable has a categorial nature to it, so it will not be analyzed as a continuous variable and not be used to train a model
convertibles are a lot more expensive than other body styles
turbo-aspirated used cars generally cost more than standard aspiration cars, rear-wheel-drive cars cost more than front-wheel-drive and 4-wheel-drive
mpfi and idi fuel-injection systems generally cost more than other types
# continuous variables correlation co-efficients:
print(df[["engine-size", "price"]].corr())
sns.regplot(x="engine-size", y="price", data=df)
print(df[["horsepower", "price"]].corr())
sns.regplot(x="horsepower", y="price", data=df)
print(df[["highway-mpg", "price"]].corr())
sns.regplot(x="highway-mpg", y="price", data=df)
print(df[["curb-weight", "price"]].corr())
sns.regplot(x="curb-weight", y="price", data=df)
print(df[["symboling", "price"]].corr())
sns.regplot(x="symboling", y="price", data=df)
print(df[["normalized-losses", "price"]].corr())
sns.regplot(x="normalized-losses", y="price", data=df)
# categorical variables
print(df["body-style"].value_counts())
sns.boxplot(x="body-style", y="price", data=df)
print(df["aspiration"].value_counts())
sns.boxplot(x="aspiration", y="price", data=df)
print(df["fuel-system"].value_counts())
sns.boxplot(x="fuel-system", y="price", data=df)
print(df["drive-wheels"].value_counts())
sns.boxplot(x="drive-wheels", y="price", data=df)
print(df["make"].value_counts())
sns.boxplot(x="make", y="price", data=df)
identify all categorical predictor variables and one-hot encode them
df = pd.get_dummies(df)
engine-size:price
lm_engine_size = LinearRegression()
horsepower:price
lm_horsepower = LinearRegression()
highway-mpg:price
lm_highway_mpg = LinearRegression()
normalized-losses:price
lm_norm_loss = LinearRegression()
(length,width,height,curb-weight):price
lm_build_dim = LinearRegression()
(engine-size,horsepower,city-mpg,highway-mpg):price
lm_car_chars = LinearRegression()
fit models initialized previously with their data series:
## simple linear regression
lm_engine_size.fit(x_train[['engine-size']], y_train)
lm_horsepower.fit(x_train[['horsepower']], y_train)
lm_highway_mpg.fit(x_train[['highway-mpg']], y_train)
lm_norm_loss.fit(x_train[['normalized-losses']], y_train)
## multiple linear regression
lm_build_dim.fit(x_train[['length','width','height','curb-weight']], y_train)
lm_car_specs.fit(x_train[['engine-size','horsepower','city-mpg','highway-mpg']], y_train)
the following are the linear regression intercept-coefficient pairs generated from training the model:
## Engine-Size:Price Linear Regression Model:
print(lm_engine_size.intercept_, lm_engine_size.coef_)
## [-7992.74991339] [[162.92166497]]
## Horsepower:Price Linear Regression Model:
print(lm_horsepower.intercept_, lm_horsepower.coef_)
## [-2306.79714958] [[143.97806991]]
## Highway-MPG:Price Linear Regression Model:
print(lm_highway_mpg.intercept_, lm_highway_mpg.coef_)
## [33820.38434914] [[-694.88845889]]
## Normalized-Losses:Price Linear Regression Model:
print(lm_norm_loss.intercept_, lm_norm_loss.coef_)
## [7414.32607766] [[34.81225022]]
## Build Dimensions:Price Linear Regression Model:
print(lm_build_dim.intercept_, lm_build_dim.coef_)
## [-52378.43476732] [[ -64.92258503 931.64836436 -164.25220771 9.23624461]]
## Car Specs:Price Linear Regression Model:
print(lm_car_specs.intercept_, lm_car_specs.coef_)
## [1531.96721594] [[ 126.16415762 14.17026447 355.30831826 -495.97047484]]
fig: engine-size:price linear estimator model
plt.figure(0)
plt.scatter(x_train[['engine-size']], y_train)
plt.xlabel('engine-size')
plt.ylabel('price')
x_bounds = plt.xlim()
y_bounds = plt.ylim()
print(x_bounds, y_bounds)
x_vals = np.linspace(x_bounds[0],x_bounds[1],num=50)
y_vals = lm_engine_size.intercept_ + lm_engine_size.coef_ * x_vals
print(x_vals, y_vals)
plt.plot(x_vals, y_vals[0], '--')
plt.title('Engine-Size based Linear Price Estimator')
plt.savefig('plots/03-model-engine-size.png')
plt.close()
visual evaluations:
Residual Plots (in-sample evaluation)
residual plot insights
simple linear regression
multi linear regression
only engine-size:price model will be further analyzed as the other models need a separate study to find out which models work
Distribution Plots
below are distribution plots of true vs. predicted values of in-sample engine-size:price pairs
below are distribution plots of true vs. predicated values of out-of-sample engine-size:price pairs
in-sample R2 values for engine-size:price:
engine-size:price
print(lm_engine_size.score(x_train[['engine-size']], y_train))
0.7310050422003973
out-of-sample R2 values for engine-size:price:
engine-size:price
print(lm_engine_size.score(x_test[['engine-size']], y_test))
0.5637996565118054
in-sample:
engine-size:price
mean_squared_error(y_train,lm_engine_size.predict(x_train[['engine-size']]))
9848498.42266501
out-of-sample:
engine-size:price
mean_squared_error(y_test,lm_engine_size.predict(x_test[['engine-size']]))
10707791.23428005
K-folds cross validation is used to retrain the model with K in-sample train-test splits and get the R2 values for respective splits
2-Folds:
# train-test splits - R^2 error:
cross_val_score(lm_engine_size, x_train[['engine-size']], y_train, cv=2)
# [0.72605172 0.57164608]
# R^2 average:
np.mean(cross_val_score(lm_engine_size, x_train[['engine-size']], y_train, cv=2))
# 0.6488488999289039
3-Folds:
# train-test splits - R^2 error:
cross_val_score(lm_engine_size, df[['engine-size']], df[['price']], cv=3)
# [0.73736009 0.80002091 0.50708333]
# R^2 average:
np.mean(cross_val_score(lm_engine_size, df[['engine-size']], df[['price']], cv=3))
# 0.6814881082982801
4-Folds:
# train-test splits - R^2 error:
cross_val_score(lm_engine_size, x_train[['engine-size']], y_train, cv=4)
# [0.75718247 0.75754086 0.72910826 0.38780982]
# R^2 average:
np.mean(cross_val_score(lm_engine_size, x_train[['engine-size']], y_train, cv=4))
# 0.6579103522174403
5-Folds:
# train-test splits - R^2 error:
cross_val_score(lm_engine_size, x_train[['engine-size']], y_train, cv=5)
# [0.79241442 0.70705579 0.80316644 0.65735852 0.41465915]
# R^2 average:
np.mean(cross_val_score(lm_engine_size, x_train[['engine-size']], y_train, cv=5))
# 0.6749308645401664
fig: 2-fold trained model predictions vs. true values
fig: 3-fold trained model predictions vs. true values
fig: 4-fold trained model predictions vs. true values
fig: 5-fold trained model predictions vs. true values
the trained linear model has the required residual plot behavior, so the linear estimator is good
the estimator suffers from poor accuracy, however, for out-of-sample data as seen in the above distribution plot
training with more data samples or a more complex model like a neural network might provide better performance