ml case studies 3 - uci auto data regression performance evaluation

last updated: 03 Jun 2019

Pre-requisite Reading

Goal

the performance of multi-linear regression and neural network regression will be compared to predict used car price based on a few features
code repo for comparative study

Python Libraries

### IMPORT ML LIBRARIES: 
import numpy as np
import pandas as pd

import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.linear_model import LinearRegression
import keras

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

Data Pre-Processing

the observations with NaN have been removed and the data type coherency has been ensured in the previous studies, the same cleaned-up raw data will be used for this study

### IMPORTS: 

data_path = '../csv/00-cleaned-up-data.csv'
df = pd.read_csv(data_path)

Feature Selection

both algorithms will use the same set of features:
- ‘engine-size’,
- ‘horsepower’,
- ‘city-mpg’,
- ‘highway-mpg’ and
- ‘body-style’
‘body-style’ is a categorical variable, one-hot-encoding will be applied to turn it into a quantitative numerical variable

### FEATURE SELECTION: 

group = np.array([
    'engine-size', 'horsepower', 'city-mpg', 'highway-mpg', 'body-style',
    'price'
])

df = df[group]

One-Hot-Encoding and Normalization

### ONE-HOT-ENCODING: 

ohe = pd.get_dummies(df)
print(ohe.shape)

### NORMALIZE DATA: 

train_target = train_data.pop('price')
test_target = test_data.pop('price')

train_stats = train_data.describe().transpose()
# print(train_stats)

normed_train_data = (train_data - train_stats['mean']) / train_stats['std']
normed_test_data = (test_data - train_stats['mean']) / train_stats['std']

# print(normed_train_data.dtypes)
# print(normed_test_data)

Model Initialization and Training

linear regression model

this multi-variate linear regression model has 5 dependent variables (9 after one-hot), visualization is complex and beyond the scope of this post

## MULTI-LINEAR REGRESSION: 

# init model:
lm = LinearRegression()

# train model:
lm.fit(normed_train_data, train_target)

neural net regression model

the same 3 layered model with 2 sandwiched dropout layers used in the previous study is used here

## NEURAL NET REGRESSION: 

# setup MLP model generating function
def build_mlp_model():

    model = keras.Sequential([
        keras.layers.Dense(
            32,
            activation='sigmoid',
            kernel_initializer=keras.initializers.glorot_normal(seed=3),
            input_dim=len(normed_train_data.keys())),
        keras.layers.Dropout(rate=0.25, noise_shape=None, seed=7),
        keras.layers.Dense(8, activation='relu'),
        keras.layers.Dropout(rate=0.001, noise_shape=None, seed=3),
        keras.layers.Dense(1)
    ])

    model.compile(loss='mse',
                  optimizer=keras.optimizers.Adam(lr=0.09,
                                                  beta_1=0.9,
                                                  beta_2=0.999,
                                                  epsilon=None,
                                                  decay=0.03,
                                                  amsgrad=True),
                  metrics=['mae', 'mse'])

    return model


# initialize model and view details

nn = build_mlp_model() # keras only model
nn.summary()

# nn model verification: 

example_batch = normed_train_data[:10]
example_result = nn.predict(example_batch)
print(example_result)

In-Sample Accuracy

linear model residual

residual scatter plot

fig: residual plot for linear regression model

Multi-variate Linear Regression Accuracy Metrics
- in-sample R^2: 0.7674983359675379
- in-sample MSE: 7650295.906955752

neural net residual

residual scatter plot

fig: residual plot for neural net model

Neural Net Regression Accuracy Metrics
- in-sample R^2: 0.6504853621362341
- in-sample MSE: 11500521.57517874

insights

both models have an uneven amount of scatter around x-axis, but no obvious curve in the scatter plot
both models are not great according to the residual scatter plot as an even distribution of points would indicate a good trained model
the neural net MSE is much higher than the linear model
R² is much higher for the linear model, indicating the linear model’s tighter fit to the training data

### RESIDUALS (in-sample accuracy): 

## MULTI-LINEAR REGRESSION: 
plt.figure(0, figsize=(12, 10))
sns.residplot(lm.predict(normed_train_data), train_target)
plt.savefig('plots/a-resid-plot-multi-lin.png')

## NEURAL NET REGRESSION: 
plt.figure(1, figsize=(12, 10))
sns.residplot(nn.predict(normed_train_data).flatten(), train_target)
plt.savefig('plots/a-resid-plot-nn.png')

### ACCURACY (in-sample test): 

## MULTI-LINEAR REGRESSION: 

print('\nMulti-variate Linear Regression Accuracy Metrics')

# R^2 score:
print('in-sample R^2')
print(r2_score(train_target, lm.predict(normed_train_data)))

# MSE score:
print('in-sample MSE')
print(mean_squared_error(train_target, lm.predict(normed_train_data)))

## NEURAL NET REGRESSION: 

print('\nNeural Net Regression Accuracy Metrics')

# R^2 score:
print('in-sample R^2')
print(r2_score(train_target, nn.predict(normed_train_data).flatten()))

# MSE score:
print('in-sample MSE')
print(mean_squared_error(train_target, nn.predict(normed_train_data).flatten()))

Out-of-Sample Accuracy

linear model

Multi-variate Linear Regression Accuracy Metrics
- out-of-sample R^2: 0.7513950523452133
- out-of-sample MSE: 9924168.147390723

neural net

Neural Net Regression Accuracy Metrics
- out-of-sample R^2: 0.6746316888174696
- out-of-sample MSE: 12988517.969850667

comparison

fig: distribution plot for true (black), linear model (cyan) and neural net predictions (yellow)

insights

R²:
- the out-of-sample R² is higher for the linear regression model, making it a better fit for the test data as well
  - the linear model was a better fit compared to neural net in the training data also
- neural net R² for test data in higher than training data, so it generalizes better than the linear model
  - since linear model R² is lower for training data, indicating lower generalization
MSE:
- the MSE, however is worse for the neural net, in both training and test data, so it isn’t as accurate as the linear model for both cases
conclusion
- as seen in the residual plot, neither model is very good and the MSE values are pretty high as well
  - this can be attributed to the data size of just 159 observations, the dataset to perform this training is small to obtain high accuracies
- to add to the uncertainty, different runs of training yield slightly different accuracy scores, so the MSE and R² values hover around the shown values by a noticeable amount

### ACCURACY (out-of-sample test): 

## MULTI-LINEAR REGRESSION: 

print('\nMulti-variate Linear Regression Accuracy Metrics')

# R^2 score:
print('out-of-sample R^2')
print(r2_score(test_target, lm.predict(normed_test_data)))

# MSE score:
print('out-of-sample MSE')
print(mean_squared_error(test_target, lm.predict(normed_test_data)))

## NEURAL NET REGRESSION: 

print('\nNeural Net Regression Accuracy Metrics')

# R^2 score:
print('out-of-sample R^2')
print(r2_score(test_target, nn.predict(normed_test_data).flatten()))

# MSE score:
print('out-of-sample MSE')
print(mean_squared_error(test_target, nn.predict(normed_test_data).flatten()))

## DISTRIBUTION PLOTS: 

plt.figure(2, figsize=(12, 10))

ax1 = sns.distplot(test_target, hist=False, color="k", label='True Values')

ax2 = sns.distplot(lm.predict(normed_test_data),
                   hist=False,
                   color="c",
                   label='Linear Model Prediction',
                   ax=ax1)

ax3 = sns.distplot(nn.predict(normed_test_data).flatten(),
                   hist=False,
                   color="y",
                   label='Neural Net Prediction',
                   ax=ax1)

plt.title(
    "['engine-size', 'horsepower', 'city-mpg', 'highway-mpg', 'body-style']:Price"
)
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.savefig('plots/b-dist-lin-nn-compare.png')

ml case studies 3 - uci auto data regression performance evaluation

last updated: 03 Jun 2019

Pre-requisite Reading

Goal

Python Libraries

Data Pre-Processing

Feature Selection

One-Hot-Encoding and Normalization

Model Initialization and Training

linear regression model

neural net regression model

In-Sample Accuracy

linear model residual

neural net residual

insights

Out-of-Sample Accuracy

linear model

neural net

comparison

insights

References

created: 01 Jun 2019

today's track: We Know (Vintage & Morelli Remix) by Boom Jinx & Sound Prank feat. Katrine Stenbekk