dataset year: 19 May 1987
# for data import and data wrangling:
import numpy as np
import pandas as pd
# for exploratory data analysis:
import seaborn as sns
from matplotlib import pyplot as plt
# for neural networks
import keras
# for read/write trained models to disk:
import pickle
NaN
’ and missing values:
NaN
and equivalent values are removedthe given dataset has 205 observations which, after clean-up, yields 159 data points
fig: engine-size vs. price
fig: horsepower vs. price
fig: city-mpg vs. price
fig: highway-mpg vs. price
fig: wheel-base vs. price
fig: length vs. price
fig: width vs. price
fig: height vs. price
fig: curb-weight vs. price
fig: peak-rpm vs. price
fig: bore vs. price
fig: stroke vs. price
fig: compression-ratio vs. price
fig: norm-loss vs. price
fig: body-style vs. price
fig: drive-wheels vs. price
fig: aspiration vs. price
fig: fuel-system vs. price
fig: make vs. price
fig: fuel-type vs. price
fig: number of doors vs. price
fig: number of cylinders vs. price
fig: symboling vs. price
fig: engine-location vs. price
fig: engine-type vs. price
there are a total of 25 predictor variables
given only 159 usable observations to train this neural net, 25 predictors might be a bit too many features to use
this calls for feature selection, as is explored next
import cleaned-up data
## unsplit data:
data_path = 'csv/00-cleaned-up-data.csv'
df = pd.read_csv(data_path)
print('number of observations in dataset: ', df.shape)
regression plots
#--- engine location value count shows only 1 category; so it is dropped
print(df['engine-location'].value_counts())
df = df.drop(columns=['engine-location'])
#--- correlation plots and regression line equations:
for key in df.keys():
if key != 'price' and key != 'symboling' and df[key].dtype != 'O':
print(key)
# print(ohe[key].shape, ohe['price'].shape)
slope, intercept, r_value, p_value, std_err = sp.stats.linregress(df[key], df['price'])
# save plot for visual inspection
fig = plt.figure()
sns.regplot(x=key, y="price", data=df)
plt.title('Slope: ' + str(slope) + '; Intercept: ' + str(intercept))
plt.savefig('plots/feature-influence/06-a-reg-plot-'+ key +'.png')
plt.close(fig)
box plots
## categorical correlation:
print('value-counts table ------')
# body-style:price
plt.figure(6)
print(df["body-style"].value_counts())
sns.boxplot(x="body-style", y="price", data=df)
plt.savefig('plots/01-b-box-plot-body-style.png')
# aspiration:price
plt.figure(7)
print(df["aspiration"].value_counts())
sns.boxplot(x="aspiration", y="price", data=df)
plt.savefig('plots/01-b-box-plot-aspiration.png')
# fuel-system:price
plt.figure(8)
print(df["fuel-system"].value_counts())
sns.boxplot(x="fuel-system", y="price", data=df)
plt.savefig('plots/01-b-box-plot-fuel-system.png')
# drive-wheels:price
plt.figure(9)
print(df["drive-wheels"].value_counts())
sns.boxplot(x="drive-wheels", y="price", data=df)
plt.savefig('plots/01-b-box-plot-drive-wheels.png')
# make:price
plt.figure(9)
print(df["make"].value_counts())
sns.boxplot(x="make", y="price", data=df)
plt.savefig('plots/01-b-box-plot-make.png')
# num-of-doors:price
plt.figure(10)
print(df["num-of-doors"].value_counts())
sns.boxplot(x="num-of-doors", y="price", data=df)
plt.savefig('plots/01-b-box-plot-num-of-doors.png')
# symboling:price
plt.figure(11)
print(df["symboling"].value_counts())
sns.boxplot(x="symboling", y="price", data=df)
plt.savefig('plots/01-b-box-plot-symboling.png')
# num-of-cylinders:price
plt.figure(12)
print(df["num-of-cylinders"].value_counts())
sns.boxplot(x="num-of-cylinders", y="price", data=df)
plt.savefig('plots/01-b-box-plot-num-of-cylinders.png')
# engine-type:price
plt.figure(13)
print(df["engine-type"].value_counts())
sns.boxplot(x="engine-type", y="price", data=df)
plt.savefig('plots/01-b-box-plot-engine-type.png')
# fuel-system:price
plt.figure(14)
print(df["fuel-type"].value_counts())
sns.boxplot(x="fuel-type", y="price", data=df)
plt.savefig('plots/01-b-box-plot-fuel-type.png')
# :price
plt.figure(15)
print(df["engine-location"].value_counts())
sns.boxplot(x="engine-location", y="price", data=df)
plt.savefig('plots/01-b-box-plot-engine-location.png')
leaving out price, one-hot-encoded categroical variables along with the rest of the continuous predctor variables yield a total of 69 features
NaN
the features are narrowed down in the pandas
dataframe by selecting only the choosen ones
## extract only choosen features along with price
group1 = np.array(['engine-size','horsepower','city-mpg','highway-mpg','body-style','price'])
df = df[group1]
print(df.keys())
a couple of steps are saved if one-hot-encoding is done before the test-train split
# apply 'one-hot-encoding":
ohe = pd.get_dummies(df)
# print(ohe.keys())
# print(ohe.shape)
### TEST-TRAIN SPLIT:
train_data = ohe.sample(frac=0.8,random_state=0)
test_data = ohe.drop(train_data.index)
even one-hot-encoded variables are normalized
here, the mean and standard-deviation of each series is used to normalize itself
### NORMALIZE DATA:
train_target = train_data.pop('price')
test_target = test_data.pop('price')
train_stats = train_data.describe().transpose()
normed_train_data = (train_data - train_stats['mean']) / train_stats['std']
normed_test_data = (test_data - train_stats['mean']) / train_stats['std']
fig: intended neural net model
keras
sequential model is initialized with three Dense
layers and two Dropout
layers
Dense
and Dropout
are types of core keras
layersDense
layers are basic building blocks of a neural layerDense
layer receiving 9 feature inputs has 32 nodes, each with a sigmoid activation function; the central hidden Dense
layer has 8 nodes, each with rectified liner unit (ReLU) activation function, and the output Dense
layer has just one node to output the continuous value of price, so no activation function is suppliedDropout
layers help to avoid over-fitting, these are sandwiched between the Dense
layersfig: graph of the generated sequential keras neural net model
### KERAS MODEL:
# setup MLP model generating function
def build_mlp_model():
model = keras.Sequential([
keras.layers.Dense(32, activation = 'sigmoid', kernel_initializer=keras.initializers.glorot_normal(seed=3), input_dim = len(normed_train_data.keys())),
keras.layers.Dropout(rate=0.25, noise_shape=None, seed=7),
keras.layers.Dense(8, activation = 'relu'),
keras.layers.Dropout(rate=0.001, noise_shape=None, seed=3),
keras.layers.Dense(1)
])
model.compile(loss = 'mse',
optimizer=keras.optimizers.Adam(lr=0.09, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.03, amsgrad=True),
metrics=['mae', 'mse'])
return model
# initialize model and view details
model = build_mlp_model()
model.summary()
# CLI output of model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 32) 320
_________________________________________________________________
dropout_1 (Dropout) (None, 32) 0
_________________________________________________________________
dense_2 (Dense) (None, 8) 264
_________________________________________________________________
dropout_2 (Dropout) (None, 8) 0
_________________________________________________________________
dense_3 (Dense) (None, 1) 9
=================================================================
Total params: 593
Total params: 1,169
Trainable params: 1,169
Non-trainable params: 0
_________________________________________________________________
NaN
output here; so, the number of features to train models is prioritized to a few important ones## model verification:
example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
print(example_result)
# CLI output of above print statement
[[ 0.1648832 ]
[ 0.03176637]
[ 0.10069194]
[ 0.18034486]
[-0.02307046]
[-0.00585764]
[ 0.03640913]
[-0.10891403]
[-0.01421728]
[ 0.15022787]]
### TRAIN MODEL:
# function to display training progress in console
class PrintDot(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs):
print('.', end = '')
if epoch % 100 == 0: print('\nEPOCH: ', epoch, '\n')
EPOCHS = 1000
history = model.fit(
normed_train_data, train_target,
epochs=EPOCHS, validation_split = 0.2, verbose=0,
callbacks=[PrintDot()])
## visualize learning
# plot the learning steps:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
plt.figure(0)
plt.xlabel('Epoch')
plt.ylabel('Mean Abs Error [USD]')
plt.plot(hist['epoch'], hist['mean_absolute_error'],
label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
label = 'Val Error')
plt.legend()
plt.savefig('nn-plots/mae-epoch-history.png')
plt.figure(1)
plt.xlabel('Epoch')
plt.ylabel('Mean Square Error [$USD^2$]')
plt.plot(hist['epoch'], hist['mean_squared_error'],
label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_squared_error'],
label = 'Val Error')
plt.legend()
plt.savefig('nn-plots/mse-epoch-history.png')
fig: mean-absolute-error vs. epoch
fig: mean-absolute-error vs. epoch
cython
is used to compile this entire neural network pipeline written in python
to speed up the training passes
python
code to cython
herecython
folder in code repo### TEST SET EVALUATION:
loss, mae, mse = model.evaluate(normed_test_data, test_target, verbose=0)
print("\n\nTesting set Mean Abs Error: {:5.2f} USD".format(mae))
# CLI output of print statement above
Testing set Mean-Abs-Error (MAE): 2156.20 USD
fig: scatter plot of error for test dataset
fig: histogram of errors for test dataset
### PREDICTIONS:
test_predictions = model.predict(normed_test_data).flatten()
# plot scatter plot for test data -----------------------
plt.figure(2)
plt.scatter(test_target, test_predictions)
plt.xlabel('True Values [USD]')
plt.ylabel('Predictions [USD]')
plt.axis('equal')
plt.axis('square')
# plt.xlim([0,plt.xlim()[1]])
# plt.ylim([0,plt.ylim()[1]])
# _ = plt.plot([-100, 100], [-100, 100])
plt.savefig('nn-plots/scatter-test-true-vs-predicted.png')
# error distribution plot: ------------------------------
plt.figure(3)
error = test_predictions - test_target
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [USD]")
_ = plt.ylabel("Count")
plt.savefig('nn-plots/hist-test-vs-predicted-error-distribution.png')
plt.show()