ml case studies 2 - uci auto data (neural nets)

Contents


  1. Dataset Details
  2. Goal
  3. Neural Net Pipeline
  4. Data Pre-processing
  5. Exploratory Data Analysis
  6. Feature Selection
  7. Test-Train Split
  8. Neural Net Setup
  9. Output Verification
  10. Neural Net Training
  11. Evaluation
  12. References

Dataset Details:


dataset headers
  1. symboling: -3, -2, -1, 0, 1, 2, 3
  2. normalized-losses: continuous from 65 to 256
  3. make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
  4. fuel-type: diesel, gas
  5. aspiration: std, turbo
  6. num-of-doors: four, two
  7. body-style: hardtop, wagon, sedan, hatchback, convertible
  8. drive-wheels: 4wd, fwd, rwd
  9. engine-location: front, rear
  10. wheel-base: continuous from 86.6 120.9
  11. length: continuous from 141.1 to 208.1
  12. width: continuous from 60.3 to 72.3
  13. height: continuous from 47.8 to 59.8
  14. curb-weight: continuous from 1488 to 4066
  15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor
  16. num-of-cylinders: eight, five, four, six, three, twelve, two
  17. engine-size: continuous from 61 to 326
  18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi
  19. bore: continuous from 2.54 to 3.94
  20. stroke: continuous from 2.07 to 4.17
  21. compression-ratio: continuous from 7 to 23
  22. horsepower: continuous from 48 to 288
  23. peak-rpm: continuous from 4150 to 6600
  24. city-mpg: continuous from 13 to 49
  25. highway-mpg: continuous from 16 to 54
  26. price: continuous from 5118 to 45400

Goal:



Neural Net Pipeline:


Pipeline


python libraries
    # for data import and data wrangling:
    import numpy as np
    import pandas as pd
    
    # for exploratory data analysis:
    import seaborn as sns
    from matplotlib import pyplot as plt

    # for neural networks
    import keras

    # for read/write trained models to disk:
    import pickle



Data Pre-processing:



Exploratory Data Analysis:


continuous numerical variables:


fig: engine-size vs. price


fig: horsepower vs. price


fig: city-mpg vs. price


fig: highway-mpg vs. price


fig: wheel-base vs. price


fig: length vs. price


fig: width vs. price


fig: height vs. price


fig: curb-weight vs. price


fig: peak-rpm vs. price


fig: bore vs. price


fig: stroke vs. price


fig: compression-ratio vs. price


fig: norm-loss vs. price


categorical variables:


fig: body-style vs. price


fig: drive-wheels vs. price


fig: aspiration vs. price


fig: fuel-system vs. price


fig: make vs. price


fig: fuel-type vs. price


fig: number of doors vs. price


fig: number of cylinders vs. price


fig: symboling vs. price


fig: engine-location vs. price


fig: engine-type vs. price

insights:
plots generation code:

Feature Selection:



Test-Train Split:


one-hot-encoding


test-train split
feature normalization

Neural Net Setup:



fig: intended neural net model


keras-model-graph

fig: graph of the generated sequential keras neural net model


### KERAS MODEL:

# setup MLP model generating function
def build_mlp_model():

    model = keras.Sequential([
        keras.layers.Dense(32, activation = 'sigmoid', kernel_initializer=keras.initializers.glorot_normal(seed=3), input_dim = len(normed_train_data.keys())),
        keras.layers.Dropout(rate=0.25, noise_shape=None, seed=7),
        keras.layers.Dense(8, activation = 'relu'),
        keras.layers.Dropout(rate=0.001, noise_shape=None, seed=3),
        keras.layers.Dense(1)
    ])

    model.compile(loss = 'mse',
                  optimizer=keras.optimizers.Adam(lr=0.09, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.03, amsgrad=True),
                  metrics=['mae', 'mse'])

    return model

# initialize model and view details
model = build_mlp_model()
model.summary()

# CLI output of model.summary() 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 32)                320       
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 264       
_________________________________________________________________
dropout_2 (Dropout)          (None, 8)                 0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
=================================================================
Total params: 593
Total params: 1,169
Trainable params: 1,169
Non-trainable params: 0
_________________________________________________________________



Output Verification


## model verification: 

example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
print(example_result)

# CLI output of above print statement
[[ 0.1648832 ]
 [ 0.03176637]
 [ 0.10069194]
 [ 0.18034486]
 [-0.02307046]
 [-0.00585764]
 [ 0.03640913]
 [-0.10891403]
 [-0.01421728]
 [ 0.15022787]]

Neural Net Training


### TRAIN MODEL:

# function to display training progress in console
class PrintDot(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    print('.', end = '')
    if epoch % 100 == 0: print('\nEPOCH: ', epoch, '\n')

EPOCHS = 1000

history = model.fit(
  normed_train_data, train_target,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[PrintDot()])

## visualize learning

# plot the learning steps: 
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

plt.figure(0)
plt.xlabel('Epoch')
plt.ylabel('Mean Abs Error [USD]')
plt.plot(hist['epoch'], hist['mean_absolute_error'],
          label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
          label = 'Val Error')
plt.legend()
plt.savefig('nn-plots/mae-epoch-history.png')


plt.figure(1)
plt.xlabel('Epoch')
plt.ylabel('Mean Square Error [$USD^2$]')
plt.plot(hist['epoch'], hist['mean_squared_error'],
          label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_squared_error'],
          label = 'Val Error')
plt.legend()
plt.savefig('nn-plots/mse-epoch-history.png')


fig: mean-absolute-error vs. epoch


fig: mean-absolute-error vs. epoch


Evaluation:


test set MAE
### TEST SET EVALUATION: 

loss, mae, mse = model.evaluate(normed_test_data, test_target, verbose=0)
print("\n\nTesting set Mean Abs Error: {:5.2f} USD".format(mae))

# CLI output of print statement above
Testing set Mean-Abs-Error (MAE): 2156.20 USD


test set predictions and error

fig: scatter plot of error for test dataset

fig: histogram of errors for test dataset

### PREDICTIONS: 

test_predictions = model.predict(normed_test_data).flatten()

# plot scatter plot for test data -----------------------

plt.figure(2)
plt.scatter(test_target, test_predictions)
plt.xlabel('True Values [USD]')
plt.ylabel('Predictions [USD]')
plt.axis('equal')
plt.axis('square')
# plt.xlim([0,plt.xlim()[1]])
# plt.ylim([0,plt.ylim()[1]])
# _ = plt.plot([-100, 100], [-100, 100])
plt.savefig('nn-plots/scatter-test-true-vs-predicted.png')


# error distribution plot: ------------------------------
plt.figure(3)
error = test_predictions - test_target
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [USD]")
_ = plt.ylabel("Count")
plt.savefig('nn-plots/hist-test-vs-predicted-error-distribution.png')
plt.show()

References:



created: 29 May 2019
today's track: Rumble Tumble by Del-30