ml case studies 1 - uci auto data (linear regression)

Contents


  1. Dataset Details
  2. Goal
  3. Python ML Packages
  4. Machine Learning Pipeline
  5. Data Pre-processing
  6. Exploratory Data Analysis
  7. Test-Train Split
  8. Algorithm Setup
  9. Model Fitting
  10. Evaluation
  11. Model Re-train
  12. Conclusions
  13. References

Dataset Details:


dataset headers
  1. symboling: -3, -2, -1, 0, 1, 2, 3
  2. normalized-losses: continuous from 65 to 256
  3. make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
  4. fuel-type: diesel, gas
  5. aspiration: std, turbo
  6. num-of-doors: four, two
  7. body-style: hardtop, wagon, sedan, hatchback, convertible
  8. drive-wheels: 4wd, fwd, rwd
  9. engine-location: front, rear
  10. wheel-base: continuous from 86.6 120.9
  11. length: continuous from 141.1 to 208.1
  12. width: continuous from 60.3 to 72.3
  13. height: continuous from 47.8 to 59.8
  14. curb-weight: continuous from 1488 to 4066
  15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor
  16. num-of-cylinders: eight, five, four, six, three, twelve, two
  17. engine-size: continuous from 61 to 326
  18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi
  19. bore: continuous from 2.54 to 3.94
  20. stroke: continuous from 2.07 to 4.17
  21. compression-ratio: continuous from 7 to 23
  22. horsepower: continuous from 48 to 288
  23. peak-rpm: continuous from 4150 to 6600
  24. city-mpg: continuous from 13 to 49
  25. highway-mpg: continuous from 16 to 54
  26. price: continuous from 5118 to 45400

Goal:



Python ML Packages:


    # for data import and data wrangling:
    import numpy as np
    import pandas as pd
    
    # for exploratory data analysis:
    import seaborn as sns
    from matplotlib import pyplot as plt

    # for test-train split:
    from sklearn.model_selection import train_test_split

    # for linear regression:
    from sklearn.linear_model import LinearRegression

    # for K-Folds cross validaton and prediction:
    from sklearn.model_selection import cross_val_predict
    from sklearn.model_selection import cross_val_score

    # for mean-squared-error:
    from sklearn.metrics import mean_squared_error

    # for saving trained models to disk:
    import pickle



Machine Learning Pipeline:


Pipeline


Data Pre-processing:


  1. set path for data file and load data

     data_path = 'imports-85.data'
     df = pd.read_csv(data_path)
    
  2. inspect loaded data

     print(df.head(5))
     print(df.tail(5))
     print(df.info)
    
  3. check if headers are clean, if not assign the right headers

    • in this dataset, the headers are provided in a separate file, so they need to be added to the dataframe

        header = ['symboling','normalized-losses','make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location','wheel-base','length','width','height','curb-weight','engine-type','num-of-cylinders','engine-size','fuel-system','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
      
        df.columns = header
      
  4. missing value clean-up:

    • missing values in this dataset are denoted by ‘?’, replace them with numpy NaN values and drop the corresponding observations

        df.replace('?',np.nan, inplace=True)
        df.dropna(inplace=True)
      
  5. data-type cleanup:

    • inspect datatypes of each column

        print(df.head(5))
        print(df.dtypes)
      
    • numpy datatypes can be applied
    • pandasObject is equivalent to python’s str datatype
    • following datatype correction are applied in this write-up

        df['normalized-losses'] = df['normalized-losses'].astype('int')
        df['bore'] = df['bore'].astype('float')
        df['stroke'] = df['stroke'].astype('float')
        df['horsepower'] = df['horsepower'].astype('float')
        df['peak-rpm'] = df['peak-rpm'].astype('int')
        df['price'] = df['price'].astype('int')
      
    • review datatype of each column to confirm datatype conversion

        print(df.dtypes)
      
  6. short statistical summary as a common-sense test for dataset values

         print(df.describe(include="all"))
    

Exploratory Data Analysis:


continuous numerical variables:
categorical variables:
insights:

some insights gained from exploratory data analysis:

python code

# continuous variables correlation co-efficients:

    print(df[["engine-size", "price"]].corr())
    sns.regplot(x="engine-size", y="price", data=df)

    print(df[["horsepower", "price"]].corr())
    sns.regplot(x="horsepower", y="price", data=df)

    print(df[["highway-mpg", "price"]].corr())
    sns.regplot(x="highway-mpg", y="price", data=df)

    print(df[["curb-weight", "price"]].corr())
    sns.regplot(x="curb-weight", y="price", data=df)

    print(df[["symboling", "price"]].corr())
    sns.regplot(x="symboling", y="price", data=df)

    print(df[["normalized-losses", "price"]].corr())
    sns.regplot(x="normalized-losses", y="price", data=df)



# categorical variables

    print(df["body-style"].value_counts())
    sns.boxplot(x="body-style", y="price", data=df)

    print(df["aspiration"].value_counts())
    sns.boxplot(x="aspiration", y="price", data=df)

    print(df["fuel-system"].value_counts())
    sns.boxplot(x="fuel-system", y="price", data=df)

    print(df["drive-wheels"].value_counts())
    sns.boxplot(x="drive-wheels", y="price", data=df)

    print(df["make"].value_counts())
    sns.boxplot(x="make", y="price", data=df)



Test-Train Split:



Algorithm Setup:


Simple Linear Regression
Multiple Linear Regression

Model Fitting:



fig: engine-size:price linear estimator model

    plt.figure(0)
    plt.scatter(x_train[['engine-size']], y_train)
    plt.xlabel('engine-size')
    plt.ylabel('price')

    x_bounds = plt.xlim()
    y_bounds = plt.ylim()
    print(x_bounds, y_bounds)

    x_vals = np.linspace(x_bounds[0],x_bounds[1],num=50)
    y_vals = lm_engine_size.intercept_ + lm_engine_size.coef_ * x_vals
    print(x_vals, y_vals)

    plt.plot(x_vals, y_vals[0], '--')

    plt.title('Engine-Size based Linear Price Estimator')

    plt.savefig('plots/03-model-engine-size.png')
    plt.close()

Evaluation:


evaluation metrics:


visual evaluation:

Residual Plots (in-sample evaluation)


residual plot insights

engine-size:price - model evaluation

Distribution Plots

R2 evaluation:
MSE evaluation:
R2 and MSE insights:

Model Re-train:

K-fold Schematic


K-fold Cross-Validation


K-fold Predictions

fig: 2-fold trained model predictions vs. true values


fig: 3-fold trained model predictions vs. true values


fig: 4-fold trained model predictions vs. true values


fig: 5-fold trained model predictions vs. true values


Conclusions



References:



created: 21 May 2019
today's track: Lights (Bassnectar Remix) by Ellie Goulding