In this project I use my own designed Regression and DataTable packages to predict a CO2 Emission based on car characteristics.
I'll walk you through this process of model definition, fine-tuning and evaluation. You might want to jump to the interesting section immediately.
I had developed my own Pandas-like simplified package data_reader.py to fetch different kind of data faster that Pandas can do and contains main features for Regression AI like data preparation for training/testing, split data, adding new features, creating combined one, ploting and many others.
In addition, to enhance my AI regression model's knowledge, I designed a regression.py package based on gradient descent technique to traing a (linear, non-linear, multi-variable) regression model with a various model parameters like number of iterations (epoch), learning rate (alpha) and regularization coefficient (regul). Moreover there is logs feature to monitor regression model learning and evaluation of its performance.
All these features and techniques I would like to show in this notebook.
For additional package usage, refer to doxy in its src code.
For that session, I'm going to use a FuelConsumtion.csv file, which contains table-like structure of data of how different parameters of the car impact on CO2 Emission. It might be interesting to try to predict how much particular car model will produce CO2.
Import all necessary libs (yeah, matplotlib will be used just to make a nice visualization. And random lib as well).
import data_reader as dr
import regression.regression as regression
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
I had collected helper functions/stripts in helper_methods.py
from helper_methods import *
To upload data, let's use DataTable (like dataFrame from Pandas) from data_reader.py.
main_data_table = dr.DataTable("data/FuelConsumption.csv")
This DataTable is very suitable for a small-middle range data size, because with current implementation it stores all data into a main memory.
main_data_table.head # returns labels of each column
['MODELYEAR', 'MAKE', 'MODEL', 'VEHICLECLASS', 'ENGINESIZE', 'CYLINDERS', 'TRANSMISSION', 'FUELTYPE', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB', 'FUELCONSUMPTION_COMB_MPG', 'CO2EMISSIONS']
len(main_data_table) # shows how many rows of data we have
1067
By default, there are no features and target pre-defined. Features - set of parameters we are going to use for our model to train it; target - set of outcome we desire to have from our trained model.
For the first time, I'm interested how engine size and fuel consumtion impact on CO2 emission.
For that, I select following features:
# To define feature/s and target:
main_data_table.activate_features(["FUELCONSUMPTION_COMB_MPG", "ENGINESIZE"])
main_data_table.select_target("CO2EMISSIONS")
Feature FUELCONSUMPTION_COMB_MPG was added Feature ENGINESIZE was added Target CO2EMISSIONS was added
Let's draw it.
main_data_table.plot(features2target=True)
Our features are defined like
main_data_table.features
{'FUELCONSUMPTION_COMB_MPG': <data_reader.DataTable._DataColumn at 0x1061c30a0>, 'ENGINESIZE': <data_reader.DataTable._DataColumn at 0x1061c3220>}
We can see that engine size can be filled with a straight line, what we cannot say about fuel consumption parameter. To showcase capabilities of my packages, I would like to select a parabolic curved feature.
Let's delete engine size feature from our training data set.
main_data_table.deactivate_feature("ENGINESIZE")
Feature ENGINESIZE was disabled from the training set
main_data_table.features
{'FUELCONSUMPTION_COMB_MPG': <data_reader.DataTable._DataColumn at 0x1061c30a0>}
DataTable supports data split onto training, cross-validation (direct model testing and selecting the best one) and testing data (for final model testing).
main_data_table.split_data(0.6, 0.2, shuffle=True)
Shuffle was done Data was split as follows: 0.6 training set, 0.2 cross-validation set and 0.2 test set
# to fetch brand new data for AI purpose
training_data = main_data_table.get_training_data() # returns (features data, target data)
cv_data = main_data_table.get_cv_data()
testing_data = main_data_table.get_testing_data()
training_data[0][0:5] # feature data
[[27.0], [17.0], [39.0], [30.0], [23.0]]
training_data[1][0:5] # target data
[242.0, 259.0, 166.0, 216.0, 283.0]
data_labels = main_data_table.get_labels() # to get labels of training/cv/testing data arrays
data_labels
(['FUELCONSUMPTION_COMB_MPG'], 'CO2EMISSIONS')
Now data is prepared for first AI trials with 1 feature and 1 target.
regression_model = regression.Regression() # create model as an entity
regression_model.set_labels(data_labels) # set labels of our data into the model
regression_model.set_training_data(training_data[0], training_data[1]) # and point this model to our data
regression_model.set_testing_data(cv_data[0], cv_data[1]) #
True
regression_model
<regression.Regression at 0x12313cfa0>
In this section I'm going to play a little bit with my model and data we have. Let's set the following parameters:
regression_model.ROUND_AFTER_COMA = 4 # simplify computation by rounding all results to 4 digits after coma
# do first weight coefficients' initialization in range -10...10
regression_model.RANDOM_WEIGHT_INITIALIZATION = 10
# number of iteration 100
regression_model.epoch = 100
# learning rate 0.5
regression_model.alpha = 0.5
# regularization coefficient is 0
regression_model.regularization = 0
Now, we can fit our model with defined training data.
coeffs = regression_model.fit() # provide model training using defined training data set
Initiated coefficients are [5, -5] Iteration 26 done Iteration 51 done Iteration 76 done Training is completed with 100 iterations
coeffs # model's trained coefficients
[-9.212008495791602e+256, -2.62457033381171e+258]
cv_features, cv_target = cv_data
training_features, training_target = training_data
plot2d_target2predict(regression_model, cv_features, cv_target,
feature_name="FUELCONSUMPTION_COMB_MPG",
target_name="CO2EMISSIONS",
feature_idx = 0, loc_place="upper left")
Wow, it seems like we overshooted our global (and local as well) minimum of our model function and it has been decreasing so rapidly. Let's reduce learning coefficient significantly.
regression_model.alpha = 0.01
coeffs = regression_model.fit()
Initiated coefficients are [-5, -8] Iteration 26 done Iteration 51 done Iteration 76 done Training is completed with 100 iterations
coeffs
[-1.1685903244157878e+81, -3.3294015080883016e+82]
Better, we just reduced coefficients from e+256 to e+81. Still, bad :D
Do we need reduce alpha even more? Let's introduce some regularization as well.
regression_model.alpha = 0.001
regression_model.regularization = 0.2
coeffs = regression_model.fit()
coeffs
Initiated coefficients are [2, 2] Iteration 26 done Iteration 51 done Iteration 76 done Training is completed with 100 iterations
[5.7349, 8.2525]
Looks much better. Let's try these coefficients.
plot2d_target2predict(regression_model, cv_features, cv_target,
feature_name="FUELCONSUMPTION_COMB_MPG",
target_name="CO2EMISSIONS",
feature_idx = 0, loc_place="upper right")
That seems like we didn't fit the model. Increase epoch and alpha?
regression_model.alpha = 0.002
regression_model.epoch = 500
coeffs = regression_model.fit()
coeffs
Initiated coefficients are [-7, -9] Iteration 126 done Iteration 251 done Iteration 376 done Training is completed with 500 iterations
[28.1806, 7.4646]
Still no. How about to initialize first coefficients in range -1...1?
coeffs = regression_model.fit(scaled_coefficients=True)
coeffs
Initiated coefficients are [-0.4, 0.5] Iteration 126 done Iteration 251 done Iteration 376 done Training is completed with 500 iterations
[33.9735, 7.2613]
That's the moment when we have to scale our data
plot2d_target2predict(regression_model, cv_features, cv_target,
feature_name="FUELCONSUMPTION_COMB_MPG",
target_name="CO2EMISSIONS",
feature_idx = 0, loc_place="upper right")
We faced a problem that we cannot select appropriate coefficients/parameters for our model to learn properly. At some moment we have overshoot, at another one - undershoot. May be data is still didn't ready?
In this section I will try to create the best 1 feature - 1 target model.
Let's scale our data as x = x / max(|X|), so the feature and the target will be in a range -1...1. To do so, we apply the following:
main_data_table.max_scaling()
Column MODELYEAR was scaled Column MAKE was scaled Column MODEL was scaled Column VEHICLECLASS was scaled Column ENGINESIZE was scaled Column CYLINDERS was scaled Column TRANSMISSION was scaled Column FUELTYPE was scaled Column FUELCONSUMPTION_CITY was scaled Column FUELCONSUMPTION_HWY was scaled Column FUELCONSUMPTION_COMB was scaled Column FUELCONSUMPTION_COMB_MPG was scaled Column CO2EMISSIONS was scaled
main_data_table.plot(features2target=True)
indeed, as we can see data was scaled properly. Let's try with that data now.
# regenerate training/cv/testing scaled data
scaled_training_data = main_data_table.get_training_data()
scaled_cv_data = main_data_table.get_cv_data()
scaled_testing_data = main_data_table.get_testing_data()
regression_model.set_training_data(scaled_training_data[0], scaled_training_data[1])
regression_model.set_testing_data(scaled_cv_data[0], scaled_cv_data[1])
True
In additional, I'll enable logs' writing to showcase how the model was learning within time.
regression_model.log_mode(True)
Log mode is enable
Let's try we the same config we had at very first time.
regression_model.epoch = 100
regression_model.alpha = 0.5
regression_model.regularization = 0
regression_model.fit()
Initiated coefficients are [-6, -3] Iteration 26 done Iteration 51 done Iteration 76 done Training is completed with 100 iterations
[0.7472, -0.5055]
cv_features, cv_target = scaled_cv_data
training_features, training_target = scaled_training_data
plot2d_target2predict(regression_model, cv_features, cv_target,
feature_name="FUELCONSUMPTION_COMB_MPG",
target_name="CO2EMISSIONS",
feature_idx = 0, loc_place="upper right")
Much better!! Regression model has method to evaluate trained model. There is mean absolute error metric as:
regression_model.evaluation(scaled_cv_data, metric="MAE")
0.0635
Assuming all values scaled up to 1, we can say ignorantly that the model gives almost 94% accuracy.
Let's check what logs might show:
logs1 = regression_model.get_logs()
logs1
Logs of model settings alpha = 0.5, reg = 0
Let's use logs to analyse loss functions changes within a time.
plt.title("Raw error")
iterations = [x for x in range(logs1.iterations)]
cost_function_training = logs1.training_cf
cost_function_cv = logs1.testing_cf
plt.scatter(iterations, cost_function_training, label="training loss")
plt.scatter(iterations, cost_function_cv, label="testing loss")
plt.legend(loc="upper right")
plt.xlabel("epoch")
plt.ylabel("Error")
plt.show()
plt.title("Scaled error")
plt.ylim([0, 0.01])
plt.scatter(iterations, cost_function_training, label="training loss")
plt.scatter(iterations, cost_function_cv, label="testing loss")
plt.legend(loc="upper right")
plt.ylabel("Error")
plt.show()
Brilliant. It seems like we can increase alpha and number of iteration (epoch) to improve our model. But let's move forward.
We can add new feature to make the model line looks like parabolic. How about FUELCONSUMPTION_COMB_MPG^2?
To do so, we have to create a new feature data set as follows:
main_data_table.add_new_feature("FUELCONSUMPTION_COMB_MPG", power=2)
New created feature FUELCONSUMPTION_COMB_MPG^(2) was added This FUELCONSUMPTION_COMB_MPG^(2) feature is added to the list of training set
main_data_table.features # shows all enabled features
{'FUELCONSUMPTION_COMB_MPG': <data_reader.DataTable._DataColumn at 0x1061c30a0>, 'FUELCONSUMPTION_COMB_MPG^(2)': <data_reader.DataTable._DataColumn at 0x12349c4c0>}
Also, we can set power = 0.5, which will give square root of selected feature. But for that time we are looking for power of 2.
# regenerate training/cv/testing scaled data
scaled_training_data = main_data_table.get_training_data()
scaled_cv_data = main_data_table.get_cv_data()
scaled_testing_data = main_data_table.get_testing_data()
cv_features, cv_target = scaled_cv_data
training_features, training_target = scaled_training_data
regression_model.set_training_data(scaled_training_data[0], scaled_training_data[1])
regression_model.set_testing_data(scaled_cv_data[0], scaled_cv_data[1])
True
regression_model.epoch = 100
regression_model.alpha = 0.5
regression_model.regularization = 0
regression_model.fit()
Initiated coefficients are [1, 3, 2] Iteration 26 done Iteration 51 done Iteration 76 done Training is completed with 100 iterations
[0.313, 0.545, -0.1678]
regression_model.coefficients
[0.313, 0.545, -0.1678]
plot2d_target2predict(regression_model, cv_features, cv_target,
feature_name="FUELCONSUMPTION_COMB_MPG",
target_name="CO2EMISSIONS",
feature_idx = 0, loc_place="upper right")
regression_model.evaluation(scaled_cv_data, metric="MAE")
0.1502
Clearly we can see that the model is undertrained. How about to increase epoch?
regression_model.epoch = 2000
regression_model.fit()
Initiated coefficients are [-1, -4, -5] Iteration 501 done Iteration 1001 done Iteration 1501 done Training is completed with 2000 iterations
[0.7873, -0.2674, -0.683]
plot2d_target2predict(regression_model, cv_features, cv_target,
feature_name="FUELCONSUMPTION_COMB_MPG",
target_name="CO2EMISSIONS",
feature_idx = 0, loc_place="upper right")
regression_model.evaluation(scaled_cv_data, metric="MAE")
0.0468
Great! Now it looks more to what we want. But still it's not perfect. At first look we can provide some regularization to reduce square power.
Let's figure out what we can do.
logs_0alph5_0regul_2000epoch = regression_model.get_logs()
logs1 = logs_0alph5_0regul_2000epoch
plt.title("Raw error")
iterations = [x for x in range(logs1.iterations)]
cost_function_training = logs1.training_cf
cost_function_cv = logs1.testing_cf
plt.scatter(iterations, cost_function_training, label="training loss")
plt.scatter(iterations, cost_function_cv, label="testing loss")
plt.legend(loc="upper right")
plt.xlabel("epoch")
plt.ylabel("Error")
plt.show()
plt.title("Scaled error")
plt.ylim([0, 0.02])
plt.scatter(iterations, cost_function_training, label="training loss")
plt.scatter(iterations, cost_function_cv, label="testing loss")
plt.legend(loc="upper right")
plt.ylabel("Error")
plt.show()
From loss functions analysis we can see that epoch >= 1500 is not necessary, so it just wastes a time. Let's dive into alpha and regul parameters more.
regression_model.epoch = 1500
regression_model.alpha = 0.6
regression_model.regularization = 0.1
regression_model.fit()
Initiated coefficients are [9, -8, -6] Iteration 376 done Iteration 751 done Iteration 1126 done Training is completed with 1500 iterations
[1.1425, -1.8649, 0.979]
plot2d_target2predict(regression_model, cv_features, cv_target,
feature_name="FUELCONSUMPTION_COMB_MPG",
target_name="CO2EMISSIONS",
feature_idx = 0, loc_place="upper right")
regression_model.evaluation(scaled_cv_data, metric="MAE")
0.0275
regression_model.evaluation(scaled_testing_data, metric="MAE")
0.032
CO2EMISSIONS = 1.1425 - 1.8649 FUELCONSUMPTION_COMB_MPG + 0.979 FUELCONSUMPTION_COMB_MPG^2
which gives only 0.0275 MAE or 2.75% error. Let's store this model for further usage.
model_1_1 = regression_model
model_1_1.coefficients
[1.1425, -1.8649, 0.979]
How about to add more features into our model? What minimum MAE we might get? Let's explore.
In this section I'm going to build more complex model with way more features than it was before.
Let's create a new brand new model, where additional features will be applied.
model_all = regression.Regression()
We might use the same scaled data. Let's add new features.
# current enabled features are:
main_data_table.features
{'FUELCONSUMPTION_COMB_MPG': <data_reader.DataTable._DataColumn at 0x1061c30a0>, 'FUELCONSUMPTION_COMB_MPG^(2)': <data_reader.DataTable._DataColumn at 0x12349c4c0>}
main_data_table.head
['MODELYEAR', 'MAKE', 'MODEL', 'VEHICLECLASS', 'ENGINESIZE', 'CYLINDERS', 'TRANSMISSION', 'FUELTYPE', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB', 'FUELCONSUMPTION_COMB_MPG', 'CO2EMISSIONS', 'FUELCONSUMPTION_COMB_MPG^(2)']
Which features might be suitable for us? Which one might improve the quality of the model prediction? Let's plot data to find it out.
main_data_table.plot(all2target=True)
We can see that CO2 emission is highly dependent on Engine Size, Cylinders and Fuel Consumption in different variations.
Take a note, that other pictures with straight lines or 1 axisX value means that this data is not numerical.
main_data_table.activate_features(["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB"])
Feature ENGINESIZE was added Feature CYLINDERS was added Feature FUELCONSUMPTION_COMB was added
# list of enabled features are:
main_data_table.features
{'FUELCONSUMPTION_COMB_MPG': <data_reader.DataTable._DataColumn at 0x1061c30a0>, 'FUELCONSUMPTION_COMB_MPG^(2)': <data_reader.DataTable._DataColumn at 0x12349c4c0>, 'ENGINESIZE': <data_reader.DataTable._DataColumn at 0x1061c3220>, 'CYLINDERS': <data_reader.DataTable._DataColumn at 0x1061c3160>, 'FUELCONSUMPTION_COMB': <data_reader.DataTable._DataColumn at 0x1061c32b0>}
# Regenerate new data for training/testing
scaled_training_data = main_data_table.get_training_data()
scaled_cv_data = main_data_table.get_cv_data()
scaled_testing_data = main_data_table.get_testing_data()
data_labels = main_data_table.get_labels()
model_all.set_labels(data_labels)
model_all.set_training_data(scaled_training_data[0], scaled_training_data[1])
model_all.set_testing_data(scaled_cv_data[0], scaled_cv_data[1])
model_all.labels
(['FUELCONSUMPTION_COMB_MPG', 'FUELCONSUMPTION_COMB_MPG^(2)', 'ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB'], 'CO2EMISSIONS')
model_all.set_model_parameters(alpha=0.3, regularization=0.1, epoch=1500)
model_all.RANDOM_WEIGHT_INITIALIZATION = 10
model_all.fit()
Initiated coefficients are [1, 10, -2, -8, 10, 10] Iteration 376 done Iteration 751 done Iteration 1126 done Training is completed with 1500 iterations
[-5.09, 6.99, -3.54, -8.33, 7.47, 6.67]
cv_features, cv_target = scaled_cv_data[0], scaled_cv_data[1]
features, target = model_all.labels
for feature_idx in range(len(features)):
plot2d_target2predict(model_all, cv_features, cv_target,
feature_name=features[feature_idx],
target_name=target, feature_idx=feature_idx, loc_place="upper right")
model_all.evaluation(scaled_testing_data)
0.36
Well, it can be better! Let's fine-tune the model.
model_all.set_model_parameters(alpha=0.7, regularization=0.1, epoch=1500)
model_all.fit(scaled_coefficients=True)
Initiated coefficients are [0.9, 0.4, -0.5, -0.2, 0.3, 0.7] Iteration 376 done Iteration 751 done Iteration 1126 done Training is completed with 1500 iterations
[0.37, 0.13, -0.61, -0.24, 0.18, 0.53]
for feature_idx in range(len(features)):
plot2d_target2predict(model_all, cv_features, cv_target,
feature_name=features[feature_idx],
target_name=target, feature_idx=feature_idx, loc_place="upper right")
model_all.evaluation(scaled_testing_data)
0.05
Per this work we used gradient descent algorithms to train the model with different features, learning rate and regularization parameters to find our how they impact on the model's learning process.
We could predict CO2 emission within 5% (from max given target value) error based on mean absolute metric and using related features!! We can improve this model even further (once I got 4%), but it's out of the scope of this work.
But the interesting thing is that with only Fuel Consumtion parameter we could predict CO2 emission only within 3% error! It shows how feature's selection is important.