AI: Regression

Regression usage on Fuel Consumption vs CO2 Emission data

In this project I use my own designed Regression and DataTable packages to predict a CO2 Emission based on car characteristics.

I'll walk you through this process of model definition, fine-tuning and evaluation. You might want to jump to the interesting section immediately.

  1. Section 1- preparation
  2. Section 2- model config
  3. Section 3- first try
  4. Section 4- real game
  5. Section 5- more features - more fun
  6. Section 6- results

Introduction

I had developed my own Pandas-like simplified package data_reader.py to fetch different kind of data faster that Pandas can do and contains main features for Regression AI like data preparation for training/testing, split data, adding new features, creating combined one, ploting and many others.

In addition, to enhance my AI regression model's knowledge, I designed a regression.py package based on gradient descent technique to traing a (linear, non-linear, multi-variable) regression model with a various model parameters like number of iterations (epoch), learning rate (alpha) and regularization coefficient (regul). Moreover there is logs feature to monitor regression model learning and evaluation of its performance.

All these features and techniques I would like to show in this notebook.

For additional package usage, refer to doxy in its src code.

For that session, I'm going to use a FuelConsumtion.csv file, which contains table-like structure of data of how different parameters of the car impact on CO2 Emission. It might be interesting to try to predict how much particular car model will produce CO2.

Section 1 - preparation

Import all necessary libs (yeah, matplotlib will be used just to make a nice visualization. And random lib as well).

I had collected helper functions/stripts in helper_methods.py

To upload data, let's use DataTable (like dataFrame from Pandas) from data_reader.py.

image.png

This DataTable is very suitable for a small-middle range data size, because with current implementation it stores all data into a main memory.

Features and target selection

By default, there are no features and target pre-defined. Features - set of parameters we are going to use for our model to train it; target - set of outcome we desire to have from our trained model.

For the first time, I'm interested how engine size and fuel consumtion impact on CO2 emission.

For that, I select following features:

Let's draw it.

Our features are defined like

We can see that engine size can be filled with a straight line, what we cannot say about fuel consumption parameter. To showcase capabilities of my packages, I would like to select a parabolic curved feature.

Let's delete engine size feature from our training data set.

Training/CV/Testing data sets

DataTable supports data split onto training, cross-validation (direct model testing and selecting the best one) and testing data (for final model testing).

Now data is prepared for first AI trials with 1 feature and 1 target.

Section 2 - model config

Section 3 - first try

In this section I'm going to play a little bit with my model and data we have. Let's set the following parameters:

Now, we can fit our model with defined training data.

Wow, it seems like we overshooted our global (and local as well) minimum of our model function and it has been decreasing so rapidly. Let's reduce learning coefficient significantly.

Better, we just reduced coefficients from e+256 to e+81. Still, bad :D

Do we need reduce alpha even more? Let's introduce some regularization as well.

Looks much better. Let's try these coefficients.

That seems like we didn't fit the model. Increase epoch and alpha?

Still no. How about to initialize first coefficients in range -1...1?

That's the moment when we have to scale our data

We faced a problem that we cannot select appropriate coefficients/parameters for our model to learn properly. At some moment we have overshoot, at another one - undershoot. May be data is still didn't ready?

Section 4 - real game

In this section I will try to create the best 1 feature - 1 target model.

Let's scale our data as x = x / max(|X|), so the feature and the target will be in a range -1...1. To do so, we apply the following:

indeed, as we can see data was scaled properly. Let's try with that data now.

In additional, I'll enable logs' writing to showcase how the model was learning within time.

Let's try we the same config we had at very first time.

Much better!! Regression model has method to evaluate trained model. There is mean absolute error metric as:

Assuming all values scaled up to 1, we can say ignorantly that the model gives almost 94% accuracy.

Let's check what logs might show:

Let's use logs to analyse loss functions changes within a time.

Brilliant. It seems like we can increase alpha and number of iteration (epoch) to improve our model. But let's move forward.

Square feature

We can add new feature to make the model line looks like parabolic. How about FUELCONSUMPTION_COMB_MPG^2?

To do so, we have to create a new feature data set as follows:

Also, we can set power = 0.5, which will give square root of selected feature. But for that time we are looking for power of 2.

Clearly we can see that the model is undertrained. How about to increase epoch?

Great! Now it looks more to what we want. But still it's not perfect. At first look we can provide some regularization to reduce square power.

Let's figure out what we can do.

From loss functions analysis we can see that epoch >= 1500 is not necessary, so it just wastes a time. Let's dive into alpha and regul parameters more.

1 feature - 1 target model results

The best trained model is:

CO2EMISSIONS = 1.1425 - 1.8649 FUELCONSUMPTION_COMB_MPG + 0.979 FUELCONSUMPTION_COMB_MPG^2

which gives only 0.0275 MAE or 2.75% error. Let's store this model for further usage.

Section 5 - more features - more fun

How about to add more features into our model? What minimum MAE we might get? Let's explore.

In this section I'm going to build more complex model with way more features than it was before.

Let's create a new brand new model, where additional features will be applied.

We might use the same scaled data. Let's add new features.

Which features might be suitable for us? Which one might improve the quality of the model prediction? Let's plot data to find it out.

We can see that CO2 emission is highly dependent on Engine Size, Cylinders and Fuel Consumption in different variations.

Take a note, that other pictures with straight lines or 1 axisX value means that this data is not numerical.

Well, it can be better! Let's fine-tune the model.

Section 6 - results

Per this work we used gradient descent algorithms to train the model with different features, learning rate and regularization parameters to find our how they impact on the model's learning process.

We could predict CO2 emission within 5% (from max given target value) error based on mean absolute metric and using related features!! We can improve this model even further (once I got 4%), but it's out of the scope of this work.

But the interesting thing is that with only Fuel Consumtion parameter we could predict CO2 emission only within 3% error! It shows how feature's selection is important.