AI: Logistic Regression

Classification problem with LR: should we give a loan or no?

In this project I use my own designed Logistic Regression and DataTable packages to predict if a person with respective attributes will pay loan back.

I'll walk you through this process of model definition, fine-tuning and evaluation. You might want to jump to the interesting section immediately.

Before exploring Logistic Regression, I would recommend to start with Regression at first here: https://github.com/kotsky/ai-dev/blob/main/regression_workflow.ipynb, because this work is Regression extension and was done in short.

  1. Section 1- preparation
  2. Section 2- model building
  3. Section 3- training
  4. Section 4- evaluation
  5. Section 5- results

Introduction

I had developed my own Pandas-like simplified package data_reader.py to fetch different kind of data faster than Pandas can do and contains main features for Logistic Regression AI like data preparation for training/testing, split data, adding new features, creating combined one, ploting and many others.

In addition, to enhance my AI regression model's knowledge, I designed a logistic_regression.py package based on sigmoid activator, gradient descent technique to traing a (linear, non-linear, multi-variable) regression model with a various model parameters like number of iterations (epoch), learning rate (alpha) and regularization coefficient (regul). Moreover there is logs feature to monitor regression model learning and evaluation of its performance.

All these features and techniques I would like to show in this notebook.

For additional package usage, refer to doxy in its src code.

For that session, I'm going to use a /data/loan_train.csv file, which contains table-like structure data of various cases when a person returns/not returns his loan back. It might be interesting to try to predict if the given person can get loan.

Section 1 - preparation

image-2.png

Loan_status is going to be our target which we try to predict. It has 2 options: PAIDOFF and COLLECTION. DataTable helps to read csv file in a way that words (as a string data type) will be converted to numbers, which we can easily restore back.

So, COLLECTION is 0, and PAIDOFF is 1. With that, let's move further. What about features?

From the picture above we can see an interesting relationship betwee education and age. It make sense to create a new feature education * age.

Well, we can see that we don't have strongly marked relationship between features and target. Might be hard to predict with hight accurace loan returning. But we will try.

Section 2 - model building

Section 3 - training

Section 4 - evaluation

Indeed, 0 and 1 are mixed up, so it's going to be hard to predict properly. Let's set logistic threshold to 0.325 as a middle of our plot and check via confusion matrix its error outcome.

Section 5 - results

Based on evaluation output above, we predicted with 71% accuracy that given people from a test data set will or not return loan taken before.

f1 score showed the best logistic threshold at 0.27, which we used to find out the best precision 74% and recall 95%.