AI: K-Nearest Neighbors

Classification problem with KNN: should we give a loan or no?

In this project I use my own designed K-Nearest Neighbors and DataTable packages to predict if a person with respective attributes will pay loan back.

I'll walk you through this process of model definition, fine-tuning and evaluation. You might want to jump to the interesting section immediately.

Before exploring this work with KNN, go explore first Logistic Regression note which covers same data set. It's done to compare results of 2 different classification algorithms.

  1. Section 1- preparation
  2. Section 2- model building
  3. Section 3- k-evaluation
  4. Section 4- results

Introduction

I had developed my own Pandas-like simplified package data_reader.py to fetch different kind of data faster than Pandas can do and contains main features for KNN AI like data preparation for training/testing, split data, adding new features, creating combined one, ploting and many others.

In addition, to enhance my AI regression model's knowledge, I designed a knn.py package based on a special k-size max heap data structure to optimize a speed and memory complexities storing/comparing of nearest points to unknown data point. Moreover there is logs feature to monitor regression model learning and evaluation of its performance.

All these features and techniques I would like to show in this notebook.

For additional package usage, refer to doxy in its src code.

For that session, I'm going to use a /data/loan_train.csv file, which contains table-like structure data of various cases when a person returns/not returns his loan back. It might be interesting to try to predict if the given person can get loan.

Section 1 - preparation

image.png

Loan_status is going to be our target which we try to predict. It has 2 options: PAIDOFF and COLLECTION. DataTable helps to read csv file in a way that words (as a string data type) will be converted to numbers, which we can easily restore back.

So, COLLECTION is 0, and PAIDOFF is 1. With that, let's move further. What about features?

Well, we can see that we don't have strongly marked relationship between features and target. Might be hard to predict with hight accurace loan returning. But we will try.

Section 2 - model building

Let's create the same model as we had for Logistic regression, so we can compare their results.

Section 3 - k-evaluation

In this section I would like to explore which value of k might be the best for our model to predict.

Section 4 - results

Per short research above, we can say that k = 4 roughly is the best for our model, which has 72% accuracy, 74% precision and 94% recall.