AI: K-Mean

Clusterization problem with K-Mean: what kind of customers do we have?

In this project I use my own designed K-Mean and DataTable packages to get clusters of our customers based on their attributes.

I'll walk you through this process of model definition, fine-tuning and evaluation. You might want to jump to the interesting section immediately.

  1. Section 1- preparation
  2. Section 2- model building
  3. Section 3- K-Mean first try
  4. Section 4- K-Mean full up
  5. Section 5- results

Introduction

I had developed my own Pandas-like simplified package data_reader.py to fetch different kind of data faster than Pandas can do and contains main features for K-Mean AI like data preparation for training/testing, split data, adding new features, creating combined one, ploting and many others.

In addition, to enhance my AI regression model's knowledge, I designed a kmean.py package, which is pre-configured to initialize 20 times different starting positions of centroids, 5 times iteration for each training and returning the best centroids based on min cost function which was achieved after model training.

All these features and techniques I would like to show in this notebook.

For additional package usage, refer to doxy in its src code.

For that session, I'm going to use a /data/Cust_Segmentation.csv file, which contains table-like structure data of our customers.

Section 1 - preparation

Let's pick few features up for further analysis and K-Mean testing. I'm interested in attitude between Income vs Years Employed.

Cool, let's apply K-Mean with 2 clusters. But before, let's scale our data to range -1...+1 for better training experience, assuming we calculate distance between points in same scale.

Section 2 - model building

Section 3 - first try

Let's find our the best three centroids for our data. We are going to do 3 times random centroid initialization and to have 5 learning iterations for each.

There is logs' saving during learning process. It stores the best logs which happened during all training session based on cost function (means every new centroid initialization gives different final cost function, so the model saves the training process and its centroids for the lowest achieved value of the cost function). Let's analyse them.

Section 4 - full up

In this section I want to find out which K is the best to minimize cost function of the model for these 2 features: Income vs Years Employed.

Section 5 - results

Let's find our what we just trained!

Cost functions for each K was taken as a cost function of the best trained model for particular number of centroids (K) of the last iteration (in our case - 5th iteration). Based on the picture above we can see that K = [2, 3, 5] is good. The best K based on the lowest cost functions is:

Let's set centroids of K = 5 to our model and try to predict which cluster a new customer is belong to.

The customer with scaled data [0.06 Income, 0.15 Years of Employed] belongs to a cluster 4 (3 + 1).