The Key dynamics of KNN Algorithm

Are you someone who is venturing into KNN machine learning algorithms? Yes, you have clicked the right place. This blog is a quick introduction to the simplest machine language algorithms, KNN which will help you to grasp its key dynamics.

KNN Algorithm, the most used learning algorithm was born out of research done for the armed forces. Two officers of USAF School of Aviation Medicine, namely Hodge and Fix wrote a technical report in 1951 where they introduced the KNN algorithm.

Table of Contents

Introduction

K Nearest Neighbor is one of the fundamental algorithms in machine learning which makes the use of a set of input values to predict output values. It is one of the simplest forms of algorithms mostly used for classification problems as it classifies the data point on the basis of how its neighbour is classified. K in KNN represents the number of the closest neighbors that we have used to classify our new data points.

The main characteristics of KNN algorithm is that it classifies the new data points depending on the similarity measure of the earlier stored data points. For instance, let us consider we have a dataset of mangoes and grapes. KNN will sort out similar features like shape, variety and color and store them. When a new object is found, it will automatically check its similarity with the features available and predict it.

Characteristics of KNN Algorithm

So, we have understood what is meant by KNN Algorithms. Now, let us check out the main characteristics of KNN algorithms which makes it so user-friendly and easily interpretable.

Foremostly, KNN helps to find sample geometric distance. The KNN classifier is commonly based on the Euclidean distance between a test sample and the specified training samples.
KNN is considered to be a lazy learning, non-parametric algorithm because it uses data with several classes to predict the classification of the new sample point. KNN is said as non-parametric since it doesn’t make any assumptions on the data being studied.
Next characteristics is classification decision rule and confusion matrix where classification typically involves partitioning samples into training and testing datasets.
KNN algorithm plays a vital role in feature transformation. The increased performance of a classifier can sometimes be achieved when the feature values are transformed prior to classification analysis. And two commonly used feature transformations involve standardization and fuzzification.
Performance assessment with cross-validation: A basic rule in classification analysis is that class predictions are not made for data samples that are used for training or learning. If class predictions are made for samples used in training or learning, the accuracy will be artificially biased upward. Instead, class predictions are made for samples that are kept out of the training process.

When is KNN employed in our daily problems?

We have properly labeled data which includes just two values. For instance, if we are considering predicting that someone is having diabetes or not, then the final label can be either 1 or 0. It cannot be NaN or -1.
In the political science scenario where classing a political voter to “vote Republican” or “vote Democrat”, or to a “will vote” or “will not vote” in the election.
Next, the data is noise-free. For the diabetes data set, we cannot have a Glucose level as 0.0 or 1000. It’s practically impossible to have such data.
It can also be used in banking systems where it predicts if a person is fit for loan approval. Or sometimes to predict if he or she has similar traits to a defaulter.
KNN is highly preferred for small dataset. It is helpful in calculating credit ratings. It calculates an individual’s credit score by comparing it with persons with similar traits.
Some other areas that make use of the KNN algorithm include Image Processing and recognition, Video Recognition, Handwriting Detection, and Speech Recognition.

Choosing the value of K?

We have grasped a brief knowledge on the characteristics of KNN Algorithm and its applications in the real-time world. Now, let us learn how we can choose the value of K and some important points to consider while choosing it.

While solving real-life problems, we come across many points. Now, the question arises: how to select the value of K?

Choosing the right value of K is known as parameter tuning and it is very essential for better results. Choosing the value of K means we find the square root of the total number of data points available in the dataset.

a. K = sqrt (the total number of data points present).

b. Odd value of K is usually selected to avoid confusion between 2 classes.

How does KNN work? When we have a data set with two kinds of points denoted as Label A and Label B. And we want to figure out the label of a new point in this dataset using kNN, we can simply do it by taking a vote of its k nearest neighbours. K value can be any value in the range between 1 to infinity, but in most practical cases, k is less than 30.
In order to get the right value of K, we should run the KNN algorithm several times with different values of K and finally, select the one that has the least number of errors. The right K value must be able to predict data that it hasn’t seen before accurately.

Working of KNN Algorithm in Machine Learning

Making the blog short and precise, in this last section we will try to understand in a better way the working of KNN algorithm which apply the following steps:

Step 1 – The first step when implementing an algorithm is to have a proper labelled data set. So that we can start it by loading the training and the testing data.

Step 2 – The next comes choosing the nearest data points i.e., the value of K. K can be any integer value.

Step 3 – For each test data, we should do the following things –

Using Euclidean distance, Hamming, or Manhattan to calculate the distance between test data and each row of training. The Euclidean method is the most commonly used method.
Sorting data set in ascending order based on the distance value. Now, from the sorted array, we have to choose the top K rows.
Based on the maximum appearing classes of these rows, it will assign a specific category to the test point.

Step 4 – The algorithm execution is over.

There is no doubt that Companies like Amazon or Netflix make use of the KNN algorithm when recommending books to buy or movies to watch to its customers but this algorithm has some disadvantages too.

Accuracy depends on the quality of the data and so, sometimes using large data makes the prediction problem to be slow.
It is sensitive to the size of the information and irrelevant features.
Moreover, it requires usage of high memory as it needs to store all of the training data and so it can be computationally expensive.

This blog has given you the basics of one of the most popular machine learning algorithms. KNN algorithms can be implemented in our model with the help of programming languages which can be either Python or R. If you are a great learner, then check out the article Understanding the Concept of KNN Algorithm Using R as KNN is a great place to start with for beginners when first learning to build models based on different data sets.

Author Bio:

Senior Data Scientist and Alumnus of IIM- C (Indian Institute of Management – Kolkata) with over 25 years of professional experience Specialized in Data Science, Artificial Intelligence, and Machine Learning. PMP Certified ITIL Expert certified APMG, PEOPLECERT and EXIN Accredited Trainer for all modules of ITIL till Expert Trained over 3000+ professionals across the globe Currently authoring a book on ITIL “ITIL MADE EASY”.

Conducted myriad Project management and ITIL Process consulting engagements in various organizations. Performed maturity assessment, gap analysis and Project management process definition and end to end implementation of Project management best practices