Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (2024)

For calculating distances KNN uses a distance metric from the list of available metrics. Read this article for an overview of these metrics, and when they should be considered for use.

comments

By Sarang Anil Gokte, Praxis Business School

Introduction


KNN is the most commonly used and one of the simplest algorithms for finding patterns in classification and regression problems. It is an unsupervised algorithm and also known as lazy learning algorithm. It works by calculating the distance of 1 test observation from all the observation of the training dataset and then finding K nearest neighbors of it. This happens for each and every test observation and that is how it finds similarities in the data. For calculating distances KNN uses a distance metric from the list of available metrics.

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (1)

K-nearest neighbor classification example for k=3 and k=7

Distance Metrics


For the algorithm to work best on a particular dataset we need to choose the most appropriate distance metric accordingly. There are a lot of different distance metrics available, but we are only going to talk about a few widely used ones. Euclidean distance function is the most popular one among all of them as it is set default in the SKlearn KNN classifier library in python.

So here are some of the distances used:


Minkowski Distance – It is a metric intended for real-valued vector spaces. We can calculate Minkowski distance only in a normed vector space, which means in a space where distances can be represented as a vector that has a length and the lengths cannot be negative.

There are a few conditions that the distance metric must satisfy:

  1. Non-negativity: d(x, y) >= 0
  2. Identity: d(x, y) = 0 if and only if x == y
  3. Symmetry: d(x, y) = d(y, x)
  4. Triangle Inequality: d(x, y) + d(y, z) >= d(x, z)

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (2)

This above formula for Minkowski distance is in generalized form and we can manipulate it to get different distance metrices.

The p value in the formula can be manipulated to give us different distances like:

  • p = 1, when p is set to 1 we get Manhattan distance
  • p = 2, when p is set to 2 we get Euclidean distance


Manhattan Distance – This distance is also known as taxicab distance or city block distance, that is because the way this distance is calculated. The distance between two points is the sum of the absolute differences of their Cartesian coordinates.

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (3)

As we know we get the formula for Manhattan distance by substituting p=1 in the Minkowski distance formula.

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (4)

Suppose we have two points as shown in the image the red(4,4) and the green(1,1).

And now we have to calculate the distance using Manhattan distance metric.

We will get,

d = |4-1| + |4-1| = 6

This distance is preferred over Euclidean distance when we have a case of high dimensionality.


Euclidean Distance – This distance is the most widely used one as it is the default metric that SKlearn library of Python uses for K-Nearest Neighbour. It is ameasureof the true straight linedistancebetween two points inEuclideanspace.

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (5)

It can be used by setting the value of p equal to 2 in Minkowski distance metric.

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (6)

Now suppose we have two point the red (4,4) and the green (1,1).

And now we have to calculate the distance using Euclidean distance metric.

We will get,

4.24


Cosine Distance – This distance metric is used mainly to calculate similarity between two vectors. It is measured by thecosineof the angle between two vectors and determines whether two vectors are pointing in the same direction. It is often used to measure documentsimilarityin text analysis. When used with KNN this distance gives us a new perspective to a business problem and lets us find some hidden information in the data which we didn’t see using the above two distance matrices.

It is also used in text analytics to find similarities between two documents by the number of times a particular set of words appear in it.

Formula for cosine distance is:

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (7)

Using this formula we will get a value which tells us about the similarity between the two vectors and 1-cosΘ will give us their cosine distance.

Using this distance we get values between 0 and 1, where 0 means the vectors are 100% similar to each other and 1 means they are not similar at all.


Jaccard Distance - The Jaccard coefficient is a similar method of comparison to the Cosine Similarity due to how both methods compare one type of attribute distributed among all data.The Jaccard approach looks at the two data sets and finds the incident where both values are equal to 1.So the resulting value reflects how many 1 to 1 matches occur in comparison to the total number of data points. This is also known as the frequency that 1 to 1 match, which is what the Cosine Similarity looks for, how frequent a certain attribute occurs.

It is extremely sensitive to small samples sizes and may give erroneous results, especially with very small data sets with missing observations.

The formula for Jaccard index is:

Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (8)

Jaccard distance is the complement of the Jaccard index and can be found by subtracting the Jaccard Index from 100%, thus the formula for Jaccard distance is:

D(A,B) = 1 – J(A,B)


Hamming Distance - Hamming distanceis a metric for comparing two binary data strings. While comparing two binary strings of equal length,Hamming distanceis the number of bit positions in which the two bits are different. The Hamming distance method looks at the whole data and finds when data points are similar and dissimilar one to one.The Hamming distance gives the result of how many attributes were different.

This is used mostly when you one-hot encode your data and need to find distances between the two binary vectors.

Suppose we have two strings “ABCDE” and “AGDDF” of same length and we want to find the hamming distance between these. We will go letter by letter in each string and see if they are similar or not like first letters of both strings are similar, then second is not similar and so on.

ABCDE and AGDDF

When we are done doing this we will see that only two letters marked in red were similar and three were dissimilar in the strings. Hence, the Hamming Distance here will be 3. Note that larger the Hamming Distance between two strings, more dissimilar will be those strings (and vice versa).


References:

  1. Sklearn distance metrics documentation
  2. KNN in python
  3. 4 Distance Measures for Machine Learning
  4. Importance of Distance Metrics in Machine Learning Modelling
  5. Different Types of Distance Metrics used in Machine Learning
  6. Jaccard Index / Similarity Coefficient
  7. Exercise for calculating Hamming and Jaccard Distances
  8. Wikipedia
  9. Cosine Similarity
  10. Effects of Distance Measure Choice on KNN Classifier Performance - A Review



Bio: Sarang Anil Gokte is a Postgraduate Student at Praxis Business School.

Related:

  • Introduction to the K-nearest Neighbour Algorithm Using Examples
  • How to Explain Key Machine Learning Algorithms at an Interview
  • /2020/10/exploring-brute-force-nearest-neighbors-algorithm.html

More On This Topic

  • Distance Metrics: Euclidean, Manhattan, Minkowski, Oh My!
  • The Most Popular Intro to Programming Course From Harvard is Free!
  • KDnuggets News March 30: The Most Popular Intro to Programming…
  • 10 Most Common Data Quality Issues and How to Fix Them
  • 10 Most Used Tableau Functions
  • KDnuggets News, August 3: 10 Most Used Tableau Functions • Is…
Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets (2024)
Top Articles
Mrs. William Jones - and Bill
Canadian Government Employee Discounts 2024: 50+ Stores
Hotels Near 6491 Peachtree Industrial Blvd
Fredatmcd.read.inkling.com
Http://N14.Ultipro.com
Lamb Funeral Home Obituaries Columbus Ga
Big Spring Skip The Games
How to know if a financial advisor is good?
Undergraduate Programs | Webster Vienna
Konkurrenz für Kioske: 7-Eleven will Minisupermärkte in Deutschland etablieren
The Best English Movie Theaters In Germany [Ultimate Guide]
Smokeland West Warwick
OnTrigger Enter, Exit ...
Grand Park Baseball Tournaments
Transformers Movie Wiki
Whitley County Ky Mugshots Busted
Blue Beetle Showtimes Near Regal Swamp Fox
Kaomoji Border
I Touch and Day Spa II
Q Management Inc
Wal-Mart 140 Supercenter Products
Red Devil 9664D Snowblower Manual
Joann Ally Employee Portal
Icivics The Electoral Process Answer Key
Espn Horse Racing Results
Maxpreps Field Hockey
Rs3 Eldritch Crossbow
Litter Robot 3 RED SOLID LIGHT
When His Eyes Opened Chapter 3123
Studentvue Calexico
Tactical Masters Price Guide
10 Best Quotes From Venom (2018)
Martins Point Patient Portal
Lininii
Babbychula
Junee Warehouse | Imamother
Ippa 番号
Snohomish Hairmasters
Gets Less Antsy Crossword Clue
Pensacola Cars Craigslist
What Does Code 898 Mean On Irs Transcript
Autum Catholic Store
How To Customise Mii QR Codes in Tomodachi Life?
Ssc South Carolina
The Nikki Catsouras death - HERE the incredible photos | Horror Galore
John Wick: Kapitel 4 (2023)
Terrell Buckley Net Worth
303-615-0055
Bonecrusher Upgrade Rs3
antelope valley for sale "lancaster ca" - craigslist
라이키 유출
Die 10 wichtigsten Sehenswürdigkeiten in NYC, die Sie kennen sollten
Latest Posts
Article information

Author: Prof. An Powlowski

Last Updated:

Views: 5687

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Prof. An Powlowski

Birthday: 1992-09-29

Address: Apt. 994 8891 Orval Hill, Brittnyburgh, AZ 41023-0398

Phone: +26417467956738

Job: District Marketing Strategist

Hobby: Embroidery, Bodybuilding, Motor sports, Amateur radio, Wood carving, Whittling, Air sports

Introduction: My name is Prof. An Powlowski, I am a charming, helpful, attractive, good, graceful, thoughtful, vast person who loves writing and wants to share my knowledge and understanding with you.