K-Means for beginners

Shortly put, K-Means is an unsupervised machine learning algorithm used for cluster analysis.
Now, to explain what that means let’s start with the unsupervised part or rather compare it to supervised learning.
Supervised learning is used to map a certain input to its corresponding output. For example, you have a series of pictures and each picture has a label describing it. You run those pictures through a machine learning algorithm and check the output. If the output is correct you continue the process. If it’s wrong you tweak the parameters of the function to get better results in the future.
Unsupervised learning, on the other hand, does not have labels associated with data. Then how do we check if the output is correct? In short, we don’t. The goal of the algorithm is to find patterns in the data. The data is then segmented into clusters with similar attributes. This can range from grouping customers with similar shopping habits to segmenting cells based on their genetic makeup.
So, how does it work?
You set the value of k, this represents the number of means (also determines the number of clusters) which the data will be clustered around (In practice there are methods and algorithms which help you determine k, but for the sake of simplicity let’s say we chose it by hand).
- The algorithm randomly generates the means within the data domain. For this example let k = 3, represented by the coloured squares.

2. The clusters are created based on the distance between data points and the means

3. The data point closest to the centroid of the cluster becomes the new mean.

4. Steps 2 and 3 are repeated until we reach convergence (until the clusters stop changing).

And it’s as simple as that, we have our clusters!
K-Means is excellent for beginners wanting to get into machine learning. It’s simple, fast (relatively) and easy to understand. There are, of course, hurdles to be overcome such as determining the number of clusters or dealing with the curse of dimensionality, but that’s a topic for another post.