K-Means++ is a clever upgrade to K-Means that fixes its biggest flaw: random initialization.
Instead of picking all k centroids at random, K-Means++:
- Picks the first centroid randomly from the data points.
- For each remaining point ( x ), compute its shortest distance ( D(x) ) to the nearest chosen centroid.
- Choose the next centroid from the dataset with probability proportional to ( D(x)^2 ).
- Repeat until ( k ) centroids are selected.
This spreads centroids out more effectively and leads to:
- Faster convergence
- Better clustering quality
- Fewer bad runs
📐 Theoretical Works
Let:
j_opt
= the optimal value of the K-Means objective functionj_init
= the expected value after K-Means++ initialization
Then:
E[j_init] ≤ 8(log k + 2) · j_opt
This means K-Means++ gives a solution not too far from what we would like and of coruse far better than random seeding.
For the full math, read “k-means++: The Advantages of Careful Seeding” by Arthur and Vassilvitskii (2007).
If you’re clustering anything important try and skip random. Start with ++ and then see where it takes you.