K-Means vs K-Means++: Smarter Centroids, Better Clusters

K-Means++ is a clever upgrade to K-Means that fixes its biggest flaw: random initialization.

Instead of picking all k centroids at random, K-Means++:

Picks the first centroid randomly from the data points.
For each remaining point ( x ), compute its shortest distance ( D(x) ) to the nearest chosen centroid.
Choose the next centroid from the dataset with probability proportional to ( D(x)^2 ).
Repeat until ( k ) centroids are selected.

This spreads centroids out more effectively and leads to:

K-Means++ Advantages of Careful Seeding by David Arthur & Sergei Vassilvitskii

Let:

Then:

E[j_init] ≤ 8(log k + 2) · j_opt

This means K-Means++ gives a solution not too far from what we would like and of coruse far better than random seeding.

For the full math, read “k-means++: The Advantages of Careful Seeding” by Arthur and Vassilvitskii (2007).

If you’re clustering anything important try and skip random. Start with ++ and then see where it takes you.