K-Means Clustering

K-Means Clustering is an unsupervised learning algorithm that partitions a dataset into K distinct, non-overlapping clusters. Each data point belongs to the cluster with the nearest mean (centroid). The algorithm iteratively assigns points to the closest centroid and recalculates centroids until convergence. Operators encounter K-Means in feature extraction pipelines, such as quantizing model weights or grouping similar embeddings for retrieval-augmented generation (RAG). It is computationally efficient for large datasets but sensitive to initial centroid placement and assumes spherical clusters of similar size.

When building a RAG system, an operator might use K-Means to cluster document embeddings from a sentence transformer model. For a corpus of 100,000 documents, setting K=1000 groups similar documents together. At query time, the system first identifies the nearest cluster centroid, then searches only within that cluster, reducing latency from scanning all 100,000 documents to scanning roughly 100. This trade-off sacrifices some recall for speed.

In a local RAG workflow using LangChain, an operator runs from sklearn.cluster import KMeans to cluster embeddings generated by a local model like all-MiniLM-L6-v2. The operator sets n_clusters=500 and fits the model on the embeddings. The resulting cluster labels are stored alongside the documents. During retrieval, the query embedding is compared to centroids, and the nearest cluster's documents are passed to the LLM for answer generation.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Practical example

Workflow example