How to Implement K-Means Clustering in Python



In this article, you will learn how to implement K-Means clustering in Python.

Implement K-Means Clustering

K-Means Clustering is a popular unsupervised machine learning algorithm used to group data points into clusters based on their similarities. Here is a step-by-step guide to implement K-Means Clustering in Python:

Import the required libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Load the dataset
data = pd.read_csv('your_dataset.csv')
Prepare the data for clustering
X = data.iloc[:, [feature1_index, feature2_index, ...]].values
Determine the optimal number of clusters (k) using the Elbow method or Silhouette analysis
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')

In the above code, we have used the Within-Cluster Sum of Squares (WCSS) metric to evaluate the performance of K-Means Clustering for different values of k. We can use the Elbow method to choose the optimal value of k by selecting the value at the “elbow point” of the WCSS vs. k plot.

Fit the K-Means model to the data and predict the cluster labels
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0)
y_kmeans = kmeans.fit_predict(X)

In the above code, we have used the optimal value of k obtained from the Elbow method to fit the K-Means model to the data and predict the cluster labels.

Visualize the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

In the above code, we have visualized the clusters using a scatter plot. Each data point is colored based on its predicted cluster label, and the centroids of each cluster are marked with a yellow circle.

Note: Replace feature1_index, feature2_index, your_dataset.csv, and k with appropriate values based on your dataset and requirements.