How to Use PCA in Python

09/12/2021

Contents

In this article, you will learn how to use PCA in Python.

How to Use PCA

Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction and feature extraction in machine learning. In Python, you can use PCA with the scikit-learn library.

Here’s a simple example of how to apply PCA to a dataset:

from sklearn.decomposition import PCA
import numpy as np

# Load your dataset into a numpy array
data = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

# Create a PCA object with the number of components you want to keep
pca = PCA(n_components=1)

# Fit the PCA model to your data
pca.fit(data)

# Transform the data to the first principal component
data_pca = pca.transform(data)

# You can access the explained variance ratio for each component
print(pca.explained_variance_ratio_)

In this example, we use PCA to reduce the data from 2 dimensions to 1 dimension, while retaining as much information as possible. The explained variance ratio tells us what percentage of the total variance in the data is explained by each component.

In PCA, the goal is to transform the original variables into a new set of uncorrelated variables, called “principal components”, which explain the maximum amount of variation in the data. The first principal component is the direction that explains the largest amount of variance in the data, and each subsequent component explains the largest amount of remaining variance while being orthogonal to the previous components.

The scikit-learn implementation of PCA uses Singular Value Decomposition (SVD) to perform the transformation. The fit method computes the eigenvectors and eigenvalues of the covariance matrix of the data, and the transform method projects the data onto the new principal component axes.

The n_components parameter controls the number of principal components to keep. If n_components is set to None (the default), all components will be kept, and the transformed data will have the same number of columns as the original data. If n_components is set to a value k, the transformed data will have k columns and will retain k principal components, which explain the largest amount of variance in the data.

After fitting the PCA model, you can access the explained variance ratio using the explained_variance_ratio_ attribute. This can be useful to understand how much information is retained by each principal component and to decide how many components to keep.

It’s worth noting that PCA is sensitive to the scaling of the features, so it’s usually a good idea to standardize the data before applying PCA. You can use the StandardScaler class from scikit-learn to do this:

from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Apply PCA to the scaled data
pca = PCA(n_components=1)
pca.fit(data_scaled)
data_pca = pca.transform(data_scaled)