PCA on Cancer dataset

shekhar pandey
2 min readMay 16, 2020

--

Dimensionality reduction for visualization:
Often we deal with a high dimensionality dataset, and there arises a need to convert it into a lower dimension space, so we can visualize it , with the condition that we retain the maximum information.

Principal Component Analysis (PCA) :
The main idea of PCA is to reduce the dimensionality of a dataset consisting of many variables (i.e. dimensions), while retaining the variation (i.e. the information) present in the dataset up to the maximum extent. The is done by transforming the variables to a new set of variables, which are known as the principal components.
These principal components retain the variation present in original variable in an ordered manner, i.e. first principal component retains maximum information, then second princila component and so on .
So if we can convert a high dimensionality dataset into 2 or 3 dimension while retaining around 80% to 90% of original variation, that really helps.

# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer

# load dataset
breast_cancer = load_breast_cancer()

type(breast_cancer)
# sklearn.utils.Bunch

# to see detailed description of dataset
print(breast_cancer.DESCR)
breast_cancer.data.shape
# (569, 30) : 569 datapoints i.e rows and 30 columns

breast_cancer.target.shape
# (569,) # 569 labels as 0 or 1
raw_data = breast_cancer.data

# normalized data
normalized_data = StandardScaler().fit_transform(raw_data)

# initialize pca with 2 components
pca = PCA(n_components=2)

# fit data
pca_data = pca.fit_transform(normalized_data)
# Variance explained by principal components
print(pca.explained_variance_ratio_)
# [0.44272026 0.18971182]

# Total Variance explained by principal components
total_var = 100 * np.sum(pca.explained_variance_ratio_)
print(f'{total_var:.3}% of total variance is explained by 2 principal components')
# 63.2% of total variance is explained by 2 principal components

So, with PCA we converted a 30 dimensions dataset into 2 dimensions and retaining 63% of information of original dataset. Now we can easily plot this newly created dataset on a 2 dimension graph.

# Create dataframe 
pca_df = pd.DataFrame(np.vstack((pca_data.T, breast_cancer.target)).T,
columns = ['1st_Prin', '2nd_Prin', 'label'])


# Replace 0 with Malignant and 1 with Benign
pca_df['label'].replace(0.0, 'Malignant',inplace=True)
pca_df['label'].replace(1.0, 'Benign',inplace=True)

# Check the count of label
pca_df.label.value_counts()

# Benign 357
# Malignant 212
# This count matches with labels as per dataset description

# Create Plot
# Set palette of colors for different labels
pal = dict(Malignant="red", Benign="green")

ax = sns.FacetGrid(pca_df, hue='label', height=6, palette=pal,
hue_order=["Malignant", "Benign"]).\
map(plt.scatter, '1st_Prin', '2nd_Prin').\
add_legend()

plt.show()

--

--

No responses yet