Skip to content

Commit 70a0c07

Browse files
xhluluxhlulu
authored andcommitted
ML Docs: Start pca notebook
1 parent 0b38106 commit 70a0c07

File tree

1 file changed

+135
-0
lines changed

1 file changed

+135
-0
lines changed

doc/python/ml-pca.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
notebook_metadata_filter: all
5+
text_representation:
6+
extension: .md
7+
format_name: markdown
8+
format_version: '1.1'
9+
jupytext_version: 1.1.1
10+
kernelspec:
11+
display_name: Python 3
12+
language: python
13+
name: python3
14+
language_info:
15+
codemirror_mode:
16+
name: ipython
17+
version: 3
18+
file_extension: .py
19+
mimetype: text/x-python
20+
name: python
21+
nbconvert_exporter: python
22+
pygments_lexer: ipython3
23+
version: 3.7.6
24+
plotly:
25+
description: Visualize Principle Component Analysis (PCA) of your high-dimensional
26+
data with Plotly on Python.
27+
display_as: ai_ml
28+
language: python
29+
layout: base
30+
name: PCA Visualization
31+
order: 4
32+
page_type: example_index
33+
permalink: python/pca-visualization/
34+
thumbnail: thumbnail/ml-pca.png
35+
---
36+
37+
## Basic PCA Scatter Plot
38+
39+
This example shows you how to simply visualize the first two principal components of a PCA, by reducing a dataset of 4 dimensions to 2D. It uses scikit-learn's `PCA`.
40+
41+
```python
42+
import plotly.express as px
43+
from sklearn.decomposition import PCA
44+
45+
df = px.data.iris()
46+
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
47+
48+
pca = PCA(n_components=2)
49+
components = pca.fit_transform(X)
50+
51+
fig = px.scatter(x=components[:, 0], y=components[:, 1], color=df['species'])
52+
fig.show()
53+
```
54+
55+
## Visualize PCA with `px.scatter_3d`
56+
57+
Just like the basic PCA plot, this let you visualize the first 3 dimensions. This additionally displays the total variance explained by those components.
58+
59+
```python
60+
import plotly.express as px
61+
from sklearn.decomposition import PCA
62+
63+
df = px.data.iris()
64+
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
65+
66+
pca = PCA(n_components=3)
67+
components = pca.fit_transform(X)
68+
69+
total_var = pca.explained_variance_ratio_.sum() * 100
70+
71+
fig = px.scatter_3d(
72+
x=components[:, 0], y=components[:, 1], z=components[:, 2],
73+
color=df['species'],
74+
title=f'Total Explained Variance: {total_var:.2f}%',
75+
labels={'x': 'PC 1', 'y': 'PC 2', 'z': 'PC 3'},
76+
)
77+
fig.show()
78+
```
79+
80+
## Plot high-dimensional components with `px.scatter_matrix`
81+
82+
If you need to visualize more than 3 dimensions, you can use scatter plot matrices.
83+
84+
```python
85+
import pandas as pd
86+
from sklearn.decomposition import PCA
87+
from sklearn.datasets import load_boston
88+
89+
boston = load_boston()
90+
df = pd.DataFrame(boston.data, columns=boston.feature_names)
91+
92+
pca = PCA(n_components=5)
93+
components = pca.fit_transform(df)
94+
95+
total_var = pca.explained_variance_ratio_.sum() * 100
96+
97+
labels = {str(i): f"PC {i+1}" for i in range(5)}
98+
labels['color'] = 'Median Price'
99+
100+
fig = px.scatter_matrix(
101+
components,
102+
color=boston.target,
103+
dimensions=range(5),
104+
labels=labels,
105+
title=f'Total Explained Variance: {total_var:.2f}%',
106+
)
107+
fig.update_traces(diagonal_visible=False)
108+
fig.show()
109+
```
110+
111+
## Plotting explained variance
112+
113+
Often, you might be interested in seeing how much variance the PCA is able to explain as you increase the number of components, in order to decide how many dimensions to ultimately keep or analyze. This example shows you how to quickly plot the cumulative sum of explained variance for a high-dimensional dataset like [Diabetes](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset).
114+
115+
```python
116+
import numpy as np
117+
import pandas as pd
118+
from sklearn.decomposition import PCA
119+
from sklearn.datasets import load_diabetes
120+
121+
boston = load_diabetes()
122+
df = pd.DataFrame(boston.data, columns=boston.feature_names)
123+
124+
pca = PCA()
125+
pca.fit(df)
126+
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
127+
128+
px.area(
129+
x=range(1, exp_var_cumul.shape[0] + 1),
130+
y=exp_var_cumul,
131+
labels={"x": "# Components", "y": "Explained Variance"}
132+
)
133+
```
134+
135+
## Visualize loadings

0 commit comments

Comments
 (0)