Skip to content

Commit 209dfea

Browse files
author
xhlulu
committed
ML Docs: Update T-sne and UMAP section
1 parent 2152601 commit 209dfea

File tree

1 file changed

+33
-2
lines changed

1 file changed

+33
-2
lines changed

doc/python/ml-tsne-umap-projections.md

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,22 @@ jupyter:
3434
thumbnail: thumbnail/tsne-umap-projections.png
3535
---
3636

37+
This page presents various ways to visualize two popular dimensionality reduction techniques, namely the [t-distributed stochastic neighbor embedding](https://lvdmaaten.github.io/tsne/) (t-SNE) and [Uniform Manifold Approximation and Projection](https://umap-learn.readthedocs.io/en/latest/index.html) (UMAP). They are needed whenever you want to visualize data with more than two or three features (i.e. dimensions).
38+
39+
We first show how to visualize data with more than three features using the [scatter plot matrix](https://medium.com/plotly/what-is-a-splom-chart-make-scatterplot-matrices-in-python-8dc4998921c3), then we apply dimensionality reduction techniques to get 2D/3D representation of our data, and visualize the results with [scatter plots](https://plotly.com/python/line-and-scatter/) and [3D scatter plots](https://plotly.com/python/3d-scatter-plots/).
40+
41+
3742
## Basic t-SNE projections
3843

44+
t-SNE is a popular dimensionality reduction algorithm that arises from probability theory. Simply put, it projects the high-dimensional data points (sometimes with hundreds of features) into 2D/3D by inducing the projected data to have a similar distribution as the original data points by minimizing something called the [KL divergence](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8).
45+
46+
Compared to a method like Principal Component Analysis (PCA), it takes signficantly more time to converge, but present signficiantly better insights when visualized. For example, by projecting features of a flowers, it will be able to distinctly group
47+
3948

4049
### Visualizing high-dimensional data with `px.scatter_matrix`
4150

51+
First, let's try to visualize every feature of the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris), and color everything by the species. We will use the Scatter Plot Matrix ([splom](https://plotly.com/python/splom/)), which lets us plot each feature against everything else, which is convenient when your dataset has more than 3 dimensions.
52+
4253
```python
4354
import plotly.express as px
4455

@@ -50,6 +61,8 @@ fig.show()
5061

5162
### Project data into 2D with t-SNE and `px.scatter`
5263

64+
Now, let's use the t-SNE algorithm to project the data shown above into two dimensions. Notice how each of the species is physically separate from each other.
65+
5366
```python
5467
from sklearn.manifold import TSNE
5568
import plotly.express as px
@@ -70,6 +83,8 @@ fig.show()
7083

7184
### Project data into 3D with t-SNE and `px.scatter_3d`
7285

86+
t-SNE can reduce your data to any number of dimensions you want! Here, we show you how to project it to 3D and visualize with a 3D scatter plot.
87+
7388
```python
7489
from sklearn.manifold import TSNE
7590
import plotly.express as px
@@ -125,7 +140,9 @@ fig_3d.show()
125140

126141
## Visualizing image datasets
127142

128-
In the following example, we show how to visualize large image datasets using UMAP. Here, we use [`load_digits`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html), a subset of the famous MNIST dataset that was downsized to 8x8 and flattened to 64 dimensions.
143+
In the following example, we show how to visualize large image datasets using UMAP. Here, we use [`load_digits`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html), a subset of the famous [MNIST dataset](http://yann.lecun.com/exdb/mnist/) that was downsized to 8x8 and flattened to 64 dimensions.
144+
145+
Although there's over 1000 data points, and many more dimensions than the previous example, it is still extremely fast. This is because UMAP is optimized for speed, both from a theoretical perspective, and in the way it is implemented. Learn more in [this comparison post](https://umap-learn.readthedocs.io/en/latest/benchmarking.html).
129146

130147
```python
131148
import plotly.express as px
@@ -146,4 +163,18 @@ fig = px.scatter(
146163
fig.show()
147164
```
148165

149-
### Reference
166+
<!-- #region -->
167+
## Reference
168+
169+
Plotly figures:
170+
* https://plotly.com/python/line-and-scatter/
171+
* https://plotly.com/python/3d-scatter-plots/
172+
* https://plotly.com/python/splom/
173+
174+
175+
Details about algorithms:
176+
* UMAP library: https://umap-learn.readthedocs.io/en/latest/
177+
* t-SNE User guide: https://scikit-learn.org/stable/modules/manifold.html#t-sne
178+
* t-SNE paper: https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
179+
* MNIST: http://yann.lecun.com/exdb/mnist/
180+
<!-- #endregion -->

0 commit comments

Comments
 (0)