ML Docs: Update T-sne and UMAP section

xhlulu · xhlulu · commit 209dfea292d3 · 2020-08-12T19:38:37.000-04:00
diff --git a/doc/python/ml-tsne-umap-projections.md b/doc/python/ml-tsne-umap-projections.md
@@ -34,11 +34,22 @@ jupyter:
     thumbnail: thumbnail/tsne-umap-projections.png
 ---
 
+This page presents various ways to visualize two popular dimensionality reduction techniques, namely the [t-distributed stochastic neighbor embedding](https://lvdmaaten.github.io/tsne/) (t-SNE) and [Uniform Manifold Approximation and Projection](https://umap-learn.readthedocs.io/en/latest/index.html) (UMAP). They are needed whenever you want to visualize data with more than two or three features (i.e. dimensions). 
+
+We first show how to visualize data with more than three features using the [scatter plot matrix](https://medium.com/plotly/what-is-a-splom-chart-make-scatterplot-matrices-in-python-8dc4998921c3), then we apply dimensionality reduction techniques to get 2D/3D representation of our data, and visualize the results with [scatter plots](https://plotly.com/python/line-and-scatter/) and [3D scatter plots](https://plotly.com/python/3d-scatter-plots/). 
+
+
 ## Basic t-SNE projections
 
+t-SNE is a popular dimensionality reduction algorithm that arises from probability theory. Simply put, it projects the high-dimensional data points (sometimes with hundreds of features) into 2D/3D by inducing the projected data to have a similar distribution as the original data points by minimizing something called the [KL divergence](https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8). 
+
+Compared to a method like Principal Component Analysis (PCA), it takes signficantly more time to converge, but present signficiantly better insights when visualized. For example, by projecting features of a flowers, it will be able to distinctly group 
+
 
 ### Visualizing high-dimensional data with `px.scatter_matrix`
 
+First, let's try to visualize every feature of the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris), and color everything by the species. We will use the Scatter Plot Matrix ([splom](https://plotly.com/python/splom/)), which lets us plot each feature against everything else, which is convenient when your dataset has more than 3 dimensions.
+
 ```python
 import plotly.express as px
 
@@ -50,6 +61,8 @@ fig.show()
 
 ### Project data into 2D with t-SNE and `px.scatter`
 
+Now, let's use the t-SNE algorithm to project the data shown above into two dimensions. Notice how each of the species is physically separate from each other.
+
 ```python
 from sklearn.manifold import TSNE
 import plotly.express as px
@@ -70,6 +83,8 @@ fig.show()
 
 ### Project data into 3D with t-SNE and `px.scatter_3d`
 
+t-SNE can reduce your data to any number of dimensions you want! Here, we show you how to project it to 3D and visualize with a 3D scatter plot.
+
 ```python
 from sklearn.manifold import TSNE
 import plotly.express as px
@@ -125,7 +140,9 @@ fig_3d.show()
 
 ## Visualizing image datasets
 
-In the following example, we show how to visualize large image datasets using UMAP. Here, we use [`load_digits`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html), a subset of the famous MNIST dataset that was downsized to 8x8 and flattened to 64 dimensions.
+In the following example, we show how to visualize large image datasets using UMAP. Here, we use [`load_digits`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html), a subset of the famous [MNIST dataset](http://yann.lecun.com/exdb/mnist/) that was downsized to 8x8 and flattened to 64 dimensions.
+
+Although there's over 1000 data points, and many more dimensions than the previous example, it is still extremely fast. This is because UMAP is optimized for speed, both from a theoretical perspective, and in the way it is implemented. Learn more in [this comparison post](https://umap-learn.readthedocs.io/en/latest/benchmarking.html).
 
 ```python
 import plotly.express as px
@@ -146,4 +163,18 @@ fig = px.scatter(
 fig.show()
 ```
 
-### Reference
+<!-- #region -->
+## Reference
+
+Plotly figures:
+* https://plotly.com/python/line-and-scatter/
+* https://plotly.com/python/3d-scatter-plots/
+* https://plotly.com/python/splom/
+
+
+Details about algorithms:
+* UMAP library: https://umap-learn.readthedocs.io/en/latest/
+* t-SNE User guide: https://scikit-learn.org/stable/modules/manifold.html#t-sne
+* t-SNE paper: https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
+* MNIST: http://yann.lecun.com/exdb/mnist/
+<!-- #endregion -->