Skip to content

Commit 918006b

Browse files
committed
Pushing the docs to dev/ for branch: master, commit a192e75833b1e8d1999579501f7cb1fd99dffbfe
1 parent 30b06bc commit 918006b

File tree

1,172 files changed

+4129
-3933
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,172 files changed

+4129
-3933
lines changed
Binary file not shown.

dev/_downloads/36b58500501fbf3f06587ee0039d1985/plot_johnson_lindenstrauss_bound.ipynb

Lines changed: 88 additions & 2 deletions
Large diffs are not rendered by default.

dev/_downloads/9806f0059c4cc6c99c54414e573e6615/plot_johnson_lindenstrauss_bound.py

Lines changed: 85 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -9,86 +9,8 @@
99
space while controlling the distortion in the pairwise distances.
1010
1111
.. _`Johnson-Lindenstrauss lemma`: https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
12-
13-
14-
Theoretical bounds
15-
==================
16-
17-
The distortion introduced by a random projection `p` is asserted by
18-
the fact that `p` is defining an eps-embedding with good probability
19-
as defined by:
20-
21-
.. math::
22-
(1 - eps) \|u - v\|^2 < \|p(u) - p(v)\|^2 < (1 + eps) \|u - v\|^2
23-
24-
Where u and v are any rows taken from a dataset of shape [n_samples,
25-
n_features] and p is a projection by a random Gaussian N(0, 1) matrix
26-
with shape [n_components, n_features] (or a sparse Achlioptas matrix).
27-
28-
The minimum number of components to guarantees the eps-embedding is
29-
given by:
30-
31-
.. math::
32-
n\_components >= 4 log(n\_samples) / (eps^2 / 2 - eps^3 / 3)
33-
34-
35-
The first plot shows that with an increasing number of samples ``n_samples``,
36-
the minimal number of dimensions ``n_components`` increased logarithmically
37-
in order to guarantee an ``eps``-embedding.
38-
39-
The second plot shows that an increase of the admissible
40-
distortion ``eps`` allows to reduce drastically the minimal number of
41-
dimensions ``n_components`` for a given number of samples ``n_samples``
42-
43-
44-
Empirical validation
45-
====================
46-
47-
We validate the above bounds on the digits dataset or on the 20 newsgroups
48-
text document (TF-IDF word frequencies) dataset:
49-
50-
- for the digits dataset, some 8x8 gray level pixels data for 500
51-
handwritten digits pictures are randomly projected to spaces for various
52-
larger number of dimensions ``n_components``.
53-
54-
- for the 20 newsgroups dataset some 500 documents with 100k
55-
features in total are projected using a sparse random matrix to smaller
56-
euclidean spaces with various values for the target number of dimensions
57-
``n_components``.
58-
59-
The default dataset is the digits dataset. To run the example on the twenty
60-
newsgroups dataset, pass the --twenty-newsgroups command line argument to this
61-
script.
62-
63-
For each value of ``n_components``, we plot:
64-
65-
- 2D distribution of sample pairs with pairwise distances in original
66-
and projected spaces as x and y axis respectively.
67-
68-
- 1D histogram of the ratio of those distances (projected / original).
69-
70-
We can see that for low values of ``n_components`` the distribution is wide
71-
with many distorted pairs and a skewed distribution (due to the hard
72-
limit of zero ratio on the left as distances are always positives)
73-
while for larger values of n_components the distortion is controlled
74-
and the distances are well preserved by the random projection.
75-
76-
77-
Remarks
78-
=======
79-
80-
According to the JL lemma, projecting 500 samples without too much distortion
81-
will require at least several thousands dimensions, irrespective of the
82-
number of features of the original dataset.
83-
84-
Hence using random projections on the digits dataset which only has 64 features
85-
in the input space does not make sense: it does not allow for dimensionality
86-
reduction in this case.
87-
88-
On the twenty newsgroups on the other hand the dimensionality can be decreased
89-
from 56436 down to 10000 while reasonably preserving pairwise distances.
90-
9112
"""
13+
9214
print(__doc__)
9315

9416
import sys
@@ -109,8 +31,30 @@
10931
else:
11032
density_param = {'normed': True}
11133

112-
# Part 1: plot the theoretical dependency between n_components_min and
113-
# n_samples
34+
##########################################################
35+
# Theoretical bounds
36+
# ==================
37+
# The distortion introduced by a random projection `p` is asserted by
38+
# the fact that `p` is defining an eps-embedding with good probability
39+
# as defined by:
40+
#
41+
# .. math::
42+
# (1 - eps) \|u - v\|^2 < \|p(u) - p(v)\|^2 < (1 + eps) \|u - v\|^2
43+
#
44+
# Where u and v are any rows taken from a dataset of shape [n_samples,
45+
# n_features] and p is a projection by a random Gaussian N(0, 1) matrix
46+
# with shape [n_components, n_features] (or a sparse Achlioptas matrix).
47+
#
48+
# The minimum number of components to guarantees the eps-embedding is
49+
# given by:
50+
#
51+
# .. math::
52+
# n\_components >= 4 log(n\_samples) / (eps^2 / 2 - eps^3 / 3)
53+
#
54+
#
55+
# The first plot shows that with an increasing number of samples ``n_samples``,
56+
# the minimal number of dimensions ``n_components`` increased logarithmically
57+
# in order to guarantee an ``eps``-embedding.
11458

11559
# range of admissible distortions
11660
eps_range = np.linspace(0.1, 0.99, 5)
@@ -128,6 +72,13 @@
12872
plt.xlabel("Number of observations to eps-embed")
12973
plt.ylabel("Minimum number of dimensions")
13074
plt.title("Johnson-Lindenstrauss bounds:\nn_samples vs n_components")
75+
plt.show()
76+
77+
78+
##########################################################
79+
# The second plot shows that an increase of the admissible
80+
# distortion ``eps`` allows to reduce drastically the minimal number of
81+
# dimensions ``n_components`` for a given number of samples ``n_samples``
13182

13283
# range of admissible distortions
13384
eps_range = np.linspace(0.01, 0.99, 100)
@@ -145,17 +96,42 @@
14596
plt.xlabel("Distortion eps")
14697
plt.ylabel("Minimum number of dimensions")
14798
plt.title("Johnson-Lindenstrauss bounds:\nn_components vs eps")
99+
plt.show()
148100

149-
# Part 2: perform sparse random projection of some digits images which are
150-
# quite low dimensional and dense or documents of the 20 newsgroups dataset
151-
# which is both high dimensional and sparse
101+
##########################################################
102+
# Empirical validation
103+
# ====================
104+
#
105+
# We validate the above bounds on the digits dataset or on the 20 newsgroups
106+
# text document (TF-IDF word frequencies) dataset:
107+
#
108+
# - for the digits dataset, some 8x8 gray level pixels data for 500
109+
# handwritten digits pictures are randomly projected to spaces for various
110+
# larger number of dimensions ``n_components``.
111+
#
112+
# - for the 20 newsgroups dataset some 500 documents with 100k
113+
# features in total are projected using a sparse random matrix to smaller
114+
# euclidean spaces with various values for the target number of dimensions
115+
# ``n_components``.
116+
#
117+
# The default dataset is the digits dataset. To run the example on the twenty
118+
# newsgroups dataset, pass the --twenty-newsgroups command line argument to
119+
# this script.
152120

153121
if '--twenty-newsgroups' in sys.argv:
154122
# Need an internet connection hence not enabled by default
155123
data = fetch_20newsgroups_vectorized().data[:500]
156124
else:
157125
data = load_digits().data[:500]
158126

127+
##########################################################
128+
# For each value of ``n_components``, we plot:
129+
#
130+
# - 2D distribution of sample pairs with pairwise distances in original
131+
# and projected spaces as x and y axis respectively.
132+
#
133+
# - 1D histogram of the ratio of those distances (projected / original).
134+
159135
n_samples, n_features = data.shape
160136
print("Embedding %d samples with dim %d using various random projections"
161137
% (n_samples, n_features))
@@ -205,3 +181,28 @@
205181
# as vertical lines / region
206182

207183
plt.show()
184+
185+
186+
##########################################################
187+
# We can see that for low values of ``n_components`` the distribution is wide
188+
# with many distorted pairs and a skewed distribution (due to the hard
189+
# limit of zero ratio on the left as distances are always positives)
190+
# while for larger values of n_components the distortion is controlled
191+
# and the distances are well preserved by the random projection.
192+
193+
194+
##########################################################
195+
# Remarks
196+
# =======
197+
#
198+
# According to the JL lemma, projecting 500 samples without too much distortion
199+
# will require at least several thousands dimensions, irrespective of the
200+
# number of features of the original dataset.
201+
#
202+
# Hence using random projections on the digits dataset which only has 64
203+
# features in the input space does not make sense: it does not allow
204+
# for dimensionality reduction in this case.
205+
#
206+
# On the twenty newsgroups on the other hand the dimensionality can be
207+
# decreased from 56436 down to 10000 while reasonably preserving
208+
# pairwise distances.
Binary file not shown.

dev/_downloads/scikit-learn-docs.pdf

6.38 KB
Binary file not shown.

dev/_images/iris.png

0 Bytes
28 Bytes
28 Bytes
-23 Bytes
-23 Bytes

0 commit comments

Comments
 (0)