|
9 | 9 | space while controlling the distortion in the pairwise distances.
|
10 | 10 |
|
11 | 11 | .. _`Johnson-Lindenstrauss lemma`: https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
|
12 |
| -
|
13 |
| -
|
14 |
| -Theoretical bounds |
15 |
| -================== |
16 |
| -
|
17 |
| -The distortion introduced by a random projection `p` is asserted by |
18 |
| -the fact that `p` is defining an eps-embedding with good probability |
19 |
| -as defined by: |
20 |
| -
|
21 |
| -.. math:: |
22 |
| - (1 - eps) \|u - v\|^2 < \|p(u) - p(v)\|^2 < (1 + eps) \|u - v\|^2 |
23 |
| -
|
24 |
| -Where u and v are any rows taken from a dataset of shape [n_samples, |
25 |
| -n_features] and p is a projection by a random Gaussian N(0, 1) matrix |
26 |
| -with shape [n_components, n_features] (or a sparse Achlioptas matrix). |
27 |
| -
|
28 |
| -The minimum number of components to guarantees the eps-embedding is |
29 |
| -given by: |
30 |
| -
|
31 |
| -.. math:: |
32 |
| - n\_components >= 4 log(n\_samples) / (eps^2 / 2 - eps^3 / 3) |
33 |
| -
|
34 |
| -
|
35 |
| -The first plot shows that with an increasing number of samples ``n_samples``, |
36 |
| -the minimal number of dimensions ``n_components`` increased logarithmically |
37 |
| -in order to guarantee an ``eps``-embedding. |
38 |
| -
|
39 |
| -The second plot shows that an increase of the admissible |
40 |
| -distortion ``eps`` allows to reduce drastically the minimal number of |
41 |
| -dimensions ``n_components`` for a given number of samples ``n_samples`` |
42 |
| -
|
43 |
| -
|
44 |
| -Empirical validation |
45 |
| -==================== |
46 |
| -
|
47 |
| -We validate the above bounds on the digits dataset or on the 20 newsgroups |
48 |
| -text document (TF-IDF word frequencies) dataset: |
49 |
| -
|
50 |
| -- for the digits dataset, some 8x8 gray level pixels data for 500 |
51 |
| - handwritten digits pictures are randomly projected to spaces for various |
52 |
| - larger number of dimensions ``n_components``. |
53 |
| -
|
54 |
| -- for the 20 newsgroups dataset some 500 documents with 100k |
55 |
| - features in total are projected using a sparse random matrix to smaller |
56 |
| - euclidean spaces with various values for the target number of dimensions |
57 |
| - ``n_components``. |
58 |
| -
|
59 |
| -The default dataset is the digits dataset. To run the example on the twenty |
60 |
| -newsgroups dataset, pass the --twenty-newsgroups command line argument to this |
61 |
| -script. |
62 |
| -
|
63 |
| -For each value of ``n_components``, we plot: |
64 |
| -
|
65 |
| -- 2D distribution of sample pairs with pairwise distances in original |
66 |
| - and projected spaces as x and y axis respectively. |
67 |
| -
|
68 |
| -- 1D histogram of the ratio of those distances (projected / original). |
69 |
| -
|
70 |
| -We can see that for low values of ``n_components`` the distribution is wide |
71 |
| -with many distorted pairs and a skewed distribution (due to the hard |
72 |
| -limit of zero ratio on the left as distances are always positives) |
73 |
| -while for larger values of n_components the distortion is controlled |
74 |
| -and the distances are well preserved by the random projection. |
75 |
| -
|
76 |
| -
|
77 |
| -Remarks |
78 |
| -======= |
79 |
| -
|
80 |
| -According to the JL lemma, projecting 500 samples without too much distortion |
81 |
| -will require at least several thousands dimensions, irrespective of the |
82 |
| -number of features of the original dataset. |
83 |
| -
|
84 |
| -Hence using random projections on the digits dataset which only has 64 features |
85 |
| -in the input space does not make sense: it does not allow for dimensionality |
86 |
| -reduction in this case. |
87 |
| -
|
88 |
| -On the twenty newsgroups on the other hand the dimensionality can be decreased |
89 |
| -from 56436 down to 10000 while reasonably preserving pairwise distances. |
90 |
| -
|
91 | 12 | """
|
| 13 | + |
92 | 14 | print(__doc__)
|
93 | 15 |
|
94 | 16 | import sys
|
|
109 | 31 | else:
|
110 | 32 | density_param = {'normed': True}
|
111 | 33 |
|
112 |
| -# Part 1: plot the theoretical dependency between n_components_min and |
113 |
| -# n_samples |
| 34 | +########################################################## |
| 35 | +# Theoretical bounds |
| 36 | +# ================== |
| 37 | +# The distortion introduced by a random projection `p` is asserted by |
| 38 | +# the fact that `p` is defining an eps-embedding with good probability |
| 39 | +# as defined by: |
| 40 | +# |
| 41 | +# .. math:: |
| 42 | +# (1 - eps) \|u - v\|^2 < \|p(u) - p(v)\|^2 < (1 + eps) \|u - v\|^2 |
| 43 | +# |
| 44 | +# Where u and v are any rows taken from a dataset of shape [n_samples, |
| 45 | +# n_features] and p is a projection by a random Gaussian N(0, 1) matrix |
| 46 | +# with shape [n_components, n_features] (or a sparse Achlioptas matrix). |
| 47 | +# |
| 48 | +# The minimum number of components to guarantees the eps-embedding is |
| 49 | +# given by: |
| 50 | +# |
| 51 | +# .. math:: |
| 52 | +# n\_components >= 4 log(n\_samples) / (eps^2 / 2 - eps^3 / 3) |
| 53 | +# |
| 54 | +# |
| 55 | +# The first plot shows that with an increasing number of samples ``n_samples``, |
| 56 | +# the minimal number of dimensions ``n_components`` increased logarithmically |
| 57 | +# in order to guarantee an ``eps``-embedding. |
114 | 58 |
|
115 | 59 | # range of admissible distortions
|
116 | 60 | eps_range = np.linspace(0.1, 0.99, 5)
|
|
128 | 72 | plt.xlabel("Number of observations to eps-embed")
|
129 | 73 | plt.ylabel("Minimum number of dimensions")
|
130 | 74 | plt.title("Johnson-Lindenstrauss bounds:\nn_samples vs n_components")
|
| 75 | +plt.show() |
| 76 | + |
| 77 | + |
| 78 | +########################################################## |
| 79 | +# The second plot shows that an increase of the admissible |
| 80 | +# distortion ``eps`` allows to reduce drastically the minimal number of |
| 81 | +# dimensions ``n_components`` for a given number of samples ``n_samples`` |
131 | 82 |
|
132 | 83 | # range of admissible distortions
|
133 | 84 | eps_range = np.linspace(0.01, 0.99, 100)
|
|
145 | 96 | plt.xlabel("Distortion eps")
|
146 | 97 | plt.ylabel("Minimum number of dimensions")
|
147 | 98 | plt.title("Johnson-Lindenstrauss bounds:\nn_components vs eps")
|
| 99 | +plt.show() |
148 | 100 |
|
149 |
| -# Part 2: perform sparse random projection of some digits images which are |
150 |
| -# quite low dimensional and dense or documents of the 20 newsgroups dataset |
151 |
| -# which is both high dimensional and sparse |
| 101 | +########################################################## |
| 102 | +# Empirical validation |
| 103 | +# ==================== |
| 104 | +# |
| 105 | +# We validate the above bounds on the digits dataset or on the 20 newsgroups |
| 106 | +# text document (TF-IDF word frequencies) dataset: |
| 107 | +# |
| 108 | +# - for the digits dataset, some 8x8 gray level pixels data for 500 |
| 109 | +# handwritten digits pictures are randomly projected to spaces for various |
| 110 | +# larger number of dimensions ``n_components``. |
| 111 | +# |
| 112 | +# - for the 20 newsgroups dataset some 500 documents with 100k |
| 113 | +# features in total are projected using a sparse random matrix to smaller |
| 114 | +# euclidean spaces with various values for the target number of dimensions |
| 115 | +# ``n_components``. |
| 116 | +# |
| 117 | +# The default dataset is the digits dataset. To run the example on the twenty |
| 118 | +# newsgroups dataset, pass the --twenty-newsgroups command line argument to |
| 119 | +# this script. |
152 | 120 |
|
153 | 121 | if '--twenty-newsgroups' in sys.argv:
|
154 | 122 | # Need an internet connection hence not enabled by default
|
155 | 123 | data = fetch_20newsgroups_vectorized().data[:500]
|
156 | 124 | else:
|
157 | 125 | data = load_digits().data[:500]
|
158 | 126 |
|
| 127 | +########################################################## |
| 128 | +# For each value of ``n_components``, we plot: |
| 129 | +# |
| 130 | +# - 2D distribution of sample pairs with pairwise distances in original |
| 131 | +# and projected spaces as x and y axis respectively. |
| 132 | +# |
| 133 | +# - 1D histogram of the ratio of those distances (projected / original). |
| 134 | + |
159 | 135 | n_samples, n_features = data.shape
|
160 | 136 | print("Embedding %d samples with dim %d using various random projections"
|
161 | 137 | % (n_samples, n_features))
|
|
205 | 181 | # as vertical lines / region
|
206 | 182 |
|
207 | 183 | plt.show()
|
| 184 | + |
| 185 | + |
| 186 | +########################################################## |
| 187 | +# We can see that for low values of ``n_components`` the distribution is wide |
| 188 | +# with many distorted pairs and a skewed distribution (due to the hard |
| 189 | +# limit of zero ratio on the left as distances are always positives) |
| 190 | +# while for larger values of n_components the distortion is controlled |
| 191 | +# and the distances are well preserved by the random projection. |
| 192 | + |
| 193 | + |
| 194 | +########################################################## |
| 195 | +# Remarks |
| 196 | +# ======= |
| 197 | +# |
| 198 | +# According to the JL lemma, projecting 500 samples without too much distortion |
| 199 | +# will require at least several thousands dimensions, irrespective of the |
| 200 | +# number of features of the original dataset. |
| 201 | +# |
| 202 | +# Hence using random projections on the digits dataset which only has 64 |
| 203 | +# features in the input space does not make sense: it does not allow |
| 204 | +# for dimensionality reduction in this case. |
| 205 | +# |
| 206 | +# On the twenty newsgroups on the other hand the dimensionality can be |
| 207 | +# decreased from 56436 down to 10000 while reasonably preserving |
| 208 | +# pairwise distances. |
0 commit comments