Skip to content

Commit 24550a0

Browse files
committed
Pushing the docs to dev/ for branch: master, commit b3a639ffc2d518b8862c61e4170403a400368571
1 parent c4d5336 commit 24550a0

File tree

931 files changed

+3859
-2699
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

931 files changed

+3859
-2699
lines changed
2.87 KB
Binary file not shown.
2.41 KB
Binary file not shown.

dev/_downloads/plot_compare_reduction.ipynb

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,50 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Selecting dimensionality reduction with Pipeline and GridSearchCV\n\n\nThis example constructs a pipeline that does dimensionality\nreduction followed by prediction with a support vector\nclassifier. It demonstrates the use of GridSearchCV and\nPipeline to optimize over different classes of estimators in a\nsingle CV run -- unsupervised PCA and NMF dimensionality\nreductions are compared to univariate feature selection during\nthe grid search.\n\n"
18+
"\n# Selecting dimensionality reduction with Pipeline and GridSearchCV\n\n\nThis example constructs a pipeline that does dimensionality\nreduction followed by prediction with a support vector\nclassifier. It demonstrates the use of ``GridSearchCV`` and\n``Pipeline`` to optimize over different classes of estimators in a\nsingle CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality\nreductions are compared to univariate feature selection during\nthe grid search.\n\nAdditionally, ``Pipeline`` can be instantiated with the ``memory``\nargument to memoize the transformers within the pipeline, avoiding to fit\nagain the same transformers over and over.\n\nNote that the use of ``memory`` to enable caching becomes interesting when the\nfitting of a transformer is costly.\n\n"
19+
],
20+
"cell_type": "markdown",
21+
"metadata": {}
22+
},
23+
{
24+
"source": [
25+
"Illustration of ``Pipeline`` and ``GridSearchCV``\n##############################################################################\n This section illustrates the use of a ``Pipeline`` with\n ``GridSearchCV``\n\n"
26+
],
27+
"cell_type": "markdown",
28+
"metadata": {}
29+
},
30+
{
31+
"execution_count": null,
32+
"cell_type": "code",
33+
"source": [
34+
"# Authors: Robert McGibbon, Joel Nothman, Guillaume Lemaitre\n\nfrom __future__ import print_function, division\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.svm import LinearSVC\nfrom sklearn.decomposition import PCA, NMF\nfrom sklearn.feature_selection import SelectKBest, chi2\n\nprint(__doc__)\n\npipe = Pipeline([\n ('reduce_dim', PCA()),\n ('classify', LinearSVC())\n])\n\nN_FEATURES_OPTIONS = [2, 4, 8]\nC_OPTIONS = [1, 10, 100, 1000]\nparam_grid = [\n {\n 'reduce_dim': [PCA(iterated_power=7), NMF()],\n 'reduce_dim__n_components': N_FEATURES_OPTIONS,\n 'classify__C': C_OPTIONS\n },\n {\n 'reduce_dim': [SelectKBest(chi2)],\n 'reduce_dim__k': N_FEATURES_OPTIONS,\n 'classify__C': C_OPTIONS\n },\n]\nreducer_labels = ['PCA', 'NMF', 'KBest(chi2)']\n\ngrid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)\ndigits = load_digits()\ngrid.fit(digits.data, digits.target)\n\nmean_scores = np.array(grid.cv_results_['mean_test_score'])\n# scores are in the order of param_grid iteration, which is alphabetical\nmean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))\n# select score for best C\nmean_scores = mean_scores.max(axis=0)\nbar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *\n (len(reducer_labels) + 1) + .5)\n\nplt.figure()\nCOLORS = 'bgrcmyk'\nfor i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):\n plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])\n\nplt.title(\"Comparing feature reduction techniques\")\nplt.xlabel('Reduced number of features')\nplt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)\nplt.ylabel('Digit classification accuracy')\nplt.ylim((0, 1))\nplt.legend(loc='upper left')"
35+
],
36+
"outputs": [],
37+
"metadata": {
38+
"collapsed": false
39+
}
40+
},
41+
{
42+
"source": [
43+
"Caching transformers within a ``Pipeline``\n##############################################################################\n It is sometimes worthwhile storing the state of a specific transformer\n since it could be used again. Using a pipeline in ``GridSearchCV`` triggers\n such situations. Therefore, we use the argument ``memory`` to enable caching.\n\n .. warning::\n Note that this example is, however, only an illustration since for this\n specific case fitting PCA is not necessarily slower than loading the\n cache. Hence, use the ``memory`` constructor parameter when the fitting\n of a transformer is costly.\n\n"
44+
],
45+
"cell_type": "markdown",
46+
"metadata": {}
47+
},
48+
{
49+
"execution_count": null,
50+
"cell_type": "code",
51+
"source": [
52+
"from tempfile import mkdtemp\nfrom shutil import rmtree\nfrom sklearn.externals.joblib import Memory\n\n# Create a temporary folder to store the transformers of the pipeline\ncachedir = mkdtemp()\nmemory = Memory(cachedir=cachedir, verbose=10)\ncached_pipe = Pipeline([('reduce_dim', PCA()),\n ('classify', LinearSVC())],\n memory=memory)\n\n# This time, a cached pipeline will be used within the grid search\ngrid = GridSearchCV(cached_pipe, cv=3, n_jobs=1, param_grid=param_grid)\ndigits = load_digits()\ngrid.fit(digits.data, digits.target)\n\n# Delete the temporary cache before exiting\nrmtree(cachedir)"
53+
],
54+
"outputs": [],
55+
"metadata": {
56+
"collapsed": false
57+
}
58+
},
59+
{
60+
"source": [
61+
"The ``PCA`` fitting is only computed at the evaluation of the first\nconfiguration of the ``C`` parameter of the ``LinearSVC`` classifier. The\nother configurations of ``C`` will trigger the loading of the cached ``PCA``\nestimator data, leading to save processing time. Therefore, the use of\ncaching the pipeline using ``memory`` is highly beneficial when fitting\na transformer is costly.\n\n"
1962
],
2063
"cell_type": "markdown",
2164
"metadata": {}
@@ -24,7 +67,7 @@
2467
"execution_count": null,
2568
"cell_type": "code",
2669
"source": [
27-
"# Authors: Robert McGibbon, Joel Nothman\n\nfrom __future__ import print_function, division\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.svm import LinearSVC\nfrom sklearn.decomposition import PCA, NMF\nfrom sklearn.feature_selection import SelectKBest, chi2\n\nprint(__doc__)\n\npipe = Pipeline([\n ('reduce_dim', PCA()),\n ('classify', LinearSVC())\n])\n\nN_FEATURES_OPTIONS = [2, 4, 8]\nC_OPTIONS = [1, 10, 100, 1000]\nparam_grid = [\n {\n 'reduce_dim': [PCA(iterated_power=7), NMF()],\n 'reduce_dim__n_components': N_FEATURES_OPTIONS,\n 'classify__C': C_OPTIONS\n },\n {\n 'reduce_dim': [SelectKBest(chi2)],\n 'reduce_dim__k': N_FEATURES_OPTIONS,\n 'classify__C': C_OPTIONS\n },\n]\nreducer_labels = ['PCA', 'NMF', 'KBest(chi2)']\n\ngrid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)\ndigits = load_digits()\ngrid.fit(digits.data, digits.target)\n\nmean_scores = np.array(grid.cv_results_['mean_test_score'])\n# scores are in the order of param_grid iteration, which is alphabetical\nmean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))\n# select score for best C\nmean_scores = mean_scores.max(axis=0)\nbar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *\n (len(reducer_labels) + 1) + .5)\n\nplt.figure()\nCOLORS = 'bgrcmyk'\nfor i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):\n plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])\n\nplt.title(\"Comparing feature reduction techniques\")\nplt.xlabel('Reduced number of features')\nplt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)\nplt.ylabel('Digit classification accuracy')\nplt.ylim((0, 1))\nplt.legend(loc='upper left')\nplt.show()"
70+
"plt.show()"
2871
],
2972
"outputs": [],
3073
"metadata": {

dev/_downloads/plot_compare_reduction.py

Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#!/usr/bin/python
1+
#!/usr/bin/env python
22
# -*- coding: utf-8 -*-
33
"""
44
=================================================================
@@ -7,13 +7,27 @@
77
88
This example constructs a pipeline that does dimensionality
99
reduction followed by prediction with a support vector
10-
classifier. It demonstrates the use of GridSearchCV and
11-
Pipeline to optimize over different classes of estimators in a
12-
single CV run -- unsupervised PCA and NMF dimensionality
10+
classifier. It demonstrates the use of ``GridSearchCV`` and
11+
``Pipeline`` to optimize over different classes of estimators in a
12+
single CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality
1313
reductions are compared to univariate feature selection during
1414
the grid search.
15+
16+
Additionally, ``Pipeline`` can be instantiated with the ``memory``
17+
argument to memoize the transformers within the pipeline, avoiding to fit
18+
again the same transformers over and over.
19+
20+
Note that the use of ``memory`` to enable caching becomes interesting when the
21+
fitting of a transformer is costly.
1522
"""
16-
# Authors: Robert McGibbon, Joel Nothman
23+
24+
###############################################################################
25+
# Illustration of ``Pipeline`` and ``GridSearchCV``
26+
###############################################################################
27+
# This section illustrates the use of a ``Pipeline`` with
28+
# ``GridSearchCV``
29+
30+
# Authors: Robert McGibbon, Joel Nothman, Guillaume Lemaitre
1731

1832
from __future__ import print_function, division
1933

@@ -49,7 +63,7 @@
4963
]
5064
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
5165

52-
grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
66+
grid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)
5367
digits = load_digits()
5468
grid.fit(digits.data, digits.target)
5569

@@ -72,4 +86,45 @@
7286
plt.ylabel('Digit classification accuracy')
7387
plt.ylim((0, 1))
7488
plt.legend(loc='upper left')
89+
90+
###############################################################################
91+
# Caching transformers within a ``Pipeline``
92+
###############################################################################
93+
# It is sometimes worthwhile storing the state of a specific transformer
94+
# since it could be used again. Using a pipeline in ``GridSearchCV`` triggers
95+
# such situations. Therefore, we use the argument ``memory`` to enable caching.
96+
#
97+
# .. warning::
98+
# Note that this example is, however, only an illustration since for this
99+
# specific case fitting PCA is not necessarily slower than loading the
100+
# cache. Hence, use the ``memory`` constructor parameter when the fitting
101+
# of a transformer is costly.
102+
103+
from tempfile import mkdtemp
104+
from shutil import rmtree
105+
from sklearn.externals.joblib import Memory
106+
107+
# Create a temporary folder to store the transformers of the pipeline
108+
cachedir = mkdtemp()
109+
memory = Memory(cachedir=cachedir, verbose=10)
110+
cached_pipe = Pipeline([('reduce_dim', PCA()),
111+
('classify', LinearSVC())],
112+
memory=memory)
113+
114+
# This time, a cached pipeline will be used within the grid search
115+
grid = GridSearchCV(cached_pipe, cv=3, n_jobs=1, param_grid=param_grid)
116+
digits = load_digits()
117+
grid.fit(digits.data, digits.target)
118+
119+
# Delete the temporary cache before exiting
120+
rmtree(cachedir)
121+
122+
###############################################################################
123+
# The ``PCA`` fitting is only computed at the evaluation of the first
124+
# configuration of the ``C`` parameter of the ``LinearSVC`` classifier. The
125+
# other configurations of ``C`` will trigger the loading of the cached ``PCA``
126+
# estimator data, leading to save processing time. Therefore, the use of
127+
# caching the pipeline using ``memory`` is highly beneficial when fitting
128+
# a transformer is costly.
129+
75130
plt.show()

dev/_downloads/scikit-learn-docs.pdf

31.7 KB
Binary file not shown.
-66 Bytes
-66 Bytes
-65 Bytes
-65 Bytes
-52 Bytes

0 commit comments

Comments
 (0)