Skip to content

Commit d6336c2

Browse files
committed
Pushing the docs to dev/ for branch: master, commit 120009b12a1b23ecd39d162a9a4b654e5c5109fd
1 parent a3cd5d0 commit d6336c2

File tree

1,113 files changed

+4094
-3609
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,113 files changed

+4094
-3609
lines changed
1.67 KB
Binary file not shown.
1.62 KB
Binary file not shown.

dev/_downloads/plot_learning_curve.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Plotting Learning Curves\n\n\nOn the left side the learning curve of a naive Bayes classifier is shown for\nthe digits dataset. Note that the training score and the cross-validation score\nare both not very good at the end. However, the shape of the curve can be found\nin more complex datasets very often: the training score is very high at the\nbeginning and decreases and the cross-validation score is very low at the\nbeginning and increases. On the right side we see the learning curve of an SVM\nwith RBF kernel. We can see clearly that the training score is still around\nthe maximum and the validation score could be increased with more training\nsamples.\n\n"
18+
"\n# Plotting Learning Curves\n\nIn the first column, first row the learning curve of a naive Bayes classifier\nis shown for the digits dataset. Note that the training score and the\ncross-validation score are both not very good at the end. However, the shape\nof the curve can be found in more complex datasets very often: the training\nscore is very high at the beginning and decreases and the cross-validation\nscore is very low at the beginning and increases. In the second column, first\nrow we see the learning curve of an SVM with RBF kernel. We can see clearly\nthat the training score is still around the maximum and the validation score\ncould be increased with more training samples. The plots in the second row\nshow the times required by the models to train with various sizes of training\ndataset. The plots in the third row show how much time was required to train\nthe models for each training sizes.\n\n"
1919
]
2020
},
2121
{
@@ -26,7 +26,7 @@
2626
},
2727
"outputs": [],
2828
"source": [
29-
"print(__doc__)\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import learning_curve\nfrom sklearn.model_selection import ShuffleSplit\n\n\ndef plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,\n n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):\n \"\"\"\n Generate a simple plot of the test and training learning curve.\n\n Parameters\n ----------\n estimator : object type that implements the \"fit\" and \"predict\" methods\n An object of that type which is cloned for each validation.\n\n title : string\n Title for the chart.\n\n X : array-like, shape (n_samples, n_features)\n Training vector, where n_samples is the number of samples and\n n_features is the number of features.\n\n y : array-like, shape (n_samples) or (n_samples, n_features), optional\n Target relative to X for classification or regression;\n None for unsupervised learning.\n\n ylim : tuple, shape (ymin, ymax), optional\n Defines minimum and maximum yvalues plotted.\n\n cv : int, cross-validation generator or an iterable, optional\n Determines the cross-validation splitting strategy.\n Possible inputs for cv are:\n - None, to use the default 5-fold cross-validation,\n - integer, to specify the number of folds.\n - :term:`CV splitter`,\n - An iterable yielding (train, test) splits as arrays of indices.\n\n For integer/None inputs, if ``y`` is binary or multiclass,\n :class:`StratifiedKFold` used. If the estimator is not a classifier\n or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.\n\n Refer :ref:`User Guide <cross_validation>` for the various\n cross-validators that can be used here.\n\n n_jobs : int or None, optional (default=None)\n Number of jobs to run in parallel.\n ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.\n ``-1`` means using all processors. See :term:`Glossary <n_jobs>`\n for more details.\n\n train_sizes : array-like, shape (n_ticks,), dtype float or int\n Relative or absolute numbers of training examples that will be used to\n generate the learning curve. If the dtype is float, it is regarded as a\n fraction of the maximum size of the training set (that is determined\n by the selected validation method), i.e. it has to be within (0, 1].\n Otherwise it is interpreted as absolute sizes of the training sets.\n Note that for classification the number of samples usually have to\n be big enough to contain at least one sample from each class.\n (default: np.linspace(0.1, 1.0, 5))\n \"\"\"\n plt.figure()\n plt.title(title)\n if ylim is not None:\n plt.ylim(*ylim)\n plt.xlabel(\"Training examples\")\n plt.ylabel(\"Score\")\n train_sizes, train_scores, test_scores = learning_curve(\n estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)\n train_scores_mean = np.mean(train_scores, axis=1)\n train_scores_std = np.std(train_scores, axis=1)\n test_scores_mean = np.mean(test_scores, axis=1)\n test_scores_std = np.std(test_scores, axis=1)\n plt.grid()\n\n plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n train_scores_mean + train_scores_std, alpha=0.1,\n color=\"r\")\n plt.fill_between(train_sizes, test_scores_mean - test_scores_std,\n test_scores_mean + test_scores_std, alpha=0.1, color=\"g\")\n plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n label=\"Training score\")\n plt.plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n label=\"Cross-validation score\")\n\n plt.legend(loc=\"best\")\n return plt\n\n\ndigits = load_digits()\nX, y = digits.data, digits.target\n\n\ntitle = \"Learning Curves (Naive Bayes)\"\n# Cross validation with 100 iterations to get smoother mean test and train\n# score curves, each time with 20% data randomly selected as a validation set.\ncv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)\n\nestimator = GaussianNB()\nplot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)\n\ntitle = r\"Learning Curves (SVM, RBF kernel, $\\gamma=0.001$)\"\n# SVC is more expensive so we do a lower number of CV iterations:\ncv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)\nestimator = SVC(gamma=0.001)\nplot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)\n\nplt.show()"
29+
"print(__doc__)\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import learning_curve\nfrom sklearn.model_selection import ShuffleSplit\n\n\ndef plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,\n n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):\n \"\"\"\n Generate 3 plots: the test and training learning curve, the training\n samples vs fit times curve, the fit times vs score curve.\n\n Parameters\n ----------\n estimator : object type that implements the \"fit\" and \"predict\" methods\n An object of that type which is cloned for each validation.\n\n title : string\n Title for the chart.\n\n X : array-like, shape (n_samples, n_features)\n Training vector, where n_samples is the number of samples and\n n_features is the number of features.\n\n y : array-like, shape (n_samples) or (n_samples, n_features), optional\n Target relative to X for classification or regression;\n None for unsupervised learning.\n\n axes : array of 3 axes, optional (default=None)\n Axes to use for plotting the curves.\n\n ylim : tuple, shape (ymin, ymax), optional\n Defines minimum and maximum yvalues plotted.\n\n cv : int, cross-validation generator or an iterable, optional\n Determines the cross-validation splitting strategy.\n Possible inputs for cv are:\n - None, to use the default 5-fold cross-validation,\n - integer, to specify the number of folds.\n - :term:`CV splitter`,\n - An iterable yielding (train, test) splits as arrays of indices.\n\n For integer/None inputs, if ``y`` is binary or multiclass,\n :class:`StratifiedKFold` used. If the estimator is not a classifier\n or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.\n\n Refer :ref:`User Guide <cross_validation>` for the various\n cross-validators that can be used here.\n\n n_jobs : int or None, optional (default=None)\n Number of jobs to run in parallel.\n ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.\n ``-1`` means using all processors. See :term:`Glossary <n_jobs>`\n for more details.\n\n train_sizes : array-like, shape (n_ticks,), dtype float or int\n Relative or absolute numbers of training examples that will be used to\n generate the learning curve. If the dtype is float, it is regarded as a\n fraction of the maximum size of the training set (that is determined\n by the selected validation method), i.e. it has to be within (0, 1].\n Otherwise it is interpreted as absolute sizes of the training sets.\n Note that for classification the number of samples usually have to\n be big enough to contain at least one sample from each class.\n (default: np.linspace(0.1, 1.0, 5))\n \"\"\"\n if axes is None:\n _, axes = plt.subplots(1, 3, figsize=(20, 5))\n\n axes[0].set_title(title)\n if ylim is not None:\n axes[0].set_ylim(*ylim)\n axes[0].set_xlabel(\"Training examples\")\n axes[0].set_ylabel(\"Score\")\n\n train_sizes, train_scores, test_scores, fit_times, _ = \\\n learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,\n train_sizes=train_sizes,\n return_times=True)\n train_scores_mean = np.mean(train_scores, axis=1)\n train_scores_std = np.std(train_scores, axis=1)\n test_scores_mean = np.mean(test_scores, axis=1)\n test_scores_std = np.std(test_scores, axis=1)\n fit_times_mean = np.mean(fit_times, axis=1)\n fit_times_std = np.std(fit_times, axis=1)\n\n # Plot learning curve\n axes[0].grid()\n axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,\n train_scores_mean + train_scores_std, alpha=0.1,\n color=\"r\")\n axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,\n test_scores_mean + test_scores_std, alpha=0.1,\n color=\"g\")\n axes[0].plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n label=\"Training score\")\n axes[0].plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n label=\"Cross-validation score\")\n axes[0].legend(loc=\"best\")\n\n # Plot n_samples vs fit_times\n axes[1].grid()\n axes[1].plot(train_sizes, fit_times_mean, 'o-')\n axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,\n fit_times_mean + fit_times_std, alpha=0.1)\n axes[1].set_xlabel(\"Training examples\")\n axes[1].set_ylabel(\"fit_times\")\n axes[1].set_title(\"Scalability of the model\")\n\n # Plot fit_time vs score\n axes[2].grid()\n axes[2].plot(fit_times_mean, test_scores_mean, 'o-')\n axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,\n test_scores_mean + test_scores_std, alpha=0.1)\n axes[2].set_xlabel(\"fit_times\")\n axes[2].set_ylabel(\"Score\")\n axes[2].set_title(\"Performance of the model\")\n\n return plt\n\n\nfig, axes = plt.subplots(3, 2, figsize=(10, 15))\n\ndigits = load_digits()\nX, y = digits.data, digits.target\n\n\ntitle = \"Learning Curves (Naive Bayes)\"\n# Cross validation with 100 iterations to get smoother mean test and train\n# score curves, each time with 20% data randomly selected as a validation set.\ncv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)\n\nestimator = GaussianNB()\nplot_learning_curve(estimator, title, X, y, axes=axes[:, 0], ylim=(0.7, 1.01),\n cv=cv, n_jobs=4)\n\ntitle = r\"Learning Curves (SVM, RBF kernel, $\\gamma=0.001$)\"\n# SVC is more expensive so we do a lower number of CV iterations:\ncv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)\nestimator = SVC(gamma=0.001)\nplot_learning_curve(estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01),\n cv=cv, n_jobs=4)\n\nplt.show()"
3030
]
3131
}
3232
],

dev/_downloads/plot_learning_curve.py

Lines changed: 71 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,18 @@
22
========================
33
Plotting Learning Curves
44
========================
5-
6-
On the left side the learning curve of a naive Bayes classifier is shown for
7-
the digits dataset. Note that the training score and the cross-validation score
8-
are both not very good at the end. However, the shape of the curve can be found
9-
in more complex datasets very often: the training score is very high at the
10-
beginning and decreases and the cross-validation score is very low at the
11-
beginning and increases. On the right side we see the learning curve of an SVM
12-
with RBF kernel. We can see clearly that the training score is still around
13-
the maximum and the validation score could be increased with more training
14-
samples.
5+
In the first column, first row the learning curve of a naive Bayes classifier
6+
is shown for the digits dataset. Note that the training score and the
7+
cross-validation score are both not very good at the end. However, the shape
8+
of the curve can be found in more complex datasets very often: the training
9+
score is very high at the beginning and decreases and the cross-validation
10+
score is very low at the beginning and increases. In the second column, first
11+
row we see the learning curve of an SVM with RBF kernel. We can see clearly
12+
that the training score is still around the maximum and the validation score
13+
could be increased with more training samples. The plots in the second row
14+
show the times required by the models to train with various sizes of training
15+
dataset. The plots in the third row show how much time was required to train
16+
the models for each training sizes.
1517
"""
1618
print(__doc__)
1719

@@ -24,10 +26,11 @@
2426
from sklearn.model_selection import ShuffleSplit
2527

2628

27-
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
29+
def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
2830
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
2931
"""
30-
Generate a simple plot of the test and training learning curve.
32+
Generate 3 plots: the test and training learning curve, the training
33+
samples vs fit times curve, the fit times vs score curve.
3134
3235
Parameters
3336
----------
@@ -45,6 +48,9 @@ def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
4548
Target relative to X for classification or regression;
4649
None for unsupervised learning.
4750
51+
axes : array of 3 axes, optional (default=None)
52+
Axes to use for plotting the curves.
53+
4854
ylim : tuple, shape (ymin, ymax), optional
4955
Defines minimum and maximum yvalues plotted.
5056
@@ -79,34 +85,63 @@ def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
7985
be big enough to contain at least one sample from each class.
8086
(default: np.linspace(0.1, 1.0, 5))
8187
"""
82-
plt.figure()
83-
plt.title(title)
88+
if axes is None:
89+
_, axes = plt.subplots(1, 3, figsize=(20, 5))
90+
91+
axes[0].set_title(title)
8492
if ylim is not None:
85-
plt.ylim(*ylim)
86-
plt.xlabel("Training examples")
87-
plt.ylabel("Score")
88-
train_sizes, train_scores, test_scores = learning_curve(
89-
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
93+
axes[0].set_ylim(*ylim)
94+
axes[0].set_xlabel("Training examples")
95+
axes[0].set_ylabel("Score")
96+
97+
train_sizes, train_scores, test_scores, fit_times, _ = \
98+
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
99+
train_sizes=train_sizes,
100+
return_times=True)
90101
train_scores_mean = np.mean(train_scores, axis=1)
91102
train_scores_std = np.std(train_scores, axis=1)
92103
test_scores_mean = np.mean(test_scores, axis=1)
93104
test_scores_std = np.std(test_scores, axis=1)
94-
plt.grid()
95-
96-
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
97-
train_scores_mean + train_scores_std, alpha=0.1,
98-
color="r")
99-
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
100-
test_scores_mean + test_scores_std, alpha=0.1, color="g")
101-
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
102-
label="Training score")
103-
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
104-
label="Cross-validation score")
105-
106-
plt.legend(loc="best")
105+
fit_times_mean = np.mean(fit_times, axis=1)
106+
fit_times_std = np.std(fit_times, axis=1)
107+
108+
# Plot learning curve
109+
axes[0].grid()
110+
axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
111+
train_scores_mean + train_scores_std, alpha=0.1,
112+
color="r")
113+
axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
114+
test_scores_mean + test_scores_std, alpha=0.1,
115+
color="g")
116+
axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
117+
label="Training score")
118+
axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
119+
label="Cross-validation score")
120+
axes[0].legend(loc="best")
121+
122+
# Plot n_samples vs fit_times
123+
axes[1].grid()
124+
axes[1].plot(train_sizes, fit_times_mean, 'o-')
125+
axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
126+
fit_times_mean + fit_times_std, alpha=0.1)
127+
axes[1].set_xlabel("Training examples")
128+
axes[1].set_ylabel("fit_times")
129+
axes[1].set_title("Scalability of the model")
130+
131+
# Plot fit_time vs score
132+
axes[2].grid()
133+
axes[2].plot(fit_times_mean, test_scores_mean, 'o-')
134+
axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
135+
test_scores_mean + test_scores_std, alpha=0.1)
136+
axes[2].set_xlabel("fit_times")
137+
axes[2].set_ylabel("Score")
138+
axes[2].set_title("Performance of the model")
139+
107140
return plt
108141

109142

143+
fig, axes = plt.subplots(3, 2, figsize=(10, 15))
144+
110145
digits = load_digits()
111146
X, y = digits.data, digits.target
112147

@@ -117,12 +152,14 @@ def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
117152
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
118153

119154
estimator = GaussianNB()
120-
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)
155+
plot_learning_curve(estimator, title, X, y, axes=axes[:, 0], ylim=(0.7, 1.01),
156+
cv=cv, n_jobs=4)
121157

122158
title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
123159
# SVC is more expensive so we do a lower number of CV iterations:
124160
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
125161
estimator = SVC(gamma=0.001)
126-
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
162+
plot_learning_curve(estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01),
163+
cv=cv, n_jobs=4)
127164

128165
plt.show()

dev/_downloads/scikit-learn-docs.pdf

123 KB
Binary file not shown.

dev/_images/iris.png

0 Bytes
-194 Bytes
-194 Bytes
-521 Bytes
-521 Bytes

0 commit comments

Comments
 (0)