Skip to content

Commit ba169a6

Browse files
committed
Pushing the docs to dev/ for branch: master, commit 0a5af0d2a11c64d59381110f3967acbe7d88a031
1 parent 3fdb804 commit ba169a6

File tree

1,232 files changed

+8415
-3889
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,232 files changed

+8415
-3889
lines changed
Binary file not shown.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {
7+
"collapsed": false
8+
},
9+
"outputs": [],
10+
"source": [
11+
"%matplotlib inline"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"\n# Successive Halving Iterations\n\nThis example illustrates how a successive halving search (\n:class:`~sklearn.model_selection.HalvingGridSearchCV` and\n:class:`~sklearn.model_selection.HalvingRandomSearchCV`) iteratively chooses\nthe best parameter combination out of multiple candidates.\n"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {
25+
"collapsed": false
26+
},
27+
"outputs": [],
28+
"source": [
29+
"import pandas as pd\nfrom sklearn import datasets\nimport matplotlib.pyplot as plt\nfrom scipy.stats import randint\nimport numpy as np\n\nfrom sklearn.experimental import enable_successive_halving # noqa\nfrom sklearn.model_selection import HalvingRandomSearchCV\nfrom sklearn.ensemble import RandomForestClassifier\n\n\nprint(__doc__)"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"We first define the parameter space and train a\n:class:`~sklearn.model_selection.HalvingRandomSearchCV` instance.\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"rng = np.random.RandomState(0)\n\nX, y = datasets.make_classification(n_samples=700, random_state=rng)\n\nclf = RandomForestClassifier(n_estimators=20, random_state=rng)\n\nparam_dist = {\"max_depth\": [3, None],\n \"max_features\": randint(1, 11),\n \"min_samples_split\": randint(2, 11),\n \"bootstrap\": [True, False],\n \"criterion\": [\"gini\", \"entropy\"]}\n\nrsh = HalvingRandomSearchCV(\n estimator=clf,\n param_distributions=param_dist,\n factor=2,\n random_state=rng)\nrsh.fit(X, y)"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"We can now use the `cv_results_` attribute of the search estimator to inspect\nand plot the evolution of the search.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"results = pd.DataFrame(rsh.cv_results_)\nresults['params_str'] = results.params.apply(str)\nresults.drop_duplicates(subset=('params_str', 'iter'), inplace=True)\nmean_scores = results.pivot(index='iter', columns='params_str',\n values='mean_test_score')\nax = mean_scores.plot(legend=False, alpha=.6)\n\nlabels = [\n f'iter={i}\\nn_samples={rsh.n_resources_[i]}\\n'\n f'n_candidates={rsh.n_candidates_[i]}'\n for i in range(rsh.n_iterations_)\n]\nax.set_xticklabels(labels, rotation=45, multialignment='left')\nax.set_title('Scores of candidates over iterations')\nax.set_ylabel('mean test score', fontsize=15)\nax.set_xlabel('iterations', fontsize=15)\nplt.tight_layout()\nplt.show()"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"## Number of candidates and amount of resource at each iteration\n\nAt the first iteration, a small amount of resources is used. The resource\nhere is the number of samples that the estimators are trained on. All\ncandidates are evaluated.\n\nAt the second iteration, only the best half of the candidates is evaluated.\nThe number of allocated resources is doubled: candidates are evaluated on\ntwice as many samples.\n\nThis process is repeated until the last iteration, where only 2 candidates\nare left. The best candidate is the candidate that has the best score at the\nlast iteration.\n\n"
73+
]
74+
}
75+
],
76+
"metadata": {
77+
"kernelspec": {
78+
"display_name": "Python 3",
79+
"language": "python",
80+
"name": "python3"
81+
},
82+
"language_info": {
83+
"codemirror_mode": {
84+
"name": "ipython",
85+
"version": 3
86+
},
87+
"file_extension": ".py",
88+
"mimetype": "text/x-python",
89+
"name": "python",
90+
"nbconvert_exporter": "python",
91+
"pygments_lexer": "ipython3",
92+
"version": "3.8.5"
93+
}
94+
},
95+
"nbformat": 4,
96+
"nbformat_minor": 0
97+
}
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
"""
2+
Successive Halving Iterations
3+
=============================
4+
5+
This example illustrates how a successive halving search (
6+
:class:`~sklearn.model_selection.HalvingGridSearchCV` and
7+
:class:`~sklearn.model_selection.HalvingRandomSearchCV`) iteratively chooses
8+
the best parameter combination out of multiple candidates.
9+
10+
"""
11+
import pandas as pd
12+
from sklearn import datasets
13+
import matplotlib.pyplot as plt
14+
from scipy.stats import randint
15+
import numpy as np
16+
17+
from sklearn.experimental import enable_successive_halving # noqa
18+
from sklearn.model_selection import HalvingRandomSearchCV
19+
from sklearn.ensemble import RandomForestClassifier
20+
21+
22+
print(__doc__)
23+
24+
# %%
25+
# We first define the parameter space and train a
26+
# :class:`~sklearn.model_selection.HalvingRandomSearchCV` instance.
27+
28+
rng = np.random.RandomState(0)
29+
30+
X, y = datasets.make_classification(n_samples=700, random_state=rng)
31+
32+
clf = RandomForestClassifier(n_estimators=20, random_state=rng)
33+
34+
param_dist = {"max_depth": [3, None],
35+
"max_features": randint(1, 11),
36+
"min_samples_split": randint(2, 11),
37+
"bootstrap": [True, False],
38+
"criterion": ["gini", "entropy"]}
39+
40+
rsh = HalvingRandomSearchCV(
41+
estimator=clf,
42+
param_distributions=param_dist,
43+
factor=2,
44+
random_state=rng)
45+
rsh.fit(X, y)
46+
47+
# %%
48+
# We can now use the `cv_results_` attribute of the search estimator to inspect
49+
# and plot the evolution of the search.
50+
51+
results = pd.DataFrame(rsh.cv_results_)
52+
results['params_str'] = results.params.apply(str)
53+
results.drop_duplicates(subset=('params_str', 'iter'), inplace=True)
54+
mean_scores = results.pivot(index='iter', columns='params_str',
55+
values='mean_test_score')
56+
ax = mean_scores.plot(legend=False, alpha=.6)
57+
58+
labels = [
59+
f'iter={i}\nn_samples={rsh.n_resources_[i]}\n'
60+
f'n_candidates={rsh.n_candidates_[i]}'
61+
for i in range(rsh.n_iterations_)
62+
]
63+
ax.set_xticklabels(labels, rotation=45, multialignment='left')
64+
ax.set_title('Scores of candidates over iterations')
65+
ax.set_ylabel('mean test score', fontsize=15)
66+
ax.set_xlabel('iterations', fontsize=15)
67+
plt.tight_layout()
68+
plt.show()
69+
70+
# %%
71+
# Number of candidates and amount of resource at each iteration
72+
# -------------------------------------------------------------
73+
#
74+
# At the first iteration, a small amount of resources is used. The resource
75+
# here is the number of samples that the estimators are trained on. All
76+
# candidates are evaluated.
77+
#
78+
# At the second iteration, only the best half of the candidates is evaluated.
79+
# The number of allocated resources is doubled: candidates are evaluated on
80+
# twice as many samples.
81+
#
82+
# This process is repeated until the last iteration, where only 2 candidates
83+
# are left. The best candidate is the candidate that has the best score at the
84+
# last iteration.
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
"""
2+
Comparison between grid search and successive halving
3+
=====================================================
4+
5+
This example compares the parameter search performed by
6+
:class:`~sklearn.model_selection.HalvingGridSearchCV` and
7+
:class:`~sklearn.model_selection.GridSearchCV`.
8+
9+
"""
10+
from time import time
11+
12+
import matplotlib.pyplot as plt
13+
import numpy as np
14+
import pandas as pd
15+
16+
from sklearn.svm import SVC
17+
from sklearn import datasets
18+
from sklearn.model_selection import GridSearchCV
19+
from sklearn.experimental import enable_successive_halving # noqa
20+
from sklearn.model_selection import HalvingGridSearchCV
21+
22+
23+
print(__doc__)
24+
25+
# %%
26+
# We first define the parameter space for an :class:`~sklearn.svm.SVC`
27+
# estimator, and compute the time required to train a
28+
# :class:`~sklearn.model_selection.HalvingGridSearchCV` instance, as well as a
29+
# :class:`~sklearn.model_selection.GridSearchCV` instance.
30+
31+
rng = np.random.RandomState(0)
32+
X, y = datasets.make_classification(n_samples=1000, random_state=rng)
33+
34+
gammas = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]
35+
Cs = [1, 10, 100, 1e3, 1e4, 1e5]
36+
param_grid = {'gamma': gammas, 'C': Cs}
37+
38+
clf = SVC(random_state=rng)
39+
40+
tic = time()
41+
gsh = HalvingGridSearchCV(estimator=clf, param_grid=param_grid, factor=2,
42+
random_state=rng)
43+
gsh.fit(X, y)
44+
gsh_time = time() - tic
45+
46+
tic = time()
47+
gs = GridSearchCV(estimator=clf, param_grid=param_grid)
48+
gs.fit(X, y)
49+
gs_time = time() - tic
50+
51+
# %%
52+
# We now plot heatmaps for both search estimators.
53+
54+
55+
def make_heatmap(ax, gs, is_sh=False, make_cbar=False):
56+
"""Helper to make a heatmap."""
57+
results = pd.DataFrame.from_dict(gs.cv_results_)
58+
results['params_str'] = results.params.apply(str)
59+
if is_sh:
60+
# SH dataframe: get mean_test_score values for the highest iter
61+
scores_matrix = results.sort_values('iter').pivot_table(
62+
index='param_gamma', columns='param_C',
63+
values='mean_test_score', aggfunc='last'
64+
)
65+
else:
66+
scores_matrix = results.pivot(index='param_gamma', columns='param_C',
67+
values='mean_test_score')
68+
69+
im = ax.imshow(scores_matrix)
70+
71+
ax.set_xticks(np.arange(len(Cs)))
72+
ax.set_xticklabels(['{:.0E}'.format(x) for x in Cs])
73+
ax.set_xlabel('C', fontsize=15)
74+
75+
ax.set_yticks(np.arange(len(gammas)))
76+
ax.set_yticklabels(['{:.0E}'.format(x) for x in gammas])
77+
ax.set_ylabel('gamma', fontsize=15)
78+
79+
# Rotate the tick labels and set their alignment.
80+
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
81+
rotation_mode="anchor")
82+
83+
if is_sh:
84+
iterations = results.pivot_table(index='param_gamma',
85+
columns='param_C', values='iter',
86+
aggfunc='max').values
87+
for i in range(len(gammas)):
88+
for j in range(len(Cs)):
89+
ax.text(j, i, iterations[i, j],
90+
ha="center", va="center", color="w", fontsize=20)
91+
92+
if make_cbar:
93+
fig.subplots_adjust(right=0.8)
94+
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
95+
fig.colorbar(im, cax=cbar_ax)
96+
cbar_ax.set_ylabel('mean_test_score', rotation=-90, va="bottom",
97+
fontsize=15)
98+
99+
100+
fig, axes = plt.subplots(ncols=2, sharey=True)
101+
ax1, ax2 = axes
102+
103+
make_heatmap(ax1, gsh, is_sh=True)
104+
make_heatmap(ax2, gs, make_cbar=True)
105+
106+
ax1.set_title('Successive Halving\ntime = {:.3f}s'.format(gsh_time),
107+
fontsize=15)
108+
ax2.set_title('GridSearch\ntime = {:.3f}s'.format(gs_time), fontsize=15)
109+
110+
plt.show()
111+
112+
# %%
113+
# The heatmaps show the mean test score of the parameter combinations for an
114+
# :class:`~sklearn.svm.SVC` instance. The
115+
# :class:`~sklearn.model_selection.HalvingGridSearchCV` also shows the
116+
# iteration at which the combinations where last used. The combinations marked
117+
# as ``0`` were only evaluated at the first iteration, while the ones with
118+
# ``5`` are the parameter combinations that are considered the best ones.
119+
#
120+
# We can see that the :class:`~sklearn.model_selection.HalvingGridSearchCV`
121+
# class is able to find parameter combinations that are just as accurate as
122+
# :class:`~sklearn.model_selection.GridSearchCV`, in much less time.

0 commit comments

Comments
 (0)