Skip to content

Commit 6890547

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 3a4a1d57372001a53b65c67ef92ba22299e7dc9a
1 parent 318f958 commit 6890547

File tree

1,319 files changed

+6844
-5985
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,319 files changed

+6844
-5985
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: dc88e4ab53657c80bf1735fd689584a6
3+
config: 76dc20ce210e45daaf85e521f8ca05fb
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.
Lines changed: 221 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,122 +1,253 @@
1-
r"""
1+
"""
22
=====================================
33
Multi-class AdaBoosted Decision Trees
44
=====================================
55
6-
This example reproduces Figure 1 of Zhu et al [1]_ and shows how boosting can
7-
improve prediction accuracy on a multi-class problem. The classification
8-
dataset is constructed by taking a ten-dimensional standard normal distribution
9-
and defining three classes separated by nested concentric ten-dimensional
10-
spheres such that roughly equal numbers of samples are in each class (quantiles
11-
of the :math:`\chi^2` distribution).
12-
13-
The performance of the SAMME and SAMME.R [1]_ algorithms are compared. SAMME.R
14-
uses the probability estimates to update the additive model, while SAMME uses
15-
the classifications only. As the example illustrates, the SAMME.R algorithm
16-
typically converges faster than SAMME, achieving a lower test error with fewer
17-
boosting iterations. The error of each algorithm on the test set after each
18-
boosting iteration is shown on the left, the classification error on the test
19-
set of each tree is shown in the middle, and the boost weight of each tree is
20-
shown on the right. All trees have a weight of one in the SAMME.R algorithm and
21-
therefore are not shown.
22-
23-
.. [1] J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost", 2009.
6+
This example shows how boosting can improve the prediction accuracy on a
7+
multi-label classification problem. It reproduces a similar experiment as
8+
depicted by Figure 1 in Zhu et al [1]_.
9+
10+
The core principle of AdaBoost (Adaptive Boosting) is to fit a sequence of weak
11+
learners (e.g. Decision Trees) on repeatedly re-sampled versions of the data.
12+
Each sample carries a weight that is adjusted after each training step, such
13+
that misclassified samples will be assigned higher weights. The re-sampling
14+
process with replacement takes into account the weights assigned to each sample.
15+
Samples with higher weights have a greater chance of being selected multiple
16+
times in the new data set, while samples with lower weights are less likely to
17+
be selected. This ensures that subsequent iterations of the algorithm focus on
18+
the difficult-to-classify samples.
19+
20+
.. topic:: References:
21+
22+
.. [1] :doi:`J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class adaboost."
23+
Statistics and its Interface 2.3 (2009): 349-360.
24+
<10.4310/SII.2009.v2.n3.a8>`
2425
2526
"""
2627

27-
# Author: Noel Dawe <[email protected]>
28-
#
28+
# Noel Dawe <[email protected]>
2929
# License: BSD 3 clause
3030

31-
import matplotlib.pyplot as plt
32-
31+
# %%
32+
# Creating the dataset
33+
# --------------------
34+
# The classification dataset is constructed by taking a ten-dimensional standard
35+
# normal distribution (:math:`x` in :math:`R^{10}`) and defining three classes
36+
# separated by nested concentric ten-dimensional spheres such that roughly equal
37+
# numbers of samples are in each class (quantiles of the :math:`\chi^2`
38+
# distribution).
3339
from sklearn.datasets import make_gaussian_quantiles
34-
from sklearn.ensemble import AdaBoostClassifier
35-
from sklearn.metrics import accuracy_score
36-
from sklearn.tree import DecisionTreeClassifier
3740

3841
X, y = make_gaussian_quantiles(
39-
n_samples=13000, n_features=10, n_classes=3, random_state=1
42+
n_samples=2_000, n_features=10, n_classes=3, random_state=1
4043
)
4144

42-
n_split = 3000
43-
44-
X_train, X_test = X[:n_split], X[n_split:]
45-
y_train, y_test = y[:n_split], y[n_split:]
45+
# %%
46+
# We split the dataset into 2 sets: 70 percent of the samples are used for
47+
# training and the remaining 30 percent for testing.
48+
from sklearn.model_selection import train_test_split
4649

47-
bdt_real = AdaBoostClassifier(
48-
DecisionTreeClassifier(max_depth=2), n_estimators=300, learning_rate=1
50+
X_train, X_test, y_train, y_test = train_test_split(
51+
X, y, train_size=0.7, random_state=42
4952
)
5053

51-
bdt_discrete = AdaBoostClassifier(
52-
DecisionTreeClassifier(max_depth=2),
53-
n_estimators=300,
54-
learning_rate=1.5,
54+
# %%
55+
# Training the `AdaBoostClassifier`
56+
# ---------------------------------
57+
# We train the :class:`~sklearn.ensemble.AdaBoostClassifier`. The estimator
58+
# utilizes boosting to improve the classification accuracy. Boosting is a method
59+
# designed to train weak learners (i.e. `base_estimator`) that learn from their
60+
# predecessor's mistakes.
61+
#
62+
# Here, we define the weak learner as a
63+
# :class:`~sklearn.tree.DecisionTreeClassifier` and set the maximum number of
64+
# leaves to 8. In a real setting, this parameter should be tuned. We set it to a
65+
# rather low value to limit the runtime of the example.
66+
#
67+
# The `SAMME` algorithm build into the
68+
# :class:`~sklearn.ensemble.AdaBoostClassifier` then uses the correct or
69+
# incorrect predictions made be the current weak learner to update the sample
70+
# weights used for training the consecutive weak learners. Also, the weight of
71+
# the weak learner itself is calculated based on its accuracy in classifying the
72+
# training examples. The weight of the weak learner determines its influence on
73+
# the final ensemble prediction.
74+
from sklearn.ensemble import AdaBoostClassifier
75+
from sklearn.tree import DecisionTreeClassifier
76+
77+
weak_learner = DecisionTreeClassifier(max_leaf_nodes=8)
78+
n_estimators = 300
79+
80+
adaboost_clf = AdaBoostClassifier(
81+
estimator=weak_learner,
82+
n_estimators=n_estimators,
5583
algorithm="SAMME",
56-
)
84+
random_state=42,
85+
).fit(X_train, y_train)
86+
87+
# %%
88+
# Analysis
89+
# --------
90+
# Convergence of the `AdaBoostClassifier`
91+
# ***************************************
92+
# To demonstrate the effectiveness of boosting in improving accuracy, we
93+
# evaluate the misclassification error of the boosted trees in comparison to two
94+
# baseline scores. The first baseline score is the `misclassification_error`
95+
# obtained from a single weak-learner (i.e.
96+
# :class:`~sklearn.tree.DecisionTreeClassifier`), which serves as a reference
97+
# point. The second baseline score is obtained from the
98+
# :class:`~sklearn.dummy.DummyClassifier`, which predicts the most prevalent
99+
# class in a dataset.
100+
from sklearn.dummy import DummyClassifier
101+
from sklearn.metrics import accuracy_score
57102

58-
bdt_real.fit(X_train, y_train)
59-
bdt_discrete.fit(X_train, y_train)
103+
dummy_clf = DummyClassifier()
60104

61-
real_test_errors = []
62-
discrete_test_errors = []
63105

64-
for real_test_predict, discrete_test_predict in zip(
65-
bdt_real.staged_predict(X_test), bdt_discrete.staged_predict(X_test)
66-
):
67-
real_test_errors.append(1.0 - accuracy_score(real_test_predict, y_test))
68-
discrete_test_errors.append(1.0 - accuracy_score(discrete_test_predict, y_test))
106+
def misclassification_error(y_true, y_pred):
107+
return 1 - accuracy_score(y_true, y_pred)
69108

70-
n_trees_discrete = len(bdt_discrete)
71-
n_trees_real = len(bdt_real)
72109

73-
# Boosting might terminate early, but the following arrays are always
74-
# n_estimators long. We crop them to the actual number of trees here:
75-
discrete_estimator_errors = bdt_discrete.estimator_errors_[:n_trees_discrete]
76-
real_estimator_errors = bdt_real.estimator_errors_[:n_trees_real]
77-
discrete_estimator_weights = bdt_discrete.estimator_weights_[:n_trees_discrete]
110+
weak_learners_misclassification_error = misclassification_error(
111+
y_test, weak_learner.fit(X_train, y_train).predict(X_test)
112+
)
78113

79-
plt.figure(figsize=(15, 5))
114+
dummy_classifiers_misclassification_error = misclassification_error(
115+
y_test, dummy_clf.fit(X_train, y_train).predict(X_test)
116+
)
80117

81-
plt.subplot(131)
82-
plt.plot(range(1, n_trees_discrete + 1), discrete_test_errors, c="black", label="SAMME")
83-
plt.plot(
84-
range(1, n_trees_real + 1),
85-
real_test_errors,
86-
c="black",
87-
linestyle="dashed",
88-
label="SAMME.R",
118+
print(
119+
"DecisionTreeClassifier's misclassification_error: "
120+
f"{weak_learners_misclassification_error:.3f}"
121+
)
122+
print(
123+
"DummyClassifier's misclassification_error: "
124+
f"{dummy_classifiers_misclassification_error:.3f}"
89125
)
90-
plt.legend()
91-
plt.ylim(0.18, 0.62)
92-
plt.ylabel("Test Error")
93-
plt.xlabel("Number of Trees")
94126

95-
plt.subplot(132)
127+
# %%
128+
# After training the :class:`~sklearn.tree.DecisionTreeClassifier` model, the
129+
# achieved error surpasses the expected value that would have been obtained by
130+
# guessing the most frequent class label, as the
131+
# :class:`~sklearn.dummy.DummyClassifier` does.
132+
#
133+
# Now, we calculate the `misclassification_error`, i.e. `1 - accuracy`, of the
134+
# additive model (:class:`~sklearn.tree.DecisionTreeClassifier`) at each
135+
# boosting iteration on the test set to assess its performance.
136+
#
137+
# We use :meth:`~sklearn.ensemble.AdaBoostClassifier.staged_predict` that makes
138+
# as many iterations as the number of fitted estimator (i.e. corresponding to
139+
# `n_estimators`). At iteration `n`, the predictions of AdaBoost only use the
140+
# `n` first weak learners. We compare these predictions with the true
141+
# predictions `y_test` and we, therefore, conclude on the benefit (or not) of adding a
142+
# new weak learner into the chain.
143+
#
144+
# We plot the misclassification error for the different stages:
145+
import matplotlib.pyplot as plt
146+
import pandas as pd
147+
148+
boosting_errors = pd.DataFrame(
149+
{
150+
"Number of trees": range(1, n_estimators + 1),
151+
"AdaBoost": [
152+
misclassification_error(y_test, y_pred)
153+
for y_pred in adaboost_clf.staged_predict(X_test)
154+
],
155+
}
156+
).set_index("Number of trees")
157+
ax = boosting_errors.plot()
158+
ax.set_ylabel("Misclassification error on test set")
159+
ax.set_title("Convergence of AdaBoost algorithm")
160+
96161
plt.plot(
97-
range(1, n_trees_discrete + 1),
98-
discrete_estimator_errors,
99-
"b",
100-
label="SAMME",
101-
alpha=0.5,
162+
[boosting_errors.index.min(), boosting_errors.index.max()],
163+
[weak_learners_misclassification_error, weak_learners_misclassification_error],
164+
color="tab:orange",
165+
linestyle="dashed",
102166
)
103167
plt.plot(
104-
range(1, n_trees_real + 1), real_estimator_errors, "r", label="SAMME.R", alpha=0.5
168+
[boosting_errors.index.min(), boosting_errors.index.max()],
169+
[
170+
dummy_classifiers_misclassification_error,
171+
dummy_classifiers_misclassification_error,
172+
],
173+
color="c",
174+
linestyle="dotted",
105175
)
106-
plt.legend()
107-
plt.ylabel("Error")
108-
plt.xlabel("Number of Trees")
109-
plt.ylim((0.2, max(real_estimator_errors.max(), discrete_estimator_errors.max()) * 1.2))
110-
plt.xlim((-20, len(bdt_discrete) + 20))
111-
112-
plt.subplot(133)
113-
plt.plot(range(1, n_trees_discrete + 1), discrete_estimator_weights, "b", label="SAMME")
114-
plt.legend()
115-
plt.ylabel("Weight")
116-
plt.xlabel("Number of Trees")
117-
plt.ylim((0, discrete_estimator_weights.max() * 1.2))
118-
plt.xlim((-20, n_trees_discrete + 20))
119-
120-
# prevent overlapping y-axis labels
121-
plt.subplots_adjust(wspace=0.25)
176+
plt.legend(["AdaBoost", "DecisionTreeClassifier", "DummyClassifier"], loc=1)
122177
plt.show()
178+
179+
# %%
180+
# The plot shows the missclassification error on the test set after each
181+
# boosting iteration. We see that the error of the boosted trees converges to an
182+
# error of around 0.3 after 50 iterations, indicating a significantly higher
183+
# accuracy compared to a single tree, as illustrated by the dashed line in the
184+
# plot.
185+
#
186+
# The misclassification error jitters because the `SAMME` algorithm uses the
187+
# discrete outputs of the weak learners to train the boosted model.
188+
#
189+
# The convergence of :class:`~sklearn.ensemble.AdaBoostClassifier` is mainly
190+
# influenced by the learning rate (i.e `learning_rate`), the number of weak
191+
# learners used (`n_estimators`), and the expressivity of the weak learners
192+
# (e.g. `max_leaf_nodes`).
193+
194+
# %%
195+
# Errors and weights of the Weak Learners
196+
# ***************************************
197+
# As previously mentioned, AdaBoost is a forward stagewise additive model. We
198+
# now focus on understanding the relationship between the attributed weights of
199+
# the weak learners and their statistical performance.
200+
#
201+
# We use the fitted :class:`~sklearn.ensemble.AdaBoostClassifier`'s attributes
202+
# `estimator_errors_` and `estimator_weights_` to investigate this link.
203+
weak_learners_info = pd.DataFrame(
204+
{
205+
"Number of trees": range(1, n_estimators + 1),
206+
"Errors": adaboost_clf.estimator_errors_,
207+
"Weights": adaboost_clf.estimator_weights_,
208+
}
209+
).set_index("Number of trees")
210+
211+
axs = weak_learners_info.plot(
212+
subplots=True, layout=(1, 2), figsize=(10, 4), legend=False, color="tab:blue"
213+
)
214+
axs[0, 0].set_ylabel("Train error")
215+
axs[0, 0].set_title("Weak learner's training error")
216+
axs[0, 1].set_ylabel("Weight")
217+
axs[0, 1].set_title("Weak learner's weight")
218+
fig = axs[0, 0].get_figure()
219+
fig.suptitle("Weak learner's errors and weights for the AdaBoostClassifier")
220+
fig.tight_layout()
221+
222+
# %%
223+
# On the left plot, we show the weighted error of each weak learner on the
224+
# reweighted training set at each boosting iteration. On the right plot, we show
225+
# the weights associated with each weak learner later used to make the
226+
# predictions of the final additive model.
227+
#
228+
# We see that the error of the weak learner is the inverse of the weights. It
229+
# means that our additive model will trust more a weak learner that makes
230+
# smaller errors (on the training set) by increasing its impact on the final
231+
# decision. Indeed, this exactly is the formulation of updating the base
232+
# estimators' weights after each iteration in AdaBoost.
233+
#
234+
# |details-start| Mathematical details |details-split|
235+
#
236+
# The weight associated with a weak learner trained at the stage :math:`m` is
237+
# inversely associated with its misclassification error such that:
238+
#
239+
# .. math:: \alpha^{(m)} = \log \frac{1 - err^{(m)}}{err^{(m)}} + \log (K - 1),
240+
#
241+
# where :math:`\alpha^{(m)}` and :math:`err^{(m)}` are the weight and the error
242+
# of the :math:`m` th weak learner, respectively, and :math:`K` is the number of
243+
# classes in our classification problem. |details-end|
244+
#
245+
# Another interesting observation boils down to the fact that the first weak
246+
# learners of the model make fewer errors than later weak learners of the
247+
# boosting chain.
248+
#
249+
# The intuition behind this observation is the following: due to the sample
250+
# reweighting, later classifiers are forced to try to classify more difficult or
251+
# noisy samples and to ignore already well classified samples. Therefore, the
252+
# overall error on the training set will increase. That's why the weak learner's
253+
# weights are built to counter-balance the worse performing weak learners.

0 commit comments

Comments
 (0)