|
1 |
| -r""" |
| 1 | +""" |
2 | 2 | =====================================
|
3 | 3 | Multi-class AdaBoosted Decision Trees
|
4 | 4 | =====================================
|
5 | 5 |
|
6 |
| -This example reproduces Figure 1 of Zhu et al [1]_ and shows how boosting can |
7 |
| -improve prediction accuracy on a multi-class problem. The classification |
8 |
| -dataset is constructed by taking a ten-dimensional standard normal distribution |
9 |
| -and defining three classes separated by nested concentric ten-dimensional |
10 |
| -spheres such that roughly equal numbers of samples are in each class (quantiles |
11 |
| -of the :math:`\chi^2` distribution). |
12 |
| -
|
13 |
| -The performance of the SAMME and SAMME.R [1]_ algorithms are compared. SAMME.R |
14 |
| -uses the probability estimates to update the additive model, while SAMME uses |
15 |
| -the classifications only. As the example illustrates, the SAMME.R algorithm |
16 |
| -typically converges faster than SAMME, achieving a lower test error with fewer |
17 |
| -boosting iterations. The error of each algorithm on the test set after each |
18 |
| -boosting iteration is shown on the left, the classification error on the test |
19 |
| -set of each tree is shown in the middle, and the boost weight of each tree is |
20 |
| -shown on the right. All trees have a weight of one in the SAMME.R algorithm and |
21 |
| -therefore are not shown. |
22 |
| -
|
23 |
| -.. [1] J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost", 2009. |
| 6 | +This example shows how boosting can improve the prediction accuracy on a |
| 7 | +multi-label classification problem. It reproduces a similar experiment as |
| 8 | +depicted by Figure 1 in Zhu et al [1]_. |
| 9 | +
|
| 10 | +The core principle of AdaBoost (Adaptive Boosting) is to fit a sequence of weak |
| 11 | +learners (e.g. Decision Trees) on repeatedly re-sampled versions of the data. |
| 12 | +Each sample carries a weight that is adjusted after each training step, such |
| 13 | +that misclassified samples will be assigned higher weights. The re-sampling |
| 14 | +process with replacement takes into account the weights assigned to each sample. |
| 15 | +Samples with higher weights have a greater chance of being selected multiple |
| 16 | +times in the new data set, while samples with lower weights are less likely to |
| 17 | +be selected. This ensures that subsequent iterations of the algorithm focus on |
| 18 | +the difficult-to-classify samples. |
| 19 | +
|
| 20 | +.. topic:: References: |
| 21 | +
|
| 22 | + .. [1] :doi:`J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class adaboost." |
| 23 | + Statistics and its Interface 2.3 (2009): 349-360. |
| 24 | + <10.4310/SII.2009.v2.n3.a8>` |
24 | 25 |
|
25 | 26 | """
|
26 | 27 |
|
27 |
| -# Author: Noel Dawe <[email protected]> |
28 |
| -# |
| 28 | + |
29 | 29 | # License: BSD 3 clause
|
30 | 30 |
|
31 |
| -import matplotlib.pyplot as plt |
32 |
| - |
| 31 | +# %% |
| 32 | +# Creating the dataset |
| 33 | +# -------------------- |
| 34 | +# The classification dataset is constructed by taking a ten-dimensional standard |
| 35 | +# normal distribution (:math:`x` in :math:`R^{10}`) and defining three classes |
| 36 | +# separated by nested concentric ten-dimensional spheres such that roughly equal |
| 37 | +# numbers of samples are in each class (quantiles of the :math:`\chi^2` |
| 38 | +# distribution). |
33 | 39 | from sklearn.datasets import make_gaussian_quantiles
|
34 |
| -from sklearn.ensemble import AdaBoostClassifier |
35 |
| -from sklearn.metrics import accuracy_score |
36 |
| -from sklearn.tree import DecisionTreeClassifier |
37 | 40 |
|
38 | 41 | X, y = make_gaussian_quantiles(
|
39 |
| - n_samples=13000, n_features=10, n_classes=3, random_state=1 |
| 42 | + n_samples=2_000, n_features=10, n_classes=3, random_state=1 |
40 | 43 | )
|
41 | 44 |
|
42 |
| -n_split = 3000 |
43 |
| - |
44 |
| -X_train, X_test = X[:n_split], X[n_split:] |
45 |
| -y_train, y_test = y[:n_split], y[n_split:] |
| 45 | +# %% |
| 46 | +# We split the dataset into 2 sets: 70 percent of the samples are used for |
| 47 | +# training and the remaining 30 percent for testing. |
| 48 | +from sklearn.model_selection import train_test_split |
46 | 49 |
|
47 |
| -bdt_real = AdaBoostClassifier( |
48 |
| - DecisionTreeClassifier(max_depth=2), n_estimators=300, learning_rate=1 |
| 50 | +X_train, X_test, y_train, y_test = train_test_split( |
| 51 | + X, y, train_size=0.7, random_state=42 |
49 | 52 | )
|
50 | 53 |
|
51 |
| -bdt_discrete = AdaBoostClassifier( |
52 |
| - DecisionTreeClassifier(max_depth=2), |
53 |
| - n_estimators=300, |
54 |
| - learning_rate=1.5, |
| 54 | +# %% |
| 55 | +# Training the `AdaBoostClassifier` |
| 56 | +# --------------------------------- |
| 57 | +# We train the :class:`~sklearn.ensemble.AdaBoostClassifier`. The estimator |
| 58 | +# utilizes boosting to improve the classification accuracy. Boosting is a method |
| 59 | +# designed to train weak learners (i.e. `base_estimator`) that learn from their |
| 60 | +# predecessor's mistakes. |
| 61 | +# |
| 62 | +# Here, we define the weak learner as a |
| 63 | +# :class:`~sklearn.tree.DecisionTreeClassifier` and set the maximum number of |
| 64 | +# leaves to 8. In a real setting, this parameter should be tuned. We set it to a |
| 65 | +# rather low value to limit the runtime of the example. |
| 66 | +# |
| 67 | +# The `SAMME` algorithm build into the |
| 68 | +# :class:`~sklearn.ensemble.AdaBoostClassifier` then uses the correct or |
| 69 | +# incorrect predictions made be the current weak learner to update the sample |
| 70 | +# weights used for training the consecutive weak learners. Also, the weight of |
| 71 | +# the weak learner itself is calculated based on its accuracy in classifying the |
| 72 | +# training examples. The weight of the weak learner determines its influence on |
| 73 | +# the final ensemble prediction. |
| 74 | +from sklearn.ensemble import AdaBoostClassifier |
| 75 | +from sklearn.tree import DecisionTreeClassifier |
| 76 | + |
| 77 | +weak_learner = DecisionTreeClassifier(max_leaf_nodes=8) |
| 78 | +n_estimators = 300 |
| 79 | + |
| 80 | +adaboost_clf = AdaBoostClassifier( |
| 81 | + estimator=weak_learner, |
| 82 | + n_estimators=n_estimators, |
55 | 83 | algorithm="SAMME",
|
56 |
| -) |
| 84 | + random_state=42, |
| 85 | +).fit(X_train, y_train) |
| 86 | + |
| 87 | +# %% |
| 88 | +# Analysis |
| 89 | +# -------- |
| 90 | +# Convergence of the `AdaBoostClassifier` |
| 91 | +# *************************************** |
| 92 | +# To demonstrate the effectiveness of boosting in improving accuracy, we |
| 93 | +# evaluate the misclassification error of the boosted trees in comparison to two |
| 94 | +# baseline scores. The first baseline score is the `misclassification_error` |
| 95 | +# obtained from a single weak-learner (i.e. |
| 96 | +# :class:`~sklearn.tree.DecisionTreeClassifier`), which serves as a reference |
| 97 | +# point. The second baseline score is obtained from the |
| 98 | +# :class:`~sklearn.dummy.DummyClassifier`, which predicts the most prevalent |
| 99 | +# class in a dataset. |
| 100 | +from sklearn.dummy import DummyClassifier |
| 101 | +from sklearn.metrics import accuracy_score |
57 | 102 |
|
58 |
| -bdt_real.fit(X_train, y_train) |
59 |
| -bdt_discrete.fit(X_train, y_train) |
| 103 | +dummy_clf = DummyClassifier() |
60 | 104 |
|
61 |
| -real_test_errors = [] |
62 |
| -discrete_test_errors = [] |
63 | 105 |
|
64 |
| -for real_test_predict, discrete_test_predict in zip( |
65 |
| - bdt_real.staged_predict(X_test), bdt_discrete.staged_predict(X_test) |
66 |
| -): |
67 |
| - real_test_errors.append(1.0 - accuracy_score(real_test_predict, y_test)) |
68 |
| - discrete_test_errors.append(1.0 - accuracy_score(discrete_test_predict, y_test)) |
| 106 | +def misclassification_error(y_true, y_pred): |
| 107 | + return 1 - accuracy_score(y_true, y_pred) |
69 | 108 |
|
70 |
| -n_trees_discrete = len(bdt_discrete) |
71 |
| -n_trees_real = len(bdt_real) |
72 | 109 |
|
73 |
| -# Boosting might terminate early, but the following arrays are always |
74 |
| -# n_estimators long. We crop them to the actual number of trees here: |
75 |
| -discrete_estimator_errors = bdt_discrete.estimator_errors_[:n_trees_discrete] |
76 |
| -real_estimator_errors = bdt_real.estimator_errors_[:n_trees_real] |
77 |
| -discrete_estimator_weights = bdt_discrete.estimator_weights_[:n_trees_discrete] |
| 110 | +weak_learners_misclassification_error = misclassification_error( |
| 111 | + y_test, weak_learner.fit(X_train, y_train).predict(X_test) |
| 112 | +) |
78 | 113 |
|
79 |
| -plt.figure(figsize=(15, 5)) |
| 114 | +dummy_classifiers_misclassification_error = misclassification_error( |
| 115 | + y_test, dummy_clf.fit(X_train, y_train).predict(X_test) |
| 116 | +) |
80 | 117 |
|
81 |
| -plt.subplot(131) |
82 |
| -plt.plot(range(1, n_trees_discrete + 1), discrete_test_errors, c="black", label="SAMME") |
83 |
| -plt.plot( |
84 |
| - range(1, n_trees_real + 1), |
85 |
| - real_test_errors, |
86 |
| - c="black", |
87 |
| - linestyle="dashed", |
88 |
| - label="SAMME.R", |
| 118 | +print( |
| 119 | + "DecisionTreeClassifier's misclassification_error: " |
| 120 | + f"{weak_learners_misclassification_error:.3f}" |
| 121 | +) |
| 122 | +print( |
| 123 | + "DummyClassifier's misclassification_error: " |
| 124 | + f"{dummy_classifiers_misclassification_error:.3f}" |
89 | 125 | )
|
90 |
| -plt.legend() |
91 |
| -plt.ylim(0.18, 0.62) |
92 |
| -plt.ylabel("Test Error") |
93 |
| -plt.xlabel("Number of Trees") |
94 | 126 |
|
95 |
| -plt.subplot(132) |
| 127 | +# %% |
| 128 | +# After training the :class:`~sklearn.tree.DecisionTreeClassifier` model, the |
| 129 | +# achieved error surpasses the expected value that would have been obtained by |
| 130 | +# guessing the most frequent class label, as the |
| 131 | +# :class:`~sklearn.dummy.DummyClassifier` does. |
| 132 | +# |
| 133 | +# Now, we calculate the `misclassification_error`, i.e. `1 - accuracy`, of the |
| 134 | +# additive model (:class:`~sklearn.tree.DecisionTreeClassifier`) at each |
| 135 | +# boosting iteration on the test set to assess its performance. |
| 136 | +# |
| 137 | +# We use :meth:`~sklearn.ensemble.AdaBoostClassifier.staged_predict` that makes |
| 138 | +# as many iterations as the number of fitted estimator (i.e. corresponding to |
| 139 | +# `n_estimators`). At iteration `n`, the predictions of AdaBoost only use the |
| 140 | +# `n` first weak learners. We compare these predictions with the true |
| 141 | +# predictions `y_test` and we, therefore, conclude on the benefit (or not) of adding a |
| 142 | +# new weak learner into the chain. |
| 143 | +# |
| 144 | +# We plot the misclassification error for the different stages: |
| 145 | +import matplotlib.pyplot as plt |
| 146 | +import pandas as pd |
| 147 | + |
| 148 | +boosting_errors = pd.DataFrame( |
| 149 | + { |
| 150 | + "Number of trees": range(1, n_estimators + 1), |
| 151 | + "AdaBoost": [ |
| 152 | + misclassification_error(y_test, y_pred) |
| 153 | + for y_pred in adaboost_clf.staged_predict(X_test) |
| 154 | + ], |
| 155 | + } |
| 156 | +).set_index("Number of trees") |
| 157 | +ax = boosting_errors.plot() |
| 158 | +ax.set_ylabel("Misclassification error on test set") |
| 159 | +ax.set_title("Convergence of AdaBoost algorithm") |
| 160 | + |
96 | 161 | plt.plot(
|
97 |
| - range(1, n_trees_discrete + 1), |
98 |
| - discrete_estimator_errors, |
99 |
| - "b", |
100 |
| - label="SAMME", |
101 |
| - alpha=0.5, |
| 162 | + [boosting_errors.index.min(), boosting_errors.index.max()], |
| 163 | + [weak_learners_misclassification_error, weak_learners_misclassification_error], |
| 164 | + color="tab:orange", |
| 165 | + linestyle="dashed", |
102 | 166 | )
|
103 | 167 | plt.plot(
|
104 |
| - range(1, n_trees_real + 1), real_estimator_errors, "r", label="SAMME.R", alpha=0.5 |
| 168 | + [boosting_errors.index.min(), boosting_errors.index.max()], |
| 169 | + [ |
| 170 | + dummy_classifiers_misclassification_error, |
| 171 | + dummy_classifiers_misclassification_error, |
| 172 | + ], |
| 173 | + color="c", |
| 174 | + linestyle="dotted", |
105 | 175 | )
|
106 |
| -plt.legend() |
107 |
| -plt.ylabel("Error") |
108 |
| -plt.xlabel("Number of Trees") |
109 |
| -plt.ylim((0.2, max(real_estimator_errors.max(), discrete_estimator_errors.max()) * 1.2)) |
110 |
| -plt.xlim((-20, len(bdt_discrete) + 20)) |
111 |
| - |
112 |
| -plt.subplot(133) |
113 |
| -plt.plot(range(1, n_trees_discrete + 1), discrete_estimator_weights, "b", label="SAMME") |
114 |
| -plt.legend() |
115 |
| -plt.ylabel("Weight") |
116 |
| -plt.xlabel("Number of Trees") |
117 |
| -plt.ylim((0, discrete_estimator_weights.max() * 1.2)) |
118 |
| -plt.xlim((-20, n_trees_discrete + 20)) |
119 |
| - |
120 |
| -# prevent overlapping y-axis labels |
121 |
| -plt.subplots_adjust(wspace=0.25) |
| 176 | +plt.legend(["AdaBoost", "DecisionTreeClassifier", "DummyClassifier"], loc=1) |
122 | 177 | plt.show()
|
| 178 | + |
| 179 | +# %% |
| 180 | +# The plot shows the missclassification error on the test set after each |
| 181 | +# boosting iteration. We see that the error of the boosted trees converges to an |
| 182 | +# error of around 0.3 after 50 iterations, indicating a significantly higher |
| 183 | +# accuracy compared to a single tree, as illustrated by the dashed line in the |
| 184 | +# plot. |
| 185 | +# |
| 186 | +# The misclassification error jitters because the `SAMME` algorithm uses the |
| 187 | +# discrete outputs of the weak learners to train the boosted model. |
| 188 | +# |
| 189 | +# The convergence of :class:`~sklearn.ensemble.AdaBoostClassifier` is mainly |
| 190 | +# influenced by the learning rate (i.e `learning_rate`), the number of weak |
| 191 | +# learners used (`n_estimators`), and the expressivity of the weak learners |
| 192 | +# (e.g. `max_leaf_nodes`). |
| 193 | + |
| 194 | +# %% |
| 195 | +# Errors and weights of the Weak Learners |
| 196 | +# *************************************** |
| 197 | +# As previously mentioned, AdaBoost is a forward stagewise additive model. We |
| 198 | +# now focus on understanding the relationship between the attributed weights of |
| 199 | +# the weak learners and their statistical performance. |
| 200 | +# |
| 201 | +# We use the fitted :class:`~sklearn.ensemble.AdaBoostClassifier`'s attributes |
| 202 | +# `estimator_errors_` and `estimator_weights_` to investigate this link. |
| 203 | +weak_learners_info = pd.DataFrame( |
| 204 | + { |
| 205 | + "Number of trees": range(1, n_estimators + 1), |
| 206 | + "Errors": adaboost_clf.estimator_errors_, |
| 207 | + "Weights": adaboost_clf.estimator_weights_, |
| 208 | + } |
| 209 | +).set_index("Number of trees") |
| 210 | + |
| 211 | +axs = weak_learners_info.plot( |
| 212 | + subplots=True, layout=(1, 2), figsize=(10, 4), legend=False, color="tab:blue" |
| 213 | +) |
| 214 | +axs[0, 0].set_ylabel("Train error") |
| 215 | +axs[0, 0].set_title("Weak learner's training error") |
| 216 | +axs[0, 1].set_ylabel("Weight") |
| 217 | +axs[0, 1].set_title("Weak learner's weight") |
| 218 | +fig = axs[0, 0].get_figure() |
| 219 | +fig.suptitle("Weak learner's errors and weights for the AdaBoostClassifier") |
| 220 | +fig.tight_layout() |
| 221 | + |
| 222 | +# %% |
| 223 | +# On the left plot, we show the weighted error of each weak learner on the |
| 224 | +# reweighted training set at each boosting iteration. On the right plot, we show |
| 225 | +# the weights associated with each weak learner later used to make the |
| 226 | +# predictions of the final additive model. |
| 227 | +# |
| 228 | +# We see that the error of the weak learner is the inverse of the weights. It |
| 229 | +# means that our additive model will trust more a weak learner that makes |
| 230 | +# smaller errors (on the training set) by increasing its impact on the final |
| 231 | +# decision. Indeed, this exactly is the formulation of updating the base |
| 232 | +# estimators' weights after each iteration in AdaBoost. |
| 233 | +# |
| 234 | +# |details-start| Mathematical details |details-split| |
| 235 | +# |
| 236 | +# The weight associated with a weak learner trained at the stage :math:`m` is |
| 237 | +# inversely associated with its misclassification error such that: |
| 238 | +# |
| 239 | +# .. math:: \alpha^{(m)} = \log \frac{1 - err^{(m)}}{err^{(m)}} + \log (K - 1), |
| 240 | +# |
| 241 | +# where :math:`\alpha^{(m)}` and :math:`err^{(m)}` are the weight and the error |
| 242 | +# of the :math:`m` th weak learner, respectively, and :math:`K` is the number of |
| 243 | +# classes in our classification problem. |details-end| |
| 244 | +# |
| 245 | +# Another interesting observation boils down to the fact that the first weak |
| 246 | +# learners of the model make fewer errors than later weak learners of the |
| 247 | +# boosting chain. |
| 248 | +# |
| 249 | +# The intuition behind this observation is the following: due to the sample |
| 250 | +# reweighting, later classifiers are forced to try to classify more difficult or |
| 251 | +# noisy samples and to ignore already well classified samples. Therefore, the |
| 252 | +# overall error on the training set will increase. That's why the weak learner's |
| 253 | +# weights are built to counter-balance the worse performing weak learners. |
0 commit comments