|
| 1 | +""" |
| 2 | +=============================================== |
| 3 | +Overview of multiclass training meta-estimators |
| 4 | +=============================================== |
| 5 | +
|
| 6 | +In this example, we discuss the problem of classification when the target |
| 7 | +variable is composed of more than two classes. This is called multiclass |
| 8 | +classification. |
| 9 | +
|
| 10 | +In scikit-learn, all estimators support multiclass classification out of the |
| 11 | +box: the most sensible strategy was implemented for the end-user. The |
| 12 | +:mod:`sklearn.multiclass` module implements various strategies that one can use |
| 13 | +for experimenting or developing third-party estimators that only support binary |
| 14 | +classification. |
| 15 | +
|
| 16 | +:mod:`sklearn.multiclass` includes OvO/OvR strategies used to train a |
| 17 | +multiclass classifier by fitting a set of binary classifiers (the |
| 18 | +:class:`~sklearn.multiclass.OneVsOneClassifier` and |
| 19 | +:class:`~sklearn.multiclass.OneVsRestClassifier` meta-estimators). This example |
| 20 | +will review them. |
| 21 | +""" |
| 22 | + |
| 23 | +# %% |
| 24 | +# The Yeast UCI dataset |
| 25 | +# --------------------- |
| 26 | +# |
| 27 | +# In this example, we use a UCI dataset [1]_, generally referred as the Yeast |
| 28 | +# dataset. We use the :func:`sklearn.datasets.fetch_openml` function to load |
| 29 | +# the dataset from OpenML. |
| 30 | +from sklearn.datasets import fetch_openml |
| 31 | + |
| 32 | +X, y = fetch_openml(data_id=181, as_frame=True, return_X_y=True, parser="pandas") |
| 33 | + |
| 34 | +# %% |
| 35 | +# To know the type of data science problem we are dealing with, we can check |
| 36 | +# the target for which we want to build a predictive model. |
| 37 | +y.value_counts().sort_index() |
| 38 | + |
| 39 | +# %% |
| 40 | +# We see that the target is discrete and composed of 10 classes. We therefore |
| 41 | +# deal with a multiclass classification problem. |
| 42 | +# |
| 43 | +# Strategies comparison |
| 44 | +# --------------------- |
| 45 | +# |
| 46 | +# In the following experiment, we use a |
| 47 | +# :class:`~sklearn.tree.DecisionTreeClassifier` and a |
| 48 | +# :class:`~sklearn.model_selection.RepeatedStratifiedKFold` cross-validation |
| 49 | +# with 3 splits and 5 repetitions. |
| 50 | +# |
| 51 | +# We compare the following strategies: |
| 52 | +# |
| 53 | +# * :class:~sklearn.tree.DecisionTreeClassifier can handle multiclass |
| 54 | +# classification without needing any special adjustments. It works by breaking |
| 55 | +# down the training data into smaller subsets and focusing on the most common |
| 56 | +# class in each subset. By repeating this process, the model can accurately |
| 57 | +# classify input data into multiple different classes. |
| 58 | +# * :class:`~sklearn.multiclass.OneVsOneClassifier` trains a set of binary |
| 59 | +# classifiers where each classifier is trained to distinguish between |
| 60 | +# two classes. |
| 61 | +# * :class:`~sklearn.multiclass.OneVsRestClassifier`: trains a set of binary |
| 62 | +# classifiers where each classifier is trained to distinguish between |
| 63 | +# one class and the rest of the classes. |
| 64 | +# * :class:`~sklearn.multiclass.OutputCodeClassifier`: trains a set of binary |
| 65 | +# classifiers where each classifier is trained to distinguish between |
| 66 | +# a set of classes from the rest of the classes. The set of classes is |
| 67 | +# defined by a codebook, which is randomly generated in scikit-learn. This |
| 68 | +# method exposes a parameter `code_size` to control the size of the codebook. |
| 69 | +# We set it above one since we are not interested in compressing the class |
| 70 | +# representation. |
| 71 | +import pandas as pd |
| 72 | + |
| 73 | +from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate |
| 74 | +from sklearn.multiclass import ( |
| 75 | + OneVsOneClassifier, |
| 76 | + OneVsRestClassifier, |
| 77 | + OutputCodeClassifier, |
| 78 | +) |
| 79 | +from sklearn.tree import DecisionTreeClassifier |
| 80 | + |
| 81 | +cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=5, random_state=0) |
| 82 | + |
| 83 | +tree = DecisionTreeClassifier(random_state=0) |
| 84 | +ovo_tree = OneVsOneClassifier(tree) |
| 85 | +ovr_tree = OneVsRestClassifier(tree) |
| 86 | +ecoc = OutputCodeClassifier(tree, code_size=2) |
| 87 | + |
| 88 | +cv_results_tree = cross_validate(tree, X, y, cv=cv, n_jobs=2) |
| 89 | +cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2) |
| 90 | +cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2) |
| 91 | +cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2) |
| 92 | + |
| 93 | +# %% |
| 94 | +# We can now compare the statistical performance of the different strategies. |
| 95 | +# We plot the score distribution of the different strategies. |
| 96 | +from matplotlib import pyplot as plt |
| 97 | + |
| 98 | +scores = pd.DataFrame( |
| 99 | + { |
| 100 | + "DecisionTreeClassifier": cv_results_tree["test_score"], |
| 101 | + "OneVsOneClassifier": cv_results_ovo["test_score"], |
| 102 | + "OneVsRestClassifier": cv_results_ovr["test_score"], |
| 103 | + "OutputCodeClassifier": cv_results_ecoc["test_score"], |
| 104 | + } |
| 105 | +) |
| 106 | +ax = scores.plot.kde(legend=True) |
| 107 | +ax.set_xlabel("Accuracy score") |
| 108 | +ax.set_xlim([0, 0.7]) |
| 109 | +_ = ax.set_title( |
| 110 | + "Density of the accuracy scores for the different multiclass strategies" |
| 111 | +) |
| 112 | + |
| 113 | +# %% |
| 114 | +# At a first glance, we can see that the built-in strategy of the decision |
| 115 | +# tree classifier is working quite well. One-vs-one and the error-correcting |
| 116 | +# output code strategies are working even better. However, the |
| 117 | +# one-vs-rest strategy is not working as well as the other strategies. |
| 118 | +# |
| 119 | +# Indeed, these results reproduce something reported in the literature |
| 120 | +# as in [2]_. However, the story is not as simple as it seems. |
| 121 | +# |
| 122 | +# The importance of hyperparameters search |
| 123 | +# ---------------------------------------- |
| 124 | +# |
| 125 | +# It was later shown in [3]_ that the multiclass strategies would show similar |
| 126 | +# scores if the hyperparameters of the base classifiers are first optimized. |
| 127 | +# |
| 128 | +# Here we try to reproduce such result by at least optimizing the depth of the |
| 129 | +# base decision tree. |
| 130 | +from sklearn.model_selection import GridSearchCV |
| 131 | + |
| 132 | +param_grid = {"max_depth": [3, 5, 8]} |
| 133 | +tree_optimized = GridSearchCV(tree, param_grid=param_grid, cv=3) |
| 134 | +ovo_tree = OneVsOneClassifier(tree_optimized) |
| 135 | +ovr_tree = OneVsRestClassifier(tree_optimized) |
| 136 | +ecoc = OutputCodeClassifier(tree_optimized, code_size=2) |
| 137 | + |
| 138 | +cv_results_tree = cross_validate(tree_optimized, X, y, cv=cv, n_jobs=2) |
| 139 | +cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2) |
| 140 | +cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2) |
| 141 | +cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2) |
| 142 | + |
| 143 | +scores = pd.DataFrame( |
| 144 | + { |
| 145 | + "DecisionTreeClassifier": cv_results_tree["test_score"], |
| 146 | + "OneVsOneClassifier": cv_results_ovo["test_score"], |
| 147 | + "OneVsRestClassifier": cv_results_ovr["test_score"], |
| 148 | + "OutputCodeClassifier": cv_results_ecoc["test_score"], |
| 149 | + } |
| 150 | +) |
| 151 | +ax = scores.plot.kde(legend=True) |
| 152 | +ax.set_xlabel("Accuracy score") |
| 153 | +ax.set_xlim([0, 0.7]) |
| 154 | +_ = ax.set_title( |
| 155 | + "Density of the accuracy scores for the different multiclass strategies" |
| 156 | +) |
| 157 | + |
| 158 | +plt.show() |
| 159 | + |
| 160 | +# %% |
| 161 | +# We can see that once the hyperparameters are optimized, all multiclass |
| 162 | +# strategies have similar performance as discussed in [3]_. |
| 163 | +# |
| 164 | +# Conclusion |
| 165 | +# ---------- |
| 166 | +# |
| 167 | +# We can get some intuition behind those results. |
| 168 | +# |
| 169 | +# First, the reason for which one-vs-one and error-correcting output code are |
| 170 | +# outperforming the tree when the hyperparameters are not optimized relies on |
| 171 | +# fact that they ensemble a larger number of classifiers. The ensembling |
| 172 | +# improves the generalization performance. This is a bit similar why a bagging |
| 173 | +# classifier generally performs better than a single decision tree if no care |
| 174 | +# is taken to optimize the hyperparameters. |
| 175 | +# |
| 176 | +# Then, we see the importance of optimizing the hyperparameters. Indeed, it |
| 177 | +# should be regularly explored when developing predictive models even if |
| 178 | +# techniques such as ensembling help at reducing this impact. |
| 179 | +# |
| 180 | +# Finally, it is important to recall that the estimators in scikit-learn |
| 181 | +# are developed with a specific strategy to handle multiclass classification |
| 182 | +# out of the box. So for these estimators, it means that there is no need to |
| 183 | +# use different strategies. These strategies are mainly useful for third-party |
| 184 | +# estimators supporting only binary classification. In all cases, we also show |
| 185 | +# that the hyperparameters should be optimized. |
| 186 | +# |
| 187 | +# References |
| 188 | +# ---------- |
| 189 | +# |
| 190 | +# .. [1] https://archive.ics.uci.edu/ml/datasets/Yeast |
| 191 | +# |
| 192 | +# .. [2] `"Reducing multiclass to binary: A unifying approach for margin classifiers." |
| 193 | +# Allwein, Erin L., Robert E. Schapire, and Yoram Singer. |
| 194 | +# Journal of machine learning research 1 |
| 195 | +# Dec (2000): 113-141. |
| 196 | +# <https://www.jmlr.org/papers/volume1/allwein00a/allwein00a.pdf>`_. |
| 197 | +# |
| 198 | +# .. [3] `"In defense of one-vs-all classification." |
| 199 | +# Journal of Machine Learning Research 5 |
| 200 | +# Jan (2004): 101-141. |
| 201 | +# <https://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf>`_. |
0 commit comments