|
6 | 6 | .. currentmodule:: sklearn.preprocessing
|
7 | 7 |
|
8 | 8 | The :class:`TargetEncoder` replaces each category of a categorical feature with
|
9 |
| -the mean of the target variable for that category. This method is useful |
| 9 | +the shrunk mean of the target variable for that category. This method is useful |
10 | 10 | in cases where there is a strong relationship between the categorical feature
|
11 | 11 | and the target. To prevent overfitting, :meth:`TargetEncoder.fit_transform` uses
|
12 |
| -an internal cross fitting scheme to encode the training data to be used by a |
13 |
| -downstream model. In this example, we demonstrate the importance of the cross fitting |
14 |
| -procedure to prevent overfitting. |
| 12 | +an internal :term:`cross fitting` scheme to encode the training data to be used |
| 13 | +by a downstream model. This scheme involves splitting the data into *k* folds |
| 14 | +and encoding each fold using the encodings learnt using the other *k-1* folds. |
| 15 | +In this example, we demonstrate the importance of the cross |
| 16 | +fitting procedure to prevent overfitting. |
15 | 17 | """
|
16 | 18 |
|
17 | 19 | # %%
|
18 | 20 | # Create Synthetic Dataset
|
19 | 21 | # ========================
|
20 |
| -# For this example, we build a dataset with three categorical features: an informative |
21 |
| -# feature with medium cardinality, an uninformative feature with medium cardinality, |
22 |
| -# and an uninformative feature with high cardinality. First, we generate the informative |
23 |
| -# feature: |
| 22 | +# For this example, we build a dataset with three categorical features: |
| 23 | +# |
| 24 | +# * an informative feature with medium cardinality ("informative") |
| 25 | +# * an uninformative feature with medium cardinality ("shuffled") |
| 26 | +# * an uninformative feature with high cardinality ("near_unique") |
| 27 | +# |
| 28 | +# First, we generate the informative feature: |
24 | 29 | import numpy as np
|
25 | 30 |
|
26 | 31 | from sklearn.preprocessing import KBinsDiscretizer
|
|
33 | 38 | n_categories = 100
|
34 | 39 |
|
35 | 40 | kbins = KBinsDiscretizer(
|
36 |
| - n_bins=n_categories, encode="ordinal", strategy="uniform", random_state=rng |
| 41 | + n_bins=n_categories, |
| 42 | + encode="ordinal", |
| 43 | + strategy="uniform", |
| 44 | + random_state=rng, |
| 45 | + subsample=None, |
37 | 46 | )
|
38 | 47 | X_informative = kbins.fit_transform((y + noise).reshape(-1, 1))
|
39 | 48 |
|
40 |
| -# Remove the linear relationship between y and the bin index by permuting the values of |
41 |
| -# X_informative |
| 49 | +# Remove the linear relationship between y and the bin index by permuting the |
| 50 | +# values of X_informative: |
42 | 51 | permuted_categories = rng.permutation(n_categories)
|
43 | 52 | X_informative = permuted_categories[X_informative.astype(np.int32)]
|
44 | 53 |
|
|
48 | 57 | X_shuffled = rng.permutation(X_informative)
|
49 | 58 |
|
50 | 59 | # %%
|
51 |
| -# The uninformative feature with high cardinality is generated so that is independent of |
52 |
| -# the target variable. We will show that target encoding without cross fitting will |
53 |
| -# cause catastrophic overfitting for the downstream regressor. These high cardinality |
54 |
| -# features are basically unique identifiers for samples which should generally be |
55 |
| -# removed from machine learning dataset. In this example, we generate them to show how |
56 |
| -# :class:`TargetEncoder`'s default cross fitting behavior mitigates the overfitting |
57 |
| -# issue automatically. |
| 60 | +# The uninformative feature with high cardinality is generated so that it is |
| 61 | +# independent of the target variable. We will show that target encoding without |
| 62 | +# :term:`cross fitting` will cause catastrophic overfitting for the downstream |
| 63 | +# regressor. These high cardinality features are basically unique identifiers |
| 64 | +# for samples which should generally be removed from machine learning datasets. |
| 65 | +# In this example, we generate them to show how :class:`TargetEncoder`'s default |
| 66 | +# :term:`cross fitting` behavior mitigates the overfitting issue automatically. |
58 | 67 | X_near_unique_categories = rng.choice(
|
59 | 68 | int(0.9 * n_samples), size=n_samples, replace=True
|
60 | 69 | ).reshape(-1, 1)
|
|
79 | 88 | # ==========================
|
80 | 89 | # In this section, we train a ridge regressor on the dataset with and without
|
81 | 90 | # encoding and explore the influence of target encoder with and without the
|
82 |
| -# internal cross fitting. First, we see the Ridge model trained on the |
83 |
| -# raw features will have low performance, because the order of the informative |
84 |
| -# feature is not informative: |
| 91 | +# internal :term:`cross fitting`. First, we see the Ridge model trained on the |
| 92 | +# raw features will have low performance. This is because we permuted the order |
| 93 | +# of the informative feature meaning `X_informative` is not informative when |
| 94 | +# raw: |
85 | 95 | import sklearn
|
86 | 96 | from sklearn.linear_model import Ridge
|
87 | 97 |
|
|
96 | 106 |
|
97 | 107 | # %%
|
98 | 108 | # Next, we create a pipeline with the target encoder and ridge model. The pipeline
|
99 |
| -# uses :meth:`TargetEncoder.fit_transform` which uses cross fitting. We see that |
100 |
| -# the model fits the data well and generalizes to the test set: |
| 109 | +# uses :meth:`TargetEncoder.fit_transform` which uses :term:`cross fitting`. We |
| 110 | +# see that the model fits the data well and generalizes to the test set: |
101 | 111 | from sklearn.pipeline import make_pipeline
|
102 | 112 | from sklearn.preprocessing import TargetEncoder
|
103 | 113 |
|
104 |
| -model_with_cv = make_pipeline(TargetEncoder(random_state=0), ridge) |
105 |
| -model_with_cv.fit(X_train, y_train) |
106 |
| -print("Model with CV on training set: ", model_with_cv.score(X_train, y_train)) |
107 |
| -print("Model with CV on test set: ", model_with_cv.score(X_test, y_test)) |
| 114 | +model_with_cf = make_pipeline(TargetEncoder(random_state=0), ridge) |
| 115 | +model_with_cf.fit(X_train, y_train) |
| 116 | +print("Model with CF on train set: ", model_with_cf.score(X_train, y_train)) |
| 117 | +print("Model with CF on test set: ", model_with_cf.score(X_test, y_test)) |
108 | 118 |
|
109 | 119 | # %%
|
110 | 120 | # The coefficients of the linear model shows that most of the weight is on the
|
|
114 | 124 |
|
115 | 125 | plt.rcParams["figure.constrained_layout.use"] = True
|
116 | 126 |
|
117 |
| -coefs_cv = pd.Series( |
118 |
| - model_with_cv[-1].coef_, index=model_with_cv[-1].feature_names_in_ |
| 127 | +coefs_cf = pd.Series( |
| 128 | + model_with_cf[-1].coef_, index=model_with_cf[-1].feature_names_in_ |
119 | 129 | ).sort_values()
|
120 |
| -_ = coefs_cv.plot(kind="barh") |
| 130 | +ax = coefs_cf.plot(kind="barh") |
| 131 | +_ = ax.set( |
| 132 | + title="Target encoded with cross fitting", |
| 133 | + xlabel="Ridge coefficient", |
| 134 | + ylabel="Feature", |
| 135 | +) |
121 | 136 |
|
122 | 137 | # %%
|
123 |
| -# While :meth:`TargetEncoder.fit_transform` uses an internal cross fitting scheme, |
124 |
| -# :meth:`TargetEncoder.transform` itself does not perform any cross fitting. |
125 |
| -# It uses the aggregation of the complete training set to transform the categorical |
126 |
| -# features. Thus, we can use :meth:`TargetEncoder.fit` followed by |
127 |
| -# :meth:`TargetEncoder.transform` to disable the cross fitting. This encoding |
128 |
| -# is then passed to the ridge model. |
| 138 | +# While :meth:`TargetEncoder.fit_transform` uses an internal |
| 139 | +# :term:`cross fitting` scheme to learn encodings for the training set, |
| 140 | +# :meth:`TargetEncoder.transform` itself does not. |
| 141 | +# It uses the complete training set to learn encodings and to transform the |
| 142 | +# categorical features. Thus, we can use :meth:`TargetEncoder.fit` followed by |
| 143 | +# :meth:`TargetEncoder.transform` to disable the :term:`cross fitting`. This |
| 144 | +# encoding is then passed to the ridge model. |
129 | 145 | target_encoder = TargetEncoder(random_state=0)
|
130 | 146 | target_encoder.fit(X_train, y_train)
|
131 |
| -X_train_no_cv_encoding = target_encoder.transform(X_train) |
132 |
| -X_test_no_cv_encoding = target_encoder.transform(X_test) |
| 147 | +X_train_no_cf_encoding = target_encoder.transform(X_train) |
| 148 | +X_test_no_cf_encoding = target_encoder.transform(X_test) |
133 | 149 |
|
134 |
| -model_no_cv = ridge.fit(X_train_no_cv_encoding, y_train) |
| 150 | +model_no_cf = ridge.fit(X_train_no_cf_encoding, y_train) |
135 | 151 |
|
136 | 152 | # %%
|
137 |
| -# We evaluate the model on the non-cross validated encoding and see that it overfits: |
| 153 | +# We evaluate the model that did not use :term:`cross fitting` when encoding and |
| 154 | +# see that it overfits: |
138 | 155 | print(
|
139 |
| - "Model without CV on training set: ", |
140 |
| - model_no_cv.score(X_train_no_cv_encoding, y_train), |
| 156 | + "Model without CF on training set: ", |
| 157 | + model_no_cf.score(X_train_no_cf_encoding, y_train), |
141 | 158 | )
|
142 | 159 | print(
|
143 |
| - "Model without CV on test set: ", model_no_cv.score(X_test_no_cv_encoding, y_test) |
| 160 | + "Model without CF on test set: ", |
| 161 | + model_no_cf.score( |
| 162 | + X_test_no_cf_encoding, |
| 163 | + y_test, |
| 164 | + ), |
144 | 165 | )
|
145 | 166 |
|
146 | 167 | # %%
|
147 |
| -# The ridge model overfits, because it assigns more weight to the extremely high |
148 |
| -# cardinality feature relative to the informative feature. |
149 |
| -coefs_no_cv = pd.Series( |
150 |
| - model_no_cv.coef_, index=model_no_cv.feature_names_in_ |
| 168 | +# The ridge model overfits because it assigns much more weight to the |
| 169 | +# uninformative extremely high cardinality ("near_unique") and medium |
| 170 | +# cardinality ("shuffled") features than when the model used |
| 171 | +# :term:`cross fitting` to encode the features. |
| 172 | +coefs_no_cf = pd.Series( |
| 173 | + model_no_cf.coef_, index=model_no_cf.feature_names_in_ |
151 | 174 | ).sort_values()
|
152 |
| -_ = coefs_no_cv.plot(kind="barh") |
| 175 | +ax = coefs_no_cf.plot(kind="barh") |
| 176 | +_ = ax.set( |
| 177 | + title="Target encoded without cross fitting", |
| 178 | + xlabel="Ridge coefficient", |
| 179 | + ylabel="Feature", |
| 180 | +) |
153 | 181 |
|
154 | 182 | # %%
|
155 | 183 | # Conclusion
|
156 | 184 | # ==========
|
157 |
| -# This example demonstrates the importance of :class:`TargetEncoder`'s internal cross |
158 |
| -# fitting. It is important to use :meth:`TargetEncoder.fit_transform` to encode |
159 |
| -# training data before passing it to a machine learning model. When a |
160 |
| -# :class:`TargetEncoder` is a part of a :class:`~sklearn.pipeline.Pipeline` and the |
161 |
| -# pipeline is fitted, the pipeline will correctly call |
162 |
| -# :meth:`TargetEncoder.fit_transform` and pass the encoding along. |
| 185 | +# This example demonstrates the importance of :class:`TargetEncoder`'s internal |
| 186 | +# :term:`cross fitting`. It is important to use |
| 187 | +# :meth:`TargetEncoder.fit_transform` to encode training data before passing it |
| 188 | +# to a machine learning model. When a :class:`TargetEncoder` is a part of a |
| 189 | +# :class:`~sklearn.pipeline.Pipeline` and the pipeline is fitted, the pipeline |
| 190 | +# will correctly call :meth:`TargetEncoder.fit_transform` and use |
| 191 | +# :term:`cross fitting` when encoding the training data. |
0 commit comments