Skip to content

Commit 4630ed6

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 8f63882bb43db78d3b1684276329512f89ab18bc
1 parent de814cf commit 4630ed6

File tree

1,311 files changed

+6118
-6003
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,311 files changed

+6118
-6003
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 28b7307cedb45eb961cbd8228e144f2a
3+
config: ed2dda44dc5ed392cb6fbae664adcacd
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.

dev/_downloads/4f6558a73e0c79834afc005bac34dc13/plot_target_encoder_cross_val.py

Lines changed: 83 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,26 @@
66
.. currentmodule:: sklearn.preprocessing
77
88
The :class:`TargetEncoder` replaces each category of a categorical feature with
9-
the mean of the target variable for that category. This method is useful
9+
the shrunk mean of the target variable for that category. This method is useful
1010
in cases where there is a strong relationship between the categorical feature
1111
and the target. To prevent overfitting, :meth:`TargetEncoder.fit_transform` uses
12-
an internal cross fitting scheme to encode the training data to be used by a
13-
downstream model. In this example, we demonstrate the importance of the cross fitting
14-
procedure to prevent overfitting.
12+
an internal :term:`cross fitting` scheme to encode the training data to be used
13+
by a downstream model. This scheme involves splitting the data into *k* folds
14+
and encoding each fold using the encodings learnt using the other *k-1* folds.
15+
In this example, we demonstrate the importance of the cross
16+
fitting procedure to prevent overfitting.
1517
"""
1618

1719
# %%
1820
# Create Synthetic Dataset
1921
# ========================
20-
# For this example, we build a dataset with three categorical features: an informative
21-
# feature with medium cardinality, an uninformative feature with medium cardinality,
22-
# and an uninformative feature with high cardinality. First, we generate the informative
23-
# feature:
22+
# For this example, we build a dataset with three categorical features:
23+
#
24+
# * an informative feature with medium cardinality ("informative")
25+
# * an uninformative feature with medium cardinality ("shuffled")
26+
# * an uninformative feature with high cardinality ("near_unique")
27+
#
28+
# First, we generate the informative feature:
2429
import numpy as np
2530

2631
from sklearn.preprocessing import KBinsDiscretizer
@@ -33,12 +38,16 @@
3338
n_categories = 100
3439

3540
kbins = KBinsDiscretizer(
36-
n_bins=n_categories, encode="ordinal", strategy="uniform", random_state=rng
41+
n_bins=n_categories,
42+
encode="ordinal",
43+
strategy="uniform",
44+
random_state=rng,
45+
subsample=None,
3746
)
3847
X_informative = kbins.fit_transform((y + noise).reshape(-1, 1))
3948

40-
# Remove the linear relationship between y and the bin index by permuting the values of
41-
# X_informative
49+
# Remove the linear relationship between y and the bin index by permuting the
50+
# values of X_informative:
4251
permuted_categories = rng.permutation(n_categories)
4352
X_informative = permuted_categories[X_informative.astype(np.int32)]
4453

@@ -48,13 +57,13 @@
4857
X_shuffled = rng.permutation(X_informative)
4958

5059
# %%
51-
# The uninformative feature with high cardinality is generated so that is independent of
52-
# the target variable. We will show that target encoding without cross fitting will
53-
# cause catastrophic overfitting for the downstream regressor. These high cardinality
54-
# features are basically unique identifiers for samples which should generally be
55-
# removed from machine learning dataset. In this example, we generate them to show how
56-
# :class:`TargetEncoder`'s default cross fitting behavior mitigates the overfitting
57-
# issue automatically.
60+
# The uninformative feature with high cardinality is generated so that it is
61+
# independent of the target variable. We will show that target encoding without
62+
# :term:`cross fitting` will cause catastrophic overfitting for the downstream
63+
# regressor. These high cardinality features are basically unique identifiers
64+
# for samples which should generally be removed from machine learning datasets.
65+
# In this example, we generate them to show how :class:`TargetEncoder`'s default
66+
# :term:`cross fitting` behavior mitigates the overfitting issue automatically.
5867
X_near_unique_categories = rng.choice(
5968
int(0.9 * n_samples), size=n_samples, replace=True
6069
).reshape(-1, 1)
@@ -79,9 +88,10 @@
7988
# ==========================
8089
# In this section, we train a ridge regressor on the dataset with and without
8190
# encoding and explore the influence of target encoder with and without the
82-
# internal cross fitting. First, we see the Ridge model trained on the
83-
# raw features will have low performance, because the order of the informative
84-
# feature is not informative:
91+
# internal :term:`cross fitting`. First, we see the Ridge model trained on the
92+
# raw features will have low performance. This is because we permuted the order
93+
# of the informative feature meaning `X_informative` is not informative when
94+
# raw:
8595
import sklearn
8696
from sklearn.linear_model import Ridge
8797

@@ -96,15 +106,15 @@
96106

97107
# %%
98108
# Next, we create a pipeline with the target encoder and ridge model. The pipeline
99-
# uses :meth:`TargetEncoder.fit_transform` which uses cross fitting. We see that
100-
# the model fits the data well and generalizes to the test set:
109+
# uses :meth:`TargetEncoder.fit_transform` which uses :term:`cross fitting`. We
110+
# see that the model fits the data well and generalizes to the test set:
101111
from sklearn.pipeline import make_pipeline
102112
from sklearn.preprocessing import TargetEncoder
103113

104-
model_with_cv = make_pipeline(TargetEncoder(random_state=0), ridge)
105-
model_with_cv.fit(X_train, y_train)
106-
print("Model with CV on training set: ", model_with_cv.score(X_train, y_train))
107-
print("Model with CV on test set: ", model_with_cv.score(X_test, y_test))
114+
model_with_cf = make_pipeline(TargetEncoder(random_state=0), ridge)
115+
model_with_cf.fit(X_train, y_train)
116+
print("Model with CF on train set: ", model_with_cf.score(X_train, y_train))
117+
print("Model with CF on test set: ", model_with_cf.score(X_test, y_test))
108118

109119
# %%
110120
# The coefficients of the linear model shows that most of the weight is on the
@@ -114,49 +124,68 @@
114124

115125
plt.rcParams["figure.constrained_layout.use"] = True
116126

117-
coefs_cv = pd.Series(
118-
model_with_cv[-1].coef_, index=model_with_cv[-1].feature_names_in_
127+
coefs_cf = pd.Series(
128+
model_with_cf[-1].coef_, index=model_with_cf[-1].feature_names_in_
119129
).sort_values()
120-
_ = coefs_cv.plot(kind="barh")
130+
ax = coefs_cf.plot(kind="barh")
131+
_ = ax.set(
132+
title="Target encoded with cross fitting",
133+
xlabel="Ridge coefficient",
134+
ylabel="Feature",
135+
)
121136

122137
# %%
123-
# While :meth:`TargetEncoder.fit_transform` uses an internal cross fitting scheme,
124-
# :meth:`TargetEncoder.transform` itself does not perform any cross fitting.
125-
# It uses the aggregation of the complete training set to transform the categorical
126-
# features. Thus, we can use :meth:`TargetEncoder.fit` followed by
127-
# :meth:`TargetEncoder.transform` to disable the cross fitting. This encoding
128-
# is then passed to the ridge model.
138+
# While :meth:`TargetEncoder.fit_transform` uses an internal
139+
# :term:`cross fitting` scheme to learn encodings for the training set,
140+
# :meth:`TargetEncoder.transform` itself does not.
141+
# It uses the complete training set to learn encodings and to transform the
142+
# categorical features. Thus, we can use :meth:`TargetEncoder.fit` followed by
143+
# :meth:`TargetEncoder.transform` to disable the :term:`cross fitting`. This
144+
# encoding is then passed to the ridge model.
129145
target_encoder = TargetEncoder(random_state=0)
130146
target_encoder.fit(X_train, y_train)
131-
X_train_no_cv_encoding = target_encoder.transform(X_train)
132-
X_test_no_cv_encoding = target_encoder.transform(X_test)
147+
X_train_no_cf_encoding = target_encoder.transform(X_train)
148+
X_test_no_cf_encoding = target_encoder.transform(X_test)
133149

134-
model_no_cv = ridge.fit(X_train_no_cv_encoding, y_train)
150+
model_no_cf = ridge.fit(X_train_no_cf_encoding, y_train)
135151

136152
# %%
137-
# We evaluate the model on the non-cross validated encoding and see that it overfits:
153+
# We evaluate the model that did not use :term:`cross fitting` when encoding and
154+
# see that it overfits:
138155
print(
139-
"Model without CV on training set: ",
140-
model_no_cv.score(X_train_no_cv_encoding, y_train),
156+
"Model without CF on training set: ",
157+
model_no_cf.score(X_train_no_cf_encoding, y_train),
141158
)
142159
print(
143-
"Model without CV on test set: ", model_no_cv.score(X_test_no_cv_encoding, y_test)
160+
"Model without CF on test set: ",
161+
model_no_cf.score(
162+
X_test_no_cf_encoding,
163+
y_test,
164+
),
144165
)
145166

146167
# %%
147-
# The ridge model overfits, because it assigns more weight to the extremely high
148-
# cardinality feature relative to the informative feature.
149-
coefs_no_cv = pd.Series(
150-
model_no_cv.coef_, index=model_no_cv.feature_names_in_
168+
# The ridge model overfits because it assigns much more weight to the
169+
# uninformative extremely high cardinality ("near_unique") and medium
170+
# cardinality ("shuffled") features than when the model used
171+
# :term:`cross fitting` to encode the features.
172+
coefs_no_cf = pd.Series(
173+
model_no_cf.coef_, index=model_no_cf.feature_names_in_
151174
).sort_values()
152-
_ = coefs_no_cv.plot(kind="barh")
175+
ax = coefs_no_cf.plot(kind="barh")
176+
_ = ax.set(
177+
title="Target encoded without cross fitting",
178+
xlabel="Ridge coefficient",
179+
ylabel="Feature",
180+
)
153181

154182
# %%
155183
# Conclusion
156184
# ==========
157-
# This example demonstrates the importance of :class:`TargetEncoder`'s internal cross
158-
# fitting. It is important to use :meth:`TargetEncoder.fit_transform` to encode
159-
# training data before passing it to a machine learning model. When a
160-
# :class:`TargetEncoder` is a part of a :class:`~sklearn.pipeline.Pipeline` and the
161-
# pipeline is fitted, the pipeline will correctly call
162-
# :meth:`TargetEncoder.fit_transform` and pass the encoding along.
185+
# This example demonstrates the importance of :class:`TargetEncoder`'s internal
186+
# :term:`cross fitting`. It is important to use
187+
# :meth:`TargetEncoder.fit_transform` to encode training data before passing it
188+
# to a machine learning model. When a :class:`TargetEncoder` is a part of a
189+
# :class:`~sklearn.pipeline.Pipeline` and the pipeline is fitted, the pipeline
190+
# will correctly call :meth:`TargetEncoder.fit_transform` and use
191+
# :term:`cross fitting` when encoding the training data.
Binary file not shown.

0 commit comments

Comments
 (0)