Skip to content

Commit 41ca403

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 94b81ab2e7f9b0170b2d6ba6d84c1cc913367d8b
1 parent ba3cf1c commit 41ca403

File tree

711 files changed

+2477
-1948
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

711 files changed

+2477
-1948
lines changed
Binary file not shown.

dev/_downloads/3c9b7bcd0b16f172ac12ffad61f3b5f0/plot_stack_predictors.ipynb

Lines changed: 85 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
},
2727
"outputs": [],
2828
"source": [
29-
"print(__doc__)\n\n# Authors: Guillaume Lemaitre <[email protected]>\n# Maria Telenczuk <https://github.com/maikia>\n# License: BSD 3 clause"
29+
"# Authors: Guillaume Lemaitre <[email protected]>\n# Maria Telenczuk <https://github.com/maikia>\n# License: BSD 3 clause\n\nprint(__doc__)\n\nfrom sklearn import set_config\nset_config(display='diagram')"
3030
]
3131
},
3232
{
@@ -51,7 +51,7 @@
5151
"cell_type": "markdown",
5252
"metadata": {},
5353
"source": [
54-
"## Make pipeline to preprocess the data\n\n Before we can use Ames dataset we still need to do some preprocessing.\n First, the dataset has many missing values. To impute them, we will exchange\n categorical missing values with the new category 'missing' while the\n numerical missing values with the 'mean' of the column. We will also encode\n the categories with either :class:`~sklearn.preprocessing.OneHotEncoder\n <sklearn.preprocessing.OneHotEncoder>` or\n :class:`~sklearn.preprocessing.OrdinalEncoder\n <sklearn.preprocessing.OrdinalEncoder>` depending for which type of model we\n will use them (linear or non-linear model). To facilitate this preprocessing\n we will make two pipelines.\n You can skip this section if your data is ready to use and does\n not need preprocessing\n\n"
54+
"## Make pipeline to preprocess the data\n\n Before we can use Ames dataset we still need to do some preprocessing.\n First, we will select the categorical and numerical columns of the dataset to\n construct the first step of the pipeline.\n\n"
5555
]
5656
},
5757
{
@@ -62,14 +62,94 @@
6262
},
6363
"outputs": [],
6464
"source": [
65-
"from sklearn.compose import make_column_transformer\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.preprocessing import OrdinalEncoder\nfrom sklearn.preprocessing import StandardScaler\n\n\ncat_cols = X.columns[X.dtypes == 'O']\nnum_cols = X.columns[X.dtypes == 'float64']\n\ncategories = [\n X[column].unique() for column in X[cat_cols]]\n\nfor cat in categories:\n cat[cat == None] = 'missing' # noqa\n\ncat_proc_nlin = make_pipeline(\n SimpleImputer(missing_values=None, strategy='constant',\n fill_value='missing'),\n OrdinalEncoder(categories=categories)\n )\n\nnum_proc_nlin = make_pipeline(SimpleImputer(strategy='mean'))\n\ncat_proc_lin = make_pipeline(\n SimpleImputer(missing_values=None,\n strategy='constant',\n fill_value='missing'),\n OneHotEncoder(categories=categories)\n)\n\nnum_proc_lin = make_pipeline(\n SimpleImputer(strategy='mean'),\n StandardScaler()\n)\n\n# transformation to use for non-linear estimators\nprocessor_nlin = make_column_transformer(\n (cat_proc_nlin, cat_cols),\n (num_proc_nlin, num_cols),\n remainder='passthrough')\n\n# transformation to use for linear estimators\nprocessor_lin = make_column_transformer(\n (cat_proc_lin, cat_cols),\n (num_proc_lin, num_cols),\n remainder='passthrough')"
65+
"from sklearn.compose import make_column_selector\n\ncat_selector = make_column_selector(dtype_include=object)\nnum_selector = make_column_selector(dtype_include=np.number)\ncat_selector(X)"
66+
]
67+
},
68+
{
69+
"cell_type": "code",
70+
"execution_count": null,
71+
"metadata": {
72+
"collapsed": false
73+
},
74+
"outputs": [],
75+
"source": [
76+
"num_selector(X)"
77+
]
78+
},
79+
{
80+
"cell_type": "markdown",
81+
"metadata": {},
82+
"source": [
83+
"Then, we will need to design preprocessing pipelines which depends on the\nending regressor. If the ending regressor is a linear model, one needs to\none-hot encode the categories. If the ending regressor is a tree-based model\nan ordinal encoder will be sufficient. Besides, numerical values need to be\nstandardized for a linear model while the raw numerical data can be treated\nas is by a tree-based model. However, both models need an imputer to\nhandle missing values.\n\nWe will first design the pipeline required for the tree-based models.\n\n"
84+
]
85+
},
86+
{
87+
"cell_type": "code",
88+
"execution_count": null,
89+
"metadata": {
90+
"collapsed": false
91+
},
92+
"outputs": [],
93+
"source": [
94+
"from sklearn.compose import make_column_transformer\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OrdinalEncoder\n\ncat_tree_processor = OrdinalEncoder(\n handle_unknown=\"use_encoded_value\", unknown_value=-1)\nnum_tree_processor = SimpleImputer(strategy=\"mean\", add_indicator=True)\n\ntree_preprocessor = make_column_transformer(\n (num_tree_processor, num_selector), (cat_tree_processor, cat_selector))\ntree_preprocessor"
95+
]
96+
},
97+
{
98+
"cell_type": "markdown",
99+
"metadata": {},
100+
"source": [
101+
"Then, we will now define the preprocessor used when the ending regressor\nis a linear model.\n\n"
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": null,
107+
"metadata": {
108+
"collapsed": false
109+
},
110+
"outputs": [],
111+
"source": [
112+
"from sklearn.preprocessing import OneHotEncoder\nfrom sklearn.preprocessing import StandardScaler\n\ncat_linear_processor = OneHotEncoder(handle_unknown=\"ignore\")\nnum_linear_processor = make_pipeline(\n StandardScaler(), SimpleImputer(strategy=\"mean\", add_indicator=True))\n\nlinear_preprocessor = make_column_transformer(\n (num_linear_processor, num_selector), (cat_linear_processor, cat_selector))\nlinear_preprocessor"
66113
]
67114
},
68115
{
69116
"cell_type": "markdown",
70117
"metadata": {},
71118
"source": [
72-
"## Stack of predictors on a single data set\n\n It is sometimes tedious to find the model which will best perform on a given\n dataset. Stacking provide an alternative by combining the outputs of several\n learners, without the need to choose a model specifically. The performance of\n stacking is usually close to the best model and sometimes it can outperform\n the prediction performance of each individual model.\n\n Here, we combine 3 learners (linear and non-linear) and use a ridge regressor\n to combine their outputs together.\n\n Note: although we will make new pipelines with the processors which we wrote\n in the previous section for the 3 learners, the final estimator RidgeCV()\n does not need preprocessing of the data as it will be fed with the already\n preprocessed output from the 3 learners.\n\n"
119+
"## Stack of predictors on a single data set\n\n It is sometimes tedious to find the model which will best perform on a given\n dataset. Stacking provide an alternative by combining the outputs of several\n learners, without the need to choose a model specifically. The performance of\n stacking is usually close to the best model and sometimes it can outperform\n the prediction performance of each individual model.\n\n Here, we combine 3 learners (linear and non-linear) and use a ridge regressor\n to combine their outputs together.\n\n .. note::\n Although we will make new pipelines with the processors which we wrote in\n the previous section for the 3 learners, the final estimator\n :class:`~sklearn.linear_model.RidgeCV()` does not need preprocessing of\n the data as it will be fed with the already preprocessed output from the 3\n learners.\n\n"
120+
]
121+
},
122+
{
123+
"cell_type": "code",
124+
"execution_count": null,
125+
"metadata": {
126+
"collapsed": false
127+
},
128+
"outputs": [],
129+
"source": [
130+
"from sklearn.linear_model import LassoCV\n\nlasso_pipeline = make_pipeline(linear_preprocessor, LassoCV())\nlasso_pipeline"
131+
]
132+
},
133+
{
134+
"cell_type": "code",
135+
"execution_count": null,
136+
"metadata": {
137+
"collapsed": false
138+
},
139+
"outputs": [],
140+
"source": [
141+
"from sklearn.ensemble import RandomForestRegressor\n\nrf_pipeline = make_pipeline(\n tree_preprocessor, RandomForestRegressor(random_state=42))\nrf_pipeline"
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": null,
147+
"metadata": {
148+
"collapsed": false
149+
},
150+
"outputs": [],
151+
"source": [
152+
"from sklearn.experimental import enable_hist_gradient_boosting # noqa\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n\ngbdt_pipeline = make_pipeline(\n tree_preprocessor, HistGradientBoostingRegressor(random_state=0))\ngbdt_pipeline"
73153
]
74154
},
75155
{
@@ -80,7 +160,7 @@
80160
},
81161
"outputs": [],
82162
"source": [
83-
"from sklearn.experimental import enable_hist_gradient_boosting # noqa\nfrom sklearn.ensemble import HistGradientBoostingRegressor\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.ensemble import StackingRegressor\nfrom sklearn.linear_model import LassoCV\nfrom sklearn.linear_model import RidgeCV\n\n\nlasso_pipeline = make_pipeline(processor_lin,\n LassoCV())\n\nrf_pipeline = make_pipeline(processor_nlin,\n RandomForestRegressor(random_state=42))\n\ngradient_pipeline = make_pipeline(\n processor_nlin,\n HistGradientBoostingRegressor(random_state=0))\n\nestimators = [('Random Forest', rf_pipeline),\n ('Lasso', lasso_pipeline),\n ('Gradient Boosting', gradient_pipeline)]\n\nstacking_regressor = StackingRegressor(estimators=estimators,\n final_estimator=RidgeCV())"
163+
"from sklearn.ensemble import StackingRegressor\nfrom sklearn.linear_model import RidgeCV\n\nestimators = [('Random Forest', rf_pipeline),\n ('Lasso', lasso_pipeline),\n ('Gradient Boosting', gbdt_pipeline)]\n\nstacking_regressor = StackingRegressor(\n estimators=estimators, final_estimator=RidgeCV())\nstacking_regressor"
84164
]
85165
},
86166
{
Binary file not shown.

dev/_downloads/c6ccb1a9c5f82321f082e9767a2706f3/plot_stack_predictors.py

Lines changed: 69 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,15 @@
1515
stacking strategy. Stacking slightly improves the overall performance.
1616
1717
"""
18-
print(__doc__)
1918

2019
# Authors: Guillaume Lemaitre <[email protected]>
2120
# Maria Telenczuk <https://github.com/maikia>
2221
# License: BSD 3 clause
2322

23+
print(__doc__)
24+
25+
from sklearn import set_config
26+
set_config(display='diagram')
2427

2528
# %%
2629
# Download the dataset
@@ -73,68 +76,56 @@ def load_ames_housing():
7376
##############################################################################
7477
#
7578
# Before we can use Ames dataset we still need to do some preprocessing.
76-
# First, the dataset has many missing values. To impute them, we will exchange
77-
# categorical missing values with the new category 'missing' while the
78-
# numerical missing values with the 'mean' of the column. We will also encode
79-
# the categories with either :class:`~sklearn.preprocessing.OneHotEncoder
80-
# <sklearn.preprocessing.OneHotEncoder>` or
81-
# :class:`~sklearn.preprocessing.OrdinalEncoder
82-
# <sklearn.preprocessing.OrdinalEncoder>` depending for which type of model we
83-
# will use them (linear or non-linear model). To facilitate this preprocessing
84-
# we will make two pipelines.
85-
# You can skip this section if your data is ready to use and does
86-
# not need preprocessing
79+
# First, we will select the categorical and numerical columns of the dataset to
80+
# construct the first step of the pipeline.
81+
82+
from sklearn.compose import make_column_selector
83+
84+
cat_selector = make_column_selector(dtype_include=object)
85+
num_selector = make_column_selector(dtype_include=np.number)
86+
cat_selector(X)
8787

88+
# %%
89+
num_selector(X)
90+
91+
# %%
92+
# Then, we will need to design preprocessing pipelines which depends on the
93+
# ending regressor. If the ending regressor is a linear model, one needs to
94+
# one-hot encode the categories. If the ending regressor is a tree-based model
95+
# an ordinal encoder will be sufficient. Besides, numerical values need to be
96+
# standardized for a linear model while the raw numerical data can be treated
97+
# as is by a tree-based model. However, both models need an imputer to
98+
# handle missing values.
99+
#
100+
# We will first design the pipeline required for the tree-based models.
88101

89102
from sklearn.compose import make_column_transformer
90103
from sklearn.impute import SimpleImputer
91104
from sklearn.pipeline import make_pipeline
92-
from sklearn.preprocessing import OneHotEncoder
93105
from sklearn.preprocessing import OrdinalEncoder
94-
from sklearn.preprocessing import StandardScaler
95-
96-
97-
cat_cols = X.columns[X.dtypes == 'O']
98-
num_cols = X.columns[X.dtypes == 'float64']
99106

100-
categories = [
101-
X[column].unique() for column in X[cat_cols]]
107+
cat_tree_processor = OrdinalEncoder(
108+
handle_unknown="use_encoded_value", unknown_value=-1)
109+
num_tree_processor = SimpleImputer(strategy="mean", add_indicator=True)
102110

103-
for cat in categories:
104-
cat[cat == None] = 'missing' # noqa
111+
tree_preprocessor = make_column_transformer(
112+
(num_tree_processor, num_selector), (cat_tree_processor, cat_selector))
113+
tree_preprocessor
105114

106-
cat_proc_nlin = make_pipeline(
107-
SimpleImputer(missing_values=None, strategy='constant',
108-
fill_value='missing'),
109-
OrdinalEncoder(categories=categories)
110-
)
111-
112-
num_proc_nlin = make_pipeline(SimpleImputer(strategy='mean'))
113-
114-
cat_proc_lin = make_pipeline(
115-
SimpleImputer(missing_values=None,
116-
strategy='constant',
117-
fill_value='missing'),
118-
OneHotEncoder(categories=categories)
119-
)
120-
121-
num_proc_lin = make_pipeline(
122-
SimpleImputer(strategy='mean'),
123-
StandardScaler()
124-
)
115+
# %%
116+
# Then, we will now define the preprocessor used when the ending regressor
117+
# is a linear model.
125118

126-
# transformation to use for non-linear estimators
127-
processor_nlin = make_column_transformer(
128-
(cat_proc_nlin, cat_cols),
129-
(num_proc_nlin, num_cols),
130-
remainder='passthrough')
119+
from sklearn.preprocessing import OneHotEncoder
120+
from sklearn.preprocessing import StandardScaler
131121

132-
# transformation to use for linear estimators
133-
processor_lin = make_column_transformer(
134-
(cat_proc_lin, cat_cols),
135-
(num_proc_lin, num_cols),
136-
remainder='passthrough')
122+
cat_linear_processor = OneHotEncoder(handle_unknown="ignore")
123+
num_linear_processor = make_pipeline(
124+
StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True))
137125

126+
linear_preprocessor = make_column_transformer(
127+
(num_linear_processor, num_selector), (cat_linear_processor, cat_selector))
128+
linear_preprocessor
138129

139130
# %%
140131
# Stack of predictors on a single data set
@@ -149,37 +140,44 @@ def load_ames_housing():
149140
# Here, we combine 3 learners (linear and non-linear) and use a ridge regressor
150141
# to combine their outputs together.
151142
#
152-
# Note: although we will make new pipelines with the processors which we wrote
153-
# in the previous section for the 3 learners, the final estimator RidgeCV()
154-
# does not need preprocessing of the data as it will be fed with the already
155-
# preprocessed output from the 3 learners.
143+
# .. note::
144+
# Although we will make new pipelines with the processors which we wrote in
145+
# the previous section for the 3 learners, the final estimator
146+
# :class:`~sklearn.linear_model.RidgeCV()` does not need preprocessing of
147+
# the data as it will be fed with the already preprocessed output from the 3
148+
# learners.
156149

150+
from sklearn.linear_model import LassoCV
157151

158-
from sklearn.experimental import enable_hist_gradient_boosting # noqa
159-
from sklearn.ensemble import HistGradientBoostingRegressor
152+
lasso_pipeline = make_pipeline(linear_preprocessor, LassoCV())
153+
lasso_pipeline
154+
155+
# %%
160156
from sklearn.ensemble import RandomForestRegressor
161-
from sklearn.ensemble import StackingRegressor
162-
from sklearn.linear_model import LassoCV
163-
from sklearn.linear_model import RidgeCV
164157

158+
rf_pipeline = make_pipeline(
159+
tree_preprocessor, RandomForestRegressor(random_state=42))
160+
rf_pipeline
165161

166-
lasso_pipeline = make_pipeline(processor_lin,
167-
LassoCV())
162+
# %%
163+
from sklearn.experimental import enable_hist_gradient_boosting # noqa
164+
from sklearn.ensemble import HistGradientBoostingRegressor
168165

169-
rf_pipeline = make_pipeline(processor_nlin,
170-
RandomForestRegressor(random_state=42))
166+
gbdt_pipeline = make_pipeline(
167+
tree_preprocessor, HistGradientBoostingRegressor(random_state=0))
168+
gbdt_pipeline
171169

172-
gradient_pipeline = make_pipeline(
173-
processor_nlin,
174-
HistGradientBoostingRegressor(random_state=0))
170+
# %%
171+
from sklearn.ensemble import StackingRegressor
172+
from sklearn.linear_model import RidgeCV
175173

176174
estimators = [('Random Forest', rf_pipeline),
177175
('Lasso', lasso_pipeline),
178-
('Gradient Boosting', gradient_pipeline)]
179-
180-
stacking_regressor = StackingRegressor(estimators=estimators,
181-
final_estimator=RidgeCV())
176+
('Gradient Boosting', gbdt_pipeline)]
182177

178+
stacking_regressor = StackingRegressor(
179+
estimators=estimators, final_estimator=RidgeCV())
180+
stacking_regressor
183181

184182
# %%
185183
# Measure and plot the results

dev/_downloads/scikit-learn-docs.zip

43 Bytes
Binary file not shown.

dev/_images/binder_badge_logo.png

0 Bytes

dev/_images/iris.png

0 Bytes

0 commit comments

Comments
 (0)