Skip to content

Commit 630ed0c

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 2fd6e34f2b1788503c649da47d0b2fb7267cbe41
1 parent 8d0afae commit 630ed0c

File tree

1,383 files changed

+5462
-5156
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,383 files changed

+5462
-5156
lines changed
Binary file not shown.

dev/_downloads/26f110ad6cff1a8a7c58b1a00d8b8b5a/plot_column_transformer_mixed_types.ipynb

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Column Transformer with Mixed Types\n\n.. currentmodule:: sklearn\n\nThis example illustrates how to apply different preprocessing and feature\nextraction pipelines to different subsets of features, using\n:class:`~compose.ColumnTransformer`. This is particularly handy for the\ncase of datasets that contain heterogeneous data types, since we may want to\nscale the numeric features and one-hot encode the categorical ones.\n\nIn this example, the numeric data is standard-scaled after mean-imputation. The\ncategorical data is one-hot encoded via ``OneHotEncoder``, which\ncreates a new category for missing values.\n\nIn addition, we show two different ways to dispatch the columns to the\nparticular pre-processor: by column names and by column data types.\n\nFinally, the preprocessing pipeline is integrated in a full prediction pipeline\nusing :class:`~pipeline.Pipeline`, together with a simple classification\nmodel.\n"
18+
"\n# Column Transformer with Mixed Types\n\n.. currentmodule:: sklearn\n\nThis example illustrates how to apply different preprocessing and feature\nextraction pipelines to different subsets of features, using\n:class:`~compose.ColumnTransformer`. This is particularly handy for the\ncase of datasets that contain heterogeneous data types, since we may want to\nscale the numeric features and one-hot encode the categorical ones.\n\nIn this example, the numeric data is standard-scaled after mean-imputation. The\ncategorical data is one-hot encoded via ``OneHotEncoder``, which\ncreates a new category for missing values. We further reduce the dimensionality\nby selecting categories using a chi-squared test.\n\nIn addition, we show two different ways to dispatch the columns to the\nparticular pre-processor: by column names and by column data types.\n\nFinally, the preprocessing pipeline is integrated in a full prediction pipeline\nusing :class:`~pipeline.Pipeline`, together with a simple classification\nmodel.\n"
1919
]
2020
},
2121
{
@@ -37,7 +37,7 @@
3737
},
3838
"outputs": [],
3939
"source": [
40-
"import numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split, GridSearchCV\n\nnp.random.seed(0)"
40+
"import numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split, RandomizedSearchCV\nfrom sklearn.feature_selection import SelectPercentile, chi2\n\nnp.random.seed(0)"
4141
]
4242
},
4343
{
@@ -73,7 +73,7 @@
7373
},
7474
"outputs": [],
7575
"source": [
76-
"numeric_features = [\"age\", \"fare\"]\nnumeric_transformer = Pipeline(\n steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())]\n)\n\ncategorical_features = [\"embarked\", \"sex\", \"pclass\"]\ncategorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n\npreprocessor = ColumnTransformer(\n transformers=[\n (\"num\", numeric_transformer, numeric_features),\n (\"cat\", categorical_transformer, categorical_features),\n ]\n)"
76+
"numeric_features = [\"age\", \"fare\"]\nnumeric_transformer = Pipeline(\n steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())]\n)\n\ncategorical_features = [\"embarked\", \"sex\", \"pclass\"]\ncategorical_transformer = Pipeline(\n steps=[\n (\"encoder\", OneHotEncoder(handle_unknown=\"ignore\")),\n (\"selector\", SelectPercentile(chi2, percentile=50)),\n ]\n)\npreprocessor = ColumnTransformer(\n transformers=[\n (\"num\", numeric_transformer, numeric_features),\n (\"cat\", categorical_transformer, categorical_features),\n ]\n)"
7777
]
7878
},
7979
{
@@ -206,7 +206,7 @@
206206
"cell_type": "markdown",
207207
"metadata": {},
208208
"source": [
209-
"Using the prediction pipeline in a grid search\n\nGrid search can also be performed on the different preprocessing steps\ndefined in the ``ColumnTransformer`` object, together with the classifier's\nhyperparameters as part of the ``Pipeline``.\nWe will search for both the imputer strategy of the numeric preprocessing\nand the regularization parameter of the logistic regression using\n:class:`~sklearn.model_selection.GridSearchCV`.\n\n"
209+
"Using the prediction pipeline in a grid search\n\nGrid search can also be performed on the different preprocessing steps\ndefined in the ``ColumnTransformer`` object, together with the classifier's\nhyperparameters as part of the ``Pipeline``.\nWe will search for both the imputer strategy of the numeric preprocessing\nand the regularization parameter of the logistic regression using\n:class:`~sklearn.model_selection.RandomizedSearchCV`. This\nhyperparameter search randomly selects a fixed number of parameter\nsettings configured by `n_iter`. Alternatively, one can use\n:class:`~sklearn.model_selection.GridSearchCV` but the cartesian product of\nthe parameter space will be evaluated.\n\n"
210210
]
211211
},
212212
{
@@ -217,7 +217,7 @@
217217
},
218218
"outputs": [],
219219
"source": [
220-
"param_grid = {\n \"preprocessor__num__imputer__strategy\": [\"mean\", \"median\"],\n \"classifier__C\": [0.1, 1.0, 10, 100],\n}\n\ngrid_search = GridSearchCV(clf, param_grid, cv=10)\ngrid_search"
220+
"param_grid = {\n \"preprocessor__num__imputer__strategy\": [\"mean\", \"median\"],\n \"preprocessor__cat__selector__percentile\": [10, 30, 50, 70],\n \"classifier__C\": [0.1, 1.0, 10, 100],\n}\n\nsearch_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)\nsearch_cv"
221221
]
222222
},
223223
{
@@ -235,7 +235,7 @@
235235
},
236236
"outputs": [],
237237
"source": [
238-
"grid_search.fit(X_train, y_train)\n\nprint(\"Best params:\")\nprint(grid_search.best_params_)"
238+
"search_cv.fit(X_train, y_train)\n\nprint(\"Best params:\")\nprint(search_cv.best_params_)"
239239
]
240240
},
241241
{
@@ -253,7 +253,7 @@
253253
},
254254
"outputs": [],
255255
"source": [
256-
"print(f\"Internal CV score: {grid_search.best_score_:.3f}\")"
256+
"print(f\"Internal CV score: {search_cv.best_score_:.3f}\")"
257257
]
258258
},
259259
{
@@ -271,7 +271,7 @@
271271
},
272272
"outputs": [],
273273
"source": [
274-
"import pandas as pd\n\ncv_results = pd.DataFrame(grid_search.cv_results_)\ncv_results = cv_results.sort_values(\"mean_test_score\", ascending=False)\ncv_results[\n [\n \"mean_test_score\",\n \"std_test_score\",\n \"param_preprocessor__num__imputer__strategy\",\n \"param_classifier__C\",\n ]\n].head(5)"
274+
"import pandas as pd\n\ncv_results = pd.DataFrame(search_cv.cv_results_)\ncv_results = cv_results.sort_values(\"mean_test_score\", ascending=False)\ncv_results[\n [\n \"mean_test_score\",\n \"std_test_score\",\n \"param_preprocessor__num__imputer__strategy\",\n \"param_preprocessor__cat__selector__percentile\",\n \"param_classifier__C\",\n ]\n].head(5)"
275275
]
276276
},
277277
{
@@ -289,7 +289,7 @@
289289
},
290290
"outputs": [],
291291
"source": [
292-
"print(\n (\n \"best logistic regression from grid search: %.3f\"\n % grid_search.score(X_test, y_test)\n )\n)"
292+
"print(\n \"accuracy of the best model from randomized search: \"\n f\"{search_cv.score(X_test, y_test):.3f}\"\n)"
293293
]
294294
}
295295
],

dev/_downloads/41973816d3932cd07b75d8825fd2c13d/plot_svm_anova.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
# Create the pipeline
2727
# -------------------
2828
from sklearn.pipeline import Pipeline
29-
from sklearn.feature_selection import SelectPercentile, chi2
29+
from sklearn.feature_selection import SelectPercentile, f_classif
3030
from sklearn.preprocessing import StandardScaler
3131
from sklearn.svm import SVC
3232

@@ -35,7 +35,7 @@
3535

3636
clf = Pipeline(
3737
[
38-
("anova", SelectPercentile(chi2)),
38+
("anova", SelectPercentile(f_classif)),
3939
("scaler", StandardScaler()),
4040
("svc", SVC(gamma="auto")),
4141
]
Binary file not shown.

dev/_downloads/6f4a6a0d8063b616c4aa4db2865de57c/plot_svm_anova.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@
5151
},
5252
"outputs": [],
5353
"source": [
54-
"from sklearn.pipeline import Pipeline\nfrom sklearn.feature_selection import SelectPercentile, chi2\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\n# Create a feature-selection transform, a scaler and an instance of SVM that we\n# combine together to have a full-blown estimator\n\nclf = Pipeline(\n [\n (\"anova\", SelectPercentile(chi2)),\n (\"scaler\", StandardScaler()),\n (\"svc\", SVC(gamma=\"auto\")),\n ]\n)"
54+
"from sklearn.pipeline import Pipeline\nfrom sklearn.feature_selection import SelectPercentile, f_classif\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\n\n# Create a feature-selection transform, a scaler and an instance of SVM that we\n# combine together to have a full-blown estimator\n\nclf = Pipeline(\n [\n (\"anova\", SelectPercentile(f_classif)),\n (\"scaler\", StandardScaler()),\n (\"svc\", SVC(gamma=\"auto\")),\n ]\n)"
5555
]
5656
},
5757
{

dev/_downloads/79c38d2f2cb1f2ef7d68e0cc7ea7b4e4/plot_column_transformer_mixed_types.py

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
1414
In this example, the numeric data is standard-scaled after mean-imputation. The
1515
categorical data is one-hot encoded via ``OneHotEncoder``, which
16-
creates a new category for missing values.
16+
creates a new category for missing values. We further reduce the dimensionality
17+
by selecting categories using a chi-squared test.
1718
1819
In addition, we show two different ways to dispatch the columns to the
1920
particular pre-processor: by column names and by column data types.
@@ -37,7 +38,8 @@
3738
from sklearn.impute import SimpleImputer
3839
from sklearn.preprocessing import StandardScaler, OneHotEncoder
3940
from sklearn.linear_model import LogisticRegression
40-
from sklearn.model_selection import train_test_split, GridSearchCV
41+
from sklearn.model_selection import train_test_split, RandomizedSearchCV
42+
from sklearn.feature_selection import SelectPercentile, chi2
4143

4244
np.random.seed(0)
4345

@@ -77,8 +79,12 @@
7779
)
7880

7981
categorical_features = ["embarked", "sex", "pclass"]
80-
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
81-
82+
categorical_transformer = Pipeline(
83+
steps=[
84+
("encoder", OneHotEncoder(handle_unknown="ignore")),
85+
("selector", SelectPercentile(chi2, percentile=50)),
86+
]
87+
)
8288
preprocessor = ColumnTransformer(
8389
transformers=[
8490
("num", numeric_transformer, numeric_features),
@@ -173,40 +179,46 @@
173179
# hyperparameters as part of the ``Pipeline``.
174180
# We will search for both the imputer strategy of the numeric preprocessing
175181
# and the regularization parameter of the logistic regression using
176-
# :class:`~sklearn.model_selection.GridSearchCV`.
182+
# :class:`~sklearn.model_selection.RandomizedSearchCV`. This
183+
# hyperparameter search randomly selects a fixed number of parameter
184+
# settings configured by `n_iter`. Alternatively, one can use
185+
# :class:`~sklearn.model_selection.GridSearchCV` but the cartesian product of
186+
# the parameter space will be evaluated.
177187

178188
param_grid = {
179189
"preprocessor__num__imputer__strategy": ["mean", "median"],
190+
"preprocessor__cat__selector__percentile": [10, 30, 50, 70],
180191
"classifier__C": [0.1, 1.0, 10, 100],
181192
}
182193

183-
grid_search = GridSearchCV(clf, param_grid, cv=10)
184-
grid_search
194+
search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)
195+
search_cv
185196

186197
# %%
187198
# Calling 'fit' triggers the cross-validated search for the best
188199
# hyper-parameters combination:
189200
#
190-
grid_search.fit(X_train, y_train)
201+
search_cv.fit(X_train, y_train)
191202

192203
print("Best params:")
193-
print(grid_search.best_params_)
204+
print(search_cv.best_params_)
194205

195206
# %%
196207
# The internal cross-validation scores obtained by those parameters is:
197-
print(f"Internal CV score: {grid_search.best_score_:.3f}")
208+
print(f"Internal CV score: {search_cv.best_score_:.3f}")
198209

199210
# %%
200211
# We can also introspect the top grid search results as a pandas dataframe:
201212
import pandas as pd
202213

203-
cv_results = pd.DataFrame(grid_search.cv_results_)
214+
cv_results = pd.DataFrame(search_cv.cv_results_)
204215
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
205216
cv_results[
206217
[
207218
"mean_test_score",
208219
"std_test_score",
209220
"param_preprocessor__num__imputer__strategy",
221+
"param_preprocessor__cat__selector__percentile",
210222
"param_classifier__C",
211223
]
212224
].head(5)
@@ -217,8 +229,6 @@
217229
# not used for hyperparameter tuning.
218230
#
219231
print(
220-
(
221-
"best logistic regression from grid search: %.3f"
222-
% grid_search.score(X_test, y_test)
223-
)
232+
"accuracy of the best model from randomized search: "
233+
f"{search_cv.score(X_test, y_test):.3f}"
224234
)

dev/_downloads/e38f4849bd47832b7b365f2fa9d31dd6/plot_compare_reduction.ipynb

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,17 @@
1818
"\n# Selecting dimensionality reduction with Pipeline and GridSearchCV\n\nThis example constructs a pipeline that does dimensionality\nreduction followed by prediction with a support vector\nclassifier. It demonstrates the use of ``GridSearchCV`` and\n``Pipeline`` to optimize over different classes of estimators in a\nsingle CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality\nreductions are compared to univariate feature selection during\nthe grid search.\n\nAdditionally, ``Pipeline`` can be instantiated with the ``memory``\nargument to memoize the transformers within the pipeline, avoiding to fit\nagain the same transformers over and over.\n\nNote that the use of ``memory`` to enable caching becomes interesting when the\nfitting of a transformer is costly.\n"
1919
]
2020
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {
25+
"collapsed": false
26+
},
27+
"outputs": [],
28+
"source": [
29+
"# Authors: Robert McGibbon\n# Joel Nothman\n# Guillaume Lemaitre"
30+
]
31+
},
2132
{
2233
"cell_type": "markdown",
2334
"metadata": {},
@@ -33,7 +44,18 @@
3344
},
3445
"outputs": [],
3546
"source": [
36-
"# Authors: Robert McGibbon, Joel Nothman, Guillaume Lemaitre\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.svm import LinearSVC\nfrom sklearn.decomposition import PCA, NMF\nfrom sklearn.feature_selection import SelectKBest, chi2\n\npipe = Pipeline(\n [\n # the reduce_dim stage is populated by the param_grid\n (\"reduce_dim\", \"passthrough\"),\n (\"classify\", LinearSVC(dual=False, max_iter=10000)),\n ]\n)\n\nN_FEATURES_OPTIONS = [2, 4, 8]\nC_OPTIONS = [1, 10, 100, 1000]\nparam_grid = [\n {\n \"reduce_dim\": [PCA(iterated_power=7), NMF()],\n \"reduce_dim__n_components\": N_FEATURES_OPTIONS,\n \"classify__C\": C_OPTIONS,\n },\n {\n \"reduce_dim\": [SelectKBest(chi2)],\n \"reduce_dim__k\": N_FEATURES_OPTIONS,\n \"classify__C\": C_OPTIONS,\n },\n]\nreducer_labels = [\"PCA\", \"NMF\", \"KBest(chi2)\"]\n\ngrid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)\nX, y = load_digits(return_X_y=True)\ngrid.fit(X, y)\n\nmean_scores = np.array(grid.cv_results_[\"mean_test_score\"])\n# scores are in the order of param_grid iteration, which is alphabetical\nmean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))\n# select score for best C\nmean_scores = mean_scores.max(axis=0)\nbar_offsets = np.arange(len(N_FEATURES_OPTIONS)) * (len(reducer_labels) + 1) + 0.5\n\nplt.figure()\nCOLORS = \"bgrcmyk\"\nfor i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):\n plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])\n\nplt.title(\"Comparing feature reduction techniques\")\nplt.xlabel(\"Reduced number of features\")\nplt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)\nplt.ylabel(\"Digit classification accuracy\")\nplt.ylim((0, 1))\nplt.legend(loc=\"upper left\")\n\nplt.show()"
47+
"import numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.svm import LinearSVC\nfrom sklearn.decomposition import PCA, NMF\nfrom sklearn.feature_selection import SelectKBest, mutual_info_classif\nfrom sklearn.preprocessing import MinMaxScaler\n\nX, y = load_digits(return_X_y=True)\n\npipe = Pipeline(\n [\n (\"scaling\", MinMaxScaler()),\n # the reduce_dim stage is populated by the param_grid\n (\"reduce_dim\", \"passthrough\"),\n (\"classify\", LinearSVC(dual=False, max_iter=10000)),\n ]\n)\n\nN_FEATURES_OPTIONS = [2, 4, 8]\nC_OPTIONS = [1, 10, 100, 1000]\nparam_grid = [\n {\n \"reduce_dim\": [PCA(iterated_power=7), NMF(max_iter=1_000)],\n \"reduce_dim__n_components\": N_FEATURES_OPTIONS,\n \"classify__C\": C_OPTIONS,\n },\n {\n \"reduce_dim\": [SelectKBest(mutual_info_classif)],\n \"reduce_dim__k\": N_FEATURES_OPTIONS,\n \"classify__C\": C_OPTIONS,\n },\n]\nreducer_labels = [\"PCA\", \"NMF\", \"KBest(mutual_info_classif)\"]\n\ngrid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)\ngrid.fit(X, y)"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {
54+
"collapsed": false
55+
},
56+
"outputs": [],
57+
"source": [
58+
"import pandas as pd\n\nmean_scores = np.array(grid.cv_results_[\"mean_test_score\"])\n# scores are in the order of param_grid iteration, which is alphabetical\nmean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))\n# select score for best C\nmean_scores = mean_scores.max(axis=0)\n# create a dataframe to ease plotting\nmean_scores = pd.DataFrame(\n mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels\n)\n\nax = mean_scores.plot.bar()\nax.set_title(\"Comparing feature reduction techniques\")\nax.set_xlabel(\"Reduced number of features\")\nax.set_ylabel(\"Digit classification accuracy\")\nax.set_ylim((0, 1))\nax.legend(loc=\"upper left\")\n\nplt.show()"
3759
]
3860
},
3961
{

0 commit comments

Comments
 (0)