Skip to content

Commit c46b604

Browse files
committed
Pushing the docs to dev/ for branch: master, commit 244d118ef77a513b487b95b721179c395cbc1660
1 parent 262c828 commit c46b604

File tree

1,207 files changed

+4603
-4175
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,207 files changed

+4603
-4175
lines changed
Binary file not shown.

dev/_downloads/b5a4a1546e908b944c14370f9e7e2a25/plot_column_transformer_mixed_types.ipynb

Lines changed: 81 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Column Transformer with Mixed Types\n\n\nThis example illustrates how to apply different preprocessing and\nfeature extraction pipelines to different subsets of features,\nusing :class:`sklearn.compose.ColumnTransformer`.\nThis is particularly handy for the case of datasets that contain\nheterogeneous data types, since we may want to scale the\nnumeric features and one-hot encode the categorical ones.\n\nIn this example, the numeric data is standard-scaled after\nmean-imputation, while the categorical data is one-hot\nencoded after imputing missing values with a new category\n(``'missing'``).\n\nFinally, the preprocessing pipeline is integrated in a\nfull prediction pipeline using :class:`sklearn.pipeline.Pipeline`,\ntogether with a simple classification model.\n"
18+
"\n# Column Transformer with Mixed Types\n\n\nThis example illustrates how to apply different preprocessing and feature\nextraction pipelines to different subsets of features, using\n:class:`sklearn.compose.ColumnTransformer`. This is particularly handy for the\ncase of datasets that contain heterogeneous data types, since we may want to\nscale the numeric features and one-hot encode the categorical ones.\n\nIn this example, the numeric data is standard-scaled after mean-imputation,\nwhile the categorical data is one-hot encoded after imputing missing values\nwith a new category (``'missing'``).\n\nIn addition, we show two different ways to dispatch the columns to the\nparticular pre-processor: by column names and by column data types.\n\nFinally, the preprocessing pipeline is integrated in a full prediction pipeline\nusing :class:`sklearn.pipeline.Pipeline`, together with a simple classification\nmodel.\n"
1919
]
2020
},
2121
{
@@ -26,7 +26,86 @@
2626
},
2727
"outputs": [],
2828
"source": [
29-
"# Author: Pedro Morales <[email protected]>\n#\n# License: BSD 3 clause\n\nimport numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split, GridSearchCV\n\nnp.random.seed(0)\n\n# Load data from https://www.openml.org/d/40945\nX, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\n\n# Alternatively X and y can be obtained directly from the frame attribute:\n# X = titanic.frame.drop('survived', axis=1)\n# y = titanic.frame['survived']\n\n# We will train our classifier with the following features:\n# Numeric Features:\n# - age: float.\n# - fare: float.\n# Categorical Features:\n# - embarked: categories encoded as strings {'C', 'S', 'Q'}.\n# - sex: categories encoded as strings {'female', 'male'}.\n# - pclass: ordinal integers {1, 2, 3}.\n\n# We create the preprocessing pipelines for both numeric and categorical data.\nnumeric_features = ['age', 'fare']\nnumeric_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='median')),\n ('scaler', StandardScaler())])\n\ncategorical_features = ['embarked', 'sex', 'pclass']\ncategorical_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n\npreprocessor = ColumnTransformer(\n transformers=[\n ('num', numeric_transformer, numeric_features),\n ('cat', categorical_transformer, categorical_features)])\n\n# Append classifier to preprocessing pipeline.\n# Now we have a full prediction pipeline.\nclf = Pipeline(steps=[('preprocessor', preprocessor),\n ('classifier', LogisticRegression())])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))"
29+
"# Author: Pedro Morales <[email protected]>\n#\n# License: BSD 3 clause\n\nimport numpy as np\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split, GridSearchCV\n\nnp.random.seed(0)\n\n# Load data from https://www.openml.org/d/40945\nX, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\n\n# Alternatively X and y can be obtained directly from the frame attribute:\n# X = titanic.frame.drop('survived', axis=1)\n# y = titanic.frame['survived']"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"Use ``ColumnTransformer`` by selecting column by names\n##############################################################################\n We will train our classifier with the following features:\n\n Numeric Features:\n\n * ``age``: float;\n * ``fare``: float.\n\n Categorical Features:\n\n * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;\n * ``sex``: categories encoded as strings ``{'female', 'male'}``;\n * ``pclass``: ordinal integers ``{1, 2, 3}``.\n\n We create the preprocessing pipelines for both numeric and categorical data.\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"numeric_features = ['age', 'fare']\nnumeric_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='median')),\n ('scaler', StandardScaler())])\n\ncategorical_features = ['embarked', 'sex', 'pclass']\ncategorical_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n\npreprocessor = ColumnTransformer(\n transformers=[\n ('num', numeric_transformer, numeric_features),\n ('cat', categorical_transformer, categorical_features)])\n\n# Append classifier to preprocessing pipeline.\n# Now we have a full prediction pipeline.\nclf = Pipeline(steps=[('preprocessor', preprocessor),\n ('classifier', LogisticRegression())])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"Use ``ColumnTransformer`` by selecting column by data types\n##############################################################################\n When dealing with a cleaned dataset, the preprocessing can be automatic by\n using the data types of the column to decide whether to treat a column as a\n numerical or categorical feature.\n :func:`sklearn.compose.make_column_selector` gives this possibility.\n First, let's only select a subset of columns to simplify our\n example.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']\nX = X[subset_feature]"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"Then, we introspect the information regarding each column data type.\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"X.info()"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"We can observe that the `embarked` and `sex` columns were tagged as\n`category` columns when loading the data with ``fetch_openml``. Therefore, we\ncan use this information to dispatch the categorical columns to the\n``categorical_transformer`` and the remaining columns to the\n``numerical_transformer``.\n\n"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
"<div class=\"alert alert-info\"><h4>Note</h4><p>In practice, you will have to handle yourself the column data type.\n If you want some columns to be considered as `category`, you will have to\n convert them into categorical columns. If you are using pandas, you can\n refer to their documentation regarding `Categorical data\n <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_.</p></div>\n\n"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"metadata": {
104+
"collapsed": false
105+
},
106+
"outputs": [],
107+
"source": [
108+
"from sklearn.compose import make_column_selector as selector\n\npreprocessor = ColumnTransformer(transformers=[\n ('num', numeric_transformer, selector(dtype_exclude=\"category\")),\n ('cat', categorical_transformer, selector(dtype_include=\"category\"))\n])\n\n# Reproduce the identical fit/score process\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))"
30109
]
31110
},
32111
{
Binary file not shown.

dev/_downloads/ec7916875965bf7f54b7cfe8e6dc4cc2/plot_column_transformer_mixed_types.py

Lines changed: 74 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,22 @@
33
Column Transformer with Mixed Types
44
===================================
55
6-
This example illustrates how to apply different preprocessing and
7-
feature extraction pipelines to different subsets of features,
8-
using :class:`sklearn.compose.ColumnTransformer`.
9-
This is particularly handy for the case of datasets that contain
10-
heterogeneous data types, since we may want to scale the
11-
numeric features and one-hot encode the categorical ones.
12-
13-
In this example, the numeric data is standard-scaled after
14-
mean-imputation, while the categorical data is one-hot
15-
encoded after imputing missing values with a new category
16-
(``'missing'``).
17-
18-
Finally, the preprocessing pipeline is integrated in a
19-
full prediction pipeline using :class:`sklearn.pipeline.Pipeline`,
20-
together with a simple classification model.
6+
This example illustrates how to apply different preprocessing and feature
7+
extraction pipelines to different subsets of features, using
8+
:class:`sklearn.compose.ColumnTransformer`. This is particularly handy for the
9+
case of datasets that contain heterogeneous data types, since we may want to
10+
scale the numeric features and one-hot encode the categorical ones.
11+
12+
In this example, the numeric data is standard-scaled after mean-imputation,
13+
while the categorical data is one-hot encoded after imputing missing values
14+
with a new category (``'missing'``).
15+
16+
In addition, we show two different ways to dispatch the columns to the
17+
particular pre-processor: by column names and by column data types.
18+
19+
Finally, the preprocessing pipeline is integrated in a full prediction pipeline
20+
using :class:`sklearn.pipeline.Pipeline`, together with a simple classification
21+
model.
2122
"""
2223

2324
# Author: Pedro Morales <[email protected]>
@@ -43,16 +44,24 @@
4344
# X = titanic.frame.drop('survived', axis=1)
4445
# y = titanic.frame['survived']
4546

47+
###############################################################################
48+
# Use ``ColumnTransformer`` by selecting column by names
49+
###############################################################################
4650
# We will train our classifier with the following features:
51+
#
4752
# Numeric Features:
48-
# - age: float.
49-
# - fare: float.
53+
#
54+
# * ``age``: float;
55+
# * ``fare``: float.
56+
#
5057
# Categorical Features:
51-
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
52-
# - sex: categories encoded as strings {'female', 'male'}.
53-
# - pclass: ordinal integers {1, 2, 3}.
54-
58+
#
59+
# * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;
60+
# * ``sex``: categories encoded as strings ``{'female', 'male'}``;
61+
# * ``pclass``: ordinal integers ``{1, 2, 3}``.
62+
#
5563
# We create the preprocessing pipelines for both numeric and categorical data.
64+
5665
numeric_features = ['age', 'fare']
5766
numeric_transformer = Pipeline(steps=[
5867
('imputer', SimpleImputer(strategy='median')),
@@ -78,6 +87,50 @@
7887
clf.fit(X_train, y_train)
7988
print("model score: %.3f" % clf.score(X_test, y_test))
8089

90+
###############################################################################
91+
# Use ``ColumnTransformer`` by selecting column by data types
92+
###############################################################################
93+
# When dealing with a cleaned dataset, the preprocessing can be automatic by
94+
# using the data types of the column to decide whether to treat a column as a
95+
# numerical or categorical feature.
96+
# :func:`sklearn.compose.make_column_selector` gives this possibility.
97+
# First, let's only select a subset of columns to simplify our
98+
# example.
99+
100+
subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']
101+
X = X[subset_feature]
102+
103+
###############################################################################
104+
# Then, we introspect the information regarding each column data type.
105+
106+
X.info()
107+
108+
###############################################################################
109+
# We can observe that the `embarked` and `sex` columns were tagged as
110+
# `category` columns when loading the data with ``fetch_openml``. Therefore, we
111+
# can use this information to dispatch the categorical columns to the
112+
# ``categorical_transformer`` and the remaining columns to the
113+
# ``numerical_transformer``.
114+
115+
###############################################################################
116+
# .. note:: In practice, you will have to handle yourself the column data type.
117+
# If you want some columns to be considered as `category`, you will have to
118+
# convert them into categorical columns. If you are using pandas, you can
119+
# refer to their documentation regarding `Categorical data
120+
# <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_.
121+
122+
from sklearn.compose import make_column_selector as selector
123+
124+
preprocessor = ColumnTransformer(transformers=[
125+
('num', numeric_transformer, selector(dtype_exclude="category")),
126+
('cat', categorical_transformer, selector(dtype_include="category"))
127+
])
128+
129+
# Reproduce the identical fit/score process
130+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
131+
132+
clf.fit(X_train, y_train)
133+
print("model score: %.3f" % clf.score(X_test, y_test))
81134

82135
###############################################################################
83136
# Using the prediction pipeline in a grid search
@@ -89,7 +142,6 @@
89142
# and the regularization parameter of the logistic regression using
90143
# :class:`sklearn.model_selection.GridSearchCV`.
91144

92-
93145
param_grid = {
94146
'preprocessor__num__imputer__strategy': ['mean', 'median'],
95147
'classifier__C': [0.1, 1.0, 10, 100],

dev/_downloads/scikit-learn-docs.pdf

1.89 KB
Binary file not shown.

dev/_images/iris.png

0 Bytes

0 commit comments

Comments
 (0)