Skip to content

Commit 477424a

Browse files
committed
Pushing the docs to dev/ for branch: master, commit bd19a848578ebabe120313ac15db86efa9b133c4
1 parent 5179885 commit 477424a

File tree

1,072 files changed

+4168
-3403
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,072 files changed

+4168
-3403
lines changed
5.18 KB
Binary file not shown.
4 KB
Binary file not shown.
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {
7+
"collapsed": false
8+
},
9+
"outputs": [],
10+
"source": [
11+
"%matplotlib inline"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"\n# Column Transformer with Mixed Types\n\n\nThis example illustrates how to apply different preprocessing and\nfeature extraction pipelines to different subsets of features,\nusing :class:`sklearn.compose.ColumnTransformer`.\nThis is particularly handy for the case of datasets that contain\nheterogeneous data types, since we may want to scale the\nnumeric features and one-hot encode the categorical ones.\n\nIn this example, the numeric data is standard-scaled after\nmean-imputation, while the categorical data is one-hot\nencoded after imputing missing values with a new category\n(``'missing'``).\n\nFinally, the preprocessing pipeline is integrated in a\nfull prediction pipeline using :class:`sklearn.pipeline.Pipeline`,\ntogether with a simple classification model.\n\n"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {
25+
"collapsed": false
26+
},
27+
"outputs": [],
28+
"source": [
29+
"# Author: Pedro Morales <[email protected]>\n#\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport pandas as pd\n\nfrom sklearn.compose import make_column_transformer\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import StandardScaler, CategoricalEncoder\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split, GridSearchCV\n\n\n# Read data from Titanic dataset.\ntitanic_url = ('https://raw.githubusercontent.com/amueller/'\n 'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')\ndata = pd.read_csv(titanic_url)\n\n# We will train our classifier with the following features:\n# Numeric Features:\n# - age: float.\n# - fare: float.\n# Categorical Features:\n# - embarked: categories encoded as strings {'C', 'S', 'Q'}.\n# - sex: categories encoded as strings {'female', 'male'}.\n# - pclass: ordinal integers {1, 2, 3}.\nnumeric_features = ['age', 'fare']\ncategorical_features = ['embarked', 'sex', 'pclass']\n\n# Provisionally, use pd.fillna() to impute missing values for categorical\n# features; SimpleImputer will eventually support strategy=\"constant\".\ndata[categorical_features] = data[categorical_features].fillna(value='missing')\n\n# We create the preprocessing pipelines for both numeric and categorical data.\nnumeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())\ncategorical_transformer = CategoricalEncoder('onehot-dense',\n handle_unknown='ignore')\n\npreprocessing_pl = make_column_transformer(\n (numeric_features, numeric_transformer),\n (categorical_features, categorical_transformer),\n remainder='drop'\n)\n\n# Append classifier to preprocessing pipeline.\n# Now we have a full prediction pipeline.\nclf = make_pipeline(preprocessing_pl, LogisticRegression())\n\nX = data.drop('survived', axis=1)\ny = data.survived.values\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\n shuffle=True)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %f\" % clf.score(X_test, y_test))"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"Using the prediction pipeline in a grid search\n##############################################################################\n Grid search can also be performed on the different preprocessing steps\n defined in the ``ColumnTransformer`` object, together with the classifier's\n hyperparameters as part of the ``Pipeline``.\n We will search for both the imputer strategy of the numeric preprocessing\n and the regularization parameter of the logistic regression using\n :class:`sklearn.model_selection.GridSearchCV`.\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"param_grid = {\n 'columntransformer__pipeline__simpleimputer__strategy': ['mean', 'median'],\n 'logisticregression__C': [0.1, 1.0, 1.0],\n}\n\ngrid_search = GridSearchCV(clf, param_grid, cv=10, iid=False)\ngrid_search.fit(X_train, y_train)\n\nprint((\"best logistic regression from grid search: %f\"\n % grid_search.score(X_test, y_test)))"
48+
]
49+
}
50+
],
51+
"metadata": {
52+
"kernelspec": {
53+
"display_name": "Python 3",
54+
"language": "python",
55+
"name": "python3"
56+
},
57+
"language_info": {
58+
"codemirror_mode": {
59+
"name": "ipython",
60+
"version": 3
61+
},
62+
"file_extension": ".py",
63+
"mimetype": "text/x-python",
64+
"name": "python",
65+
"nbconvert_exporter": "python",
66+
"pygments_lexer": "ipython3",
67+
"version": "3.6.5"
68+
}
69+
},
70+
"nbformat": 4,
71+
"nbformat_minor": 0
72+
}
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
"""
2+
===================================
3+
Column Transformer with Mixed Types
4+
===================================
5+
6+
This example illustrates how to apply different preprocessing and
7+
feature extraction pipelines to different subsets of features,
8+
using :class:`sklearn.compose.ColumnTransformer`.
9+
This is particularly handy for the case of datasets that contain
10+
heterogeneous data types, since we may want to scale the
11+
numeric features and one-hot encode the categorical ones.
12+
13+
In this example, the numeric data is standard-scaled after
14+
mean-imputation, while the categorical data is one-hot
15+
encoded after imputing missing values with a new category
16+
(``'missing'``).
17+
18+
Finally, the preprocessing pipeline is integrated in a
19+
full prediction pipeline using :class:`sklearn.pipeline.Pipeline`,
20+
together with a simple classification model.
21+
"""
22+
23+
# Author: Pedro Morales <[email protected]>
24+
#
25+
# License: BSD 3 clause
26+
27+
from __future__ import print_function
28+
29+
import pandas as pd
30+
31+
from sklearn.compose import make_column_transformer
32+
from sklearn.pipeline import make_pipeline
33+
from sklearn.impute import SimpleImputer
34+
from sklearn.preprocessing import StandardScaler, CategoricalEncoder
35+
from sklearn.linear_model import LogisticRegression
36+
from sklearn.model_selection import train_test_split, GridSearchCV
37+
38+
39+
# Read data from Titanic dataset.
40+
titanic_url = ('https://raw.githubusercontent.com/amueller/'
41+
'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
42+
data = pd.read_csv(titanic_url)
43+
44+
# We will train our classifier with the following features:
45+
# Numeric Features:
46+
# - age: float.
47+
# - fare: float.
48+
# Categorical Features:
49+
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
50+
# - sex: categories encoded as strings {'female', 'male'}.
51+
# - pclass: ordinal integers {1, 2, 3}.
52+
numeric_features = ['age', 'fare']
53+
categorical_features = ['embarked', 'sex', 'pclass']
54+
55+
# Provisionally, use pd.fillna() to impute missing values for categorical
56+
# features; SimpleImputer will eventually support strategy="constant".
57+
data[categorical_features] = data[categorical_features].fillna(value='missing')
58+
59+
# We create the preprocessing pipelines for both numeric and categorical data.
60+
numeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())
61+
categorical_transformer = CategoricalEncoder('onehot-dense',
62+
handle_unknown='ignore')
63+
64+
preprocessing_pl = make_column_transformer(
65+
(numeric_features, numeric_transformer),
66+
(categorical_features, categorical_transformer),
67+
remainder='drop'
68+
)
69+
70+
# Append classifier to preprocessing pipeline.
71+
# Now we have a full prediction pipeline.
72+
clf = make_pipeline(preprocessing_pl, LogisticRegression())
73+
74+
X = data.drop('survived', axis=1)
75+
y = data.survived.values
76+
77+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
78+
shuffle=True)
79+
80+
clf.fit(X_train, y_train)
81+
print("model score: %f" % clf.score(X_test, y_test))
82+
83+
84+
###############################################################################
85+
# Using the prediction pipeline in a grid search
86+
###############################################################################
87+
# Grid search can also be performed on the different preprocessing steps
88+
# defined in the ``ColumnTransformer`` object, together with the classifier's
89+
# hyperparameters as part of the ``Pipeline``.
90+
# We will search for both the imputer strategy of the numeric preprocessing
91+
# and the regularization parameter of the logistic regression using
92+
# :class:`sklearn.model_selection.GridSearchCV`.
93+
94+
95+
param_grid = {
96+
'columntransformer__pipeline__simpleimputer__strategy': ['mean', 'median'],
97+
'logisticregression__C': [0.1, 1.0, 1.0],
98+
}
99+
100+
grid_search = GridSearchCV(clf, param_grid, cv=10, iid=False)
101+
grid_search.fit(X_train, y_train)
102+
103+
print(("best logistic regression from grid search: %f"
104+
% grid_search.score(X_test, y_test)))

dev/_downloads/scikit-learn-docs.pdf

910 KB
Binary file not shown.

dev/_images/iris.png

0 Bytes

0 commit comments

Comments
 (0)