|
3 | 3 | Column Transformer with Mixed Types
|
4 | 4 | ===================================
|
5 | 5 |
|
6 |
| -This example illustrates how to apply different preprocessing and |
7 |
| -feature extraction pipelines to different subsets of features, |
8 |
| -using :class:`sklearn.compose.ColumnTransformer`. |
9 |
| -This is particularly handy for the case of datasets that contain |
10 |
| -heterogeneous data types, since we may want to scale the |
11 |
| -numeric features and one-hot encode the categorical ones. |
12 |
| -
|
13 |
| -In this example, the numeric data is standard-scaled after |
14 |
| -mean-imputation, while the categorical data is one-hot |
15 |
| -encoded after imputing missing values with a new category |
16 |
| -(``'missing'``). |
17 |
| -
|
18 |
| -Finally, the preprocessing pipeline is integrated in a |
19 |
| -full prediction pipeline using :class:`sklearn.pipeline.Pipeline`, |
20 |
| -together with a simple classification model. |
| 6 | +This example illustrates how to apply different preprocessing and feature |
| 7 | +extraction pipelines to different subsets of features, using |
| 8 | +:class:`sklearn.compose.ColumnTransformer`. This is particularly handy for the |
| 9 | +case of datasets that contain heterogeneous data types, since we may want to |
| 10 | +scale the numeric features and one-hot encode the categorical ones. |
| 11 | +
|
| 12 | +In this example, the numeric data is standard-scaled after mean-imputation, |
| 13 | +while the categorical data is one-hot encoded after imputing missing values |
| 14 | +with a new category (``'missing'``). |
| 15 | +
|
| 16 | +In addition, we show two different ways to dispatch the columns to the |
| 17 | +particular pre-processor: by column names and by column data types. |
| 18 | +
|
| 19 | +Finally, the preprocessing pipeline is integrated in a full prediction pipeline |
| 20 | +using :class:`sklearn.pipeline.Pipeline`, together with a simple classification |
| 21 | +model. |
21 | 22 | """
|
22 | 23 |
|
23 | 24 | # Author: Pedro Morales <[email protected]>
|
|
43 | 44 | # X = titanic.frame.drop('survived', axis=1)
|
44 | 45 | # y = titanic.frame['survived']
|
45 | 46 |
|
| 47 | +############################################################################### |
| 48 | +# Use ``ColumnTransformer`` by selecting column by names |
| 49 | +############################################################################### |
46 | 50 | # We will train our classifier with the following features:
|
| 51 | +# |
47 | 52 | # Numeric Features:
|
48 |
| -# - age: float. |
49 |
| -# - fare: float. |
| 53 | +# |
| 54 | +# * ``age``: float; |
| 55 | +# * ``fare``: float. |
| 56 | +# |
50 | 57 | # Categorical Features:
|
51 |
| -# - embarked: categories encoded as strings {'C', 'S', 'Q'}. |
52 |
| -# - sex: categories encoded as strings {'female', 'male'}. |
53 |
| -# - pclass: ordinal integers {1, 2, 3}. |
54 |
| - |
| 58 | +# |
| 59 | +# * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``; |
| 60 | +# * ``sex``: categories encoded as strings ``{'female', 'male'}``; |
| 61 | +# * ``pclass``: ordinal integers ``{1, 2, 3}``. |
| 62 | +# |
55 | 63 | # We create the preprocessing pipelines for both numeric and categorical data.
|
| 64 | + |
56 | 65 | numeric_features = ['age', 'fare']
|
57 | 66 | numeric_transformer = Pipeline(steps=[
|
58 | 67 | ('imputer', SimpleImputer(strategy='median')),
|
|
78 | 87 | clf.fit(X_train, y_train)
|
79 | 88 | print("model score: %.3f" % clf.score(X_test, y_test))
|
80 | 89 |
|
| 90 | +############################################################################### |
| 91 | +# Use ``ColumnTransformer`` by selecting column by data types |
| 92 | +############################################################################### |
| 93 | +# When dealing with a cleaned dataset, the preprocessing can be automatic by |
| 94 | +# using the data types of the column to decide whether to treat a column as a |
| 95 | +# numerical or categorical feature. |
| 96 | +# :func:`sklearn.compose.make_column_selector` gives this possibility. |
| 97 | +# First, let's only select a subset of columns to simplify our |
| 98 | +# example. |
| 99 | + |
| 100 | +subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare'] |
| 101 | +X = X[subset_feature] |
| 102 | + |
| 103 | +############################################################################### |
| 104 | +# Then, we introspect the information regarding each column data type. |
| 105 | + |
| 106 | +X.info() |
| 107 | + |
| 108 | +############################################################################### |
| 109 | +# We can observe that the `embarked` and `sex` columns were tagged as |
| 110 | +# `category` columns when loading the data with ``fetch_openml``. Therefore, we |
| 111 | +# can use this information to dispatch the categorical columns to the |
| 112 | +# ``categorical_transformer`` and the remaining columns to the |
| 113 | +# ``numerical_transformer``. |
| 114 | + |
| 115 | +############################################################################### |
| 116 | +# .. note:: In practice, you will have to handle yourself the column data type. |
| 117 | +# If you want some columns to be considered as `category`, you will have to |
| 118 | +# convert them into categorical columns. If you are using pandas, you can |
| 119 | +# refer to their documentation regarding `Categorical data |
| 120 | +# <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_. |
| 121 | + |
| 122 | +from sklearn.compose import make_column_selector as selector |
| 123 | + |
| 124 | +preprocessor = ColumnTransformer(transformers=[ |
| 125 | + ('num', numeric_transformer, selector(dtype_exclude="category")), |
| 126 | + ('cat', categorical_transformer, selector(dtype_include="category")) |
| 127 | +]) |
| 128 | + |
| 129 | +# Reproduce the identical fit/score process |
| 130 | +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) |
| 131 | + |
| 132 | +clf.fit(X_train, y_train) |
| 133 | +print("model score: %.3f" % clf.score(X_test, y_test)) |
81 | 134 |
|
82 | 135 | ###############################################################################
|
83 | 136 | # Using the prediction pipeline in a grid search
|
|
89 | 142 | # and the regularization parameter of the logistic regression using
|
90 | 143 | # :class:`sklearn.model_selection.GridSearchCV`.
|
91 | 144 |
|
92 |
| - |
93 | 145 | param_grid = {
|
94 | 146 | 'preprocessor__num__imputer__strategy': ['mean', 'median'],
|
95 | 147 | 'classifier__C': [0.1, 1.0, 10, 100],
|
|
0 commit comments