|
33 | 33 | "cell_type": "markdown",
|
34 | 34 | "metadata": {},
|
35 | 35 | "source": [
|
36 |
| - "Use ``ColumnTransformer`` by selecting column by names\n##############################################################################\n We will train our classifier with the following features:\n\n Numeric Features:\n\n * ``age``: float;\n * ``fare``: float.\n\n Categorical Features:\n\n * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;\n * ``sex``: categories encoded as strings ``{'female', 'male'}``;\n * ``pclass``: ordinal integers ``{1, 2, 3}``.\n\n We create the preprocessing pipelines for both numeric and categorical data.\n\n" |
| 36 | + "Use ``ColumnTransformer`` by selecting column by names\n##############################################################################\n We will train our classifier with the following features:\n\n Numeric Features:\n\n * ``age``: float;\n * ``fare``: float.\n\n Categorical Features:\n\n * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;\n * ``sex``: categories encoded as strings ``{'female', 'male'}``;\n * ``pclass``: ordinal integers ``{1, 2, 3}``.\n\n We create the preprocessing pipelines for both numeric and categorical data.\n Note that ``pclass`` could either be treated as a categorical or numeric\n feature.\n\n" |
37 | 37 | ]
|
38 | 38 | },
|
39 | 39 | {
|
|
44 | 44 | },
|
45 | 45 | "outputs": [],
|
46 | 46 | "source": [
|
47 |
| - "numeric_features = ['age', 'fare']\nnumeric_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='median')),\n ('scaler', StandardScaler())])\n\ncategorical_features = ['embarked', 'sex', 'pclass']\ncategorical_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n\npreprocessor = ColumnTransformer(\n transformers=[\n ('num', numeric_transformer, numeric_features),\n ('cat', categorical_transformer, categorical_features)])\n\n# Append classifier to preprocessing pipeline.\n# Now we have a full prediction pipeline.\nclf = Pipeline(steps=[('preprocessor', preprocessor),\n ('classifier', LogisticRegression())])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))" |
| 47 | + "numeric_features = ['age', 'fare']\nnumeric_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='median')),\n ('scaler', StandardScaler())])\n\ncategorical_features = ['embarked', 'sex', 'pclass']\ncategorical_transformer = Pipeline(steps=[\n ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n ('onehot', OneHotEncoder(handle_unknown='ignore'))])\n\npreprocessor = ColumnTransformer(\n transformers=[\n ('num', numeric_transformer, numeric_features),\n ('cat', categorical_transformer, categorical_features)])\n\n# Append classifier to preprocessing pipeline.\n# Now we have a full prediction pipeline.\nclf = Pipeline(steps=[('preprocessor', preprocessor),\n ('classifier', LogisticRegression())])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\n random_state=0)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))" |
48 | 48 | ]
|
49 | 49 | },
|
50 | 50 | {
|
|
62 | 62 | },
|
63 | 63 | "outputs": [],
|
64 | 64 | "source": [
|
65 |
| - "from sklearn import set_config\nset_config(display='diagram')\nclf" |
| 65 | + "from sklearn import set_config\n\nset_config(display='diagram')\nclf" |
66 | 66 | ]
|
67 | 67 | },
|
68 | 68 | {
|
|
80 | 80 | },
|
81 | 81 | "outputs": [],
|
82 | 82 | "source": [
|
83 |
| - "subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']\nX = X[subset_feature]" |
| 83 | + "subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']\nX_train, X_test = X_train[subset_feature], X_test[subset_feature]" |
84 | 84 | ]
|
85 | 85 | },
|
86 | 86 | {
|
|
98 | 98 | },
|
99 | 99 | "outputs": [],
|
100 | 100 | "source": [
|
101 |
| - "X.info()" |
| 101 | + "X_train.info()" |
102 | 102 | ]
|
103 | 103 | },
|
104 | 104 | {
|
|
123 | 123 | },
|
124 | 124 | "outputs": [],
|
125 | 125 | "source": [
|
126 |
| - "from sklearn.compose import make_column_selector as selector\n\npreprocessor = ColumnTransformer(transformers=[\n ('num', numeric_transformer, selector(dtype_exclude=\"category\")),\n ('cat', categorical_transformer, selector(dtype_include=\"category\"))\n])\n\n# Reproduce the identical fit/score process\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))" |
| 126 | + "from sklearn.compose import make_column_selector as selector\n\npreprocessor = ColumnTransformer(transformers=[\n ('num', numeric_transformer, selector(dtype_exclude=\"category\")),\n ('cat', categorical_transformer, selector(dtype_include=\"category\"))\n])\nclf = Pipeline(steps=[('preprocessor', preprocessor),\n ('classifier', LogisticRegression())])\n\n\nclf.fit(X_train, y_train)\nprint(\"model score: %.3f\" % clf.score(X_test, y_test))" |
| 127 | + ] |
| 128 | + }, |
| 129 | + { |
| 130 | + "cell_type": "markdown", |
| 131 | + "metadata": {}, |
| 132 | + "source": [ |
| 133 | + "The resulting score is not exactly the same as the one from the previous\npipeline becase the dtype-based selector treats the ``pclass`` columns as\na numeric features instead of a categorical feature as previously:\n\n" |
| 134 | + ] |
| 135 | + }, |
| 136 | + { |
| 137 | + "cell_type": "code", |
| 138 | + "execution_count": null, |
| 139 | + "metadata": { |
| 140 | + "collapsed": false |
| 141 | + }, |
| 142 | + "outputs": [], |
| 143 | + "source": [ |
| 144 | + "selector(dtype_exclude=\"category\")(X_train)" |
| 145 | + ] |
| 146 | + }, |
| 147 | + { |
| 148 | + "cell_type": "code", |
| 149 | + "execution_count": null, |
| 150 | + "metadata": { |
| 151 | + "collapsed": false |
| 152 | + }, |
| 153 | + "outputs": [], |
| 154 | + "source": [ |
| 155 | + "selector(dtype_include=\"category\")(X_train)" |
127 | 156 | ]
|
128 | 157 | },
|
129 | 158 | {
|
|
141 | 170 | },
|
142 | 171 | "outputs": [],
|
143 | 172 | "source": [
|
144 |
| - "param_grid = {\n 'preprocessor__num__imputer__strategy': ['mean', 'median'],\n 'classifier__C': [0.1, 1.0, 10, 100],\n}\n\ngrid_search = GridSearchCV(clf, param_grid, cv=10)\ngrid_search.fit(X_train, y_train)\n\nprint((\"best logistic regression from grid search: %.3f\"\n % grid_search.score(X_test, y_test)))" |
| 173 | + "param_grid = {\n 'preprocessor__num__imputer__strategy': ['mean', 'median'],\n 'classifier__C': [0.1, 1.0, 10, 100],\n}\n\ngrid_search = GridSearchCV(clf, param_grid, cv=10)\ngrid_search" |
| 174 | + ] |
| 175 | + }, |
| 176 | + { |
| 177 | + "cell_type": "markdown", |
| 178 | + "metadata": {}, |
| 179 | + "source": [ |
| 180 | + "Calling 'fit' triggers the cross-validated search for the best\nhyper-parameters combination:\n\n\n" |
| 181 | + ] |
| 182 | + }, |
| 183 | + { |
| 184 | + "cell_type": "code", |
| 185 | + "execution_count": null, |
| 186 | + "metadata": { |
| 187 | + "collapsed": false |
| 188 | + }, |
| 189 | + "outputs": [], |
| 190 | + "source": [ |
| 191 | + "grid_search.fit(X_train, y_train)\n\nprint(f\"Best params:\")\nprint(grid_search.best_params_)" |
| 192 | + ] |
| 193 | + }, |
| 194 | + { |
| 195 | + "cell_type": "markdown", |
| 196 | + "metadata": {}, |
| 197 | + "source": [ |
| 198 | + "The internal cross-validation scores obtained by those parameters is:\n\n" |
| 199 | + ] |
| 200 | + }, |
| 201 | + { |
| 202 | + "cell_type": "code", |
| 203 | + "execution_count": null, |
| 204 | + "metadata": { |
| 205 | + "collapsed": false |
| 206 | + }, |
| 207 | + "outputs": [], |
| 208 | + "source": [ |
| 209 | + "print(f\"Internal CV score: {grid_search.best_score_:.3f}\")" |
| 210 | + ] |
| 211 | + }, |
| 212 | + { |
| 213 | + "cell_type": "markdown", |
| 214 | + "metadata": {}, |
| 215 | + "source": [ |
| 216 | + "We can also introspect the top grid search results as a pandas dataframe:\n\n" |
| 217 | + ] |
| 218 | + }, |
| 219 | + { |
| 220 | + "cell_type": "code", |
| 221 | + "execution_count": null, |
| 222 | + "metadata": { |
| 223 | + "collapsed": false |
| 224 | + }, |
| 225 | + "outputs": [], |
| 226 | + "source": [ |
| 227 | + "import pandas as pd\n\ncv_results = pd.DataFrame(grid_search.cv_results_)\ncv_results = cv_results.sort_values(\"mean_test_score\", ascending=False)\ncv_results[[\"mean_test_score\", \"std_test_score\",\n \"param_preprocessor__num__imputer__strategy\",\n \"param_classifier__C\"\n ]].head(5)" |
| 228 | + ] |
| 229 | + }, |
| 230 | + { |
| 231 | + "cell_type": "markdown", |
| 232 | + "metadata": {}, |
| 233 | + "source": [ |
| 234 | + "The best hyper-parameters have be used to re-fit a final model on the full\ntraining set. We can evaluate that final model on held out test data that was\nnot used for hyparameter tuning.\n\n\n" |
| 235 | + ] |
| 236 | + }, |
| 237 | + { |
| 238 | + "cell_type": "code", |
| 239 | + "execution_count": null, |
| 240 | + "metadata": { |
| 241 | + "collapsed": false |
| 242 | + }, |
| 243 | + "outputs": [], |
| 244 | + "source": [ |
| 245 | + "print((\"best logistic regression from grid search: %.3f\"\n % grid_search.score(X_test, y_test)))" |
145 | 246 | ]
|
146 | 247 | }
|
147 | 248 | ],
|
|
0 commit comments