You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"## Make pipeline to preprocess the data\n\n Before we can use Ames dataset we still need to do some preprocessing.\n First, the dataset has many missing values. To impute them, we will exchange\n categorical missing values with the new category 'missing' while the\nnumerical missing values with the 'mean' of the column. We will also encode\n the categories with either :class:`~sklearn.preprocessing.OneHotEncoder\n <sklearn.preprocessing.OneHotEncoder>` or\n :class:`~sklearn.preprocessing.OrdinalEncoder\n <sklearn.preprocessing.OrdinalEncoder>` depending for which type of model we\n will use them (linear or non-linear model). To facilitate this preprocessing\n we will make two pipelines.\n You can skip this section if your data is ready to use and does\n not need preprocessing\n\n"
54
+
"## Make pipeline to preprocess the data\n\n Before we can use Ames dataset we still need to do some preprocessing.\n First, we will select the categorical and numerical columns of the dataset to\nconstruct the first step of the pipeline.\n\n"
55
55
]
56
56
},
57
57
{
@@ -62,14 +62,94 @@
62
62
},
63
63
"outputs": [],
64
64
"source": [
65
-
"from sklearn.compose import make_column_transformer\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.preprocessing import OrdinalEncoder\nfrom sklearn.preprocessing import StandardScaler\n\n\ncat_cols = X.columns[X.dtypes == 'O']\nnum_cols = X.columns[X.dtypes == 'float64']\n\ncategories = [\n X[column].unique() for column in X[cat_cols]]\n\nfor cat in categories:\n cat[cat == None] = 'missing' # noqa\n\ncat_proc_nlin = make_pipeline(\n SimpleImputer(missing_values=None, strategy='constant',\n fill_value='missing'),\n OrdinalEncoder(categories=categories)\n )\n\nnum_proc_nlin = make_pipeline(SimpleImputer(strategy='mean'))\n\ncat_proc_lin = make_pipeline(\n SimpleImputer(missing_values=None,\n strategy='constant',\n fill_value='missing'),\n OneHotEncoder(categories=categories)\n)\n\nnum_proc_lin = make_pipeline(\n SimpleImputer(strategy='mean'),\n StandardScaler()\n)\n\n# transformation to use for non-linear estimators\nprocessor_nlin = make_column_transformer(\n (cat_proc_nlin, cat_cols),\n (num_proc_nlin, num_cols),\n remainder='passthrough')\n\n# transformation to use for linear estimators\nprocessor_lin = make_column_transformer(\n (cat_proc_lin, cat_cols),\n (num_proc_lin, num_cols),\n remainder='passthrough')"
"Then, we will need to design preprocessing pipelines which depends on the\nending regressor. If the ending regressor is a linear model, one needs to\none-hot encode the categories. If the ending regressor is a tree-based model\nan ordinal encoder will be sufficient. Besides, numerical values need to be\nstandardized for a linear model while the raw numerical data can be treated\nas is by a tree-based model. However, both models need an imputer to\nhandle missing values.\n\nWe will first design the pipeline required for the tree-based models.\n\n"
"## Stack of predictors on a single data set\n\n It is sometimes tedious to find the model which will best perform on a given\n dataset. Stacking provide an alternative by combining the outputs of several\n learners, without the need to choose a model specifically. The performance of\n stacking is usually close to the best model and sometimes it can outperform\n the prediction performance of each individual model.\n\n Here, we combine 3 learners (linear and non-linear) and use a ridge regressor\n to combine their outputs together.\n\n Note: although we will make new pipelines with the processors which we wrote\n in the previous section for the 3 learners, the final estimator RidgeCV()\n does not need preprocessing of the data as it will be fed with the already\n preprocessed output from the 3 learners.\n\n"
119
+
"## Stack of predictors on a single data set\n\n It is sometimes tedious to find the model which will best perform on a given\n dataset. Stacking provide an alternative by combining the outputs of several\n learners, without the need to choose a model specifically. The performance of\n stacking is usually close to the best model and sometimes it can outperform\n the prediction performance of each individual model.\n\n Here, we combine 3 learners (linear and non-linear) and use a ridge regressor\n to combine their outputs together.\n\n .. note::\n Although we will make new pipelines with the processors which we wrote in\n the previous section for the 3 learners, the final estimator\n :class:`~sklearn.linear_model.RidgeCV()` does not need preprocessing of\n the data as it will be fed with the already preprocessed output from the 3\n learners.\n\n"
0 commit comments