Skip to content

Commit bcfffd9

Browse files
committed
Pushing the docs to dev/ for branch: master, commit 26a1027a830b7b658782229d936ffb84a340caad
1 parent 4021f79 commit bcfffd9

File tree

977 files changed

+5901
-3788
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

977 files changed

+5901
-3788
lines changed
11.6 KB
Binary file not shown.
9.77 KB
Binary file not shown.

dev/_downloads/plot_all_scaling.ipynb

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {
7+
"collapsed": false
8+
},
9+
"outputs": [],
10+
"source": [
11+
"%matplotlib inline"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"\n# Compare the effect of different scalers on data with outliers\n\n\nFeature 0 (median income in a block) and feature 5 (number of households) of\nthe `California housing dataset\n<http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html>`_ have very\ndifferent scales and contain some very large outliers. These two\ncharacteristics lead to difficulties to visualize the data and, more\nimportantly, they can degrade the predictive performance of many machine\nlearning algorithms. Unscaled data can also slow down or even prevent the\nconvergence of many gradient-based estimators.\n\nIndeed many estimators are designed with the assumption that each feature takes\nvalues close to zero or more importantly that all features vary on comparable\nscales. In particular, metric-based and gradient-based estimators often assume\napproximately standardized data (centered features with unit variances). A\nnotable exception are decision tree-based estimators that are robust to\narbitrary scaling of the data.\n\nThis example uses different scalers, transformers, and normalizers to bring the\ndata within a pre-defined range.\n\nScalers are linear (or more precisely affine) transformers and differ from each\nother in the way to estimate the parameters used to shift and scale each\nfeature.\n\n``QuantileTransformer`` provides a non-linear transformation in which distances\nbetween marginal outliers and inliers are shrunk.\n\nUnlike the previous transformations, normalization refers to a per sample\ntransformation instead of a per feature transformation.\n\nThe following code is a bit verbose, feel free to jump directly to the analysis\nof the results_.\n\n\n"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {
25+
"collapsed": false
26+
},
27+
"outputs": [],
28+
"source": [
29+
"# Author: Raghav RV <[email protected]>\n# Guillaume Lemaitre <[email protected]>\n# Thomas Unterthiner\n# License: BSD 3 clause\n\nfrom __future__ import print_function\n\nimport numpy as np\n\nimport matplotlib as mpl\nfrom matplotlib import pyplot as plt\nfrom matplotlib import cm\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.preprocessing import minmax_scale\nfrom sklearn.preprocessing import MaxAbsScaler\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.preprocessing import RobustScaler\nfrom sklearn.preprocessing import Normalizer\nfrom sklearn.preprocessing.data import QuantileTransformer\n\nfrom sklearn.datasets import fetch_california_housing\n\nprint(__doc__)\n\ndataset = fetch_california_housing()\nX_full, y_full = dataset.data, dataset.target\n\n# Take only 2 features to make visualization easier\n# Feature of 0 has a long tail distribution.\n# Feature 5 has a few but very large outliers.\n\nX = X_full[:, [0, 5]]\n\ndistributions = [\n ('Unscaled data', X),\n ('Data after standard scaling',\n StandardScaler().fit_transform(X)),\n ('Data after min-max scaling',\n MinMaxScaler().fit_transform(X)),\n ('Data after max-abs scaling',\n MaxAbsScaler().fit_transform(X)),\n ('Data after robust scaling',\n RobustScaler(quantile_range=(25, 75)).fit_transform(X)),\n ('Data after quantile transformation (uniform pdf)',\n QuantileTransformer(output_distribution='uniform')\n .fit_transform(X)),\n ('Data after quantile transformation (gaussian pdf)',\n QuantileTransformer(output_distribution='normal')\n .fit_transform(X)),\n ('Data after sample-wise L2 normalizing',\n Normalizer().fit_transform(X))\n]\n\n# scale the output between 0 and 1 for the colorbar\ny = minmax_scale(y_full)\n\n\ndef create_axes(title, figsize=(16, 6)):\n fig = plt.figure(figsize=figsize)\n fig.suptitle(title)\n\n # define the axis for the first plot\n left, width = 0.1, 0.22\n bottom, height = 0.1, 0.7\n bottom_h = height + 0.15\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter = plt.axes(rect_scatter)\n ax_histx = plt.axes(rect_histx)\n ax_histy = plt.axes(rect_histy)\n\n # define the axis for the zoomed-in plot\n left = width + left + 0.2\n left_h = left + width + 0.02\n\n rect_scatter = [left, bottom, width, height]\n rect_histx = [left, bottom_h, width, 0.1]\n rect_histy = [left_h, bottom, 0.05, height]\n\n ax_scatter_zoom = plt.axes(rect_scatter)\n ax_histx_zoom = plt.axes(rect_histx)\n ax_histy_zoom = plt.axes(rect_histy)\n\n # define the axis for the colorbar\n left, width = width + left + 0.13, 0.01\n\n rect_colorbar = [left, bottom, width, height]\n ax_colorbar = plt.axes(rect_colorbar)\n\n return ((ax_scatter, ax_histy, ax_histx),\n (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),\n ax_colorbar)\n\n\ndef plot_distribution(axes, X, y, hist_nbins=50, title=\"\",\n x0_label=\"\", x1_label=\"\"):\n ax, hist_X1, hist_X0 = axes\n\n ax.set_title(title)\n ax.set_xlabel(x0_label)\n ax.set_ylabel(x1_label)\n\n # The scatter plot\n colors = cm.plasma_r(y)\n ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)\n\n # Removing the top and the right spine for aesthetics\n # make nice axis layout\n ax.spines['top'].set_visible(False)\n ax.spines['right'].set_visible(False)\n ax.get_xaxis().tick_bottom()\n ax.get_yaxis().tick_left()\n ax.spines['left'].set_position(('outward', 10))\n ax.spines['bottom'].set_position(('outward', 10))\n\n # Histogram for axis X1 (feature 5)\n hist_X1.set_ylim(ax.get_ylim())\n hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',\n color='grey', ec='grey')\n hist_X1.axis('off')\n\n # Histogram for axis X0 (feature 0)\n hist_X0.set_xlim(ax.get_xlim())\n hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',\n color='grey', ec='grey')\n hist_X0.axis('off')"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"Two plots will be shown for each scaler/normalizer/transformer. The left\nfigure will show a scatter plot of the full data set while the right figure\nwill exclude the extreme values considering only 99 % of the data set,\nexcluding marginal outliers. In addition, the marginal distributions for each\nfeature will be shown on the side of the scatter plot.\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"def make_plot(item_idx):\n title, X = distributions[item_idx]\n ax_zoom_out, ax_zoom_in, ax_colorbar = create_axes(title)\n axarr = (ax_zoom_out, ax_zoom_in)\n plot_distribution(axarr[0], X, y, hist_nbins=200,\n x0_label=\"Median Income\",\n x1_label=\"Number of households\",\n title=\"Full data\")\n\n # zoom-in\n zoom_in_percentile_range = (0, 99)\n cutoffs_X0 = np.percentile(X[:, 0], zoom_in_percentile_range)\n cutoffs_X1 = np.percentile(X[:, 1], zoom_in_percentile_range)\n\n non_outliers_mask = (\n np.all(X > [cutoffs_X0[0], cutoffs_X1[0]], axis=1) &\n np.all(X < [cutoffs_X0[1], cutoffs_X1[1]], axis=1))\n plot_distribution(axarr[1], X[non_outliers_mask], y[non_outliers_mask],\n hist_nbins=50,\n x0_label=\"Median Income\",\n x1_label=\"Number of households\",\n title=\"Zoom-in\")\n\n norm = mpl.colors.Normalize(y_full.min(), y_full.max())\n mpl.colorbar.ColorbarBase(ax_colorbar, cmap=cm.plasma_r,\n norm=norm, orientation='vertical',\n label='Color mapping for values of y')"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"\nOriginal data\n-------------\n\nEach transformation is plotted showing two transformed features, with the\nleft plot showing the entire dataset, and the right zoomed-in to show the\ndataset without the marginal outliers. A large majority of the samples are\ncompacted to a specific range, [0, 10] for the median income and [0, 6] for\nthe number of households. Note that there are some marginal outliers (some\nblocks have more than 1200 households). Therefore, a specific pre-processing\ncan be very beneficial depending of the application. In the following, we\npresent some insights and behaviors of those pre-processing methods in the\npresence of marginal outliers.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"make_plot(0)"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"StandardScaler\n--------------\n\n``StandardScaler`` removes the mean and scales the data to unit variance.\nHowever, the outliers have an influence when computing the empirical mean and\nstandard deviation which shrink the range of the feature values as shown in\nthe left figure below. Note in particular that because the outliers on each\nfeature have different magnitudes, the spread of the transformed data on\neach feature is very different: most of the data lie in the [-2, 4] range for\nthe transformed median income feature while the same data is squeezed in the\nsmaller [-0.2, 0.2] range for the transformed number of households.\n\n``StandardScaler`` therefore cannot guarantee balanced feature scales in the\npresence of outliers.\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"make_plot(1)"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"MinMaxScaler\n------------\n\n``MinMaxScaler`` rescales the data set such that all feature values are in\nthe range [0, 1] as shown in the right panel below. However, this scaling\ncompress all inliers in the narrow range [0, 0.005] for the transformed\nnumber of households.\n\nAs ``StandardScaler``, ``MinMaxScaler`` is very sensitive to the presence of\noutliers.\n\n"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"metadata": {
97+
"collapsed": false
98+
},
99+
"outputs": [],
100+
"source": [
101+
"make_plot(2)"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"MaxAbsScaler\n------------\n\n``MaxAbsScaler`` differs from the previous scaler such that the absolute\nvalues are mapped in the range [0, 1]. On positive only data, this scaler\nbehaves similarly to ``MinMaxScaler`` and therefore also suffers from the\npresence of large outliers.\n\n"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {
115+
"collapsed": false
116+
},
117+
"outputs": [],
118+
"source": [
119+
"make_plot(3)"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"RobustScaler\n------------\n\nUnlike the previous scalers, the centering and scaling statistics of this\nscaler are based on percentiles and are therefore not influenced by a few\nnumber of very large marginal outliers. Consequently, the resulting range of\nthe transformed feature values is larger than for the previous scalers and,\nmore importantly, are approximately similar: for both features most of the\ntransformed values lie in a [-2, 3] range as seen in the zoomed-in figure.\nNote that the outliers themselves are still present in the transformed data.\nIf a separate outlier clipping is desirable, a non-linear transformation is\nrequired (see below).\n\n"
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {
133+
"collapsed": false
134+
},
135+
"outputs": [],
136+
"source": [
137+
"make_plot(4)"
138+
]
139+
},
140+
{
141+
"cell_type": "markdown",
142+
"metadata": {},
143+
"source": [
144+
"QuantileTransformer (uniform output)\n------------------------------------\n\n``QuantileTransformer`` applies a non-linear transformation such that the\nprobability density function of each feature will be mapped to a uniform\ndistribution. In this case, all the data will be mapped in the range [0, 1],\neven the outliers which cannot be distinguished anymore from the inliers.\n\nAs ``RobustScaler``, ``QuantileTransformer`` is robust to outliers in the\nsense that adding or removing outliers in the training set will yield\napproximately the same transformation on held out data. But contrary to\n``RobustScaler``, ``QuantileTransformer`` will also automatically collapse\nany outlier by setting them to the a priori defined range boundaries (0 and\n1).\n\n"
145+
]
146+
},
147+
{
148+
"cell_type": "code",
149+
"execution_count": null,
150+
"metadata": {
151+
"collapsed": false
152+
},
153+
"outputs": [],
154+
"source": [
155+
"make_plot(5)"
156+
]
157+
},
158+
{
159+
"cell_type": "markdown",
160+
"metadata": {},
161+
"source": [
162+
"QuantileTransformer (Gaussian output)\n-------------------------------------\n\n``QuantileTransformer`` has an additional ``output_distribution`` parameter\nallowing to match a Gaussian distribution instead of a uniform distribution.\nNote that this non-parametetric transformer introduces saturation artifacts\nfor extreme values.\n\n"
163+
]
164+
},
165+
{
166+
"cell_type": "code",
167+
"execution_count": null,
168+
"metadata": {
169+
"collapsed": false
170+
},
171+
"outputs": [],
172+
"source": [
173+
"make_plot(6)"
174+
]
175+
},
176+
{
177+
"cell_type": "markdown",
178+
"metadata": {},
179+
"source": [
180+
"Normalizer\n----------\n\nThe ``Normalizer`` rescales the vector for each sample to have unit norm,\nindependently of the distribution of the samples. It can be seen on both\nfigures below where all samples are mapped onto the unit circle. In our\nexample the two selected features have only positive values; therefore the\ntransformed data only lie in the positive quadrant. This would not be the\ncase if some original features had a mix of positive and negative values.\n\n"
181+
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": null,
186+
"metadata": {
187+
"collapsed": false
188+
},
189+
"outputs": [],
190+
"source": [
191+
"make_plot(7)\nplt.show()"
192+
]
193+
}
194+
],
195+
"metadata": {
196+
"kernelspec": {
197+
"display_name": "Python 3",
198+
"language": "python",
199+
"name": "python3"
200+
},
201+
"language_info": {
202+
"codemirror_mode": {
203+
"name": "ipython",
204+
"version": 3
205+
},
206+
"file_extension": ".py",
207+
"mimetype": "text/x-python",
208+
"name": "python",
209+
"nbconvert_exporter": "python",
210+
"pygments_lexer": "ipython3",
211+
"version": "3.6.1"
212+
}
213+
},
214+
"nbformat": 4,
215+
"nbformat_minor": 0
216+
}

0 commit comments

Comments
 (0)