Skip to content

Commit 3ca53e4

Browse files
committed
Pushing the docs to dev/ for branch: master, commit 4f710cdd088aa8851e8b049e4faafa03767fda10
1 parent adbcf13 commit 3ca53e4

File tree

539 files changed

+2932
-985
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

539 files changed

+2932
-985
lines changed
9.69 KB
Binary file not shown.
7.32 KB
Binary file not shown.
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {
7+
"collapsed": false
8+
},
9+
"outputs": [],
10+
"source": [
11+
"%matplotlib inline"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"\n# Effect of transforming the targets in regression model\n\n\nIn this example, we give an overview of the\n:class:`sklearn.preprocessing.TransformedTargetRegressor`. Two examples\nillustrate the benefit of transforming the targets before learning a linear\nregression model. The first example uses synthetic data while the second\nexample is based on the Boston housing data set.\n\n\n"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {
25+
"collapsed": false
26+
},
27+
"outputs": [],
28+
"source": [
29+
"# Author: Guillaume Lemaitre <[email protected]>\n# License: BSD 3 clause\n\nfrom __future__ import print_function, division\n\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nprint(__doc__)"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"Synthetic example\n##############################################################################\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"from sklearn.datasets import make_regression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import RidgeCV\nfrom sklearn.preprocessing import TransformedTargetRegressor\nfrom sklearn.metrics import median_absolute_error, r2_score"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"A synthetic random regression problem is generated. The targets ``y`` are\nmodified by: (i) translating all targets such that all entries are\nnon-negative and (ii) applying an exponential function to obtain non-linear\ntargets which cannot be fitted using a simple linear model.\n\nTherefore, a logarithmic and an exponential function will be used to\ntransform the targets before training a linear regression model and using it\nfor prediction.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"def log_transform(x):\n return np.log(x + 1)\n\n\ndef exp_transform(x):\n return np.exp(x) - 1\n\n\nX, y = make_regression(n_samples=10000, noise=100, random_state=0)\ny = np.exp((y + abs(y.min())) / 200)\ny_trans = log_transform(y)"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"The following illustrate the probability density functions of the target\nbefore and after applying the logarithmic functions.\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"f, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins='auto', normed=True)\nax0.set_xlim([0, 2000])\nax0.set_ylabel('Probability')\nax0.set_xlabel('Target')\nax0.set_title('Target distribution')\n\nax1.hist(y_trans, bins='auto', normed=True)\nax1.set_ylabel('Probability')\nax1.set_xlabel('Target')\nax1.set_title('Transformed target distribution')\n\nf.suptitle(\"Synthetic data\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"At first, a linear model will be applied on the original targets. Due to the\nnon-linearity, the model trained will not be precise during the\nprediction. Subsequently, a logarithmic function is used to linearize the\ntargets, allowing better prediction even with a similar linear model as\nreported by the median absolute error (MAE).\n\n"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"metadata": {
97+
"collapsed": false
98+
},
99+
"outputs": [],
100+
"source": [
101+
"f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)\n\nregr = RidgeCV()\nregr.fit(X_train, y_train)\ny_pred = regr.predict(X_test)\n\nax0.scatter(y_test, y_pred)\nax0.plot([0, 2000], [0, 2000], '--k')\nax0.set_ylabel('Target predicted')\nax0.set_xlabel('True Target')\nax0.set_title('Ridge regression \\n without target transformation')\nax0.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (\n r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax0.set_xlim([0, 2000])\nax0.set_ylim([0, 2000])\n\nregr_trans = TransformedTargetRegressor(regressor=RidgeCV(),\n func=log_transform,\n inverse_func=exp_transform)\nregr_trans.fit(X_train, y_train)\ny_pred = regr_trans.predict(X_test)\n\nax1.scatter(y_test, y_pred)\nax1.plot([0, 2000], [0, 2000], '--k')\nax1.set_ylabel('Target predicted')\nax1.set_xlabel('True Target')\nax1.set_title('Ridge regression \\n with target transformation')\nax1.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (\n r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax1.set_xlim([0, 2000])\nax1.set_ylim([0, 2000])\n\nf.suptitle(\"Synthetic data\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"Real-world data set\n##############################################################################\n\n"
109+
]
110+
},
111+
{
112+
"cell_type": "markdown",
113+
"metadata": {},
114+
"source": [
115+
"In a similar manner, the boston housing data set is used to show the impact\nof transforming the targets before learning a model. In this example, the\ntargets to be predicted corresponds to the weighted distances to the five\nBoston employment centers.\n\n"
116+
]
117+
},
118+
{
119+
"cell_type": "code",
120+
"execution_count": null,
121+
"metadata": {
122+
"collapsed": false
123+
},
124+
"outputs": [],
125+
"source": [
126+
"from sklearn.datasets import load_boston\nfrom sklearn.preprocessing import QuantileTransformer, quantile_transform\n\ndataset = load_boston()\ntarget = np.array(dataset.feature_names) == \"DIS\"\nX = dataset.data[:, np.logical_not(target)]\ny = dataset.data[:, target].squeeze()\ny_trans = quantile_transform(dataset.data[:, target],\n output_distribution='normal').squeeze()"
127+
]
128+
},
129+
{
130+
"cell_type": "markdown",
131+
"metadata": {},
132+
"source": [
133+
"A :class:`sklearn.preprocessing.QuantileTransformer` is used such that the\ntargets follows a normal distribution before applying a\n:class:`sklearn.linear_model.RidgeCV` model.\n\n"
134+
]
135+
},
136+
{
137+
"cell_type": "code",
138+
"execution_count": null,
139+
"metadata": {
140+
"collapsed": false
141+
},
142+
"outputs": [],
143+
"source": [
144+
"f, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins='auto', normed=True)\nax0.set_ylabel('Probability')\nax0.set_xlabel('Target')\nax0.set_title('Target distribution')\n\nax1.hist(y_trans, bins='auto', normed=True)\nax1.set_ylabel('Probability')\nax1.set_xlabel('Target')\nax1.set_title('Transformed target distribution')\n\nf.suptitle(\"Boston housing data: distance to employment centers\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
145+
]
146+
},
147+
{
148+
"cell_type": "markdown",
149+
"metadata": {},
150+
"source": [
151+
"The effect of the transformer is weaker than on the synthetic data. However,\nthe transform induces a decrease of the MAE.\n\n"
152+
]
153+
},
154+
{
155+
"cell_type": "code",
156+
"execution_count": null,
157+
"metadata": {
158+
"collapsed": false
159+
},
160+
"outputs": [],
161+
"source": [
162+
"f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)\n\nregr = RidgeCV()\nregr.fit(X_train, y_train)\ny_pred = regr.predict(X_test)\n\nax0.scatter(y_test, y_pred)\nax0.plot([0, 10], [0, 10], '--k')\nax0.set_ylabel('Target predicted')\nax0.set_xlabel('True Target')\nax0.set_title('Ridge regression \\n without target transformation')\nax0.text(1, 9, r'$R^2$=%.2f, MAE=%.2f' % (\n r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax0.set_xlim([0, 10])\nax0.set_ylim([0, 10])\n\nregr_trans = TransformedTargetRegressor(\n regressor=RidgeCV(),\n transformer=QuantileTransformer(output_distribution='normal'))\nregr_trans.fit(X_train, y_train)\ny_pred = regr_trans.predict(X_test)\n\nax1.scatter(y_test, y_pred)\nax1.plot([0, 10], [0, 10], '--k')\nax1.set_ylabel('Target predicted')\nax1.set_xlabel('True Target')\nax1.set_title('Ridge regression \\n with target transformation')\nax1.text(1, 9, r'$R^2$=%.2f, MAE=%.2f' % (\n r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax1.set_xlim([0, 10])\nax1.set_ylim([0, 10])\n\nf.suptitle(\"Boston housing data: distance to employment centers\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])\n\nplt.show()"
163+
]
164+
}
165+
],
166+
"metadata": {
167+
"kernelspec": {
168+
"display_name": "Python 3",
169+
"language": "python",
170+
"name": "python3"
171+
},
172+
"language_info": {
173+
"codemirror_mode": {
174+
"name": "ipython",
175+
"version": 3
176+
},
177+
"file_extension": ".py",
178+
"mimetype": "text/x-python",
179+
"name": "python",
180+
"nbconvert_exporter": "python",
181+
"pygments_lexer": "ipython3",
182+
"version": "3.6.3"
183+
}
184+
},
185+
"nbformat": 4,
186+
"nbformat_minor": 0
187+
}
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
4+
"""
5+
======================================================
6+
Effect of transforming the targets in regression model
7+
======================================================
8+
9+
In this example, we give an overview of the
10+
:class:`sklearn.preprocessing.TransformedTargetRegressor`. Two examples
11+
illustrate the benefit of transforming the targets before learning a linear
12+
regression model. The first example uses synthetic data while the second
13+
example is based on the Boston housing data set.
14+
15+
"""
16+
17+
# Author: Guillaume Lemaitre <[email protected]>
18+
# License: BSD 3 clause
19+
20+
from __future__ import print_function, division
21+
22+
import numpy as np
23+
import matplotlib.pyplot as plt
24+
25+
print(__doc__)
26+
27+
###############################################################################
28+
# Synthetic example
29+
###############################################################################
30+
31+
from sklearn.datasets import make_regression
32+
from sklearn.model_selection import train_test_split
33+
from sklearn.linear_model import RidgeCV
34+
from sklearn.preprocessing import TransformedTargetRegressor
35+
from sklearn.metrics import median_absolute_error, r2_score
36+
37+
###############################################################################
38+
# A synthetic random regression problem is generated. The targets ``y`` are
39+
# modified by: (i) translating all targets such that all entries are
40+
# non-negative and (ii) applying an exponential function to obtain non-linear
41+
# targets which cannot be fitted using a simple linear model.
42+
#
43+
# Therefore, a logarithmic and an exponential function will be used to
44+
# transform the targets before training a linear regression model and using it
45+
# for prediction.
46+
47+
48+
def log_transform(x):
49+
return np.log(x + 1)
50+
51+
52+
def exp_transform(x):
53+
return np.exp(x) - 1
54+
55+
56+
X, y = make_regression(n_samples=10000, noise=100, random_state=0)
57+
y = np.exp((y + abs(y.min())) / 200)
58+
y_trans = log_transform(y)
59+
60+
###############################################################################
61+
# The following illustrate the probability density functions of the target
62+
# before and after applying the logarithmic functions.
63+
64+
f, (ax0, ax1) = plt.subplots(1, 2)
65+
66+
ax0.hist(y, bins='auto', normed=True)
67+
ax0.set_xlim([0, 2000])
68+
ax0.set_ylabel('Probability')
69+
ax0.set_xlabel('Target')
70+
ax0.set_title('Target distribution')
71+
72+
ax1.hist(y_trans, bins='auto', normed=True)
73+
ax1.set_ylabel('Probability')
74+
ax1.set_xlabel('Target')
75+
ax1.set_title('Transformed target distribution')
76+
77+
f.suptitle("Synthetic data", y=0.035)
78+
f.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
79+
80+
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
81+
82+
###############################################################################
83+
# At first, a linear model will be applied on the original targets. Due to the
84+
# non-linearity, the model trained will not be precise during the
85+
# prediction. Subsequently, a logarithmic function is used to linearize the
86+
# targets, allowing better prediction even with a similar linear model as
87+
# reported by the median absolute error (MAE).
88+
89+
f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)
90+
91+
regr = RidgeCV()
92+
regr.fit(X_train, y_train)
93+
y_pred = regr.predict(X_test)
94+
95+
ax0.scatter(y_test, y_pred)
96+
ax0.plot([0, 2000], [0, 2000], '--k')
97+
ax0.set_ylabel('Target predicted')
98+
ax0.set_xlabel('True Target')
99+
ax0.set_title('Ridge regression \n without target transformation')
100+
ax0.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (
101+
r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
102+
ax0.set_xlim([0, 2000])
103+
ax0.set_ylim([0, 2000])
104+
105+
regr_trans = TransformedTargetRegressor(regressor=RidgeCV(),
106+
func=log_transform,
107+
inverse_func=exp_transform)
108+
regr_trans.fit(X_train, y_train)
109+
y_pred = regr_trans.predict(X_test)
110+
111+
ax1.scatter(y_test, y_pred)
112+
ax1.plot([0, 2000], [0, 2000], '--k')
113+
ax1.set_ylabel('Target predicted')
114+
ax1.set_xlabel('True Target')
115+
ax1.set_title('Ridge regression \n with target transformation')
116+
ax1.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (
117+
r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
118+
ax1.set_xlim([0, 2000])
119+
ax1.set_ylim([0, 2000])
120+
121+
f.suptitle("Synthetic data", y=0.035)
122+
f.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
123+
124+
###############################################################################
125+
# Real-world data set
126+
###############################################################################
127+
128+
###############################################################################
129+
# In a similar manner, the boston housing data set is used to show the impact
130+
# of transforming the targets before learning a model. In this example, the
131+
# targets to be predicted corresponds to the weighted distances to the five
132+
# Boston employment centers.
133+
134+
from sklearn.datasets import load_boston
135+
from sklearn.preprocessing import QuantileTransformer, quantile_transform
136+
137+
dataset = load_boston()
138+
target = np.array(dataset.feature_names) == "DIS"
139+
X = dataset.data[:, np.logical_not(target)]
140+
y = dataset.data[:, target].squeeze()
141+
y_trans = quantile_transform(dataset.data[:, target],
142+
output_distribution='normal').squeeze()
143+
144+
###############################################################################
145+
# A :class:`sklearn.preprocessing.QuantileTransformer` is used such that the
146+
# targets follows a normal distribution before applying a
147+
# :class:`sklearn.linear_model.RidgeCV` model.
148+
149+
f, (ax0, ax1) = plt.subplots(1, 2)
150+
151+
ax0.hist(y, bins='auto', normed=True)
152+
ax0.set_ylabel('Probability')
153+
ax0.set_xlabel('Target')
154+
ax0.set_title('Target distribution')
155+
156+
ax1.hist(y_trans, bins='auto', normed=True)
157+
ax1.set_ylabel('Probability')
158+
ax1.set_xlabel('Target')
159+
ax1.set_title('Transformed target distribution')
160+
161+
f.suptitle("Boston housing data: distance to employment centers", y=0.035)
162+
f.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
163+
164+
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
165+
166+
###############################################################################
167+
# The effect of the transformer is weaker than on the synthetic data. However,
168+
# the transform induces a decrease of the MAE.
169+
170+
f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)
171+
172+
regr = RidgeCV()
173+
regr.fit(X_train, y_train)
174+
y_pred = regr.predict(X_test)
175+
176+
ax0.scatter(y_test, y_pred)
177+
ax0.plot([0, 10], [0, 10], '--k')
178+
ax0.set_ylabel('Target predicted')
179+
ax0.set_xlabel('True Target')
180+
ax0.set_title('Ridge regression \n without target transformation')
181+
ax0.text(1, 9, r'$R^2$=%.2f, MAE=%.2f' % (
182+
r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
183+
ax0.set_xlim([0, 10])
184+
ax0.set_ylim([0, 10])
185+
186+
regr_trans = TransformedTargetRegressor(
187+
regressor=RidgeCV(),
188+
transformer=QuantileTransformer(output_distribution='normal'))
189+
regr_trans.fit(X_train, y_train)
190+
y_pred = regr_trans.predict(X_test)
191+
192+
ax1.scatter(y_test, y_pred)
193+
ax1.plot([0, 10], [0, 10], '--k')
194+
ax1.set_ylabel('Target predicted')
195+
ax1.set_xlabel('True Target')
196+
ax1.set_title('Ridge regression \n with target transformation')
197+
ax1.text(1, 9, r'$R^2$=%.2f, MAE=%.2f' % (
198+
r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
199+
ax1.set_xlim([0, 10])
200+
ax1.set_ylim([0, 10])
201+
202+
f.suptitle("Boston housing data: distance to employment centers", y=0.035)
203+
f.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])
204+
205+
plt.show()

dev/_downloads/scikit-learn-docs.pdf

140 KB
Binary file not shown.

0 commit comments

Comments
 (0)