Skip to content

Commit 6596cd2

Browse files
committed
Pushing the docs to dev/ for branch: master, commit eb9fe80e50f32fb164e9c0f840b45b25b7b31580
1 parent 02187d6 commit 6596cd2

File tree

951 files changed

+3673
-2847
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

951 files changed

+3673
-2847
lines changed
6.29 KB
Binary file not shown.
5.33 KB
Binary file not shown.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
{
2+
"nbformat_minor": 0,
3+
"nbformat": 4,
4+
"cells": [
5+
{
6+
"execution_count": null,
7+
"cell_type": "code",
8+
"source": [
9+
"%matplotlib inline"
10+
],
11+
"outputs": [],
12+
"metadata": {
13+
"collapsed": false
14+
}
15+
},
16+
{
17+
"source": [
18+
"\n# Importance of Feature Scaling\n\n\nFeature scaling though standardization (or Z-score normalization)\ncan be an important preprocessing step for many machine learning\nalgorithms. Standardization involves rescaling the features such\nthat they have the properties of a standard normal distribution\nwith a mean of zero and a standard deviation of one.\n\nWhile many algorithms (such as SVM, K-nearest neighbors, and logistic\nregression) require features to be normalized, intuitively we can\nthink of Principle Component Analysis (PCA) as being a prime example\nof when normalization is important. In PCA we are interested in the\ncomponents that maximize the variance. If one component (e.g. human\nheight) varies less than another (e.g. weight) because of their\nrespective scales (meters vs. kilos), PCA might determine that the\ndirection of maximal variance more closely corresponds with the\n'weight' axis, if those features are not scaled. As a change in\nheight of one meter can be considered much more important than the\nchange in weight of one kilogram, this is clearly incorrect.\n\nTo illustrate this, PCA is performed comparing the use of data with\n:class:`StandardScaler <sklearn.preprocessing.StandardScaler>` applied,\nto unscaled data. The results are visualized and a clear difference noted.\nThe 1st principal component in the unscaled set can be seen. It can be seen\nthat feature #13 dominates the direction, being a whole two orders of\nmagnitude above the other features. This is contrasted when observing\nthe principal component for the scaled version of the data. In the scaled\nversion, the orders of magnitude are roughly the same across all the features.\n\nThe dataset used is the Wine Dataset available at UCI. This dataset\nhas continuous features that are heterogeneous in scale due to differing\nproperties that they measure (i.e alcohol content, and malic acid).\n\nThe transformed data is then used to train a naive Bayes classifier, and a\nclear difference in prediction accuracies is observed wherein the dataset\nwhich is scaled before PCA vastly outperforms the unscaled version.\n\n\n"
19+
],
20+
"cell_type": "markdown",
21+
"metadata": {}
22+
},
23+
{
24+
"execution_count": null,
25+
"cell_type": "code",
26+
"source": [
27+
"from __future__ import print_function\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.decomposition import PCA\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn import metrics\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_wine\nfrom sklearn.pipeline import make_pipeline\nprint(__doc__)\n\n# Code source: Tyler Lanigan <[email protected]>\n# Sebastian Raschka <[email protected]>\n\n# License: BSD 3 clause\n\nRANDOM_STATE = 42\nFIG_SIZE = (10, 7)\n\n\nfeatures, target = load_wine(return_X_y=True)\n\n# Make a train/test split using 30% test size\nX_train, X_test, y_train, y_test = train_test_split(features, target,\n test_size=0.30,\n random_state=RANDOM_STATE)\n\n# Fit to data and predict using pipelined GNB and PCA.\nunscaled_clf = make_pipeline(PCA(n_components=2), GaussianNB())\nunscaled_clf.fit(X_train, y_train)\npred_test = unscaled_clf.predict(X_test)\n\n# Fit to data and predict using pipelined scaling, GNB and PCA.\nstd_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())\nstd_clf.fit(X_train, y_train)\npred_test_std = std_clf.predict(X_test)\n\n# Show prediction accuracies in scaled and unscaled data.\nprint('\\nPrediction accuracy for the normal test dataset with PCA')\nprint('{:.2%}\\n'.format(metrics.accuracy_score(y_test, pred_test)))\n\nprint('\\nPrediction accuracy for the standardized test dataset with PCA')\nprint('{:.2%}\\n'.format(metrics.accuracy_score(y_test, pred_test_std)))\n\n# Extract PCA from pipeline\npca = unscaled_clf.named_steps['pca']\npca_std = std_clf.named_steps['pca']\n\n# Show first principal componenets\nprint('\\nPC 1 without scaling:\\n', pca.components_[0])\nprint('\\nPC 1 with scaling:\\n', pca_std.components_[0])\n\n# Scale and use PCA on X_train data for visualization.\nscaler = std_clf.named_steps['standardscaler']\nX_train_std = pca_std.transform(scaler.transform(X_train))\n\n# visualize standardized vs. untouched dataset with PCA performed\nfig, (ax1, ax2) = plt.subplots(ncols=2, figsize=FIG_SIZE)\n\n\nfor l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):\n ax1.scatter(X_train[y_train == l, 0], X_train[y_train == l, 1],\n color=c,\n label='class %s' % l,\n alpha=0.5,\n marker=m\n )\n\nfor l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):\n ax2.scatter(X_train_std[y_train == l, 0], X_train_std[y_train == l, 1],\n color=c,\n label='class %s' % l,\n alpha=0.5,\n marker=m\n )\n\nax1.set_title('Training dataset after PCA')\nax2.set_title('Standardized training dataset after PCA')\n\nfor ax in (ax1, ax2):\n ax.set_xlabel('1st principal component')\n ax.set_ylabel('2nd principal component')\n ax.legend(loc='upper right')\n ax.grid()\n\nplt.tight_layout()\n\nplt.show()"
28+
],
29+
"outputs": [],
30+
"metadata": {
31+
"collapsed": false
32+
}
33+
}
34+
],
35+
"metadata": {
36+
"kernelspec": {
37+
"display_name": "Python 2",
38+
"name": "python2",
39+
"language": "python"
40+
},
41+
"language_info": {
42+
"mimetype": "text/x-python",
43+
"nbconvert_exporter": "python",
44+
"name": "python",
45+
"file_extension": ".py",
46+
"version": "2.7.13",
47+
"pygments_lexer": "ipython2",
48+
"codemirror_mode": {
49+
"version": 2,
50+
"name": "ipython"
51+
}
52+
}
53+
}
54+
}
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
#!/usr/bin/python
2+
# -*- coding: utf-8 -*-
3+
"""
4+
=========================================================
5+
Importance of Feature Scaling
6+
=========================================================
7+
8+
Feature scaling though standardization (or Z-score normalization)
9+
can be an important preprocessing step for many machine learning
10+
algorithms. Standardization involves rescaling the features such
11+
that they have the properties of a standard normal distribution
12+
with a mean of zero and a standard deviation of one.
13+
14+
While many algorithms (such as SVM, K-nearest neighbors, and logistic
15+
regression) require features to be normalized, intuitively we can
16+
think of Principle Component Analysis (PCA) as being a prime example
17+
of when normalization is important. In PCA we are interested in the
18+
components that maximize the variance. If one component (e.g. human
19+
height) varies less than another (e.g. weight) because of their
20+
respective scales (meters vs. kilos), PCA might determine that the
21+
direction of maximal variance more closely corresponds with the
22+
'weight' axis, if those features are not scaled. As a change in
23+
height of one meter can be considered much more important than the
24+
change in weight of one kilogram, this is clearly incorrect.
25+
26+
To illustrate this, PCA is performed comparing the use of data with
27+
:class:`StandardScaler <sklearn.preprocessing.StandardScaler>` applied,
28+
to unscaled data. The results are visualized and a clear difference noted.
29+
The 1st principal component in the unscaled set can be seen. It can be seen
30+
that feature #13 dominates the direction, being a whole two orders of
31+
magnitude above the other features. This is contrasted when observing
32+
the principal component for the scaled version of the data. In the scaled
33+
version, the orders of magnitude are roughly the same across all the features.
34+
35+
The dataset used is the Wine Dataset available at UCI. This dataset
36+
has continuous features that are heterogeneous in scale due to differing
37+
properties that they measure (i.e alcohol content, and malic acid).
38+
39+
The transformed data is then used to train a naive Bayes classifier, and a
40+
clear difference in prediction accuracies is observed wherein the dataset
41+
which is scaled before PCA vastly outperforms the unscaled version.
42+
43+
"""
44+
from __future__ import print_function
45+
from sklearn.model_selection import train_test_split
46+
from sklearn.preprocessing import StandardScaler
47+
from sklearn.decomposition import PCA
48+
from sklearn.naive_bayes import GaussianNB
49+
from sklearn import metrics
50+
import matplotlib.pyplot as plt
51+
from sklearn.datasets import load_wine
52+
from sklearn.pipeline import make_pipeline
53+
print(__doc__)
54+
55+
# Code source: Tyler Lanigan <[email protected]>
56+
# Sebastian Raschka <[email protected]>
57+
58+
# License: BSD 3 clause
59+
60+
RANDOM_STATE = 42
61+
FIG_SIZE = (10, 7)
62+
63+
64+
features, target = load_wine(return_X_y=True)
65+
66+
# Make a train/test split using 30% test size
67+
X_train, X_test, y_train, y_test = train_test_split(features, target,
68+
test_size=0.30,
69+
random_state=RANDOM_STATE)
70+
71+
# Fit to data and predict using pipelined GNB and PCA.
72+
unscaled_clf = make_pipeline(PCA(n_components=2), GaussianNB())
73+
unscaled_clf.fit(X_train, y_train)
74+
pred_test = unscaled_clf.predict(X_test)
75+
76+
# Fit to data and predict using pipelined scaling, GNB and PCA.
77+
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())
78+
std_clf.fit(X_train, y_train)
79+
pred_test_std = std_clf.predict(X_test)
80+
81+
# Show prediction accuracies in scaled and unscaled data.
82+
print('\nPrediction accuracy for the normal test dataset with PCA')
83+
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
84+
85+
print('\nPrediction accuracy for the standardized test dataset with PCA')
86+
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))
87+
88+
# Extract PCA from pipeline
89+
pca = unscaled_clf.named_steps['pca']
90+
pca_std = std_clf.named_steps['pca']
91+
92+
# Show first principal componenets
93+
print('\nPC 1 without scaling:\n', pca.components_[0])
94+
print('\nPC 1 with scaling:\n', pca_std.components_[0])
95+
96+
# Scale and use PCA on X_train data for visualization.
97+
scaler = std_clf.named_steps['standardscaler']
98+
X_train_std = pca_std.transform(scaler.transform(X_train))
99+
100+
# visualize standardized vs. untouched dataset with PCA performed
101+
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=FIG_SIZE)
102+
103+
104+
for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
105+
ax1.scatter(X_train[y_train == l, 0], X_train[y_train == l, 1],
106+
color=c,
107+
label='class %s' % l,
108+
alpha=0.5,
109+
marker=m
110+
)
111+
112+
for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
113+
ax2.scatter(X_train_std[y_train == l, 0], X_train_std[y_train == l, 1],
114+
color=c,
115+
label='class %s' % l,
116+
alpha=0.5,
117+
marker=m
118+
)
119+
120+
ax1.set_title('Training dataset after PCA')
121+
ax2.set_title('Standardized training dataset after PCA')
122+
123+
for ax in (ax1, ax2):
124+
ax.set_xlabel('1st principal component')
125+
ax.set_ylabel('2nd principal component')
126+
ax.legend(loc='upper right')
127+
ax.grid()
128+
129+
plt.tight_layout()
130+
131+
plt.show()

dev/_downloads/scikit-learn-docs.pdf

63.3 KB
Binary file not shown.
-144 Bytes
-144 Bytes
-154 Bytes
-154 Bytes
-98 Bytes

0 commit comments

Comments
 (0)