Skip to content

Commit 5ebee14

Browse files
committed
Pushing the docs to dev/ for branch: master, commit d1c52f402c89586ff66c2ec6e86f794da7715a18
1 parent 5efa7d5 commit 5ebee14

File tree

1,104 files changed

+5887
-3096
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,104 files changed

+5887
-3096
lines changed
16.5 KB
Binary file not shown.
12.8 KB
Binary file not shown.
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {
7+
"collapsed": false
8+
},
9+
"outputs": [],
10+
"source": [
11+
"%matplotlib inline"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"\n================================================================\nPermutation Importance vs Random Forest Feature Importance (MDI)\n================================================================\n\nIn this example, we will compare the impurity-based feature importance of\n:class:`~sklearn.ensemble.RandomForestClassifier` with the\npermutation importance on the titanic dataset using\n:func:`~sklearn.inspection.permutation_importance`. We will show that the\nimpurity-based feature importance can inflate the importance of numerical\nfeatures.\n\nFurthermore, the impurity-based feature importance of random forests suffers\nfrom being computed on statistics derived from the training dataset: the\nimportances can be high even for features that are not predictive of the target\nvariable, as long as the model has the capacity to use them to overfit.\n\nThis example shows how to use Permutation Importances as an alternative that\ncan mitigate those limitations.\n\n.. topic:: References:\n\n .. [1] L. Breiman, \"Random Forests\", Machine Learning, 45(1), 5-32,\n 2001. https://doi.org/10.1023/A:1010933404324\n\n"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {
25+
"collapsed": false
26+
},
27+
"outputs": [],
28+
"source": [
29+
"print(__doc__)\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.inspection import permutation_importance\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import OneHotEncoder"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"Data Loading and Feature Engineering\n------------------------------------\nLet's use pandas to load a copy of the titanic dataset. The following shows\nhow to apply separate preprocessing on numerical and categorical features.\n\nWe further include two random variables that are not correlated in any way\nwith the target variable (``survived``):\n\n- ``random_num`` is a high cardinality numerical variable (as many unique\n values as records).\n- ``random_cat`` is a low cardinality categorical variable (3 possible\n values).\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"X, y = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\nX['random_cat'] = np.random.randint(3, size=X.shape[0])\nX['random_num'] = np.random.randn(X.shape[0])\n\ncategorical_columns = ['pclass', 'sex', 'embarked', 'random_cat']\nnumerical_columns = ['age', 'sibsp', 'parch', 'fare', 'random_num']\n\nX = X[categorical_columns + numerical_columns]\n\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, stratify=y, random_state=42)\n\ncategorical_pipe = Pipeline([\n ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n ('onehot', OneHotEncoder(handle_unknown='ignore'))\n])\nnumerical_pipe = Pipeline([\n ('imputer', SimpleImputer(strategy='mean'))\n])\n\npreprocessing = ColumnTransformer(\n [('cat', categorical_pipe, categorical_columns),\n ('num', numerical_pipe, numerical_columns)])\n\nrf = Pipeline([\n ('preprocess', preprocessing),\n ('classifier', RandomForestClassifier(random_state=42))\n])\nrf.fit(X_train, y_train)"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"Accuracy of the Model\n---------------------\nPrior to inspecting the feature importances, it is important to check that\nthe model predictive performance is high enough. Indeed there would be little\ninterest of inspecting the important features of a non-predictive model.\n\nHere one can observe that the train accuracy is very high (the forest model\nhas enough capacity to completely memorize the training set) but it can still\ngeneralize well enough to the test set thanks to the built-in bagging of\nrandom forests.\n\nIt might be possible to trade some accuracy on the training set for a\nslightly better accuracy on the test set by limiting the capacity of the\ntrees (for instance by setting ``min_samples_leaf=5`` or\n``min_samples_leaf=10``) so as to limit overfitting while not introducing too\nmuch underfitting.\n\nHowever let's keep our high capacity random forest model for now so as to\nillustrate some pitfalls with feature importance on variables with many\nunique values.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"print(\"RF train accuracy: %0.3f\" % rf.score(X_train, y_train))\nprint(\"RF test accuracy: %0.3f\" % rf.score(X_test, y_test))"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"Tree's Feature Importance from Mean Decrease in Impurity (MDI)\n--------------------------------------------------------------\nThe impurity-based feature importance ranks the numerical features to be the\nmost important features. As a result, the non-predictive ``random_num``\nvariable is ranked the most important!\n\nThis problem stems from two limitations of impurity-based feature\nimportances:\n\n- impurity-based importances are biased towards high cardinality features;\n- impurity-based importances are computed on training set statistics and\n therefore do not reflect the ability of feature to be useful to make\n predictions that generalize to the test set (when the model has enough\n capacity).\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"ohe = (rf.named_steps['preprocess']\n .named_transformers_['cat']\n .named_steps['onehot'])\nfeature_names = ohe.get_feature_names(input_features=categorical_columns)\nfeature_names = np.r_[feature_names, numerical_columns]\n\ntree_feature_importances = (\n rf.named_steps['classifier'].feature_importances_)\nsorted_idx = tree_feature_importances.argsort()\n\ny_ticks = np.arange(0, len(feature_names))\nfig, ax = plt.subplots()\nax.barh(y_ticks, tree_feature_importances[sorted_idx])\nax.set_yticklabels(feature_names[sorted_idx])\nax.set_yticks(y_ticks)\nax.set_title(\"Random Forest Feature Importances (MDI)\")\nfig.tight_layout()\nplt.show()"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"As an alternative, the permutation importances of ``rf`` are computed on a\nheld out test set. This shows that the low cardinality categorical feature,\n``sex`` is the most important feature.\n\nAlso note that both random features have very low importances (close to 0) as\nexpected.\n\n"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"metadata": {
97+
"collapsed": false
98+
},
99+
"outputs": [],
100+
"source": [
101+
"result = permutation_importance(rf, X_test, y_test, n_repeats=10,\n random_state=42, n_jobs=2)\nsorted_idx = result.importances_mean.argsort()\n\nfig, ax = plt.subplots()\nax.boxplot(result.importances[sorted_idx].T,\n vert=False, labels=X_test.columns[sorted_idx])\nax.set_title(\"Permutation Importances (test set)\")\nfig.tight_layout()\nplt.show()"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"It is also possible to compute the permutation importances on the training\nset. This reveals that ``random_num`` gets a significantly higher importance\nranking than when computed on the test set. The difference between those two\nplots is a confirmation that the RF model has enough capacity to use that\nrandom numerical feature to overfit. You can further confirm this by\nre-running this example with constrained RF with min_samples_leaf=10.\n\n"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {
115+
"collapsed": false
116+
},
117+
"outputs": [],
118+
"source": [
119+
"result = permutation_importance(rf, X_train, y_train, n_repeats=10,\n random_state=42, n_jobs=2)\nsorted_idx = result.importances_mean.argsort()\n\nfig, ax = plt.subplots()\nax.boxplot(result.importances[sorted_idx].T,\n vert=False, labels=X_train.columns[sorted_idx])\nax.set_title(\"Permutation Importances (train set)\")\nfig.tight_layout()\nplt.show()"
120+
]
121+
}
122+
],
123+
"metadata": {
124+
"kernelspec": {
125+
"display_name": "Python 3",
126+
"language": "python",
127+
"name": "python3"
128+
},
129+
"language_info": {
130+
"codemirror_mode": {
131+
"name": "ipython",
132+
"version": 3
133+
},
134+
"file_extension": ".py",
135+
"mimetype": "text/x-python",
136+
"name": "python",
137+
"nbconvert_exporter": "python",
138+
"pygments_lexer": "ipython3",
139+
"version": "3.6.8"
140+
}
141+
},
142+
"nbformat": 4,
143+
"nbformat_minor": 0
144+
}
Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
"""
2+
================================================================
3+
Permutation Importance vs Random Forest Feature Importance (MDI)
4+
================================================================
5+
6+
In this example, we will compare the impurity-based feature importance of
7+
:class:`~sklearn.ensemble.RandomForestClassifier` with the
8+
permutation importance on the titanic dataset using
9+
:func:`~sklearn.inspection.permutation_importance`. We will show that the
10+
impurity-based feature importance can inflate the importance of numerical
11+
features.
12+
13+
Furthermore, the impurity-based feature importance of random forests suffers
14+
from being computed on statistics derived from the training dataset: the
15+
importances can be high even for features that are not predictive of the target
16+
variable, as long as the model has the capacity to use them to overfit.
17+
18+
This example shows how to use Permutation Importances as an alternative that
19+
can mitigate those limitations.
20+
21+
.. topic:: References:
22+
23+
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32,
24+
2001. https://doi.org/10.1023/A:1010933404324
25+
"""
26+
print(__doc__)
27+
import matplotlib.pyplot as plt
28+
import numpy as np
29+
30+
from sklearn.datasets import fetch_openml
31+
from sklearn.ensemble import RandomForestClassifier
32+
from sklearn.impute import SimpleImputer
33+
from sklearn.inspection import permutation_importance
34+
from sklearn.compose import ColumnTransformer
35+
from sklearn.model_selection import train_test_split
36+
from sklearn.pipeline import Pipeline
37+
from sklearn.preprocessing import OneHotEncoder
38+
39+
40+
##############################################################################
41+
# Data Loading and Feature Engineering
42+
# ------------------------------------
43+
# Let's use pandas to load a copy of the titanic dataset. The following shows
44+
# how to apply separate preprocessing on numerical and categorical features.
45+
#
46+
# We further include two random variables that are not correlated in any way
47+
# with the target variable (``survived``):
48+
#
49+
# - ``random_num`` is a high cardinality numerical variable (as many unique
50+
# values as records).
51+
# - ``random_cat`` is a low cardinality categorical variable (3 possible
52+
# values).
53+
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
54+
X['random_cat'] = np.random.randint(3, size=X.shape[0])
55+
X['random_num'] = np.random.randn(X.shape[0])
56+
57+
categorical_columns = ['pclass', 'sex', 'embarked', 'random_cat']
58+
numerical_columns = ['age', 'sibsp', 'parch', 'fare', 'random_num']
59+
60+
X = X[categorical_columns + numerical_columns]
61+
62+
X_train, X_test, y_train, y_test = train_test_split(
63+
X, y, stratify=y, random_state=42)
64+
65+
categorical_pipe = Pipeline([
66+
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
67+
('onehot', OneHotEncoder(handle_unknown='ignore'))
68+
])
69+
numerical_pipe = Pipeline([
70+
('imputer', SimpleImputer(strategy='mean'))
71+
])
72+
73+
preprocessing = ColumnTransformer(
74+
[('cat', categorical_pipe, categorical_columns),
75+
('num', numerical_pipe, numerical_columns)])
76+
77+
rf = Pipeline([
78+
('preprocess', preprocessing),
79+
('classifier', RandomForestClassifier(random_state=42))
80+
])
81+
rf.fit(X_train, y_train)
82+
83+
##############################################################################
84+
# Accuracy of the Model
85+
# ---------------------
86+
# Prior to inspecting the feature importances, it is important to check that
87+
# the model predictive performance is high enough. Indeed there would be little
88+
# interest of inspecting the important features of a non-predictive model.
89+
#
90+
# Here one can observe that the train accuracy is very high (the forest model
91+
# has enough capacity to completely memorize the training set) but it can still
92+
# generalize well enough to the test set thanks to the built-in bagging of
93+
# random forests.
94+
#
95+
# It might be possible to trade some accuracy on the training set for a
96+
# slightly better accuracy on the test set by limiting the capacity of the
97+
# trees (for instance by setting ``min_samples_leaf=5`` or
98+
# ``min_samples_leaf=10``) so as to limit overfitting while not introducing too
99+
# much underfitting.
100+
#
101+
# However let's keep our high capacity random forest model for now so as to
102+
# illustrate some pitfalls with feature importance on variables with many
103+
# unique values.
104+
print("RF train accuracy: %0.3f" % rf.score(X_train, y_train))
105+
print("RF test accuracy: %0.3f" % rf.score(X_test, y_test))
106+
107+
108+
##############################################################################
109+
# Tree's Feature Importance from Mean Decrease in Impurity (MDI)
110+
# --------------------------------------------------------------
111+
# The impurity-based feature importance ranks the numerical features to be the
112+
# most important features. As a result, the non-predictive ``random_num``
113+
# variable is ranked the most important!
114+
#
115+
# This problem stems from two limitations of impurity-based feature
116+
# importances:
117+
#
118+
# - impurity-based importances are biased towards high cardinality features;
119+
# - impurity-based importances are computed on training set statistics and
120+
# therefore do not reflect the ability of feature to be useful to make
121+
# predictions that generalize to the test set (when the model has enough
122+
# capacity).
123+
ohe = (rf.named_steps['preprocess']
124+
.named_transformers_['cat']
125+
.named_steps['onehot'])
126+
feature_names = ohe.get_feature_names(input_features=categorical_columns)
127+
feature_names = np.r_[feature_names, numerical_columns]
128+
129+
tree_feature_importances = (
130+
rf.named_steps['classifier'].feature_importances_)
131+
sorted_idx = tree_feature_importances.argsort()
132+
133+
y_ticks = np.arange(0, len(feature_names))
134+
fig, ax = plt.subplots()
135+
ax.barh(y_ticks, tree_feature_importances[sorted_idx])
136+
ax.set_yticklabels(feature_names[sorted_idx])
137+
ax.set_yticks(y_ticks)
138+
ax.set_title("Random Forest Feature Importances (MDI)")
139+
fig.tight_layout()
140+
plt.show()
141+
142+
143+
##############################################################################
144+
# As an alternative, the permutation importances of ``rf`` are computed on a
145+
# held out test set. This shows that the low cardinality categorical feature,
146+
# ``sex`` is the most important feature.
147+
#
148+
# Also note that both random features have very low importances (close to 0) as
149+
# expected.
150+
result = permutation_importance(rf, X_test, y_test, n_repeats=10,
151+
random_state=42, n_jobs=2)
152+
sorted_idx = result.importances_mean.argsort()
153+
154+
fig, ax = plt.subplots()
155+
ax.boxplot(result.importances[sorted_idx].T,
156+
vert=False, labels=X_test.columns[sorted_idx])
157+
ax.set_title("Permutation Importances (test set)")
158+
fig.tight_layout()
159+
plt.show()
160+
161+
##############################################################################
162+
# It is also possible to compute the permutation importances on the training
163+
# set. This reveals that ``random_num`` gets a significantly higher importance
164+
# ranking than when computed on the test set. The difference between those two
165+
# plots is a confirmation that the RF model has enough capacity to use that
166+
# random numerical feature to overfit. You can further confirm this by
167+
# re-running this example with constrained RF with min_samples_leaf=10.
168+
result = permutation_importance(rf, X_train, y_train, n_repeats=10,
169+
random_state=42, n_jobs=2)
170+
sorted_idx = result.importances_mean.argsort()
171+
172+
fig, ax = plt.subplots()
173+
ax.boxplot(result.importances[sorted_idx].T,
174+
vert=False, labels=X_train.columns[sorted_idx])
175+
ax.set_title("Permutation Importances (train set)")
176+
fig.tight_layout()
177+
plt.show()

0 commit comments

Comments
 (0)