Skip to content

Commit 68df614

Browse files
committed
Pushing the docs to dev/ for branch: master, commit 67c94c7e2a24bf5848ff8dc068849b3d83e3c96a
1 parent 4168a04 commit 68df614

File tree

1,203 files changed

+5157
-3742
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,203 files changed

+5157
-3742
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {
7+
"collapsed": false
8+
},
9+
"outputs": [],
10+
"source": [
11+
"%matplotlib inline"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"\n# Post pruning decision trees with cost complexity pruning\n\n\n.. currentmodule:: sklearn.tree\n\nThe :class:`DecisionTreeClassifier` provides parameters such as\n``min_samples_leaf`` and ``max_depth`` to prevent a tree from overfiting. Cost\ncomplexity pruning provides another option to control the size of a tree. In\n:class:`DecisionTreeClassifier`, this pruning technique is parameterized by the\ncost complexity parameter, ``ccp_alpha``. Greater values of ``ccp_alpha``\nincrease the number of nodes pruned. Here we only show the effect of\n``ccp_alpha`` on regularizing the trees and how to choose a ``ccp_alpha``\nbased on validation scores.\n\nSee also `ref`:_minimal_cost_complexity_pruning` for details on pruning.\n"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {
25+
"collapsed": false
26+
},
27+
"outputs": [],
28+
"source": [
29+
"print(__doc__)\nimport matplotlib.pyplot as plt\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.tree import DecisionTreeClassifier"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"Total impurity of leaves vs effective alphas of pruned tree\n---------------------------------------------------------------\nMinimal cost complexity pruning recursively finds the node with the \"weakest\nlink\". The weakest link is characterized by an effective alpha, where the\nnodes with the smallest effective alpha are pruned first. To get an idea of\nwhat values of ``ccp_alpha`` could be appropriate, scikit-learn provides\n:func:`DecisionTreeClassifier.cost_complexity_pruning_path` that returns the\neffective alphas and the corresponding total leaf impurities at each step of\nthe pruning process. As alpha increases, more of the tree is pruned, which\nincreases the total impurity of its leaves.\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"X, y = load_breast_cancer(return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\nclf = DecisionTreeClassifier(random_state=0)\npath = clf.cost_complexity_pruning_path(X_train, y_train)\nccp_alphas, impurities = path.ccp_alphas, path.impurities"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"In the following plot, the maximum effective alpha value is removed, because\nit is the trivial tree with only one node.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"fig, ax = plt.subplots()\nax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle=\"steps-post\")\nax.set_xlabel(\"effective alpha\")\nax.set_ylabel(\"total impurity of leaves\")\nax.set_title(\"Total Impurity vs effective alpha for training set\")"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"Next, we train a decision tree using the effective alphas. The last value\nin ``ccp_alphas`` is the alpha value that prunes the whole tree,\nleaving the tree, ``clfs[-1]``, with one node.\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"clfs = []\nfor ccp_alpha in ccp_alphas:\n clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)\n clf.fit(X_train, y_train)\n clfs.append(clf)\nprint(\"Number of nodes in the last tree is: {} with ccp_alpha: {}\".format(\n clfs[-1].tree_.node_count, ccp_alphas[-1]))"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"For the remainder of this example, we remove the last element in\n``clfs`` and ``ccp_alphas``, because it is the trivial tree with only one\nnode. Here we show that the number of nodes and tree depth decreases as alpha\nincreases.\n\n"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"metadata": {
97+
"collapsed": false
98+
},
99+
"outputs": [],
100+
"source": [
101+
"clfs = clfs[:-1]\nccp_alphas = ccp_alphas[:-1]\n\nnode_counts = [clf.tree_.node_count for clf in clfs]\ndepth = [clf.tree_.max_depth for clf in clfs]\nfig, ax = plt.subplots(2, 1)\nax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle=\"steps-post\")\nax[0].set_xlabel(\"alpha\")\nax[0].set_ylabel(\"number of nodes\")\nax[0].set_title(\"Number of nodes vs alpha\")\nax[1].plot(ccp_alphas, depth, marker='o', drawstyle=\"steps-post\")\nax[1].set_xlabel(\"alpha\")\nax[1].set_ylabel(\"depth of tree\")\nax[1].set_title(\"Depth vs alpha\")\nfig.tight_layout()"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"Accuracy vs alpha for training and testing sets\n----------------------------------------------------\nWhen ``ccp_alpha`` is set to zero and keeping the other default parameters\nof :class:`DecisionTreeClassifier`, the tree overfits, leading to\na 100% training accuracy and 88% testing accuracy. As alpha increases, more\nof the tree is pruned, thus creating a decision tree that generalizes better.\nIn this example, setting ``ccp_alpha=0.015`` maximizes the testing accuracy.\n\n"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {
115+
"collapsed": false
116+
},
117+
"outputs": [],
118+
"source": [
119+
"train_scores = [clf.score(X_train, y_train) for clf in clfs]\ntest_scores = [clf.score(X_test, y_test) for clf in clfs]\n\nfig, ax = plt.subplots()\nax.set_xlabel(\"alpha\")\nax.set_ylabel(\"accuracy\")\nax.set_title(\"Accuracy vs alpha for training and testing sets\")\nax.plot(ccp_alphas, train_scores, marker='o', label=\"train\",\n drawstyle=\"steps-post\")\nax.plot(ccp_alphas, test_scores, marker='o', label=\"test\",\n drawstyle=\"steps-post\")\nax.legend()\nplt.show()"
120+
]
121+
}
122+
],
123+
"metadata": {
124+
"kernelspec": {
125+
"display_name": "Python 3",
126+
"language": "python",
127+
"name": "python3"
128+
},
129+
"language_info": {
130+
"codemirror_mode": {
131+
"name": "ipython",
132+
"version": 3
133+
},
134+
"file_extension": ".py",
135+
"mimetype": "text/x-python",
136+
"name": "python",
137+
"nbconvert_exporter": "python",
138+
"pygments_lexer": "ipython3",
139+
"version": "3.7.3"
140+
}
141+
},
142+
"nbformat": 4,
143+
"nbformat_minor": 0
144+
}
Binary file not shown.
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
"""
2+
========================================================
3+
Post pruning decision trees with cost complexity pruning
4+
========================================================
5+
6+
.. currentmodule:: sklearn.tree
7+
8+
The :class:`DecisionTreeClassifier` provides parameters such as
9+
``min_samples_leaf`` and ``max_depth`` to prevent a tree from overfiting. Cost
10+
complexity pruning provides another option to control the size of a tree. In
11+
:class:`DecisionTreeClassifier`, this pruning technique is parameterized by the
12+
cost complexity parameter, ``ccp_alpha``. Greater values of ``ccp_alpha``
13+
increase the number of nodes pruned. Here we only show the effect of
14+
``ccp_alpha`` on regularizing the trees and how to choose a ``ccp_alpha``
15+
based on validation scores.
16+
17+
See also `ref`:_minimal_cost_complexity_pruning` for details on pruning.
18+
"""
19+
20+
print(__doc__)
21+
import matplotlib.pyplot as plt
22+
from sklearn.model_selection import train_test_split
23+
from sklearn.datasets import load_breast_cancer
24+
from sklearn.tree import DecisionTreeClassifier
25+
26+
###############################################################################
27+
# Total impurity of leaves vs effective alphas of pruned tree
28+
# ---------------------------------------------------------------
29+
# Minimal cost complexity pruning recursively finds the node with the "weakest
30+
# link". The weakest link is characterized by an effective alpha, where the
31+
# nodes with the smallest effective alpha are pruned first. To get an idea of
32+
# what values of ``ccp_alpha`` could be appropriate, scikit-learn provides
33+
# :func:`DecisionTreeClassifier.cost_complexity_pruning_path` that returns the
34+
# effective alphas and the corresponding total leaf impurities at each step of
35+
# the pruning process. As alpha increases, more of the tree is pruned, which
36+
# increases the total impurity of its leaves.
37+
X, y = load_breast_cancer(return_X_y=True)
38+
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
39+
40+
clf = DecisionTreeClassifier(random_state=0)
41+
path = clf.cost_complexity_pruning_path(X_train, y_train)
42+
ccp_alphas, impurities = path.ccp_alphas, path.impurities
43+
44+
###############################################################################
45+
# In the following plot, the maximum effective alpha value is removed, because
46+
# it is the trivial tree with only one node.
47+
fig, ax = plt.subplots()
48+
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
49+
ax.set_xlabel("effective alpha")
50+
ax.set_ylabel("total impurity of leaves")
51+
ax.set_title("Total Impurity vs effective alpha for training set")
52+
53+
###############################################################################
54+
# Next, we train a decision tree using the effective alphas. The last value
55+
# in ``ccp_alphas`` is the alpha value that prunes the whole tree,
56+
# leaving the tree, ``clfs[-1]``, with one node.
57+
clfs = []
58+
for ccp_alpha in ccp_alphas:
59+
clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
60+
clf.fit(X_train, y_train)
61+
clfs.append(clf)
62+
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
63+
clfs[-1].tree_.node_count, ccp_alphas[-1]))
64+
65+
###############################################################################
66+
# For the remainder of this example, we remove the last element in
67+
# ``clfs`` and ``ccp_alphas``, because it is the trivial tree with only one
68+
# node. Here we show that the number of nodes and tree depth decreases as alpha
69+
# increases.
70+
clfs = clfs[:-1]
71+
ccp_alphas = ccp_alphas[:-1]
72+
73+
node_counts = [clf.tree_.node_count for clf in clfs]
74+
depth = [clf.tree_.max_depth for clf in clfs]
75+
fig, ax = plt.subplots(2, 1)
76+
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
77+
ax[0].set_xlabel("alpha")
78+
ax[0].set_ylabel("number of nodes")
79+
ax[0].set_title("Number of nodes vs alpha")
80+
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
81+
ax[1].set_xlabel("alpha")
82+
ax[1].set_ylabel("depth of tree")
83+
ax[1].set_title("Depth vs alpha")
84+
fig.tight_layout()
85+
86+
###############################################################################
87+
# Accuracy vs alpha for training and testing sets
88+
# ----------------------------------------------------
89+
# When ``ccp_alpha`` is set to zero and keeping the other default parameters
90+
# of :class:`DecisionTreeClassifier`, the tree overfits, leading to
91+
# a 100% training accuracy and 88% testing accuracy. As alpha increases, more
92+
# of the tree is pruned, thus creating a decision tree that generalizes better.
93+
# In this example, setting ``ccp_alpha=0.015`` maximizes the testing accuracy.
94+
train_scores = [clf.score(X_train, y_train) for clf in clfs]
95+
test_scores = [clf.score(X_test, y_test) for clf in clfs]
96+
97+
fig, ax = plt.subplots()
98+
ax.set_xlabel("alpha")
99+
ax.set_ylabel("accuracy")
100+
ax.set_title("Accuracy vs alpha for training and testing sets")
101+
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
102+
drawstyle="steps-post")
103+
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
104+
drawstyle="steps-post")
105+
ax.legend()
106+
plt.show()
Binary file not shown.

dev/_downloads/scikit-learn-docs.pdf

79.2 KB
Binary file not shown.

dev/_images/iris.png

0 Bytes
-82 Bytes
-82 Bytes
-209 Bytes
-209 Bytes

0 commit comments

Comments
 (0)