Skip to content

Commit 9914c4c

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 2415a1b11dff64ca758145abc32a452f3ee8850d
1 parent 086f75a commit 9914c4c

File tree

1,310 files changed

+6906
-5276
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,310 files changed

+6906
-5276
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: f95a1e1bdcdb49fd4081558cb02017e9
3+
config: 5df56ca14f574a7114a1040b9afbd167
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.
Binary file not shown.
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"\n# Comparing Random Forests and Histogram Gradient Boosting models\n\nIn this example we compare the performance of Random Forest (RF) and Histogram\nGradient Boosting (HGBT) models in terms of score and computation time for a\nregression dataset, though **all the concepts here presented apply to\nclassification as well**.\n\nThe comparison is made by varying the parameters that control the number of\ntrees according to each estimator:\n\n- `n_estimators` controls the number of trees in the forest. It's a fixed numer.\n- `max_iter` is the the maximum number of iterations in a gradient boosting\n based model. The number of iterations corresponds to the number of trees for\n regression and binary classification problems. Furthermore, the actual number\n of trees required by the model depends on the stopping criteria.\n\nHGBT uses gradient boosting to iteratively improve the model's performance by\nfitting each tree to the negative gradient of the loss function with respect to\nthe predicted value. RFs, on the other hand, are based on bagging and use a\nmajority vote to predict the outcome.\n\nFor more information on ensemble models, see the `User Guide <ensemble>`.\n"
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": null,
13+
"metadata": {
14+
"collapsed": false
15+
},
16+
"outputs": [],
17+
"source": [
18+
"# Author: Arturo Amor <[email protected]>\n# License: BSD 3 clause"
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {},
24+
"source": [
25+
"## Load dataset\n\n"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": null,
31+
"metadata": {
32+
"collapsed": false
33+
},
34+
"outputs": [],
35+
"source": [
36+
"from sklearn.datasets import fetch_california_housing\n\nX, y = fetch_california_housing(return_X_y=True, as_frame=True)\nn_samples, n_features = X.shape"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"HGBT uses a histogram-based algorithm on binned feature values that can\nefficiently handle large datasets (tens of thousands of samples or more) with\na high number of features (see `Why_it's_faster`). The scikit-learn\nimplementation of RF does not use binning and relies on exact splitting, which\ncan be computationally expensive.\n\n"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {
50+
"collapsed": false
51+
},
52+
"outputs": [],
53+
"source": [
54+
"print(f\"The dataset consists of {n_samples} samples and {n_features} features\")"
55+
]
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"## Compute score and computation times\n\nNotice that many parts of the implementation of\n:class:`~sklearn.ensemble.HistGradientBoostingClassifier` and\n:class:`~sklearn.ensemble.HistGradientBoostingRegressor` are parallelized by\ndefault.\n\nThe implementation of :class:`~sklearn.ensemble.RandomForestRegressor` and\n:class:`~sklearn.ensemble.RandomForestClassifier` can also be run on multiple\ncores by using the `n_jobs` parameter, here set to match the number of\nphysical cores on the host machine. See `parallelism` for more\ninformation.\n\n"
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": null,
67+
"metadata": {
68+
"collapsed": false
69+
},
70+
"outputs": [],
71+
"source": [
72+
"import joblib\n\nN_CORES = joblib.cpu_count(only_physical_cores=True)\nprint(f\"Number of physical cores: {N_CORES}\")"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"Unlike RF, HGBT models offer an early-stopping option (see\n`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_early_stopping.py`)\nto avoid adding new unnecessary trees. Internally, the algorithm uses an\nout-of-sample set to compute the generalization performance of the model at\neach addition of a tree. Thus, if the generalization performance is not\nimproving for more than `n_iter_no_change` iterations, it stops adding trees.\n\nThe other parameters of both models were tuned but the procedure is not shown\nhere to keep the example simple.\n\n"
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"metadata": {
86+
"collapsed": false
87+
},
88+
"outputs": [],
89+
"source": [
90+
"import pandas as pd\nfrom sklearn.ensemble import HistGradientBoostingRegressor\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.model_selection import GridSearchCV, KFold\n\nmodels = {\n \"Random Forest\": RandomForestRegressor(\n min_samples_leaf=5, random_state=0, n_jobs=N_CORES\n ),\n \"Hist Gradient Boosting\": HistGradientBoostingRegressor(\n max_leaf_nodes=15, random_state=0, early_stopping=False\n ),\n}\nparam_grids = {\n \"Random Forest\": {\"n_estimators\": [10, 20, 50, 100]},\n \"Hist Gradient Boosting\": {\"max_iter\": [10, 20, 50, 100, 300, 500]},\n}\ncv = KFold(n_splits=4, shuffle=True, random_state=0)\n\nresults = []\nfor name, model in models.items():\n grid_search = GridSearchCV(\n estimator=model,\n param_grid=param_grids[name],\n return_train_score=True,\n cv=cv,\n ).fit(X, y)\n result = {\"model\": name, \"cv_results\": pd.DataFrame(grid_search.cv_results_)}\n results.append(result)"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
".. Note::\n Tuning the `n_estimators` for RF generally results in a waste of computer\n power. In practice one just needs to ensure that it is large enough so that\n doubling its value does not lead to a significant improvement of the testing\n score.\n\n## Plot results\nWe can use a [plotly.express.scatter](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)\nto visualize the trade-off between elapsed computing time and mean test score.\nPassing the cursor over a given point displays the corresponding parameters.\nError bars correspond to one standard deviation as computed in the different\nfolds of the cross-validation.\n\n"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"metadata": {
104+
"collapsed": false
105+
},
106+
"outputs": [],
107+
"source": [
108+
"import plotly.express as px\nimport plotly.colors as colors\nfrom plotly.subplots import make_subplots\n\nfig = make_subplots(\n rows=1,\n cols=2,\n shared_yaxes=True,\n subplot_titles=[\"Train time vs score\", \"Predict time vs score\"],\n)\nmodel_names = [result[\"model\"] for result in results]\ncolors_list = colors.qualitative.Plotly * (\n len(model_names) // len(colors.qualitative.Plotly) + 1\n)\n\nfor idx, result in enumerate(results):\n cv_results = result[\"cv_results\"].round(3)\n model_name = result[\"model\"]\n param_name = list(param_grids[model_name].keys())[0]\n cv_results[param_name] = cv_results[\"param_\" + param_name]\n cv_results[\"model\"] = model_name\n\n scatter_fig = px.scatter(\n cv_results,\n x=\"mean_fit_time\",\n y=\"mean_test_score\",\n error_x=\"std_fit_time\",\n error_y=\"std_test_score\",\n hover_data=param_name,\n color=\"model\",\n )\n line_fig = px.line(\n cv_results,\n x=\"mean_fit_time\",\n y=\"mean_test_score\",\n )\n\n scatter_trace = scatter_fig[\"data\"][0]\n line_trace = line_fig[\"data\"][0]\n scatter_trace.update(marker=dict(color=colors_list[idx]))\n line_trace.update(line=dict(color=colors_list[idx]))\n fig.add_trace(scatter_trace, row=1, col=1)\n fig.add_trace(line_trace, row=1, col=1)\n\n scatter_fig = px.scatter(\n cv_results,\n x=\"mean_score_time\",\n y=\"mean_test_score\",\n error_x=\"std_score_time\",\n error_y=\"std_test_score\",\n hover_data=param_name,\n )\n line_fig = px.line(\n cv_results,\n x=\"mean_score_time\",\n y=\"mean_test_score\",\n )\n\n scatter_trace = scatter_fig[\"data\"][0]\n line_trace = line_fig[\"data\"][0]\n scatter_trace.update(marker=dict(color=colors_list[idx]))\n line_trace.update(line=dict(color=colors_list[idx]))\n fig.add_trace(scatter_trace, row=1, col=2)\n fig.add_trace(line_trace, row=1, col=2)\n\nfig.update_layout(\n xaxis=dict(title=\"Train time (s) - lower is better\"),\n yaxis=dict(title=\"Test R2 score - higher is better\"),\n xaxis2=dict(title=\"Predict time (s) - lower is better\"),\n legend=dict(x=0.72, y=0.05, traceorder=\"normal\", borderwidth=1),\n title=dict(x=0.5, text=\"Speed-score trade-off of tree-based ensembles\"),\n)"
109+
]
110+
},
111+
{
112+
"cell_type": "markdown",
113+
"metadata": {},
114+
"source": [
115+
"Both HGBT and RF models improve when increasing the number of trees in the\nensemble. However, the scores reach a plateau where adding new trees just\nmakes fitting and scoring slower. The RF model reaches such plateau earlier\nand can never reach the test score of the largest HGBDT model.\n\nNote that the results shown on the above plot can change sightly across runs\nand even more significantly when running on other machines: try to run this\nexample on your own local machine.\n\nOverall, one should often observe that the Histogram-based gradient boosting\nmodels uniformly dominate the Random Forest models in the \"test score vs\ntraining speed trade-off\" (the HGBDT curve should be on the top left of the RF\ncurve, without ever crossing). The \"test score vs prediction speed\" trade-off\ncan also be more disputed but it's most often favorable to HGBDT. It's always\na good idea to check both kinds of model (with hyper-parameter tuning) and\ncompare their performance on your specific problem to determine which model is\nthe best fit but **HGBT almost always offers a more favorable speed-accuracy\ntrade-off than RF**, either with the default hyper-parameters or including the\nhyper-parameter tuning cost.\n\nThere is one exception to this rule of thumb though: when training a\nmulticlass classification model with a large number of possible classes, HGBDT\nfits internally one-tree per class at each boosting iteration while the trees\nused by the RF models are naturally multiclass which should improve the speed\naccuracy trade-off of the RF models in this case.\n\n"
116+
]
117+
}
118+
],
119+
"metadata": {
120+
"kernelspec": {
121+
"display_name": "Python 3",
122+
"language": "python",
123+
"name": "python3"
124+
},
125+
"language_info": {
126+
"codemirror_mode": {
127+
"name": "ipython",
128+
"version": 3
129+
},
130+
"file_extension": ".py",
131+
"mimetype": "text/x-python",
132+
"name": "python",
133+
"nbconvert_exporter": "python",
134+
"pygments_lexer": "ipython3",
135+
"version": "3.9.16"
136+
}
137+
},
138+
"nbformat": 4,
139+
"nbformat_minor": 0
140+
}

0 commit comments

Comments
 (0)