scikit-learn
diff --git a/‎dev/.buildinfo
Lines changed: 1 addition & 1 deletion b/‎dev/.buildinfo
Lines changed: 1 addition & 1 deletion
diff --git a/‎dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
1.82 KB b/‎dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
1.82 KB
diff --git a/‎dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip
2.31 KB b/‎dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip
2.31 KB
diff --git a/‎dev/_downloads/8452fc8dfe9850cfdaa1b758e5a2748b/plot_gradient_boosting_early_stopping.ipynb
Lines changed: 45 additions & 6 deletions b/‎dev/_downloads/8452fc8dfe9850cfdaa1b758e5a2748b/plot_gradient_boosting_early_stopping.ipynb
Lines changed: 45 additions & 6 deletions
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: afc687cf5f622f1c7392385790b126f5
+config: 8ee13000cd92405fe0369b3f68a054f7
 tags: 645f666f9bcd5a90fca523b33c5a78b7
@@ -4,7 +4,32 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Early stopping in Gradient Boosting\n\nGradient boosting is an ensembling technique where several weak learners\n(regression trees) are combined to yield a powerful single model, in an\niterative fashion.\n\nEarly stopping support in Gradient Boosting enables us to find the least number\nof iterations which is sufficient to build a model that generalizes well to\nunseen data.\n\nThe concept of early stopping is simple. We specify a ``validation_fraction``\nwhich denotes the fraction of the whole dataset that will be kept aside from\ntraining to assess the validation loss of the model. The gradient boosting\nmodel is trained using the training set and evaluated using the validation set.\nWhen each additional stage of regression tree is added, the validation set is\nused to score the model.  This is continued until the scores of the model in\nthe last ``n_iter_no_change`` stages do not improve by at least `tol`. After\nthat the model is considered to have converged and further addition of stages\nis \"stopped early\".\n\nThe number of stages of the final model is available at the attribute\n``n_estimators_``.\n\nThis example illustrates how the early stopping can used in the\n:class:`~sklearn.ensemble.GradientBoostingClassifier` model to achieve\nalmost the same accuracy as compared to a model built without early stopping\nusing many fewer estimators. This can significantly reduce training time,\nmemory usage and prediction latency.\n"
+        "\n# Early stopping in Gradient Boosting\n\nGradient Boosting is an ensemble technique that combines multiple weak\nlearners, typically decision trees, to create a robust and powerful\npredictive model. It does so in an iterative fashion, where each new stage\n(tree) corrects the errors of the previous ones.\n\nEarly stopping is a technique in Gradient Boosting that allows us to find\nthe optimal number of iterations required to build a model that generalizes\nwell to unseen data and avoids overfitting. The concept is simple: we set\naside a portion of our dataset as a validation set (specified using\n`validation_fraction`) to assess the model's performance during training.\nAs the model is iteratively built with additional stages (trees), its\nperformance on the validation set is monitored as a function of the\nnumber of steps.\n\nEarly stopping becomes effective when the model's performance on the\nvalidation set plateaus or worsens (within deviations specified by `tol`)\nover a certain number of consecutive stages (specified by `n_iter_no_change`).\nThis signals that the model has reached a point where further iterations may\nlead to overfitting, and it's time to stop training.\n\nThe number of estimators (trees) in the final model, when early stopping is\napplied, can be accessed using the `n_estimators_` attribute. Overall, early\nstopping is a valuable tool to strike a balance between model performance and\nefficiency in gradient boosting.\n\nLicense: BSD 3 clause\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Data Preparation\nFirst we load and prepares the California Housing Prices dataset for\ntraining and evaluation. It subsets the dataset, splits it into training\nand validation sets.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import time\n\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import fetch_california_housing\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.metrics import mean_squared_error\nfrom sklearn.model_selection import train_test_split\n\ndata = fetch_california_housing()\nX, y = data.data[:600], data.target[:600]\n\nX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Model Training and Comparison\nTwo :class:`~sklearn.ensemble.GradientBoostingRegressor` models are trained:\none with and another without early stopping. The purpose is to compare their\nperformance. It also calculates the training time and the `n_estimators_`\nused by both models.\n\n"
       ]
     },
     {
@@ -15,14 +40,14 @@
       },
       "outputs": [],
       "source": [
-        "# Authors: Vighnesh Birodkar <[email protected]>\n#          Raghav RV <[email protected]>\n# License: BSD 3 clause\n\nimport time\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn import datasets, ensemble\nfrom sklearn.model_selection import train_test_split\n\ndata_list = [\n    datasets.load_iris(return_X_y=True),\n    datasets.make_classification(n_samples=800, random_state=0),\n    datasets.make_hastie_10_2(n_samples=2000, random_state=0),\n]\nnames = [\"Iris Data\", \"Classification Data\", \"Hastie Data\"]\n\nn_gb = []\nscore_gb = []\ntime_gb = []\nn_gbes = []\nscore_gbes = []\ntime_gbes = []\n\nn_estimators = 200\n\nfor X, y in data_list:\n    X_train, X_test, y_train, y_test = train_test_split(\n        X, y, test_size=0.2, random_state=0\n    )\n\n    # We specify that if the scores don't improve by at least 0.01 for the last\n    # 10 stages, stop fitting additional stages\n    gbes = ensemble.GradientBoostingClassifier(\n        n_estimators=n_estimators,\n        validation_fraction=0.2,\n        n_iter_no_change=5,\n        tol=0.01,\n        random_state=0,\n    )\n    gb = ensemble.GradientBoostingClassifier(n_estimators=n_estimators, random_state=0)\n    start = time.time()\n    gb.fit(X_train, y_train)\n    time_gb.append(time.time() - start)\n\n    start = time.time()\n    gbes.fit(X_train, y_train)\n    time_gbes.append(time.time() - start)\n\n    score_gb.append(gb.score(X_test, y_test))\n    score_gbes.append(gbes.score(X_test, y_test))\n\n    n_gb.append(gb.n_estimators_)\n    n_gbes.append(gbes.n_estimators_)\n\nbar_width = 0.2\nn = len(data_list)\nindex = np.arange(0, n * bar_width, bar_width) * 2.5\nindex = index[0:n]"
+        "params = dict(n_estimators=1000, max_depth=5, learning_rate=0.1, random_state=42)\n\ngbm_full = GradientBoostingRegressor(**params)\ngbm_early_stopping = GradientBoostingRegressor(\n    **params,\n    validation_fraction=0.1,\n    n_iter_no_change=10,\n)\n\nstart_time = time.time()\ngbm_full.fit(X_train, y_train)\ntraining_time_full = time.time() - start_time\nn_estimators_full = gbm_full.n_estimators_\n\nstart_time = time.time()\ngbm_early_stopping.fit(X_train, y_train)\ntraining_time_early_stopping = time.time() - start_time\nestimators_early_stopping = gbm_early_stopping.n_estimators_"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Compare scores with and without early stopping\n\n"
+        "## Error Calculation\nThe code calculates the :func:`~sklearn.metrics.mean_squared_error` for both\ntraining and validation datasets for the models trained in the previous\nsection. It computes the errors for each boosting iteration. The purpose is\nto assess the performance and convergence of the models.\n\n"
       ]
     },
     {
@@ -33,14 +58,14 @@
       },
       "outputs": [],
       "source": [
-        "plt.figure(figsize=(9, 5))\n\nbar1 = plt.bar(\n    index, score_gb, bar_width, label=\"Without early stopping\", color=\"crimson\"\n)\nbar2 = plt.bar(\n    index + bar_width, score_gbes, bar_width, label=\"With early stopping\", color=\"coral\"\n)\n\nplt.xticks(index + bar_width, names)\nplt.yticks(np.arange(0, 1.3, 0.1))\n\n\ndef autolabel(rects, n_estimators):\n    \"\"\"\n    Attach a text label above each bar displaying n_estimators of each model\n    \"\"\"\n    for i, rect in enumerate(rects):\n        plt.text(\n            rect.get_x() + rect.get_width() / 2.0,\n            1.05 * rect.get_height(),\n            \"n_est=%d\" % n_estimators[i],\n            ha=\"center\",\n            va=\"bottom\",\n        )\n\n\nautolabel(bar1, n_gb)\nautolabel(bar2, n_gbes)\n\nplt.ylim([0, 1.3])\nplt.legend(loc=\"best\")\nplt.grid(True)\n\nplt.xlabel(\"Datasets\")\nplt.ylabel(\"Test score\")\n\nplt.show()"
+        "train_errors_without = []\nval_errors_without = []\n\ntrain_errors_with = []\nval_errors_with = []\n\nfor i, (train_pred, val_pred) in enumerate(\n    zip(\n        gbm_full.staged_predict(X_train),\n        gbm_full.staged_predict(X_val),\n    )\n):\n    train_errors_without.append(mean_squared_error(y_train, train_pred))\n    val_errors_without.append(mean_squared_error(y_val, val_pred))\n\nfor i, (train_pred, val_pred) in enumerate(\n    zip(\n        gbm_early_stopping.staged_predict(X_train),\n        gbm_early_stopping.staged_predict(X_val),\n    )\n):\n    train_errors_with.append(mean_squared_error(y_train, train_pred))\n    val_errors_with.append(mean_squared_error(y_val, val_pred))"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Compare fit times with and without early stopping\n\n"
+        "## Visualize Comparision\nIt includes three subplots:\n1. Plotting training errors of both models over boosting iterations.\n2. Plotting validation errors of both models over boosting iterations.\n3. Creating a bar chart to compare the training times and the estimator used\nof the models with and without early stopping.\n\n"
       ]
     },
     {
@@ -51,7 +76,21 @@
       },
       "outputs": [],
       "source": [
-        "plt.figure(figsize=(9, 5))\n\nbar1 = plt.bar(\n    index, time_gb, bar_width, label=\"Without early stopping\", color=\"crimson\"\n)\nbar2 = plt.bar(\n    index + bar_width, time_gbes, bar_width, label=\"With early stopping\", color=\"coral\"\n)\n\nmax_y = np.amax(np.maximum(time_gb, time_gbes))\n\nplt.xticks(index + bar_width, names)\nplt.yticks(np.linspace(0, 1.3 * max_y, 13))\n\nautolabel(bar1, n_gb)\nautolabel(bar2, n_gbes)\n\nplt.ylim([0, 1.3 * max_y])\nplt.legend(loc=\"best\")\nplt.grid(True)\n\nplt.xlabel(\"Datasets\")\nplt.ylabel(\"Fit Time\")\n\nplt.show()"
+        "fig, axes = plt.subplots(ncols=3, figsize=(12, 4))\n\naxes[0].plot(train_errors_without, label=\"gbm_full\")\naxes[0].plot(train_errors_with, label=\"gbm_early_stopping\")\naxes[0].set_xlabel(\"Boosting Iterations\")\naxes[0].set_ylabel(\"MSE (Training)\")\naxes[0].set_yscale(\"log\")\naxes[0].legend()\naxes[0].set_title(\"Training Error\")\n\naxes[1].plot(val_errors_without, label=\"gbm_full\")\naxes[1].plot(val_errors_with, label=\"gbm_early_stopping\")\naxes[1].set_xlabel(\"Boosting Iterations\")\naxes[1].set_ylabel(\"MSE (Validation)\")\naxes[1].set_yscale(\"log\")\naxes[1].legend()\naxes[1].set_title(\"Validation Error\")\n\ntraining_times = [training_time_full, training_time_early_stopping]\nlabels = [\"gbm_full\", \"gbm_early_stopping\"]\nbars = axes[2].bar(labels, training_times)\naxes[2].set_ylabel(\"Training Time (s)\")\n\nfor bar, n_estimators in zip(bars, [n_estimators_full, estimators_early_stopping]):\n    height = bar.get_height()\n    axes[2].text(\n        bar.get_x() + bar.get_width() / 2,\n        height + 0.001,\n        f\"Estimators: {n_estimators}\",\n        ha=\"center\",\n        va=\"bottom\",\n    )\n\nplt.tight_layout()\nplt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The difference in training error between the `gbm_full` and the\n`gbm_early_stopping` stems from the fact that `gbm_early_stopping` sets\naside `validation_fraction` of the training data as internal validation set.\nEarly stopping is decided based on this internal validation score.\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Summary\nIn our example with the :class:`~sklearn.ensemble.GradientBoostingRegressor`\nmodel on the California Housing Prices dataset, we have demonstrated the\npractical benefits of early stopping:\n\n- **Preventing Overfitting:** We showed how the validation error stabilizes\nor starts to increase after a certain point, indicating that the model\ngeneralizes better to unseen data. This is achieved by stopping the training\nprocess before overfitting occurs.\n\n- **Improving Training Efficiency:** We compared training times between\nmodels with and without early stopping. The model with early stopping\nachieved comparable accuracy while requiring significantly fewer\nestimators, resulting in faster training.\n\n"
       ]
     }
   ],