scikit-learn
diff --git a/‎dev/.buildinfo
Lines changed: 1 addition & 1 deletion b/‎dev/.buildinfo
Lines changed: 1 addition & 1 deletion
diff --git a/‎dev/_downloads/0785ea6d45bde062e5beedda88131215/plot_release_highlights_1_3_0.ipynb
Lines changed: 147 additions & 0 deletions b/‎dev/_downloads/0785ea6d45bde062e5beedda88131215/plot_release_highlights_1_3_0.ipynb
Lines changed: 147 additions & 0 deletions
diff --git a/‎dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
6.22 KB b/‎dev/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
6.22 KB
diff --git a/‎dev/_downloads/3316f301d7c7651c033565a5ae51c295/plot_release_highlights_1_3_0.py
Lines changed: 156 additions & 0 deletions b/‎dev/_downloads/3316f301d7c7651c033565a5ae51c295/plot_release_highlights_1_3_0.py
Lines changed: 156 additions & 0 deletions
diff --git a/‎dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip
8.2 KB b/‎dev/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip
8.2 KB
diff --git a/‎dev/_downloads/scikit-learn-docs.zip
75.3 KB b/‎dev/_downloads/scikit-learn-docs.zip
75.3 KB
diff --git a/‎dev/_images/sphx_glr_plot_agglomerative_clustering_001.png
173 Bytes b/‎dev/_images/sphx_glr_plot_agglomerative_clustering_001.png
173 Bytes
diff --git a/‎dev/_images/sphx_glr_plot_agglomerative_clustering_002.png
166 Bytes b/‎dev/_images/sphx_glr_plot_agglomerative_clustering_002.png
166 Bytes
diff --git a/‎dev/_images/sphx_glr_plot_agglomerative_clustering_004.png
-167 Bytes b/‎dev/_images/sphx_glr_plot_agglomerative_clustering_004.png
-167 Bytes
diff --git a/‎dev/_images/sphx_glr_plot_agglomerative_clustering_thumb.png
62 Bytes b/‎dev/_images/sphx_glr_plot_agglomerative_clustering_thumb.png
62 Bytes
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: dc9a128ce43f1ecc1882fa542db0c9fc
+config: 86795e0e9a1b9c914eee6483fa78d9c3
 tags: 645f666f9bcd5a90fca523b33c5a78b7
@@ -0,0 +1,147 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Release Highlights for scikit-learn 1.3\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 1.3! Many bug fixes\nand improvements were added, as well as some new key features. We detail\nbelow a few of the major features of this release. **For an exhaustive list of\nall the changes**, please refer to the `release notes <changes_1_3>`.\n\nTo install the latest version (with pip)::\n\n    pip install --upgrade scikit-learn\n\nor with conda::\n\n    conda install -c conda-forge scikit-learn\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Metadata Routing\nWe are in the process of introducing a new way to route metadata such as\n``sample_weight`` throughout the codebase, which would affect how\nmeta-estimators such as :class:`pipeline.Pipeline` and\n:class:`model_selection.GridSearchCV` route metadata. While the\ninfrastructure for this feature is already included in this release, the work\nis ongoing and not all meta-estimators support this new feature. You can read\nmore about this feature in the `Metadata Routing User Guide\n<metadata_routing>`. Note that this feature is still under development and\nnot implemented for most meta-estimators.\n\nThird party developers can already start incorporating this into their\nmeta-estimators. For more details, see\n`metadata routing developer guide\n<sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py>`.\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## HDBSCAN: hierarchical density-based clustering\nOriginally hosted in the scikit-learn-contrib repository, :class:`cluster.HDBSCAN`\nhas been adpoted into scikit-learn. It's missing a few features from the original\nimplementation which will be added in future releases.\nBy performing a modified version of :class:`cluster.DBSCAN` over multiple epsilon\nvalues simultaneously, :class:`cluster.HDBSCAN` finds clusters of varying densities\nmaking it more robust to parameter selection than :class:`cluster.DBSCAN`.\nMore details in the `User Guide <hdbscan>`.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\nfrom sklearn.cluster import HDBSCAN\nfrom sklearn.datasets import load_digits\nfrom sklearn.metrics import v_measure_score\n\nX, true_labels = load_digits(return_X_y=True)\nprint(f\"number of digits: {len(np.unique(true_labels))}\")\n\nhdbscan = HDBSCAN(min_cluster_size=15).fit(X)\nnon_noisy_labels = hdbscan.labels_[hdbscan.labels_ != -1]\nprint(f\"number of clusters found: {len(np.unique(non_noisy_labels))}\")\n\nprint(v_measure_score(true_labels[hdbscan.labels_ != -1], non_noisy_labels))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## TargetEncoder: a new category encoding strategy\nWell suited for categorical features with high cardinality,\n:class:`preprocessing.TargetEncoder` encodes the categories based on a shrunk\nestimate of the average target values for observations belonging to that category.\nMore details in the `User Guide <target_encoder>`.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\nfrom sklearn.preprocessing import TargetEncoder\n\nX = np.array([[\"cat\"] * 30 + [\"dog\"] * 20 + [\"snake\"] * 38], dtype=object).T\ny = [90.3] * 30 + [20.4] * 20 + [21.2] * 38\n\nenc = TargetEncoder(random_state=0)\nX_trans = enc.fit_transform(X, y)\n\nenc.encodings_"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Missing values support in decision trees\nThe classes :class:`tree.DecisionTreeClassifier` and\n:class:`tree.DecisionTreeRegressor` now support missing values. For each potential\nthreshold on the non-missing data, the splitter will evaluate the split with all the\nmissing values going to the left node or the right node.\nMore details in the `User Guide <tree_missing_value_support>`.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\nfrom sklearn.tree import DecisionTreeClassifier\n\nX = np.array([0, 1, 6, np.nan]).reshape(-1, 1)\ny = [0, 0, 1, 1]\n\ntree = DecisionTreeClassifier(random_state=0).fit(X, y)\ntree.predict(X)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## New display `model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import ValidationCurveDisplay\n\nX, y = make_classification(1000, 10, random_state=0)\n\n_ = ValidationCurveDisplay.from_estimator(\n    LogisticRegression(),\n    X,\n    y,\n    param_name=\"C\",\n    param_range=np.geomspace(1e-5, 1e3, num=9),\n    score_type=\"both\",\n    score_name=\"Accuracy\",\n)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Gamma loss for gradient boosting\nThe class :class:`ensemble.HistGradientBoostingRegressor` supports the\nGamma deviance loss function via `loss=\"gamma\"`. This loss function is useful for\nmodeling strictly positive targets with a right-skewed distribution.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.datasets import make_low_rank_matrix\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n\nn_samples, n_features = 500, 10\nrng = np.random.RandomState(0)\nX = make_low_rank_matrix(n_samples, n_features, random_state=rng)\ncoef = rng.uniform(low=-10, high=20, size=n_features)\ny = rng.gamma(shape=2, scale=np.exp(X @ coef) / 2)\ngbdt = HistGradientBoostingRegressor(loss=\"gamma\")\ncross_val_score(gbdt, X, y).mean()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Grouping infrequent categories in :class:`preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from sklearn.preprocessing import OrdinalEncoder\nimport numpy as np\n\nX = np.array(\n    [[\"dog\"] * 5 + [\"cat\"] * 20 + [\"rabbit\"] * 10 + [\"snake\"] * 3], dtype=object\n).T\nenc = OrdinalEncoder(min_frequency=6).fit(X)\nenc.infrequent_categories_"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.16"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
@@ -0,0 +1,156 @@
+# flake8: noqa
+"""
+=======================================
+Release Highlights for scikit-learn 1.3
+=======================================
+
+.. currentmodule:: sklearn
+
+We are pleased to announce the release of scikit-learn 1.3! Many bug fixes
+and improvements were added, as well as some new key features. We detail
+below a few of the major features of this release. **For an exhaustive list of
+all the changes**, please refer to the :ref:`release notes <changes_1_3>`.
+
+To install the latest version (with pip)::
+
+    pip install --upgrade scikit-learn
+
+or with conda::
+
+    conda install -c conda-forge scikit-learn
+
+"""
+
+# %%
+# Metadata Routing
+# ----------------
+# We are in the process of introducing a new way to route metadata such as
+# ``sample_weight`` throughout the codebase, which would affect how
+# meta-estimators such as :class:`pipeline.Pipeline` and
+# :class:`model_selection.GridSearchCV` route metadata. While the
+# infrastructure for this feature is already included in this release, the work
+# is ongoing and not all meta-estimators support this new feature. You can read
+# more about this feature in the :ref:`Metadata Routing User Guide
+# <metadata_routing>`. Note that this feature is still under development and
+# not implemented for most meta-estimators.
+#
+# Third party developers can already start incorporating this into their
+# meta-estimators. For more details, see
+# :ref:`metadata routing developer guide
+# <sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py>`.
+
+# %%
+# HDBSCAN: hierarchical density-based clustering
+# ----------------------------------------------
+# Originally hosted in the scikit-learn-contrib repository, :class:`cluster.HDBSCAN`
+# has been adpoted into scikit-learn. It's missing a few features from the original
+# implementation which will be added in future releases.
+# By performing a modified version of :class:`cluster.DBSCAN` over multiple epsilon
+# values simultaneously, :class:`cluster.HDBSCAN` finds clusters of varying densities
+# making it more robust to parameter selection than :class:`cluster.DBSCAN`.
+# More details in the :ref:`User Guide <hdbscan>`.
+import numpy as np
+from sklearn.cluster import HDBSCAN
+from sklearn.datasets import load_digits
+from sklearn.metrics import v_measure_score
+
+X, true_labels = load_digits(return_X_y=True)
+print(f"number of digits: {len(np.unique(true_labels))}")
+
+hdbscan = HDBSCAN(min_cluster_size=15).fit(X)
+non_noisy_labels = hdbscan.labels_[hdbscan.labels_ != -1]
+print(f"number of clusters found: {len(np.unique(non_noisy_labels))}")
+
+print(v_measure_score(true_labels[hdbscan.labels_ != -1], non_noisy_labels))
+
+# %%
+# TargetEncoder: a new category encoding strategy
+# -----------------------------------------------
+# Well suited for categorical features with high cardinality,
+# :class:`preprocessing.TargetEncoder` encodes the categories based on a shrunk
+# estimate of the average target values for observations belonging to that category.
+# More details in the :ref:`User Guide <target_encoder>`.
+import numpy as np
+from sklearn.preprocessing import TargetEncoder
+
+X = np.array([["cat"] * 30 + ["dog"] * 20 + ["snake"] * 38], dtype=object).T
+y = [90.3] * 30 + [20.4] * 20 + [21.2] * 38
+
+enc = TargetEncoder(random_state=0)
+X_trans = enc.fit_transform(X, y)
+
+enc.encodings_
+
+# %%
+# Missing values support in decision trees
+# ----------------------------------------
+# The classes :class:`tree.DecisionTreeClassifier` and
+# :class:`tree.DecisionTreeRegressor` now support missing values. For each potential
+# threshold on the non-missing data, the splitter will evaluate the split with all the
+# missing values going to the left node or the right node.
+# More details in the :ref:`User Guide <tree_missing_value_support>`.
+import numpy as np
+from sklearn.tree import DecisionTreeClassifier
+
+X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
+y = [0, 0, 1, 1]
+
+tree = DecisionTreeClassifier(random_state=0).fit(X, y)
+tree.predict(X)
+
+# %%
+# New display `model_selection.ValidationCurveDisplay`
+# ----------------------------------------------------
+# :class:`model_selection.ValidationCurveDisplay` is now available to plot results
+# from :func:`model_selection.validation_curve`.
+from sklearn.datasets import make_classification
+from sklearn.linear_model import LogisticRegression
+from sklearn.model_selection import ValidationCurveDisplay
+
+X, y = make_classification(1000, 10, random_state=0)
+
+_ = ValidationCurveDisplay.from_estimator(
+    LogisticRegression(),
+    X,
+    y,
+    param_name="C",
+    param_range=np.geomspace(1e-5, 1e3, num=9),
+    score_type="both",
+    score_name="Accuracy",
+)
+
+# %%
+# Gamma loss for gradient boosting
+# --------------------------------
+# The class :class:`ensemble.HistGradientBoostingRegressor` supports the
+# Gamma deviance loss function via `loss="gamma"`. This loss function is useful for
+# modeling strictly positive targets with a right-skewed distribution.
+import numpy as np
+from sklearn.model_selection import cross_val_score
+from sklearn.datasets import make_low_rank_matrix
+from sklearn.ensemble import HistGradientBoostingRegressor
+
+n_samples, n_features = 500, 10
+rng = np.random.RandomState(0)
+X = make_low_rank_matrix(n_samples, n_features, random_state=rng)
+coef = rng.uniform(low=-10, high=20, size=n_features)
+y = rng.gamma(shape=2, scale=np.exp(X @ coef) / 2)
+gbdt = HistGradientBoostingRegressor(loss="gamma")
+cross_val_score(gbdt, X, y).mean()
+
+# %%
+# Grouping infrequent categories in :class:`preprocessing.OrdinalEncoder`
+# -----------------------------------------------------------------------
+# Similarly to :class:`preprocessing.OneHotEncoder`, the class
+# :class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories
+# into a single output for each feature. The parameters to enable the gathering of
+# infrequent categories are `min_frequency` and `max_categories`.
+# See the :ref:`User Guide <encoder_infrequent_categories>` for more details.
+from sklearn.preprocessing import OrdinalEncoder
+import numpy as np
+
+X = np.array(
+    [["dog"] * 5 + ["cat"] * 20 + ["rabbit"] * 10 + ["snake"] * 3], dtype=object
+).T
+enc = OrdinalEncoder(min_frequency=6).fit(X)
+enc.infrequent_categories_