Skip to content

Commit e6c982b

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 5dd58112fe34267194ed3e94e6c515046d3d0f34
1 parent 6f07167 commit e6c982b

File tree

1,311 files changed

+7322
-5743
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,311 files changed

+7322
-5743
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: dc9a128ce43f1ecc1882fa542db0c9fc
3+
config: 86795e0e9a1b9c914eee6483fa78d9c3
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"\n# Release Highlights for scikit-learn 1.3\n\n.. currentmodule:: sklearn\n\nWe are pleased to announce the release of scikit-learn 1.3! Many bug fixes\nand improvements were added, as well as some new key features. We detail\nbelow a few of the major features of this release. **For an exhaustive list of\nall the changes**, please refer to the `release notes <changes_1_3>`.\n\nTo install the latest version (with pip)::\n\n pip install --upgrade scikit-learn\n\nor with conda::\n\n conda install -c conda-forge scikit-learn\n"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"## Metadata Routing\nWe are in the process of introducing a new way to route metadata such as\n``sample_weight`` throughout the codebase, which would affect how\nmeta-estimators such as :class:`pipeline.Pipeline` and\n:class:`model_selection.GridSearchCV` route metadata. While the\ninfrastructure for this feature is already included in this release, the work\nis ongoing and not all meta-estimators support this new feature. You can read\nmore about this feature in the `Metadata Routing User Guide\n<metadata_routing>`. Note that this feature is still under development and\nnot implemented for most meta-estimators.\n\nThird party developers can already start incorporating this into their\nmeta-estimators. For more details, see\n`metadata routing developer guide\n<sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py>`.\n\n"
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"## HDBSCAN: hierarchical density-based clustering\nOriginally hosted in the scikit-learn-contrib repository, :class:`cluster.HDBSCAN`\nhas been adpoted into scikit-learn. It's missing a few features from the original\nimplementation which will be added in future releases.\nBy performing a modified version of :class:`cluster.DBSCAN` over multiple epsilon\nvalues simultaneously, :class:`cluster.HDBSCAN` finds clusters of varying densities\nmaking it more robust to parameter selection than :class:`cluster.DBSCAN`.\nMore details in the `User Guide <hdbscan>`.\n\n"
22+
]
23+
},
24+
{
25+
"cell_type": "code",
26+
"execution_count": null,
27+
"metadata": {
28+
"collapsed": false
29+
},
30+
"outputs": [],
31+
"source": [
32+
"import numpy as np\nfrom sklearn.cluster import HDBSCAN\nfrom sklearn.datasets import load_digits\nfrom sklearn.metrics import v_measure_score\n\nX, true_labels = load_digits(return_X_y=True)\nprint(f\"number of digits: {len(np.unique(true_labels))}\")\n\nhdbscan = HDBSCAN(min_cluster_size=15).fit(X)\nnon_noisy_labels = hdbscan.labels_[hdbscan.labels_ != -1]\nprint(f\"number of clusters found: {len(np.unique(non_noisy_labels))}\")\n\nprint(v_measure_score(true_labels[hdbscan.labels_ != -1], non_noisy_labels))"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"metadata": {},
38+
"source": [
39+
"## TargetEncoder: a new category encoding strategy\nWell suited for categorical features with high cardinality,\n:class:`preprocessing.TargetEncoder` encodes the categories based on a shrunk\nestimate of the average target values for observations belonging to that category.\nMore details in the `User Guide <target_encoder>`.\n\n"
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"metadata": {
46+
"collapsed": false
47+
},
48+
"outputs": [],
49+
"source": [
50+
"import numpy as np\nfrom sklearn.preprocessing import TargetEncoder\n\nX = np.array([[\"cat\"] * 30 + [\"dog\"] * 20 + [\"snake\"] * 38], dtype=object).T\ny = [90.3] * 30 + [20.4] * 20 + [21.2] * 38\n\nenc = TargetEncoder(random_state=0)\nX_trans = enc.fit_transform(X, y)\n\nenc.encodings_"
51+
]
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"metadata": {},
56+
"source": [
57+
"## Missing values support in decision trees\nThe classes :class:`tree.DecisionTreeClassifier` and\n:class:`tree.DecisionTreeRegressor` now support missing values. For each potential\nthreshold on the non-missing data, the splitter will evaluate the split with all the\nmissing values going to the left node or the right node.\nMore details in the `User Guide <tree_missing_value_support>`.\n\n"
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": null,
63+
"metadata": {
64+
"collapsed": false
65+
},
66+
"outputs": [],
67+
"source": [
68+
"import numpy as np\nfrom sklearn.tree import DecisionTreeClassifier\n\nX = np.array([0, 1, 6, np.nan]).reshape(-1, 1)\ny = [0, 0, 1, 1]\n\ntree = DecisionTreeClassifier(random_state=0).fit(X, y)\ntree.predict(X)"
69+
]
70+
},
71+
{
72+
"cell_type": "markdown",
73+
"metadata": {},
74+
"source": [
75+
"## New display `model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
76+
]
77+
},
78+
{
79+
"cell_type": "code",
80+
"execution_count": null,
81+
"metadata": {
82+
"collapsed": false
83+
},
84+
"outputs": [],
85+
"source": [
86+
"from sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import ValidationCurveDisplay\n\nX, y = make_classification(1000, 10, random_state=0)\n\n_ = ValidationCurveDisplay.from_estimator(\n LogisticRegression(),\n X,\n y,\n param_name=\"C\",\n param_range=np.geomspace(1e-5, 1e3, num=9),\n score_type=\"both\",\n score_name=\"Accuracy\",\n)"
87+
]
88+
},
89+
{
90+
"cell_type": "markdown",
91+
"metadata": {},
92+
"source": [
93+
"## Gamma loss for gradient boosting\nThe class :class:`ensemble.HistGradientBoostingRegressor` supports the\nGamma deviance loss function via `loss=\"gamma\"`. This loss function is useful for\nmodeling strictly positive targets with a right-skewed distribution.\n\n"
94+
]
95+
},
96+
{
97+
"cell_type": "code",
98+
"execution_count": null,
99+
"metadata": {
100+
"collapsed": false
101+
},
102+
"outputs": [],
103+
"source": [
104+
"import numpy as np\nfrom sklearn.model_selection import cross_val_score\nfrom sklearn.datasets import make_low_rank_matrix\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n\nn_samples, n_features = 500, 10\nrng = np.random.RandomState(0)\nX = make_low_rank_matrix(n_samples, n_features, random_state=rng)\ncoef = rng.uniform(low=-10, high=20, size=n_features)\ny = rng.gamma(shape=2, scale=np.exp(X @ coef) / 2)\ngbdt = HistGradientBoostingRegressor(loss=\"gamma\")\ncross_val_score(gbdt, X, y).mean()"
105+
]
106+
},
107+
{
108+
"cell_type": "markdown",
109+
"metadata": {},
110+
"source": [
111+
"## Grouping infrequent categories in :class:`preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"metadata": {
118+
"collapsed": false
119+
},
120+
"outputs": [],
121+
"source": [
122+
"from sklearn.preprocessing import OrdinalEncoder\nimport numpy as np\n\nX = np.array(\n [[\"dog\"] * 5 + [\"cat\"] * 20 + [\"rabbit\"] * 10 + [\"snake\"] * 3], dtype=object\n).T\nenc = OrdinalEncoder(min_frequency=6).fit(X)\nenc.infrequent_categories_"
123+
]
124+
}
125+
],
126+
"metadata": {
127+
"kernelspec": {
128+
"display_name": "Python 3",
129+
"language": "python",
130+
"name": "python3"
131+
},
132+
"language_info": {
133+
"codemirror_mode": {
134+
"name": "ipython",
135+
"version": 3
136+
},
137+
"file_extension": ".py",
138+
"mimetype": "text/x-python",
139+
"name": "python",
140+
"nbconvert_exporter": "python",
141+
"pygments_lexer": "ipython3",
142+
"version": "3.9.16"
143+
}
144+
},
145+
"nbformat": 4,
146+
"nbformat_minor": 0
147+
}
Binary file not shown.
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# flake8: noqa
2+
"""
3+
=======================================
4+
Release Highlights for scikit-learn 1.3
5+
=======================================
6+
7+
.. currentmodule:: sklearn
8+
9+
We are pleased to announce the release of scikit-learn 1.3! Many bug fixes
10+
and improvements were added, as well as some new key features. We detail
11+
below a few of the major features of this release. **For an exhaustive list of
12+
all the changes**, please refer to the :ref:`release notes <changes_1_3>`.
13+
14+
To install the latest version (with pip)::
15+
16+
pip install --upgrade scikit-learn
17+
18+
or with conda::
19+
20+
conda install -c conda-forge scikit-learn
21+
22+
"""
23+
24+
# %%
25+
# Metadata Routing
26+
# ----------------
27+
# We are in the process of introducing a new way to route metadata such as
28+
# ``sample_weight`` throughout the codebase, which would affect how
29+
# meta-estimators such as :class:`pipeline.Pipeline` and
30+
# :class:`model_selection.GridSearchCV` route metadata. While the
31+
# infrastructure for this feature is already included in this release, the work
32+
# is ongoing and not all meta-estimators support this new feature. You can read
33+
# more about this feature in the :ref:`Metadata Routing User Guide
34+
# <metadata_routing>`. Note that this feature is still under development and
35+
# not implemented for most meta-estimators.
36+
#
37+
# Third party developers can already start incorporating this into their
38+
# meta-estimators. For more details, see
39+
# :ref:`metadata routing developer guide
40+
# <sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py>`.
41+
42+
# %%
43+
# HDBSCAN: hierarchical density-based clustering
44+
# ----------------------------------------------
45+
# Originally hosted in the scikit-learn-contrib repository, :class:`cluster.HDBSCAN`
46+
# has been adpoted into scikit-learn. It's missing a few features from the original
47+
# implementation which will be added in future releases.
48+
# By performing a modified version of :class:`cluster.DBSCAN` over multiple epsilon
49+
# values simultaneously, :class:`cluster.HDBSCAN` finds clusters of varying densities
50+
# making it more robust to parameter selection than :class:`cluster.DBSCAN`.
51+
# More details in the :ref:`User Guide <hdbscan>`.
52+
import numpy as np
53+
from sklearn.cluster import HDBSCAN
54+
from sklearn.datasets import load_digits
55+
from sklearn.metrics import v_measure_score
56+
57+
X, true_labels = load_digits(return_X_y=True)
58+
print(f"number of digits: {len(np.unique(true_labels))}")
59+
60+
hdbscan = HDBSCAN(min_cluster_size=15).fit(X)
61+
non_noisy_labels = hdbscan.labels_[hdbscan.labels_ != -1]
62+
print(f"number of clusters found: {len(np.unique(non_noisy_labels))}")
63+
64+
print(v_measure_score(true_labels[hdbscan.labels_ != -1], non_noisy_labels))
65+
66+
# %%
67+
# TargetEncoder: a new category encoding strategy
68+
# -----------------------------------------------
69+
# Well suited for categorical features with high cardinality,
70+
# :class:`preprocessing.TargetEncoder` encodes the categories based on a shrunk
71+
# estimate of the average target values for observations belonging to that category.
72+
# More details in the :ref:`User Guide <target_encoder>`.
73+
import numpy as np
74+
from sklearn.preprocessing import TargetEncoder
75+
76+
X = np.array([["cat"] * 30 + ["dog"] * 20 + ["snake"] * 38], dtype=object).T
77+
y = [90.3] * 30 + [20.4] * 20 + [21.2] * 38
78+
79+
enc = TargetEncoder(random_state=0)
80+
X_trans = enc.fit_transform(X, y)
81+
82+
enc.encodings_
83+
84+
# %%
85+
# Missing values support in decision trees
86+
# ----------------------------------------
87+
# The classes :class:`tree.DecisionTreeClassifier` and
88+
# :class:`tree.DecisionTreeRegressor` now support missing values. For each potential
89+
# threshold on the non-missing data, the splitter will evaluate the split with all the
90+
# missing values going to the left node or the right node.
91+
# More details in the :ref:`User Guide <tree_missing_value_support>`.
92+
import numpy as np
93+
from sklearn.tree import DecisionTreeClassifier
94+
95+
X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
96+
y = [0, 0, 1, 1]
97+
98+
tree = DecisionTreeClassifier(random_state=0).fit(X, y)
99+
tree.predict(X)
100+
101+
# %%
102+
# New display `model_selection.ValidationCurveDisplay`
103+
# ----------------------------------------------------
104+
# :class:`model_selection.ValidationCurveDisplay` is now available to plot results
105+
# from :func:`model_selection.validation_curve`.
106+
from sklearn.datasets import make_classification
107+
from sklearn.linear_model import LogisticRegression
108+
from sklearn.model_selection import ValidationCurveDisplay
109+
110+
X, y = make_classification(1000, 10, random_state=0)
111+
112+
_ = ValidationCurveDisplay.from_estimator(
113+
LogisticRegression(),
114+
X,
115+
y,
116+
param_name="C",
117+
param_range=np.geomspace(1e-5, 1e3, num=9),
118+
score_type="both",
119+
score_name="Accuracy",
120+
)
121+
122+
# %%
123+
# Gamma loss for gradient boosting
124+
# --------------------------------
125+
# The class :class:`ensemble.HistGradientBoostingRegressor` supports the
126+
# Gamma deviance loss function via `loss="gamma"`. This loss function is useful for
127+
# modeling strictly positive targets with a right-skewed distribution.
128+
import numpy as np
129+
from sklearn.model_selection import cross_val_score
130+
from sklearn.datasets import make_low_rank_matrix
131+
from sklearn.ensemble import HistGradientBoostingRegressor
132+
133+
n_samples, n_features = 500, 10
134+
rng = np.random.RandomState(0)
135+
X = make_low_rank_matrix(n_samples, n_features, random_state=rng)
136+
coef = rng.uniform(low=-10, high=20, size=n_features)
137+
y = rng.gamma(shape=2, scale=np.exp(X @ coef) / 2)
138+
gbdt = HistGradientBoostingRegressor(loss="gamma")
139+
cross_val_score(gbdt, X, y).mean()
140+
141+
# %%
142+
# Grouping infrequent categories in :class:`preprocessing.OrdinalEncoder`
143+
# -----------------------------------------------------------------------
144+
# Similarly to :class:`preprocessing.OneHotEncoder`, the class
145+
# :class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories
146+
# into a single output for each feature. The parameters to enable the gathering of
147+
# infrequent categories are `min_frequency` and `max_categories`.
148+
# See the :ref:`User Guide <encoder_infrequent_categories>` for more details.
149+
from sklearn.preprocessing import OrdinalEncoder
150+
import numpy as np
151+
152+
X = np.array(
153+
[["dog"] * 5 + ["cat"] * 20 + ["rabbit"] * 10 + ["snake"] * 3], dtype=object
154+
).T
155+
enc = OrdinalEncoder(min_frequency=6).fit(X)
156+
enc.infrequent_categories_
Binary file not shown.

dev/_downloads/scikit-learn-docs.zip

75.3 KB
Binary file not shown.
173 Bytes
166 Bytes
-167 Bytes
62 Bytes

0 commit comments

Comments
 (0)