From f6afffbfda052709646973c62e1b1565b2d675b7 Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Thu, 22 Oct 2015 15:48:01 +0200 Subject: [PATCH 001/118] ENH: add usecases for transform_y Option A: meta estimators RST formatting Advance the discussion Restructured text layout RST formatting iter --- slep001/discussion.rst | 295 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 295 insertions(+) create mode 100644 slep001/discussion.rst diff --git a/slep001/discussion.rst b/slep001/discussion.rst new file mode 100644 index 0000000..a9333c6 --- /dev/null +++ b/slep001/discussion.rst @@ -0,0 +1,295 @@ +===================================== +Transformers that modify their target +===================================== + +.. topic:: **Summary** + + Transformers implement:: + + self = estimator.fit(X, y=None) + X_transform = estimator.transform(X) + estimator.fit(X, y=None).transform(X) == estimator.fit_transform(X, y) + + Many usecases require modifying y. How do we support this? + +.. sectnum:: + +.. contents:: Table of contents + :depth: 2 + +Rational +========== + +Summary of the contract of transformers +---------------------------------------- + +* .transform(...) returns a data matrix X + +* .transform(...) returns one feature vector for each sample of the input + +* .fit_transform(...) is the same and .fit(...).transform(...) + +Examples of usecases targetted +------------------------------- + +#. Over sampling: + + #. Class rembalancing: over sampling the minority class in + unbalanced dataset + #. Data enhancement (nudgging images for instance) + +#. Under-sampling + + #. Stateless undersampling: Take one sample out of two + #. Stateful undersampling: apply clustering and transform to cluster + centers + #. Coresets: return a smaller number of samples and associated sample + weights + +#. Outlier detection: + + #. Remove outlier from train set + #. Create a special class 'y' for outliers + +#. Completing y: + + #. Missing data imputation on y + #. Semi-supervised learning (related to above) + +#. Data loading / conversion + + #. Pandas in => (X, y) out + #. Filename in => (X, y) with multiple samples (very useful in + combination with online learning) + #. Database query => (X, y) out + +#. Aggregate statistics over multiple samples + + #. Windowing-like functions on time-series + + In a sense, these are dodgy with scikit-learn's cross-validation API + that knows nothing about sample structure. But the refactor of the CV + API is really helping in this regard. + +____ + +These usecases pretty much require breaking the contract of the +Transformer, as detailed above. + +The intuition driving this enhancement proposal is that the more the +data-processing pipeline becomes rich, the more the data grow, the more +the usecases above become important. + +Enhancements proposed +======================= + +Option A: meta-estimators +--------------------------- + +Proposal +........ + +This option advocates that any transformer-like usecase that wants to +modify y or the number of samples should not be a transformer-like but a +specific meta-estimator. A core-set object would thus look like: + +* From the user perspective:: + + from sklearn.sample_shrink import BirchCoreSet + from sklearn.ensemble import RandomForest + estimator = BirchCoreSet(RandomForest()) + +* From the developer perspective:: + + class BirchCoreSet(BaseEstimator): + + def fit(self, X, y): + # The logic here is wrong, as we need to handle y: + super(BirchCoreSet, self).fit(X) + X_red = self.subcluster_centers_ + self.estimator_.fit(X_red) + +Benefits +......... + +#. No change to the existing API + +#. The meta-estimator pattern is very powerful, and pretty much anything + is possible. + +Limitations +............ + +The different limitations listed below are variants of the same +conceptual difficulty + +#. It is hard to have mental models and garantees of what a + meta-estimator does, as it is by definition super versatile + + This is both a problem for the beginner, that needs to learn them on + an almost case-by-case basis, and for the advanced user, that needs to + maintain a set of case-specific code + +#. The "estimator heap" problem. + + Here the word heap is used to denote the multiple pipelines and + meta-estimators. It corresponds to what we would naturally call a + "data processing pipeline", but we use "heap" to avoid confusion with + the pipeline object. + + Stacks combining many steps of pipelines and meta-estimators become + very hard to inspect and manipulate, both for the user, and for + pipeline-management (aka "heap-management") code. Currently, these + difficulties are mostly in user code, so we don't see them too much in + scikit-learn. Here are concrete examples + + #. Trying to retrieve coefficients from a models estimated in a + "heap". Solving this problem requires + https://github.com/scikit-learn/scikit-learn/issues/2562#issuecomment-27543186 + (this enhancement proposal is not advocating to solve the problem + above, but pointing it out as an illustration) + + #. DaskLearn has modified the logic of pipeline to expose it as a + computation graph. The reason that it was relatively easy to do is + that there was mostly one object to modify to do the dispatching, + the Pipeline object. + + #. A future, out-of-core "conductor" object to fit a "stack" in out of + core by connecting it to a data-store would need to have a + representation of the stack. For instance, when chaining random + projections with Birch coresets and finally SGD, the user would + need to specify that random projections are stateless, birch needs + to do one pass of the data, and SGD a few. Given this information, + the conductor could orchestrate pull the data from the data source, + and sending it to the various steps. Such an object is much harder + to implement if the various steps are to be combined in a heap. + + Note that this is not a problem in non out-of-core settings, in the + sense that the BirchCoreSet meta-estimator would take care of doing + a pass on the data before feeding it to its sub estimator. + +In conclusion, meta-estimators are harder to comprehend (problem 1) and +write (problem 2). + +That said, we will never get rid of meta estimators. It is a very +powerful pattern. The discussion here is about extending a bit the +estimator API to have a less pressing need for meta-estimators. + +Option B: transformer-like that modify y +------------------------------------------ + +.. note:: Two variants of this option exist: + + 1. Changing the semantics of transformers to modify y and return + something more complex than a data matrix X + + 2. Introducing a new type of object + + Their is an emerging consensus for option 2. + +Proposal +......... + +Introduce a `TransformPipe` type of object with the following API +(names are discussed below): + +* `X_new, y_new = estimator.fit_pipe(X, y)` + +* `X_new, y_new = estimator.transform_pipe(X, y)` + +Or: + +* `X_new, y_new, sample_props = estimator.fit_pipe(X, y)` + +* `X_new, y_new, sample_props = estimator.transform_pipe(X, y)` + +Contracts (these are weaker contracts than the transformer: + +* Neither `fit_pipe` nor `transform_pipe` are guarantied to keep the + number of samples unchanged. + +* transform_pipe is not equivalent to .fit_pipe.transform + +Design questions +.................... + +#. Should there be a fit method? + + In such estimators, it may not be a good idea to call fit rather than + fit_pipe (for instance in coreset). + + +#. At test time, how does a pipeline use such an object? + + #. Should there be a transform method used at test time? + + #. What to do with objects that implement both `transform` and + `transform_pipe`? + + For some usecases, test time needs to modify the number of samples + (for instance data loading from a file). However, these will by + construction a problem for eg cross-val-score, as they need to + generate a y_true. It is thus unclear that the data-loading usecases + can be fully integrated in the CV framework (which is not an argument + against enabling them). + + For our CV framework, we need the number of samples to remain constant + (to have correspondence between y_pred and _true). This is an argument + for: + + #. Accepting both transform and transform_pipe + + #. Having the pipeline 'predict' use 'transform' on its + intermediate steps + +#. How do we deal with sample weights and other sample properties + + This discussion feeds in the `sample_props` discussion (that should + be discussed in a different enhancement proposal). + + The suggestion is to have the sample properties as a dictionary of + arrays `sample_props`. + + **Example usecase** useful to think about sample properties: coresets: + given (X, y) return (X_new, y_new, weights) with a much smaller number + of samples. + + This example is interesting because it shows that PipeTransforms can + legitimately create sample properties. + + **Proposed solution**: + + * PipeTransforms always return (X_new, y_new, sample_props) where + sample_props can be an empty dictionary. + + +Naming suggestions +.................. + +In term of name choice, the rational would be to have method names that +are close to 'fit' and 'transform', to make discoverability and +readability of the code easier. + +* Name of the object: + - TransformPipe + - PipeTransformer + - FilterTransform + +* Method to fit and apply on training + - fit_pipe + - pipe_fit + - fit_filter + +* Method to apply on new data (not alway available) + - transform_pipe + - pipe_transform + - transform_filter + +Benefits +......... + +Limitations +............ + + + From 80973135fd95fa915e77e5f6b98f3261ffa5bb0f Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Thu, 22 Oct 2015 21:45:43 +0200 Subject: [PATCH 002/118] mend --- slep001/discussion.rst | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/slep001/discussion.rst b/slep001/discussion.rst index a9333c6..1f46e4d 100644 --- a/slep001/discussion.rst +++ b/slep001/discussion.rst @@ -280,7 +280,7 @@ readability of the code easier. - pipe_fit - fit_filter -* Method to apply on new data (not alway available) +* Method to apply on new data (not always available) - transform_pipe - pipe_transform - transform_filter @@ -288,8 +288,24 @@ readability of the code easier. Benefits ......... +* Many usecases listed above will be implemented scikit-learn without a + meta-estimator, and thus will be easy to use (eg in a pipeline). Many + of these are patterns that we should be encouraging. + +* The API being more versatile, it will be easier to create + application-specific code or framework wrappers (ala DaskLearn) that + are scikit-learn compatible, and thus that can be used with the + parameter-selection framework. This will be especially true for ETL + (extract transform and load) pattern. + Limitations ............ +* Introducing new methods, and a new type of estimator object:. There are + probably a total of **3 new methods** that will get introduced by this + enhancement: fit_pipe, transform_pipe, and partial_fit_pipe + +* Cannot solve all possible cases, and thus we will not get rid of + meta-estimators. From 092e54ab02c9be958223061fc8f352fc3f8f47f0 Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Fri, 23 Oct 2015 01:28:16 +0200 Subject: [PATCH 003/118] Address all of @amueller's comments. --- slep001/discussion.rst | 69 +++++++++++++++++++++++++++++------------- 1 file changed, 48 insertions(+), 21 deletions(-) diff --git a/slep001/discussion.rst b/slep001/discussion.rst index 1f46e4d..7720558 100644 --- a/slep001/discussion.rst +++ b/slep001/discussion.rst @@ -10,7 +10,11 @@ Transformers that modify their target X_transform = estimator.transform(X) estimator.fit(X, y=None).transform(X) == estimator.fit_transform(X, y) - Many usecases require modifying y. How do we support this? + Within a chain or processing sequence of estimators, many usecases + require modifying y. How do we support this? + + Doing many of these things is possible "by hand". The question is: + how to avoid writing custom connecting logic. .. sectnum:: @@ -137,14 +141,24 @@ conceptual difficulty "data processing pipeline", but we use "heap" to avoid confusion with the pipeline object. - Stacks combining many steps of pipelines and meta-estimators become + Heaps combining many steps of pipelines and meta-estimators become very hard to inspect and manipulate, both for the user, and for pipeline-management (aka "heap-management") code. Currently, these difficulties are mostly in user code, so we don't see them too much in scikit-learn. Here are concrete examples - #. Trying to retrieve coefficients from a models estimated in a - "heap". Solving this problem requires + #. Trying to retrieve coefficients from a model estimated in a + "heap". Eg: + + * you know there is a lasso in your stack and you want to + get it's coef (in whatever space that resides?): + `pipeline.named_steps['lasso'].coef_` is possible. + + * you want to retrieve the coef of the last step: + `pipeline.steps[-1][1].coef_` is possible. + + With meta estimators this is tricky. + Solving this problem requires https://github.com/scikit-learn/scikit-learn/issues/2562#issuecomment-27543186 (this enhancement proposal is not advocating to solve the problem above, but pointing it out as an illustration) @@ -154,19 +168,22 @@ conceptual difficulty that there was mostly one object to modify to do the dispatching, the Pipeline object. - #. A future, out-of-core "conductor" object to fit a "stack" in out of + #. A future, out-of-core "conductor" object to fit a "heap" in out of core by connecting it to a data-store would need to have a - representation of the stack. For instance, when chaining random + representation of the heap. For instance, when chaining random projections with Birch coresets and finally SGD, the user would need to specify that random projections are stateless, birch needs to do one pass of the data, and SGD a few. Given this information, the conductor could orchestrate pull the data from the data source, and sending it to the various steps. Such an object is much harder to implement if the various steps are to be combined in a heap. + Note that the scikit-learn pipeline can only implement a linear + "chain" like set of processing. For instance a One vs All will + never be able to be implemented in a scikit-learn pipeline. - Note that this is not a problem in non out-of-core settings, in the - sense that the BirchCoreSet meta-estimator would take care of doing - a pass on the data before feeding it to its sub estimator. + This is not a problem in non out-of-core settings, in the sense + that the BirchCoreSet meta-estimator would take care of doing a + pass on the data before feeding it to its sub estimator. In conclusion, meta-estimators are harder to comprehend (problem 1) and write (problem 2). @@ -183,9 +200,9 @@ Option B: transformer-like that modify y 1. Changing the semantics of transformers to modify y and return something more complex than a data matrix X - 2. Introducing a new type of object + 2. Introducing new methods (and a new type of object) - Their is an emerging consensus for option 2. + There is an emerging consensus for option 2. Proposal ......... @@ -229,13 +246,22 @@ Design questions For some usecases, test time needs to modify the number of samples (for instance data loading from a file). However, these will by construction a problem for eg cross-val-score, as they need to - generate a y_true. It is thus unclear that the data-loading usecases - can be fully integrated in the CV framework (which is not an argument - against enabling them). + generate a y_true. Indeed, the problem is the following: + + - To measure an error, we need y_true at the level of + 'cross_val_score' or GridSearch + + - y_true is created inside the pipeline by the data-loading object. + + It is thus unclear that the data-loading usecases can be fully + integrated in the CV framework (which is not an argument against + enabling them). + + | - For our CV framework, we need the number of samples to remain constant - (to have correspondence between y_pred and _true). This is an argument - for: + For our CV framework, we need the number of samples to remain + constant: for each y_pred, we need a corresponding y_true. This is an + argument for: #. Accepting both transform and transform_pipe @@ -270,20 +296,21 @@ In term of name choice, the rational would be to have method names that are close to 'fit' and 'transform', to make discoverability and readability of the code easier. -* Name of the object: +* Name of the object (referred in the docs): - TransformPipe - PipeTransformer - - FilterTransform + - TransModifier * Method to fit and apply on training - fit_pipe - pipe_fit - fit_filter + - fit_modify * Method to apply on new data (not always available) - transform_pipe - pipe_transform - - transform_filter + - trans_modify Benefits ......... @@ -301,7 +328,7 @@ Benefits Limitations ............ -* Introducing new methods, and a new type of estimator object:. There are +* Introducing new methods, and a new type of estimator object. There are probably a total of **3 new methods** that will get introduced by this enhancement: fit_pipe, transform_pipe, and partial_fit_pipe From d00bb98f729c962b22c2a2da11ed68d7e3914f4b Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Fri, 23 Oct 2015 09:59:25 +0200 Subject: [PATCH 004/118] Make things more explicit address comments by @mblondel and @agramfort --- slep001/discussion.rst | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/slep001/discussion.rst b/slep001/discussion.rst index 7720558..cae041c 100644 --- a/slep001/discussion.rst +++ b/slep001/discussion.rst @@ -195,7 +195,7 @@ estimator API to have a less pressing need for meta-estimators. Option B: transformer-like that modify y ------------------------------------------ -.. note:: Two variants of this option exist: +.. topic:: **Two variants** 1. Changing the semantics of transformers to modify y and return something more complex than a data matrix X @@ -204,6 +204,20 @@ Option B: transformer-like that modify y There is an emerging consensus for option 2. +.. topic:: **`transform` modifying y** + + Option 1 above could be implementing by allowing transform to modify + y. However, the return signature of transform would be unclear. + + Do we modify all transformers to return a y (y=None for unsupervised + transformers that are not given y?). This sounds like leading to code + full of surprised and difficult to maintain from the user perspective. + + We would loose the contract that the number of samples is unchanged by + a transformer. This contract is very useful (eg for model selection: + measuring error for each sample). + + Proposal ......... From 4185bf6de21e0de8c88184a20f55eaee61c23562 Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Fri, 23 Oct 2015 10:03:57 +0200 Subject: [PATCH 005/118] Slight phrasing change --- slep001/discussion.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/slep001/discussion.rst b/slep001/discussion.rst index cae041c..073cd67 100644 --- a/slep001/discussion.rst +++ b/slep001/discussion.rst @@ -206,7 +206,7 @@ Option B: transformer-like that modify y .. topic:: **`transform` modifying y** - Option 1 above could be implementing by allowing transform to modify + Variant 1 above could be implementing by allowing transform to modify y. However, the return signature of transform would be unclear. Do we modify all transformers to return a y (y=None for unsupervised @@ -217,6 +217,7 @@ Option B: transformer-like that modify y a transformer. This contract is very useful (eg for model selection: measuring error for each sample). + For these reasons, we feel new methods are necessary. Proposal ......... From 1a1150174b0772b4636e1225a73ab81963f0f091 Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Fri, 23 Oct 2015 10:12:04 +0200 Subject: [PATCH 006/118] typo --- slep001/discussion.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep001/discussion.rst b/slep001/discussion.rst index 073cd67..2ecbb98 100644 --- a/slep001/discussion.rst +++ b/slep001/discussion.rst @@ -211,7 +211,7 @@ Option B: transformer-like that modify y Do we modify all transformers to return a y (y=None for unsupervised transformers that are not given y?). This sounds like leading to code - full of surprised and difficult to maintain from the user perspective. + full of surprises and difficult to maintain from the user perspective. We would loose the contract that the number of samples is unchanged by a transformer. This contract is very useful (eg for model selection: From eef49f8f2d76999aa8463db9907613417f5130b3 Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Fri, 23 Oct 2015 16:57:08 +0200 Subject: [PATCH 007/118] Add code examples Ping @ogrisel @amueller --- slep001/discussion.rst | 11 +++++ slep001/example_outlier_digits.py | 69 +++++++++++++++++++++++++++++++ slep001/outlier_filtering.py | 32 ++++++++++++++ slep001/subsampler.py | 24 +++++++++++ 4 files changed, 136 insertions(+) create mode 100644 slep001/example_outlier_digits.py create mode 100644 slep001/outlier_filtering.py create mode 100644 slep001/subsampler.py diff --git a/slep001/discussion.rst b/slep001/discussion.rst index 2ecbb98..3a667db 100644 --- a/slep001/discussion.rst +++ b/slep001/discussion.rst @@ -63,6 +63,7 @@ Examples of usecases targetted #. Data loading / conversion #. Pandas in => (X, y) out + #. Images in => patches out #. Filename in => (X, y) with multiple samples (very useful in combination with online learning) #. Database query => (X, y) out @@ -282,6 +283,10 @@ Design questions #. Having the pipeline 'predict' use 'transform' on its intermediate steps + + One option is to modify the scoring framework to be able to handle + these things, the scoring gets the output of the chain of + transform_pipe for y. #. How do we deal with sample weights and other sample properties @@ -350,4 +355,10 @@ Limitations * Cannot solve all possible cases, and thus we will not get rid of meta-estimators. +TODO +==== + +* Implement an example doing outlier filtering + +* Implement an example doing data downsampling diff --git a/slep001/example_outlier_digits.py b/slep001/example_outlier_digits.py new file mode 100644 index 0000000..8b308ba --- /dev/null +++ b/slep001/example_outlier_digits.py @@ -0,0 +1,69 @@ +""" +Small example doing data filtering on digits for t-SNE embedding. +""" +from time import time + +import numpy as np +import matplotlib.pyplot as plt + +from sklearn import manifold, datasets, decomposition, pipeline + +from outlier_filtering import EllipticEnvelopeFilter +from subsampler import SubSampler + +digits = datasets.load_digits() +X = digits.data +y = digits.target +n_samples, n_features = X.shape + + +#---------------------------------------------------------------------- +# Scale and visualize the embedding vectors +def plot_embedding(X, y, title=None): + x_min, x_max = np.min(X, 0), np.max(X, 0) + X = (X - x_min) / (x_max - x_min) + + plt.figure() + plt.subplot(111) + for this_x, this_y in zip(X, y): + plt.text(this_x[0], this_x[1], str(this_y), + color=plt.cm.Set1(this_y / 10.), + fontdict={'weight': 'bold', 'size': 9}) + + plt.xticks([]), plt.yticks([]) + if title is not None: + plt.title(title) + + +print("Computing t-SNE embedding") + +tsne = manifold.TSNE(n_components=2, init='pca', random_state=0) + +subsampler = SubSampler(random_state=1, ratio=.5) + +filtering = EllipticEnvelopeFilter(random_state=1) + +t0 = time() + +# We need a PCA reduction of X because MinCovDet crashes elsewhere +X_pca = decomposition.RandomizedPCA(n_components=30).fit_transform(X) +filtering.fit_pipe(*subsampler.transform_pipe(X_pca)) + +print("Fitting filtering done: %.2fs" % (time() - t0)) + +X_red, y_red = filtering.transform_pipe(X_pca, y) + +X_tsne = tsne.fit_transform(X_red) + +plot_embedding(X_tsne, y_red, + "With outlier_filtering") + + +# Now without outlier_filtering +X_tsne = tsne.fit_transform(X_pca) + +plot_embedding(X_tsne, y, + "Without outlier_filtering") + +plt.show() + diff --git a/slep001/outlier_filtering.py b/slep001/outlier_filtering.py new file mode 100644 index 0000000..9e20db4 --- /dev/null +++ b/slep001/outlier_filtering.py @@ -0,0 +1,32 @@ +from sklearn.base import BaseEstimator +from sklearn.covariance import EllipticEnvelope + + +class EllipticEnvelopeFilter(BaseEstimator): + + def __init__(self, assume_centered=False, + support_fraction=None, contamination=0.1, + random_state=None): + self.assume_centered = assume_centered + self.support_fraction = support_fraction + self.contamination = contamination + self.random_state = random_state + + def fit_pipe(self, X, y=None): + self.elliptic_envelope_ = EllipticEnvelope(**self.get_params()) + self.elliptic_envelope_.fit(X) + return self.transform_pipe(X, y) + + def transform_pipe(self, X, y): + # XXX: sample_props not taken care off + is_inlier = self.elliptic_envelope_.predict(X) == 1 + X_out = X[is_inlier] + if y is None: + y_out = None + else: + y_out = y[is_inlier] + return X_out, y_out + + def transform(self, X, y=None): + return X + diff --git a/slep001/subsampler.py b/slep001/subsampler.py new file mode 100644 index 0000000..02620e4 --- /dev/null +++ b/slep001/subsampler.py @@ -0,0 +1,24 @@ +from sklearn.base import BaseEstimator +from sklearn.utils import check_random_state + + +class SubSampler(BaseEstimator): + + def __init__(self, ratio=.3, random_state=None): + self.ratio = ratio + self.random_state = random_state + self.random_state_ = None + + def transform_pipe(self, X, y=None): + # Awkward situation: random_state_ is set at transform time :) + if self.random_state_ is None: + self.random_state_ = check_random_state(self.random_state) + n_samples, _ = X.shape + random_choice = self.random_state_.random_sample(n_samples) + random_choice = random_choice < self.ratio + X_out = X[random_choice] + y_out = None + if y is not None: + y_out = y[random_choice] + return X_out, y_out + From fdfe4211e1d4714df93fe2185b6d2c2072bb8216 Mon Sep 17 00:00:00 2001 From: Gael Varoquaux Date: Fri, 23 Oct 2015 18:14:41 +0200 Subject: [PATCH 008/118] Integrate discussions with @ogrisel and @amueller --- slep001/discussion.rst | 131 +++++++++++++++++++++++------------------ 1 file changed, 75 insertions(+), 56 deletions(-) diff --git a/slep001/discussion.rst b/slep001/discussion.rst index 3a667db..00c6e62 100644 --- a/slep001/discussion.rst +++ b/slep001/discussion.rst @@ -223,90 +223,109 @@ Option B: transformer-like that modify y Proposal ......... -Introduce a `TransformPipe` type of object with the following API +Introduce a `TransModifier` type of object with the following API (names are discussed below): -* `X_new, y_new = estimator.fit_pipe(X, y)` +* `X_new, y_new = estimator.fit_modify(X, y)` -* `X_new, y_new = estimator.transform_pipe(X, y)` +* `X_new, y_new = estimator.trans_modify(X, y)` Or: -* `X_new, y_new, sample_props = estimator.fit_pipe(X, y)` +* `X_new, y_new, sample_props = estimator.fit_modify(X, y)` -* `X_new, y_new, sample_props = estimator.transform_pipe(X, y)` +* `X_new, y_new, sample_props = estimator.trans_modify(X, y)` Contracts (these are weaker contracts than the transformer: -* Neither `fit_pipe` nor `transform_pipe` are guarantied to keep the +* Neither `fit_modify` nor `trans_modify` are guarantied to keep the number of samples unchanged. -* transform_pipe is not equivalent to .fit_pipe.transform +* `fit_modify` may not exist (questionnable) -Design questions -.................... +Design questions and difficulties +.................................. -#. Should there be a fit method? +Should there be a fit method? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - In such estimators, it may not be a good idea to call fit rather than - fit_pipe (for instance in coreset). +In such estimators, it may not be a good idea to call fit rather than +fit_modify (for instance in coreset). -#. At test time, how does a pipeline use such an object? +How does a pipeline use such an object? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - #. Should there be a transform method used at test time? +In particular at test time? - #. What to do with objects that implement both `transform` and - `transform_pipe`? +#. Should there be a transform method used at test time? - For some usecases, test time needs to modify the number of samples - (for instance data loading from a file). However, these will by - construction a problem for eg cross-val-score, as they need to - generate a y_true. Indeed, the problem is the following: +#. What to do with objects that implement both `transform` and + `trans_modify`? - - To measure an error, we need y_true at the level of - 'cross_val_score' or GridSearch +**Creating y in a pipeline makes error measurement harder** For some +usecases, test time needs to modify the number of samples (for instance +data loading from a file). However, these will by construction a problem +for eg cross-val-score, as in supervised settings, these expect a y_true. +Indeed, the problem is the following: + +- To measure an error, we need y_true at the level of + `cross_val_score` or `GridSearchCV` + +- y_true is created inside the pipeline by the data-loading object. + +It is thus unclear that the data-loading usecases can be fully +integrated in the CV framework (which is not an argument against +enabling them). + +| + +For our CV framework, we need the number of samples to remain +constant: for each y_pred, we need a corresponding y_true. + +| + +**Proposal 1**: use transform at `predict` time. - - y_true is created inside the pipeline by the data-loading object. +#. Objects implementing both `transform` and `trans_modify` are valid - It is thus unclear that the data-loading usecases can be fully - integrated in the CV framework (which is not an argument against - enabling them). +#. The pipeline's `predict` method use `transform` on its intermediate + steps - | +The different semantics of `trans_modify` and `transform` can be very useful, +as `transform` keeps untouched the notion of sample, and `y_true`. - For our CV framework, we need the number of samples to remain - constant: for each y_pred, we need a corresponding y_true. This is an - argument for: - - #. Accepting both transform and transform_pipe +| - #. Having the pipeline 'predict' use 'transform' on its - intermediate steps +**Proposal 2** Modify the scoring framework - One option is to modify the scoring framework to be able to handle - these things, the scoring gets the output of the chain of - transform_pipe for y. - -#. How do we deal with sample weights and other sample properties +One option is to modify the scoring framework to be able to handle +these things, the scoring gets the output of the chain of +trans_modify for y. This should rely on clever code in the `score` method +of pipeline. Maybe it should be controlled by a keyword argument on the +pipeline, and turned off by default. - This discussion feeds in the `sample_props` discussion (that should - be discussed in a different enhancement proposal). + +How do we deal with sample weights and other sample properties? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This discussion feeds in the `sample_props` discussion (that should +be discussed in a different enhancement proposal). - The suggestion is to have the sample properties as a dictionary of - arrays `sample_props`. +The suggestion is to have the sample properties as a dictionary of +arrays `sample_props`. - **Example usecase** useful to think about sample properties: coresets: - given (X, y) return (X_new, y_new, weights) with a much smaller number - of samples. +**Example usecase** useful to think about sample properties: coresets: +given (X, y) return (X_new, y_new, weights) with a much smaller number +of samples. - This example is interesting because it shows that PipeTransforms can - legitimately create sample properties. +This example is interesting because it shows that TransModifiers can +legitimately create sample properties. - **Proposed solution**: +**Proposed solution**: - * PipeTransforms always return (X_new, y_new, sample_props) where - sample_props can be an empty dictionary. +TransModifiers always return (X_new, y_new, sample_props) where +sample_props can be an empty dictionary. Naming suggestions @@ -317,20 +336,20 @@ are close to 'fit' and 'transform', to make discoverability and readability of the code easier. * Name of the object (referred in the docs): + - TransModifier - TransformPipe - PipeTransformer - - TransModifier * Method to fit and apply on training + - fit_modify - fit_pipe - pipe_fit - fit_filter - - fit_modify -* Method to apply on new data (not always available) +* Method to apply on new data + - trans_modify - transform_pipe - pipe_transform - - trans_modify Benefits ......... @@ -350,7 +369,7 @@ Limitations * Introducing new methods, and a new type of estimator object. There are probably a total of **3 new methods** that will get introduced by this - enhancement: fit_pipe, transform_pipe, and partial_fit_pipe + enhancement: fit_modify, trans_modify, and partial_fit_modify. * Cannot solve all possible cases, and thus we will not get rid of meta-estimators. From 10a82233245ddf061f6504206f340a1e2ab9af84 Mon Sep 17 00:00:00 2001 From: Konstantin Podshumok Date: Sun, 11 Sep 2016 06:35:39 +0300 Subject: [PATCH 009/118] Add initial version of Dynamic Pipelines slep --- slep002/proposal.rst | 558 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 558 insertions(+) create mode 100644 slep002/proposal.rst diff --git a/slep002/proposal.rst b/slep002/proposal.rst new file mode 100644 index 0000000..f19dc8e --- /dev/null +++ b/slep002/proposal.rst @@ -0,0 +1,558 @@ +================= +Dynamic pipelines +================= + +.. topic:: **Summary** + + Create and manipulate pipelines with ease. + +.. sectnum:: + +.. contents:: Table of contents + :depth: 3 + +Goals +===== + +* Being backward-compatible +* Allow interactive pipeline construction (for example in IPython) +* Support adding and replacing parts of pipeline +* Support using steps as label (y's) transformers + + +Design +====== + +Imports +------- + +In addition to `Pipeline` class some additional wrappers are proposed as part of public API:: + + from sklearn.pipeline import (Pipeline, fitted, transformer, predictor + label_transformer, label_predictor, + ignore_transform, ignore_predict) + +Pipeline creation +----------------- + +Backward-compatible +................... + +Of course, old syntax should be supported:: + + pipe = Pipeline(steps=[('name1', estimator1), ('name2', 'estimator2)] + +Proposed default constructor +............................ + +It is not backward-compatible, but it shouldn't break most of old code:: + + pipe = Pipeline() + +It is not yet configured, so trying to use it should fail:: + + >>> pipe.predict(...) + Traceback (most recent call last): + ... + NotFittedError: This Pipeline instance is not fitted yet + + >>> pipe.fit(...) + Traceback (most recent call last): + ... + NotConfiguredError: This Pipeline instance is not configured yet + +Proposed construction from iterable of dicts +............................................ + +Dictionaries emphasize structure:: + + pipe = Pipeline( + steps=[ + {'name1': Estimator1()}, + {'name2': Estimator2()}, + ] + ) + +Every dict should be of length 1:: + + >>> pipe = Pipeline( + ... steps=( + ... {'name1': Estimator1(), + ... 'name2': Estimator2()}, + ... {}, + ... ), + ... ) + Traceback (most recent call last): + ... + TypeError: Wrong step definition + + +Proposed construction from `collections.OrderedDict` +.................................................... + +It is probably the most natural way to create a pipeline:: + + pipe = Pipeline( + collections.OrderedDict([ + ('name1', Estimator1()), + ('name2', Estimator2()), + ]), + ) + +Backward-compatibility notice +----------------------------- + +As user can provide object of any type as `steps` argument to constructor, +there is no way to be 100% compatible, if we are going to maintain our oun +type for `Pipeline.steps`. +But in most cases people provide `list` object as `steps` parameter, so +being backward-compatible with `list` API should be fine. + +Adding estimators +----------------- + +Backward-compatible +................... + +Although not documented, but popular method of modifying (not fitted) pipelines should be supported:: + + pipe.steps.append(['name', estimator]) + +The only difference is that special handler is returned instead of `None`. + +Enhanced: by indexing +..................... + +Using dict-like syntax if very user-friendly:: + + pipe.steps['name'] = estimator + +Enhanced: `add` function +........................ + +Alias to previous two calls:: + + pipe.steps.add('name', estimator) + +And also:: + + pipe.add_estimator('name', estimator) + +Adding estimators with type specification +......................................... + +Estimator types will be discussed later, but some functions belong to this section:: + + pipe.add_estimator('name0', estimator0).mark_fitted() + pipe.add_transformer('name1', estimator1) # never calls .fit (x, y -> x) + pipe.add_predictor('name2', estimator2) # never calls .trasform (x -> y) + pipe.add_label_transformer('name3', estimator3) # (y -> y) + pipe.add_label_predictor('name4', estimator4) # (y -> y) + +Steps (subestimators) access +---------------------------- + +Backward-compatible +................... + +Indexing by number should return `(step, estimator)` pair:: + + >>> pipe.steps[0] + ('name', SomeEstimator(...)) + +Enhanced access via indexing +............................ + +One should be able to retrieve any estimator with indexing by step's name:: + + >>> pipe.steps['mame'] + SomeEstimator(param1=value1, param2=value2) + +Enhanced access via attributes +.............................. + +Dotted access should also work if name of step is valid python name literal +and there is no inference with internal methods:: + + >>> pipe.steps.name + SomeEstimator(param1=value1, param2=value2) + + >>> pipe.steps.get + > + + >>> pipe.add_transformer('my transformer', estimator) + >>> pipe.steps.my transformer + File ... + pipe.steps_.my transformer + ^ + SyntaxError: invalid syntax + +Replacing estimators +-------------------- + +Backward-compatible +................... + +Replacing should only be supported via access to `.steps` attribute. This way there is no ambiguity +with new/old subestimator subtype:: + + pipe = Pipeline(steps=[('name', SomeEstimator())]) + pipe.steps[0] = ('name', AnotherEstimator()) + +Replace via indexing by step name +................................. + +Dict-like behavior can be used too:: + + pipe = Pipeline(steps=[('name', SomeEstimator())]) + pipe.steps['name'] = AnotherEstimator() + +Replace via `replace()` function +................................. + +This way one can obtain special handler:: + + pipe.steps.replace('old_step_name', 'new_step_name', NewEstimator()) + pipe.steps.replace('step_name', 'new_name', SomeEstimator()).mark_transformer() + + +Rename step via `rename()` function +.................................... + +Simple way to change step's name (doesn't affect anything except object representation):: + + pipe.steps.rename('old_name', 'new_name') + +Modifying estimators +-------------------- + +Changing estimator params should only be performed via `pipeline.set_params()`. +If somebody calls `subestimator.set_params()` directly, pipeline object will have +no idea about changed state. There is no easy way to control it, so docs should just +warm users about it. + +On the other hand, there exist not-so-easy way to at least warm users during runtime: +pipeline will have to keep params of all its children and compare them with actual +params during `fit` or `predict` routines and raise a warning if they do not match. +This functionality may be implemented as part of some kind of debugging mode. + +Deleting estimators +------------------- + +Backward-compatible +................... + +Backward-compatible way to delete a step is to `del` it via index number:: + + del pipe.steps[2] + +Enhanced indexing +................. + +A little more user-friendly way to remove a step can be achieved +using enhanced indexing:: + + pipe = Pipeline() + est1 = Estimator1() + est2 = Estimator2() + + pipe.steps.add('name1', est1) + pipe.steps.add('name2', est2) + + del pipe.steps['name1'] + del pipe.steps[pipe.steps.index(est2)] + +Using dict/list-like `pop()` functions +...................................... + +Last estimator in a chain can be deleted with any of these calls:: + + >>> pipe.steps.pop() + SomeEstimator() + + >>> pipe.steps.popitem() + ('some_name', SomeEstimator()) + +Likewise, first estimator in the pipeline can be removed with any of these calls:: + + >>> pipe.steps.popfront() + BeginEstimator() + + >>> pipe.steps.popitemfront() + ('begin', BeginEstimator) + +Any step can be removed with `pop(step_name)` or `popitem(step_name)`. + +Fitted flag reset +----------------- + +Internally `Pipeline` object should keep track on whatever it is fitted or not. +It should consider itself fitted if it wasn't modified after: + +* successful call to `.fit`:: + + pipe.fit(...) # Got fitted pipeline if no exception was raised +* construction with list of estimators, all marked as + fitted via `fitted` function:: + + pipe = pipeline.Pipeline(steps=[ + ('name1', fitted(estimator1)), + ('name2', fitted(estimator2)(, + ... + ]) +* adding fitted estimator to fitted pipeline:: + + pipe.steps.append(fitted(estimator1)) + pipe.steps['new_step'] = fitted(estimator2) + pipe.add_transformer('some_key', estimator3).set_fitted() +* renaming step in fitted pipeline +* removing first or last step from fitted pipeline + +Subestimator types +------------------ + +Subestimator type contains information about the way a pipeline +should process a step with that subestimator. +Subestimator type can be specified + +1. By wrapping estimator with subtype constructor call: + * when creating pipeline:: + + Pipeline([ + ('name1', transformer(estimator)), + ('name2', predictor(estimator)), + ('name3', label_transformer(estimator)), + ('name4', label_predictor(estimator)), + ]) + * when adding or replacing a step:: + + pipe.steps.append(['name', label_predictor(estimator]) + pipe.steps.add('name', label_transformer(estimator)) + pipe.add_estimator('name', predictor(estimator)) + pipe.steps.replace('name', transformer(fitted(estimator))) + pipe.steps['name'] = fitted(predictor(estimator)) +3. Using `pipe.add_*` methods:: + + pipe.add_transformer('transformer', Transformer()) + pipe.add_predictor('predictor', Predictor()) + pipe.add_label_transformer('l_transformer', LabelTransformer()) + pipe.add_label_predictor('l_predictor', LabelPredictor()) +2. Using special handler methods:: + + pipe.add_estimator('name1', EstimatorA()).mark_transformer() + pipe.steps.add('name2', EstimatorB()).mark_predictor() + pipe.steps.append(['name3', EstimatorC()]).mark_label_transformer() + pipe.steps.replace('name4', EstimatorD()).mark_label_predictor() + pipe.steps.replace('name4', EstimatorE()).mark('label_transformer') + +Transformer +........... +Is a default type. + +It is processed like this:: + + y_new = y + if fiting: + X_new = step_estimator.fit_transform(X, y) + else: + X_new = step.transform(X, y) + +Predictor +......... + +It is processed like this:: + + X_new = X + if fitting: + y_new = step_estimator.fit_predict(X, y) + else: + y_new = step_estimator.predict(X, y) + +Label transformer +................. + +Processing pseudocode:: + + X_new = X + if fitting: + y_new = step_estimator.fit_transform(y) + else: + y_new = step_estimator.transform(y) + +Label predictor +............... + +Processing pseudocode:: + + X_new = X + if fitting: + y_new = step_estimator.fit_predict(y) + else: + y_new = step_estimator.predict(y) + +Special handlers and wrapper functions +-------------------------------------- + +Assuming estimator is already fitted +.................................... + +to add estimator, that was already fitted to a pipline +one can use fitted function:: + + est = SomeEstimator().fit(some_data) + pipe.steps.add('prefitted', fitted(est)) + +or special hanlder method:: + + pipe.steps.add('prefitted', est).mark_fitted() + # or + pipe.steps.add('prefitted', est).mark('fitted') + +Ignoring estimator during prediction +.................................... + +In some cases we only need to apply estimator only during fit-phase:: + + pipe.add_estimator('sampler', ignore_transform(Sampler())) + # or + pipe.add_estimator('sampler', Sampler()).mark_ignore_transform() + # or + pipe.add_estimator('sampler', Sampler()).mark('ignore_transform') + +If it is `predictor` or `label_predictor`, then one should use `ignore_predict`:: + + pipe.add_estimator('cluster', ignore_predict(predictor(ClusteringEstimator()))) + # or + pipe.add_estimator('cluster', predictor(ClusteringEstimator())).mark_ignore_predict() + # or + pipe.add_estimator('cluster', predictor(ClusteringEstimator())).mark('ignore_predict') + +Setting subestimator type +......................... + +As specified above setting subestimator type can be performed with special handler +or special function call. + +Combining multiple flags +........................ + +All sorts of syntax combinations should be supported:: + + pipe.steps.add('step', fitted(predictor(Estimator()))) + pipe.steps.add('step', predictor(fitted(Estimator()))) + pipe.steps.add('step', predictor(Estimator())).mark_fitted() + pipe.steps.add('step', fitted(Estimator())).mark_predictor() + pipe.steps.add('step', Estimator()).mark_predictor().mark_fitted() + pipe.steps.add('step', Estimator()).mark_fitted().mark_predictor() + pipe.steps.add('step', Estimator()).mark('fitted').mark_predictor() + pipe.steps.add('step', Estimator()).mark('predictor').mark_fitted() + pipe.steps.add('step', Estimator()).mark('predictor').mark('fitted') + pipe.steps.add('step', Estimator()).mark('fitted').mark('predictor') + pipe.steps.add('step', Estimator()).mark('fitted', 'predictor') + pipe.steps.add('step', Estimator()).mark('predictor', 'fitted') + +Type of steps object +-------------------- + +This is internal type, users shouldn'r usualy mess with that. +But public methods should be considered as part of pipeline API. + +Attributes and methods with standard behavior +.............................................. + +Special methods: + +* `__contains__()`, `__getitem__()`, `__setitem__()`, `__delitem__()` +* `__len__()`, `__iter__()` +* `__add__()`, `__iadd__()` + +Methods: + +* `get()`, `index()` +* `extend()`, `insert()` +* `keys()`, `items()`, `values()` +* `clear()`, `pop()`, `popitem()`, `popfront()`, `popitemfront()` + +Non-standard methods +.................... + +* `replace()` +* `rename()` + +Not supported arguments and methods +................................... + +This type provides dict-like and list-like interfaces, +but following methods and attributes are not supported: + +* `fromkeys()` +* `setdefault()` +* `sort()` +* `__mul__()`, `__rmul__()`, `__imul__()` + +Any attempt to use them should fail with `AttributeError` or +`NotImplementedError` + +Thease methods may be not supported: + +* `__ge__()`, `__gt__()` +* `__le__()`, `__lt__()` + +Serialization +------------- + +* Support loading/unpickling pipelines from old scikit-learn versions +* Keep track of API version in `__getstate__` / `picklier`: all future + versions should support unpickling all previous versions of enhanced pipeline +* Serialization of `.steps` attribute (without master pipeline) may be not supported. + +Examples +======== + +Example: remove outliers +------------------------ + +Proposed design allows to do many things, but some of them have to be done in two steps. +But it shouldn't be a problem, as one can make a pipeline with those steps:: + + def make_outlier_remover(bad_value=-1): + outlier_remover = Pipeline() + outlier_remover.steps.add( + 'data', + DropLinesOfXCorrespondingLabel(remove_if=bad_value), + ) + outlier_remover.steps.add( + 'labels', + DropLabelsIf(remove_if=bad_value), + ).mark_label_transformer() + return outlier_remover + +Example: sample dataset +----------------------- +We can use previous example function for this:: + + def make_sampler(percent=75): + sentinel = object() + sampler = Pipeline() + sampler.steps.add( + 'sample', + LabelSomeRowsAs(percent=percent, label=sentinel), + ).mark('predictor', 'ignore_predict') + sampler.steps.add( + 'down', + make_outlier_remover(bad_value=sentinel), + ) + return sampler + +Benefits +======== +* Users can use old code with new pipeline: + usual `__init__`, `set_params`, `get_params`, `fit`, `transform` and `predict` + are the only requirements of subestimators. +* Users can use new pipeline with their old code: + pipeline is stil usual estimator, that supports usual set of methods. +* We finally can transform `y` in a pipeline. + +Drawbacks +========= +Well, it's a lot of code to write and support... From c969d84ded66e69ee57794934e631c7611d7c2b1 Mon Sep 17 00:00:00 2001 From: Konstantin Podshumok Date: Sun, 11 Sep 2016 06:41:47 +0300 Subject: [PATCH 010/118] fix wrong (and manual numbering) --- slep002/proposal.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/slep002/proposal.rst b/slep002/proposal.rst index f19dc8e..ae9e3ce 100644 --- a/slep002/proposal.rst +++ b/slep002/proposal.rst @@ -313,9 +313,10 @@ Subestimator types Subestimator type contains information about the way a pipeline should process a step with that subestimator. -Subestimator type can be specified -1. By wrapping estimator with subtype constructor call: +Subestimator type can be specified: + +* By wrapping estimator with subtype constructor call: * when creating pipeline:: Pipeline([ @@ -331,13 +332,13 @@ Subestimator type can be specified pipe.add_estimator('name', predictor(estimator)) pipe.steps.replace('name', transformer(fitted(estimator))) pipe.steps['name'] = fitted(predictor(estimator)) -3. Using `pipe.add_*` methods:: +* Using `pipe.add_*` methods:: pipe.add_transformer('transformer', Transformer()) pipe.add_predictor('predictor', Predictor()) pipe.add_label_transformer('l_transformer', LabelTransformer()) pipe.add_label_predictor('l_predictor', LabelPredictor()) -2. Using special handler methods:: +* Using special handler methods:: pipe.add_estimator('name1', EstimatorA()).mark_transformer() pipe.steps.add('name2', EstimatorB()).mark_predictor() From 9b27c78e9f9440a5ee65339d2a5f5fecf86b2939 Mon Sep 17 00:00:00 2001 From: Thierry Guillemot Date: Wed, 7 Jun 2017 15:48:55 +0200 Subject: [PATCH 011/118] Add sample props specification --- sample_props_spec.md | 259 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 259 insertions(+) create mode 100644 sample_props_spec.md diff --git a/sample_props_spec.md b/sample_props_spec.md new file mode 100644 index 0000000..d5c49b2 --- /dev/null +++ b/sample_props_spec.md @@ -0,0 +1,259 @@ +This is a specification to introduce data information (as `sample_weights`) +during the computation of an estimator methods (`fit`, `score`, ...) based on +the different discussion proposes on issues and PR : + +- [Consistent API for attaching properties to samples #4497]( + https://github.com/scikit-learn/scikit-learn/issues/4497) +- [Acceptance of sample_weights in pipeline.score #7723]( + https://github.com/scikit-learn/scikit-learn/pull/7723) +- [Establish global error state like np.seterr #4660]( + https://github.com/scikit-learn/scikit-learn/issues/4660) +- [Should cross-validation scoring take sample-weights into account? #4632]( + https://github.com/scikit-learn/scikit-learn/issues/4632) +- [Sample properties #4696]( + https://github.com/scikit-learn/scikit-learn/issues/4696) + +Probably related PR: +- [Add feature_extraction.ColumnTransformer #3886]( + https://github.com/scikit-learn/scikit-learn/pull/3886) +- [Categorical split for decision tree #3346]( + https://github.com/scikit-learn/scikit-learn/pull/3346) + +# 1. Requirement + +These requirements are defined from the different issues and PR discussions: + +- User can attach information to samples. +- Must be a DataFrame like object. +- Can be given to `fit`, `score`, `split` and every time user give X. +- Must work with every meta-estimator (`Pipeline, GridSearchCV, + cross_val_score`). +- Can specify what sample property is used by each part of the meta-estimator. +- Must raise an error if not necessary extra information are given to an + estimator. In the case of meta-estimator these errors are not raised. + +Requirement proposed but not used by this specification: +- User can attach feature properties to samples. + +# 2. Definition + +Some estimator in sklearn can change their behavior when an attribute +`sample_props` is provided. `sample_props` is a dictionary +(`pandas.DataFrame` compatible) defining sample properties. The example bellow +explain how a `sample_props` can be provided to LogisticRegression to +weighted the samples: + +```python +import numpy as np +from sklearn import datasets +from sklearn.linear_model import LogisticRegression + +digits = datasets.load_digits() +X = digits.data +y = digits.target + +# Define weights used by sample_props +weights_fit = np.random.rand(X.shape[0]) +weights_fit /= np.sum(weights_fit) +weights_score = np.random.rand(X.shape[0]) +weights_score /= np.sum(weights_score) + +logreg = LogisticRegression() + +# Fit and score a LogisticRegression without sample weights +logreg = logreg.fit(X, y) +score = logreg.score(X, y) +print("Score obtained without applying weights: %f" % score) + +# Fit LogisticRegression without sample weights and score with sample weights +logreg = logreg.fit(X, y) +score = logreg.score(X, y, sample_props={'weight': weights_score}) +print("Score obtained by applying weights only to score: %f" % score) + +# Fit and score a LogisticRegression with sample weights +log_reg = logreg.fit(X, y, sample_props={'weight': weights_fit}) +score = logreg.score(X, y, sample_props={'weight': weights_score}) +print("Score obtained by applying weights to both" + " score and fit: %f" % score) +``` + +When an estimator expects a mandatory `sample_props`, an error is raised for +each property not provided. Moreover **if an unintended properties is given +through `sample_props`, a warning will be launched** to prevent that the result +may be different from the one expected. For example, the following code : + +```python +import numpy as np +from sklearn import datasets +from sklearn.cluster import KMeans +from sklearn.pipeline import Pipeline + +digits = datasets.load_digits() +X = digits.data +y = digits.target +weights = np.random.rand(X.shape[0]) + +logreg = LogisticRegression() + +# This instruction will raise the warning +logreg = logreg.fit(X, y, sample_props={'bad_property': weights}) +``` + +will **raise the warning message**: "sample_props['bad_property'] is not used by +`LogisticRegression.fit`. The results obtained may be different from the one +expected." + +We provide the function `sklearn.seterr` in the case you want to change the +behavior of theses messages. Even if there are considered as warnings by +default, we recommend to change the behavior to raise as errors. You can do it +by adding the following code: + +```python +sklearn.seterr(sample_props="raise") +``` + +Please refer to the documentation of `np.seterr` for more information. + +# 3. Behavior of `sample_props` for meta-estimator + +## 3.1 Common routing scheme + +Meta-estimators can also change their behavior when an attribute `sample_props` +is provided. On that case, `sample_props` will be sent to any internal estimator +and function supporting the `sample_props` attribute. In other terms **all the +property defined by `sample_props` will be transmitted to each internal +functions or classes supporting `sample_props`**. For example in the following +example, the property `weights` is sent through `sample_props` to +`pca.fit_transform` and `logistic.fit`: + +```python +import numpy as np +from sklearn import decomposition, datasets, linear_model +from sklearn.pipeline import Pipeline + +digits = datasets.load_digits() +X = digits.data +y = digits.target + +logistic = linear_model.LogisticRegression() +pca = decomposition.PCA() +pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),]) + +# Define weights +weights = np.random.rand(X.shape[0]) +weights /= np.sum(weights) + +# weights is send to pca.fit_transform and logistic.fit +pipe.fit(X, sample_props={"weights": weights}) +``` + +**By contrast with the estimator, no warning will be raised by a +meta-estimator if an extra property is sent through `sample_props`.** +Anyway, errors are still raised if a mandatory property is not provided. + +## 3.2 Override common routing scheme + +**You can override the common routing scheme of `sample_props` of nested objects +by defining sample properties of the form `__`.** + +**You can override the common routing scheme of `sample_props` by defining your +own routes through the `routing` attribute of a meta-estimator**. + +**A route defines a way to override the value of a key of `sample_props` by the +value of another key in the same `sample_props`. This modification is done +every time a method compatible with `sample_prop` is called.** + +To illustrate how it works, if you want to send `weights` only to `pca`, +you can define a `sample_prop` with a property `pca__weights`: + +```python +import numpy as np +from sklearn import decomposition, datasets, linear_model +from sklearn.pipeline import Pipeline + +digits = datasets.load_digits() +X = digits.data +y = digits.target + +logistic = linear_model.LogisticRegression() +pca = decomposition.PCA() + +# Create a route using routing +pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),]) + +# Define weights +weights = np.random.rand(X.shape[0]) +weights /= np.sum(pca_weights) +pca_weights = np.random.rand(X.shape[0]) +pca_weights /= np.sum(pca_weights) + +# Only pca will receive pca_weights as weights +pipe.fit(X, sample_props={'pca__weights': pca_weights}) + +# pca will receive pca_weights and logistic will receive weights as weights +pipe.fit(X, sample_props={'pca__weights': pca_weights, + 'weights': weights}) +``` + +By defining `pca__weights`, we have overridden the property +`weights` for `pca`. On all cases, the property `pca__weights` +will be send to `pca` and `logistic`. + +**Overriding the routing scheme can be subtle and you must +remember the priority of application of each route types**: + +1. Routes applied specifically to a function/estimator: `{'pca__weights': weights}}` +2. Routes defined globally: `{'weights': weights}` + +Let's consider the following code to familiarized yourself with the different +routes definitions : + +```python +import numpy as np +from sklearn import datasets +from sklearn.linear_model import SGDClassifier +from sklearn.model_selection import cross_val_score, GridSearchCV, LeaveOneLabelOut + +digits = datasets.load_digits() +X = digits.data +y = digits.target + +# Define the groups used by cross_val_score +cv_groups = np.random.randint(3, size=y.shape) + +# Define the groups used by GridSearchCV +gs_groups = np.random.randint(3, size=y.shape) + +# Define weights used by cross_val_score +weights = np.random.rand(X.shape[0]) +weights /= np.sum(weights) + +# We define the GridSearchCV used by cross_val_score +grid = GridSearchCV(SGDClassifier(), params, cv=LeaveOneLabelOut()) + +# When cross_val_score is called, we send all parameters for internal values +cross_val_score(grid, X, y, cv=LeaveOneLabelOut(), + sample_props={'cv__groups': groups, + 'split__groups': gs_groups, + 'weights': weights}) +``` + +With this code, the `sample_props` sent to each function of `GridSearchCV` and +`cross_val_score` will be: + +| function | `sample_props` | +|:----------------|:-----------------------------------------------------------------------------------------------| +| grid.fit | `{'weights': weights, 'cv__groups': cv_groups, split_groups': gs_groups}` | +| grid.score | `{'weights': weights, 'cv__groups': cv_groups, split_groups': gs_groups}` | +| grid.split | `{'weights': weights, 'groups': gs_groups, 'cv__groups': cv_groups, split_groups': gs_groups}` | +| cross_val_score | `{'weights': weights, 'groups': groups, 'cv__groups': cv_groups, split_groups': gs_groups}` | + + +Thus, these functions receive as `weights` and `groups` properties : + +| function | `weights` | `groups` | +|:----------------|:-------------------|:------------| +| grid.fit | `weights` | `None` | +| grid.score | `weights` | `None` | +| grid.split | `weights` | `gs_groups` | +| cross_val_score | `weights` | `cv_groups` | From 0d4506fa93e24b1a76798fb0e8f0fd7de5a41045 Mon Sep 17 00:00:00 2001 From: Thierry Guillemot Date: Wed, 7 Jun 2017 18:06:18 +0200 Subject: [PATCH 012/118] Add alternative proposition 06.17.17 --- sample_props_spec.md | 94 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) diff --git a/sample_props_spec.md b/sample_props_spec.md index d5c49b2..0cfdf73 100644 --- a/sample_props_spec.md +++ b/sample_props_spec.md @@ -18,6 +18,10 @@ Probably related PR: https://github.com/scikit-learn/scikit-learn/pull/3886) - [Categorical split for decision tree #3346]( https://github.com/scikit-learn/scikit-learn/pull/3346) + +Google doc of the sample_prop discussion done during the sklearn day in paris +the 7th June 2017: +https://docs.google.com/document/d/1k8d4vyw87gWODiyAyQTz91Z1KOnYr6runx-N074qIBY/edit # 1. Requirement @@ -257,3 +261,93 @@ Thus, these functions receive as `weights` and `groups` properties : | grid.score | `weights` | `None` | | grid.split | `weights` | `gs_groups` | | cross_val_score | `weights` | `cv_groups` | + + +# 4. Alternative propositions for sample_props (06.17.17) +The meta-estimator says which columns of sample_props they wanted to use. +```python +p = make_pipeline( + PCA(n_components=10), + SVC(C=10).with(_=) +) +p.fit(X, y, sample_props={column_name=value}) +``` + +For example : +```python +p = make_pipeline( + PCA(n_components=10), + SVC(C=10).with(fit_weights='weights', score_weights='weights') +) +p.fit(X, y, sample_props={"weights": w}) +``` + +**Other proposals**: +- Olivier suggests to modify `.with(...)` by `.sample_props_mapping(...)`. +- Gael suggests to change the `.with(...)` by a property `with_props=...` like : +```python +p = make_pipeline( + PCA(n_components=10), + SVC(C=10), + with_props={ + 'svc':(_=)} +) +``` + +## 4.1 GridSearch + Pipeline case +Let's consider the case of a `GridSearch` working with a `Pipeline`. +How we definer the `sample_props` on that case ? + +### Alternative 1 +Pass through everything in `GridSearchCV`: +```python +pipe = make_pipeline( + PCA(), SVC(), + with_props={pca__fit_weight: 'my_weights'}}) +GridSearchCV( + pipe, cv=my_cv, + with_props={'cv__groups': "my_groups", '*':'*') +``` + +A more complex example with this solution: +```python +pipe = make_pipeline( + make_union( + CountVectorizer(analyzer='word').with(fit_weight='my_weight'), + CountVectorizer(analyzer='char').with(fit_weight='my_weight')), + SVC()) + +GridSearchCV( + pipe, + cv=my_cv.with(groups='my_groups'), score_weight='my_weight') +``` + +### Alternative 2 +Grid search manage the `sample_props` of all internal variable. +```python +pipe = make_pipeline(PCA(), SVC()) +GridSearchCV( + pipe, cv=my_cv, + with_props={ + 'cv__groups': "my_groups", + 'estimator__pca__fit_weight': "my_weights"), +       }) +``` + +A more complex example with this solution: +```python +pipe = make_pipeline( + make_union( + CountVectorizer(analyzer='word'), + CountVectorizer(analyzer='char')), + SVC()) +GridSearchCV( + pipe, cv=my_cv, + with_props={ + 'cv__groups': "my_groups", + 'estimator__featureunion__countvectorizer-1__fit_weight': "my_weights", + 'estimator__featureunion__countvectorizer-2__fit_weight': "my_weights", + 'score_weight': "my_weights", + } +) +``` From 3989b737642adf5f13a4e37ecfeed69cd592555a Mon Sep 17 00:00:00 2001 From: Andreas Mueller Date: Sat, 8 Dec 2018 14:05:50 -0500 Subject: [PATCH 013/118] Set theme jekyll-theme-slate --- _config.yml | 1 + 1 file changed, 1 insertion(+) create mode 100644 _config.yml diff --git a/_config.yml b/_config.yml new file mode 100644 index 0000000..c741881 --- /dev/null +++ b/_config.yml @@ -0,0 +1 @@ +theme: jekyll-theme-slate \ No newline at end of file From f3cfbc1337fb4c63e8aea5915781b8eb6a1f255a Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 12 Dec 2018 10:45:07 -0500 Subject: [PATCH 014/118] moved sample_props_spec.md int slep004 --- sample_props_spec.md => slep004/sample_props_spec.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename sample_props_spec.md => slep004/sample_props_spec.md (100%) diff --git a/sample_props_spec.md b/slep004/sample_props_spec.md similarity index 100% rename from sample_props_spec.md rename to slep004/sample_props_spec.md From 7fcf7cae18344f5f053eec24d5ab7260d18d9f61 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 12 Dec 2018 11:39:46 -0500 Subject: [PATCH 015/118] Spinned up shpinx doc --- Makefile | 20 ++ README.rst | 2 - conf.py | 166 +++++++++ index.rst | 27 ++ make.bat | 36 ++ requirements.txt | 2 + slep001/{discussion.rst => proposal.rst} | 21 +- slep002/proposal.rst | 4 +- slep003/proposal.rst | 7 +- slep004/proposal.rst | 406 +++++++++++++++++++++++ slep004/sample_props_spec.md | 353 -------------------- 11 files changed, 673 insertions(+), 371 deletions(-) create mode 100644 Makefile create mode 100644 conf.py create mode 100644 index.rst create mode 100644 make.bat create mode 100644 requirements.txt rename slep001/{discussion.rst => proposal.rst} (98%) create mode 100644 slep004/proposal.rst delete mode 100644 slep004/sample_props_spec.md diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..9e1ed7d --- /dev/null +++ b/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line. +SPHINXOPTS = +SPHINXBUILD = sphinx-build +SPHINXPROJ = Scikit-learnenhancementproposals +SOURCEDIR = . +BUILDDIR = build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/README.rst b/README.rst index 3bf8cd8..968de36 100644 --- a/README.rst +++ b/README.rst @@ -11,5 +11,3 @@ the rational and usecases that are addressed, the problems and the major possible solution. It should be a summary of the key points that drive the decision, and ideally converge to a draft of an API or object to be implemented in scikit-learn. - - diff --git a/conf.py b/conf.py new file mode 100644 index 0000000..49b8fc9 --- /dev/null +++ b/conf.py @@ -0,0 +1,166 @@ +# -*- coding: utf-8 -*- +# +# Configuration file for the Sphinx documentation builder. +# +# This file does only contain a selection of the most common options. For a +# full list see the documentation: +# http://www.sphinx-doc.org/en/master/config + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +# import os +# import sys +# sys.path.insert(0, os.path.abspath('.')) + + +# -- Project information ----------------------------------------------------- + +project = 'Scikit-learn enhancement proposals' +copyright = '2018, scikit-learn community' +author = 'scikit-learn community' + +# The short X.Y version +version = '' +# The full version, including alpha/beta/rc tags +release = '' + + +# -- General configuration --------------------------------------------------- + +# If your documentation needs a minimal Sphinx version, state it here. +# +# needs_sphinx = '1.0' + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + 'sphinx.ext.intersphinx', + 'sphinx.ext.mathjax', + 'sphinx.ext.viewcode', +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# The suffix(es) of source filenames. +# You can specify multiple suffix as a list of string: +# +# source_suffix = ['.rst', '.md'] +source_suffix = '.rst' + +# The master toctree document. +master_doc = 'index' + +# The language for content autogenerated by Sphinx. Refer to documentation +# for a list of supported languages. +# +# This is also used if you do content translation via gettext catalogs. +# Usually you set "language" from the command line for these cases. +language = None + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path . +exclude_patterns = [] + +# The name of the Pygments (syntax highlighting) style to use. +pygments_style = 'sphinx' + + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'sphinx_rtd_theme' + +# Theme options are theme-specific and customize the look and feel of a theme +# further. For a list of options available for each theme, see the +# documentation. +# +# html_theme_options = {} + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] + +# Custom sidebar templates, must be a dictionary that maps document names +# to template names. +# +# The default sidebars (for documents that don't match any pattern) are +# defined by theme itself. Builtin themes are using these templates by +# default: ``['localtoc.html', 'relations.html', 'sourcelink.html', +# 'searchbox.html']``. +# +# html_sidebars = {} + + +# -- Options for HTMLHelp output --------------------------------------------- + +# Output file base name for HTML help builder. +htmlhelp_basename = 'Scikit-learnenhancementproposalsdoc' + + +# -- Options for LaTeX output ------------------------------------------------ + +latex_elements = { + # The paper size ('letterpaper' or 'a4paper'). + # + # 'papersize': 'letterpaper', + + # The font size ('10pt', '11pt' or '12pt'). + # + # 'pointsize': '10pt', + + # Additional stuff for the LaTeX preamble. + # + # 'preamble': '', + + # Latex figure (float) alignment + # + # 'figure_align': 'htbp', +} + +# Grouping the document tree into LaTeX files. List of tuples +# (source start file, target name, title, +# author, documentclass [howto, manual, or own class]). +latex_documents = [ + (master_doc, 'Scikit-learnenhancementproposals.tex', 'Scikit-learn enhancement proposals Documentation', + 'scikit-learn community', 'manual'), +] + + +# -- Options for manual page output ------------------------------------------ + +# One entry per manual page. List of tuples +# (source start file, name, description, authors, manual section). +man_pages = [ + (master_doc, 'scikit-learnenhancementproposals', 'Scikit-learn enhancement proposals Documentation', + [author], 1) +] + + +# -- Options for Texinfo output ---------------------------------------------- + +# Grouping the document tree into Texinfo files. List of tuples +# (source start file, target name, title, author, +# dir menu entry, description, category) +texinfo_documents = [ + (master_doc, 'Scikit-learnenhancementproposals', 'Scikit-learn enhancement proposals Documentation', + author, 'Scikit-learnenhancementproposals', 'One line description of project.', + 'Miscellaneous'), +] + + +# -- Extension configuration ------------------------------------------------- + +# -- Options for intersphinx extension --------------------------------------- + +# Example configuration for intersphinx: refer to the Python standard library. +intersphinx_mapping = {'https://docs.python.org/': None} diff --git a/index.rst b/index.rst new file mode 100644 index 0000000..db8afc6 --- /dev/null +++ b/index.rst @@ -0,0 +1,27 @@ +.. Scikit-learn enhancement proposals documentation master file, created by + sphinx-quickstart on Wed Dec 12 10:57:18 2018. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +Scikit-learn enhancement proposals +================================== + +This repository is for structured discussions about large modifications or +additions to scikit-learn. + +The discussions must create an "enhancement proposal", similar Python +enhancement proposal, that reflects the major arguments to keep in mind, the +rational and usecases that are addressed, the problems and the major +possible solution. It should be a summary of the key points that drive the +decision, and ideally converge to a draft of an API or object to be +implemented in scikit-learn. + +.. toctree:: + :maxdepth: 1 + :numbered: + :caption: Proposals: + + slep001/proposal + slep002/proposal + slep003/proposal + slep004/proposal diff --git a/make.bat b/make.bat new file mode 100644 index 0000000..55e4c0e --- /dev/null +++ b/make.bat @@ -0,0 +1,36 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=. +set BUILDDIR=build +set SPHINXPROJ=Scikit-learnenhancementproposals + +if "%1" == "" goto help + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.http://sphinx-doc.org/ + exit /b 1 +) + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% + +:end +popd diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..cbf1e36 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,2 @@ +sphinx +sphinx-rtd-theme diff --git a/slep001/discussion.rst b/slep001/proposal.rst similarity index 98% rename from slep001/discussion.rst rename to slep001/proposal.rst index 00c6e62..18bc070 100644 --- a/slep001/discussion.rst +++ b/slep001/proposal.rst @@ -1,3 +1,5 @@ +.. _slep_001: + ===================================== Transformers that modify their target ===================================== @@ -16,13 +18,11 @@ Transformers that modify their target Doing many of these things is possible "by hand". The question is: how to avoid writing custom connecting logic. -.. sectnum:: - .. contents:: Table of contents :depth: 2 Rational -========== +======== Summary of the contract of transformers ---------------------------------------- @@ -86,10 +86,10 @@ data-processing pipeline becomes rich, the more the data grow, the more the usecases above become important. Enhancements proposed -======================= +===================== Option A: meta-estimators ---------------------------- +------------------------- Proposal ........ @@ -194,7 +194,7 @@ powerful pattern. The discussion here is about extending a bit the estimator API to have a less pressing need for meta-estimators. Option B: transformer-like that modify y ------------------------------------------- +---------------------------------------- .. topic:: **Two variants** @@ -244,10 +244,10 @@ Contracts (these are weaker contracts than the transformer: * `fit_modify` may not exist (questionnable) Design questions and difficulties -.................................. +................................. Should there be a fit method? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In such estimators, it may not be a good idea to call fit rather than fit_modify (for instance in coreset). @@ -352,7 +352,7 @@ readability of the code easier. - pipe_transform Benefits -......... +........ * Many usecases listed above will be implemented scikit-learn without a meta-estimator, and thus will be easy to use (eg in a pipeline). Many @@ -365,7 +365,7 @@ Benefits (extract transform and load) pattern. Limitations -............ +........... * Introducing new methods, and a new type of estimator object. There are probably a total of **3 new methods** that will get introduced by this @@ -380,4 +380,3 @@ TODO * Implement an example doing outlier filtering * Implement an example doing data downsampling - diff --git a/slep002/proposal.rst b/slep002/proposal.rst index ae9e3ce..ba275d8 100644 --- a/slep002/proposal.rst +++ b/slep002/proposal.rst @@ -1,3 +1,5 @@ +.. _slep_002: + ================= Dynamic pipelines ================= @@ -6,8 +8,6 @@ Dynamic pipelines Create and manipulate pipelines with ease. -.. sectnum:: - .. contents:: Table of contents :depth: 3 diff --git a/slep003/proposal.rst b/slep003/proposal.rst index de4ed39..511fcd2 100644 --- a/slep003/proposal.rst +++ b/slep003/proposal.rst @@ -1,3 +1,5 @@ +.. _slep_003: + ====================================== Consistent inspection for transformers ====================================== @@ -8,9 +10,8 @@ Consistent inspection for transformers consistently with ``get_feature_dependence() -> boolean (n_outputs, n_inputs)`` -.. sectnum:: - - :depth: 3 +.. contents:: Table of contents + :depth: 2 Goals ===== diff --git a/slep004/proposal.rst b/slep004/proposal.rst new file mode 100644 index 0000000..d822f5e --- /dev/null +++ b/slep004/proposal.rst @@ -0,0 +1,406 @@ +.. _slep_004: + +================ +Data information +================ + +This is a specification to introduce data information (as +``sample_weights``) during the computation of an estimator methods +(``fit``, ``score``, ...) based on the different discussion proposes on +issues and PR : + +- `Consistent API for attaching properties to samples + #4497 `__ +- `Acceptance of sample\_weights in pipeline.score + #7723 `__ +- `Establish global error state like np.seterr + #4660 `__ +- `Should cross-validation scoring take sample-weights into account? + #4632 `__ +- `Sample properties + #4696 `__ + +Probably related PR: - `Add feature\_extraction.ColumnTransformer +#3886 `__ - +`Categorical split for decision tree +#3346 `__ + +Google doc of the sample\_prop discussion done during the sklearn day in +paris the 7th June 2017: +https://docs.google.com/document/d/1k8d4vyw87gWODiyAyQTz91Z1KOnYr6runx-N074qIBY/edit + +.. contents:: Table of contents + :depth: 2 + +1. Requirement +============== + +These requirements are defined from the different issues and PR +discussions: + +- User can attach information to samples. +- Must be a DataFrame like object. +- Can be given to ``fit``, ``score``, ``split`` and every time user + give X. +- Must work with every meta-estimator + (``Pipeline, GridSearchCV, cross_val_score``). +- Can specify what sample property is used by each part of the + meta-estimator. +- Must raise an error if not necessary extra information are given to + an estimator. In the case of meta-estimator these errors are not + raised. + +Requirement proposed but not used by this specification: - User can +attach feature properties to samples. + +2. Definition +============= + +Some estimator in sklearn can change their behavior when an attribute +``sample_props`` is provided. ``sample_props`` is a dictionary +(``pandas.DataFrame`` compatible) defining sample properties. The +example bellow explain how a ``sample_props`` can be provided to +LogisticRegression to weighted the samples: + +.. code:: python + + import numpy as np + from sklearn import datasets + from sklearn.linear_model import LogisticRegression + + digits = datasets.load_digits() + X = digits.data + y = digits.target + + # Define weights used by sample_props + weights_fit = np.random.rand(X.shape[0]) + weights_fit /= np.sum(weights_fit) + weights_score = np.random.rand(X.shape[0]) + weights_score /= np.sum(weights_score) + + logreg = LogisticRegression() + + # Fit and score a LogisticRegression without sample weights + logreg = logreg.fit(X, y) + score = logreg.score(X, y) + print("Score obtained without applying weights: %f" % score) + + # Fit LogisticRegression without sample weights and score with sample weights + logreg = logreg.fit(X, y) + score = logreg.score(X, y, sample_props={'weight': weights_score}) + print("Score obtained by applying weights only to score: %f" % score) + + # Fit and score a LogisticRegression with sample weights + log_reg = logreg.fit(X, y, sample_props={'weight': weights_fit}) + score = logreg.score(X, y, sample_props={'weight': weights_score}) + print("Score obtained by applying weights to both" + " score and fit: %f" % score) + +When an estimator expects a mandatory ``sample_props``, an error is +raised for each property not provided. Moreover if an unintended +properties is given through ``sample_props``, a warning will be +launched to prevent that the result may be different from the one +expected. For example, the following code : + +.. code:: python + + import numpy as np + from sklearn import datasets + from sklearn.cluster import KMeans + from sklearn.pipeline import Pipeline + + digits = datasets.load_digits() + X = digits.data + y = digits.target + weights = np.random.rand(X.shape[0]) + + logreg = LogisticRegression() + + # This instruction will raise the warning + logreg = logreg.fit(X, y, sample_props={'bad_property': weights}) + +will **raise the warning message**: "sample\_props['bad\_property'] is +not used by ``LogisticRegression.fit``. The results obtained may be +different from the one expected." + +We provide the function ``sklearn.seterr`` in the case you want to +change the behavior of theses messages. Even if there are considered as +warnings by default, we recommend to change the behavior to raise as +errors. You can do it by adding the following code: + +.. code:: python + + sklearn.seterr(sample_props="raise") + +Please refer to the documentation of ``np.seterr`` for more information. + +3. Behavior of ``sample_props`` for meta-estimator +================================================== + +3.1 Common routing scheme +------------------------- + +Meta-estimators can also change their behavior when an attribute +``sample_props`` is provided. On that case, ``sample_props`` will be +sent to any internal estimator and function supporting the +``sample_props`` attribute. In other terms all the property defined by +``sample_props`` will be transmitted to each internal functions or +classes supporting ``sample_props``. For example in the following +example, the property ``weights`` is sent through ``sample_props`` to +``pca.fit_transform`` and ``logistic.fit``: + +.. code:: python + + import numpy as np + from sklearn import decomposition, datasets, linear_model + from sklearn.pipeline import Pipeline + + digits = datasets.load_digits() + X = digits.data + y = digits.target + + logistic = linear_model.LogisticRegression() + pca = decomposition.PCA() + pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),]) + + # Define weights + weights = np.random.rand(X.shape[0]) + weights /= np.sum(weights) + + # weights is send to pca.fit_transform and logistic.fit + pipe.fit(X, sample_props={"weights": weights}) + +By contrast with the estimator, no warning will be raised by a +meta-estimator if an extra property is sent through ``sample_props``. +Anyway, errors are still raised if a mandatory property is not provided. + +3.2 Override common routing scheme +---------------------------------- + +You can override the common routing scheme of ``sample_props`` of +nested objects by defining sample properties of the form +``__``. + +You can override the common routing scheme of ``sample_props`` by +defining your own routes through the ``routing`` attribute of a +meta-estimator. + +A route defines a way to override the value of a key of +``sample_props`` by the value of another key in the same +``sample_props``. This modification is done every time a method +compatible with ``sample_prop`` is called. + +To illustrate how it works, if you want to send ``weights`` only to +``pca``, you can define a ``sample_prop`` with a property +``pca__weights``: + +.. code:: python + + import numpy as np + from sklearn import decomposition, datasets, linear_model + from sklearn.pipeline import Pipeline + + digits = datasets.load_digits() + X = digits.data + y = digits.target + + logistic = linear_model.LogisticRegression() + pca = decomposition.PCA() + + # Create a route using routing + pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),]) + + # Define weights + weights = np.random.rand(X.shape[0]) + weights /= np.sum(pca_weights) + pca_weights = np.random.rand(X.shape[0]) + pca_weights /= np.sum(pca_weights) + + # Only pca will receive pca_weights as weights + pipe.fit(X, sample_props={'pca__weights': pca_weights}) + + # pca will receive pca_weights and logistic will receive weights as weights + pipe.fit(X, sample_props={'pca__weights': pca_weights, + 'weights': weights}) + +By defining ``pca__weights``, we have overridden the property +``weights`` for ``pca``. On all cases, the property ``pca__weights`` +will be send to ``pca`` and ``logistic``. + +Overriding the routing scheme can be subtle and you must remember the +priority of application of each route types: + +1. Routes applied specifically to a function/estimator: + ``{'pca__weights': weights}}`` +2. Routes defined globally: ``{'weights': weights}`` + +Let's consider the following code to familiarized yourself with the +different routes definitions : + +.. code:: python + + import numpy as np + from sklearn import datasets + from sklearn.linear_model import SGDClassifier + from sklearn.model_selection import cross_val_score, GridSearchCV, LeaveOneLabelOut + + digits = datasets.load_digits() + X = digits.data + y = digits.target + + # Define the groups used by cross_val_score + cv_groups = np.random.randint(3, size=y.shape) + + # Define the groups used by GridSearchCV + gs_groups = np.random.randint(3, size=y.shape) + + # Define weights used by cross_val_score + weights = np.random.rand(X.shape[0]) + weights /= np.sum(weights) + + # We define the GridSearchCV used by cross_val_score + grid = GridSearchCV(SGDClassifier(), params, cv=LeaveOneLabelOut()) + + # When cross_val_score is called, we send all parameters for internal values + cross_val_score(grid, X, y, cv=LeaveOneLabelOut(), + sample_props={'cv__groups': groups, + 'split__groups': gs_groups, + 'weights': weights}) + +With this code, the ``sample_props`` sent to each function of +``GridSearchCV`` and ``cross_val_score`` will be: + ++-------------+--------------------------------------------------------------+ +| function | ``sample_props`` | ++=============+==============================================================+ +| grid.fit | ``{'weights': weights, 'cv__groups': cv_groups, split_groups | +| | ': gs_groups}`` | ++-------------+--------------------------------------------------------------+ +| grid.score | ``{'weights': weights, 'cv__groups': cv_groups, split_groups | +| | ': gs_groups}`` | ++-------------+--------------------------------------------------------------+ +| grid.split | ``{'weights': weights, 'groups': gs_groups, 'cv__groups': cv | +| | _groups, split_groups': gs_groups}`` | ++-------------+--------------------------------------------------------------+ +| cross\_val\ | ``{'weights': weights, 'groups': groups, 'cv__groups': cv_gr | +| _score | oups, split_groups': gs_groups}`` | ++-------------+--------------------------------------------------------------+ + +Thus, these functions receive as ``weights`` and ``groups`` properties : + ++---------------------+---------------+-----------------+ +| function | ``weights`` | ``groups`` | ++=====================+===============+=================+ +| grid.fit | ``weights`` | ``None`` | ++---------------------+---------------+-----------------+ +| grid.score | ``weights`` | ``None`` | ++---------------------+---------------+-----------------+ +| grid.split | ``weights`` | ``gs_groups`` | ++---------------------+---------------+-----------------+ +| cross\_val\_score | ``weights`` | ``cv_groups`` | ++---------------------+---------------+-----------------+ + +4. Alternative propositions for sample\_props (06.17.17) +======================================================== + +The meta-estimator says which columns of sample\_props they wanted to +use. + +.. code:: python + + p = make_pipeline( + PCA(n_components=10), + SVC(C=10).with(_=) + ) + p.fit(X, y, sample_props={column_name=value}) + +For example : + +.. code:: python + + p = make_pipeline( + PCA(n_components=10), + SVC(C=10).with(fit_weights='weights', score_weights='weights') + ) + p.fit(X, y, sample_props={"weights": w}) + +**Other proposals**: - Olivier suggests to modify ``.with(...)`` by +``.sample_props_mapping(...)``. - Gael suggests to change the +``.with(...)`` by a property ``with_props=...`` like : + +.. code:: python + + p = make_pipeline( + PCA(n_components=10), + SVC(C=10), + with_props={ + 'svc':(_=)} + ) + +4.1 GridSearch + Pipeline case +------------------------------ + +Let's consider the case of a ``GridSearch`` working with a ``Pipeline``. +How we definer the ``sample_props`` on that case ? + +Alternative 1 +~~~~~~~~~~~~~ + +Pass through everything in ``GridSearchCV``: + +.. code:: python + + pipe = make_pipeline( + PCA(), SVC(), + with_props={pca__fit_weight: 'my_weights'}}) + GridSearchCV( + pipe, cv=my_cv, + with_props={'cv__groups': "my_groups", '*':'*') + +A more complex example with this solution: + +.. code:: python + + pipe = make_pipeline( + make_union( + CountVectorizer(analyzer='word').with(fit_weight='my_weight'), + CountVectorizer(analyzer='char').with(fit_weight='my_weight')), + SVC()) + + GridSearchCV( + pipe, + cv=my_cv.with(groups='my_groups'), score_weight='my_weight') + +Alternative 2 +~~~~~~~~~~~~~ + +Grid search manage the ``sample_props`` of all internal variable. + +.. code:: python + + pipe = make_pipeline(PCA(), SVC()) + GridSearchCV( + pipe, cv=my_cv, + with_props={ + 'cv__groups': "my_groups", + 'estimator__pca__fit_weight': "my_weights"), +       }) + +A more complex example with this solution: + +.. code:: python + + pipe = make_pipeline( + make_union( + CountVectorizer(analyzer='word'), + CountVectorizer(analyzer='char')), + SVC()) + GridSearchCV( + pipe, cv=my_cv, + with_props={ + 'cv__groups': "my_groups", + 'estimator__featureunion__countvectorizer-1__fit_weight': "my_weights", + 'estimator__featureunion__countvectorizer-2__fit_weight': "my_weights", + 'score_weight': "my_weights", + } + ) diff --git a/slep004/sample_props_spec.md b/slep004/sample_props_spec.md deleted file mode 100644 index 0cfdf73..0000000 --- a/slep004/sample_props_spec.md +++ /dev/null @@ -1,353 +0,0 @@ -This is a specification to introduce data information (as `sample_weights`) -during the computation of an estimator methods (`fit`, `score`, ...) based on -the different discussion proposes on issues and PR : - -- [Consistent API for attaching properties to samples #4497]( - https://github.com/scikit-learn/scikit-learn/issues/4497) -- [Acceptance of sample_weights in pipeline.score #7723]( - https://github.com/scikit-learn/scikit-learn/pull/7723) -- [Establish global error state like np.seterr #4660]( - https://github.com/scikit-learn/scikit-learn/issues/4660) -- [Should cross-validation scoring take sample-weights into account? #4632]( - https://github.com/scikit-learn/scikit-learn/issues/4632) -- [Sample properties #4696]( - https://github.com/scikit-learn/scikit-learn/issues/4696) - -Probably related PR: -- [Add feature_extraction.ColumnTransformer #3886]( - https://github.com/scikit-learn/scikit-learn/pull/3886) -- [Categorical split for decision tree #3346]( - https://github.com/scikit-learn/scikit-learn/pull/3346) - -Google doc of the sample_prop discussion done during the sklearn day in paris -the 7th June 2017: -https://docs.google.com/document/d/1k8d4vyw87gWODiyAyQTz91Z1KOnYr6runx-N074qIBY/edit - -# 1. Requirement - -These requirements are defined from the different issues and PR discussions: - -- User can attach information to samples. -- Must be a DataFrame like object. -- Can be given to `fit`, `score`, `split` and every time user give X. -- Must work with every meta-estimator (`Pipeline, GridSearchCV, - cross_val_score`). -- Can specify what sample property is used by each part of the meta-estimator. -- Must raise an error if not necessary extra information are given to an - estimator. In the case of meta-estimator these errors are not raised. - -Requirement proposed but not used by this specification: -- User can attach feature properties to samples. - -# 2. Definition - -Some estimator in sklearn can change their behavior when an attribute -`sample_props` is provided. `sample_props` is a dictionary -(`pandas.DataFrame` compatible) defining sample properties. The example bellow -explain how a `sample_props` can be provided to LogisticRegression to -weighted the samples: - -```python -import numpy as np -from sklearn import datasets -from sklearn.linear_model import LogisticRegression - -digits = datasets.load_digits() -X = digits.data -y = digits.target - -# Define weights used by sample_props -weights_fit = np.random.rand(X.shape[0]) -weights_fit /= np.sum(weights_fit) -weights_score = np.random.rand(X.shape[0]) -weights_score /= np.sum(weights_score) - -logreg = LogisticRegression() - -# Fit and score a LogisticRegression without sample weights -logreg = logreg.fit(X, y) -score = logreg.score(X, y) -print("Score obtained without applying weights: %f" % score) - -# Fit LogisticRegression without sample weights and score with sample weights -logreg = logreg.fit(X, y) -score = logreg.score(X, y, sample_props={'weight': weights_score}) -print("Score obtained by applying weights only to score: %f" % score) - -# Fit and score a LogisticRegression with sample weights -log_reg = logreg.fit(X, y, sample_props={'weight': weights_fit}) -score = logreg.score(X, y, sample_props={'weight': weights_score}) -print("Score obtained by applying weights to both" - " score and fit: %f" % score) -``` - -When an estimator expects a mandatory `sample_props`, an error is raised for -each property not provided. Moreover **if an unintended properties is given -through `sample_props`, a warning will be launched** to prevent that the result -may be different from the one expected. For example, the following code : - -```python -import numpy as np -from sklearn import datasets -from sklearn.cluster import KMeans -from sklearn.pipeline import Pipeline - -digits = datasets.load_digits() -X = digits.data -y = digits.target -weights = np.random.rand(X.shape[0]) - -logreg = LogisticRegression() - -# This instruction will raise the warning -logreg = logreg.fit(X, y, sample_props={'bad_property': weights}) -``` - -will **raise the warning message**: "sample_props['bad_property'] is not used by -`LogisticRegression.fit`. The results obtained may be different from the one -expected." - -We provide the function `sklearn.seterr` in the case you want to change the -behavior of theses messages. Even if there are considered as warnings by -default, we recommend to change the behavior to raise as errors. You can do it -by adding the following code: - -```python -sklearn.seterr(sample_props="raise") -``` - -Please refer to the documentation of `np.seterr` for more information. - -# 3. Behavior of `sample_props` for meta-estimator - -## 3.1 Common routing scheme - -Meta-estimators can also change their behavior when an attribute `sample_props` -is provided. On that case, `sample_props` will be sent to any internal estimator -and function supporting the `sample_props` attribute. In other terms **all the -property defined by `sample_props` will be transmitted to each internal -functions or classes supporting `sample_props`**. For example in the following -example, the property `weights` is sent through `sample_props` to -`pca.fit_transform` and `logistic.fit`: - -```python -import numpy as np -from sklearn import decomposition, datasets, linear_model -from sklearn.pipeline import Pipeline - -digits = datasets.load_digits() -X = digits.data -y = digits.target - -logistic = linear_model.LogisticRegression() -pca = decomposition.PCA() -pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),]) - -# Define weights -weights = np.random.rand(X.shape[0]) -weights /= np.sum(weights) - -# weights is send to pca.fit_transform and logistic.fit -pipe.fit(X, sample_props={"weights": weights}) -``` - -**By contrast with the estimator, no warning will be raised by a -meta-estimator if an extra property is sent through `sample_props`.** -Anyway, errors are still raised if a mandatory property is not provided. - -## 3.2 Override common routing scheme - -**You can override the common routing scheme of `sample_props` of nested objects -by defining sample properties of the form `__`.** - -**You can override the common routing scheme of `sample_props` by defining your -own routes through the `routing` attribute of a meta-estimator**. - -**A route defines a way to override the value of a key of `sample_props` by the -value of another key in the same `sample_props`. This modification is done -every time a method compatible with `sample_prop` is called.** - -To illustrate how it works, if you want to send `weights` only to `pca`, -you can define a `sample_prop` with a property `pca__weights`: - -```python -import numpy as np -from sklearn import decomposition, datasets, linear_model -from sklearn.pipeline import Pipeline - -digits = datasets.load_digits() -X = digits.data -y = digits.target - -logistic = linear_model.LogisticRegression() -pca = decomposition.PCA() - -# Create a route using routing -pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic),]) - -# Define weights -weights = np.random.rand(X.shape[0]) -weights /= np.sum(pca_weights) -pca_weights = np.random.rand(X.shape[0]) -pca_weights /= np.sum(pca_weights) - -# Only pca will receive pca_weights as weights -pipe.fit(X, sample_props={'pca__weights': pca_weights}) - -# pca will receive pca_weights and logistic will receive weights as weights -pipe.fit(X, sample_props={'pca__weights': pca_weights, - 'weights': weights}) -``` - -By defining `pca__weights`, we have overridden the property -`weights` for `pca`. On all cases, the property `pca__weights` -will be send to `pca` and `logistic`. - -**Overriding the routing scheme can be subtle and you must -remember the priority of application of each route types**: - -1. Routes applied specifically to a function/estimator: `{'pca__weights': weights}}` -2. Routes defined globally: `{'weights': weights}` - -Let's consider the following code to familiarized yourself with the different -routes definitions : - -```python -import numpy as np -from sklearn import datasets -from sklearn.linear_model import SGDClassifier -from sklearn.model_selection import cross_val_score, GridSearchCV, LeaveOneLabelOut - -digits = datasets.load_digits() -X = digits.data -y = digits.target - -# Define the groups used by cross_val_score -cv_groups = np.random.randint(3, size=y.shape) - -# Define the groups used by GridSearchCV -gs_groups = np.random.randint(3, size=y.shape) - -# Define weights used by cross_val_score -weights = np.random.rand(X.shape[0]) -weights /= np.sum(weights) - -# We define the GridSearchCV used by cross_val_score -grid = GridSearchCV(SGDClassifier(), params, cv=LeaveOneLabelOut()) - -# When cross_val_score is called, we send all parameters for internal values -cross_val_score(grid, X, y, cv=LeaveOneLabelOut(), - sample_props={'cv__groups': groups, - 'split__groups': gs_groups, - 'weights': weights}) -``` - -With this code, the `sample_props` sent to each function of `GridSearchCV` and -`cross_val_score` will be: - -| function | `sample_props` | -|:----------------|:-----------------------------------------------------------------------------------------------| -| grid.fit | `{'weights': weights, 'cv__groups': cv_groups, split_groups': gs_groups}` | -| grid.score | `{'weights': weights, 'cv__groups': cv_groups, split_groups': gs_groups}` | -| grid.split | `{'weights': weights, 'groups': gs_groups, 'cv__groups': cv_groups, split_groups': gs_groups}` | -| cross_val_score | `{'weights': weights, 'groups': groups, 'cv__groups': cv_groups, split_groups': gs_groups}` | - - -Thus, these functions receive as `weights` and `groups` properties : - -| function | `weights` | `groups` | -|:----------------|:-------------------|:------------| -| grid.fit | `weights` | `None` | -| grid.score | `weights` | `None` | -| grid.split | `weights` | `gs_groups` | -| cross_val_score | `weights` | `cv_groups` | - - -# 4. Alternative propositions for sample_props (06.17.17) -The meta-estimator says which columns of sample_props they wanted to use. -```python -p = make_pipeline( - PCA(n_components=10), - SVC(C=10).with(_=) -) -p.fit(X, y, sample_props={column_name=value}) -``` - -For example : -```python -p = make_pipeline( - PCA(n_components=10), - SVC(C=10).with(fit_weights='weights', score_weights='weights') -) -p.fit(X, y, sample_props={"weights": w}) -``` - -**Other proposals**: -- Olivier suggests to modify `.with(...)` by `.sample_props_mapping(...)`. -- Gael suggests to change the `.with(...)` by a property `with_props=...` like : -```python -p = make_pipeline( - PCA(n_components=10), - SVC(C=10), - with_props={ - 'svc':(_=)} -) -``` - -## 4.1 GridSearch + Pipeline case -Let's consider the case of a `GridSearch` working with a `Pipeline`. -How we definer the `sample_props` on that case ? - -### Alternative 1 -Pass through everything in `GridSearchCV`: -```python -pipe = make_pipeline( - PCA(), SVC(), - with_props={pca__fit_weight: 'my_weights'}}) -GridSearchCV( - pipe, cv=my_cv, - with_props={'cv__groups': "my_groups", '*':'*') -``` - -A more complex example with this solution: -```python -pipe = make_pipeline( - make_union( - CountVectorizer(analyzer='word').with(fit_weight='my_weight'), - CountVectorizer(analyzer='char').with(fit_weight='my_weight')), - SVC()) - -GridSearchCV( - pipe, - cv=my_cv.with(groups='my_groups'), score_weight='my_weight') -``` - -### Alternative 2 -Grid search manage the `sample_props` of all internal variable. -```python -pipe = make_pipeline(PCA(), SVC()) -GridSearchCV( - pipe, cv=my_cv, - with_props={ - 'cv__groups': "my_groups", - 'estimator__pca__fit_weight': "my_weights"), -       }) -``` - -A more complex example with this solution: -```python -pipe = make_pipeline( - make_union( - CountVectorizer(analyzer='word'), - CountVectorizer(analyzer='char')), - SVC()) -GridSearchCV( - pipe, cv=my_cv, - with_props={ - 'cv__groups': "my_groups", - 'estimator__featureunion__countvectorizer-1__fit_weight': "my_weights", - 'estimator__featureunion__countvectorizer-2__fit_weight': "my_weights", - 'score_weight': "my_weights", - } -) -``` From 12b61c5c78cac0ff4acc6c614b13b6165e075fe4 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 12 Dec 2018 14:38:24 -0500 Subject: [PATCH 016/118] Added slep_template from nep_template --- index.rst | 6 ++++ slep_template.rst | 77 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 83 insertions(+) create mode 100644 slep_template.rst diff --git a/index.rst b/index.rst index db8afc6..9cc30ff 100644 --- a/index.rst +++ b/index.rst @@ -16,6 +16,12 @@ possible solution. It should be a summary of the key points that drive the decision, and ideally converge to a draft of an API or object to be implemented in scikit-learn. +.. toctree:: + :maxdepth: 1 + :caption: Template: + + slep_template + .. toctree:: :maxdepth: 1 :numbered: diff --git a/slep_template.rst b/slep_template.rst new file mode 100644 index 0000000..b05f710 --- /dev/null +++ b/slep_template.rst @@ -0,0 +1,77 @@ +============================== +SLEP Template and Instructions +============================== + +:Author: +:Status: +:Type: +:Created: +:Resolution: (required for Accepted | Rejected | Withdrawn) + +Abstract +-------- + +The abstract should be a short description of what the SLEP will achieve. + + +Detailed description +-------------------- + +This section describes the need for the SLEP. It should describe the +existing problem that it is trying to solve and why this SLEP makes the +situation better. It should include examples of how the new functionality +would be used and perhaps some use cases. + + +Implementation +-------------- + +This section lists the major steps required to implement the SLEP. Where +possible, it should be noted where one step is dependent on another, and which +steps may be optionally omitted. Where it makes sense, each step should +include a link related pull requests as the implementation progresses. + +Any pull requests or developmt branches containing work on this SLEP should +be linked to from here. (A SLEP does not need to be implemented in a single +pull request if it makes sense to implement it in discrete phases). + + +Backward compatibility +---------------------- + +This section describes the ways in which the SLEP breaks backward +compatibility. + + +Alternatives +------------ + +If there were any alternative solutions to solving the same problem, they +should be discussed here, along with a justification for the chosen +approach. + + +Discussion +---------- + +This section may just be a bullet list including links to any discussions +regarding the SLEP: + +- This includes links to mailing list threads or relevant GitHub issues. + + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From 516f472d5498cbf1bd44e495051437a94af3b534 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 12 Dec 2018 14:48:20 -0500 Subject: [PATCH 017/118] Added intershpinx support --- conf.py | 3 +-- slep002/proposal.rst | 3 ++- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/conf.py b/conf.py index 49b8fc9..8ce8f79 100644 --- a/conf.py +++ b/conf.py @@ -162,5 +162,4 @@ # -- Options for intersphinx extension --------------------------------------- -# Example configuration for intersphinx: refer to the Python standard library. -intersphinx_mapping = {'https://docs.python.org/': None} +intersphinx_mapping = {'sklearn': ('http://scikit-learn.org/stable', None)} diff --git a/slep002/proposal.rst b/slep002/proposal.rst index ba275d8..0080354 100644 --- a/slep002/proposal.rst +++ b/slep002/proposal.rst @@ -26,7 +26,8 @@ Design Imports ------- -In addition to `Pipeline` class some additional wrappers are proposed as part of public API:: +In addition to :class:`sklearn.pipeline.Pipeline` class some additional +wrappers are proposed as part of public API:: from sklearn.pipeline import (Pipeline, fitted, transformer, predictor label_transformer, label_predictor, From beba66c8d1284fff55fb9add9dd0665b261c2fc4 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 12 Dec 2018 16:36:40 -0500 Subject: [PATCH 018/118] Added default_role = 'any' --- conf.py | 2 ++ slep002/proposal.rst | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/conf.py b/conf.py index 8ce8f79..3a1548d 100644 --- a/conf.py +++ b/conf.py @@ -68,6 +68,8 @@ # This pattern also affects html_static_path and html_extra_path . exclude_patterns = [] +default_role = 'any' + # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' diff --git a/slep002/proposal.rst b/slep002/proposal.rst index 0080354..3187432 100644 --- a/slep002/proposal.rst +++ b/slep002/proposal.rst @@ -26,7 +26,7 @@ Design Imports ------- -In addition to :class:`sklearn.pipeline.Pipeline` class some additional +In addition to `Pipeline ` class some additional wrappers are proposed as part of public API:: from sklearn.pipeline import (Pipeline, fitted, transformer, predictor From 3868fc20cc4af5027eeb4dcd44d71109a1eb2064 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 12 Dec 2018 18:13:14 -0500 Subject: [PATCH 019/118] Used double backticks and removed whitespaces --- slep001/proposal.rst | 67 +++++++++--------- slep002/proposal.rst | 160 +++++++++++++++++++++++-------------------- slep004/proposal.rst | 24 +++---- 3 files changed, 132 insertions(+), 119 deletions(-) diff --git a/slep001/proposal.rst b/slep001/proposal.rst index 18bc070..f365284 100644 --- a/slep001/proposal.rst +++ b/slep001/proposal.rst @@ -14,7 +14,7 @@ Transformers that modify their target Within a chain or processing sequence of estimators, many usecases require modifying y. How do we support this? - + Doing many of these things is possible "by hand". The question is: how to avoid writing custom connecting logic. @@ -71,7 +71,7 @@ Examples of usecases targetted #. Aggregate statistics over multiple samples #. Windowing-like functions on time-series - + In a sense, these are dodgy with scikit-learn's cross-validation API that knows nothing about sample structure. But the refactor of the CV API is really helping in this regard. @@ -135,7 +135,7 @@ conceptual difficulty an almost case-by-case basis, and for the advanced user, that needs to maintain a set of case-specific code -#. The "estimator heap" problem. +#. The "estimator heap" problem. Here the word heap is used to denote the multiple pipelines and meta-estimators. It corresponds to what we would naturally call a @@ -149,17 +149,17 @@ conceptual difficulty scikit-learn. Here are concrete examples #. Trying to retrieve coefficients from a model estimated in a - "heap". Eg: - + "heap". Eg: + * you know there is a lasso in your stack and you want to get it's coef (in whatever space that resides?): - `pipeline.named_steps['lasso'].coef_` is possible. + ``pipeline.named_steps['lasso'].coef_`` is possible. * you want to retrieve the coef of the last step: - `pipeline.steps[-1][1].coef_` is possible. + ``pipeline.steps[-1][1].coef_`` is possible. With meta estimators this is tricky. - Solving this problem requires + Solving this problem requires https://github.com/scikit-learn/scikit-learn/issues/2562#issuecomment-27543186 (this enhancement proposal is not advocating to solve the problem above, but pointing it out as an illustration) @@ -205,11 +205,11 @@ Option B: transformer-like that modify y There is an emerging consensus for option 2. -.. topic:: **`transform` modifying y** +.. topic:: **``transform`` modifying y** Variant 1 above could be implementing by allowing transform to modify - y. However, the return signature of transform would be unclear. - + y. However, the return signature of transform would be unclear. + Do we modify all transformers to return a y (y=None for unsupervised transformers that are not given y?). This sounds like leading to code full of surprises and difficult to maintain from the user perspective. @@ -223,25 +223,25 @@ Option B: transformer-like that modify y Proposal ......... -Introduce a `TransModifier` type of object with the following API +Introduce a ``TransModifier`` type of object with the following API (names are discussed below): -* `X_new, y_new = estimator.fit_modify(X, y)` +* ``X_new, y_new = estimator.fit_modify(X, y)`` -* `X_new, y_new = estimator.trans_modify(X, y)` +* ``X_new, y_new = estimator.trans_modify(X, y)`` Or: -* `X_new, y_new, sample_props = estimator.fit_modify(X, y)` +* ``X_new, y_new, sample_props = estimator.fit_modify(X, y)`` -* `X_new, y_new, sample_props = estimator.trans_modify(X, y)` +* ``X_new, y_new, sample_props = estimator.trans_modify(X, y)`` Contracts (these are weaker contracts than the transformer: -* Neither `fit_modify` nor `trans_modify` are guarantied to keep the +* Neither ``fit_modify`` nor ``trans_modify`` are guarantied to keep the number of samples unchanged. -* `fit_modify` may not exist (questionnable) +* ``fit_modify`` may not exist (questionnable) Design questions and difficulties ................................. @@ -260,8 +260,8 @@ In particular at test time? #. Should there be a transform method used at test time? -#. What to do with objects that implement both `transform` and - `trans_modify`? +#. What to do with objects that implement both ``transform`` and + ``trans_modify``? **Creating y in a pipeline makes error measurement harder** For some usecases, test time needs to modify the number of samples (for instance @@ -270,7 +270,8 @@ for eg cross-val-score, as in supervised settings, these expect a y_true. Indeed, the problem is the following: - To measure an error, we need y_true at the level of - `cross_val_score` or `GridSearchCV` + `sklearn.model_selection.cross_val_score` or + `sklearn.model_selection.GridSearchCV` - y_true is created inside the pipeline by the data-loading object. @@ -281,19 +282,19 @@ enabling them). | For our CV framework, we need the number of samples to remain -constant: for each y_pred, we need a corresponding y_true. +constant: for each y_pred, we need a corresponding y_true. | -**Proposal 1**: use transform at `predict` time. - -#. Objects implementing both `transform` and `trans_modify` are valid +**Proposal 1**: use transform at ``predict`` time. + +#. Objects implementing both ``transform`` and ``trans_modify`` are valid -#. The pipeline's `predict` method use `transform` on its intermediate +#. The pipeline's ``predict`` method use ``transform`` on its intermediate steps -The different semantics of `trans_modify` and `transform` can be very useful, -as `transform` keeps untouched the notion of sample, and `y_true`. +The different semantics of ``trans_modify`` and ``transform`` can be very useful, +as ``transform`` keeps untouched the notion of sample, and ``y_true``. | @@ -301,19 +302,19 @@ as `transform` keeps untouched the notion of sample, and `y_true`. One option is to modify the scoring framework to be able to handle these things, the scoring gets the output of the chain of -trans_modify for y. This should rely on clever code in the `score` method +trans_modify for y. This should rely on clever code in the ``score`` method of pipeline. Maybe it should be controlled by a keyword argument on the pipeline, and turned off by default. - + How do we deal with sample weights and other sample properties? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This discussion feeds in the `sample_props` discussion (that should +This discussion feeds in the ``sample_props`` discussion (that should be discussed in a different enhancement proposal). The suggestion is to have the sample properties as a dictionary of -arrays `sample_props`. +arrays ``sample_props``. **Example usecase** useful to think about sample properties: coresets: given (X, y) return (X_new, y_new, weights) with a much smaller number @@ -340,7 +341,7 @@ readability of the code easier. - TransformPipe - PipeTransformer -* Method to fit and apply on training +* Method to fit and apply on training - fit_modify - fit_pipe - pipe_fit diff --git a/slep002/proposal.rst b/slep002/proposal.rst index 3187432..232e382 100644 --- a/slep002/proposal.rst +++ b/slep002/proposal.rst @@ -88,8 +88,8 @@ Every dict should be of length 1:: TypeError: Wrong step definition -Proposed construction from `collections.OrderedDict` -.................................................... +Proposed construction from ``collections.OrderedDict`` +...................................................... It is probably the most natural way to create a pipeline:: @@ -103,11 +103,11 @@ It is probably the most natural way to create a pipeline:: Backward-compatibility notice ----------------------------- -As user can provide object of any type as `steps` argument to constructor, +As user can provide object of any type as ``steps`` argument to constructor, there is no way to be 100% compatible, if we are going to maintain our oun -type for `Pipeline.steps`. -But in most cases people provide `list` object as `steps` parameter, so -being backward-compatible with `list` API should be fine. +type for ``Pipeline.steps``. +But in most cases people provide ``list`` object as ``steps`` parameter, so +being backward-compatible with ``list`` API should be fine. Adding estimators ----------------- @@ -115,11 +115,12 @@ Adding estimators Backward-compatible ................... -Although not documented, but popular method of modifying (not fitted) pipelines should be supported:: +Although not documented, but popular method of modifying (not fitted) pipelines +should be supported:: pipe.steps.append(['name', estimator]) -The only difference is that special handler is returned instead of `None`. +The only difference is that special handler is returned instead of ``None``. Enhanced: by indexing ..................... @@ -128,8 +129,8 @@ Using dict-like syntax if very user-friendly:: pipe.steps['name'] = estimator -Enhanced: `add` function -........................ +Enhanced: ``add`` function +.......................... Alias to previous two calls:: @@ -142,7 +143,8 @@ And also:: Adding estimators with type specification ......................................... -Estimator types will be discussed later, but some functions belong to this section:: +Estimator types will be discussed later, but some functions belong to this +section:: pipe.add_estimator('name0', estimator0).mark_fitted() pipe.add_transformer('name1', estimator1) # never calls .fit (x, y -> x) @@ -156,7 +158,7 @@ Steps (subestimators) access Backward-compatible ................... -Indexing by number should return `(step, estimator)` pair:: +Indexing by number should return ``(step, estimator)`` pair:: >>> pipe.steps[0] ('name', SomeEstimator(...)) @@ -177,7 +179,7 @@ and there is no inference with internal methods:: >>> pipe.steps.name SomeEstimator(param1=value1, param2=value2) - + >>> pipe.steps.get > @@ -194,8 +196,8 @@ Replacing estimators Backward-compatible ................... -Replacing should only be supported via access to `.steps` attribute. This way there is no ambiguity -with new/old subestimator subtype:: +Replacing should only be supported via access to ``.steps`` attribute. This way +there is no ambiguity with new/old subestimator subtype:: pipe = Pipeline(steps=[('name', SomeEstimator())]) pipe.steps[0] = ('name', AnotherEstimator()) @@ -208,34 +210,37 @@ Dict-like behavior can be used too:: pipe = Pipeline(steps=[('name', SomeEstimator())]) pipe.steps['name'] = AnotherEstimator() -Replace via `replace()` function -................................. +Replace via ``replace()`` function +.................................. This way one can obtain special handler:: pipe.steps.replace('old_step_name', 'new_step_name', NewEstimator()) - pipe.steps.replace('step_name', 'new_name', SomeEstimator()).mark_transformer() + pipe.steps.replace('step_name', 'new_name', + SomeEstimator()).mark_transformer() -Rename step via `rename()` function -.................................... +Rename step via ``rename()`` function +..................................... -Simple way to change step's name (doesn't affect anything except object representation):: +Simple way to change step's name (doesn't affect anything except object +representation):: pipe.steps.rename('old_name', 'new_name') Modifying estimators -------------------- -Changing estimator params should only be performed via `pipeline.set_params()`. -If somebody calls `subestimator.set_params()` directly, pipeline object will have -no idea about changed state. There is no easy way to control it, so docs should just -warm users about it. +Changing estimator params should only be performed via +``pipeline.set_params()``. If somebody calls ``subestimator.set_params()`` +directly, pipeline object will have no idea about changed state. There is no +easy way to control it, so docs should just warm users about it. -On the other hand, there exist not-so-easy way to at least warm users during runtime: -pipeline will have to keep params of all its children and compare them with actual -params during `fit` or `predict` routines and raise a warning if they do not match. -This functionality may be implemented as part of some kind of debugging mode. +On the other hand, there exist not-so-easy way to at least warm users during +runtime: pipeline will have to keep params of all its children and compare them +with actual params during ``fit`` or ``predict`` routines and raise a warning +if they do not match. This functionality may be implemented as part of some +kind of debugging mode. Deleting estimators ------------------- @@ -243,7 +248,7 @@ Deleting estimators Backward-compatible ................... -Backward-compatible way to delete a step is to `del` it via index number:: +Backward-compatible way to delete a step is to ``del`` it via index number:: del pipe.steps[2] @@ -256,56 +261,60 @@ using enhanced indexing:: pipe = Pipeline() est1 = Estimator1() est2 = Estimator2() - + pipe.steps.add('name1', est1) pipe.steps.add('name2', est2) - + del pipe.steps['name1'] del pipe.steps[pipe.steps.index(est2)] -Using dict/list-like `pop()` functions -...................................... +Using dict/list-like ``pop()`` functions +........................................ Last estimator in a chain can be deleted with any of these calls:: >>> pipe.steps.pop() SomeEstimator() - + >>> pipe.steps.popitem() ('some_name', SomeEstimator()) -Likewise, first estimator in the pipeline can be removed with any of these calls:: +Likewise, first estimator in the pipeline can be removed with any of these +calls:: >>> pipe.steps.popfront() BeginEstimator() - + >>> pipe.steps.popitemfront() ('begin', BeginEstimator) -Any step can be removed with `pop(step_name)` or `popitem(step_name)`. +Any step can be removed with ``pop(step_name)`` or ``popitem(step_name)``. Fitted flag reset ----------------- -Internally `Pipeline` object should keep track on whatever it is fitted or not. -It should consider itself fitted if it wasn't modified after: +Internally ``Pipeline`` object should keep track on whatever it is fitted or +not. It should consider itself fitted if it wasn't modified after: -* successful call to `.fit`:: +* successful call to ``.fit``:: pipe.fit(...) # Got fitted pipeline if no exception was raised + * construction with list of estimators, all marked as - fitted via `fitted` function:: - + fitted via ``fitted`` function:: + pipe = pipeline.Pipeline(steps=[ ('name1', fitted(estimator1)), ('name2', fitted(estimator2)(, ... ]) + * adding fitted estimator to fitted pipeline:: pipe.steps.append(fitted(estimator1)) pipe.steps['new_step'] = fitted(estimator2) pipe.add_transformer('some_key', estimator3).set_fitted() + * renaming step in fitted pipeline * removing first or last step from fitted pipeline @@ -319,7 +328,7 @@ Subestimator type can be specified: * By wrapping estimator with subtype constructor call: * when creating pipeline:: - + Pipeline([ ('name1', transformer(estimator)), ('name2', predictor(estimator)), @@ -327,13 +336,13 @@ Subestimator type can be specified: ('name4', label_predictor(estimator)), ]) * when adding or replacing a step:: - + pipe.steps.append(['name', label_predictor(estimator]) pipe.steps.add('name', label_transformer(estimator)) pipe.add_estimator('name', predictor(estimator)) pipe.steps.replace('name', transformer(fitted(estimator))) pipe.steps['name'] = fitted(predictor(estimator)) -* Using `pipe.add_*` methods:: +* Using ``pipe.add_*`` methods:: pipe.add_transformer('transformer', Transformer()) pipe.add_predictor('predictor', Predictor()) @@ -403,7 +412,7 @@ one can use fitted function:: est = SomeEstimator().fit(some_data) pipe.steps.add('prefitted', fitted(est)) - + or special hanlder method:: pipe.steps.add('prefitted', est).mark_fitted() @@ -421,7 +430,8 @@ In some cases we only need to apply estimator only during fit-phase:: # or pipe.add_estimator('sampler', Sampler()).mark('ignore_transform') -If it is `predictor` or `label_predictor`, then one should use `ignore_predict`:: +If it is ``predictor`` or ``label_predictor``, then one should use +``ignore_predict``:: pipe.add_estimator('cluster', ignore_predict(predictor(ClusteringEstimator()))) # or @@ -432,8 +442,8 @@ If it is `predictor` or `label_predictor`, then one should use `ignore_predict`: Setting subestimator type ......................... -As specified above setting subestimator type can be performed with special handler -or special function call. +As specified above setting subestimator type can be performed with special +handler or special function call. Combining multiple flags ........................ @@ -464,22 +474,22 @@ Attributes and methods with standard behavior Special methods: -* `__contains__()`, `__getitem__()`, `__setitem__()`, `__delitem__()` -* `__len__()`, `__iter__()` -* `__add__()`, `__iadd__()` +* ``__contains__()``, ``__getitem__()``, ``__setitem__()``, ``__delitem__()`` +* ``__len__()``, ``__iter__()`` +* ``__add__()``, ``__iadd__()`` Methods: -* `get()`, `index()` -* `extend()`, `insert()` -* `keys()`, `items()`, `values()` -* `clear()`, `pop()`, `popitem()`, `popfront()`, `popitemfront()` +* ``get()``, ``index()`` +* ``extend()``, ``insert()`` +* ``keys()``, ``items()``, ``values()`` +* ``clear()``, ``pop()``, ``popitem()``, ``popfront()``, ``popitemfront()`` Non-standard methods .................... -* `replace()` -* `rename()` +* ``replace()`` +* ``rename()`` Not supported arguments and methods ................................... @@ -487,26 +497,27 @@ Not supported arguments and methods This type provides dict-like and list-like interfaces, but following methods and attributes are not supported: -* `fromkeys()` -* `setdefault()` -* `sort()` -* `__mul__()`, `__rmul__()`, `__imul__()` +* ``fromkeys()`` +* ``setdefault()`` +* ``sort()`` +* ``__mul__()``, ``__rmul__()``, ``__imul__()`` -Any attempt to use them should fail with `AttributeError` or -`NotImplementedError` +Any attempt to use them should fail with ``AttributeError`` or +``NotImplementedError`` Thease methods may be not supported: -* `__ge__()`, `__gt__()` -* `__le__()`, `__lt__()` +* ``__ge__()``, ``__gt__()`` +* ``__le__()``, ``__lt__()`` Serialization ------------- * Support loading/unpickling pipelines from old scikit-learn versions -* Keep track of API version in `__getstate__` / `picklier`: all future +* Keep track of API version in ``__getstate__`` / ``picklier``: all future versions should support unpickling all previous versions of enhanced pipeline -* Serialization of `.steps` attribute (without master pipeline) may be not supported. +* Serialization of ``.steps`` attribute (without master pipeline) may be not + supported. Examples ======== @@ -514,8 +525,9 @@ Examples Example: remove outliers ------------------------ -Proposed design allows to do many things, but some of them have to be done in two steps. -But it shouldn't be a problem, as one can make a pipeline with those steps:: +Proposed design allows to do many things, but some of them have to be done in +two steps. But it shouldn't be a problem, as one can make a pipeline with +those steps:: def make_outlier_remover(bad_value=-1): outlier_remover = Pipeline() @@ -549,11 +561,11 @@ We can use previous example function for this:: Benefits ======== * Users can use old code with new pipeline: - usual `__init__`, `set_params`, `get_params`, `fit`, `transform` and `predict` - are the only requirements of subestimators. + usual ``__init__``, ``set_params``, ``get_params``, ``fit``, ``transform`` + and ``predict`` are the only requirements of subestimators. * Users can use new pipeline with their old code: pipeline is stil usual estimator, that supports usual set of methods. -* We finally can transform `y` in a pipeline. +* We finally can transform ``y`` in a pipeline. Drawbacks ========= diff --git a/slep004/proposal.rst b/slep004/proposal.rst index d822f5e..558cb90 100644 --- a/slep004/proposal.rst +++ b/slep004/proposal.rst @@ -265,7 +265,7 @@ different routes definitions : cross_val_score(grid, X, y, cv=LeaveOneLabelOut(), sample_props={'cv__groups': groups, 'split__groups': gs_groups, - 'weights': weights}) + 'weights': weights}) With this code, the ``sample_props`` sent to each function of ``GridSearchCV`` and ``cross_val_score`` will be: @@ -351,10 +351,10 @@ Pass through everything in ``GridSearchCV``: .. code:: python pipe = make_pipeline( - PCA(), SVC(), + PCA(), SVC(), with_props={pca__fit_weight: 'my_weights'}}) GridSearchCV( - pipe, cv=my_cv, + pipe, cv=my_cv, with_props={'cv__groups': "my_groups", '*':'*') A more complex example with this solution: @@ -364,11 +364,11 @@ A more complex example with this solution: pipe = make_pipeline( make_union( CountVectorizer(analyzer='word').with(fit_weight='my_weight'), - CountVectorizer(analyzer='char').with(fit_weight='my_weight')), + CountVectorizer(analyzer='char').with(fit_weight='my_weight')), SVC()) - + GridSearchCV( - pipe, + pipe, cv=my_cv.with(groups='my_groups'), score_weight='my_weight') Alternative 2 @@ -380,9 +380,9 @@ Grid search manage the ``sample_props`` of all internal variable. pipe = make_pipeline(PCA(), SVC()) GridSearchCV( - pipe, cv=my_cv, + pipe, cv=my_cv, with_props={ - 'cv__groups': "my_groups", + 'cv__groups': "my_groups", 'estimator__pca__fit_weight': "my_weights"),       }) @@ -392,13 +392,13 @@ A more complex example with this solution: pipe = make_pipeline( make_union( - CountVectorizer(analyzer='word'), - CountVectorizer(analyzer='char')), + CountVectorizer(analyzer='word'), + CountVectorizer(analyzer='char')), SVC()) GridSearchCV( - pipe, cv=my_cv, + pipe, cv=my_cv, with_props={ - 'cv__groups': "my_groups", + 'cv__groups': "my_groups", 'estimator__featureunion__countvectorizer-1__fit_weight': "my_weights", 'estimator__featureunion__countvectorizer-2__fit_weight': "my_weights", 'score_weight': "my_weights", From d7aae8d8698f8928873165719073f75150ad9c49 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 27 Dec 2018 16:23:10 -0500 Subject: [PATCH 020/118] Added placeholders --- index.rst | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/index.rst b/index.rst index 9cc30ff..5f30250 100644 --- a/index.rst +++ b/index.rst @@ -16,18 +16,33 @@ possible solution. It should be a summary of the key points that drive the decision, and ideally converge to a draft of an API or object to be implemented in scikit-learn. -.. toctree:: - :maxdepth: 1 - :caption: Template: - - slep_template - .. toctree:: :maxdepth: 1 :numbered: - :caption: Proposals: + :caption: Under review slep001/proposal slep002/proposal slep003/proposal slep004/proposal + +.. toctree:: + :maxdepth: 1 + :numbered: + :caption: Accepted + +.. toctree:: + :maxdepth: 1 + :numbered: + :caption: Delayed review + +.. toctree:: + :maxdepth: 1 + :numbered: + :caption: Rejected + +.. toctree:: + :maxdepth: 1 + :caption: Template + + slep_template From b3e6b60d79689e0818fd4d9398c4d92cd65c5b9e Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 27 Dec 2018 16:38:07 -0500 Subject: [PATCH 021/118] Put existing SLEP into delayed reviews --- index.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/index.rst b/index.rst index 5f30250..0d95b8f 100644 --- a/index.rst +++ b/index.rst @@ -21,11 +21,6 @@ implemented in scikit-learn. :numbered: :caption: Under review - slep001/proposal - slep002/proposal - slep003/proposal - slep004/proposal - .. toctree:: :maxdepth: 1 :numbered: @@ -36,6 +31,11 @@ implemented in scikit-learn. :numbered: :caption: Delayed review + slep001/proposal + slep002/proposal + slep003/proposal + slep004/proposal + .. toctree:: :maxdepth: 1 :numbered: From b16d86ec16f0ebdcc80b189078e2f5bfa478e0f3 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 27 Dec 2018 17:14:52 -0500 Subject: [PATCH 022/118] Added link in README --- README.rst | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/README.rst b/README.rst index 968de36..6028cd2 100644 --- a/README.rst +++ b/README.rst @@ -1,6 +1,6 @@ -===================================== +================================== Scikit-learn enhancement proposals -===================================== +================================== This repository is for structured discussions about large modifications or additions to scikit-learn. @@ -11,3 +11,6 @@ the rational and usecases that are addressed, the problems and the major possible solution. It should be a summary of the key points that drive the decision, and ideally converge to a draft of an API or object to be implemented in scikit-learn. + +The SLEPs are publicly available online on `Read The Docs +`_. \ No newline at end of file From fe7c1473e0dd4ad62ede42d844724e875b5deac8 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 28 Dec 2018 11:46:55 -0500 Subject: [PATCH 023/118] Added files to be removed for titles to show up --- accepted.rst | 4 ++++ index.rst | 41 ++++++++++++++++++++++------------------- rejected.rst | 4 ++++ under_review.rst | 4 ++++ 4 files changed, 34 insertions(+), 19 deletions(-) create mode 100644 accepted.rst create mode 100644 rejected.rst create mode 100644 under_review.rst diff --git a/accepted.rst b/accepted.rst new file mode 100644 index 0000000..1aeab73 --- /dev/null +++ b/accepted.rst @@ -0,0 +1,4 @@ +Accpeted SLEPs +============== + +Nothing here diff --git a/index.rst b/index.rst index 0d95b8f..81c2099 100644 --- a/index.rst +++ b/index.rst @@ -17,32 +17,35 @@ decision, and ideally converge to a draft of an API or object to be implemented in scikit-learn. .. toctree:: - :maxdepth: 1 - :numbered: - :caption: Under review + :maxdepth: 1 + :caption: Under review + + under_review .. toctree:: - :maxdepth: 1 - :numbered: - :caption: Accepted + :maxdepth: 1 + :caption: Accepted + + accepted .. toctree:: - :maxdepth: 1 - :numbered: - :caption: Delayed review + :maxdepth: 1 + :numbered: + :caption: Delayed review - slep001/proposal - slep002/proposal - slep003/proposal - slep004/proposal + slep001/proposal + slep002/proposal + slep003/proposal + slep004/proposal .. toctree:: - :maxdepth: 1 - :numbered: - :caption: Rejected + :maxdepth: 1 + :caption: Rejected + + rejected .. toctree:: - :maxdepth: 1 - :caption: Template + :maxdepth: 1 + :caption: Template - slep_template + slep_template diff --git a/rejected.rst b/rejected.rst new file mode 100644 index 0000000..42799a4 --- /dev/null +++ b/rejected.rst @@ -0,0 +1,4 @@ +Rejected SLEPs +============== + +Nothing here diff --git a/under_review.rst b/under_review.rst new file mode 100644 index 0000000..a5a2d08 --- /dev/null +++ b/under_review.rst @@ -0,0 +1,4 @@ +SLEPs under review +================== + +Nothing here From 93214f5c076bbf7aa4cfbbd056095792b148ef4b Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Fri, 28 Dec 2018 12:23:49 -0500 Subject: [PATCH 024/118] Fixed typo --- accepted.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/accepted.rst b/accepted.rst index 1aeab73..ed7ee4a 100644 --- a/accepted.rst +++ b/accepted.rst @@ -1,4 +1,4 @@ -Accpeted SLEPs +Accepted SLEPs ============== Nothing here From 83cf0643e77014bb0ee39f760d4b6916c7ae4d14 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Tue, 15 Jan 2019 14:50:10 -0500 Subject: [PATCH 025/118] Include README in index.rst --- index.rst | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/index.rst b/index.rst index 81c2099..9713a84 100644 --- a/index.rst +++ b/index.rst @@ -3,18 +3,7 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Scikit-learn enhancement proposals -================================== - -This repository is for structured discussions about large modifications or -additions to scikit-learn. - -The discussions must create an "enhancement proposal", similar Python -enhancement proposal, that reflects the major arguments to keep in mind, the -rational and usecases that are addressed, the problems and the major -possible solution. It should be a summary of the key points that drive the -decision, and ideally converge to a draft of an API or object to be -implemented in scikit-learn. +.. include:: README.rst .. toctree:: :maxdepth: 1 From b5b4a682bda1417a74f8d92b62201bfc20b0ffcb Mon Sep 17 00:00:00 2001 From: Adrin Jalali Date: Wed, 11 Sep 2019 01:46:25 +0200 Subject: [PATCH 026/118] SLEP009: keyword only arguments (#19) --- slep009/proposal.rst | 218 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 218 insertions(+) create mode 100644 slep009/proposal.rst diff --git a/slep009/proposal.rst b/slep009/proposal.rst new file mode 100644 index 0000000..692d4e1 --- /dev/null +++ b/slep009/proposal.rst @@ -0,0 +1,218 @@ +.. _slep_009: + +=============================== +SLEP009: Keyword-only arguments +=============================== + +:Author: Adrin Jalali +:Status: Draft +:Type: Standards Track +:Created: 2019-07-13 + +Abstract +######## + +This proposal discusses the path to gradually forcing users to pass arguments, +or most of them, as keyword arguments only. It talks about the status-quo, and +the motivation to introduce the change. It shall cover the pros and cons of the +change. The original issue starting the discussion is located +`here `_. + +Motivation +########## + +At the moment `sklearn` accepts all arguments both as positional and +keyword arguments. For example, both of the following are valid: + +.. code-block:: python + + # positional arguments + clf = svm.SVC(.1, 'rbf') + # keyword arguments + clf = svm.SVC(C=.1, kernel='rbf') + + +Using keyword arguments has a few benefits: + +- It is more readable. +- For models which accept many parameters, especially numerical, it is less + error-prone than positional arguments. Compare these examples: + +.. code-block:: python + + cls = cluster.OPTICS( + min_samples=5, max_eps=inf, metric=’minkowski’, p=2, + metric_params=None, cluster_method=’xi’, eps=None, xi=0.05, + predecessor_correction=True, min_cluster_size=None, algorithm=’auto’, + leaf_size=30, n_jobs=None) + + cls = cluster.OPTICS(5, inf, ’minkowski’, 2, None, ’xi’, None, 0.05, + True, None, ’auto’, 30, None) + + +- It allows adding new parameters closer the other relevant parameters, instead + of adding new ones at the end of the list without breaking backward + compatibility. Right now all new parameters are added at the end of the + signature. Once we move to a keyword only argument list, we can change their + order and put related parameters together. Assuming at some point numpydoc + would support sections for parameters, these groups of parameters would be in + different sections for the documentation to be more readable. Also, note that + we have previously assumed users would pass most parameters by name and have + sometimes considered changes to be backwards compatible when they modified + the order of parameters. For example, user code relying on positional + arguments could break after a deprecated parameter was removed. Accepting + this SLEP would make this requirement explicit. + +Solution +######## + +The official supported way to have keyword only arguments is: + +.. code-block:: python + + def func(arg1, arg2, *, arg3, arg4) + +Which means the function can only be called with `arg3` and `arg4` specified +as keyword arguments: + +.. code-block:: python + + func(1, 2, arg3=3, arg4=4) + +The feature was discussed and the related PEP +`PEP3102 `_ was accepted and +introduced in Python 3.0, in 2006. + +For the change to happen in ``sklearn``, we would need to add the ``*`` where +we want all subsequent parameters to be passed as keyword only. + +Considerations +############## + +We can identify the following main challenges: familiarity of the users with +the syntax, and its support by different IDEs. + +Syntax +------ + +Partly due to the fact that the Scipy/PyData has been supporting Python 2 until +recently, the feature (among other Python 3 features) has seen limited adoption +and the users may not be used to seeing the syntax. The similarity between the +following two definitions may also be confusing to some users: + +.. code-block:: python + + def f(arg1, *arg2, arg3): pass # variable length arguments at arg2 + + def f(arg1, *, arg3): pass # no arguments accepted at * + +However, some other teams are already moving towards using the syntax, such as +``matplotlib`` which has introduced the syntax with a deprecation cycle using a +decorator for this purpose in version 3.1. The related PRs can be found `here +`_ and `here +`_. Soon users will be +familiar with the syntax. + +IDE Support +----------- + +Many users rely on autocomplete and parameter hints of the IDE while coding. +Here is how the hint looks like in two different IDEs. For instance, for the +above function, defined in VSCode, the hint would be shown as: + +.. code-block:: python + + func(arg1, arg2, *, arg3, arg4) + + param arg3 + func(1, 2, |) + +The good news is that the IDE understands the syntax and tells the user it's +the ``arg3``'s turn. But it doesn't say it is a keyword only argument. + +`ipython`, however, suggests all parameters be given with the keyword anyway: + +.. code-block:: python + + In [1]: def func(arg1, arg2, *, arg3, arg4): pass + + In [2]: func( + abs() arg3= + all() arg4= + any() ArithmeticError > + arg1= ascii() + arg2= AssertionError + +Scope +##### + +An important open question is which functions/methods and/or parameters should +follow this pattern, and which parameters should be keyword only. We can +identify the following categories of functions/methods: + +- ``__init__``s +- Main methods of the API, *i.e.* ``fit``, ``transform``, etc. +- All other methods, *e.g.* ``SpectralBiclustering.get_submatrix`` +- Functions + +With regard to the common methods of the API, the decision for these methods +should be the same throughout the library in order to keep a consistent +interface to the user. + +This proposal suggests making only *most commonly* used parameters positional. +The *most commonly* used parameters are defined per method or function, to be +defined as either of the following two ways: + +- The set defined and agreed upon by the core developers, which should cover + the *easy* cases. +- A set identified as being in the top 95% of the use cases, using some + automated analysis such as `this one + `_ or `this one + `_. + +This way we would minimize the number of warnings the users would receive, +which minimizes the friction cause by the change. This SLEP does not define +these parameter sets, and the respective decisions shall be made in their +corresponding pull requests. + +Deprecation Path +---------------- + +For a smooth transition, we need an easy deprecation path. Similar to the +decorators developed in ``matplotlib``, a proposed solution is available at +[#13311](https://github.com/scikit-learn/scikit-learn/pull/13311), which +deprecates the usage of positional arguments on selected functions and methods. +With the decorator, the user sees a warning if they pass the designated +keyword-only arguments as positional, and removing the decorator would result +in an error. Examples (borrowing from the PR): + +.. code-block:: python + + @warn_args + def dbscan(X, eps=0.5, *, min_samples=4, metric='minkowski'): + pass + + + class LogisticRegression: + + @warn_args + def __init__(self, penalty='l2', *, dual=False): + + self.penalty = penalty + self.dual = dual + + +Calling ``LogisticRegression('l2', True)`` will result with a +``DeprecationWarning``: + +.. code-block:: bash + + Should use keyword args: dual=True + + +Once the deprecation period is over, we'd remove the decorator and calling +the function/method with the positional arguments after `*` would fail. + +The final decorator solution shall make sure it is well understood by most +commonly used IDEs and editors such as IPython, Jupiter Lab, Emacs, vim, +VSCode, and PyCharm. From f89a31d50de523053c6c251b9d035decfbc3b2ff Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 11 Sep 2019 09:53:54 +1000 Subject: [PATCH 027/118] Include SLEP009 in Under Review --- index.rst | 4 ++-- under_review.rst | 4 ---- 2 files changed, 2 insertions(+), 6 deletions(-) delete mode 100644 under_review.rst diff --git a/index.rst b/index.rst index 9713a84..e0a8c03 100644 --- a/index.rst +++ b/index.rst @@ -6,10 +6,10 @@ .. include:: README.rst .. toctree:: - :maxdepth: 1 + :maxdepth: 2 :caption: Under review - under_review + SLEP009: Keyword-only arguments (voting until 11 Oct 2019) .. toctree:: :maxdepth: 1 diff --git a/under_review.rst b/under_review.rst deleted file mode 100644 index a5a2d08..0000000 --- a/under_review.rst +++ /dev/null @@ -1,4 +0,0 @@ -SLEPs under review -================== - -Nothing here From 8a7d8ee2e1e5b69c5813bcdd902a629c8c76d9ac Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 11 Sep 2019 10:00:47 +1000 Subject: [PATCH 028/118] Empty commit to trigger rtfd From 9ccfa2ea3c5123b775e0baa96467bed34ff44a9a Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 11 Sep 2019 11:09:56 +1000 Subject: [PATCH 029/118] Note voting on SLEP009 --- index.rst | 2 +- slep009/proposal.rst | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/index.rst b/index.rst index e0a8c03..7555e0c 100644 --- a/index.rst +++ b/index.rst @@ -6,7 +6,7 @@ .. include:: README.rst .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :caption: Under review SLEP009: Keyword-only arguments (voting until 11 Oct 2019) diff --git a/slep009/proposal.rst b/slep009/proposal.rst index 692d4e1..1e0a02d 100644 --- a/slep009/proposal.rst +++ b/slep009/proposal.rst @@ -8,6 +8,8 @@ SLEP009: Keyword-only arguments :Status: Draft :Type: Standards Track :Created: 2019-07-13 +:Vote opens: 2019-09-11 +:Vote closes: 2019-10-11 Abstract ######## From a1baba0bd277ec88ed1e1c3842715e597d7c291e Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 24 Sep 2019 00:34:40 +1000 Subject: [PATCH 030/118] Mark SLEP009 as Accepted --- accepted.rst | 4 ---- index.rst | 4 ++-- slep009/proposal.rst | 5 ++--- under_review.rst | 4 ++++ 4 files changed, 8 insertions(+), 9 deletions(-) delete mode 100644 accepted.rst create mode 100644 under_review.rst diff --git a/accepted.rst b/accepted.rst deleted file mode 100644 index ed7ee4a..0000000 --- a/accepted.rst +++ /dev/null @@ -1,4 +0,0 @@ -Accepted SLEPs -============== - -Nothing here diff --git a/index.rst b/index.rst index 7555e0c..16bb235 100644 --- a/index.rst +++ b/index.rst @@ -9,13 +9,13 @@ :maxdepth: 1 :caption: Under review - SLEP009: Keyword-only arguments (voting until 11 Oct 2019) + under_review .. toctree:: :maxdepth: 1 :caption: Accepted - accepted + slep009/proposal .. toctree:: :maxdepth: 1 diff --git a/slep009/proposal.rst b/slep009/proposal.rst index 1e0a02d..c6f8cb0 100644 --- a/slep009/proposal.rst +++ b/slep009/proposal.rst @@ -5,11 +5,10 @@ SLEP009: Keyword-only arguments =============================== :Author: Adrin Jalali -:Status: Draft +:Status: Accepted :Type: Standards Track :Created: 2019-07-13 -:Vote opens: 2019-09-11 -:Vote closes: 2019-10-11 +:Vote opened: 2019-09-11 Abstract ######## diff --git a/under_review.rst b/under_review.rst new file mode 100644 index 0000000..a5a2d08 --- /dev/null +++ b/under_review.rst @@ -0,0 +1,4 @@ +SLEPs under review +================== + +Nothing here From 953457d85c70218adb168519d3d20885a86004a4 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 7 Nov 2019 09:32:49 -0500 Subject: [PATCH 031/118] SLEP010 n_features_in_ attribute (#22) * slep10 * added copyright * addressed comments * move motivation for solution below * addressed comments * added argument that its ok to break test suite compat * style comments * update vectorizers * added warning instead of exception in common check * now propose validate_data(X, y=None, reset=True) * added note about private API * update SLEP to only make it about n_features_in_ attribute --- slep010/proposal.rst | 109 +++++++++++++++++++++++++++++++++++++++++++ under_review.rst | 5 +- 2 files changed, 113 insertions(+), 1 deletion(-) create mode 100644 slep010/proposal.rst diff --git a/slep010/proposal.rst b/slep010/proposal.rst new file mode 100644 index 0000000..3c4387a --- /dev/null +++ b/slep010/proposal.rst @@ -0,0 +1,109 @@ +.. _slep_010: + +===================================== +SLEP010: ``n_features_in_`` attribute +===================================== + +:Author: Nicolas Hug +:Status: Under review +:Type: Standards Track +:Created: 2019-11-23 + +Abstract +######## + +This SLEP proposes the introduction of a public ``n_features_in_`` attribute +for most estimators (where relevant). + +Motivation +########## + +Knowing the number of features that an estimator expects is useful for +inspection purposes. This is also useful for implementing the feature names +propagation (`SLEP 8 +`_) . For +example any of the scaler can easily create feature names if they know +``n_features_in_``. + +Solution +######## + +The proposed solution is to replace most calls to ``check_array()`` or +``check_X_y()`` by calls to a newly created private method:: + + def _validate_data(self, X, y=None, reset=True, **check_array_params) + ... + +The ``_validate_data()`` method will call ``check_array()`` or +``check_X_y()`` function depending on the ``y`` parameter. + +If the ``reset`` parameter is True (default), the method will set the +``n_feature_in_`` attribute of the estimator, regardless of its potential +previous value. This should typically be used in ``fit()``, or in the first +``partial_fit()`` call. Passing ``reset=False`` will not set the attribute but +instead check against it, and potentially raise an error. This should typically +be used in ``predict()`` or ``transform()``, or on subsequent calls to +``partial_fit``. + +In most cases, the ``n_features_in_`` attribute exists only once ``fit`` has +been called, but there are exceptions (see below). + +A new common check is added: it makes sure that for most estimators, the +``n_features_in_`` attribute does not exist until ``fit`` is called, and +that its value is correct. Instead of raising an exception, this check will +raise a warning for the next two releases. This will give downstream +packages some time to adjust (see considerations below). + +Since the introduced method is private, third party libraries are +recommended not to rely on it. + +The logic that is proposed here (calling a stateful method instead of a +stateless function) is a pre-requisite to fixing the dataframe column +ordering issue: with a stateless ``check_array``, there is no way to raise +an error if the column ordering of a dataframe was changed between ``fit`` +and ``predict``. This is however out os scope for this SLEP, which only focuses +on the introduction of the ``n_features_in_`` attribute. + +Considerations +############## + +The main consideration is that the addition of the common test means that +existing estimators in downstream libraries will not pass our test suite, +unless the estimators also have the ``n_features_in_`` attribute. + +The newly introduced checks will only raise a warning instead of an exception +for the next 2 releases, so this will give more time for downstream packages +to adjust. + +There are other minor considerations: + +- In most meta-estimators, the input validation is handled by the + sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator + is thus explicitly set to that of the sub-estimator, either via a + ``@property``, or directly in ``fit()``. +- Some estimators like the dummy estimators do not validate the input + (the 'no_validation' tag should be True). The ``n_features_in_`` attribute + should be set to None, though this is not enforced in the common check. +- Some estimators expect a non-rectangular input: the vectorizers. These + estimators expect dicts or lists, not a ``n_samples * n_features`` matrix. + ``n_features_in_`` makes no sense here and these estimators just don't have + the attribute. +- Some estimators may know the number of input features before ``fit`` is + called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at + ``__init__`` from the ``dictionary`` parameter. In this case the attribute + is a property and is available right after object instantiation. + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ diff --git a/under_review.rst b/under_review.rst index a5a2d08..47ee5f8 100644 --- a/under_review.rst +++ b/under_review.rst @@ -1,4 +1,7 @@ SLEPs under review ================== -Nothing here +.. toctree:: + :maxdepth: 1 + + slep010/proposal From c991dce789c773a6f619ee4e3c7fe760dbe6d3bf Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Fri, 6 Dec 2019 19:53:49 +0100 Subject: [PATCH 032/118] initial writeup of the slep --- slep012/proposal.rst | 66 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 slep012/proposal.rst diff --git a/slep012/proposal.rst b/slep012/proposal.rst new file mode 100644 index 0000000..bdde582 --- /dev/null +++ b/slep012/proposal.rst @@ -0,0 +1,66 @@ +.. _slep_012: + +========== +InputArray +========== + +This proposal suggests adding a new data structure, called ``InputArray``, +which wraps a data matrix with some added information about the data. This was +motivated when working on input and output feature names. Since we expect the +feature names to be attached to the data given to an estimator, there are a few +approaches we can take: + +- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data + as a ``pandas.DataFrame``, and if so, the transformer would output a + ``pandas.DataFrame`` which also includes the [generated] feature names. This + is not a feasible solution since ``pandas`` plans to move to a per column + representation, which means ``pd.DataFrame(np.asarray(df))`` has two + guaranteed memory copies. +- ``XArray``: we could accept a `pandas.DataFrame``, and use + ``xarray.DataArray`` as the output of transformers, including feature names. + However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to + handle row labels and aligns rows when an operation between two + ``xarray.DataArray`` is done. None of these are favorable for our use-case. + +As a result, we need to have another data structure which we'll use to transfer +data related information (such as feature names), which is lightweight and +doesn't interfere with existing user code. + +A main constraint of this data structure is that is should be backward +compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a +transformer, would not break. This SLEP focuses on *feature names* as the only +meta-data attached to the data. Support for other meta-data can be added later. + + +Feature Names +************* + +Feature names are an array of strings aligned with the columns. They can be +``None``. + +Operations +********** + +All usual operations (including slicing through ``__getitem__``) return an +``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o +any modifications. This prevents any unwanted computational overhead as a +result of migrating to this data structure. + +The ``select()`` method will act like a ``__getitem__``, except that it +understands feature names and it also returns an ``InputArray``, with the +corresponding meta-data. + +Sparse Arrays +************* + +All of the above applies to sparse arrays. + +Factory Methods +*************** + +There will be factory methods creating an ``InputArray`` given a +``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or +an ``sp.SparseMatrix`` and a given set of feature names. + +An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a +``toDataFrame()`` method. From 0b25e2f5e78ba0b7df6f53a2e0c77d2876134965 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Sat, 7 Dec 2019 10:03:22 -0800 Subject: [PATCH 033/118] clarify on xarray --- slep012/proposal.rst | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index bdde582..6896483 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -18,9 +18,12 @@ approaches we can take: guaranteed memory copies. - ``XArray``: we could accept a `pandas.DataFrame``, and use ``xarray.DataArray`` as the output of transformers, including feature names. - However, ``xarray`` depends on ``pandas``, and uses ``pandas.Series`` to - handle row labels and aligns rows when an operation between two - ``xarray.DataArray`` is done. None of these are favorable for our use-case. + However, ``xarray`` has a hard dependency on ``pandas``, and uses + ``pandas.Index`` to handle row labels and aligns rows when an operation + between two ``xarray.DataArray`` is done, which can be time consuming, and is + not the semantic expected in ``scikit-learn``; we only expect the number of + rows to be equal, and that the rows always correspond to one another in the + same order. As a result, we need to have another data structure which we'll use to transfer data related information (such as feature names), which is lightweight and From ef37bab9c3e7c9bf95b087f2250a8d2db98b33ce Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Fri, 27 Dec 2019 13:16:45 -0800 Subject: [PATCH 034/118] address more comments --- slep012/proposal.rst | 124 +++++++++++++++++++++++++++++++------------ 1 file changed, 91 insertions(+), 33 deletions(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index 6896483..b8dcb88 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -4,59 +4,74 @@ InputArray ========== -This proposal suggests adding a new data structure, called ``InputArray``, -which wraps a data matrix with some added information about the data. This was -motivated when working on input and output feature names. Since we expect the -feature names to be attached to the data given to an estimator, there are a few -approaches we can take: +This proposal results in a solution to propagating feature names through +transformers, pipelines, and the column transformer. Ideally, we would have:: -- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data - as a ``pandas.DataFrame``, and if so, the transformer would output a - ``pandas.DataFrame`` which also includes the [generated] feature names. This - is not a feasible solution since ``pandas`` plans to move to a per column - representation, which means ``pd.DataFrame(np.asarray(df))`` has two - guaranteed memory copies. -- ``XArray``: we could accept a `pandas.DataFrame``, and use - ``xarray.DataArray`` as the output of transformers, including feature names. - However, ``xarray`` has a hard dependency on ``pandas``, and uses - ``pandas.Index`` to handle row labels and aligns rows when an operation - between two ``xarray.DataArray`` is done, which can be time consuming, and is - not the semantic expected in ``scikit-learn``; we only expect the number of - rows to be equal, and that the rows always correspond to one another in the - same order. + df = pd.readcsv('tabular.csv') + # transforming the data in an arbitrary way + transformer0 = ColumnTransformer(...) + # a pipeline preprocessing the data and then a classifier (or a regressor) + clf = make_pipeline(transfoemer0, ..., SVC()) -As a result, we need to have another data structure which we'll use to transfer -data related information (such as feature names), which is lightweight and -doesn't interfere with existing user code. + # now we can investigate features at each stage of the pipeline + clf[-1].input_feature_names_ + +The feature names are propagated throughout the pipeline and the user can +investigate them at each step of the pipeline. + +This proposal suggests adding a new data structure, called ``InputArray``, +which augments the data array ``X`` with additional meta-data. In this proposal +we assume the feature names (and other potential meta-data) are attached to the +data when passed to an estimator. Alternative solutions are discussed later in +this document. A main constraint of this data structure is that is should be backward compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a transformer, would not break. This SLEP focuses on *feature names* as the only meta-data attached to the data. Support for other meta-data can be added later. +Backward/NumPy/Pandas Compatibility +*********************************** + +Since currently transformers return a ``numpy`` or a ``scipy`` array, backward +compatibility in this context means the operations which are valid on those +arrays should also be valid on the new data structure. + +All operations are delegated to the *data* part of the container, and the +meta-data is lost immediately after each operation and operations result in a +``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid +performance degradation, ``__getitem__`` is not overloaded and if the user +wishes to preserve the meta-data, they shall do so via explicitly calling a +method such as ``select()``. Operations between two ``InpuArray``s will not +try to align rows and/or columns of the two given objects. + +``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for +which ``pandas`` does not provide a clean API at the moment. Alternatively, +``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the +relevant meta-data attached. Feature Names ************* -Feature names are an array of strings aligned with the columns. They can be -``None``. +Feature names are an object ``ndarray`` of strings aligned with the columns. +They can be ``None``. Operations ********** -All usual operations (including slicing through ``__getitem__``) return an -``np.ndarray``. The ``__array__`` method also returns the underlying data, w/o -any modifications. This prevents any unwanted computational overhead as a -result of migrating to this data structure. +Estimators understand the ``InputArray`` and extract the feature names from the +given data before applying the operations and transformations on the data. -The ``select()`` method will act like a ``__getitem__``, except that it -understands feature names and it also returns an ``InputArray``, with the -corresponding meta-data. +All transformers return an ``InputArray`` with feature names attached to it. +The way feature names are generated is discussed in *SLEP007 - The Style of The +Feature Names*. Sparse Arrays ************* -All of the above applies to sparse arrays. +Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does +not provide the kinda of API provided by ``numpy``, we may need to find +compromises. Factory Methods *************** @@ -66,4 +81,47 @@ There will be factory methods creating an ``InputArray`` given a an ``sp.SparseMatrix`` and a given set of feature names. An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a -``toDataFrame()`` method. +``todataframe()`` method. + +``X`` being an ``InputArray``:: + + >>> np.array(X) + >>> X.todataframe() + >>> pd.DataFrame(X) # only if pandas implements the API + +And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of +feature names, one can make the right ``InputArray`` using:: + + >>> make_inputarray(X, feature_names) + +Alternative Solutions +********************* + +Since we expect the feature names to be attached to the data given to an +estimator, there are a few potential approaches we can take: + +- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data + as a ``pandas.DataFrame``, and if so, the transformer would output a + ``pandas.DataFrame`` which also includes the [generated] feature names. This + is not a feasible solution since ``pandas`` plans to move to a per column + representation, which means ``pd.DataFrame(np.asarray(df))`` has two + guaranteed memory copies. +- ``XArray``: we could accept a `pandas.DataFrame``, and use + ``xarray.DataArray`` as the output of transformers, including feature names. + However, ``xarray`` has a hard dependency on ``pandas``, and uses + ``pandas.Index`` to handle row labels and aligns rows when an operation + between two ``xarray.DataArray`` is done, which can be time consuming, and is + not the semantic expected in ``scikit-learn``; we only expect the number of + rows to be equal, and that the rows always correspond to one another in the + same order. + +As a result, we need to have another data structure which we'll use to transfer +data related information (such as feature names), which is lightweight and +doesn't interfere with existing user code. + +Another alternative to the problem of passing meta-data around is to pass that +as a parameter to ``fit``. This would heavily involve modifying meta-estimators +since they'd need to pass that information, and extract the relevant +information from the estimators to pass that along to the next estimator. Our +prototype implementations showed significant challenges compared to when the +meta-data is attached to the data. From 02ce4db90481709bcff18450e7e8f89fa27764c7 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Sun, 12 Jan 2020 23:41:22 +0100 Subject: [PATCH 035/118] set SLEP10 as accepted (#27) --- index.rst | 1 + slep010/proposal.rst | 2 +- under_review.rst | 10 +++++++--- 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/index.rst b/index.rst index 16bb235..48e1f99 100644 --- a/index.rst +++ b/index.rst @@ -16,6 +16,7 @@ :caption: Accepted slep009/proposal + slep010/proposal .. toctree:: :maxdepth: 1 diff --git a/slep010/proposal.rst b/slep010/proposal.rst index 3c4387a..a8517c2 100644 --- a/slep010/proposal.rst +++ b/slep010/proposal.rst @@ -5,7 +5,7 @@ SLEP010: ``n_features_in_`` attribute ===================================== :Author: Nicolas Hug -:Status: Under review +:Status: Accepted :Type: Standards Track :Created: 2019-11-23 diff --git a/under_review.rst b/under_review.rst index 47ee5f8..2f1bcd3 100644 --- a/under_review.rst +++ b/under_review.rst @@ -1,7 +1,11 @@ SLEPs under review ================== -.. toctree:: - :maxdepth: 1 +No SLEP is currently under review. - slep010/proposal +.. Uncomment below when a SLEP is under review + +.. .. toctree:: +.. :maxdepth: 1 + +.. slepXXX/proposal From 09e12757c38c66ff2356b3ad74e17047b507d198 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Wed, 12 Feb 2020 15:51:36 +0100 Subject: [PATCH 036/118] SLEP012: n_features_out_ --- .gitignore | 3 +++ slep012/proposal.rst | 63 ++++++++++++++++++++++++++++++++++++++++++++ under_review.rst | 8 +++--- 3 files changed, 70 insertions(+), 4 deletions(-) create mode 100644 slep012/proposal.rst diff --git a/.gitignore b/.gitignore index ba74660..de69825 100644 --- a/.gitignore +++ b/.gitignore @@ -55,3 +55,6 @@ docs/_build/ # PyBuilder target/ + +# Editors +.vscode diff --git a/slep012/proposal.rst b/slep012/proposal.rst new file mode 100644 index 0000000..9ff7a3a --- /dev/null +++ b/slep012/proposal.rst @@ -0,0 +1,63 @@ +.. _slep_012: + +====================================== +SLEP012: ``n_features_out_`` attribute +====================================== + +:Author: Adrin Jalali +:Status: Under Review +:Type: Standards Track +:Created: 2020-02-12 + +Abstract +######## + +This SLEP proposes the introduction of a public ``n_features_out_`` attribute +for most transformers (where relevant). + +Motivation +########## + +Knowing the number of features that a transformer outputs is useful for +inspection purposes. + +Solution +######## + +The proposed solution is for the ``n_features_out_`` attribute to be set once a +call to ``fit`` is done. In most cases the value of ``n_features_out_`` is the +same as some other attribute stored in the transformer, *e.g.* +``n_components_``, and in these cases a ``Mixin`` such as a ``ComponentsMixin`` +can delegate ``n_features_out_`` to those attributes. + +Considerations +############## + +The main consideration is that the addition of the common test means that +existing estimators in downstream libraries will not pass our test suite, +unless the estimators also have the ``n_features_out_`` attribute. + +The newly introduced checks will only raise a warning instead of an exception +for the next 2 releases, so this will give more time for downstream packages +to adjust. + +There are other minor considerations: + +- In most meta-estimators, this is handled by the + sub-estimator(s). The ``n_features_out_`` attribute of the meta-estimator is + thus explicitly set to that of the sub-estimator, either via a ``@property``, + or directly in ``fit()``. +- Some transformers such as ``FunctionTransformer`` may not know the number + of output features. In such cases ``n_features_out_`` is set to ``None``. + +Copyright +--------- + +This document has been placed in the public domain. [1]_ + +References and Footnotes +------------------------ + +.. [1] _Open Publication License: https://www.opencontent.org/openpub/ + + diff --git a/under_review.rst b/under_review.rst index 2f1bcd3..72f95b4 100644 --- a/under_review.rst +++ b/under_review.rst @@ -1,11 +1,11 @@ SLEPs under review ================== -No SLEP is currently under review. +.. No SLEP is currently under review. .. Uncomment below when a SLEP is under review -.. .. toctree:: -.. :maxdepth: 1 +.. toctree:: + :maxdepth: 1 -.. slepXXX/proposal + slep012/proposal From 5d9f4049dbcd6b05971a45959894a07f2eff0628 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Wed, 12 Feb 2020 15:53:40 +0100 Subject: [PATCH 037/118] mention n_features_in_ --- slep012/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index 9ff7a3a..fe16b1a 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -19,7 +19,7 @@ Motivation ########## Knowing the number of features that a transformer outputs is useful for -inspection purposes. +inspection purposes. This is in conjunction with *SLEP010: ``n_features_in_``*. Solution ######## From 2d5e88f0df95c24f6b721bff6ff5a4a976addf41 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Thu, 13 Feb 2020 15:47:32 +0100 Subject: [PATCH 038/118] add link to n_features_in_ and mention the new test --- slep012/proposal.rst | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index fe16b1a..1c001f3 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -19,7 +19,8 @@ Motivation ########## Knowing the number of features that a transformer outputs is useful for -inspection purposes. This is in conjunction with *SLEP010: ``n_features_in_``*. +inspection purposes. This is in conjunction with `*SLEP010: ``n_features_in_``* +`_. Solution ######## @@ -30,6 +31,12 @@ same as some other attribute stored in the transformer, *e.g.* ``n_components_``, and in these cases a ``Mixin`` such as a ``ComponentsMixin`` can delegate ``n_features_out_`` to those attributes. +Testing +------- + +A test to the common tests is added to ensure the presence of the attribute or +property after calling ``fit``. + Considerations ############## From 01bab58b8361c3aa032f9bda6c5b3ccc3343fecc Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Thu, 13 Feb 2020 16:22:28 +0100 Subject: [PATCH 039/118] more Nicolas's changes --- slep012/proposal.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index 1c001f3..f762b80 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -50,12 +50,13 @@ to adjust. There are other minor considerations: -- In most meta-estimators, this is handled by the +- In most meta-estimators, this is delegated to the sub-estimator(s). The ``n_features_out_`` attribute of the meta-estimator is thus explicitly set to that of the sub-estimator, either via a ``@property``, or directly in ``fit()``. - Some transformers such as ``FunctionTransformer`` may not know the number - of output features. In such cases ``n_features_out_`` is set to ``None``. + of output features since arbitrary arrays can be passed to `transform`. In + such cases ``n_features_out_`` is set to ``None``. Copyright --------- From da14bcdc20e291aa15128be1fc88e43fa4a37ef9 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Thu, 13 Feb 2020 16:36:14 +0100 Subject: [PATCH 040/118] most -> some --- slep012/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index f762b80..e39456b 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -50,7 +50,7 @@ to adjust. There are other minor considerations: -- In most meta-estimators, this is delegated to the +- In some meta-estimators, this is delegated to the sub-estimator(s). The ``n_features_out_`` attribute of the meta-estimator is thus explicitly set to that of the sub-estimator, either via a ``@property``, or directly in ``fit()``. From 34a82fd74e0ba0b5b2ef5f631e18cdb3cb775445 Mon Sep 17 00:00:00 2001 From: Adrin Jalali Date: Fri, 14 Feb 2020 10:49:30 +0100 Subject: [PATCH 041/118] Slep007 - feature names, their generation and the API (#17) * initial 007 * fix typo * fix code block * rewrite the SLEP * clarify the flexibility on metaestimators * add motivation and clarifications * apply more comments * address andy's comments * add examples * add redundant prefix example, clarify O(1) issue * put slep under review * address Nicolas's suggestions * Update slep007/proposal.rst Co-Authored-By: Andreas Mueller * change the title * shorted example * address Nicolas's comments, remove onetoone mapping * address Nicolas's comments * trying to address Guillaume's comments * imagine -> include Co-authored-by: Andreas Mueller --- slep007/proposal.rst | 288 +++++++++++++++++++++++++++++++++++++++++++ under_review.rst | 8 +- 2 files changed, 292 insertions(+), 4 deletions(-) create mode 100644 slep007/proposal.rst diff --git a/slep007/proposal.rst b/slep007/proposal.rst new file mode 100644 index 0000000..523b149 --- /dev/null +++ b/slep007/proposal.rst @@ -0,0 +1,288 @@ + .. _slep_007: + +=========================================== +Feature names, their generation and the API +=========================================== + +:Author: Adrin Jalali +:Status: Under Review +:Type: Standards Track +:Created: 2019-04 + +Abstract +######## + +This SLEP proposes the introduction of the ``feature_names_in_`` attribute for +all estimators, and the ``feature_names_out_`` attribute for all transformers. +We here discuss the generation of such attributes and their propagation through +pipelines. Since for most estimators there are multiple ways to generate +feature names, this SLEP does not intend to define how exactly feature names +are generated for all of them. + +Motivation +########## + +``scikit-learn`` has been making it easier to build complex workflows with the +``ColumnTransformer`` and it has been seeing widespread adoption. However, +using it results in pipelines where it's not clear what the input features to +the final predictor are, even more so than before. For example, after fitting +the following pipeline, users should ideally be able to inspect the features +going into the final predictor:: + + + X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) + + # We will train our classifier with the following features: + # Numeric Features: + # - age: float. + # - fare: float. + # Categorical Features: + # - embarked: categories encoded as strings {'C', 'S', 'Q'}. + # - sex: categories encoded as strings {'female', 'male'}. + # - pclass: ordinal integers {1, 2, 3}. + + # We create the preprocessing pipelines for both numeric and categorical data. + numeric_features = ['age', 'fare'] + numeric_transformer = Pipeline(steps=[ + ('imputer', SimpleImputer(strategy='median')), + ('scaler', StandardScaler())]) + + categorical_features = ['embarked', 'sex', 'pclass'] + categorical_transformer = Pipeline(steps=[ + ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), + ('onehot', OneHotEncoder(handle_unknown='ignore'))]) + + preprocessor = ColumnTransformer( + transformers=[ + ('num', numeric_transformer, numeric_features), + ('cat', categorical_transformer, categorical_features)]) + + # Append classifier to preprocessing pipeline. + # Now we have a full prediction pipeline. + clf = Pipeline(steps=[('preprocessor', preprocessor), + ('classifier', LogisticRegression())]) + + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) + + clf.fit(X_train, y_train) + + +However, it's impossible to interpret or even sanity-check the +``LogisticRegression`` instance that's produced in the example, because the +correspondence of the coefficients to the input features is basically +impossible to figure out. + +This proposal suggests adding two attributes to fitted estimators: +``feature_names_in_`` and ``feature_names_out_``, such that in the +abovementioned example ``clf[-1].feature_names_in_`` and +``clf[-2].feature_names_out_`` will be:: + + ['num__age', + 'num__fare', + 'cat__embarked_C', + 'cat__embarked_Q', + 'cat__embarked_S', + 'cat__embarked_missing', + 'cat__sex_female', + 'cat__sex_male', + 'cat__pclass_1', + 'cat__pclass_2', + 'cat__pclass_3'] + +Ideally the generated feature names describe how a feature is generated at each +stage of a pipeline. For instance, ``cat__sex_female`` shows that the feature +has been through a categorical preprocessing pipeline, was originally the +column ``sex``, and has been one hot encoded and is one if it was originally +``female``. However, this is not always possible or desirable especially when a +generated column is based on many columns, since the generated feature names +will be too long, for example in ``PCA``. As a rule of thumb, the following +types of transformers may generate feature names which corresponds to the +original features: + +- Leave columns unchanged, *e.g.* ``StandardScaler`` +- Select a subset of columns, *e.g.* ``SelectKBest`` +- create new columns where each column depends on at most one input column, + *e.g* ``OneHotEncoder`` +- Algorithms that create combinations of a fixed number of features, *e.g.* + ``PolynomialFeatures``, as opposed to all of + them where there are many. Note that verbosity considerations and + ``verbose_feature_names`` as explained later can apply here. + +This proposal talks about how feature names are generated and not how they are +propagated. + +verbose_feature_names +********************* + +``verbose_feature_names`` controls the verbosity of the generated feature names +and it can be ``True`` or ``False``. Alternative solutions could include: + +- an integer: fine tuning the verbosity of the generated feature names. +- a ``callable`` which would give further flexibility to the user to generate + user defined feature names. + +These alternatives may be discussed and implemented in the future if deemed +necessary. + +Scope +##### + +The API for input and output feature names includes a ``feature_names_in_`` +attribute for all estimators, and a ``feature_names_out_`` attribute for any +estimator with a ``transform`` method, *i.e.* they expose the generated feature +names via the ``feature_names_out_`` attribute. + +Note that this SLEP also applies to `resamplers +`_ the same way +as transformers. + +Input Feature Names +################### + +The input feature names are stored in a fitted estimator in a +``feature_names_in_`` attribute, and are taken from the given input data, for +instance a ``pandas`` data frame. This attribute will be ``None`` if the input +provides no feature names. + +Output Feature Names +#################### + +A fitted estimator exposes the output feature names through the +``feature_names_out_`` attribute. Here we discuss more in detail how these +feature names are generated. Since for most estimators there are multiple ways +to generate feature names, this SLEP does not intend to define how exactly +feature names are generated for all of them. It is instead a guideline on how +they could generally be generated. Furthermore, that specific behavior of a +given estimator may be tuned via the ``verbose_feature_names`` parameter, as +detailed below. + +As detailed bellow, some generated output features names are the same or a +derived from the input feature names. In such cases, if no input feature names +are provided, ``x0`` to ``xn`` are assumed to be their names. + +Feature Selector Transformers +***************************** + +This includes transformers which output a subset of the input features, w/o +changing them. For example, if a ``SelectKBest`` transformer selects the first +and the third features, and no names are provided, the ``feature_names_out_`` +will be ``[x0, x2]``. + +Feature Generating Transformers +******************************* + +The simplest category of transformers in this section are the ones which +generate a column based on a single given column. The generated output column +in this case is a sensible transformation of the input feature name. For +instance, a ``LogTransformer`` can do ``'age' -> 'log(age)'``, and a +``OneHotEncoder`` could do ``'gender' -> 'gender_female', 'gender_fluid', +...``. An alternative is to leave the feature names unchanged when each output +feature corresponds to exactly one input feature. Whether or not to modify the +feature name, *e.g.* ``log(x0)`` vs. ``x0`` may be controlled via the +``verbose_feature_names`` to the constructor. The default value of +``verbose_feature_names`` can be different depending on the transformer. For +instance, ``StandardScaler`` can have it as ``False``, whereas +``LogTransformer`` could have it as ``True`` by default. + +Transformers where each output feature depends on a fixed number of input +features may generate descriptive names as well. For instance, a +``PolynomialTransformer`` on a small subset of features can generate an output +feature name such as ``x[0] * x[2] ** 3``. + +And finally, the transformers where each output feature depends on many or all +input features, generate feature names which has the form of ``name0`` to +``namen``, where ``name`` represents the transformer. For instance, a ``PCA`` +transformer will output ``[pca0, ..., pcan]``, ``n`` being the number of PCA +components. + +Meta-Estimators +*************** + +Meta estimators can choose to prefix the output feature names given by the +estimators they are wrapping or not. + +By default, ``Pipeline`` adds no prefix, *i.e* its ``feature_names_out_`` is +the same as the ``feature_names_out_`` of the last step, and ``None`` if the +last step is not a transformer. + +``ColumnTransformer`` by default adds a prefix to the output feature names, +indicating the name of the transformer applied to them. If a column is in the output +as a part of ``passthrough``, it won't be prefixed since no operation has been +applied on it. + +This is the default behavior, and it can be tuned by constructor parameters if +the meta estimator allows it. For instance, a ``verbose_feature_names=False`` +may indicate that a ``ColumnTransformer`` should not prefix the generated +feature names with the name of the step. + +Examples +######## + +Here we include some examples to demonstrate the behavior of output feature +names:: + + 100 features (no names) -> PCA(n_components=3) + feature_names_out_: [pca0, pca1, pca2] + + + 100 features (no names) -> SelectKBest(k=3) + feature_names_out_: [x2, x17, x42] + + + [f1, ..., f100] -> SelectKBest(k=3) + feature_names_out_: [f2, f17, f42] + + + [cat0] -> OneHotEncoder() + feature_names_out_: [cat0_cat, cat0_dog, ...] + + + [f1, ..., f100] -> Pipeline( + [SelectKBest(k=30), + PCA(n_components=3)] + ) + feature_names_out_: [pca0, pca1, pca2] + + + [model, make, numeric0, ..., numeric100] -> + ColumnTransformer( + [('cat', Pipeline(SimpleImputer(), OneHotEncoder()), + ['model', 'make']), + ('num', Pipeline(SimpleImputer(), PCA(n_components=3)), + ['numeric0', ..., 'numeric100'])] + ) + feature_names_out_: ['cat_model_100', 'cat_model_200', ..., + 'cat_make_ABC', 'cat_make_XYZ', ..., + 'num_pca0', 'num_pca1', 'num_pca2'] + +However, the following examples produce a somewhat redundant feature names, +and hence the relevance of ``verbose_feature_names=False``:: + + [model, make, numeric0, ..., numeric100] -> + ColumnTransformer([ + ('ohe', OneHotEncoder(), ['model', 'make']), + ('pca', PCA(n_components=3), ['numeric0', ..., 'numeric100']) + ]) + feature_names_out_: ['ohe_model_100', 'ohe_model_200', ..., + 'ohe_make_ABC', 'ohe_make_XYZ', ..., + 'pca_pca0', 'pca_pca1', 'pca_pca2'] + +If desired, the user can remove the prefixes:: + + [model, make, numeric0, ..., numeric100] -> + make_column_transformer( + (OneHotEncoder(), ['model', 'make']), + (PCA(n_components=3), ['numeric0', ..., 'numeric100']), + verbose_feature_names=False + ) + feature_names_out_: ['model_100', 'model_200', ..., + 'make_ABC', 'make_XYZ', ..., + 'pca0', 'pca1', 'pca2'] + +Backward Compatibility +###################### + +All estimators should implement the ``feature_names_in_`` and +``feature_names_out_`` API. This is checked in ``check_estimator``, and the +transition is done with a ``FutureWarning`` for at least two versions to give +time to third party developers to implement the API. diff --git a/under_review.rst b/under_review.rst index 2f1bcd3..51d9eab 100644 --- a/under_review.rst +++ b/under_review.rst @@ -1,11 +1,11 @@ SLEPs under review ================== -No SLEP is currently under review. +.. No SLEP is currently under review. .. Uncomment below when a SLEP is under review -.. .. toctree:: -.. :maxdepth: 1 +.. toctree:: + :maxdepth: 1 -.. slepXXX/proposal + slep007/proposal From a4a84eb812afd7e6dc2b81d41777067428558d6f Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Tue, 18 Feb 2020 14:54:03 +0100 Subject: [PATCH 042/118] most -> many --- slep012/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index e39456b..61eb118 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -26,7 +26,7 @@ Solution ######## The proposed solution is for the ``n_features_out_`` attribute to be set once a -call to ``fit`` is done. In most cases the value of ``n_features_out_`` is the +call to ``fit`` is done. In many cases the value of ``n_features_out_`` is the same as some other attribute stored in the transformer, *e.g.* ``n_components_``, and in these cases a ``Mixin`` such as a ``ComponentsMixin`` can delegate ``n_features_out_`` to those attributes. From 48ce7f40175c37fbaa95ed36aec1cd640787bcea Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Tue, 18 Feb 2020 20:52:32 +0100 Subject: [PATCH 043/118] add headers --- slep012/proposal.rst | 8 ++++++++ under_review.rst | 1 + 2 files changed, 9 insertions(+) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index b8dcb88..431bacc 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -4,6 +4,14 @@ InputArray ========== +:Author: Adrin jalali +:Status: Draft +:Type: Standards Track +:Created: 2019-12-20 + +Motivation +********** + This proposal results in a solution to propagating feature names through transformers, pipelines, and the column transformer. Ideally, we would have:: diff --git a/under_review.rst b/under_review.rst index 51d9eab..ff52d4e 100644 --- a/under_review.rst +++ b/under_review.rst @@ -9,3 +9,4 @@ SLEPs under review :maxdepth: 1 slep007/proposal + slep012/proposal From 3ad63628af0e2f385cf1633b98412fc3bf470f78 Mon Sep 17 00:00:00 2001 From: Thomas J Fan Date: Tue, 18 Feb 2020 15:09:28 -0500 Subject: [PATCH 044/118] CLN Small spelling error in SLEP012 (#34) --- slep012/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep012/proposal.rst b/slep012/proposal.rst index 431bacc..b8390bf 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -19,7 +19,7 @@ transformers, pipelines, and the column transformer. Ideally, we would have:: # transforming the data in an arbitrary way transformer0 = ColumnTransformer(...) # a pipeline preprocessing the data and then a classifier (or a regressor) - clf = make_pipeline(transfoemer0, ..., SVC()) + clf = make_pipeline(transformer0, ..., SVC()) # now we can investigate features at each stage of the pipeline clf[-1].input_feature_names_ From 58354ba548e880576d55212d24f37d4ca76ad2b2 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Wed, 19 Feb 2020 15:36:10 +0100 Subject: [PATCH 045/118] rename to slep13 --- {slep012 => slep013}/proposal.rst | 2 +- under_review.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) rename {slep012 => slep013}/proposal.rst (99%) diff --git a/slep012/proposal.rst b/slep013/proposal.rst similarity index 99% rename from slep012/proposal.rst rename to slep013/proposal.rst index 61eb118..649b036 100644 --- a/slep012/proposal.rst +++ b/slep013/proposal.rst @@ -1,4 +1,4 @@ -.. _slep_012: +.. _slep_013: ====================================== SLEP012: ``n_features_out_`` attribute diff --git a/under_review.rst b/under_review.rst index 72f95b4..4385333 100644 --- a/under_review.rst +++ b/under_review.rst @@ -8,4 +8,4 @@ SLEPs under review .. toctree:: :maxdepth: 1 - slep012/proposal + slep013/proposal From 3e7b46037f410b72abc44ef27593c3c125ebc6fa Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Wed, 19 Feb 2020 17:34:57 +0100 Subject: [PATCH 046/118] missed slep12->slep13 --- slep013/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep013/proposal.rst b/slep013/proposal.rst index 649b036..4744aaa 100644 --- a/slep013/proposal.rst +++ b/slep013/proposal.rst @@ -1,7 +1,7 @@ .. _slep_013: ====================================== -SLEP012: ``n_features_out_`` attribute +SLEP013: ``n_features_out_`` attribute ====================================== :Author: Adrin Jalali From 25316d97808c00621094a0058a0973dab5cf627f Mon Sep 17 00:00:00 2001 From: Andreas Mueller Date: Wed, 19 Feb 2020 18:50:42 -0500 Subject: [PATCH 047/118] make more explicit that verbose_feature_names is not required --- slep007/proposal.rst | 37 +++++++++++++++++++++++-------------- 1 file changed, 23 insertions(+), 14 deletions(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 523b149..ddef250 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -111,19 +111,6 @@ original features: This proposal talks about how feature names are generated and not how they are propagated. -verbose_feature_names -********************* - -``verbose_feature_names`` controls the verbosity of the generated feature names -and it can be ``True`` or ``False``. Alternative solutions could include: - -- an integer: fine tuning the verbosity of the generated feature names. -- a ``callable`` which would give further flexibility to the user to generate - user defined feature names. - -These alternatives may be discussed and implemented in the future if deemed -necessary. - Scope ##### @@ -267,7 +254,29 @@ and hence the relevance of ``verbose_feature_names=False``:: 'ohe_make_ABC', 'ohe_make_XYZ', ..., 'pca_pca0', 'pca_pca1', 'pca_pca2'] -If desired, the user can remove the prefixes:: +Extensions +########## + +verbose_feature_names +********************* +By default, transformers will retain existing feature names. In some cases +it might be desireable to allow the feature names to capture that a transformation +(like scaling or log) has taken place. + +To allow for that, ``verbose_feature_names`` can be added as a constructor +parameter to certain transformers to control the verbosity of generated feature +names. The ``verbose_feature_names`` parameter can be ``True`` or ``False``. +Alternative solutions could include: + +- an integer: fine tuning the verbosity of the generated feature names. +- a ``callable`` which would give further flexibility to the user to generate + user defined feature names. + +These alternatives may be discussed and implemented in the future if deemed +necessary. + +In case of the ``ColumnTransformer`` example above ``verbose_feature_names`` +could remove the estimator names, leading to shorter and less redundant names:: [model, make, numeric0, ..., numeric100] -> make_column_transformer( From 0394f6ec8128779bcef600bdfecf9f97b3fb2005 Mon Sep 17 00:00:00 2001 From: Andreas Mueller Date: Thu, 20 Feb 2020 11:34:44 -0500 Subject: [PATCH 048/118] don't be contradictory re defaults Also, don't move all discussion of verbose_feature_names to potential extensions. --- slep007/proposal.rst | 57 ++++++++++++++++---------------------------- 1 file changed, 21 insertions(+), 36 deletions(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index ddef250..a127665 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -139,9 +139,7 @@ A fitted estimator exposes the output feature names through the feature names are generated. Since for most estimators there are multiple ways to generate feature names, this SLEP does not intend to define how exactly feature names are generated for all of them. It is instead a guideline on how -they could generally be generated. Furthermore, that specific behavior of a -given estimator may be tuned via the ``verbose_feature_names`` parameter, as -detailed below. +they could generally be generated. As detailed bellow, some generated output features names are the same or a derived from the input feature names. In such cases, if no input feature names @@ -159,17 +157,12 @@ Feature Generating Transformers ******************************* The simplest category of transformers in this section are the ones which -generate a column based on a single given column. The generated output column -in this case is a sensible transformation of the input feature name. For -instance, a ``LogTransformer`` can do ``'age' -> 'log(age)'``, and a -``OneHotEncoder`` could do ``'gender' -> 'gender_female', 'gender_fluid', -...``. An alternative is to leave the feature names unchanged when each output -feature corresponds to exactly one input feature. Whether or not to modify the -feature name, *e.g.* ``log(x0)`` vs. ``x0`` may be controlled via the -``verbose_feature_names`` to the constructor. The default value of -``verbose_feature_names`` can be different depending on the transformer. For -instance, ``StandardScaler`` can have it as ``False``, whereas -``LogTransformer`` could have it as ``True`` by default. +generate a column based on a single given column. These would simply +preserve the input feature names if a single new feature is generated, +such as in ``StandardScaler``, which would mape ``'age'`` to ``'age'``. +If an input feature maps to multiple new +features, a postfix is added, so that ``OneHotEncoder`` might map +``'gender'`` to ``'gender_female'`` ``'gender_fluid'`` etc. Transformers where each output feature depends on a fixed number of input features may generate descriptive names as well. For instance, a @@ -197,11 +190,6 @@ indicating the name of the transformer applied to them. If a column is in the ou as a part of ``passthrough``, it won't be prefixed since no operation has been applied on it. -This is the default behavior, and it can be tuned by constructor parameters if -the meta estimator allows it. For instance, a ``verbose_feature_names=False`` -may indicate that a ``ColumnTransformer`` should not prefix the generated -feature names with the name of the step. - Examples ######## @@ -242,8 +230,7 @@ names:: 'cat_make_ABC', 'cat_make_XYZ', ..., 'num_pca0', 'num_pca1', 'num_pca2'] -However, the following examples produce a somewhat redundant feature names, -and hence the relevance of ``verbose_feature_names=False``:: +However, the following examples produce a somewhat redundant feature names:: [model, make, numeric0, ..., numeric100] -> ColumnTransformer([ @@ -259,21 +246,10 @@ Extensions verbose_feature_names ********************* -By default, transformers will retain existing feature names. In some cases -it might be desireable to allow the feature names to capture that a transformation -(like scaling or log) has taken place. - -To allow for that, ``verbose_feature_names`` can be added as a constructor -parameter to certain transformers to control the verbosity of generated feature -names. The ``verbose_feature_names`` parameter can be ``True`` or ``False``. -Alternative solutions could include: - -- an integer: fine tuning the verbosity of the generated feature names. -- a ``callable`` which would give further flexibility to the user to generate - user defined feature names. - -These alternatives may be discussed and implemented in the future if deemed -necessary. +To provide more control over feature names, we could add a boolean +``verbose_feature_names`` constructor argument to certain transformers. +The default would reflect the description above, but changes would allow more verbose +names in some transformers, say having ``StandardScaler`` map ``'age'`` to ``'scale(age)'``. In case of the ``ColumnTransformer`` example above ``verbose_feature_names`` could remove the estimator names, leading to shorter and less redundant names:: @@ -288,6 +264,15 @@ could remove the estimator names, leading to shorter and less redundant names:: 'make_ABC', 'make_XYZ', ..., 'pca0', 'pca1', 'pca2'] +Alternative solutions to a boolean flag could include: + +- an integer: fine tuning the verbosity of the generated feature names. +- a ``callable`` which would give further flexibility to the user to generate + user defined feature names. + +These alternatives may be discussed and implemented in the future if deemed +necessary. + Backward Compatibility ###################### From 6b556cf3489aa468bf865b8f4d80bd418e314a00 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Mon, 24 Feb 2020 05:57:13 -0500 Subject: [PATCH 049/118] Update slep007/proposal.rst --- slep007/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index a127665..93f8c26 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -159,7 +159,7 @@ Feature Generating Transformers The simplest category of transformers in this section are the ones which generate a column based on a single given column. These would simply preserve the input feature names if a single new feature is generated, -such as in ``StandardScaler``, which would mape ``'age'`` to ``'age'``. +such as in ``StandardScaler``, which would map ``'age'`` to ``'age'``. If an input feature maps to multiple new features, a postfix is added, so that ``OneHotEncoder`` might map ``'gender'`` to ``'gender_female'`` ``'gender_fluid'`` etc. From 0fa0b533c780fbe5c63fc12239017ec7e4073ef3 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 5 Mar 2020 13:56:28 +1100 Subject: [PATCH 050/118] DOC Moves under review to top level toc (#40) --- index.rst | 4 +++- slep007/proposal.rst | 6 +++--- slep012/proposal.rst | 24 ++++++++++++------------ under_review.rst | 13 ------------- 4 files changed, 18 insertions(+), 29 deletions(-) delete mode 100644 under_review.rst diff --git a/index.rst b/index.rst index 48e1f99..a68713e 100644 --- a/index.rst +++ b/index.rst @@ -9,7 +9,9 @@ :maxdepth: 1 :caption: Under review - under_review + slep007/proposal + slep012/proposal + slep013/proposal .. toctree:: :maxdepth: 1 diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 93f8c26..1dd9c7c 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -1,8 +1,8 @@ .. _slep_007: -=========================================== -Feature names, their generation and the API -=========================================== +==================================================== +SLEP007: Feature names, their generation and the API +==================================================== :Author: Adrin Jalali :Status: Under Review diff --git a/slep012/proposal.rst b/slep012/proposal.rst index b8390bf..af4dd78 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -1,8 +1,8 @@ .. _slep_012: -========== -InputArray -========== +======================= +SLEP012: ``InputArray`` +======================= :Author: Adrin jalali :Status: Draft @@ -10,7 +10,7 @@ InputArray :Created: 2019-12-20 Motivation -********** +########## This proposal results in a solution to propagating feature names through transformers, pipelines, and the column transformer. Ideally, we would have:: @@ -39,7 +39,7 @@ transformer, would not break. This SLEP focuses on *feature names* as the only meta-data attached to the data. Support for other meta-data can be added later. Backward/NumPy/Pandas Compatibility -*********************************** +################################### Since currently transformers return a ``numpy`` or a ``scipy`` array, backward compatibility in this context means the operations which are valid on those @@ -59,13 +59,13 @@ which ``pandas`` does not provide a clean API at the moment. Alternatively, relevant meta-data attached. Feature Names -************* +############# Feature names are an object ``ndarray`` of strings aligned with the columns. They can be ``None``. Operations -********** +########## Estimators understand the ``InputArray`` and extract the feature names from the given data before applying the operations and transformations on the data. @@ -75,20 +75,20 @@ The way feature names are generated is discussed in *SLEP007 - The Style of The Feature Names*. Sparse Arrays -************* +############# Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does not provide the kinda of API provided by ``numpy``, we may need to find compromises. Factory Methods -*************** +############### There will be factory methods creating an ``InputArray`` given a ``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or an ``sp.SparseMatrix`` and a given set of feature names. -An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a +An ``InputArray`` can also be converted to a ``pandas.DataFrame`` using a ``todataframe()`` method. ``X`` being an ``InputArray``:: @@ -103,7 +103,7 @@ feature names, one can make the right ``InputArray`` using:: >>> make_inputarray(X, feature_names) Alternative Solutions -********************* +##################### Since we expect the feature names to be attached to the data given to an estimator, there are a few potential approaches we can take: @@ -114,7 +114,7 @@ estimator, there are a few potential approaches we can take: is not a feasible solution since ``pandas`` plans to move to a per column representation, which means ``pd.DataFrame(np.asarray(df))`` has two guaranteed memory copies. -- ``XArray``: we could accept a `pandas.DataFrame``, and use +- ``XArray``: we could accept a ``pandas.DataFrame``, and use ``xarray.DataArray`` as the output of transformers, including feature names. However, ``xarray`` has a hard dependency on ``pandas``, and uses ``pandas.Index`` to handle row labels and aligns rows when an operation diff --git a/under_review.rst b/under_review.rst deleted file mode 100644 index 44cfc56..0000000 --- a/under_review.rst +++ /dev/null @@ -1,13 +0,0 @@ -SLEPs under review -================== - -.. No SLEP is currently under review. - -.. Uncomment below when a SLEP is under review - -.. toctree:: - :maxdepth: 1 - - slep007/proposal - slep012/proposal - slep013/proposal From 1d57fe08a6dad1fa5d45e70daf0dea6281508e62 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 18 Jun 2020 21:24:53 +1000 Subject: [PATCH 051/118] Fix ReST errors in SLEP009 (#41) --- slep009/proposal.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/slep009/proposal.rst b/slep009/proposal.rst index c6f8cb0..248c21f 100644 --- a/slep009/proposal.rst +++ b/slep009/proposal.rst @@ -17,7 +17,7 @@ This proposal discusses the path to gradually forcing users to pass arguments, or most of them, as keyword arguments only. It talks about the status-quo, and the motivation to introduce the change. It shall cover the pros and cons of the change. The original issue starting the discussion is located -`here `_. +`here `__. Motivation ########## @@ -110,8 +110,8 @@ following two definitions may also be confusing to some users: However, some other teams are already moving towards using the syntax, such as ``matplotlib`` which has introduced the syntax with a deprecation cycle using a decorator for this purpose in version 3.1. The related PRs can be found `here -`_ and `here -`_. Soon users will be +`__ and `here +`__. Soon users will be familiar with the syntax. IDE Support @@ -151,7 +151,7 @@ An important open question is which functions/methods and/or parameters should follow this pattern, and which parameters should be keyword only. We can identify the following categories of functions/methods: -- ``__init__``s +- ``__init__`` - Main methods of the API, *i.e.* ``fit``, ``transform``, etc. - All other methods, *e.g.* ``SpectralBiclustering.get_submatrix`` - Functions @@ -168,8 +168,8 @@ defined as either of the following two ways: the *easy* cases. - A set identified as being in the top 95% of the use cases, using some automated analysis such as `this one - `_ or `this one - `_. + `__ or `this one + `__. This way we would minimize the number of warnings the users would receive, which minimizes the friction cause by the change. This SLEP does not define From 677293140f8b6961ca274d0dbc6bc99e34934b50 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Mon, 29 Jun 2020 16:20:04 +1000 Subject: [PATCH 052/118] SLEP006 on Sample Properties (#16) * Starting to draft SLEP006 on Sample Properties * iter * WIP * WIP * a fourth solution and a little more fleshing... still no code examples. * Code examples using Solution 4 * A couple of cross-references * WIP * Filling out example code * Note handling of misspelled keys * Note the status quo hacks * new code * Small additions including section on nomenclature * Some more thoughts on backwards compatibility * Note on potential for mixed keys --- conf.py | 5 + index.rst | 1 + requirements.txt | 1 + slep006/cases_opt0a.py | 6 + slep006/cases_opt0b.py | 7 + slep006/cases_opt1.py | 68 +++++++ slep006/cases_opt2.py | 70 +++++++ slep006/cases_opt3.py | 99 +++++++++ slep006/cases_opt4.py | 78 ++++++++ slep006/defs.py | 14 ++ slep006/proposal.rst | 445 +++++++++++++++++++++++++++++++++++++++++ 11 files changed, 794 insertions(+) create mode 100644 slep006/cases_opt0a.py create mode 100644 slep006/cases_opt0b.py create mode 100644 slep006/cases_opt1.py create mode 100644 slep006/cases_opt2.py create mode 100644 slep006/cases_opt3.py create mode 100644 slep006/cases_opt4.py create mode 100644 slep006/defs.py create mode 100644 slep006/proposal.rst diff --git a/conf.py b/conf.py index 3a1548d..bdeceb1 100644 --- a/conf.py +++ b/conf.py @@ -42,6 +42,7 @@ 'sphinx.ext.intersphinx', 'sphinx.ext.mathjax', 'sphinx.ext.viewcode', + 'sphinx_issues', ] # Add any paths that contain templates here, relative to this directory. @@ -165,3 +166,7 @@ # -- Options for intersphinx extension --------------------------------------- intersphinx_mapping = {'sklearn': ('http://scikit-learn.org/stable', None)} + +# -- Sphinx-Issues configuration -- + +issues_github_path = "scikit-learn/scikit-learn" diff --git a/index.rst b/index.rst index a68713e..e5f5718 100644 --- a/index.rst +++ b/index.rst @@ -29,6 +29,7 @@ slep002/proposal slep003/proposal slep004/proposal + slep006/proposal .. toctree:: :maxdepth: 1 diff --git a/requirements.txt b/requirements.txt index cbf1e36..5666abb 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,2 +1,3 @@ sphinx sphinx-rtd-theme +sphinx-issues diff --git a/slep006/cases_opt0a.py b/slep006/cases_opt0a.py new file mode 100644 index 0000000..d0141fa --- /dev/null +++ b/slep006/cases_opt0a.py @@ -0,0 +1,6 @@ +from defs import (accuracy, group_cv, make_scorer, SelectKBest, + LogisticRegressionCV, cross_validate, + make_pipeline, X, y, my_groups, my_weights, + my_other_weights) + +# TODO diff --git a/slep006/cases_opt0b.py b/slep006/cases_opt0b.py new file mode 100644 index 0000000..f543e9b --- /dev/null +++ b/slep006/cases_opt0b.py @@ -0,0 +1,7 @@ +import pandas as pd +from defs import (accuracy, group_cv, make_scorer, SelectKBest, + LogisticRegressionCV, cross_validate, + make_pipeline, X, y, my_groups, my_weights, + my_other_weights) + +# TODO diff --git a/slep006/cases_opt1.py b/slep006/cases_opt1.py new file mode 100644 index 0000000..a8185d3 --- /dev/null +++ b/slep006/cases_opt1.py @@ -0,0 +1,68 @@ +from defs import (accuracy, group_cv, make_scorer, SelectKBest, + LogisticRegressionCV, cross_validate, make_pipeline, X, y, + my_groups, my_weights, my_other_weights) + +# %% +# Case A: weighted scoring and fitting + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', +) +cross_validate(lr, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring='accuracy') + +# Error handling: if props={'sample_eight': my_weights, ...} was passed +# instead, the estimator would fit and score without weight, silently failing. + +# %% +# Case B: weighted scoring and unweighted fitting + + +class MyLogisticRegressionCV(LogisticRegressionCV): + def fit(self, X, y, props=None): + props = props.copy() + props.pop('sample_weight', None) + super().fit(X, y, props=props) + + +# %% +# Case C: unweighted feature selection + +# Currently feature selection does not handle sample_weight, and as long as +# that remains the case, it will simply ignore the prop passed to it. Hence: + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', +) +sel = SelectKBest() +pipe = make_pipeline(sel, lr) +cross_validate(pipe, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring='accuracy') + +# %% +# Case D: different scoring and fitting weights + +weighted_acc = make_scorer(accuracy) + + +def specially_weighted_acc(est, X, y, props): + props = props.copy() + props['sample_weight'] = 'scoring_weight' + return weighted_acc(est, X, y, props) + + +lr = LogisticRegressionCV( + cv=group_cv, + scoring=specially_weighted_acc, +) +cross_validate(lr, X, y, cv=group_cv, + props={ + 'scoring_weight': my_weights, + 'sample_weight': my_other_weights, + 'groups': my_groups, + }, + scoring=specially_weighted_acc) diff --git a/slep006/cases_opt2.py b/slep006/cases_opt2.py new file mode 100644 index 0000000..4148e66 --- /dev/null +++ b/slep006/cases_opt2.py @@ -0,0 +1,70 @@ +from defs import (group_cv, SelectKBest, LogisticRegressionCV, + cross_validate, make_pipeline, X, y, my_groups, + my_weights, my_other_weights) + +# %% +# Case A: weighted scoring and fitting + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', +) +props = {'cv__groups': my_groups, + 'estimator__cv__groups': my_groups, + 'estimator__sample_weight': my_weights, + 'scoring__sample_weight': my_weights, + 'estimator__scoring__sample_weight': my_weights} +cross_validate(lr, X, y, cv=group_cv, + props=props, + scoring='accuracy') + +# error handling: if props={'estimator__sample_eight': my_weights, ...} was +# passed instead, the estimator would raise an error. + +# %% +# Case B: weighted scoring and unweighted fitting + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', +) +props = {'cv__groups': my_groups, + 'estimator__cv__groups': my_groups, + 'scoring__sample_weight': my_weights, + 'estimator__scoring__sample_weight': my_weights} +cross_validate(lr, X, y, cv=group_cv, + props=props, + scoring='accuracy') + +# %% +# Case C: unweighted feature selection + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', +) +pipe = make_pipeline(SelectKBest(), lr) +props = {'cv__groups': my_groups, + 'estimator__logisticregressioncv__cv__groups': my_groups, + 'estimator__logisticregressioncv__sample_weight': my_weights, + 'scoring__sample_weight': my_weights, + 'estimator__scoring__sample_weight': my_weights} +cross_validate(pipe, X, y, cv=group_cv, + props=props, + scoring='accuracy') + +# %% +# Case D: different scoring and fitting weights + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', +) +props = {'cv__groups': my_groups, + 'estimator__cv__groups': my_groups, + 'estimator__sample_weight': my_other_weights, + 'scoring__sample_weight': my_weights, + 'estimator__scoring__sample_weight': my_weights} +cross_validate(lr, X, y, cv=group_cv, + props=props, + scoring='accuracy') diff --git a/slep006/cases_opt3.py b/slep006/cases_opt3.py new file mode 100644 index 0000000..5b4b450 --- /dev/null +++ b/slep006/cases_opt3.py @@ -0,0 +1,99 @@ +from defs import (accuracy, make_scorer, SelectKBest, LogisticRegressionCV, + group_cv, cross_validate, make_pipeline, X, y, my_groups, + my_weights, my_other_weights) + +# %% +# Case A: weighted scoring and fitting + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', + prop_routing={'cv': ['groups'], + 'scoring': ['sample_weight'], + } + # one question here is whether we need to explicitly route sample_weight + # to LogisticRegressionCV's fitting... +) + +# Alternative syntax, which assumes cv receives 'groups' by default, and that a +# method-based API is provided on meta-estimators: +# lr = LogisticRegressionCV( +# cv=group_cv, +# scoring='accuracy', +# ).add_prop_route(scoring='sample_weight') + +cross_validate(lr, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring='accuracy', + prop_routing={'estimator': '*', # pass all props + 'cv': ['groups'], + 'scoring': ['sample_weight'], + }) + +# Error handling: if props={'sample_eight': my_weights, ...} was passed +# instead, LogisticRegressionCV would have to identify that a key was passed +# that could not be routed nor used, in order to raise an error. + +# %% +# Case B: weighted scoring and unweighted fitting + +# Here we rename the sample_weight prop so that we can specify that it only +# applies to scoring. +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', + prop_routing={'cv': ['groups'], + # read the following as "scoring should consume + # 'scoring_weight' as if it were 'sample_weight'." + 'scoring': {'sample_weight': 'scoring_weight'}, + }, +) +cross_validate(lr, X, y, cv=group_cv, + props={'scoring_weight': my_weights, 'groups': my_groups}, + scoring='accuracy', + prop_routing={'estimator': '*', + 'cv': ['groups'], + 'scoring': {'sample_weight': 'scoring_weight'}, + }) + +# %% +# Case C: unweighted feature selection + +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', + prop_routing={'cv': ['groups'], + 'scoring': ['sample_weight'], + }) +pipe = make_pipeline(SelectKBest(), lr, + prop_routing={'logisticregressioncv': ['sample_weight', + 'groups']}) +cross_validate(lr, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring='accuracy', + prop_routing={'estimator': '*', + 'cv': ['groups'], + 'scoring': ['sample_weight'], + }) + +# %% +# Case D: different scoring and fitting weights +lr = LogisticRegressionCV( + cv=group_cv, + scoring='accuracy', + prop_routing={'cv': ['groups'], + # read the following as "scoring should consume + # 'scoring_weight' as if it were 'sample_weight'." + 'scoring': {'sample_weight': 'scoring_weight'}, + }, +) +cross_validate(lr, X, y, cv=group_cv, + props={'scoring_weight': my_weights, 'groups': my_groups, + 'fitting_weight': my_other_weights}, + scoring='accuracy', + prop_routing={'estimator': {'sample_weight': 'fitting_weight', + 'scoring_weight': 'scoring_weight', + 'groups': 'groups'}, + 'cv': ['groups'], + 'scoring': {'sample_weight': 'scoring_weight'}, + }) diff --git a/slep006/cases_opt4.py b/slep006/cases_opt4.py new file mode 100644 index 0000000..1d1325c --- /dev/null +++ b/slep006/cases_opt4.py @@ -0,0 +1,78 @@ +from defs import (accuracy, group_cv, make_scorer, SelectKBest, + LogisticRegressionCV, cross_validate, + make_pipeline, X, y, my_groups, my_weights, + my_other_weights) + +# %% +# Case A: weighted scoring and fitting + +# Here we presume that GroupKFold requests `groups` by default. +# We need to explicitly request weights in make_scorer and for +# LogisticRegressionCV. Both of these consumers understand the meaning +# of the key "sample_weight". + +weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +).set_props_request(['sample_weight']) +cross_validate(lr, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring=weighted_acc) + +# Error handling: if props={'sample_eight': my_weights, ...} was passed, +# cross_validate would raise an error, since 'sample_eight' was not requested +# by any of its children. + +# %% +# Case B: weighted scoring and unweighted fitting + +# Since LogisticRegressionCV requires that weights explicitly be requested, +# removing that request means the fitting is unweighted. + +weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +) +cross_validate(lr, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring=weighted_acc) + +# %% +# Case C: unweighted feature selection + +# Like LogisticRegressionCV, SelectKBest needs to request weights explicitly. +# Here it does not request them. + +weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +).set_props_request(['sample_weight']) +sel = SelectKBest() +pipe = make_pipeline(sel, lr) +cross_validate(pipe, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring=weighted_acc) + +# %% +# Case D: different scoring and fitting weights + +# Despite make_scorer and LogisticRegressionCV both expecting a key +# sample_weight, we can use aliases to pass different weights to different +# consumers. + +weighted_acc = make_scorer(accuracy, + request_props={'scoring_weight': 'sample_weight'}) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +).set_props_request({'fitting_weight': "sample_weight"}) +cross_validate(lr, X, y, cv=group_cv, + props={ + 'scoring_weight': my_weights, + 'fitting_weight': my_other_weights, + 'groups': my_groups, + }, + scoring=weighted_acc) diff --git a/slep006/defs.py b/slep006/defs.py new file mode 100644 index 0000000..2026c8e --- /dev/null +++ b/slep006/defs.py @@ -0,0 +1,14 @@ +import numpy as np +from sklearn.feature_selection import SelectKBest +from sklearn.linear_model import LogisticRegressionCV +from sklearn.metrics import accuracy +from sklearn.metrics import make_scorer +from sklearn.model_selection import GroupKFold, cross_validate +from sklearn.pipeline import make_pipeline + +N, M = 100, 4 +X = np.random.rand(N, M) +y = np.random.randint(0, 1, size=N) +my_groups = np.random.randint(0, 10, size=N) +my_weights = np.random.rand(N) +my_other_weights = np.random.rand(N) diff --git a/slep006/proposal.rst b/slep006/proposal.rst new file mode 100644 index 0000000..c336e4a --- /dev/null +++ b/slep006/proposal.rst @@ -0,0 +1,445 @@ +.. _slep_006: + +================================ +Routing sample-aligned meta-data +================================ + +:Author: Joel Nothman +:Status: Draft +:Type: Standards Track +:Created: 2019-03-07 + + +Scikit-learn has limited support for information pertaining to each sample +(henceforth "sample properties") to be passed through an estimation pipeline. +The user can, for instance, pass fit parameters to all members of a +FeatureUnion, or to a specified member of a Pipeline using dunder (``__``) +prefixing:: + + >>> from sklearn.pipeline import Pipeline + >>> from sklearn.linear_model import LogisticRegression + >>> pipe = Pipeline([('clf', LogisticRegression())]) + >>> pipe.fit([[1, 2], [3, 4]], [5, 6], + ... clf__sample_weight=[.5, .7]) # doctest: +SKIP + +Several other meta-estimators, such as GridSearchCV, support forwarding these +fit parameters to their base estimator when fitting. + +Desirable features we do not currently support include: + +* passing sample properties (e.g. `sample_weight`) to a scorer used in + cross-validation +* passing sample properties (e.g. `groups`) to a CV splitter in nested cross + validation +* (maybe in scope) passing sample properties (e.g. `sample_weight`) to some + scorers and not others in a multi-metric cross-validation setup +* (likley out of scope) passing sample properties to non-fit methods, for + instance to index grouped samples that are to be treated as a single sequence + in prediction. + +Definitions +----------- + +consumer + An estimator, scorer, splitter, etc., that receives and can make use of + one or more passed props. +key + A label passed along with sample prop data to indicate how it should be + interpreted (e.g. "weight"). +router + An estimator or function that passes props on to some other router or + consumer, potentially selecting which props to pass to which destination, + and by what key. + +History +------- + +This version was drafted after a discussion of the issue and potential +solutions at the February 2019 development sprint in Paris. + +Supersedes `SLEP004 +`_ +with greater depth of desiderata and options. + +Primary related issues and pull requests include: + +- :issue:`4497`: Overarching issue, + "Consistent API for attaching properties to samples" + by :user:`GaelVaroquaux` +- :pr:`4696` A first implementation by :user:`amueller` +- `Discussion towards SLEP004 + `__ initiated + by :user:`tguillemot` +- :pr:`9566` Another implementation (solution 3 from this SLEP) + by :user:`jnothman` +- :pr:`16079` Another implementation (solution 4 from this SLEP) + by :user:`adrinjalali` + +Other related issues include: :issue:`1574`, :issue:`2630`, :issue:`3524`, +:issue:`4632`, :issue:`4652`, :issue:`4660`, :issue:`4696`, :issue:`6322`, +:issue:`7112`, :issue:`7646`, :issue:`7723`, :issue:`8127`, :issue:`8158`, +:issue:`8710`, :issue:`8950`, :issue:`11429`, :issue:`12052`, :issue:`15282`, +:issues:`15370`, :issue:`15425`. + +Desiderata +---------- + +We will consider the following aspects to develop and compare solutions: + +Usability + Can the use cases be achieved in succinct, readable code? Can common use + cases be achieved with a simple recipe copy-pasted from a QA forum? +Brittleness + If a property is being routed through a Pipeline, does changing the + structure of the pipeline (e.g. adding a layer of nesting) require rewriting + other code? +Error handling + If the user mistypes the name of a sample property, or misspecifies how it + should be routed to a consumer, will an appropriate exception be raised? +Impact on meta-estimator design + How much meta-estimator code needs to change? How hard will it be to + maintain? +Impact on estimator design + How much will the proposal affect estimator developers? +Backwards compatibility + Can existing behavior be maintained? +Forwards compatibility + Is the solution going to make users' code more + brittle with future changes? (For example, will a user's pipeline change + behaviour radically when sample_weight is implemented on some estimator) +Introspection + If sensible to do so (e.g. for improved efficiency), can a + meta-estimator identify whether its base estimator (recursively) would + handle some particular sample property (e.g. so a meta-estimator can choose + between weighting and resampling, or for automated invariance testing)? + +Keyword arguments vs. a single argument +--------------------------------------- + +Currently, sample properties are provided as keyword arguments to a `fit` +method. In redeveloping sample properties, we can instead accept a single +parameter (named `props` or `sample_props` or `etc`, for example) which maps +string keys to arrays of the same length (a "DataFrame-like"). + +Keyword arguments:: + + >>> gs.fit(X, y, groups=groups, sample_weight=sample_weight) + +Single argument:: + + >>> gs.fit(X, y, prop={'groups': groups, 'weight': weight}) + +While drafting this document, we will assume the latter notation for clarity. + +Advantages of multiple keyword arguments: + +* succinct +* possible to maintain backwards compatible support for sample_weight, etc. +* we do not need to handle cases for whether or not some estimator expects a + `props` argument. + +Advantages of a single argument: + +* we are able to consider kwargs to `fit` that are not sample-aligned, so that + we can add further functionality (some that have been proposed: + `with_warm_start`, `feature_names_in`, `feature_meta`). +* we are able to redefine the default routing of weights etc. without being + concerned by backwards compatibility. +* we can consider the use of keys that are not limited to strings or valid + identifiers (and hence are not limited to using ``_`` as a delimiter). + +Test case setup +--------------- + +Case A +~~~~~~ + +Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')`` +with weighted scoring and weighted fitting. + +Error handling: what would happen if the user misspelled `sample_weight` as +`sample_eight`? + +Case B +~~~~~~ + +Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')`` +with weighted scoring and unweighted fitting. + +Case C +~~~~~~ + +Extend Case A to apply an unweighted univariate feature selector in a +``Pipeline``. + +Case D +~~~~~~ + +Different weights for scoring and for fitting in Case A. + +TODO: case involving props passed at test time, e.g. to pipe.transform (???). +TODO: case involving score() method, e.g. not specifying scoring in +cross_val_score when wrapping an estimator with weighted score func ... + +Solution sketches will import these definitions: + +.. literalinclude:: defs.py + +Status quo solution 0a: additional feature +------------------------------------------ + +Without changing scikit-learn, the following hack can be used: + +Additional numeric features representing sample props can be appended to the +data and passed around, being handled specially in each consumer of features +or sample props. + +.. literalinclude:: cases_opt0a.py + +Status quo solution 0b: Pandas Index and global resources +--------------------------------------------------------- + +Without changing scikit-learn, the following hack can be used: + +If `y` is represented with a Pandas datatype, then its index can be used to +access required elements from props stored in a global namespace (or otherwise +made available to the estimator before fitting). This is possible everywhere +that a gold-standard `y` is passed, including fit, split and score. A similar +solution with `X` is also possible for handling predict-time props, if all +Pipeline components retain the original Pandas Index. + +Issues: + +* use of global data source +* requires Pandas data types and indices to be maintained + +.. literalinclude:: cases_opt0b.py + +Solution 1: Pass everything +--------------------------- + +This proposal passes all props to all consumers (estimators, splitters, +scorers, etc). The consumer would optionally use props it is familiar with by +name and disregard other props. + +We may consider providing syntax for the user to control the interpretation of +incoming props: + +* to require that some prop is provided (for an estimator where that prop is + otherwise optional) +* to disregard some provided prop +* to treat a particular prop key as having a certain meaning (e.g. locally + interpreting 'scoring_sample_weight' as 'sample_weight'). + +These constraints would be checked by calling a helper at the consumer. + +Issues: + +* Error handling: if a key is optional in a consumer, no error will be + raised for misspelling. An introspection API might change this, allowing a + user or meta-estimator to check if all keys passed are to be used in at least + one consumer. +* Forwards compatibility: newly supporting a prop key in a consumer will change + behaviour. Other than a ChangedBehaviorWarning, I don't see any way around + this. +* Introspection: not inherently supported. Would need an API like + ``get_prop_support(names: List[str]) -> Dict[str, Literal["supported", "required", "ignored"]]``. + +In short, this is a simple solution, but prone to risk. + +.. literalinclude:: cases_opt1.py + + +Solution 2: Specify routes at call +---------------------------------- + +Similar to the legacy behavior of fit parameters in +:class:`sklearn.pipeline.Pipeline`, this requires the user to specify the +path for each "prop" to follow when calling `fit`. For example, to pass +a prop named 'weights' to a step named 'spam' in a Pipeline, you might use +`my_pipe.fit(X, y, props={'spam__weights': my_weights})`. + +SLEP004's syntax to override the common routing scheme falls under this +solution. + +Advantages: + +* Very explicit and robust to misspellings. + +Issues: + +* The user needs to know the deep internal structure, or it is easy to fail to + pass a prop to a specific estimator. +* A corollary is that prop keys need changing when the developer modifies their + estimator structure (see case C). +* This gets especially tricky or impossible where the available routes + change mid-fit, such as where a grid search considers estimators with + different structures. +* We would need to find a different solution for :issue:`2630` where a Pipeline + could not be the base estimator of AdaBoost because AdaBoost expects the base + estimator to accept a fit param keyed 'sample_weight'. +* This may not work if a meta-estimator were to have the role of changing a + prop, e.g. a meta-estimator that passes `sample_weight` corresponding to + balanced classes onto its base estimator. The meta-estimator would need a + list of destinations to pass modified props to, or a list of keys to modify. +* We would need to develop naming conventions for different routes, which may + be more complicated than the current conventions; while a GridSearchCV + wrapping a Pipeline currently takes parameters with keys like + `{step_name}__{prop_name}`, this explicit routing, and conflict with + GridSearchCV routing destinations, implies keys like + `estimator__{step_name}__{prop_name}`. + +.. literalinclude:: cases_opt2.py + + +Solution 3: Specify routes on metaestimators +-------------------------------------------- + +Each meta-estimator is given a routing specification which it must follow in +passing only the required parameters to each of its children. In this context, +a GridSearchCV has children including `estimator`, `cv` and (each element of) +`scoring`. + +Pull request :pr:`9566` and its extension in :pr:`15425` are partial +implementations of this approach. + +A major benefit of this approach is that it may allow only prop routing +meta-estimators to be modified, not prop consumers. + +All consumers would be required to check that + +Issues: + +* Routing may be hard to get one's head around, especially since the prop + support belongs to the child estimator but the parent is responsible for the + routing. +* Need to design an API for specifying routings. +* As in Solution 2, each local destination for routing props needs to be given + a name. +* Every router along the route will need consistent instructions to pass a + specific prop to a consumer. If the prop is optional in the consumer, routing + failures may be hard to identify and debug. +* For estimators to be cloned, this routing information needs to be cloned with + it. This implies one of: the routing information be stored as a constructor + paramerter; or `clone` is extended to explicitly copy routing information. + +Possible public syntax: + +Each meta-estimator has a `prop_routing` parameter to encode local routing +rules, and a set of named children which it routes to. In :pr:`9566`, the +`prop_routing` entry for each child may be a white list or black list of +named keys passed to the meta-estimator. + +.. literalinclude:: cases_opt3.py + + +Solution 4: Each child requests +------------------------------- + +Here the meta-estimator provides only what its each of its children requests. +The meta-estimator would also need to request, on behalf of its children, +any prop that descendant consumers require. + +Each object in a situation that could receive props would have a method like +`_get_prop_requests()` which would return a list of prop names (or perhaps a +mapping for more sophisticated use-cases). Group* CV splitters would default to +returning `['groups']`, for example. Estimators supporting weighted fitting +may return `[]` by default, but may have a parameter `request_props` which +may be set to `['weight']` if weight is sought, or perhaps just boolean +parameter `request_weight`. `make_scorer` would have a similar mechanism for +enabling weighted scoring. + +Advantages: + +* This will not need to affect legacy estimators, since no props will be + passed when a props request is not available. +* This does not require defining a new syntax for routing. +* The implementation changes in meta-estimators may be easy to provide via a + helper or two (perhaps even `call_with_props(method, target, props)`). +* Easy to reconfigure what props an estimator gets in a grid search. +* Could make use of existing `**fit_params` syntax rather than introducing new + `props` argument to `fit`. + +Disadvantages: + +* This will require modifying every estimator that may want props, as well as + all meta-estimators. We could provide a mixin or similar to add prop-request + support to a legacy estimator; or `BaseEstimator` could have a + `set_props_request` method (instead of the `request_props` constructor + parameter approach) such that all legacy base estimators are + automatically equipped. +* For estimators to be cloned, this request information needs to be cloned with + it. This implies one of: the request information be stored as a constructor + paramerter; or `clone` is extended to explicitly copy request information. + +Possible public syntax: + +* `BaseEstimator` will have methods `set_props_request` and `get_props_request` +* `make_scorer` will have a `request_props` parameter to set props required by + the scorer. +* `get_props_request` will return a dict. It maps the key that the user + passes to the key that the estimator expects. +* `set_props_request` will accept either such a dict or a sequence `s` to be + interpreted as the identity mapping for all elements in `s` + (`{x: x for x in s}`). It will return `self` to enable chaining. +* `Group*` CV splitters will by default request the 'groups' prop, but its + mapping can be changed with their `set_props_request` method. + +Test cases: + +.. literalinclude:: cases_opt4.py + +Naming +------ + +"Sample props" has become a name understood internally to the Scikit-learn +development team. For ongoing usage we have several choices for naming: + +* Sample meta +* Sample properties +* Sample props +* Sample extra + +Proposal +-------- + +Having considered the above solutions, we propose: + +TODO + +* which solution? +* if an estimator requests a prop, must it be not-null? Must it be provided or explicitly passed as None? +* props param or kwargs? +* naming? + +Backward compatibility +---------------------- + +TODO + +TODO: Do we continue to handle sample_weight such that it only gets provided of requested explicitly? Or do we make it requested by default in the future (possibly with a deprecation period)? + +During a deprecation period, fit_params will be handled dually: Keys that are requested will be passed through the new request mechanism, while keys that are not known will be routed using legacy mechanisms. At completion of the deprecation period, the legacy handling will cease. + +Grouped cross validation splitters will request `groups` since they were previously unusable in a nested cross validation context, so this should not often create backwards incompatibilities, except perhaps where a fit param named `groups` served another purpose. + +Discussion +---------- + +One benefit of the explicitness in Solution 4 is that even if it makes use of **kw arguments, it does not preclude keywords arguments serving other purposes in addition. That is, in addition to requesting sample props, a future proposal could allow estimators to request feature metadata or other keys. + +TODO + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From 745b0166b39a59eee02cba69f0306e91a10c16e7 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Sun, 23 Aug 2020 21:48:38 +1000 Subject: [PATCH 053/118] Prefix the title by the SLEP code in all cases --- slep001/proposal.rst | 6 +++--- slep002/proposal.rst | 6 +++--- slep003/proposal.rst | 6 +++--- slep004/proposal.rst | 6 +++--- slep006/proposal.rst | 6 +++--- 5 files changed, 15 insertions(+), 15 deletions(-) diff --git a/slep001/proposal.rst b/slep001/proposal.rst index f365284..a335b5f 100644 --- a/slep001/proposal.rst +++ b/slep001/proposal.rst @@ -1,8 +1,8 @@ .. _slep_001: -===================================== -Transformers that modify their target -===================================== +============================================== +SLEP001: Transformers that modify their target +============================================== .. topic:: **Summary** diff --git a/slep002/proposal.rst b/slep002/proposal.rst index 232e382..e2f5901 100644 --- a/slep002/proposal.rst +++ b/slep002/proposal.rst @@ -1,8 +1,8 @@ .. _slep_002: -================= -Dynamic pipelines -================= +========================== +SLEP002: Dynamic pipelines +========================== .. topic:: **Summary** diff --git a/slep003/proposal.rst b/slep003/proposal.rst index 511fcd2..589a168 100644 --- a/slep003/proposal.rst +++ b/slep003/proposal.rst @@ -1,8 +1,8 @@ .. _slep_003: -====================================== -Consistent inspection for transformers -====================================== +=============================================== +SLEP003: Consistent inspection for transformers +=============================================== . topic:: **Summary** diff --git a/slep004/proposal.rst b/slep004/proposal.rst index 558cb90..a9992eb 100644 --- a/slep004/proposal.rst +++ b/slep004/proposal.rst @@ -1,8 +1,8 @@ .. _slep_004: -================ -Data information -================ +========================= +SLEP004: Data information +========================= This is a specification to introduce data information (as ``sample_weights``) during the computation of an estimator methods diff --git a/slep006/proposal.rst b/slep006/proposal.rst index c336e4a..1f91a3f 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -1,8 +1,8 @@ .. _slep_006: -================================ -Routing sample-aligned meta-data -================================ +========================================== +SLEP005: Routing sample-aligned meta-data +========================================== :Author: Joel Nothman :Status: Draft From af2bd64ca1f39cbde1fc254c407ad7a19a9d183b Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Sun, 30 Aug 2020 22:12:20 +1000 Subject: [PATCH 054/118] Fix SLEP number in SLEP006 --- slep006/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 1f91a3f..3477e65 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -1,7 +1,7 @@ .. _slep_006: ========================================== -SLEP005: Routing sample-aligned meta-data +SLEP006: Routing sample-aligned meta-data ========================================== :Author: Joel Nothman From 4f17c3a386c3563918f9d54e45c9618666bc33be Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Mon, 31 Aug 2020 18:59:30 +1000 Subject: [PATCH 055/118] Towards completing sample props SLEP006 (#43) * Examples for case 0a * Add variant for solution 4 * Examples for case 0b * Complete some TODOs --- slep006/cases_opt0a.py | 96 +++++++++++++++++++++++++++++++++++++++++- slep006/cases_opt0b.py | 88 +++++++++++++++++++++++++++++++++++++- slep006/cases_opt4b.py | 78 ++++++++++++++++++++++++++++++++++ slep006/proposal.rst | 69 ++++++++++++++++++++++++------ 4 files changed, 314 insertions(+), 17 deletions(-) create mode 100644 slep006/cases_opt4b.py diff --git a/slep006/cases_opt0a.py b/slep006/cases_opt0a.py index d0141fa..e94c89e 100644 --- a/slep006/cases_opt0a.py +++ b/slep006/cases_opt0a.py @@ -1,6 +1,98 @@ -from defs import (accuracy, group_cv, make_scorer, SelectKBest, +import numpy as np + +from defs import (accuracy, group_cv, get_scorer, SelectKBest, LogisticRegressionCV, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) -# TODO +# %% +# Case A: weighted scoring and fitting + + +GROUPS_IDX = -1 +WEIGHT_IDX = -2 + + +def unwrap_X(X): + return X[:, -2:] + + +class WrappedGroupCV: + def __init__(self, base_cv, groups_idx=GROUPS_IDX): + self.base_cv = base_cv + self.groups_idx = groups_idx + + def split(self, X, y, groups=None): + groups = X[:, self.groups_idx] + return self.base_cv.split(unwrap_X(X), y, groups=groups) + + def get_n_splits(self, X, y, groups=None): + groups = X[:, self.groups_idx] + return self.base_cv.get_n_splits(unwrap_X(X), y, groups=groups) + + +wrapped_group_cv = WrappedGroupCV(group_cv) + + +class WrappedLogisticRegressionCV(LogisticRegressionCV): + def fit(self, X, y): + return super().fit(unwrap_X(X), y, sample_weight=X[:, WEIGHT_IDX]) + + +acc_scorer = get_scorer('accuracy') + + +def wrapped_weighted_acc(est, X, y, sample_weight=None): + return acc_scorer(est, unwrap_X(X), y, sample_weight=X[:, WEIGHT_IDX]) + + +lr = WrappedLogisticRegressionCV( + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc, +).set_props_request(['sample_weight']) +cross_validate(lr, np.hstack([X, my_weights, my_groups]), y, + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc) + +# %% +# Case B: weighted scoring and unweighted fitting + +class UnweightedWrappedLogisticRegressionCV(LogisticRegressionCV): + def fit(self, X, y): + return super().fit(unwrap_X(X), y) + + +lr = UnweightedWrappedLogisticRegressionCV( + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc, +).set_props_request(['sample_weight']) +cross_validate(lr, np.hstack([X, my_weights, my_groups]), y, + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc) + + +# %% +# Case C: unweighted feature selection + +class UnweightedWrappedSelectKBest(SelectKBest): + def fit(self, X, y): + return super().fit(unwrap_X(X), y) + + +lr = WrappedLogisticRegressionCV( + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc, +).set_props_request(['sample_weight']) +sel = UnweightedWrappedSelectKBest() +pipe = make_pipeline(sel, lr) +cross_validate(pipe, np.hstack([X, my_weights, my_groups]), y, + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc) + +# %% +# Case D: different scoring and fitting weights + +SCORING_WEIGHT_IDX = -3 + +# TODO: proceed from here. Note that this change implies the need to add +# a parameter to unwrap_X, since we will now append an additional column to X. diff --git a/slep006/cases_opt0b.py b/slep006/cases_opt0b.py index f543e9b..6af76cd 100644 --- a/slep006/cases_opt0b.py +++ b/slep006/cases_opt0b.py @@ -1,7 +1,91 @@ import pandas as pd -from defs import (accuracy, group_cv, make_scorer, SelectKBest, +from defs import (accuracy, group_cv, get_scorer, SelectKBest, LogisticRegressionCV, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) -# TODO +X = pd.DataFrame(X) +MY_GROUPS = pd.Series(my_groups) +MY_WEIGHTS = pd.Series(my_weights) +MY_OTHER_WEIGHTS = pd.Series(my_other_weights) + +# %% +# Case A: weighted scoring and fitting + + +class WrappedGroupCV: + def __init__(self, base_cv): + self.base_cv = base_cv + + def split(self, X, y, groups=None): + return self.base_cv.split(X, y, groups=MY_GROUPS.loc[X.index]) + + def get_n_splits(self, X, y, groups=None): + return self.base_cv.get_n_splits(X, y, groups=MY_GROUPS.loc[X.index]) + + +wrapped_group_cv = WrappedGroupCV(group_cv) + + +class WeightedLogisticRegressionCV(LogisticRegressionCV): + def fit(self, X, y): + return super().fit(X, y, sample_weight=MY_WEIGHTS.loc[X.index]) + + +acc_scorer = get_scorer('accuracy') + + +def wrapped_weighted_acc(est, X, y, sample_weight=None): + return acc_scorer(est, X, y, sample_weight=MY_WEIGHTS.loc[X.index]) + + +lr = WeightedLogisticRegressionCV( + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc, +).set_props_request(['sample_weight']) +cross_validate(lr, X, y, + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc) + +# %% +# Case B: weighted scoring and unweighted fitting + +lr = LogisticRegressionCV( + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc, +).set_props_request(['sample_weight']) +cross_validate(lr, X, y, + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc) + + +# %% +# Case C: unweighted feature selection + +lr = WeightedLogisticRegressionCV( + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc, +).set_props_request(['sample_weight']) +sel = SelectKBest() +pipe = make_pipeline(sel, lr) +cross_validate(pipe, X, y, + cv=wrapped_group_cv, + scoring=wrapped_weighted_acc) + +# %% +# Case D: different scoring and fitting weights + + +def other_weighted_acc(est, X, y, sample_weight=None): + return acc_scorer(est, X, y, sample_weight=MY_OTHER_WEIGHTS.loc[X.index]) + + +lr = WeightedLogisticRegressionCV( + cv=wrapped_group_cv, + scoring=other_weighted_acc, +).set_props_request(['sample_weight']) +sel = SelectKBest() +pipe = make_pipeline(sel, lr) +cross_validate(pipe, X, y, + cv=wrapped_group_cv, + scoring=other_weighted_acc) diff --git a/slep006/cases_opt4b.py b/slep006/cases_opt4b.py new file mode 100644 index 0000000..58aeaa4 --- /dev/null +++ b/slep006/cases_opt4b.py @@ -0,0 +1,78 @@ +from defs import (accuracy, group_cv, make_scorer, SelectKBest, + LogisticRegressionCV, cross_validate, + make_pipeline, X, y, my_groups, my_weights, + my_other_weights) + +# %% +# Case A: weighted scoring and fitting + +# Here we presume that GroupKFold requests `groups` by default. +# We need to explicitly request weights in make_scorer and for +# LogisticRegressionCV. Both of these consumers understand the meaning +# of the key "sample_weight". + +weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +).request_sample_weight(fit=['sample_weight']) +cross_validate(lr, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring=weighted_acc) + +# Error handling: if props={'sample_eight': my_weights, ...} was passed, +# cross_validate would raise an error, since 'sample_eight' was not requested +# by any of its children. + +# %% +# Case B: weighted scoring and unweighted fitting + +# Since LogisticRegressionCV requires that weights explicitly be requested, +# removing that request means the fitting is unweighted. + +weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +) +cross_validate(lr, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring=weighted_acc) + +# %% +# Case C: unweighted feature selection + +# Like LogisticRegressionCV, SelectKBest needs to request weights explicitly. +# Here it does not request them. + +weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +).request_sample_weight(fit=['sample_weight']) +sel = SelectKBest() +pipe = make_pipeline(sel, lr) +cross_validate(pipe, X, y, cv=group_cv, + props={'sample_weight': my_weights, 'groups': my_groups}, + scoring=weighted_acc) + +# %% +# Case D: different scoring and fitting weights + +# Despite make_scorer and LogisticRegressionCV both expecting a key +# sample_weight, we can use aliases to pass different weights to different +# consumers. + +weighted_acc = make_scorer(accuracy, + request_props={'scoring_weight': 'sample_weight'}) +lr = LogisticRegressionCV( + cv=group_cv, + scoring=weighted_acc, +).request_sample_weight(fit='fitting_weight') +cross_validate(lr, X, y, cv=group_cv, + props={ + 'scoring_weight': my_weights, + 'fitting_weight': my_other_weights, + 'groups': my_groups, + }, + scoring=weighted_acc) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 3477e65..8b4ca90 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -79,7 +79,7 @@ Other related issues include: :issue:`1574`, :issue:`2630`, :issue:`3524`, :issue:`4632`, :issue:`4652`, :issue:`4660`, :issue:`4696`, :issue:`6322`, :issue:`7112`, :issue:`7646`, :issue:`7723`, :issue:`8127`, :issue:`8158`, :issue:`8710`, :issue:`8950`, :issue:`11429`, :issue:`12052`, :issue:`15282`, -:issues:`15370`, :issue:`15425`. +:issues:`15370`, :issue:`15425`, :issue:`18028`. Desiderata ---------- @@ -368,6 +368,14 @@ Disadvantages: `set_props_request` method (instead of the `request_props` constructor parameter approach) such that all legacy base estimators are automatically equipped. +* Aliasing is a bit confusing in this design, in that the consumer still + accepts the fit param by its original name (e.g. `sample_weight`) even if it + has a request that specifies a different key given to the router (e.g. + `fit_sample_weight`). This design has the advantage that the handling of + props within a consumer is simple and unchanged; the complexity is in + how it is forwarded the data by the router, but it may be conceptually + difficult for users to understand. (This may be acceptable, as an advanced + feature.) * For estimators to be cloned, this request information needs to be cloned with it. This implies one of: the request information be stored as a constructor paramerter; or `clone` is extended to explicitly copy request information. @@ -389,6 +397,22 @@ Test cases: .. literalinclude:: cases_opt4.py +Extensions and alternatives to the syntax considered while working on +:pr:`16079`: + +* `set_prop_request` and `get_props_request` have lists of props requested + **for each method** i.e. fit, score, transform, predict and perhaps others. +* `set_props_request` could be replaced by a method (or parameter) representing + the routing of each prop that it consumes. For example, an estimator that + consumes `sample_weight` would have a `request_sample_weight` method. One of + the difficulties of this approach is automatically introducing + `request_sample_weight` into classes inheriting from BaseEstimator without + too much magic (e.g. meta-classes, which might be the simplest solution). + +These are demonstrated together in the following: + +.. literalinclude:: cases_opt4b.py + Naming ------ @@ -405,30 +429,49 @@ Proposal Having considered the above solutions, we propose: -TODO +* Solution 4 per :pr:`16079` which will be used to resolve further, specific + details of the solution. +* Props will be known simply as Metadata. +* `**kw` syntax will be used to pass props by key. -* which solution? -* if an estimator requests a prop, must it be not-null? Must it be provided or explicitly passed as None? -* props param or kwargs? -* naming? +TODO: + +* if an estimator requests a prop, must it be not-null? Must it be provided or + explicitly passed as None? Backward compatibility ---------------------- -TODO +Under this proposal, consumer behaviour will be backwards compatible, but +meta-estimators will change their routing behaviour. + +By default, `sample_weight` will not be requested by estimators that support +it. This ensures that addition of `sample_weight` support to an estimator will +not change its behaviour. -TODO: Do we continue to handle sample_weight such that it only gets provided of requested explicitly? Or do we make it requested by default in the future (possibly with a deprecation period)? +During a deprecation period, fit_params will be handled dually: Keys that are +requested will be passed through the new request mechanism, while keys that are +not known will be routed using legacy mechanisms. At completion of the +deprecation period, the legacy handling will cease. -During a deprecation period, fit_params will be handled dually: Keys that are requested will be passed through the new request mechanism, while keys that are not known will be routed using legacy mechanisms. At completion of the deprecation period, the legacy handling will cease. +Similarly, during a deprecation period, `fit_params` in GridSearchCV and +related utilities will be routed to the estimator's `fit` by default, per +incumbent behaviour. After the deprecation period, an error will be raised for +any params not explicitly requested. -Grouped cross validation splitters will request `groups` since they were previously unusable in a nested cross validation context, so this should not often create backwards incompatibilities, except perhaps where a fit param named `groups` served another purpose. +Grouped cross validation splitters will request `groups` since they were +previously unusable in a nested cross validation context, so this should not +often create backwards incompatibilities, except perhaps where a fit param +named `groups` served another purpose. Discussion ---------- -One benefit of the explicitness in Solution 4 is that even if it makes use of **kw arguments, it does not preclude keywords arguments serving other purposes in addition. That is, in addition to requesting sample props, a future proposal could allow estimators to request feature metadata or other keys. - -TODO +One benefit of the explicitness in Solution 4 is that even if it makes use of +`**kw` arguments, it does not preclude keywords arguments serving other +purposes in addition. That is, in addition to requesting sample props, a +future proposal could allow estimators to request feature metadata or other +keys. References and Footnotes ------------------------ From 9f6b8b599b0af9c79ba490120efed0e1b0176c18 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Wed, 2 Sep 2020 12:20:54 -0400 Subject: [PATCH 056/118] minor stuff --- slep006/proposal.rst | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 8b4ca90..1cb4b91 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -33,7 +33,7 @@ Desirable features we do not currently support include: validation * (maybe in scope) passing sample properties (e.g. `sample_weight`) to some scorers and not others in a multi-metric cross-validation setup -* (likley out of scope) passing sample properties to non-fit methods, for +* (likely out of scope) passing sample properties to non-fit methods, for instance to index grouped samples that are to be treated as a single sequence in prediction. @@ -118,7 +118,7 @@ Keyword arguments vs. a single argument Currently, sample properties are provided as keyword arguments to a `fit` method. In redeveloping sample properties, we can instead accept a single -parameter (named `props` or `sample_props` or `etc`, for example) which maps +parameter (named `props` or `sample_props`, for example) which maps string keys to arrays of the same length (a "DataFrame-like"). Keyword arguments:: @@ -127,7 +127,7 @@ Keyword arguments:: Single argument:: - >>> gs.fit(X, y, prop={'groups': groups, 'weight': weight}) + >>> gs.fit(X, y, props={'groups': groups, 'weight': weight}) While drafting this document, we will assume the latter notation for clarity. @@ -204,8 +204,8 @@ Without changing scikit-learn, the following hack can be used: If `y` is represented with a Pandas datatype, then its index can be used to access required elements from props stored in a global namespace (or otherwise made available to the estimator before fitting). This is possible everywhere -that a gold-standard `y` is passed, including fit, split and score. A similar -solution with `X` is also possible for handling predict-time props, if all +that a ground-truth `y` is passed, including fit, split, score, and metrics. +A similar solution with `X` is also possible (except for metrics), if all Pipeline components retain the original Pandas Index. Issues: @@ -253,7 +253,7 @@ In short, this is a simple solution, but prone to risk. Solution 2: Specify routes at call ---------------------------------- -Similar to the legacy behavior of fit parameters in +Similar to the current behavior of fit parameters in :class:`sklearn.pipeline.Pipeline`, this requires the user to specify the path for each "prop" to follow when calling `fit`. For example, to pass a prop named 'weights' to a step named 'spam' in a Pipeline, you might use @@ -268,8 +268,8 @@ Advantages: Issues: -* The user needs to know the deep internal structure, or it is easy to fail to - pass a prop to a specific estimator. +* The user needs to know the nested internal structure, or it is easy to fail + to pass a prop to a specific estimator. * A corollary is that prop keys need changing when the developer modifies their estimator structure (see case C). * This gets especially tricky or impossible where the available routes @@ -321,7 +321,7 @@ Issues: failures may be hard to identify and debug. * For estimators to be cloned, this routing information needs to be cloned with it. This implies one of: the routing information be stored as a constructor - paramerter; or `clone` is extended to explicitly copy routing information. + parameter; or `clone` is extended to explicitly copy routing information. Possible public syntax: @@ -336,12 +336,12 @@ named keys passed to the meta-estimator. Solution 4: Each child requests ------------------------------- -Here the meta-estimator provides only what its each of its children requests. +Here the meta-estimator provides only what each of its children requests. The meta-estimator would also need to request, on behalf of its children, any prop that descendant consumers require. -Each object in a situation that could receive props would have a method like -`_get_prop_requests()` which would return a list of prop names (or perhaps a +Each object that could receive props would have a method like +`get_prop_request()` which would return a list of prop names (or perhaps a mapping for more sophisticated use-cases). Group* CV splitters would default to returning `['groups']`, for example. Estimators supporting weighted fitting may return `[]` by default, but may have a parameter `request_props` which @@ -378,7 +378,7 @@ Disadvantages: feature.) * For estimators to be cloned, this request information needs to be cloned with it. This implies one of: the request information be stored as a constructor - paramerter; or `clone` is extended to explicitly copy request information. + parameter; or `clone` is extended to explicitly copy request information. Possible public syntax: From 77d25775398b6b91017a09b16cb2cb862c11f948 Mon Sep 17 00:00:00 2001 From: Nicolas Hug Date: Thu, 3 Sep 2020 17:18:35 -0400 Subject: [PATCH 057/118] put back legacy --- slep006/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 1cb4b91..d5ad306 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -253,7 +253,7 @@ In short, this is a simple solution, but prone to risk. Solution 2: Specify routes at call ---------------------------------- -Similar to the current behavior of fit parameters in +Similar to the legacy behavior of fit parameters in :class:`sklearn.pipeline.Pipeline`, this requires the user to specify the path for each "prop" to follow when calling `fit`. For example, to pass a prop named 'weights' to a step named 'spam' in a Pipeline, you might use From e656bf722ff5426385894ee1358bca0b7afc6487 Mon Sep 17 00:00:00 2001 From: Alexandre Gramfort Date: Tue, 16 Feb 2021 21:30:56 +0100 Subject: [PATCH 058/118] Amend / Converge on Slep 6 on sample props aka now as metadata (#50) Co-authored-by: Adrin Jalali Co-authored-by: Joel Nothman --- slep006/cases_opt0a.py | 7 +- slep006/cases_opt0b.py | 4 +- slep006/cases_opt1.py | 16 +- slep006/cases_opt2.py | 18 +- slep006/cases_opt3.py | 22 +- slep006/cases_opt4.py | 26 +-- slep006/cases_opt4b.py | 53 +++-- slep006/defs.py | 4 +- slep006/other.rst | 160 +++++++++++++++ slep006/proposal.rst | 449 +++++++++++++++-------------------------- 10 files changed, 410 insertions(+), 349 deletions(-) create mode 100644 slep006/other.rst diff --git a/slep006/cases_opt0a.py b/slep006/cases_opt0a.py index e94c89e..96a3206 100644 --- a/slep006/cases_opt0a.py +++ b/slep006/cases_opt0a.py @@ -1,9 +1,8 @@ import numpy as np -from defs import (accuracy, group_cv, get_scorer, SelectKBest, +from defs import (GroupKFold, get_scorer, SelectKBest, LogisticRegressionCV, cross_validate, - make_pipeline, X, y, my_groups, my_weights, - my_other_weights) + make_pipeline, X, y, my_groups, my_weights) # %% # Case A: weighted scoring and fitting @@ -31,7 +30,7 @@ def get_n_splits(self, X, y, groups=None): return self.base_cv.get_n_splits(unwrap_X(X), y, groups=groups) -wrapped_group_cv = WrappedGroupCV(group_cv) +wrapped_group_cv = WrappedGroupCV(GroupKFold()) class WrappedLogisticRegressionCV(LogisticRegressionCV): diff --git a/slep006/cases_opt0b.py b/slep006/cases_opt0b.py index 6af76cd..89ae365 100644 --- a/slep006/cases_opt0b.py +++ b/slep006/cases_opt0b.py @@ -1,5 +1,5 @@ import pandas as pd -from defs import (accuracy, group_cv, get_scorer, SelectKBest, +from defs import (get_scorer, SelectKBest, LogisticRegressionCV, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) @@ -24,7 +24,7 @@ def get_n_splits(self, X, y, groups=None): return self.base_cv.get_n_splits(X, y, groups=MY_GROUPS.loc[X.index]) -wrapped_group_cv = WrappedGroupCV(group_cv) +wrapped_group_cv = WrappedGroupCV(GroupKFold()) class WeightedLogisticRegressionCV(LogisticRegressionCV): diff --git a/slep006/cases_opt1.py b/slep006/cases_opt1.py index a8185d3..94351df 100644 --- a/slep006/cases_opt1.py +++ b/slep006/cases_opt1.py @@ -1,4 +1,4 @@ -from defs import (accuracy, group_cv, make_scorer, SelectKBest, +from defs import (accuracy_score, GroupKFold, make_scorer, SelectKBest, LogisticRegressionCV, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) @@ -6,10 +6,10 @@ # Case A: weighted scoring and fitting lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', ) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={'sample_weight': my_weights, 'groups': my_groups}, scoring='accuracy') @@ -34,19 +34,19 @@ def fit(self, X, y, props=None): # that remains the case, it will simply ignore the prop passed to it. Hence: lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', ) sel = SelectKBest() pipe = make_pipeline(sel, lr) -cross_validate(pipe, X, y, cv=group_cv, +cross_validate(pipe, X, y, cv=GroupKFold(), props={'sample_weight': my_weights, 'groups': my_groups}, scoring='accuracy') # %% # Case D: different scoring and fitting weights -weighted_acc = make_scorer(accuracy) +weighted_acc = make_scorer(accuracy_score) def specially_weighted_acc(est, X, y, props): @@ -56,10 +56,10 @@ def specially_weighted_acc(est, X, y, props): lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring=specially_weighted_acc, ) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={ 'scoring_weight': my_weights, 'sample_weight': my_other_weights, diff --git a/slep006/cases_opt2.py b/slep006/cases_opt2.py index 4148e66..5c63d3d 100644 --- a/slep006/cases_opt2.py +++ b/slep006/cases_opt2.py @@ -1,4 +1,4 @@ -from defs import (group_cv, SelectKBest, LogisticRegressionCV, +from defs import (GroupKFold, SelectKBest, LogisticRegressionCV, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) @@ -6,7 +6,7 @@ # Case A: weighted scoring and fitting lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', ) props = {'cv__groups': my_groups, @@ -14,7 +14,7 @@ 'estimator__sample_weight': my_weights, 'scoring__sample_weight': my_weights, 'estimator__scoring__sample_weight': my_weights} -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props=props, scoring='accuracy') @@ -25,14 +25,14 @@ # Case B: weighted scoring and unweighted fitting lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', ) props = {'cv__groups': my_groups, 'estimator__cv__groups': my_groups, 'scoring__sample_weight': my_weights, 'estimator__scoring__sample_weight': my_weights} -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props=props, scoring='accuracy') @@ -40,7 +40,7 @@ # Case C: unweighted feature selection lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', ) pipe = make_pipeline(SelectKBest(), lr) @@ -49,7 +49,7 @@ 'estimator__logisticregressioncv__sample_weight': my_weights, 'scoring__sample_weight': my_weights, 'estimator__scoring__sample_weight': my_weights} -cross_validate(pipe, X, y, cv=group_cv, +cross_validate(pipe, X, y, cv=GroupKFold(), props=props, scoring='accuracy') @@ -57,7 +57,7 @@ # Case D: different scoring and fitting weights lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', ) props = {'cv__groups': my_groups, @@ -65,6 +65,6 @@ 'estimator__sample_weight': my_other_weights, 'scoring__sample_weight': my_weights, 'estimator__scoring__sample_weight': my_weights} -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props=props, scoring='accuracy') diff --git a/slep006/cases_opt3.py b/slep006/cases_opt3.py index 5b4b450..fff317d 100644 --- a/slep006/cases_opt3.py +++ b/slep006/cases_opt3.py @@ -1,12 +1,12 @@ -from defs import (accuracy, make_scorer, SelectKBest, LogisticRegressionCV, - group_cv, cross_validate, make_pipeline, X, y, my_groups, +from defs import (SelectKBest, LogisticRegressionCV, + GroupKFold, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) # %% # Case A: weighted scoring and fitting lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', prop_routing={'cv': ['groups'], 'scoring': ['sample_weight'], @@ -18,11 +18,11 @@ # Alternative syntax, which assumes cv receives 'groups' by default, and that a # method-based API is provided on meta-estimators: # lr = LogisticRegressionCV( -# cv=group_cv, +# cv=GroupKFold(), # scoring='accuracy', # ).add_prop_route(scoring='sample_weight') -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={'sample_weight': my_weights, 'groups': my_groups}, scoring='accuracy', prop_routing={'estimator': '*', # pass all props @@ -40,7 +40,7 @@ # Here we rename the sample_weight prop so that we can specify that it only # applies to scoring. lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', prop_routing={'cv': ['groups'], # read the following as "scoring should consume @@ -48,7 +48,7 @@ 'scoring': {'sample_weight': 'scoring_weight'}, }, ) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={'scoring_weight': my_weights, 'groups': my_groups}, scoring='accuracy', prop_routing={'estimator': '*', @@ -60,7 +60,7 @@ # Case C: unweighted feature selection lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', prop_routing={'cv': ['groups'], 'scoring': ['sample_weight'], @@ -68,7 +68,7 @@ pipe = make_pipeline(SelectKBest(), lr, prop_routing={'logisticregressioncv': ['sample_weight', 'groups']}) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={'sample_weight': my_weights, 'groups': my_groups}, scoring='accuracy', prop_routing={'estimator': '*', @@ -79,7 +79,7 @@ # %% # Case D: different scoring and fitting weights lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring='accuracy', prop_routing={'cv': ['groups'], # read the following as "scoring should consume @@ -87,7 +87,7 @@ 'scoring': {'sample_weight': 'scoring_weight'}, }, ) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={'scoring_weight': my_weights, 'groups': my_groups, 'fitting_weight': my_other_weights}, scoring='accuracy', diff --git a/slep006/cases_opt4.py b/slep006/cases_opt4.py index 1d1325c..84c8633 100644 --- a/slep006/cases_opt4.py +++ b/slep006/cases_opt4.py @@ -1,4 +1,4 @@ -from defs import (accuracy, group_cv, make_scorer, SelectKBest, +from defs import (accuracy_score, GroupKFold, make_scorer, SelectKBest, LogisticRegressionCV, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) @@ -11,12 +11,12 @@ # LogisticRegressionCV. Both of these consumers understand the meaning # of the key "sample_weight". -weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +weighted_acc = make_scorer(accuracy_score, request_props=['sample_weight']) lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring=weighted_acc, ).set_props_request(['sample_weight']) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={'sample_weight': my_weights, 'groups': my_groups}, scoring=weighted_acc) @@ -30,12 +30,12 @@ # Since LogisticRegressionCV requires that weights explicitly be requested, # removing that request means the fitting is unweighted. -weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +weighted_acc = make_scorer(accuracy_score, request_props=['sample_weight']) lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring=weighted_acc, ) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={'sample_weight': my_weights, 'groups': my_groups}, scoring=weighted_acc) @@ -45,14 +45,14 @@ # Like LogisticRegressionCV, SelectKBest needs to request weights explicitly. # Here it does not request them. -weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +weighted_acc = make_scorer(accuracy_score, request_props=['sample_weight']) lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring=weighted_acc, ).set_props_request(['sample_weight']) sel = SelectKBest() pipe = make_pipeline(sel, lr) -cross_validate(pipe, X, y, cv=group_cv, +cross_validate(pipe, X, y, cv=GroupKFold(), props={'sample_weight': my_weights, 'groups': my_groups}, scoring=weighted_acc) @@ -63,13 +63,13 @@ # sample_weight, we can use aliases to pass different weights to different # consumers. -weighted_acc = make_scorer(accuracy, +weighted_acc = make_scorer(accuracy_score, request_props={'scoring_weight': 'sample_weight'}) lr = LogisticRegressionCV( - cv=group_cv, + cv=GroupKFold(), scoring=weighted_acc, ).set_props_request({'fitting_weight': "sample_weight"}) -cross_validate(lr, X, y, cv=group_cv, +cross_validate(lr, X, y, cv=GroupKFold(), props={ 'scoring_weight': my_weights, 'fitting_weight': my_other_weights, diff --git a/slep006/cases_opt4b.py b/slep006/cases_opt4b.py index 58aeaa4..5f8fd4d 100644 --- a/slep006/cases_opt4b.py +++ b/slep006/cases_opt4b.py @@ -1,4 +1,4 @@ -from defs import (accuracy, group_cv, make_scorer, SelectKBest, +from defs import (accuracy_score, GroupKFold, make_scorer, SelectKBest, LogisticRegressionCV, cross_validate, make_pipeline, X, y, my_groups, my_weights, my_other_weights) @@ -11,16 +11,25 @@ # LogisticRegressionCV. Both of these consumers understand the meaning # of the key "sample_weight". -weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +weighted_acc = make_scorer(accuracy_score, request_metadata=['sample_weight']) +group_cv = GroupKFold() lr = LogisticRegressionCV( cv=group_cv, scoring=weighted_acc, -).request_sample_weight(fit=['sample_weight']) +).request_sample_weight(fit=True) # same as `fit=['sample_weight']` cross_validate(lr, X, y, cv=group_cv, - props={'sample_weight': my_weights, 'groups': my_groups}, + metadata={'sample_weight': my_weights, 'groups': my_groups}, scoring=weighted_acc) -# Error handling: if props={'sample_eight': my_weights, ...} was passed, +# Here lr.get_metadata_request() would return +# {'fit': {'groups': {'groups'}, 'sample_weight': {'sample_weight'}}, +# 'predict': {}, +# 'transform': {}, +# 'score': {}, +# 'split': {}, +# 'inverse_transform': {}} + +# Error handling: if metadata={'sample_eight': my_weights, ...} was passed, # cross_validate would raise an error, since 'sample_eight' was not requested # by any of its children. @@ -30,30 +39,38 @@ # Since LogisticRegressionCV requires that weights explicitly be requested, # removing that request means the fitting is unweighted. -weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +weighted_acc = make_scorer(accuracy_score, request_metadata=['sample_weight']) lr = LogisticRegressionCV( cv=group_cv, scoring=weighted_acc, -) +).request_sample_weight(fit=False) # if not specified an exception is raised cross_validate(lr, X, y, cv=group_cv, - props={'sample_weight': my_weights, 'groups': my_groups}, + metadata={'sample_weight': my_weights, 'groups': my_groups}, scoring=weighted_acc) +# Here lr.get_metadata_request() would return +# {'fit': {'groups': {'groups'}}, +# 'predict': {}, +# 'transform': {}, +# 'score': {}, +# 'split': {}, +# 'inverse_transform': {}} + # %% # Case C: unweighted feature selection # Like LogisticRegressionCV, SelectKBest needs to request weights explicitly. # Here it does not request them. -weighted_acc = make_scorer(accuracy, request_props=['sample_weight']) +weighted_acc = make_scorer(accuracy_score, request_metadata=['sample_weight']) lr = LogisticRegressionCV( cv=group_cv, scoring=weighted_acc, -).request_sample_weight(fit=['sample_weight']) -sel = SelectKBest() +).request_sample_weight(fit=True) +sel = SelectKBest().request_sample_weight(fit=False) pipe = make_pipeline(sel, lr) cross_validate(pipe, X, y, cv=group_cv, - props={'sample_weight': my_weights, 'groups': my_groups}, + metadata={'sample_weight': my_weights, 'groups': my_groups}, scoring=weighted_acc) # %% @@ -63,16 +80,16 @@ # sample_weight, we can use aliases to pass different weights to different # consumers. -weighted_acc = make_scorer(accuracy, - request_props={'scoring_weight': 'sample_weight'}) +weighted_acc = make_scorer(accuracy_score, + request_metadata={'scoring_weight': 'sample_weight'}) lr = LogisticRegressionCV( cv=group_cv, scoring=weighted_acc, ).request_sample_weight(fit='fitting_weight') cross_validate(lr, X, y, cv=group_cv, - props={ - 'scoring_weight': my_weights, - 'fitting_weight': my_other_weights, - 'groups': my_groups, + metadata={ + 'scoring_weight': my_weights, + 'fitting_weight': my_other_weights, + 'groups': my_groups, }, scoring=weighted_acc) diff --git a/slep006/defs.py b/slep006/defs.py index 2026c8e..26c1d6a 100644 --- a/slep006/defs.py +++ b/slep006/defs.py @@ -1,14 +1,14 @@ import numpy as np from sklearn.feature_selection import SelectKBest from sklearn.linear_model import LogisticRegressionCV -from sklearn.metrics import accuracy +from sklearn.metrics import accuracy_score from sklearn.metrics import make_scorer from sklearn.model_selection import GroupKFold, cross_validate from sklearn.pipeline import make_pipeline N, M = 100, 4 X = np.random.rand(N, M) -y = np.random.randint(0, 1, size=N) +y = np.random.randint(0, 2, size=N) my_groups = np.random.randint(0, 10, size=N) my_weights = np.random.rand(N) my_other_weights = np.random.rand(N) diff --git a/slep006/other.rst b/slep006/other.rst new file mode 100644 index 0000000..552e289 --- /dev/null +++ b/slep006/other.rst @@ -0,0 +1,160 @@ +:orphan: + +.. _slep_006_other: + +Alternative solutions to sample-aligned meta-data +================================================= + +This page contains alternative solutions that have been discussed +and finally not considered in the SLEP. + +Solution sketches require these definitions: + +.. literalinclude:: defs.py + +Status quo solution 0a: additional feature +------------------------------------------ + +Without changing scikit-learn, the following hack can be used: + +Additional numeric features representing sample props can be appended to the +data and passed around, being handled specially in each consumer of features +or sample props. + +.. literalinclude:: cases_opt0a.py + +Status quo solution 0b: Pandas Index and global resources +--------------------------------------------------------- + +Without changing scikit-learn, the following hack can be used: + +If `y` is represented with a Pandas datatype, then its index can be used to +access required elements from props stored in a global namespace (or otherwise +made available to the estimator before fitting). This is possible everywhere +that a ground-truth `y` is passed, including fit, split, score, and metrics. +A similar solution with `X` is also possible (except for metrics), if all +Pipeline components retain the original Pandas Index. + +Issues: + +* use of global data source +* requires Pandas data types and indices to be maintained + +.. literalinclude:: cases_opt0b.py + +Solution 1: Pass everything +--------------------------- + +This proposal passes all props to all consumers (estimators, splitters, +scorers, etc). The consumer would optionally use props it is familiar with by +name and disregard other props. + +We may consider providing syntax for the user to control the interpretation of +incoming props: + +* to require that some prop is provided (for an estimator where that prop is + otherwise optional) +* to disregard some provided prop +* to treat a particular prop key as having a certain meaning (e.g. locally + interpreting 'scoring_sample_weight' as 'sample_weight'). + +These constraints would be checked by calling a helper at the consumer. + +Issues: + +* Error handling: if a key is optional in a consumer, no error will be + raised for misspelling. An introspection API might change this, allowing a + user or meta-estimator to check if all keys passed are to be used in at least + one consumer. +* Forwards compatibility: newly supporting a prop key in a consumer will change + behaviour. Other than a ChangedBehaviorWarning, I don't see any way around + this. +* Introspection: not inherently supported. Would need an API like + ``get_prop_support(names: List[str]) -> Dict[str, Literal["supported", "required", "ignored"]]``. + +In short, this is a simple solution, but prone to risk. + +.. literalinclude:: cases_opt1.py + + +Solution 2: Specify routes at call +---------------------------------- + +Similar to the legacy behavior of fit parameters in +:class:`sklearn.pipeline.Pipeline`, this requires the user to specify the +path for each "prop" to follow when calling `fit`. For example, to pass +a prop named 'weights' to a step named 'spam' in a Pipeline, you might use +`my_pipe.fit(X, y, props={'spam__weights': my_weights})`. + +SLEP004's syntax to override the common routing scheme falls under this +solution. + +Advantages: + +* Very explicit and robust to misspellings. + +Issues: + +* The user needs to know the nested internal structure, or it is easy to fail + to pass a prop to a specific estimator. +* A corollary is that prop keys need changing when the developer modifies their + estimator structure (see case C). +* This gets especially tricky or impossible where the available routes + change mid-fit, such as where a grid search considers estimators with + different structures. +* We would need to find a different solution for :issue:`2630` where a Pipeline + could not be the base estimator of AdaBoost because AdaBoost expects the base + estimator to accept a fit param keyed 'sample_weight'. +* This may not work if a meta-estimator were to have the role of changing a + prop, e.g. a meta-estimator that passes `sample_weight` corresponding to + balanced classes onto its base estimator. The meta-estimator would need a + list of destinations to pass modified props to, or a list of keys to modify. +* We would need to develop naming conventions for different routes, which may + be more complicated than the current conventions; while a GridSearchCV + wrapping a Pipeline currently takes parameters with keys like + `{step_name}__{prop_name}`, this explicit routing, and conflict with + GridSearchCV routing destinations, implies keys like + `estimator__{step_name}__{prop_name}`. + +.. literalinclude:: cases_opt2.py + + +Solution 3: Specify routes on metaestimators +-------------------------------------------- + +Each meta-estimator is given a routing specification which it must follow in +passing only the required parameters to each of its children. In this context, +a GridSearchCV has children including `estimator`, `cv` and (each element of) +`scoring`. + +Pull request :pr:`9566` and its extension in :pr:`15425` are partial +implementations of this approach. + +A major benefit of this approach is that it may allow only prop routing +meta-estimators to be modified, not prop consumers. + +All consumers would be required to check that + +Issues: + +* Routing may be hard to get one's head around, especially since the prop + support belongs to the child estimator but the parent is responsible for the + routing. +* Need to design an API for specifying routings. +* As in Solution 2, each local destination for routing props needs to be given + a name. +* Every router along the route will need consistent instructions to pass a + specific prop to a consumer. If the prop is optional in the consumer, routing + failures may be hard to identify and debug. +* For estimators to be cloned, this routing information needs to be cloned with + it. This implies one of: the routing information be stored as a constructor + parameter; or `clone` is extended to explicitly copy routing information. + +Possible public syntax: + +Each meta-estimator has a `prop_routing` parameter to encode local routing +rules, and a set of named children which it routes to. In :pr:`9566`, the +`prop_routing` entry for each child may be a white list or black list of +named keys passed to the meta-estimator. + +.. literalinclude:: cases_opt3.py diff --git a/slep006/proposal.rst b/slep006/proposal.rst index d5ad306..10d8fef 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -4,12 +4,11 @@ SLEP006: Routing sample-aligned meta-data ========================================== -:Author: Joel Nothman +:Author: Joel Nothman, Adrin Jalali, Alex Gramfort :Status: Draft :Type: Standards Track :Created: 2019-03-07 - Scikit-learn has limited support for information pertaining to each sample (henceforth "sample properties") to be passed through an estimation pipeline. The user can, for instance, pass fit parameters to all members of a @@ -23,33 +22,44 @@ prefixing:: ... clf__sample_weight=[.5, .7]) # doctest: +SKIP Several other meta-estimators, such as GridSearchCV, support forwarding these -fit parameters to their base estimator when fitting. +fit parameters to their base estimator when fitting. Yet a number of important +use cases are currently not supported. -Desirable features we do not currently support include: +Features we currently do not support and wish to include: * passing sample properties (e.g. `sample_weight`) to a scorer used in cross-validation * passing sample properties (e.g. `groups`) to a CV splitter in nested cross validation -* (maybe in scope) passing sample properties (e.g. `sample_weight`) to some +* passing sample properties (e.g. `sample_weight`) to some scorers and not others in a multi-metric cross-validation setup -* (likely out of scope) passing sample properties to non-fit methods, for +* passing sample properties to non-fit methods, for instance to index grouped samples that are to be treated as a single sequence in prediction. +The last two items are considered in the API design, yet will not be considered +in the first version of the implementation. + +Naming +------ + +"Sample props" has become a name understood internally to the Scikit-learn +development team. For ongoing usage we propose to use as naming ``metadata``. + Definitions ----------- consumer An estimator, scorer, splitter, etc., that receives and can make use of - one or more passed props. + one or more passed metadata. key - A label passed along with sample prop data to indicate how it should be + A label passed along with sample metadata to indicate how it should be interpreted (e.g. "weight"). -router - An estimator or function that passes props on to some other router or - consumer, potentially selecting which props to pass to which destination, - and by what key. +.. XXX : remove router? +.. router +.. An estimator or function that passes metadata on to some other router or +.. consumer, potentially selecting which metadata to pass to which destination, +.. and by what key. History ------- @@ -84,17 +94,18 @@ Other related issues include: :issue:`1574`, :issue:`2630`, :issue:`3524`, Desiderata ---------- -We will consider the following aspects to develop and compare solutions: +The following aspects have been considered to propose the following solution: +.. XXX : maybe phrase this in an affirmative way Usability Can the use cases be achieved in succinct, readable code? Can common use cases be achieved with a simple recipe copy-pasted from a QA forum? Brittleness - If a property is being routed through a Pipeline, does changing the + If a metadata is being routed through a Pipeline, does changing the structure of the pipeline (e.g. adding a layer of nesting) require rewriting other code? Error handling - If the user mistypes the name of a sample property, or misspecifies how it + If the user mistypes the name of a sample metadata, or misspecifies how it should be routed to a consumer, will an appropriate exception be raised? Impact on meta-estimator design How much meta-estimator code needs to change? How hard will it be to @@ -106,47 +117,50 @@ Backwards compatibility Forwards compatibility Is the solution going to make users' code more brittle with future changes? (For example, will a user's pipeline change - behaviour radically when sample_weight is implemented on some estimator) + behaviour radically when `sample_weight` is implemented on some estimator) Introspection If sensible to do so (e.g. for improved efficiency), can a meta-estimator identify whether its base estimator (recursively) would - handle some particular sample property (e.g. so a meta-estimator can choose - between weighting and resampling, or for automated invariance testing)? + handle some particular sample metadata. + +.. (e.g. so a meta-estimator can choose +.. between weighting and resampling, or for automated invariance testing)? Keyword arguments vs. a single argument --------------------------------------- -Currently, sample properties are provided as keyword arguments to a `fit` -method. In redeveloping sample properties, we can instead accept a single -parameter (named `props` or `sample_props`, for example) which maps +Currently, sample metadata are provided as keyword arguments to a `fit` +method. In redeveloping sample metadata, we can instead accept a single +parameter (named `metadata` or `sample_metadata`, for example) which maps string keys to arrays of the same length (a "DataFrame-like"). -Keyword arguments:: +Using single argument:: - >>> gs.fit(X, y, groups=groups, sample_weight=sample_weight) + >>> gs.fit(X, y, metadata={'groups': groups, 'weight': weight}) -Single argument:: +vs. using keyword arguments:: - >>> gs.fit(X, y, props={'groups': groups, 'weight': weight}) + >>> gs.fit(X, y, groups=groups, sample_weight=sample_weight) -While drafting this document, we will assume the latter notation for clarity. +Advantages of a single argument: + +* we would be able to redefine the default routing of weights etc. without being + concerned by backwards compatibility. +* we could consider the use of keys that are not limited to strings or valid + identifiers (and hence are not limited to using ``_`` as a delimiter). +* we could also consider kwargs to `fit` that are not sample-aligned + (e.g. `with_warm_start`, `feature_names_in`, `feature_meta`) without + restricting valid keys for sample metadata. Advantages of multiple keyword arguments: * succinct -* possible to maintain backwards compatible support for sample_weight, etc. +* explicit function signatures relying on interpreter checks on calls +* possible to maintain backwards compatible support for `sample_weight`, etc. * we do not need to handle cases for whether or not some estimator expects a - `props` argument. + `metadata` argument. -Advantages of a single argument: - -* we are able to consider kwargs to `fit` that are not sample-aligned, so that - we can add further functionality (some that have been proposed: - `with_warm_start`, `feature_names_in`, `feature_meta`). -* we are able to redefine the default routing of weights etc. without being - concerned by backwards compatibility. -* we can consider the use of keys that are not limited to strings or valid - identifiers (and hence are not limited to using ``_`` as a delimiter). +In this SLEP, we will propose the solution based on keyword arguments. Test case setup --------------- @@ -155,10 +169,10 @@ Case A ~~~~~~ Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')`` -with weighted scoring and weighted fitting. +with weighted scoring and weighted fitting, while using groups in splitter. -Error handling: what would happen if the user misspelled `sample_weight` as -`sample_eight`? +Error handling: we would guarantee that if the user misspelled `sample_weight` +as `sample_eight` a meaningful error is raised. Case B ~~~~~~ @@ -166,298 +180,169 @@ Case B Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')`` with weighted scoring and unweighted fitting. +Error handling: if `sample_weight` is required only in scoring and not in fit +of the sub-estimator the user should make explicit that it is not required +by the sub-estimator. + Case C ~~~~~~ Extend Case A to apply an unweighted univariate feature selector in a -``Pipeline``. +``Pipeline``. This allows to check pipelines where only some steps +require a metadata. Case D ~~~~~~ Different weights for scoring and for fitting in Case A. -TODO: case involving props passed at test time, e.g. to pipe.transform (???). -TODO: case involving score() method, e.g. not specifying scoring in -cross_val_score when wrapping an estimator with weighted score func ... - -Solution sketches will import these definitions: - -.. literalinclude:: defs.py - -Status quo solution 0a: additional feature ------------------------------------------- - -Without changing scikit-learn, the following hack can be used: - -Additional numeric features representing sample props can be appended to the -data and passed around, being handled specially in each consumer of features -or sample props. - -.. literalinclude:: cases_opt0a.py +Motivation: You can have groups used in a CV, which contains batches of data as groups, +and then an estimator which takes groups as sensitive attributes to a +fairness related model. Also in a third party library an estimator may have +the same name for a parameter, but with completely different semantics. -Status quo solution 0b: Pandas Index and global resources ---------------------------------------------------------- +.. TODO: case involving props passed at test time, e.g. to pipe.transform + to be considered later -Without changing scikit-learn, the following hack can be used: - -If `y` is represented with a Pandas datatype, then its index can be used to -access required elements from props stored in a global namespace (or otherwise -made available to the estimator before fitting). This is possible everywhere -that a ground-truth `y` is passed, including fit, split, score, and metrics. -A similar solution with `X` is also possible (except for metrics), if all -Pipeline components retain the original Pandas Index. - -Issues: - -* use of global data source -* requires Pandas data types and indices to be maintained - -.. literalinclude:: cases_opt0b.py - -Solution 1: Pass everything ---------------------------- - -This proposal passes all props to all consumers (estimators, splitters, -scorers, etc). The consumer would optionally use props it is familiar with by -name and disregard other props. - -We may consider providing syntax for the user to control the interpretation of -incoming props: - -* to require that some prop is provided (for an estimator where that prop is - otherwise optional) -* to disregard some provided prop -* to treat a particular prop key as having a certain meaning (e.g. locally - interpreting 'scoring_sample_weight' as 'sample_weight'). +Case E +~~~~~~ -These constraints would be checked by calling a helper at the consumer. +``LogisticRegression()`` with a weighted ``.score()`` method. -Issues: +Solution sketches will import these definitions: -* Error handling: if a key is optional in a consumer, no error will be - raised for misspelling. An introspection API might change this, allowing a - user or meta-estimator to check if all keys passed are to be used in at least - one consumer. -* Forwards compatibility: newly supporting a prop key in a consumer will change - behaviour. Other than a ChangedBehaviorWarning, I don't see any way around - this. -* Introspection: not inherently supported. Would need an API like - ``get_prop_support(names: List[str]) -> Dict[str, Literal["supported", "required", "ignored"]]``. +.. literalinclude:: defs.py -In short, this is a simple solution, but prone to risk. +The following solution has emerged as the way to move forward, +yet others where considered. See :ref:`slep_006_other`. -.. literalinclude:: cases_opt1.py +Solution: Each consumer requests +-------------------------------- +.. note:: -Solution 2: Specify routes at call ----------------------------------- + This solution was known as solution 4 during the discussions. -Similar to the legacy behavior of fit parameters in -:class:`sklearn.pipeline.Pipeline`, this requires the user to specify the -path for each "prop" to follow when calling `fit`. For example, to pass -a prop named 'weights' to a step named 'spam' in a Pipeline, you might use -`my_pipe.fit(X, y, props={'spam__weights': my_weights})`. +A meta-estimator provides along to its children only what they request. +A meta-estimator needs to request, on behalf of its children, +any metadata that descendant consumers request. -SLEP004's syntax to override the common routing scheme falls under this -solution. +Each object that could receive metadata should have a method called +`get_metadata_request()` which returns a dict that specifies which +metadata is consumed by each of its methods (keys of this dictionary +are therefore method names, e.g. `fit`, `transform` etc.). +Estimators supporting weighted fitting may return `{}` by default, but have a +method called `request_sample_weight` which allows the user to specify +the requested `sample_weight` in each of its methods. -Advantages: +`Group*CV` splitters default to returning `{'split': 'groups'}`. -* Very explicit and robust to misspellings. - -Issues: - -* The user needs to know the nested internal structure, or it is easy to fail - to pass a prop to a specific estimator. -* A corollary is that prop keys need changing when the developer modifies their - estimator structure (see case C). -* This gets especially tricky or impossible where the available routes - change mid-fit, such as where a grid search considers estimators with - different structures. -* We would need to find a different solution for :issue:`2630` where a Pipeline - could not be the base estimator of AdaBoost because AdaBoost expects the base - estimator to accept a fit param keyed 'sample_weight'. -* This may not work if a meta-estimator were to have the role of changing a - prop, e.g. a meta-estimator that passes `sample_weight` corresponding to - balanced classes onto its base estimator. The meta-estimator would need a - list of destinations to pass modified props to, or a list of keys to modify. -* We would need to develop naming conventions for different routes, which may - be more complicated than the current conventions; while a GridSearchCV - wrapping a Pipeline currently takes parameters with keys like - `{step_name}__{prop_name}`, this explicit routing, and conflict with - GridSearchCV routing destinations, implies keys like - `estimator__{step_name}__{prop_name}`. - -.. literalinclude:: cases_opt2.py - - -Solution 3: Specify routes on metaestimators --------------------------------------------- - -Each meta-estimator is given a routing specification which it must follow in -passing only the required parameters to each of its children. In this context, -a GridSearchCV has children including `estimator`, `cv` and (each element of) -`scoring`. - -Pull request :pr:`9566` and its extension in :pr:`15425` are partial -implementations of this approach. - -A major benefit of this approach is that it may allow only prop routing -meta-estimators to be modified, not prop consumers. - -All consumers would be required to check that - -Issues: - -* Routing may be hard to get one's head around, especially since the prop - support belongs to the child estimator but the parent is responsible for the - routing. -* Need to design an API for specifying routings. -* As in Solution 2, each local destination for routing props needs to be given - a name. -* Every router along the route will need consistent instructions to pass a - specific prop to a consumer. If the prop is optional in the consumer, routing - failures may be hard to identify and debug. -* For estimators to be cloned, this routing information needs to be cloned with - it. This implies one of: the routing information be stored as a constructor - parameter; or `clone` is extended to explicitly copy routing information. - -Possible public syntax: - -Each meta-estimator has a `prop_routing` parameter to encode local routing -rules, and a set of named children which it routes to. In :pr:`9566`, the -`prop_routing` entry for each child may be a white list or black list of -named keys passed to the meta-estimator. - -.. literalinclude:: cases_opt3.py - - -Solution 4: Each child requests -------------------------------- - -Here the meta-estimator provides only what each of its children requests. -The meta-estimator would also need to request, on behalf of its children, -any prop that descendant consumers require. - -Each object that could receive props would have a method like -`get_prop_request()` which would return a list of prop names (or perhaps a -mapping for more sophisticated use-cases). Group* CV splitters would default to -returning `['groups']`, for example. Estimators supporting weighted fitting -may return `[]` by default, but may have a parameter `request_props` which -may be set to `['weight']` if weight is sought, or perhaps just boolean -parameter `request_weight`. `make_scorer` would have a similar mechanism for -enabling weighted scoring. +`make_scorer` accepts `request_metadata` as keyword parameter through +which user can specify what metadata is requested. Advantages: -* This will not need to affect legacy estimators, since no props will be - passed when a props request is not available. -* This does not require defining a new syntax for routing. -* The implementation changes in meta-estimators may be easy to provide via a - helper or two (perhaps even `call_with_props(method, target, props)`). -* Easy to reconfigure what props an estimator gets in a grid search. +* This solution does not affect legacy estimators, since no metadata will be + passed when a metadata request is not available. +* The implementation changes in meta-estimators is easy to provide via two + helpers ``build_method_metadata_params(children, routing, metadata)`` + and ``build_router_metadata_request(children, routing)``. Here ``routing`` + consists of a list of requests between the meta-estimator and its + children. Note that this construct will be not visible to scikit-learn + users, yet should be understood by third party developers developping + custom meta-estimators. +* Easy to reconfigure what metadata an estimator gets in a grid search. * Could make use of existing `**fit_params` syntax rather than introducing new - `props` argument to `fit`. + `metadata` argument to `fit`. Disadvantages: -* This will require modifying every estimator that may want props, as well as - all meta-estimators. We could provide a mixin or similar to add prop-request - support to a legacy estimator; or `BaseEstimator` could have a - `set_props_request` method (instead of the `request_props` constructor - parameter approach) such that all legacy base estimators are - automatically equipped. +* This will require modifying every estimator that may want any metadata, + as well as all meta-estimators. Yet, this can be achieved with a mixin class + to add metadata-request support to a legacy estimator. * Aliasing is a bit confusing in this design, in that the consumer still accepts the fit param by its original name (e.g. `sample_weight`) even if it - has a request that specifies a different key given to the router (e.g. - `fit_sample_weight`). This design has the advantage that the handling of - props within a consumer is simple and unchanged; the complexity is in - how it is forwarded the data by the router, but it may be conceptually - difficult for users to understand. (This may be acceptable, as an advanced - feature.) + has a request that specifies a different key given to the meta-estimator (e.g. + `my_sample_weight`). This design has the advantage that the handling of + metadata within a consumer is simple and unchanged; the complexity is in + how it is forwarded to the sub-estimator by the meta-estimators. While + it may be conceptually difficult for users to understand, this may be + acceptable, as an advanced feature. * For estimators to be cloned, this request information needs to be cloned with - it. This implies one of: the request information be stored as a constructor - parameter; or `clone` is extended to explicitly copy request information. - -Possible public syntax: - -* `BaseEstimator` will have methods `set_props_request` and `get_props_request` -* `make_scorer` will have a `request_props` parameter to set props required by - the scorer. -* `get_props_request` will return a dict. It maps the key that the user - passes to the key that the estimator expects. -* `set_props_request` will accept either such a dict or a sequence `s` to be - interpreted as the identity mapping for all elements in `s` - (`{x: x for x in s}`). It will return `self` to enable chaining. -* `Group*` CV splitters will by default request the 'groups' prop, but its - mapping can be changed with their `set_props_request` method. + it. This implies that `clone` needs to be extended to explicitly copy + request information. + +Proposed public syntax: + +* `BaseEstimator` will have a method `get_metadata_request` +* Estimators that can consume `sample_weight` will have a `request_sample_weight` + method available via a mixin. +* `make_scorer` will have a `request_metadata` parameter to specify the requested + metadata by the scorer. +* `get_metadata_request` will return a dict, whose keys are names of estimator + methods (`fit`, `predict`, `transform` or `inverse_transform`) and values are + dictionaries. These dictionaries map the input parameter names to requested + metadata keys. Example: + + >>> estimator.get_metadata_request() + {'fit': {'my_sample_weight': {'sample_weight'}}, 'predict': {}, 'transform': {}, + 'score': {}, 'split': {}, 'inverse_transform': {}} + +* Methods like `request_sample_weight` will have a signature such as: + `request_sample_weight(self, *, fit=None, score=None)` where fit keyword + parameter can be `None`, `True`, `False` or a `str`. `str` allows here + to request a metadata whose name is different from the keyword parameter. + Here ``None`` is a default, and ``False`` has a different semantic which + is that the metadata should not be provided. + +* `Group*` CV splitters will by default request the 'groups' metadata, but its + mapping can be changed with their `set_metadata_request` method. Test cases: -.. literalinclude:: cases_opt4.py - -Extensions and alternatives to the syntax considered while working on -:pr:`16079`: - -* `set_prop_request` and `get_props_request` have lists of props requested - **for each method** i.e. fit, score, transform, predict and perhaps others. -* `set_props_request` could be replaced by a method (or parameter) representing - the routing of each prop that it consumes. For example, an estimator that - consumes `sample_weight` would have a `request_sample_weight` method. One of - the difficulties of this approach is automatically introducing - `request_sample_weight` into classes inheriting from BaseEstimator without - too much magic (e.g. meta-classes, which might be the simplest solution). - -These are demonstrated together in the following: - .. literalinclude:: cases_opt4b.py -Naming ------- - -"Sample props" has become a name understood internally to the Scikit-learn -development team. For ongoing usage we have several choices for naming: - -* Sample meta -* Sample properties -* Sample props -* Sample extra - -Proposal --------- - -Having considered the above solutions, we propose: - -* Solution 4 per :pr:`16079` which will be used to resolve further, specific - details of the solution. -* Props will be known simply as Metadata. -* `**kw` syntax will be used to pass props by key. - -TODO: - -* if an estimator requests a prop, must it be not-null? Must it be provided or - explicitly passed as None? +.. note:: if an estimator requests a metadata, we consider that it cannot + be ``None``. Backward compatibility ---------------------- Under this proposal, consumer behaviour will be backwards compatible, but -meta-estimators will change their routing behaviour. +meta-estimators will change their routing behaviour. We will not support anymore +the dunder (`__`) syntax, and enforce the use of explicit request method calls. By default, `sample_weight` will not be requested by estimators that support it. This ensures that addition of `sample_weight` support to an estimator will not change its behaviour. -During a deprecation period, fit_params will be handled dually: Keys that are -requested will be passed through the new request mechanism, while keys that are -not known will be routed using legacy mechanisms. At completion of the -deprecation period, the legacy handling will cease. +During a deprecation period, fit_params using the dunder syntax will still +work, yet will raise deprecation warnings while preventing the dual use of the +new syntax. In other words it will not be possible to mix both old and new +behaviour. At completion of the deprecation period, the legacy handling +will cease. Similarly, during a deprecation period, `fit_params` in GridSearchCV and related utilities will be routed to the estimator's `fit` by default, per incumbent behaviour. After the deprecation period, an error will be raised for -any params not explicitly requested. +any params not explicitly requested. See following examples: + +>>> # This would raise a deprecation warning, that provided metadata +>>> # is not requested +>>> GridSearchCV(LogisticRegression()).fit(X, y, sample_weight=sw) +>>> +>>> # this would work with no warnings +>>> GridSearchCV(LogisticRegression().request_sample_weight( +... fit=True) +... ).fit(X, y, sample_weight=sw) +>>> +>>> # This will raise that LR could accept `sample_weight`, but has +>>> # not been specified by the user +>>> GridSearchCV( +... LogisticRegression(), +... scoring=make_scorer(accuracy_score, +... request_metadata=['sample_weight']) +... ).fit(X, y, sample_weight=sw) Grouped cross validation splitters will request `groups` since they were previously unusable in a nested cross validation context, so this should not @@ -467,9 +352,9 @@ named `groups` served another purpose. Discussion ---------- -One benefit of the explicitness in Solution 4 is that even if it makes use of +One benefit of the explicitness in this proposal is that even if it makes use of `**kw` arguments, it does not preclude keywords arguments serving other -purposes in addition. That is, in addition to requesting sample props, a +purposes in addition. That is, in addition to requesting sample metadata, a future proposal could allow estimators to request feature metadata or other keys. From c0b4e88306e992d1087779c709deb0a6db9916e5 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 17 Feb 2021 23:35:04 +1100 Subject: [PATCH 059/118] place SLEP006 under review --- index.rst | 2 +- slep006/proposal.rst | 5 ----- 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/index.rst b/index.rst index e5f5718..912b3b0 100644 --- a/index.rst +++ b/index.rst @@ -9,6 +9,7 @@ :maxdepth: 1 :caption: Under review + slep006/proposal slep007/proposal slep012/proposal slep013/proposal @@ -29,7 +30,6 @@ slep002/proposal slep003/proposal slep004/proposal - slep006/proposal .. toctree:: :maxdepth: 1 diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 10d8fef..30d6455 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -55,11 +55,6 @@ consumer key A label passed along with sample metadata to indicate how it should be interpreted (e.g. "weight"). -.. XXX : remove router? -.. router -.. An estimator or function that passes metadata on to some other router or -.. consumer, potentially selecting which metadata to pass to which destination, -.. and by what key. History ------- From dd85acd6daa53d31fbeb29b929a3a7407e330187 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Wed, 17 Feb 2021 23:36:28 +1100 Subject: [PATCH 060/118] place SLEP006 under review - part 2 --- slep006/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 30d6455..e88d8d0 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -5,7 +5,7 @@ SLEP006: Routing sample-aligned meta-data ========================================== :Author: Joel Nothman, Adrin Jalali, Alex Gramfort -:Status: Draft +:Status: Under Review :Type: Standards Track :Created: 2019-03-07 From 7f4216701e83e1ce7d96da19c6d68d006bd6c93c Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Thu, 18 Feb 2021 00:03:35 +1100 Subject: [PATCH 061/118] Fix markup typo --- slep006/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index e88d8d0..563df20 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -84,7 +84,7 @@ Other related issues include: :issue:`1574`, :issue:`2630`, :issue:`3524`, :issue:`4632`, :issue:`4652`, :issue:`4660`, :issue:`4696`, :issue:`6322`, :issue:`7112`, :issue:`7646`, :issue:`7723`, :issue:`8127`, :issue:`8158`, :issue:`8710`, :issue:`8950`, :issue:`11429`, :issue:`12052`, :issue:`15282`, -:issues:`15370`, :issue:`15425`, :issue:`18028`. +:issue:`15370`, :issue:`15425`, :issue:`18028`. Desiderata ---------- From a5d6121ad13ae091e540cdc8bb1d16af4f49a993 Mon Sep 17 00:00:00 2001 From: Joel Nothman Date: Tue, 23 Feb 2021 19:36:03 +1100 Subject: [PATCH 062/118] An example in the opening section of SLEP006 (#53) * An example in the optening section of SLEP006 * imports * A bit more explicit --- slep006/proposal.rst | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 563df20..0040174 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -40,6 +40,33 @@ Features we currently do not support and wish to include: The last two items are considered in the API design, yet will not be considered in the first version of the implementation. +This SLEP proposes an API where users can request certain metadata to be +passed to its consumer by the meta-estimator it is wrapped in. + +The following example illustrates the new `request_metadata` parameter for +making scorers, the `request_sample_weight` estimator method, the `metadata` +parameter replacing `fit_params` in `cross_validate`, and the automatic passing +of `groups` to `GroupKFold` to enable nested grouped cross validation. Here, +the user requests that the `sample_weight` metadata key should be passed to a +customised accuracy scorer (although a predefined 'weighted_accuracy' scorer +could be introduced), and to the LogisticRegressionCV. +`GroupKFold` requests `groups` by default. + + >>> from sklearn.metrics import accuracy_score, make_scorer + >>> from sklearn.model_selection import cross_validate, GroupKFold + >>> from sklearn.linear_model import LogisticRegressionCV + >>> weighted_acc = make_scorer(accuracy_score, + ... request_metadata=['sample_weight']) + >>> group_cv = GroupKFold() + >>> lr = LogisticRegressionCV( + ... cv=group_cv, + ... scoring=weighted_acc, + ... ).request_sample_weight(fit=True) + >>> cross_validate(lr, X, y, cv=group_cv, + ... metadata={'sample_weight': my_weights, + ... 'groups': my_groups}, + ... scoring=weighted_acc) + Naming ------ From 584162478d142b95c9779c410e3d85492509fdc6 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" Date: Mon, 12 Apr 2021 11:56:39 -0400 Subject: [PATCH 063/118] FIX Removes sphinx warnings --- conf.py | 4 ++-- slep006/proposal.rst | 9 +++++---- slep012/proposal.rst | 2 +- 3 files changed, 8 insertions(+), 7 deletions(-) diff --git a/conf.py b/conf.py index bdeceb1..d75b601 100644 --- a/conf.py +++ b/conf.py @@ -69,7 +69,7 @@ # This pattern also affects html_static_path and html_extra_path . exclude_patterns = [] -default_role = 'any' +default_role = 'literal' # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' @@ -91,7 +91,7 @@ # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = ['_static'] +# html_static_path = ['_static'] # Custom sidebar templates, must be a dictionary that maps document names # to template names. diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 0040174..7cc1008 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -1,7 +1,7 @@ .. _slep_006: ========================================== -SLEP006: Routing sample-aligned meta-data +SLEP006: Routing sample-aligned meta-data ========================================== :Author: Joel Nothman, Adrin Jalali, Alex Gramfort @@ -119,6 +119,7 @@ Desiderata The following aspects have been considered to propose the following solution: .. XXX : maybe phrase this in an affirmative way + Usability Can the use cases be achieved in succinct, readable code? Can common use cases be achieved with a simple recipe copy-pasted from a QA forum? @@ -247,7 +248,7 @@ Solution: Each consumer requests A meta-estimator provides along to its children only what they request. A meta-estimator needs to request, on behalf of its children, -any metadata that descendant consumers request. +any metadata that descendant consumers request. Each object that could receive metadata should have a method called `get_metadata_request()` which returns a dict that specifies which @@ -352,12 +353,12 @@ any params not explicitly requested. See following examples: >>> # This would raise a deprecation warning, that provided metadata >>> # is not requested >>> GridSearchCV(LogisticRegression()).fit(X, y, sample_weight=sw) ->>> +>>> >>> # this would work with no warnings >>> GridSearchCV(LogisticRegression().request_sample_weight( ... fit=True) ... ).fit(X, y, sample_weight=sw) ->>> +>>> >>> # This will raise that LR could accept `sample_weight`, but has >>> # not been specified by the user >>> GridSearchCV( diff --git a/slep012/proposal.rst b/slep012/proposal.rst index af4dd78..90c9347 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -50,7 +50,7 @@ meta-data is lost immediately after each operation and operations result in a ``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid performance degradation, ``__getitem__`` is not overloaded and if the user wishes to preserve the meta-data, they shall do so via explicitly calling a -method such as ``select()``. Operations between two ``InpuArray``s will not +method such as ``select()``. Operations between two ``InpuArrays`` will not try to align rows and/or columns of the two given objects. ``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for From eb9b89f9d4769da421d0c86fc4974ed44ffa86cc Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Wed, 20 Oct 2021 17:36:22 +0200 Subject: [PATCH 064/118] change feature_names_out_ to get_feature_names_out() --- slep007/proposal.rst | 60 ++++++++++++++++++++++---------------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 1dd9c7c..4f411b9 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -13,7 +13,7 @@ Abstract ######## This SLEP proposes the introduction of the ``feature_names_in_`` attribute for -all estimators, and the ``feature_names_out_`` attribute for all transformers. +all estimators, and the ``get_feature_names_out`` method for all transformers. We here discuss the generation of such attributes and their propagation through pipelines. Since for most estimators there are multiple ways to generate feature names, this SLEP does not intend to define how exactly feature names @@ -72,10 +72,10 @@ However, it's impossible to interpret or even sanity-check the correspondence of the coefficients to the input features is basically impossible to figure out. -This proposal suggests adding two attributes to fitted estimators: -``feature_names_in_`` and ``feature_names_out_``, such that in the +This proposal suggests adding ``feature_names_in_`` attribute and +``get_feature_names_out`` method to fitted estimators: , such that in the abovementioned example ``clf[-1].feature_names_in_`` and -``clf[-2].feature_names_out_`` will be:: +``clf[-2].get_feature_names_out()`` will be:: ['num__age', 'num__fare', @@ -115,9 +115,9 @@ Scope ##### The API for input and output feature names includes a ``feature_names_in_`` -attribute for all estimators, and a ``feature_names_out_`` attribute for any +attribute for all estimators, and a ``get_feature_names_out`` method for any estimator with a ``transform`` method, *i.e.* they expose the generated feature -names via the ``feature_names_out_`` attribute. +names via the ``get_feature_names_out`` method. Note that this SLEP also applies to `resamplers `_ the same way @@ -135,11 +135,11 @@ Output Feature Names #################### A fitted estimator exposes the output feature names through the -``feature_names_out_`` attribute. Here we discuss more in detail how these +``get_feature_names_out`` method. Here we discuss more in detail how these feature names are generated. Since for most estimators there are multiple ways to generate feature names, this SLEP does not intend to define how exactly feature names are generated for all of them. It is instead a guideline on how -they could generally be generated. +they could generally be generated. As detailed bellow, some generated output features names are the same or a derived from the input feature names. In such cases, if no input feature names @@ -150,8 +150,8 @@ Feature Selector Transformers This includes transformers which output a subset of the input features, w/o changing them. For example, if a ``SelectKBest`` transformer selects the first -and the third features, and no names are provided, the ``feature_names_out_`` -will be ``[x0, x2]``. +and the third features, and no names are provided, the +``get_feature_names_out`` will be ``[x0, x2]``. Feature Generating Transformers ******************************* @@ -181,9 +181,9 @@ Meta-Estimators Meta estimators can choose to prefix the output feature names given by the estimators they are wrapping or not. -By default, ``Pipeline`` adds no prefix, *i.e* its ``feature_names_out_`` is -the same as the ``feature_names_out_`` of the last step, and ``None`` if the -last step is not a transformer. +By default, ``Pipeline`` adds no prefix, *i.e* its ``get_feature_names_out()`` +is the same as the ``get_feature_names_out()`` of the last step, and ``None`` +if the last step is not a transformer. ``ColumnTransformer`` by default adds a prefix to the output feature names, indicating the name of the transformer applied to them. If a column is in the output @@ -197,26 +197,26 @@ Here we include some examples to demonstrate the behavior of output feature names:: 100 features (no names) -> PCA(n_components=3) - feature_names_out_: [pca0, pca1, pca2] + get_feature_names_out(): [pca0, pca1, pca2] 100 features (no names) -> SelectKBest(k=3) - feature_names_out_: [x2, x17, x42] + get_feature_names_out(): [x2, x17, x42] [f1, ..., f100] -> SelectKBest(k=3) - feature_names_out_: [f2, f17, f42] + get_feature_names_out(): [f2, f17, f42] [cat0] -> OneHotEncoder() - feature_names_out_: [cat0_cat, cat0_dog, ...] + get_feature_names_out(): [cat0_cat, cat0_dog, ...] [f1, ..., f100] -> Pipeline( [SelectKBest(k=30), PCA(n_components=3)] ) - feature_names_out_: [pca0, pca1, pca2] + get_feature_names_out(): [pca0, pca1, pca2] [model, make, numeric0, ..., numeric100] -> @@ -226,9 +226,9 @@ names:: ('num', Pipeline(SimpleImputer(), PCA(n_components=3)), ['numeric0', ..., 'numeric100'])] ) - feature_names_out_: ['cat_model_100', 'cat_model_200', ..., - 'cat_make_ABC', 'cat_make_XYZ', ..., - 'num_pca0', 'num_pca1', 'num_pca2'] + get_feature_names_out(): ['cat_model_100', 'cat_model_200', ..., + 'cat_make_ABC', 'cat_make_XYZ', ..., + 'num_pca0', 'num_pca1', 'num_pca2'] However, the following examples produce a somewhat redundant feature names:: @@ -237,9 +237,9 @@ However, the following examples produce a somewhat redundant feature names:: ('ohe', OneHotEncoder(), ['model', 'make']), ('pca', PCA(n_components=3), ['numeric0', ..., 'numeric100']) ]) - feature_names_out_: ['ohe_model_100', 'ohe_model_200', ..., - 'ohe_make_ABC', 'ohe_make_XYZ', ..., - 'pca_pca0', 'pca_pca1', 'pca_pca2'] + get_feature_names_out(): ['ohe_model_100', 'ohe_model_200', ..., + 'ohe_make_ABC', 'ohe_make_XYZ', ..., + 'pca_pca0', 'pca_pca1', 'pca_pca2'] Extensions ########## @@ -260,9 +260,9 @@ could remove the estimator names, leading to shorter and less redundant names:: (PCA(n_components=3), ['numeric0', ..., 'numeric100']), verbose_feature_names=False ) - feature_names_out_: ['model_100', 'model_200', ..., - 'make_ABC', 'make_XYZ', ..., - 'pca0', 'pca1', 'pca2'] + get_feature_names_out(): ['model_100', 'model_200', ..., + 'make_ABC', 'make_XYZ', ..., + 'pca0', 'pca1', 'pca2'] Alternative solutions to a boolean flag could include: @@ -277,6 +277,6 @@ Backward Compatibility ###################### All estimators should implement the ``feature_names_in_`` and -``feature_names_out_`` API. This is checked in ``check_estimator``, and the -transition is done with a ``FutureWarning`` for at least two versions to give -time to third party developers to implement the API. +``get_feature_names_out()`` API. This is checked in ``check_estimator``, and +the transition is done with a ``FutureWarning`` for at least two versions to +give time to third party developers to implement the API. From 1d0e743ffe966d315469453aaff7a58208dfee7b Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" Date: Mon, 8 Nov 2021 04:26:46 -0500 Subject: [PATCH 065/118] DOC Use verbose_feature_names_out for verbose feature names out (#60) --- slep007/proposal.rst | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 4f411b9..69c7633 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -106,7 +106,7 @@ original features: - Algorithms that create combinations of a fixed number of features, *e.g.* ``PolynomialFeatures``, as opposed to all of them where there are many. Note that verbosity considerations and - ``verbose_feature_names`` as explained later can apply here. + ``verbose_feature_names_out`` as explained later can apply here. This proposal talks about how feature names are generated and not how they are propagated. @@ -244,21 +244,22 @@ However, the following examples produce a somewhat redundant feature names:: Extensions ########## -verbose_feature_names -********************* +verbose_feature_names_out +************************* + To provide more control over feature names, we could add a boolean -``verbose_feature_names`` constructor argument to certain transformers. +``verbose_feature_names_out`` constructor argument to certain transformers. The default would reflect the description above, but changes would allow more verbose names in some transformers, say having ``StandardScaler`` map ``'age'`` to ``'scale(age)'``. -In case of the ``ColumnTransformer`` example above ``verbose_feature_names`` +In case of the ``ColumnTransformer`` example above ``verbose_feature_names_out`` could remove the estimator names, leading to shorter and less redundant names:: [model, make, numeric0, ..., numeric100] -> make_column_transformer( (OneHotEncoder(), ['model', 'make']), (PCA(n_components=3), ['numeric0', ..., 'numeric100']), - verbose_feature_names=False + verbose_feature_names_out=False ) get_feature_names_out(): ['model_100', 'model_200', ..., 'make_ABC', 'make_XYZ', ..., From c7d1b3c115102d5e469cdeee952283cadc8e9b47 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Mon, 29 Nov 2021 13:47:15 +0100 Subject: [PATCH 066/118] VOTE propose vote for SLEP007 (#59) * VOTE propose vote for SLEP007 * VOTE: ending vote --- index.rst | 2 +- slep007/proposal.rst | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/index.rst b/index.rst index 912b3b0..054a5b8 100644 --- a/index.rst +++ b/index.rst @@ -10,7 +10,6 @@ :caption: Under review slep006/proposal - slep007/proposal slep012/proposal slep013/proposal @@ -18,6 +17,7 @@ :maxdepth: 1 :caption: Accepted + slep007/proposal slep009/proposal slep010/proposal diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 69c7633..65b3c24 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -5,9 +5,11 @@ SLEP007: Feature names, their generation and the API ==================================================== :Author: Adrin Jalali -:Status: Under Review +:Status: Accepted :Type: Standards Track :Created: 2019-04 +:Vote opened: 2021-10-26 +:Vote closed: 2021-11-29 Abstract ######## From c3e352739d85a3490f3ee512df0b42b0f9f813eb Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" Date: Mon, 29 Nov 2021 08:00:59 -0500 Subject: [PATCH 067/118] ENH Adds the details about the numpy arrays and strings in SLEP007 (#61) --- slep007/proposal.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 65b3c24..78fbe0b 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -131,17 +131,19 @@ Input Feature Names The input feature names are stored in a fitted estimator in a ``feature_names_in_`` attribute, and are taken from the given input data, for instance a ``pandas`` data frame. This attribute will be ``None`` if the input -provides no feature names. +provides no feature names. The ``feature_names_in_`` attribute is a 1d NumPy +array with object dtype and all elements in the array are strings. Output Feature Names #################### A fitted estimator exposes the output feature names through the -``get_feature_names_out`` method. Here we discuss more in detail how these -feature names are generated. Since for most estimators there are multiple ways -to generate feature names, this SLEP does not intend to define how exactly -feature names are generated for all of them. It is instead a guideline on how -they could generally be generated. +``get_feature_names_out`` method. The output of ``get_feature_names_out`` is a +1d NumPy array with object dtype and all elements in the array are strings. Here +we discuss more in detail how these feature names are generated. Since for most +estimators there are multiple ways to generate feature names, this SLEP does not +intend to define how exactly feature names are generated for all of them. It is +instead a guideline on how they could generally be generated. As detailed bellow, some generated output features names are the same or a derived from the input feature names. In such cases, if no input feature names From 0a729cff0b77b261322f68b1ab88a109169c225e Mon Sep 17 00:00:00 2001 From: Christian Lorentzen Date: Wed, 8 Dec 2021 09:59:59 +0100 Subject: [PATCH 068/118] put accepted at the top (#63) --- index.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/index.rst b/index.rst index 054a5b8..e20f05f 100644 --- a/index.rst +++ b/index.rst @@ -5,14 +5,6 @@ .. include:: README.rst -.. toctree:: - :maxdepth: 1 - :caption: Under review - - slep006/proposal - slep012/proposal - slep013/proposal - .. toctree:: :maxdepth: 1 :caption: Accepted @@ -21,6 +13,14 @@ slep009/proposal slep010/proposal +.. toctree:: + :maxdepth: 1 + :caption: Under review + + slep006/proposal + slep012/proposal + slep013/proposal + .. toctree:: :maxdepth: 1 :numbered: From 325951513ce25f251ca2625171d68e975fbdafc8 Mon Sep 17 00:00:00 2001 From: Adrin Jalali Date: Tue, 4 Jan 2022 14:37:50 +0100 Subject: [PATCH 069/118] slep000, slep workflow (#30) * slep000, slep workflow * rename folder * address Guillaume's comments * address Nicolas's comments * in place -> created * accepted -> merged * slight rephrasing of the abstract * clarify the 'it also used...' sentence * Gael's suggestions * revised merge criteria * further clarify initial approval * SLEP -> SLEP draft * address Olivier's comments * add Guillaume's suggestion * Andy's comments --- slep000/pep-0001-process_flow.png | Bin 0 -> 12925 bytes slep000/proposal.rst | 298 ++++++++++++++++++++++++++++++ 2 files changed, 298 insertions(+) create mode 100644 slep000/pep-0001-process_flow.png create mode 100644 slep000/proposal.rst diff --git a/slep000/pep-0001-process_flow.png b/slep000/pep-0001-process_flow.png new file mode 100644 index 0000000000000000000000000000000000000000..0fc8176d242edf8163a1436adcd4afcef5d97802 GIT binary patch literal 12925 zcmajG2T)UA^e&o&8hS6Gg&w6>L3)4?lpWA ziy*y7lOk36&F}y2oA+kkoq3s)oU9~kXYZ49*80A0tt8wt(xat{Rfo98=BHSo@j%;kWYMxyQ2T@KvpkNF5-)ve!7qS?t41= z1={*J00IL8C7nH7eeG<$93(w`oN{(lApigrps%C#ASib`FL=gs^s;;2IR6h5j)4gt zuQqb8E(Xe?F*YauFyy}ZTl2RM-r>t0ICHs+*Rg|{?taj6OJ+zAhMY4o-4zlz7hwtx z3r7!ZtNE@Cob~tT1^+$|YyGVtx0V~So0X-q7Sh&t$}3IN_5Z)QM8t9*%|>jC4KoUg z7Jxj3XA{JKwwkIvIB#7VrC-PSiP8YdK?hXIh5jjL5d>>Y!;5X42~#uB6QqG`7JTxP zxe?T2_OhI7k6c=HBW2}#b%bF=4EUsEZi70PT-CHktBI775K53r+`yU0k^{R%XA@K# z*_Bi_jK-AD0o@|A2_R+aApZYM!57LE4=Y1#IbvId?w#=zBR$Pk>=(Ud4RnX zynuTTIjdSJycoT>r0Kx#J`8VeE)onK8q4{MWPm~J5|LY!;)5e?Vnb}1X=R2KlOB>E zZ))ctY(fuJq`sFFsA0Y;pJgN~!dCk_xz|v7KToP(e9pk-c-Zs0w&{Sr^uQHzK8C#r zTks$_qB71z&N}!CRYUpK^Nq=N0KDQ|3P}&n1eqt)2(vx*dwDeO^xAEhQ-Yg3gEgb` zDA$L0>v$jd2SC0K1gxs0kr;wM5<|9@P3MNWP#T?G zw%S$Q#szOtPrmtD6?U09_|H{HN!>SfSr*McK9KR?;$)C0f7P%t7!SX6O!ZoX+Mg5vI(*(GAmt@bDKi;J)G+;rhhW~O{yv6 zFRY>bH6%q@c#1m=!5Z23E*QZLX}CuR%#$ZuLj(<;#y%dEzO6}=kfjGuM8LP_389Ms z@EZi?E>+14*UgeD4`~!1uScG#abB_+Glvbhbn&?suRy4u3y6JmVWI<>0OCl|NS&xbtgO$Ums*1gT|g^ySS? z@5Q&smSa@pQo||apJ?TY1c6y7hu%%-?{&`m2esrc6O1utLz+YR_qqU$+(CRwz>}k)9=Sk28w!aGlkpCW%)MAxg1y6ob zT(U`(%(u!kRVlpaKrwkUF}-vucB{9GaI(xHygs^L8jJ_FrON-{B$U1SpDhOZcc1@GBe48s+HidJeCert2^tXU)UG{$ z0$Z367*wVSJ!xxS&_9ghz5~-dYhHU@oBNIng7MY$a*s1$^R6B6#}f9kUXVX8n4I<9 z<(}NHW`DhnXLKIBYAW7Rn}E(+x2smM-i!Zc>hZkjiuAot>LAu8DQsTfi-~DPJ)HmZ z_0ms(VVGmfN!n20l^PsaL-|wLXNGy_C$kLjB}YVZ^Zn8bk(ysmR0LI&3G#<_lc1+q z66Ie2^BWn(5dn55jVxI~ZGbr~$0{@$N!z1ZOL@wKW`kBQ{Ey3wvhtm3Xm>wga6K2?Fmy!y&z!-;7@wq%W@}}!%L+o%2!Z! z7vDfQaw_4Td>Y+0n5Un|=_DX&>JlmhaQZr?`y~F#No5tqo-4NVyspu19*D$P#)t%r zyB)W3l4-HPEp56$_BH#;jJg-@kkK}kN_fTZiif}&+8SoPN#GX=S?<_BgY<$F8*cL2 z=lqFZD#^J1_{h7qNH2z60H%>tt$)DPBW9_U%obGVm`;n)(*o``986HDrhO7D z7@FeHGgd2Aqu^wIE<={JL=J7&lbq1{T{78Pqc&SRzsuV5n`c#59_wLr922WGtGOUk zjh0|RWDnkc<+9{JCD$?z@jfy0gZnTr85x7xb*|H1_3V2q)~Q@wrJ5&<9#3({m>co8 ze~peSnd)ThQCA1g&NLlI-7-~Xis{c!CXe+5gbTDk2}&oRk8W>tk{q_AXoKu{gjB6{ zA{bP3rq80rj(T2uS(O^n<2N~k_A&$ubOw}S5`ZMX?SSTHYN|HW;4_woH;(imoI!vl)AS=_ZyPS?45X#)4+TtPH` z;5!Vm6M!FB4%^LqhtZ3e7j!y=$@hfE!xgnoI_ZZOE=5KR^h&_rRZ8dCcLnbMdbg83 zI#d|0SZUtz0V$WSy$SDhh6#mYRj%8I&2g24aD4^$ya>|YEGvXix>MQw-U|iJhXzvi ze;;qWQYdtwkcf0)gU${ayr(x^4392734{IlTh$*|zNb#bPV4xVw&yR`Q~%rqE%ns&oW~kFqkPzyRGk=Qw_a?L6 zn4dYAP!1b%Digcr$ihdGRfpY6l@JnVl#8$q-hExP9@{Wfpe$){@-)yOH*KG`G~$9I z%P=4DyE=I|dGXwYmtS3LUKv&2Oq-aC+>cYFDUnAF=a42{KQ<)rOkY#+#tbH_LkQQZTX|1;#<_bv_;GSGGt(w2`it$TKIXMe1hh)r2r zJYSYqJTPn`?eSS)IN5WC`riCgzw&xTT4|6RcHmM)YPUTc*N+eMzAHLECZYd{TR2rk zej|nB#)iq;ykMAv;fGP6gPq5YWS+iHh_K{HR{y!ApUw^K+uz2%4u8!P%Vt$bVUD-H z^FPB>gZQ&N^fvKSUOfXkhQe(q#PDFSs^q9|KyDHsyhoi>R zcu3$z8DJy&I2iEwwb#54@$EY`GK-3|!+=7cPl~6fJ{BxxEYJ{U53Xk3Y0YjT93J6L z3J4$tr`_2DPrq!ef|UU{6X8MLgW7k4Jc#aSgG+vHoVAPzY7Mern&w$+KJorFIB~lR7qwGcs$#lE zqw|YC2O#r55tNY$N&$3ww9vd;6#8Q;(bNOR%E~5T5=45znwjY*x|z$UA4sUUGv}R3 zaH|6~mVzcjV{daai*r@mav%+G)EFa1Lab)WFw50Qp&qP1^(QZcYF2PK}_@655 z*)24n%BtJzD-`ZL85T}3;#AOTQ_0MTG#<)MYbFvA=bzk4WRGKyqdW~W2d>QVvkqlX z683?iPZzvjM3j~78hw8PxF8RL63<8Sy{GA#0R_ZkrQc6sYHkD%DSl6Ts2ki7M)T*j zrZu9ajRnFfOW9ZI#)06?BsW!`0}YVuP|)cF`!pA!ZTSR7`x=Hly=Nz`CQ6M^M~`y=Csxuf2|Je}~%KdS?$ ze5LNEZtM9@{0AeN>1im{kGXz|bTe(p zVClX+XqwjD;J$e1%)w{msUd9kkPi6|kff8{yGmd~)n9|XUbs@+v2G(R>r0j8aOPc> zJ1pb!qCG=j-b;TJzJOk;i^9&XXn;fwE)Eip(X zk7Rk=^6;1C-{PS%NiM6_y*3aIdAPNQ~^f zsk(Sv?N%)IOL!yzG;J@=uHoJ7x&F#wRFhl~IDrXkp*@aX7K7|vPp+(^ z&+ET1vO;eO_&#`pY>`AZ&Mv$%6q=($3;)GE{ea8zWT29#jD9IqCz_>k*oW zmha|iZJr~P>>O#}0JNLUz`>?Tv%+1e3fC7S_vL}?OjqAJva{)D&uj0AOo%K{o1Mgj zgx9?D^x5|Az6I24IZ$lc!vC!bjo&g(+p@k@?csT%oISI)xjjy+Y3_G&SUnAsYt>gE zn@xzk_3xG)vIx5G1)Ux2P-{TU9 zt89NPb9jVV1J&h3-mvG#{_zjqN8LqFu2pvAcr-ZxdryThbAAP<3Y`id3PFiykZcCE z)V6~0xw<#a2kNn5IbW@2!-qoyZk=7N1^E{gg?z10<-rx+^0`M1L4LxM%xOj~tDx8# z{Ywq`jSgWE@Lo#83FcK05P{q33G>*EgH z(rO-ZT^nGeJ^2yWiq%wN459*8klEk*z}b|QD&k-1^&FO*`NbTudWAKT;uvwdLk>{)g?VjK;*vL&EVS^cZ zrML$N-y(WWZ%Ood2Np(LNkxm)Z)1Z!+uli~-oy8EQnzQeR+?3Rf2>-CiA?s)O;AKFFD zb|#MVO+6EBAZLw(gT{b#J&Y)B$ych)Iyd6W3hO`l08K{rYd_!iv|f zkl_QmBp(I^35r#oT>NXNd9b(N*BUs_Am|-vWF^>HL$74;EM{vS6u84K(A|c(2n@h) z4eG*7H_0bYWRUlc0}7ltWsL`PLH49fk#G15I?8&|PoPvdcWO(`Z7}vU^Vs%*vV<~~ z7~fLy$E4o-1F-i;LucvcC34vYj}y;(N3$+E2Q65BlgK3bk(Ve4*C8xy!aK*e5T~4x z^7B{gZBrB}vx#8u;N;D^*~l9qy5p9m{znxt7Ui5R;Re{G_(NMZV6-8cA% z90|wjsv66&1|^yOdx7@C47P}@k2HdL!YmD%vd+zgoZgefgzSV)61bW!l9#*xa@%P2 zKPhqpP4UlnBo7uj0ed+pPW@^X*g3SJTzWO*nfwWluR4T~5uPF-|j0RfzKy*CBXJGFF;9NRf-54L3#ZAbfNJpSwx& z42Ks>m@m>LR%NCAjo&O5V?tlTB(*Uns-r_5x%Y{hr?3prROvOgt)y^RR*B{xt$DIK zN14(T-`C?w(1)SonQxY~@8ao7cQl{8Gtgo5Y9+eJJ-!o>CQJn0jQc|7nn5=qN#V)I zRn@F*aAzkU;fePm^4@6{)E(h1GlXQnD@r8-EXu4G0U|S?)iLu3 zAE$AuyU5wi#z!0^7qI_?Ta_ZWSu$Ib$FaEHmY(+`tf!1~2K6dr!?YjZgSXi8Y~`u{ z!~4A)oe!V|rWTdMTtnf_{6y$;6hRyrxeD&gyzFHFZf?B}~Dxze)F|I1|SiFc8w5Gi(+l8=-X{a8PF`5g?)>Yd5l7B7}6oyn=z= z0YswNZ&3d3sLv2c;D_KD-CI^flrFXzQHQjzjvK5z>Kk5078J*M$Li`2VTI= zTuibC=DCz|N2sx^F!0JgQ$a)a`rQk~LF+=@!dE{j-SieTSR0Lef96x6hD0FJ!5OEx1rmH3-VSGuN_#diJ`kGU1UuVNznN*# zFVE10QWh@>>k67gP{Yi0daKEd5WxfApg>@`xdX`s-h>dYi)bs@0D;BaHY!jp|2o%E&eO;BpdHN(|6>vl_>*G2 z=WATx3kK(>Y1ko%9&74$-Y|QsiNWmiD#>~2PRXj!LDlJe%}sh+mVu7aIW$6+27q|A z`1wOb>+sx6UFQ`FJMcPhDAPONY=Vkov3KyWm!hc#B{afWh1RHc%N9HTU^c;7)?E-~ zdO`7Cz`wiKs0Y1XD=GbAOfx|db^hxKDhGo)$J{mHXsYgz7wD*s_j(r*Mu7N9DFkiv z1{AEC?>mVmQf^8!XW)b^fWL_K0mnUwB~erTVX`cr;;1KgS13>umn_*u*EF;r%|hH$ zv?1Sj1-q>b(M#VQkPKwvT1CxUf}@W_G6Y6n4XT#n&|PA`xYlC~jE3dMs%R$-%sy zJ#e(2uL^KlcteqlvX<~v<8nHcj+1YV(t9Q>HM>g;6hc?w1sYG@jV{XG5#1Jf^lci~=& zXt)o_QsAnkR?JnZh-MPJF=>eVBhX9D3zi+sTKm_u9L8D_)OM$t1T4+$4mwuKuDkyB zN{VbwH1~Av*=%NwW=>E2;=qz?43p=oMl8V-7+7Ls^1 z`A*iEd@nk7{^j{vu2JI0-A}}~UzW;wu@?nAHYPPeY6$0kZBInIonBd4yu6y-Udh+! z_nDd4F{c(-RneWho>Zgn_o%`6 zc-VtK8jco##&_@BvKD)w5PvF>;^uOUpA2xLwYm=|2)CjwG?R$(+{J;&U{n- z7Ev>&bcvPc^bnom`X!B`4uQ5>?)ZC{1=0#xIg_3+vXGUh&YYrjm~jnjFrM_24dX?ts zOp!T*4`%IT&1u$0!T+MRx=it5E8dd>ld;oK{=*xxMYB=gMne)()%$czAV?XZ8=8+w zq`vcJ@x2(<+|F^BD9qWjm4A05T@$0j_X0OXShJE{vpr+@F#swC3B@~EYyAEhhs)zM zn~xUou91lk0*-Kk+uFZ!GgE%DNhTm!#z!PaJVM?YW5%Mu?<`3{rB*lYV8%>YCUlnE zBM~>>6q0#tW5b^*XRlg<(Xj^dUuAhKr;~dMyrm&Dm!$!r*r~O$o=;DQb@+{x{&iPr z)yfsr4ZF$25ASa0zc4gwy&*dv_WFSw!U9g=CQa!X*>)>D6lfp#7J_(s?=Hx`IfNst ziDUbrG^WaXR@bMh{ja-EQ7(w{%6+!Il_@_PwKfOfdZ z`#wtTU?;(L9hrBvPJ%x+e3saeWa!`dwh3D1Y)yHC^)TzP%BnuW7Ne z9{xI`weo>-Q0||TT*fF}+dA%Vf_l4M4?JY1+!m}+S~d(JSMuie91V&>>{?^J!^+b? zbVN#R?TxuPbOd#IRtt)?j$uA3j4H=Z0073F|9Sy(=QY*SLIqai)|4c{w~OsDtWR_% zI+b#$eq&v35D{f|ALU8|rNEmiZMh~c{i6myZQGiue$+_`@cfw#tf(9=mzP21Lv zU=7_)f#({Ruf}?0!uc1!!XWRy^R!1wH}|JD-~N_&?us8M2Y;ZgsE>)JMfQx3eEDcz z{BbjnjW5PQsN0~F@iVUI$_chFCM{~HOIB8H_XJkGw5NifmCPE$mkc2?;k}1(;g5T6!dNMW%{&LSQzfPlK6aZHi+#){|^d` z8p&i)O_NW0CQbWA+MJF$etv08x;K&tX^9Y#OZc;;sF`*bjEty)m#hwv3y1Y*@bydI ziOu@BnJV!;@BI)=$|7Uq^E(qr=D5S{cH2@LP-)vgQiq{rk6W$}^1;HYW-1E?>91Wa z%mO5-USHKjPs$`Y+Ep6Rl$>RzvZ31IWLc$<^(wK9|Ka%x^=~DY#@R!OhXyUTA&BqN zqRqB=8eX``V?&tS->&Zg-4ztXwIKQywD1&nqe3K=s&Td~i9}BnCmN%;3tH{S&bP5n zy4^tUw>#GT{lzg+MkbpN+CB+wc1awpz|?G@>Pq2()Xt_L56g zY#&e%S>XRjz{D@+h&5sPKUWQibN{~^MFNQ|AW>y)>WX9Gr=!(@jtxp~(3T^?j@U!u z5QdqE{lj$s*7P_J%l%tS+cT5E_2~LYp7n80>fFfV$@qQ2g|Vd;8UiOl|A%UvtxyV> zp(G=iex2kG=M7I%I&c4|XDKSAj~98xB0dpXtS<8j8WB5X+f}xBaW~;+lfKWSXfm4k z^b>6hDdYIcE$|@%1j5I~W~I@TP!i`j4i8!A1EoFzUa*%E2$m<$x%WN=CLf0pQ9{Ojf+Pi$ zue0$pj)aRO955GHO;19sO#PKOnNmL)kRd$XLR7omUjQu725w`bu)Cr?(;@itT;>SA z=3N^_uhy0}+w`P?b&+{kxrdrrgH`_8=F;?uxZz|H#8GJvA7Uk1eg$1({`WLtrzp9# zCI|Bd^TBeq;d67J<+T+_hU@-{gA+`hdpoPx>^gC)vXENceq18*UUH8hzeew zF`UMO)HN#?M4@SO-8+q^rD4@jpem#b#>5OoG&KTg8w&&u#yC-!xPH(I%8&lkvEo+5 z-QlZ+1I%b^^EJS_@ivkj3Z;0?{4u!yNTjd>tvWRRYCAf*E6=J$5oMFF3wu97W~BuZ zHo~A+)qP!83jR>y{V%QiT!}O*`}}iF+YWbkh(jt=3qN*5aoFw-3R|i>Cxn&{?_?M5 zhR}NHt{f0699MdEnI8wVZZ(S;3v9s~~g3Qv;UIP%^b zRUSgk^MUYeqx*7aQ_=mV)HCO=F>#pLR>i29H6@;)?pHrZ!cw9ia~lK0Pv)co2L%PU zMSzpV@gFe_Gp8p24_y!n;H9wVTFPU{pQLzTXb!lG;R~7MUYp)cv9&K`0FFSm6{KfE z!ue|W*j4^jBf{Gvh{}||HpQ-gRLDGpHi@r<^VG~LhZDI++Lruu);n$X{Co74;#mpj zu~2q5jp-$S;ggF+)Phx1GzXG;FAGDyrS#1S<{8ufwYHN_W#p77iwj8K*9-RnH^#%of5_9}p42p{WTc(* zVQ+KefzBxt7{`zLWDkWAThdOcOIHn2lqaOA%I^W0zI>&q*0|{&NbItE1Qp%8HAI7U zDGjT6=L*#T?%#x2QAW#@B3zuUKHrd z0zWsF4`T1tM+!EwI0Gk}4%qhRn;3O9@)P?#mvPKjH8k5Kwc{+N3b*HT9XJwXXq0CL z5-Y|0_={{;?v}?y1Gk4E*t%z%=1t92GXAtxs%dsAN}eAXs%t}ZHn58qAF$TH8CK!| z4yW5Ibj&kltWNpj&T3RGe^$SSJ*Am0%-M?Qbw|v1}$aSe`qr9Wdmo*lPESH&ax6#X{xe8>7 z+)+q+3WcPMPUwhsb#gD$kA@8P^4pOAoyUj;P1t{W60sC&@}Mt_D1_0g6t_wwt-$VLvb*+02aaw;?7u);Wr>*Tg=~E=n1`|0`v#GC^}n-WPS) zv70@5yQNTgnNt9mBdZc04(|K$0~M%DELj(cLDJfulplB|)q{8+l^>X~5}msL$4ORJ zOB`@T<~H7cz!z0N=1!(+cy=`~euv(sgqWW>{QFafG7pt(Bm18<@-`}!f2O^Y(;gr< z@Ki4S7sWnxXWrC=-plL9>CFpX3zY(cAwnan$L<~!v|SZqvBl`Ht)jonp9BVLx&Atz zt~I)T=Dt6<1Z$vQk+FFnVi`kzb%|SOPFpHFjE#=`E=?oIR}U|L_afqJdUrpbCOMEV zeR1ZY9+T1G8;k~Uc9{2v`t0y;wStn;q+Nh$$dL;4s8*_(5yU)oY?Z@|xh|G4Vh1+x zZ+EZ3S5kk{1aWM)^F)iJXkWe3CTe`|Fa=G`^)TkYev|EkGfoNA5^5yiriUosg4v}u zA_Tel?OrA2A(Dd;EP`+mn(bAZ=IQ?GpWoX&S z8OaG`L3M#mh(QwQK|wCKxZqhj%9k~ouX&zhdAHGZwgH#rF3x0g?RAL)>;S@2(UR<- zYv2ITBN5%kC)j+I0cby!mj%>H5ijuQ~8K!2ObE#!BbquQ&AB5=jCNrFj?5_L-~y@R|ki zFwv-6H#K(&uFeaGiGG`J+|QdmKPIRyu;NQB2xL{WVk`sDk(hK_JSpNn=k{l*v+K)| z)?EymzX35#>$Uw2tFHKdxiyT|S;pk7aXGZ_J890y14Ee9%Lgymu$v^xY{Y6(kb^6xy*Z%GCSf@Q8CkHKLLJ0?`Eyw64Oqh9#jm&p zhtl#h?l}2GM%;dT5#Nc*&gG8gtx1ojNy)Q0>7-tnF}3;Se6zw&I$#ptrcHdCd~0VvlX@$hX_b4&mdbg zH$AhL(JIhwTa>}mt<2@MlBcZ8L2Bi0AvQM_X}&or@4xwXnG12Q<;bd`>8eoe!XEFc zH0&alzeYqef4hRcba@W04yDi5q!W=hqeXx=d8=oStHOt6807V5@zJ3-!TMmV5iCja z-d1ztTG$s+SLzGismENExo_+%u7&o)!ZLinj{9rL%z3XBydWioF#znpCCa8gQ1~gG zv_H!Cdwr9b-sIxyo34k z;@BX3xYc?o-xF@M@$7LT8fx*F~wk}k#A=|xiJHcznU^AN+rh&cENx3v6b2=O8#i^s1foS!S%AV3T z!xZ1D(vPqHoIpEdGJ_5tdTq~E6eW$mVu5}b{&UHRukNZtwUV#*edhf7@EZ2}rzZ|F z%+d{8AvqN*(ibnJuB3batPu=O6+=$WJa3^uN?daPQ*M?a8Q@n)|L<_s3T8LX>T&pN zX(0G*8BjMGivF(4+6#kGnvudbY&8M@+-nwdJ{A8a3hej(?e6NQ0)A1I!QbBOb?Zw0jy`T!xlJ{2iSv=IpwQB3y~w`SvsFkBNcYP+=&mPh_gL3A07xftCI` z;b!dlFg;o}TUa&O1K&nahxG@FoPAy^Jc6Z}RLs}IZf}h{TFy%+b%Gz|%)c%QlQEw` zQ|@bhj{NrKU?}LXP&R5eVFE?Ul|36}V7)6Cciv6VS3dK(WxF)F?gJS5=L6dllXo=K zWXsvc`a42v#vafTgTE&hl3!2$4^)YwXTBB|*CY8mg@hL*!9(ZaFQ`M`95R2QPIDci zKcnvS=jJQnSj%Y*NJ9p`rI^vK;M6@WQVCF^YC0czx@);oNd511h&si;2$lH;TsHT3 z3Nf?seHgNr{B`!u6`SgNSN{9C0@a=}>*e6@2jMc{GC4Nf4N(>atsa^R8eTa5*wYmDs2BBKKTF6 zqeP1Nf1r{*PW+1r4{UkQZkGs}OSp1K;8Z9BSruLuQB3odL5oix6~B);-jIIjBvWQA x=gy&D3#?RLPFl%~aewagAM5=8ZeRaiM}_3Kbo=sG5dSIy(APE6se;=?{xA9@T)zMS literal 0 HcmV?d00001 diff --git a/slep000/proposal.rst b/slep000/proposal.rst new file mode 100644 index 0000000..48c9572 --- /dev/null +++ b/slep000/proposal.rst @@ -0,0 +1,298 @@ +.. _slep_000: + +============================== +SLEP000: SLEP and its workflow +============================== + +:Author: Adrin Jalali +:Status: Draft +:Type: Process +:Created: 2020-02-13 + +Abstract +######## + +This SLEP specifies details related to SLEP submission, review, and acceptance +process. + +Motivation +########## + +Without a predefined workflow, the discussions around a SLEP can be long and +consume a lot of energy for both the author(s) and the reviewers. The lack of a +known workflow also results in the SLEPs to take months (if not years) before +it is merged as ``Under Review``. The purpose of this SLEP is to lubricate and +ease the process of working on a SLEP, and make it a more enjoyable and +productive experience. This SLEP borrows the process used in PEPs and NEPs +which means there will be no ``Under Review`` status. + + +What is a SLEP? +############### + +SLEP stands for Scikit-Learn Enhancement Proposal, inspired from Python PEPs or +Numpy NEPs. A SLEP is a design document providing information to the +scikit-learn community, or describing a new feature for scikit-learn or its +processes or environment. The SLEP should provide a concise technical +specification of the proposed solution, and a rationale for the feature. + +We intend SLEPs to be the primary mechanisms for proposing major new features, +for collecting community input on an issue, and for documenting the design +decisions that have gone into scikit-learn. The SLEP author is responsible for +building consensus within the community and documenting dissenting opinions. + +Because the SLEPs are maintained as text files in a versioned repository, their +revision history is the historical record of the feature proposal. + +SLEP Audience +############# + +The typical primary audience for SLEPs are the core developers of +``scikit-learn`` and technical committee, as well as contributors to the +project. However, these documents also serve the purpose of documenting the +changes and decisions to help users understand the changes and why they are +made. The SLEPs are available under `Scikit-learn enhancement proposals +`_. + +SLEP Types +########## + +There are three kinds of SLEPs: + +1. A Standards Track SLEP describes a new feature or implementation for +scikit-learn. + +2. An Informational SLEP describes a scikit-learn design issue, or provides +general guidelines or information to the scikit-learn community, but does not +propose a new feature. Informational SLEPs do not necessarily represent a +scikit-learn community consensus or recommendation, so users and implementers +are free to ignore Informational SLEPs or follow their advice. For instance, an +informational SLEP could be one explaining how people can write a third party +estimator, one to explain the usual process of adding a package to the contrib +org, or what our inclusion criteria are for scikit-learn and +scikit-learn-extra. + +3. A Process SLEP describes a process surrounding scikit-learn, or proposes a +change to (or an event in) a process. Process SLEPs are like Standards Track +SLEPs but apply to areas other than the scikit-learn library itself. They may +propose an implementation, but not to scikit-learn’s codebase; they require +community consensus. Examples include procedures, guidelines, changes to the +decision-making process and the governance document, and changes to the tools +or environment used in scikit-learn development. Any meta-SLEP is also +considered a Process SLEP. + + +SLEP Workflow +############# + +A SLEP starts with an idea, which usually is discussed in an issue or a pull +request on the main repo before submitting a SLEP. It is generally a good idea +for the author of the SLEP to gauge the viability and the interest of the +community before working on a SLEP, mostly to save author's time. + +A SLEP must have one or more champions: people who write the SLEP following the +SLEP template, shepherd the discussions around it, and seek consensus in the +community. + +The proposal should be submitted as a draft SLEP via a GitHub pull request to a +``slepXXX`` directory with the name ``proposal.rst`` where ``XXX`` is an +appropriately assigned three-digit number (e.g., ``slep000/proposal.rst``). The +draft must use the `SLEP — Template and Instructions +`_ +file. + +Once the PR for the SLEP is created, a post should be made to the mailing list +containing the sections up to “Backward compatibility”, with the purpose of +limiting discussion there to usage and impact. Discussion on the pull request +will have a broader scope, also including details of implementation. + +The first draft of the SLEP needs to be approved by at least one core developer +before being merged. Merging the draft does not mean it is accepted or is ready +for the vote. To this end, the SLEP draft is reviewed for structure, +formatting, and other errors. Approval criteria are: + +- The draft is sound and complete. The ideas must make technical sense. +- The initial PR reviewer(s) should not consider whether the SLEP seems likely + to be accepted. +- The title of the SLEP draft accurately describes its content. + +Reviewers are generally quite lenient about this initial review, expecting that +problems will be corrected by the further reviewing process. **Note**: Approval +of the SLEP draft is no guarantee that there are no embarrassing mistakes! +Ideally they're avoided, but they can also be fixed later in separate PRs. Once +approved by at least one core developer, the SLEP draft can be merged. +Additional PRs may be made by the champions to update or expand the SLEP, or by +maintainers to set its status, discussion URL, etc. + +Standards Track SLEPs (see bellow) consist of two parts, a design document and +a reference implementation. It is generally recommended that at least a +prototype implementation be co-developed with the SLEP, as ideas that sound +good in principle sometimes turn out to be impractical when subjected to the +test of implementation. Often it makes sense for the prototype implementation +to be made available as PR to the scikit-learn repo (making sure to +appropriately mark the PR as a WIP). + +Review and Resolution +--------------------- + +SLEPs are discussed on the mailing list or the PRs modifying the SLEP. The +possible paths of the status of SLEPs are as follows: + +.. image:: pep-0001-process_flow.png + :alt: SLEP process flow diagram + +All SLEPs should be created with the ``Draft`` status. + +Eventually, after discussion, there may be a consensus that the SLEP should be +accepted – see the next section for details. At this point the status becomes +``Accepted``. + +Once a SLEP has been ``Accepted``, the reference implementation must be +completed. When the reference implementation is complete and incorporated into +the main source code repository, the status will be changed to ``Final``. Since +most SLEPs deal with a part of scikit-learn's API, another way of viewing a +SLEP as ``Final`` is when its corresponding API interface is considered stable. + +To allow gathering of additional design and interface feedback before +committing to long term stability for a feature or API, a SLEP may also be +marked as ``Provisional``. This is short for "Provisionally Accepted", and +indicates that the proposal has been accepted for inclusion in the reference +implementation, but additional user feedback is needed before the full design +can be considered ``Final``. Unlike regular accepted SLEPs, provisionally +accepted SLEPs may still be ``Rejected`` or ``Withdrawn`` even after the +related changes have been included in a scikit-learn release. + +Wherever possible, it is considered preferable to reduce the scope of a +proposal to avoid the need to rely on the ``Provisional`` status (e.g. by +deferring some features to later SLEPs), as this status can lead to version +compatibility challenges in the wider scikit-learn ecosystem. + +A SLEP can also be assigned status ``Deferred``. The SLEP author or a core +developer can assign the SLEP this status when no progress is being made on the +SLEP. + +A SLEP can also be ``Rejected``. Perhaps after all is said and done it was not +a good idea. It is still important to have a record of this fact. The +``Withdrawn`` status is similar; it means that the SLEP author themselves has +decided that the SLEP is actually a bad idea, or has accepted that a competing +proposal is a better alternative. + +When a SLEP is ``Accepted``, ``Rejected``, or ``Withdrawn``, the SLEP should be +updated accordingly. In addition to updating the status field, at the very +least the ``Resolution`` header should be added with a link to the relevant +thread in the mailing list archives or where the discussion happened. + +SLEPs can also be ``Superseded`` by a different SLEP, rendering the original +obsolete. The ``Replaced-By`` and ``Replaces`` headers should be added to the +original and new SLEPs respectively. + +``Process`` SLEPs may also have a status of ``Active`` if they are never meant +to be completed, e.g. SLEP 1 (this SLEP). + +How a SLEP becomes Accepted +--------------------------- + +A SLEP is ``Accepted`` by the voting mechanism defined in the `governance model +`_. We +need a concrete way to tell whether consensus has been reached. When you think +a SLEP is ready to accept, create a PR changing the status of the SLEP to +``Accepted``, then send an email to the scikit-learn mailing list with a +subject like: + + [VOTE] Proposal to accept SLEP #: + +In the body of your email, you should: + +- link to the latest version of the SLEP, and a link to the PR accepting the + SLEP. + +- briefly describe any major points of contention and how they were resolved, + +- include a sentence like: “The vote will be closed in a month i.e. on + <the_date>.” + +Generally the SLEP author will be the one to send this email, but anyone can do +it; the important thing is to make sure that everyone knows when a SLEP is on +the verge of acceptance, and give them a final chance to respond. + +In general, the goal is to make sure that the community has consensus, not +provide a rigid policy for people to try to game. When in doubt, err on the +side of asking for more feedback and looking for opportunities to compromise. + +If the final comment and voting period passes with the required majority, then +the SLEP can officially be marked ``Accepted``. The ``Resolution`` header +should link to the PR accepting the SLEP. + +If the vote does not achieve a required majority, then the SLEP remains in +``Draft`` state, discussion continues as normal, and it can be proposed for +acceptance again later once the objections are resolved. + +In unusual cases, with the request of the author, the scikit-learn technical +committee may be asked to decide whether a controversial SLEP is ``Accepted``, +put back to ``Draft`` with additional recommendation to try again to reach +consensus or definitely ``Rejected``. Please refer to the governance doc for +more details. + +Maintenance +----------- + +In general, Standards track SLEPs are no longer modified after they have +reached the ``Final`` state as the code and project documentation are +considered the ultimate reference for the implemented feature. However, +finalized Standards track SLEPs may be updated as needed. + +Process SLEPs may be updated over time to reflect changes to development +practices and other details. The precise process followed in these cases will +depend on the nature and purpose of the SLEP being updated. + +Format and Template +------------------- + +SLEPs are UTF-8 encoded text files using the `reStructuredText +<http://docutils.sourceforge.net/rst.html>`_ format. Please see the `SLEP — +Template and Instructions +<https://github.com/scikit-learn/enhancement_proposals/blob/master/slep_template.rst>`_ +file and the `reStructuredTextPrimer +<https://www.sphinx-doc.org/en/stable/rest.html>`_ for more information. We use +`Sphinx <https://www.sphinx-doc.org/en/stable/>`_ to convert SLEPs to HTML for +viewing on the web. + +Header Preamble +--------------- + +Each SLEP must begin with a header preamble. The headers must appear in the +following order. Headers marked with * are optional. All other headers are +required:: + + :Author: <list of authors' real names and optionally, email addresses> + :Status: <Draft | Active | Accepted | Deferred | Rejected | + Withdrawn | Final | Superseded> + :Type: <Standards Track | Informational | Process> + :Created: <date created on, in yyyy-mm-dd format> + * :Requires: <slep numbers> + * :scikit-learn-Version: <version number> + * :Replaces: <slep number> + * :Replaced-By: <slep number> + * :Resolution: <url> + +The Author header lists the names, and optionally the email addresses of all +the authors of the SLEP. The format of the Author header value must be + + Random J. User <address@dom.ain> + +if the email address is included, and just + + Random J. User + +if the address is not given. If there are multiple authors, each should be on a +separate line. + +Copyright +--------- + +This document has been placed in the public domain [1]_. + +References and Footnotes +------------------------ + +.. [1] _Open Publication License: https://www.opencontent.org/openpub/ From c9e74b3e9e243a243414732b4538577d5b95925f Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Thu, 20 Jan 2022 13:32:32 -0500 Subject: [PATCH 070/118] Rewrite SLEP006 to be easier to read and vote on (#55) Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: adrinjalali <adrin.jalali@gmail.com> --- slep006/proposal.rst | 602 +++++++++++++++++-------------------------- 1 file changed, 236 insertions(+), 366 deletions(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 7cc1008..d68a0d8 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -1,385 +1,255 @@ .. _slep_006: -========================================== -SLEP006: Routing sample-aligned meta-data -========================================== +========================= +SLEP006: Metadata Routing +========================= -:Author: Joel Nothman, Adrin Jalali, Alex Gramfort +:Author: Joel Nothman, Adrin Jalali, Alex Gramfort, Thomas J. Fan :Status: Under Review :Type: Standards Track :Created: 2019-03-07 -Scikit-learn has limited support for information pertaining to each sample -(henceforth "sample properties") to be passed through an estimation pipeline. -The user can, for instance, pass fit parameters to all members of a -FeatureUnion, or to a specified member of a Pipeline using dunder (``__``) -prefixing:: +Abstract +-------- - >>> from sklearn.pipeline import Pipeline - >>> from sklearn.linear_model import LogisticRegression - >>> pipe = Pipeline([('clf', LogisticRegression())]) - >>> pipe.fit([[1, 2], [3, 4]], [5, 6], - ... clf__sample_weight=[.5, .7]) # doctest: +SKIP +This SLEP proposes an API to configure estimators, scorers, and CV splitters to +request metadata when calling methods such as `fit`, `predict`, etc. +Meta-estimators or functions that wrap estimators, scorers, or CV splitters will +use this API to pass in the requested metadata. -Several other meta-estimators, such as GridSearchCV, support forwarding these -fit parameters to their base estimator when fitting. Yet a number of important -use cases are currently not supported. - -Features we currently do not support and wish to include: - -* passing sample properties (e.g. `sample_weight`) to a scorer used in - cross-validation -* passing sample properties (e.g. `groups`) to a CV splitter in nested cross - validation -* passing sample properties (e.g. `sample_weight`) to some - scorers and not others in a multi-metric cross-validation setup -* passing sample properties to non-fit methods, for - instance to index grouped samples that are to be treated as a single sequence - in prediction. - -The last two items are considered in the API design, yet will not be considered -in the first version of the implementation. - -This SLEP proposes an API where users can request certain metadata to be -passed to its consumer by the meta-estimator it is wrapped in. - -The following example illustrates the new `request_metadata` parameter for -making scorers, the `request_sample_weight` estimator method, the `metadata` -parameter replacing `fit_params` in `cross_validate`, and the automatic passing -of `groups` to `GroupKFold` to enable nested grouped cross validation. Here, -the user requests that the `sample_weight` metadata key should be passed to a -customised accuracy scorer (although a predefined 'weighted_accuracy' scorer -could be introduced), and to the LogisticRegressionCV. -`GroupKFold` requests `groups` by default. - - >>> from sklearn.metrics import accuracy_score, make_scorer - >>> from sklearn.model_selection import cross_validate, GroupKFold - >>> from sklearn.linear_model import LogisticRegressionCV - >>> weighted_acc = make_scorer(accuracy_score, - ... request_metadata=['sample_weight']) - >>> group_cv = GroupKFold() - >>> lr = LogisticRegressionCV( - ... cv=group_cv, - ... scoring=weighted_acc, - ... ).request_sample_weight(fit=True) - >>> cross_validate(lr, X, y, cv=group_cv, - ... metadata={'sample_weight': my_weights, - ... 'groups': my_groups}, - ... scoring=weighted_acc) - -Naming ------- - -"Sample props" has become a name understood internally to the Scikit-learn -development team. For ongoing usage we propose to use as naming ``metadata``. - -Definitions ------------ - -consumer - An estimator, scorer, splitter, etc., that receives and can make use of - one or more passed metadata. -key - A label passed along with sample metadata to indicate how it should be - interpreted (e.g. "weight"). - -History -------- - -This version was drafted after a discussion of the issue and potential -solutions at the February 2019 development sprint in Paris. - -Supersedes `SLEP004 -<https://github.com/scikit-learn/enhancement_proposals/tree/master/slep004>`_ -with greater depth of desiderata and options. - -Primary related issues and pull requests include: - -- :issue:`4497`: Overarching issue, - "Consistent API for attaching properties to samples" - by :user:`GaelVaroquaux` -- :pr:`4696` A first implementation by :user:`amueller` -- `Discussion towards SLEP004 - <https://github.com/scikit-learn/enhancement_proposals/pull/6>`__ initiated - by :user:`tguillemot` -- :pr:`9566` Another implementation (solution 3 from this SLEP) - by :user:`jnothman` -- :pr:`16079` Another implementation (solution 4 from this SLEP) - by :user:`adrinjalali` - -Other related issues include: :issue:`1574`, :issue:`2630`, :issue:`3524`, -:issue:`4632`, :issue:`4652`, :issue:`4660`, :issue:`4696`, :issue:`6322`, -:issue:`7112`, :issue:`7646`, :issue:`7723`, :issue:`8127`, :issue:`8158`, -:issue:`8710`, :issue:`8950`, :issue:`11429`, :issue:`12052`, :issue:`15282`, -:issue:`15370`, :issue:`15425`, :issue:`18028`. - -Desiderata ----------- - -The following aspects have been considered to propose the following solution: - -.. XXX : maybe phrase this in an affirmative way - -Usability - Can the use cases be achieved in succinct, readable code? Can common use - cases be achieved with a simple recipe copy-pasted from a QA forum? -Brittleness - If a metadata is being routed through a Pipeline, does changing the - structure of the pipeline (e.g. adding a layer of nesting) require rewriting - other code? -Error handling - If the user mistypes the name of a sample metadata, or misspecifies how it - should be routed to a consumer, will an appropriate exception be raised? -Impact on meta-estimator design - How much meta-estimator code needs to change? How hard will it be to - maintain? -Impact on estimator design - How much will the proposal affect estimator developers? -Backwards compatibility - Can existing behavior be maintained? -Forwards compatibility - Is the solution going to make users' code more - brittle with future changes? (For example, will a user's pipeline change - behaviour radically when `sample_weight` is implemented on some estimator) -Introspection - If sensible to do so (e.g. for improved efficiency), can a - meta-estimator identify whether its base estimator (recursively) would - handle some particular sample metadata. - -.. (e.g. so a meta-estimator can choose -.. between weighting and resampling, or for automated invariance testing)? - -Keyword arguments vs. a single argument ---------------------------------------- - -Currently, sample metadata are provided as keyword arguments to a `fit` -method. In redeveloping sample metadata, we can instead accept a single -parameter (named `metadata` or `sample_metadata`, for example) which maps -string keys to arrays of the same length (a "DataFrame-like"). - -Using single argument:: - - >>> gs.fit(X, y, metadata={'groups': groups, 'weight': weight}) - -vs. using keyword arguments:: - - >>> gs.fit(X, y, groups=groups, sample_weight=sample_weight) - -Advantages of a single argument: - -* we would be able to redefine the default routing of weights etc. without being - concerned by backwards compatibility. -* we could consider the use of keys that are not limited to strings or valid - identifiers (and hence are not limited to using ``_`` as a delimiter). -* we could also consider kwargs to `fit` that are not sample-aligned - (e.g. `with_warm_start`, `feature_names_in`, `feature_meta`) without - restricting valid keys for sample metadata. - -Advantages of multiple keyword arguments: - -* succinct -* explicit function signatures relying on interpreter checks on calls -* possible to maintain backwards compatible support for `sample_weight`, etc. -* we do not need to handle cases for whether or not some estimator expects a - `metadata` argument. - -In this SLEP, we will propose the solution based on keyword arguments. - -Test case setup ---------------- - -Case A -~~~~~~ - -Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')`` -with weighted scoring and weighted fitting, while using groups in splitter. - -Error handling: we would guarantee that if the user misspelled `sample_weight` -as `sample_eight` a meaningful error is raised. - -Case B -~~~~~~ - -Cross-validate a ``LogisticRegressionCV(cv=GroupKFold(), scoring='accuracy')`` -with weighted scoring and unweighted fitting. - -Error handling: if `sample_weight` is required only in scoring and not in fit -of the sub-estimator the user should make explicit that it is not required -by the sub-estimator. - -Case C -~~~~~~ - -Extend Case A to apply an unweighted univariate feature selector in a -``Pipeline``. This allows to check pipelines where only some steps -require a metadata. - -Case D -~~~~~~ - -Different weights for scoring and for fitting in Case A. - -Motivation: You can have groups used in a CV, which contains batches of data as groups, -and then an estimator which takes groups as sensitive attributes to a -fairness related model. Also in a third party library an estimator may have -the same name for a parameter, but with completely different semantics. - -.. TODO: case involving props passed at test time, e.g. to pipe.transform - to be considered later - -Case E -~~~~~~ +Motivation and Scope +-------------------- -``LogisticRegression()`` with a weighted ``.score()`` method. +Scikit-learn has limited support for passing around information that is not +`(X, y)`. For example, to pass `sample_weight` to a step of a `Pipeline`, one +needs to specify the step using dunder (`__`) prefixing:: -Solution sketches will import these definitions: + >>> pipe = Pipeline([..., ('clf', LogisticRegression())]) + >>> pipe.fit(X, y, clf__sample_weight=sample_weight) -.. literalinclude:: defs.py - -The following solution has emerged as the way to move forward, -yet others where considered. See :ref:`slep_006_other`. - -Solution: Each consumer requests --------------------------------- - -.. note:: - - This solution was known as solution 4 during the discussions. - -A meta-estimator provides along to its children only what they request. -A meta-estimator needs to request, on behalf of its children, -any metadata that descendant consumers request. - -Each object that could receive metadata should have a method called -`get_metadata_request()` which returns a dict that specifies which -metadata is consumed by each of its methods (keys of this dictionary -are therefore method names, e.g. `fit`, `transform` etc.). -Estimators supporting weighted fitting may return `{}` by default, but have a -method called `request_sample_weight` which allows the user to specify -the requested `sample_weight` in each of its methods. - -`Group*CV` splitters default to returning `{'split': 'groups'}`. - -`make_scorer` accepts `request_metadata` as keyword parameter through -which user can specify what metadata is requested. - -Advantages: - -* This solution does not affect legacy estimators, since no metadata will be - passed when a metadata request is not available. -* The implementation changes in meta-estimators is easy to provide via two - helpers ``build_method_metadata_params(children, routing, metadata)`` - and ``build_router_metadata_request(children, routing)``. Here ``routing`` - consists of a list of requests between the meta-estimator and its - children. Note that this construct will be not visible to scikit-learn - users, yet should be understood by third party developers developping - custom meta-estimators. -* Easy to reconfigure what metadata an estimator gets in a grid search. -* Could make use of existing `**fit_params` syntax rather than introducing new - `metadata` argument to `fit`. - -Disadvantages: - -* This will require modifying every estimator that may want any metadata, - as well as all meta-estimators. Yet, this can be achieved with a mixin class - to add metadata-request support to a legacy estimator. -* Aliasing is a bit confusing in this design, in that the consumer still - accepts the fit param by its original name (e.g. `sample_weight`) even if it - has a request that specifies a different key given to the meta-estimator (e.g. - `my_sample_weight`). This design has the advantage that the handling of - metadata within a consumer is simple and unchanged; the complexity is in - how it is forwarded to the sub-estimator by the meta-estimators. While - it may be conceptually difficult for users to understand, this may be - acceptable, as an advanced feature. -* For estimators to be cloned, this request information needs to be cloned with - it. This implies that `clone` needs to be extended to explicitly copy - request information. - -Proposed public syntax: - -* `BaseEstimator` will have a method `get_metadata_request` -* Estimators that can consume `sample_weight` will have a `request_sample_weight` - method available via a mixin. -* `make_scorer` will have a `request_metadata` parameter to specify the requested - metadata by the scorer. -* `get_metadata_request` will return a dict, whose keys are names of estimator - methods (`fit`, `predict`, `transform` or `inverse_transform`) and values are - dictionaries. These dictionaries map the input parameter names to requested - metadata keys. Example: - - >>> estimator.get_metadata_request() - {'fit': {'my_sample_weight': {'sample_weight'}}, 'predict': {}, 'transform': {}, - 'score': {}, 'split': {}, 'inverse_transform': {}} - -* Methods like `request_sample_weight` will have a signature such as: - `request_sample_weight(self, *, fit=None, score=None)` where fit keyword - parameter can be `None`, `True`, `False` or a `str`. `str` allows here - to request a metadata whose name is different from the keyword parameter. - Here ``None`` is a default, and ``False`` has a different semantic which - is that the metadata should not be provided. - -* `Group*` CV splitters will by default request the 'groups' metadata, but its - mapping can be changed with their `set_metadata_request` method. - -Test cases: - -.. literalinclude:: cases_opt4b.py - -.. note:: if an estimator requests a metadata, we consider that it cannot - be ``None``. +Several other meta-estimators, such as `GridSearchCV`, support forwarding these +fit parameters to their base estimator when fitting. Yet a number of important +use cases are currently not supported: + +* Passing metadata (e.g. `sample_weight`) to a scorer used in cross-validation +* Passing metadata (e.g. `groups`) to a CV splitter in nested cross-validation +* Passing metadata (e.g. `sample_weight`) to some scorers and not others in + multi-metric cross-validation. This is also required to handle fairness + related metrics which usually expect one or more sensitive attributes to be + passed to them along with the data. +* Passing metadata to non-`fit` methods. For example, passing group indices for + samples that are to be treated as a single sequence in prediction, or passing + sensitive attributes to `predict` or `transform` of a fairness related + estimator. + +We define the following terms in this proposal: + +* **consumer**: An object that receives and consumes metadata, such as + estimators, scorers, or CV splitters. + +* **router**: An object that passes metadata to a **consumer** or + another **router**. Examples of **routers** include meta-estimators or + functions. (For example `GridSearchCV` or `cross_validate` route sample + weights, cross validation groups, etc. to **consumers**) + +This SLEP proposes to add + +* `get_metadata_routing` to all **consumers** and **routers** + (i.e. all estimators, scorers, and splitters supporting this API) +* `*_requests` to consumers (including estimators, scorers, and CV splitters), + where `*` is a method that requires metadata. (e.g. `fit_requests`, + `score_requests`, `transform_requests`, etc.) + +For example, `fit_requests` configures an estimator to request metadata:: + + >>> log_reg = LogisticRegression().fit_requests(sample_weight=True) + +`get_metadata_routing` are used by **routers** to inspect the metadata needed +by **consumers**. `get_metadata_routing` returns a `MetadataRouter` or a +`MetadataRequest` object that stores and handles metadata routing. See the +draft implementation for more implementation details. + +Detailed description +-------------------- + +This SLEP unlocks many machine learning use cases that were not possible +before. In this section, we will focus on some workflows that are made possible +by this SLEP. + +Nested Grouped Cross Validation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following examples demonstrates nested grouped cross validation +where a scorer and an estimator requests `sample_weight` and `GroupKFold` +requests `groups` by default:: + + >>> weighted_acc = make_scorer(accuracy_score).score_request(sample_weight=True) + >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) + ... .fit_requests(sample_weight=True)) + >>> cv_results = cross_validate( + ... log_reg, X, y, + ... cv=GroupKFold(), + ... props={"sample_weight": my_weights, "groups": my_groups}, + ... scoring=weighted_acc) + +To support unweighted fitting and weighted scoring, metadata is set to `False` +in `request_for_fit`:: + + >>> log_reg = (LogisticRegressionCV(cv=group_cv, scoring=weighted_acc) + ... .fit_request(sample_weight=False)) + >>> cross_validate( + ... log_reg, X, y, + ... cv=GroupKFold(), + ... props={'sample_weight': weights, 'groups': groups}, + ... scoring=weighted_acc) + +Unweighted Feature selection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Consumers** that do not accept weights during fitting such as `SelectKBest` +will _not_ be routed weights:: + + >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) + ... .fit_requests(sample_weight=True)) + >>> sel = SelectKBest(k=2) + >>> pipe = make_pipeline(sel, log_reg) + >>> pipe.fit(X, y, sample_weight=weights, groups=groups) + +Note that if a **consumer** or a **router** starts accepting and consuming a +certain metadata, the developer API enables developers to raise a warning +and avoid silent behavior changes in users' code. See the draft implementation +for more details. + +Different Scoring and Fitting Weights +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We can pass different weights for scoring and fitting by using an aliases. In +this example, `scoring_weight` is passed to the scoring and `fitting_weight` +is passed to `LogisticRegressionCV`:: + + >>> weighted_acc = (make_scorer(accuracy_score) + ... .score_requests(sample_weight="scoring_weight")) + >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) + ... .fit_requests(sample_weight="fitting_weight")) + >>> cv_results = cross_validate( + ... log_reg, X, y, + ... cv=GroupKFold(), + ... props={"scoring_weight": my_weights, + ... "fitting_weight": my_other_weights, + ... "groups": my_groups}, + ... scoring=weighted_acc) + +Nested Grouped Cross Validation with SearchCV +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since `GroupKFold` requests group metadata by default, `GroupKFold` can be +passed to multiple **consumers** to enable nested grouped cross validation. In +this example, both `RandomizedSearchCV` and `cross_validate` sets +`cv=GroupKFold()` which enables grouped CV in the outer loop (`cross_validate`) +and the inner random search:: + + >>> log_reg = LogisticRegression() + >>> distributions = {"C": uniform(loc=0, scale=4), + ... "penalty": ['l2', 'l1']} + >>> random_search = RandomizedSearchCV(log_reg, distributions, cv=GroupKFold()) + >>> cv_results = cross_validate( + ... log_reg, X, y, + ... cv=GroupKFold(), + ... props={"groups": my_groups}) + +Implementation +-------------- + +This SLEP has a draft implementation at :pr:`22083` by :user:`adrinjalali`. The +implementation provides developer utilities that is used by scikit-learn and +available to third-party estimators for adopting this SLEP. Specifically, the +draft implementation makes it easier to define `get_metadata_routing` and +`*_requests` for **consumers** and **routers**. Backward compatibility ---------------------- -Under this proposal, consumer behaviour will be backwards compatible, but -meta-estimators will change their routing behaviour. We will not support anymore -the dunder (`__`) syntax, and enforce the use of explicit request method calls. - -By default, `sample_weight` will not be requested by estimators that support -it. This ensures that addition of `sample_weight` support to an estimator will -not change its behaviour. - -During a deprecation period, fit_params using the dunder syntax will still -work, yet will raise deprecation warnings while preventing the dual use of the -new syntax. In other words it will not be possible to mix both old and new -behaviour. At completion of the deprecation period, the legacy handling -will cease. - -Similarly, during a deprecation period, `fit_params` in GridSearchCV and -related utilities will be routed to the estimator's `fit` by default, per -incumbent behaviour. After the deprecation period, an error will be raised for -any params not explicitly requested. See following examples: - ->>> # This would raise a deprecation warning, that provided metadata ->>> # is not requested ->>> GridSearchCV(LogisticRegression()).fit(X, y, sample_weight=sw) ->>> ->>> # this would work with no warnings ->>> GridSearchCV(LogisticRegression().request_sample_weight( -... fit=True) -... ).fit(X, y, sample_weight=sw) ->>> ->>> # This will raise that LR could accept `sample_weight`, but has ->>> # not been specified by the user ->>> GridSearchCV( -... LogisticRegression(), -... scoring=make_scorer(accuracy_score, -... request_metadata=['sample_weight']) -... ).fit(X, y, sample_weight=sw) - -Grouped cross validation splitters will request `groups` since they were -previously unusable in a nested cross validation context, so this should not -often create backwards incompatibilities, except perhaps where a fit param -named `groups` served another purpose. - -Discussion ----------- - -One benefit of the explicitness in this proposal is that even if it makes use of -`**kw` arguments, it does not preclude keywords arguments serving other -purposes in addition. That is, in addition to requesting sample metadata, a -future proposal could allow estimators to request feature metadata or other -keys. +Scikit-learn's meta-estimators will deprecate the dunder (`__`) syntax for +routing and enforce explicit request method calls. During the deprecation +period, using dunder syntax routing and explicit request calls together will +raise an error. + +During the deprecation period, meta-estimators such as `GridSearchCV` will +route `fit_params` to the inner estimators' `fit` by default, but +a deprecation warning is raised:: + + >>> # Deprecation warning, stating that the provided metadata is not requested + >>> GridSearchCV(LogisticRegression(), ...).fit(X, y, sample_weight=sw) + +To avoid the warning, one would need to specify the request in +`LogisticRegressionCV`:: + + >>> grid = GridSearchCV(LogisticRegression().fit_requests(sample_weight=True), ...) + >>> grid.fit(X, y, sample_weight=sw) + +Meta-estimators such as `GridSearchCV` will check that the metadata requested +and will error when metadata is passed in and the inner estimator is +not configured to request it:: + + >>> weighted_acc = make_scorer(accuracy_score).score_request(sample_weight=True) + >>> log_reg = LogisticRegression() + >>> grid = GridSearchCV(log_reg, ..., scoring=weighted_scorer) + >>> + >>> # Raise a TypeError that log_reg is not specified with any routing + >>> # metadata for `sample_weight`, but sample_weight has been passed in to + >>> # `grid.fit`. + >>> grid.fit(X, y, sample_weight=sw) + +To avoid the error, `LogisticRegression` must specify its metadata request by calling +`fit_requests`:: + + >>> # Request sample weights + >>> log_reg_weights = LogisticRegression().fit_requests(sample_weight=True) + >>> grid = GridSearchCV(log_reg_with_weights, ...) + >>> grid.fit(X, , sample_weight=sw) + >>> + >>> # Do not request sample_weights + >>> log_reg_no_weights = LogisticRegression().fit_requests(sample_weight=False) + >>> grid = GridSearchCV(log_reg_no_weights, ...) + >>> grid.fit(X, , sample_weight=sw) + +Third-party estimators will need to adopt this SLEP in order to support metadata +routing, while the dunder syntax is deprecated. Our implementation will provide +developer APIs to trigger warnings and errors as described above to help with +adopting this SLEP. + +Alternatives +------------ + +Over the years, there has been many proposed alternatives before we landed +on this SLEP: + +* :pr:`4696` A first implementation by :user:`amueller` +* `Discussion towards SLEP004 + <https://github.com/scikit-learn/enhancement_proposals/pull/6>`__ initiated + by :user:`tguillemot`. +* :pr:`9566` Another implementation (solution 3 from this SLEP) + by :user:`jnothman` +* This SLEP has emerged from many alternatives that is seen at + :ref:`slep_006_other`. + +Discussion & Related work +------------------------- + +This SLEP was drafted based on the discussions of potential solutions +at the February 2019 development sprint in Paris. The overarching issue is +fond at "Consistent API for attaching properties to samples" at :issue:`4497`. + +Related issues and discussions include: :issue:`1574`, :issue:`2630`, +:issue:`3524`, :issue:`4632`, :issue:`4652`, :issue:`4660`, :issue:`4696`, +:issue:`6322`, :issue:`7112`, :issue:`7646`, :issue:`7723`, :issue:`8127`, +:issue:`8158`, :issue:`8710`, :issue:`8950`, :issue:`11429`, :issue:`12052`, +:issue:`15282`, :issue:`15370`, :issue:`15425`, :issue:`18028`. + +One benefit of the explicitness in this proposal is that even if it makes use +of `**kwarg` arguments, it does not preclude keywords arguments serving other +purposes. In addition to requesting sample metadata, a future proposal could +allow estimators to request feature metadata or other keys. References and Footnotes ------------------------ From c583d69437d01881d3d097d9e1bcdc7b2790374c Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Tue, 1 Feb 2022 07:58:36 +1100 Subject: [PATCH 071/118] rst syntax --- slep003/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep003/proposal.rst b/slep003/proposal.rst index 589a168..d4bf10f 100644 --- a/slep003/proposal.rst +++ b/slep003/proposal.rst @@ -4,7 +4,7 @@ SLEP003: Consistent inspection for transformers =============================================== -. topic:: **Summary** +.. topic:: **Summary** Inspect transformers' output shape and dependence on input features consistently with From d8c3f962d695f36e72577cf9570d489823f72dc5 Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Fri, 4 Feb 2022 01:11:22 +1100 Subject: [PATCH 072/118] typos in slep006 --- slep006/proposal.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index d68a0d8..d52984a 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -110,7 +110,7 @@ Unweighted Feature selection will _not_ be routed weights:: >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) - ... .fit_requests(sample_weight=True)) + ... .fit_requests(sample_weight=True)) >>> sel = SelectKBest(k=2) >>> pipe = make_pipeline(sel, log_reg) >>> pipe.fit(X, y, sample_weight=weights, groups=groups) @@ -128,9 +128,9 @@ this example, `scoring_weight` is passed to the scoring and `fitting_weight` is passed to `LogisticRegressionCV`:: >>> weighted_acc = (make_scorer(accuracy_score) - ... .score_requests(sample_weight="scoring_weight")) + ... .score_requests(sample_weight="scoring_weight")) >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) - ... .fit_requests(sample_weight="fitting_weight")) + ... .fit_requests(sample_weight="fitting_weight")) >>> cv_results = cross_validate( ... log_reg, X, y, ... cv=GroupKFold(), @@ -206,12 +206,12 @@ To avoid the error, `LogisticRegression` must specify its metadata request by ca >>> # Request sample weights >>> log_reg_weights = LogisticRegression().fit_requests(sample_weight=True) >>> grid = GridSearchCV(log_reg_with_weights, ...) - >>> grid.fit(X, , sample_weight=sw) + >>> grid.fit(X, y, sample_weight=sw) >>> >>> # Do not request sample_weights >>> log_reg_no_weights = LogisticRegression().fit_requests(sample_weight=False) >>> grid = GridSearchCV(log_reg_no_weights, ...) - >>> grid.fit(X, , sample_weight=sw) + >>> grid.fit(X, y, sample_weight=sw) Third-party estimators will need to adopt this SLEP in order to support metadata routing, while the dunder syntax is deprecated. Our implementation will provide From 4ca3a1eda6fb59d8534e7a0b75ab82f2e4a7c369 Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Fri, 4 Feb 2022 01:14:08 +1100 Subject: [PATCH 073/118] SLEP006 typo --- slep006/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index d52984a..7ebcd83 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -93,7 +93,7 @@ requests `groups` by default:: ... scoring=weighted_acc) To support unweighted fitting and weighted scoring, metadata is set to `False` -in `request_for_fit`:: +in `fit_request`:: >>> log_reg = (LogisticRegressionCV(cv=group_cv, scoring=weighted_acc) ... .fit_request(sample_weight=False)) From 8252e4cd1374d27303462da7085923ebaecd26e4 Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Fri, 4 Feb 2022 01:18:20 +1100 Subject: [PATCH 074/118] SLEP006 typos --- slep006/proposal.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 7ebcd83..0a5d45e 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -161,7 +161,7 @@ Implementation -------------- This SLEP has a draft implementation at :pr:`22083` by :user:`adrinjalali`. The -implementation provides developer utilities that is used by scikit-learn and +implementation provides developer utilities that are used by scikit-learn and available to third-party estimators for adopting this SLEP. Specifically, the draft implementation makes it easier to define `get_metadata_routing` and `*_requests` for **consumers** and **routers**. @@ -182,12 +182,12 @@ a deprecation warning is raised:: >>> GridSearchCV(LogisticRegression(), ...).fit(X, y, sample_weight=sw) To avoid the warning, one would need to specify the request in -`LogisticRegressionCV`:: +`LogisticRegression`:: >>> grid = GridSearchCV(LogisticRegression().fit_requests(sample_weight=True), ...) >>> grid.fit(X, y, sample_weight=sw) -Meta-estimators such as `GridSearchCV` will check that the metadata requested +Meta-estimators such as `GridSearchCV` will check which metadata is requested, and will error when metadata is passed in and the inner estimator is not configured to request it:: @@ -221,7 +221,7 @@ adopting this SLEP. Alternatives ------------ -Over the years, there has been many proposed alternatives before we landed +Over the years, there have been many proposed alternatives before we landed on this SLEP: * :pr:`4696` A first implementation by :user:`amueller` @@ -230,7 +230,7 @@ on this SLEP: by :user:`tguillemot`. * :pr:`9566` Another implementation (solution 3 from this SLEP) by :user:`jnothman` -* This SLEP has emerged from many alternatives that is seen at +* This SLEP has emerged from many alternatives detailed at :ref:`slep_006_other`. Discussion & Related work @@ -238,7 +238,7 @@ Discussion & Related work This SLEP was drafted based on the discussions of potential solutions at the February 2019 development sprint in Paris. The overarching issue is -fond at "Consistent API for attaching properties to samples" at :issue:`4497`. +found at "Consistent API for attaching properties to samples" at :issue:`4497`. Related issues and discussions include: :issue:`1574`, :issue:`2630`, :issue:`3524`, :issue:`4632`, :issue:`4652`, :issue:`4660`, :issue:`4696`, From 03ef2e47c30fd51a32ba8831f8080aeb956a09fc Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Fri, 4 Feb 2022 01:22:05 +1100 Subject: [PATCH 075/118] SLEP000 typo --- slep000/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep000/proposal.rst b/slep000/proposal.rst index 48c9572..dd8c4e4 100644 --- a/slep000/proposal.rst +++ b/slep000/proposal.rst @@ -295,4 +295,4 @@ This document has been placed in the public domain [1]_. References and Footnotes ------------------------ -.. [1] _Open Publication License: https://www.opencontent.org/openpub/ +.. [1] Open Publication License: https://www.opencontent.org/openpub/ From 1ef475b6e0cbfc126499a8a78355464fc1ec1523 Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Fri, 4 Feb 2022 02:51:24 +1100 Subject: [PATCH 076/118] SLEP006: correct terminology use (#66) Here GroupKFold is the consumer. It knows what `groups` means. The things that provide it with `groups` do not know what `groups` means, only that `GroupKFold` requests it. Hence they are routers. --- slep006/proposal.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 0a5d45e..f339ebd 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -142,9 +142,9 @@ is passed to `LogisticRegressionCV`:: Nested Grouped Cross Validation with SearchCV ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Since `GroupKFold` requests group metadata by default, `GroupKFold` can be -passed to multiple **consumers** to enable nested grouped cross validation. In -this example, both `RandomizedSearchCV` and `cross_validate` sets +Since `GroupKFold` requests group metadata by default, `GroupKFold` instances can +be passed to multiple **routers** to enable nested grouped cross validation. In +this example, both `RandomizedSearchCV` and `cross_validate` set `cv=GroupKFold()` which enables grouped CV in the outer loop (`cross_validate`) and the inner random search:: From 939dc553b731d271cdae69d1967477f605c8f893 Mon Sep 17 00:00:00 2001 From: Adrin Jalali <adrin.jalali@gmail.com> Date: Fri, 4 Feb 2022 11:19:55 +0100 Subject: [PATCH 077/118] [VOTE] Accept SLEP000 (#64) * Accept SLEP000 * edit proposal, mark as accepted --- index.rst | 1 + slep000/proposal.rst | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/index.rst b/index.rst index e20f05f..9d3499c 100644 --- a/index.rst +++ b/index.rst @@ -9,6 +9,7 @@ :maxdepth: 1 :caption: Accepted + slep000/proposal slep007/proposal slep009/proposal slep010/proposal diff --git a/slep000/proposal.rst b/slep000/proposal.rst index dd8c4e4..7c7aeda 100644 --- a/slep000/proposal.rst +++ b/slep000/proposal.rst @@ -5,7 +5,7 @@ SLEP000: SLEP and its workflow ============================== :Author: Adrin Jalali -:Status: Draft +:Status: Accepted :Type: Process :Created: 2020-02-13 From c86f619fc2a4779b352fde3a08535d885fc47fc9 Mon Sep 17 00:00:00 2001 From: Adrin Jalali <adrin.jalali@gmail.com> Date: Mon, 21 Feb 2022 13:45:56 +0100 Subject: [PATCH 078/118] ACCEPT: SLEP006 - Metadata Routing (#65) * accept SLEP006 * props -> metadata * Andy's comments * add type of the metadata to the info * method_requests -> set_method_request --- index.rst | 2 +- slep006/proposal.rst | 64 ++++++++++++++++++++++++++------------------ 2 files changed, 39 insertions(+), 27 deletions(-) diff --git a/index.rst b/index.rst index 9d3499c..da549b0 100644 --- a/index.rst +++ b/index.rst @@ -10,6 +10,7 @@ :caption: Accepted slep000/proposal + slep006/proposal slep007/proposal slep009/proposal slep010/proposal @@ -18,7 +19,6 @@ :maxdepth: 1 :caption: Under review - slep006/proposal slep012/proposal slep013/proposal diff --git a/slep006/proposal.rst b/slep006/proposal.rst index f339ebd..80b6b9b 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -5,7 +5,7 @@ SLEP006: Metadata Routing ========================= :Author: Joel Nothman, Adrin Jalali, Alex Gramfort, Thomas J. Fan -:Status: Under Review +:Status: Accepted :Type: Standards Track :Created: 2019-03-07 @@ -56,18 +56,25 @@ This SLEP proposes to add * `get_metadata_routing` to all **consumers** and **routers** (i.e. all estimators, scorers, and splitters supporting this API) -* `*_requests` to consumers (including estimators, scorers, and CV splitters), - where `*` is a method that requires metadata. (e.g. `fit_requests`, - `score_requests`, `transform_requests`, etc.) +* `set_*_request` to consumers (including estimators, scorers, and CV + splitters), where `*` is a method that requires metadata. (e.g. + `set_fit_request`, `set_score_request`, `set_transform_request`, etc.) -For example, `fit_requests` configures an estimator to request metadata:: +For example, `set_fit_request` configures an estimator to request metadata:: - >>> log_reg = LogisticRegression().fit_requests(sample_weight=True) + >>> log_reg = LogisticRegression().set_fit_request(sample_weight=True) `get_metadata_routing` are used by **routers** to inspect the metadata needed by **consumers**. `get_metadata_routing` returns a `MetadataRouter` or a -`MetadataRequest` object that stores and handles metadata routing. See the -draft implementation for more implementation details. +`MetadataRequest` object that stores and handles metadata routing. +`get_metadata_routing` returns enough information for a router to know what +metadata is requested, and whether the metadata is sample aligned or not. See +the draft implementation for more implementation details. + +Note that in the core library nothing is requested by default, except +``groups`` in ``Group*CV`` objects which request the ``groups`` metadata. At +the time of writing this proposal, all metadata requested in the core library +are sample aligned. Detailed description -------------------- @@ -85,11 +92,11 @@ requests `groups` by default:: >>> weighted_acc = make_scorer(accuracy_score).score_request(sample_weight=True) >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) - ... .fit_requests(sample_weight=True)) + ... .set_fit_request(sample_weight=True)) >>> cv_results = cross_validate( ... log_reg, X, y, ... cv=GroupKFold(), - ... props={"sample_weight": my_weights, "groups": my_groups}, + ... metadata={"sample_weight": my_weights, "groups": my_groups}, ... scoring=weighted_acc) To support unweighted fitting and weighted scoring, metadata is set to `False` @@ -100,7 +107,7 @@ in `fit_request`:: >>> cross_validate( ... log_reg, X, y, ... cv=GroupKFold(), - ... props={'sample_weight': weights, 'groups': groups}, + ... metadata={'sample_weight': weights, 'groups': groups}, ... scoring=weighted_acc) Unweighted Feature selection @@ -110,7 +117,7 @@ Unweighted Feature selection will _not_ be routed weights:: >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) - ... .fit_requests(sample_weight=True)) + ... .set_fit_request(sample_weight=True)) >>> sel = SelectKBest(k=2) >>> pipe = make_pipeline(sel, log_reg) >>> pipe.fit(X, y, sample_weight=weights, groups=groups) @@ -128,13 +135,13 @@ this example, `scoring_weight` is passed to the scoring and `fitting_weight` is passed to `LogisticRegressionCV`:: >>> weighted_acc = (make_scorer(accuracy_score) - ... .score_requests(sample_weight="scoring_weight")) + ... .set_score_request(sample_weight="scoring_weight")) >>> log_reg = (LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc) - ... .fit_requests(sample_weight="fitting_weight")) + ... .set_fit_request(sample_weight="fitting_weight")) >>> cv_results = cross_validate( ... log_reg, X, y, ... cv=GroupKFold(), - ... props={"scoring_weight": my_weights, + ... metadata={"scoring_weight": my_weights, ... "fitting_weight": my_other_weights, ... "groups": my_groups}, ... scoring=weighted_acc) @@ -155,7 +162,7 @@ and the inner random search:: >>> cv_results = cross_validate( ... log_reg, X, y, ... cv=GroupKFold(), - ... props={"groups": my_groups}) + ... metadata={"groups": my_groups}) Implementation -------------- @@ -164,7 +171,7 @@ This SLEP has a draft implementation at :pr:`22083` by :user:`adrinjalali`. The implementation provides developer utilities that are used by scikit-learn and available to third-party estimators for adopting this SLEP. Specifically, the draft implementation makes it easier to define `get_metadata_routing` and -`*_requests` for **consumers** and **routers**. +`set_*_request` for **consumers** and **routers**. Backward compatibility ---------------------- @@ -184,7 +191,9 @@ a deprecation warning is raised:: To avoid the warning, one would need to specify the request in `LogisticRegression`:: - >>> grid = GridSearchCV(LogisticRegression().fit_requests(sample_weight=True), ...) + >>> grid = GridSearchCV( + ... LogisticRegression().set_fit_request(sample_weight=True), ... + ... ) >>> grid.fit(X, y, sample_weight=sw) Meta-estimators such as `GridSearchCV` will check which metadata is requested, @@ -200,23 +209,26 @@ not configured to request it:: >>> # `grid.fit`. >>> grid.fit(X, y, sample_weight=sw) -To avoid the error, `LogisticRegression` must specify its metadata request by calling -`fit_requests`:: +To avoid the error, `LogisticRegression` must specify its metadata request by +calling `set_fit_request`:: >>> # Request sample weights - >>> log_reg_weights = LogisticRegression().fit_requests(sample_weight=True) + >>> log_reg_weights = LogisticRegression().set_fit_request(sample_weight=True) >>> grid = GridSearchCV(log_reg_with_weights, ...) >>> grid.fit(X, y, sample_weight=sw) >>> >>> # Do not request sample_weights - >>> log_reg_no_weights = LogisticRegression().fit_requests(sample_weight=False) + >>> log_reg_no_weights = LogisticRegression().set_fit_request(sample_weight=False) >>> grid = GridSearchCV(log_reg_no_weights, ...) >>> grid.fit(X, y, sample_weight=sw) -Third-party estimators will need to adopt this SLEP in order to support metadata -routing, while the dunder syntax is deprecated. Our implementation will provide -developer APIs to trigger warnings and errors as described above to help with -adopting this SLEP. +Note that a meta-estimator will raise an error if the user passes a metadata +which is not requested by any of the child objects of the meta-estimator. + +Third-party estimators will need to adopt this SLEP in order to support +metadata routing, while the dunder syntax is deprecated. Our implementation +will provide developer APIs to trigger warnings and errors as described above +to help with adopting this SLEP. Alternatives ------------ From 174aa633fe093e9a5d17dc771d6ec6c50a492ed9 Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Sat, 19 Mar 2022 23:29:08 +1100 Subject: [PATCH 079/118] Draft of SLEP07: clone override --- slep017/proposal.rst | 131 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 slep017/proposal.rst diff --git a/slep017/proposal.rst b/slep017/proposal.rst new file mode 100644 index 0000000..6ed8ebf --- /dev/null +++ b/slep017/proposal.rst @@ -0,0 +1,131 @@ +============== +Clone Override +============== + +:Author: Joel Nothman +:Status: Draft +:Type: Standards Track +:Created: 2022-03-19 +:Resolution: (required for Accepted | Rejected | Withdrawn) + +Abstract +-------- + +The ability to clone Scikit-learn estimators -- removing any state due to +previous fitting -- is essential to ensuring estimator configurations are +reusable across multiple instances in cross validation. +A centralised implementation of :func:`sklearn.base.clone` regards +an estimator's constructor parameters as the state that should be copied. +This proposal allows for an estimator class to perform further operations +during clone with a ``__sklearn_clone__`` method, which will default to +the current ``clone`` behaviour. + +Detailed description +-------------------- + +Cloning estimators is one way that Scikit-learn ensures that there is no +data leakage across data splits in cross-validation: by only copying an +estimator's configuration, with no data from previous fitting, the +estimator must fit with a cold start. Cloning an estimator often also +occurs prior to parallelism, ensuring that a minimal version of the +estimator -- without a large stored model -- is serialised and distributed. + +Cloning is currently governed by the implementation of +:func:`sklearn.base.clone`, which recursively descends and copies the +parameters of the passed object. For an estimator, it constructs a new +instance of the estimator's class, passing to it cloned versions of the +parameter values returned by its ``get_params``. It then performs some +sanity checks to ensure that the values passed to the construtor are +identical to what is then returned by the clone's ``get_params``. + +The current equivalence between constructor parameters and what is cloned +means that whenever an estimator or library developer deems it necessary +to have further configuration of an estimator reproduced in a clone, +they must include this configuration as a constructor parameter. + +Cases where this need has been raised in Scikit-learn development include: + +* ensuring metadata requests are cloned with an estimator +* ensuring parameter spaces are cloned with an estimator +* building a simple wrapper that can "freeze" a pre-fitted estimator + +The current design also limits the ability for an estimator developer to +define an exception to the sanity checks (see :issue:`15371`). + +This proposal empowers estimator developers to extend the base implementation +of ``clone`` by providing a ``__sklearn_clone__`` method, which ``clone`` will +delegate to when available. The default implementaton will match current +``clone`` behaviour. It will be provied through +``BaseEstimator.__sklearn_clone__`` but also +provided for estimators not inheriting from :obj:`~sklearn.base.BaseEstimator`. + +This shifts the paradigm from ``clone`` being a fixed operation that +Scikit-learn must be able to perform on an estimator to ``clone`` being a +behaviour that each Scikit-learn compatible estimator must implement. +Developers are expected to be responsible in maintaintaining the fundamental +properties of cloning. + +Implementation +-------------- + +Implementing this SLEP will require: + +1. Factoring out `clone_parametrized` from `clone`, being the portion of its + implementation that handles objects with `get_params`. +2. Modifying `clone` to call ``__sklearn_clone__`` when available on an + object with ``get_params``, or ``clone_parametrized`` when not available. +3. Defining ``BaseEstimator.__sklearn_clone__`` to call ``clone_parametrized``. +4. Documenting the above. + +Backward compatibility +---------------------- + +No breakage. + +Alternatives +------------ + +Insted of allowing estimators to overwrite the entire clone process, +the core clone process could be obligatory, with the ability for an +estimator class to customise additional steps. + +One API would allow for an estimator class to provide +``__sklearn__post_clone__(self, source)`` for operations in addition +to the core cloning, or ``__sklearn__clone_attrs__`` could be defined +on a class to specify additional attributes that should be copied for +that class and its descendants. + +Alternative solutions include continuing to force developers into providing +sometimes-awkward constructor parameters for any clonable material, and +Scikit-learn core developers having the exceptional ability to extend +the ``clone`` function as needed. + +Discussion +---------- + +:issue:`5080` raised the proposal of polymorphism for ``clone`` as the right +way to provide an object-oriented API, and as a way to enable the +implementation of wrappers around estimators for model memoisation and +freezing. Objections were based on the notion that ``clone`` has a simple +contract, and that "extension to it would open the door to violations of that +contract" [2]_. + +The naming of ``__sklearn_clone__`` was further proposed and discussed in +:issue:`21838`. + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. +.. _Open Publication License: https://www.opencontent.org/openpub/ + +.. [2] `Gael Varoquaux's comments on #5080 in 2015 + <https://github.com/scikit-learn/scikit-learn/issues/5080#issuecomment-127128808>`__ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From f140b8a60c50a3ea0a511a813d27f083f3afdd0f Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Thu, 26 May 2022 18:51:55 -0400 Subject: [PATCH 080/118] SLEP018 Pandas output for transformers --- index.rst | 1 + slep018/proposal.rst | 139 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 140 insertions(+) create mode 100644 slep018/proposal.rst diff --git a/index.rst b/index.rst index da549b0..4b54470 100644 --- a/index.rst +++ b/index.rst @@ -21,6 +21,7 @@ slep012/proposal slep013/proposal + slep018/proposal .. toctree:: :maxdepth: 1 diff --git a/slep018/proposal.rst b/slep018/proposal.rst new file mode 100644 index 0000000..b9e66e4 --- /dev/null +++ b/slep018/proposal.rst @@ -0,0 +1,139 @@ +.. _slep_018: + +======================================================= +SLEP018: Pandas Output for Transformers with set_output +======================================================= + +:Author: Thomas J. Fan +:Status: Draft +:Type: Standards Track +:Created: 2022-05-23 + +Abstract +-------- + +This SLEP proposes a ``set_output`` method to configure the output container of +scikit-learn transformers. + +Detailed description +-------------------- + +Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices. +This SLEP proposes adding a ``set_output`` method to configure a transformer to output +pandas DataFrames:: + + scalar = StandardScalar().set_output(transform="pandas_or_namedsparse") + scalar.fit(X_df) + + # X_trans_df is a pandas DataFrame + X_trans_df = scalar.transform(X_df) + +For a pipeline, calling ``set_output`` on the pipeline will configure +all steps in the pipeline:: + + num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) + num_preprocessor.set_output(transform="pandas_or_namedsparse") + + # X_trans_df is a pandas DataFrame + X_trans_df = num_preprocessor.fit_transform(X_df) + +By setting ``transform="pandas"`` calls to ``fit_transform`` will also return a +pandas DataFrame:: + + num_prep = make_pipeline( + SimpleImputer(), + DependsOnPandasInputStandardScalar(), # Depends on Pandas input to train + ) + num_prep.set_output(transform="pandas_or_namedsparse") + + # Pipeline calls ``SimpleImputer.fit_transform`` returning a pandas DataFrame + num_prep.fit(X_df) + +Sparse Data +........... + +The Pandas DataFrame is not suitable to provide column names because it has +performance issues as shown in +`#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. +This SLEP proposes a scikit-learn specific sparse container that subclasses SciPy's +sparse matrices. This sparse container includes the sparse data, feature names and +index. This enables pipelines with Vectorizers without performance issues:: + + pipe = make_pipeline( + CountVectorizer(), + TfidfTransformer(), + LogisticRegression(solver="liblinear") + ) + + pipe.set_output(transform="pandas_or_namedsparse") + + # feature names for logistic regression + pipe[-1].feature_names_in_ + +Global Configuration +.................... + +This SLEP proposes a global configuration flag that sets the output for +all transformers:: + + import sklearn + sklearn.set_config(transform_output="pandas_or_namedsparse") + +The global default configuration is ``"default"`` where the estimator determines +the output container. + +Implementation +-------------- + +A prototype implementation was created to showcase different use cases for this SLEP, +which is seen in +`this rendered notebook <https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__ +and +`this interactive notebook <https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__. + + +Backward compatibility +---------------------- + +There are no backward compatibility concerns, because the ``set_output`` method +is a new API. Third party estimators can opt-in to the API by defining +``set_output``. The scikit-learn sparse container is backward compatible because +it is a subclass of SciPy's sparse matrix. + +Alternatives +------------ + +Alternatives to this SLEP includes: + +1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__ + proposes that if the input is a DataFrame than the output is a DataFrame. +2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container + for dense and sparse data that contains feature names. This SLEP + also proposes a custom container for sparse data, but pandas for dense data. +3. Prototype `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__ + showcases ``array_out="pandas"`` in `transform`. This API + is limited because does not directly support fitting on a pipeline where the + steps requires data frames input. + +Discussion +---------- + +A list of issues discussing Pandas output are: +`#14315 <https://github.com/scikit-learn/scikit-learn/pull/14315>`__, +`#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and +`#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From e708c2a4adb98b94d56291db458477c1300b7e26 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Thu, 9 Jun 2022 16:15:39 -0400 Subject: [PATCH 081/118] DOC Reorder for sparse data --- slep018/proposal.rst | 53 +++++++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 25 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index b9e66e4..a2203dd 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -22,7 +22,7 @@ Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matri This SLEP proposes adding a ``set_output`` method to configure a transformer to output pandas DataFrames:: - scalar = StandardScalar().set_output(transform="pandas_or_namedsparse") + scalar = StandardScalar().set_output(transform="pandas") scalar.fit(X_df) # X_trans_df is a pandas DataFrame @@ -32,7 +32,7 @@ For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the pipeline:: num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) - num_preprocessor.set_output(transform="pandas_or_namedsparse") + num_preprocessor.set_output(transform="pandas") # X_trans_df is a pandas DataFrame X_trans_df = num_preprocessor.fit_transform(X_df) @@ -44,32 +44,11 @@ pandas DataFrame:: SimpleImputer(), DependsOnPandasInputStandardScalar(), # Depends on Pandas input to train ) - num_prep.set_output(transform="pandas_or_namedsparse") + num_prep.set_output(transform="pandas") # Pipeline calls ``SimpleImputer.fit_transform`` returning a pandas DataFrame num_prep.fit(X_df) -Sparse Data -........... - -The Pandas DataFrame is not suitable to provide column names because it has -performance issues as shown in -`#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. -This SLEP proposes a scikit-learn specific sparse container that subclasses SciPy's -sparse matrices. This sparse container includes the sparse data, feature names and -index. This enables pipelines with Vectorizers without performance issues:: - - pipe = make_pipeline( - CountVectorizer(), - TfidfTransformer(), - LogisticRegression(solver="liblinear") - ) - - pipe.set_output(transform="pandas_or_namedsparse") - - # feature names for logistic regression - pipe[-1].feature_names_in_ - Global Configuration .................... @@ -77,7 +56,7 @@ This SLEP proposes a global configuration flag that sets the output for all transformers:: import sklearn - sklearn.set_config(transform_output="pandas_or_namedsparse") + sklearn.set_config(transform_output="pandas") The global default configuration is ``"default"`` where the estimator determines the output container. @@ -123,6 +102,30 @@ A list of issues discussing Pandas output are: `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. +Future Extensions +----------------- + +Sparse Data +........... + +The Pandas DataFrame is not suitable to provide column names because it has +performance issues as shown in +`#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. +A possible future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` +option. This option will use a scikit-learn specific sparse container that subclasses SciPy's +sparse matrices. This sparse container includes the sparse data, feature names and +index. This enables pipelines with Vectorizers without performance issues:: + + pipe = make_pipeline( + CountVectorizer(), + TfidfTransformer(), + LogisticRegression(solver="liblinear") + ) + pipe.set_output(transform="pandas_or_namedsparse") + + # feature names for logistic regression + pipe[-1].feature_names_in_ + References and Footnotes ------------------------ From 7d1528e778bbd54f4a655ff5fd917e2a49fe7973 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Thu, 9 Jun 2022 17:04:42 -0400 Subject: [PATCH 082/118] DOC Be more explicit about behavior --- slep018/proposal.rst | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index a2203dd..4bbf5ab 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -28,8 +28,10 @@ pandas DataFrames:: # X_trans_df is a pandas DataFrame X_trans_df = scalar.transform(X_df) -For a pipeline, calling ``set_output`` on the pipeline will configure -all steps in the pipeline:: +The index of the output DataFrame must match the index of the input. + +For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the +pipeline:: num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) num_preprocessor.set_output(transform="pandas") @@ -37,17 +39,8 @@ all steps in the pipeline:: # X_trans_df is a pandas DataFrame X_trans_df = num_preprocessor.fit_transform(X_df) -By setting ``transform="pandas"`` calls to ``fit_transform`` will also return a -pandas DataFrame:: - - num_prep = make_pipeline( - SimpleImputer(), - DependsOnPandasInputStandardScalar(), # Depends on Pandas input to train - ) - num_prep.set_output(transform="pandas") - - # Pipeline calls ``SimpleImputer.fit_transform`` returning a pandas DataFrame - num_prep.fit(X_df) +Meta-estimators that support ``set_output`` are required to configure all estimators +by calling ``set_output``. Global Configuration .................... @@ -76,8 +69,9 @@ Backward compatibility There are no backward compatibility concerns, because the ``set_output`` method is a new API. Third party estimators can opt-in to the API by defining -``set_output``. The scikit-learn sparse container is backward compatible because -it is a subclass of SciPy's sparse matrix. +``set_output``. Meta-estimators that define ``set_output`` to configure +it's inner estimators with ``set_output`` should error if any of the inner +estimators do not define ``set_output``. Alternatives ------------ From 1186280ce0e73756f411033943bf295c8003ef4f Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Mon, 13 Jun 2022 21:12:42 -0400 Subject: [PATCH 083/118] ENH set_output does nothing for sparse data --- slep018/proposal.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 4bbf5ab..9065d51 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -28,7 +28,9 @@ pandas DataFrames:: # X_trans_df is a pandas DataFrame X_trans_df = scalar.transform(X_df) -The index of the output DataFrame must match the index of the input. +The index of the output DataFrame must match the index of the input. For this SLEP, +``set_output`` will only configure the output for dense output. If an transformer +returns sparse data, ``set_output`` will not influence the output container. For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the pipeline:: @@ -105,8 +107,8 @@ Sparse Data The Pandas DataFrame is not suitable to provide column names because it has performance issues as shown in `#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. -A possible future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` -option. This option will use a scikit-learn specific sparse container that subclasses SciPy's +A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option. +This option will use a scikit-learn specific sparse container that subclasses SciPy's sparse matrices. This sparse container includes the sparse data, feature names and index. This enables pipelines with Vectorizers without performance issues:: From ac21785df6de37c412ed1515ab2f6cdf3a048b65 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Mon, 13 Jun 2022 21:13:04 -0400 Subject: [PATCH 084/118] DOC Wording --- slep018/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 9065d51..45b4ee1 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -29,7 +29,7 @@ pandas DataFrames:: X_trans_df = scalar.transform(X_df) The index of the output DataFrame must match the index of the input. For this SLEP, -``set_output`` will only configure the output for dense output. If an transformer +``set_output`` will only configure the container for dense output. If an transformer returns sparse data, ``set_output`` will not influence the output container. For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the From 326fcbb63af2166bdb0de6760e3bb8b9561b7c35 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Mon, 13 Jun 2022 21:58:31 -0400 Subject: [PATCH 085/118] DOC Reword --- slep018/proposal.rst | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 45b4ee1..b7617e5 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -28,9 +28,10 @@ pandas DataFrames:: # X_trans_df is a pandas DataFrame X_trans_df = scalar.transform(X_df) -The index of the output DataFrame must match the index of the input. For this SLEP, -``set_output`` will only configure the container for dense output. If an transformer -returns sparse data, ``set_output`` will not influence the output container. +The index of the output DataFrame must match the index of the input. For this +SLEP, ``set_output`` will only configure the output for dense data. If a +transformer returns sparse data, then ``transform`` will error if ``set_output`` +is to "pandas". For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the pipeline:: From d91fb3ad196d0da56582041ea4bbc8f77577d967 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Mon, 13 Jun 2022 22:04:11 -0400 Subject: [PATCH 086/118] DOC Adds set_output validation --- slep018/proposal.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index b7617e5..50d61f3 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -31,7 +31,8 @@ pandas DataFrames:: The index of the output DataFrame must match the index of the input. For this SLEP, ``set_output`` will only configure the output for dense data. If a transformer returns sparse data, then ``transform`` will error if ``set_output`` -is to "pandas". +is to "pandas". If a transformer always returns sparse data, then calling +`set_output="pandas"` may raise an error. For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the pipeline:: From 218b76ad5612663ab9408dd4d824df357dda4d05 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Tue, 14 Jun 2022 11:39:08 -0400 Subject: [PATCH 087/118] Update slep018/proposal.rst Co-authored-by: Joel Nothman <joeln@canva.com> --- slep018/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 50d61f3..bee7388 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -31,7 +31,7 @@ pandas DataFrames:: The index of the output DataFrame must match the index of the input. For this SLEP, ``set_output`` will only configure the output for dense data. If a transformer returns sparse data, then ``transform`` will error if ``set_output`` -is to "pandas". If a transformer always returns sparse data, then calling +is set to "pandas". If a transformer always returns sparse data, then calling `set_output="pandas"` may raise an error. For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the From 2d234707a6e5f0e806fd092d273de4060cd29b59 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Wed, 22 Jun 2022 17:27:37 -0400 Subject: [PATCH 088/118] CLN Address comments --- slep018/proposal.rst | 91 +++++++++++++++++++++++--------------------- 1 file changed, 47 insertions(+), 44 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index bee7388..e336d5a 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -7,7 +7,7 @@ SLEP018: Pandas Output for Transformers with set_output :Author: Thomas J. Fan :Status: Draft :Type: Standards Track -:Created: 2022-05-23 +:Created: 2022-06-22 Abstract -------- @@ -18,9 +18,9 @@ scikit-learn transformers. Detailed description -------------------- -Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices. -This SLEP proposes adding a ``set_output`` method to configure a transformer to output -pandas DataFrames:: +Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse +matrices. This SLEP proposes adding a ``set_output`` method to configure a +transformer to output pandas DataFrames:: scalar = StandardScalar().set_output(transform="pandas") scalar.fit(X_df) @@ -28,14 +28,16 @@ pandas DataFrames:: # X_trans_df is a pandas DataFrame X_trans_df = scalar.transform(X_df) -The index of the output DataFrame must match the index of the input. For this -SLEP, ``set_output`` will only configure the output for dense data. If a -transformer returns sparse data, then ``transform`` will error if ``set_output`` -is set to "pandas". If a transformer always returns sparse data, then calling -`set_output="pandas"` may raise an error. +The index of the output DataFrame must match the index of the input. If the +transformer does not support ``transform="pandas"``, then it must raise a +``ValueError`` stating that it does not support the feature. -For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the -pipeline:: +For this SLEP, ``set_output`` will only configure the output for dense data. If +the transformer returns sparse data, then ``transform`` will raise a +``ValueError`` if ``set_output(transform="pandas")``. + +For a pipeline, calling ``set_output`` on the pipeline will configure all steps +in the pipeline:: num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) num_preprocessor.set_output(transform="pandas") @@ -43,39 +45,38 @@ pipeline:: # X_trans_df is a pandas DataFrame X_trans_df = num_preprocessor.fit_transform(X_df) -Meta-estimators that support ``set_output`` are required to configure all estimators -by calling ``set_output``. +Meta-estimators that support ``set_output`` are required to configure all inner +transformer by calling ``set_output``. If an inner transformer does not define +``set_output``, then an error is raised. Global Configuration .................... -This SLEP proposes a global configuration flag that sets the output for -all transformers:: +This SLEP proposes a global configuration flag that sets the output for all +transformers:: import sklearn sklearn.set_config(transform_output="pandas") -The global default configuration is ``"default"`` where the estimator determines -the output container. +The global default configuration is ``"default"`` where the transformer +determines the output container. Implementation -------------- -A prototype implementation was created to showcase different use cases for this SLEP, -which is seen in -`this rendered notebook <https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__ -and -`this interactive notebook <https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__. +A prototype implementation was created to showcase different use cases for this +SLEP, which is seen in `this rendered notebook +<https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__ +and `this interactive notebook +<https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__. Backward compatibility ---------------------- There are no backward compatibility concerns, because the ``set_output`` method -is a new API. Third party estimators can opt-in to the API by defining -``set_output``. Meta-estimators that define ``set_output`` to configure -it's inner estimators with ``set_output`` should error if any of the inner -estimators do not define ``set_output``. +is a new API. Third party transformers can opt-in to the API by defining +``set_output``. Alternatives ------------ @@ -84,21 +85,22 @@ Alternatives to this SLEP includes: 1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__ proposes that if the input is a DataFrame than the output is a DataFrame. -2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container - for dense and sparse data that contains feature names. This SLEP - also proposes a custom container for sparse data, but pandas for dense data. -3. Prototype `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__ - showcases ``array_out="pandas"`` in `transform`. This API - is limited because does not directly support fitting on a pipeline where the - steps requires data frames input. +2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container for dense + and sparse data that contains feature names. This SLEP also proposes a custom + container for sparse data, but pandas for dense data. +3. Prototype `#20100 + <https://github.com/scikit-learn/scikit-learn/pull/20100>`__ showcases + ``array_out="pandas"`` in `transform`. This API is limited because does not + directly support fitting on a pipeline where the steps requires data frames + input. Discussion ---------- -A list of issues discussing Pandas output are: -`#14315 <https://github.com/scikit-learn/scikit-learn/pull/14315>`__, -`#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and -`#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. +A list of issues discussing Pandas output are: `#14315 +<https://github.com/scikit-learn/scikit-learn/pull/14315>`__, `#20100 +<https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001 +<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. Future Extensions ----------------- @@ -107,12 +109,13 @@ Sparse Data ........... The Pandas DataFrame is not suitable to provide column names because it has -performance issues as shown in -`#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. +performance issues as shown in `#16772 +<https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option. -This option will use a scikit-learn specific sparse container that subclasses SciPy's -sparse matrices. This sparse container includes the sparse data, feature names and -index. This enables pipelines with Vectorizers without performance issues:: +This option will use a scikit-learn specific sparse container that subclasses +SciPy's sparse matrices. This sparse container includes the sparse data, feature +names and index. This enables pipelines with Vectorizers without performance +issues:: pipe = make_pipeline( CountVectorizer(), @@ -128,8 +131,8 @@ References and Footnotes ------------------------ .. [1] Each SLEP must either be explicitly labeled as placed in the public - domain (see this SLEP as an example) or licensed under the `Open - Publication License`_. + domain (see this SLEP as an example) or licensed under the `Open Publication + License`_. .. _Open Publication License: https://www.opencontent.org/openpub/ From eac9f9fde8957b1354580c79ea5957b9e3c3d7dc Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Wed, 22 Jun 2022 20:24:48 -0400 Subject: [PATCH 089/118] DOC Link to implementation --- slep018/proposal.rst | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index e336d5a..731e782 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -64,12 +64,7 @@ determines the output container. Implementation -------------- -A prototype implementation was created to showcase different use cases for this -SLEP, which is seen in `this rendered notebook -<https://nbviewer.org/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__ -and `this interactive notebook -<https://colab.research.google.com/github/thomasjpfan/pandas-prototype-demo/blob/main/index.ipynb>`__. - +The implementation of this SLEP is in :pr:`23734`. Backward compatibility ---------------------- From add7a8ae1090c849ecc4cf5f20ee440e8ea5dd03 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Tue, 5 Jul 2022 11:42:13 -0400 Subject: [PATCH 090/118] Apply suggestions from code review Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> --- slep018/proposal.rst | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 731e782..f4b4fd1 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -12,7 +12,7 @@ SLEP018: Pandas Output for Transformers with set_output Abstract -------- -This SLEP proposes a ``set_output`` method to configure the output container of +This SLEP proposes a ``set_output`` method to configure the output data container of scikit-learn transformers. Detailed description @@ -32,9 +32,10 @@ The index of the output DataFrame must match the index of the input. If the transformer does not support ``transform="pandas"``, then it must raise a ``ValueError`` stating that it does not support the feature. -For this SLEP, ``set_output`` will only configure the output for dense data. If -the transformer returns sparse data, then ``transform`` will raise a -``ValueError`` if ``set_output(transform="pandas")``. +This SLEP's only focus is dense data for ``set_output``. If a transformer returns +sparse data, e.g. `OneHotEncoder(sparse=True), then ``transform`` will raise a +``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output +might be the scope of another future SLEP. For a pipeline, calling ``set_output`` on the pipeline will configure all steps in the pipeline:: @@ -44,6 +45,9 @@ in the pipeline:: # X_trans_df is a pandas DataFrame X_trans_df = num_preprocessor.fit_transform(X_df) + + # X_trans_df is again a pandas DataFrame + X_trans_df = num_preprocessor[0].transform(X_df) Meta-estimators that support ``set_output`` are required to configure all inner transformer by calling ``set_output``. If an inner transformer does not define @@ -52,7 +56,7 @@ transformer by calling ``set_output``. If an inner transformer does not define Global Configuration .................... -This SLEP proposes a global configuration flag that sets the output for all +For ease of use, this SLEP proposes a global configuration flag that sets the output for all transformers:: import sklearn @@ -64,7 +68,7 @@ determines the output container. Implementation -------------- -The implementation of this SLEP is in :pr:`23734`. +A possible implementation of this SLEP is worked out in :pr:`23734`. Backward compatibility ---------------------- @@ -99,7 +103,7 @@ A list of issues discussing Pandas output are: `#14315 Future Extensions ----------------- - +For information only! Sparse Data ........... From 05803915075a752548ea2974f0fd6fd792b26ea4 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Tue, 5 Jul 2022 11:41:15 -0400 Subject: [PATCH 091/118] DOC Address comments --- slep018/proposal.rst | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index f4b4fd1..7a28ba2 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -65,6 +65,21 @@ transformers:: The global default configuration is ``"default"`` where the transformer determines the output container. +The configuration can also be set locally using the ``config_context`` context +manager: + + from sklearn import config_context + with config_context(transform_output="pandas"): + num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) + num_preprocessor.fit_transform(X_df) + +The following specifies the precedence levels for the three ways to configure +the output container: + +1. Locally configure a transformer: ``transformer.set_output`` +2. Context manager: ``config_context`` +3. Global configuration: ``set_config`` + Implementation -------------- @@ -84,10 +99,7 @@ Alternatives to this SLEP includes: 1. `SLEP014 <https://github.com/scikit-learn/enhancement_proposals/pull/37>`__ proposes that if the input is a DataFrame than the output is a DataFrame. -2. :ref:`SLEP012 <slep_012>` proposes a custom scikit-learn container for dense - and sparse data that contains feature names. This SLEP also proposes a custom - container for sparse data, but pandas for dense data. -3. Prototype `#20100 +2. Prototype `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__ showcases ``array_out="pandas"`` in `transform`. This API is limited because does not directly support fitting on a pipeline where the steps requires data frames @@ -107,8 +119,8 @@ For information only! Sparse Data ........... -The Pandas DataFrame is not suitable to provide column names because it has -performance issues as shown in `#16772 +The Pandas DataFrame is not suitable to provide column names for sparse data +because it has performance issues as shown in `#16772 <https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option. This option will use a scikit-learn specific sparse container that subclasses From 4370736fc07c9f173220e34b9cf171378df6a0f7 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Tue, 5 Jul 2022 11:45:25 -0400 Subject: [PATCH 092/118] DOC Remove future extensions --- slep018/proposal.rst | 26 +------------------------- 1 file changed, 1 insertion(+), 25 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 7a28ba2..4c6358d 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -45,7 +45,7 @@ in the pipeline:: # X_trans_df is a pandas DataFrame X_trans_df = num_preprocessor.fit_transform(X_df) - + # X_trans_df is again a pandas DataFrame X_trans_df = num_preprocessor[0].transform(X_df) @@ -113,30 +113,6 @@ A list of issues discussing Pandas output are: `#14315 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. -Future Extensions ------------------ -For information only! -Sparse Data -........... - -The Pandas DataFrame is not suitable to provide column names for sparse data -because it has performance issues as shown in `#16772 -<https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097>`__. -A future extension to this SLEP is to have a ``"pandas_or_namedsparse"`` option. -This option will use a scikit-learn specific sparse container that subclasses -SciPy's sparse matrices. This sparse container includes the sparse data, feature -names and index. This enables pipelines with Vectorizers without performance -issues:: - - pipe = make_pipeline( - CountVectorizer(), - TfidfTransformer(), - LogisticRegression(solver="liblinear") - ) - pipe.set_output(transform="pandas_or_namedsparse") - - # feature names for logistic regression - pipe[-1].feature_names_in_ References and Footnotes ------------------------ From 1d57415c244fbbf546d13fc37acaa699e8b54a47 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Fri, 8 Jul 2022 16:12:35 -0400 Subject: [PATCH 093/118] DOC Adds details about fitted and non-fitted inner transformers --- slep018/proposal.rst | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 4c6358d..76436e6 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -50,8 +50,12 @@ in the pipeline:: X_trans_df = num_preprocessor[0].transform(X_df) Meta-estimators that support ``set_output`` are required to configure all inner -transformer by calling ``set_output``. If an inner transformer does not define -``set_output``, then an error is raised. +transformer by calling ``set_output``. Specifically all fitted and non-fitted +inner transformers must be configured with ``set_output``. This enables +``transform``'s output to be a DataFrame before and after the meta-estimator is +fitted. If an inner transformer does not define ``set_output``, then an error is +raised. + Global Configuration .................... From 68ada33704fcccb95039b12431b6448d44abe79b Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Fri, 8 Jul 2022 16:16:47 -0400 Subject: [PATCH 094/118] DOC Adds details about the pandas choice --- slep018/proposal.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 76436e6..203a42e 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -115,8 +115,10 @@ Discussion A list of issues discussing Pandas output are: `#14315 <https://github.com/scikit-learn/scikit-learn/pull/14315>`__, `#20100 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001 -<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. - +<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. This SLEP +proposes configuring the output to be pandas because it is the DataFrame library +that is most widely used and requested by users. The ``set_output`` can be +extended to support support additional DataFrame libraries in the future. References and Footnotes ------------------------ From 1c11263ef50877efa0c1f2ebcd5e76c1284545f3 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Sun, 17 Jul 2022 10:47:43 -0500 Subject: [PATCH 095/118] CLN Update formatting --- slep018/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 203a42e..5ceb2ca 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -70,7 +70,7 @@ The global default configuration is ``"default"`` where the transformer determines the output container. The configuration can also be set locally using the ``config_context`` context -manager: +manager:: from sklearn import config_context with config_context(transform_output="pandas"): From 988450424787d968b8f75c997fd1c319c7bcc3ca Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Wed, 20 Jul 2022 10:23:11 -0400 Subject: [PATCH 096/118] CLN Fixes formatting in slep 018 (#73) --- slep018/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep018/proposal.rst b/slep018/proposal.rst index 5ceb2ca..ff67c7b 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -33,7 +33,7 @@ transformer does not support ``transform="pandas"``, then it must raise a ``ValueError`` stating that it does not support the feature. This SLEP's only focus is dense data for ``set_output``. If a transformer returns -sparse data, e.g. `OneHotEncoder(sparse=True), then ``transform`` will raise a +sparse data, e.g. ``OneHotEncoder(sparse=True)``, then ``transform`` will raise a ``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output might be the scope of another future SLEP. From 23aced5850089c33d12372ca248fa21cd22823bb Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Fri, 19 Aug 2022 19:26:19 -0400 Subject: [PATCH 097/118] VOTE SLEP018 - Pandas Output for Transformers (#72) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Loïc Estève <loic.esteve@ymail.com> --- index.rst | 2 +- slep018/proposal.rst | 31 +++++++++++++++++++------------ 2 files changed, 20 insertions(+), 13 deletions(-) diff --git a/index.rst b/index.rst index 4b54470..c17469b 100644 --- a/index.rst +++ b/index.rst @@ -14,6 +14,7 @@ slep007/proposal slep009/proposal slep010/proposal + slep018/proposal .. toctree:: :maxdepth: 1 @@ -21,7 +22,6 @@ slep012/proposal slep013/proposal - slep018/proposal .. toctree:: :maxdepth: 1 diff --git a/slep018/proposal.rst b/slep018/proposal.rst index ff67c7b..f4b830f 100644 --- a/slep018/proposal.rst +++ b/slep018/proposal.rst @@ -5,7 +5,7 @@ SLEP018: Pandas Output for Transformers with set_output ======================================================= :Author: Thomas J. Fan -:Status: Draft +:Status: Accepted :Type: Standards Track :Created: 2022-06-22 @@ -22,7 +22,7 @@ Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse matrices. This SLEP proposes adding a ``set_output`` method to configure a transformer to output pandas DataFrames:: - scalar = StandardScalar().set_output(transform="pandas") + scalar = StandardScaler().set_output(transform="pandas") scalar.fit(X_df) # X_trans_df is a pandas DataFrame @@ -37,20 +37,26 @@ sparse data, e.g. ``OneHotEncoder(sparse=True)``, then ``transform`` will raise ``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output might be the scope of another future SLEP. -For a pipeline, calling ``set_output`` on the pipeline will configure all steps -in the pipeline:: +For a pipeline, calling ``set_output`` will configure all inner transformers and +does not configure non-transformers. This enables the following workflow:: - num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) - num_preprocessor.set_output(transform="pandas") + log_reg = make_pipeline(SimpleImputer(), StandardScaler(), LogisticRegression()) + log_reg.set_output(transform="pandas") + + # All transformers return DataFrames during fit + log_reg.fit(X_df, y) # X_trans_df is a pandas DataFrame - X_trans_df = num_preprocessor.fit_transform(X_df) + X_trans_df = log_reg[:-1].transform(X_df) # X_trans_df is again a pandas DataFrame - X_trans_df = num_preprocessor[0].transform(X_df) + X_trans_df = log_reg[0].transform(X_df) + + # The classifier contains the feature names in + log_reg[-1].feature_names_in_ Meta-estimators that support ``set_output`` are required to configure all inner -transformer by calling ``set_output``. Specifically all fitted and non-fitted +transformers by calling ``set_output``. Specifically all fitted and non-fitted inner transformers must be configured with ``set_output``. This enables ``transform``'s output to be a DataFrame before and after the meta-estimator is fitted. If an inner transformer does not define ``set_output``, then an error is @@ -74,7 +80,7 @@ manager:: from sklearn import config_context with config_context(transform_output="pandas"): - num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA()) + num_prep = make_pipeline(SimpleImputer(), StandardScaler(), PCA()) num_preprocessor.fit_transform(X_df) The following specifies the precedence levels for the three ways to configure @@ -117,8 +123,9 @@ A list of issues discussing Pandas output are: `#14315 <https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001 <https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. This SLEP proposes configuring the output to be pandas because it is the DataFrame library -that is most widely used and requested by users. The ``set_output`` can be -extended to support support additional DataFrame libraries in the future. +that is most widely used and requested by users. The ``set_output`` API can be +extended to support additional DataFrame libraries and sparse data formats in +the future. References and Footnotes ------------------------ From 14cac6600f9402bc540c66617ac924ddf30a7673 Mon Sep 17 00:00:00 2001 From: Christian Lorentzen <lorentzen.ch@gmail.com> Date: Wed, 14 Sep 2022 19:55:08 +0200 Subject: [PATCH 098/118] Set SLEP009 to Final --- slep009/proposal.rst | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/slep009/proposal.rst b/slep009/proposal.rst index 248c21f..c6ca4e3 100644 --- a/slep009/proposal.rst +++ b/slep009/proposal.rst @@ -5,11 +5,14 @@ SLEP009: Keyword-only arguments =============================== :Author: Adrin Jalali -:Status: Accepted +:Status: Final :Type: Standards Track :Created: 2019-07-13 :Vote opened: 2019-09-11 +Implemented with `v0.23 <https://scikit-learn.org/stable/whats_new/v0.23.html#enforcing-keyword-only-arguments>`__ +and `v1.0.0 <https://scikit-learn.org/stable/whats_new/v1.0.html#enforcing-keyword-only-arguments>`__. + Abstract ######## From 2a93a2976d3840436e3a62ab3a6ed49cc68eb0ea Mon Sep 17 00:00:00 2001 From: Christian Lorentzen <lorentzen.ch@gmail.com> Date: Wed, 14 Sep 2022 20:01:07 +0200 Subject: [PATCH 099/118] Set SLEP010 to Final --- slep010/proposal.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/slep010/proposal.rst b/slep010/proposal.rst index a8517c2..8970b68 100644 --- a/slep010/proposal.rst +++ b/slep010/proposal.rst @@ -5,10 +5,12 @@ SLEP010: ``n_features_in_`` attribute ===================================== :Author: Nicolas Hug -:Status: Accepted +:Status: Final :Type: Standards Track :Created: 2019-11-23 +Implemented with `v0.23 <https://scikit-learn.org/stable/whats_new/v0.23.html?highlight=n_features_in_#id13>`__. + Abstract ######## From 3609255b8b9e42d601e9e89e9c6404d6bd41f5dd Mon Sep 17 00:00:00 2001 From: Christian Lorentzen <lorentzen.ch@gmail.com> Date: Wed, 14 Sep 2022 20:05:58 +0200 Subject: [PATCH 100/118] Set SLEP007 to Final --- slep007/proposal.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 78fbe0b..58905f5 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -5,12 +5,14 @@ SLEP007: Feature names, their generation and the API ==================================================== :Author: Adrin Jalali -:Status: Accepted +:Status: Final :Type: Standards Track :Created: 2019-04 :Vote opened: 2021-10-26 :Vote closed: 2021-11-29 +Implemented with `v1.0.0 <https://scikit-learn.org/stable/whats_new/v1.0.html#id7>`__. + Abstract ######## From 0e06e05955b18126d882ed7819cd1aacd53873a1 Mon Sep 17 00:00:00 2001 From: Christian Lorentzen <lorentzen.ch@gmail.com> Date: Wed, 14 Sep 2022 20:21:18 +0200 Subject: [PATCH 101/118] Fix version of implementation to v1.1.0 --- slep007/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep007/proposal.rst b/slep007/proposal.rst index 58905f5..7f9185d 100644 --- a/slep007/proposal.rst +++ b/slep007/proposal.rst @@ -11,7 +11,7 @@ SLEP007: Feature names, their generation and the API :Vote opened: 2021-10-26 :Vote closed: 2021-11-29 -Implemented with `v1.0.0 <https://scikit-learn.org/stable/whats_new/v1.0.html#id7>`__. +Implemented with `v1.1.0 <https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_1_0.html#get-feature-names-out-available-in-all-transformers>`__. Abstract ######## From 8ae13e9ba6309993476a08863f61306cd16c2ceb Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Fri, 23 Sep 2022 14:43:57 +1000 Subject: [PATCH 102/118] Extend discussion, address reviews --- slep017/proposal.rst | 77 ++++++++++++++++++++++++++++++++++++-------- 1 file changed, 63 insertions(+), 14 deletions(-) diff --git a/slep017/proposal.rst b/slep017/proposal.rst index 6ed8ebf..53fc931 100644 --- a/slep017/proposal.rst +++ b/slep017/proposal.rst @@ -1,6 +1,6 @@ -============== -Clone Override -============== +================================================== +Clone Override Protocol with ``__sklearn_clone__`` +================================================== :Author: Joel Nothman :Status: Draft @@ -16,8 +16,8 @@ previous fitting -- is essential to ensuring estimator configurations are reusable across multiple instances in cross validation. A centralised implementation of :func:`sklearn.base.clone` regards an estimator's constructor parameters as the state that should be copied. -This proposal allows for an estimator class to perform further operations -during clone with a ``__sklearn_clone__`` method, which will default to +This proposal allows for an estimator class to implment custom cloning +functionality with a ``__sklearn_clone__`` method, which will default to the current ``clone`` behaviour. Detailed description @@ -48,6 +48,8 @@ Cases where this need has been raised in Scikit-learn development include: * ensuring metadata requests are cloned with an estimator * ensuring parameter spaces are cloned with an estimator * building a simple wrapper that can "freeze" a pre-fitted estimator +* allowing existing options for using prefitted models in ensembles + to work under cloning The current design also limits the ability for an estimator developer to define an exception to the sanity checks (see :issue:`15371`). @@ -55,15 +57,16 @@ define an exception to the sanity checks (see :issue:`15371`). This proposal empowers estimator developers to extend the base implementation of ``clone`` by providing a ``__sklearn_clone__`` method, which ``clone`` will delegate to when available. The default implementaton will match current -``clone`` behaviour. It will be provied through +``clone`` behaviour. It will be provided through ``BaseEstimator.__sklearn_clone__`` but also provided for estimators not inheriting from :obj:`~sklearn.base.BaseEstimator`. This shifts the paradigm from ``clone`` being a fixed operation that Scikit-learn must be able to perform on an estimator to ``clone`` being a -behaviour that each Scikit-learn compatible estimator must implement. -Developers are expected to be responsible in maintaintaining the fundamental -properties of cloning. +behaviour that each Scikit-learn compatible estimator may implement. +Developers that define ``__sklearn_clone__`` are expected to be responsible +in maintaintaining the fundamental properties of cloning, ordinarily +through use of ``super().__sklearn_clone__``. Implementation -------------- @@ -85,7 +88,7 @@ No breakage. Alternatives ------------ -Insted of allowing estimators to overwrite the entire clone process, +Instead of allowing estimators to overwrite the entire clone process, the core clone process could be obligatory, with the ability for an estimator class to customise additional steps. @@ -106,13 +109,59 @@ Discussion :issue:`5080` raised the proposal of polymorphism for ``clone`` as the right way to provide an object-oriented API, and as a way to enable the implementation of wrappers around estimators for model memoisation and -freezing. Objections were based on the notion that ``clone`` has a simple -contract, and that "extension to it would open the door to violations of that -contract" [2]_. - +freezing. The naming of ``__sklearn_clone__`` was further proposed and discussed in :issue:`21838`. +Making cloning more flexible either enables or simplifies the design and +implementation of several features, including wrapping pre-fitted estimators, +and providing estimator configuration through methods without adding new +constructor arguments (e.g. through mixins). + +Related issues include: + +- :issue:`6451`, :issue:`8710`, :issue:`19848`: CalibratedClassifierCV with + prefitted base estimator +- :issue:`7382`: VotingClassifier with prefitted base estimator +- :issue:`16748`: Stacking estimator with prefitted base estimator +- :issue:`8370`, :issue:`9464`: generic estimator wrapper for model freezing +- :issue:`5082`: configuring parameter search spaces +- :issue:`16079`: configuring the routing of sample-aligned metadata +- :issue:`16185`: configuring selected parameters to not be deep-copied + +Under the incumbent monolithic clone implementation, designing such additional +per-estimator configuration requires resolving whether to: + +- adjust the monolithic ``clone`` to account for the new configuration + attributes (an option only available to the Scikit-learn core developer + team); +- add constructor attributes for each new configuration option; or +- not clone estimator configurations, and accept that some use cases may not + be possible. + +A more flexible cloning operation provides a simpler pattern for adding new +configuration options through mixins. +It should be noted that adding new capabilities to *all* estimators remains +possible only through modifying the default ``__sklearn_clone__`` +implementation. + +There are, however, notable concerns in relation to this proposal. +Introducing a generic clone handler on each estimator gives a developer +complete freedom to disregard existing conventions regarding parameter +setting and construction in Scikit-learn. +In this vein, objections to :issue:`5080` cited the notion that "``clone`` +has a simple contract," and that "extension to it would open the door to +violations of that contract" [2]_. + +While these objections identify considerable risks, many public libraries +include developers regularly working around Scikit-learn conventions and +contracts, in part because developers are backed into a "design corner", +wherein it is not always obvious how to build an acceptable UX while adhering +to established conventions; in this case, that everything to be cloned must +go into ``__init__``. This proposal paves a road for how developers can +solve functionality UX limitations in the core library, rather than +inviting custom workarounds. + References and Footnotes ------------------------ From 7fd1586e228a6338f4db22dc18f54b5178b8bdb4 Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Fri, 23 Sep 2022 14:47:35 +1000 Subject: [PATCH 103/118] re estimator_checks --- slep017/proposal.rst | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/slep017/proposal.rst b/slep017/proposal.rst index 53fc931..713967a 100644 --- a/slep017/proposal.rst +++ b/slep017/proposal.rst @@ -64,9 +64,12 @@ provided for estimators not inheriting from :obj:`~sklearn.base.BaseEstimator`. This shifts the paradigm from ``clone`` being a fixed operation that Scikit-learn must be able to perform on an estimator to ``clone`` being a behaviour that each Scikit-learn compatible estimator may implement. + Developers that define ``__sklearn_clone__`` are expected to be responsible -in maintaintaining the fundamental properties of cloning, ordinarily -through use of ``super().__sklearn_clone__``. +in maintaintaining the fundamental properties of cloning. Ordinarily, they +can achieve this through use of ``super().__sklearn_clone__``. Core behaviours, +such as constructor parameters being preserved through ``clone`` operations, +can be ensured through estimator checks. Implementation -------------- From 3b9bd20faefff08c6e98d2055fd655e1a3b27f1e Mon Sep 17 00:00:00 2001 From: Andreas Mueller <t3kcit@gmail.com> Date: Mon, 31 Oct 2022 08:49:16 -0700 Subject: [PATCH 104/118] Update slep017/proposal.rst --- slep017/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep017/proposal.rst b/slep017/proposal.rst index 713967a..da63085 100644 --- a/slep017/proposal.rst +++ b/slep017/proposal.rst @@ -16,7 +16,7 @@ previous fitting -- is essential to ensuring estimator configurations are reusable across multiple instances in cross validation. A centralised implementation of :func:`sklearn.base.clone` regards an estimator's constructor parameters as the state that should be copied. -This proposal allows for an estimator class to implment custom cloning +This proposal allows for an estimator class to implement custom cloning functionality with a ``__sklearn_clone__`` method, which will default to the current ``clone`` behaviour. From c76ae375295f6d27fe42ead09adce737394faf2a Mon Sep 17 00:00:00 2001 From: Andreas Mueller <t3kcit@gmail.com> Date: Tue, 1 Nov 2022 23:13:16 -0700 Subject: [PATCH 105/118] MNT add slep 17 to index (#80) --- index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/index.rst b/index.rst index c17469b..79b0d16 100644 --- a/index.rst +++ b/index.rst @@ -22,6 +22,7 @@ slep012/proposal slep013/proposal + slep017/proposal .. toctree:: :maxdepth: 1 From 58495ffc1aca471430255a4c6ed019b4d184f562 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Wed, 2 Nov 2022 10:45:51 +0100 Subject: [PATCH 106/118] DOC Prepend the SLEP number to title of SLEP017 --- slep017/proposal.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/slep017/proposal.rst b/slep017/proposal.rst index da63085..fdec790 100644 --- a/slep017/proposal.rst +++ b/slep017/proposal.rst @@ -1,6 +1,6 @@ -================================================== -Clone Override Protocol with ``__sklearn_clone__`` -================================================== +=========================================================== +SLEP017: Clone Override Protocol with ``__sklearn_clone__`` +=========================================================== :Author: Joel Nothman :Status: Draft From f2f74184c2f62ff861bc521de6030c93e4895f89 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion <git@jjerphan.xyz> Date: Fri, 18 Nov 2022 18:11:49 +0100 Subject: [PATCH 107/118] SLEP019: Governance Update - Recognizing Contributions Beyond Code (#74) Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org> Co-authored-by: Andreas Mueller <amueller@microsoft.com> Co-authored-by: Juan Martin Loyola <jmloyola@outlook.com> Co-authored-by: Reshama Shaikh <reshama.stat@gmail.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com> Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Noa Tamir <noatamir@users.noreply.github.com> --- index.rst | 1 + slep019/proposal.rst | 199 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 200 insertions(+) create mode 100644 slep019/proposal.rst diff --git a/index.rst b/index.rst index 79b0d16..a8147b0 100644 --- a/index.rst +++ b/index.rst @@ -23,6 +23,7 @@ slep012/proposal slep013/proposal slep017/proposal + slep019/proposal .. toctree:: :maxdepth: 1 diff --git a/slep019/proposal.rst b/slep019/proposal.rst new file mode 100644 index 0000000..2ddd55c --- /dev/null +++ b/slep019/proposal.rst @@ -0,0 +1,199 @@ +.. _slep_019: + +#################################################################### + SLEP019: Governance Update - Recognizing Contributions Beyond Code +#################################################################### + +:Author: Julien Jerphanion <git@jjerphan.xyz>, Gaël Varoquaux <gael.varoquaux@normalesup.org> +:Status: Draft +:Type: Process +:Created: 2022-09-12 + +********** + Abstract +********** + +This SLEP proposes updating the Governance to broaden the notion of +contribution in scikit-learn and to ease subsequent related changes to +the Governance without requiring SLEPs. + +************ + Motivation +************ + +Current state +============= + +The formal decision making process of the scikit-learn project is +limited to a subset of contributors, called Core Developers (also +refered to as Maintainers). Their active and consistent contributions +are recognized by them: + +- being part of scikit-learn organisation on GitHub +- receiving “commit rights” to the repository +- having their Pull Request reviews recognised as authoritative +- having voting rights for the project direction (promoting a + contributor to be a core-developer, approving a SLEP, etc.) + +Core Developers are primarily selected based on their code +contributions. However, there are a lot of other ways to contribute to +the project, and these efforts are currently not recognized [1]_. To +quote Melissa Weber Mendonça [2]_ and Reshama Shaikh [3]_: + +.. epigraph:: + + "When some people join an open source project, they may be asked to contribute + with tasks that will never get them on a path to any sort of official input, + such as voting rights." + +Desired Goal: incrementally adapt the Governance +================================================ + +We need to: + +- value non-coding contributions in the project and acknowledge all + efforts, including those that are not quantified by GitHub users' + activity + +- empower more contributors to effectively participate in the project + without requiring the security responsibilities of tracking code + changes to the main branches. These considerations should lead to the + diversification of contribution paths [4]_. + +Rather than introducing entirely new structure and Governance, we +propose changes to the existing ones which allow for small incremental +modifications over time. + +****************** + Proposed changes +****************** + +Some of the proposed modification have been discussed in the monthly +meetings, on April 25th 2022 [5]_ and September 5th 2022 [6]_. + +Define "Contributions" more broadly +=================================== + +Explicitly define Contributions and emphasize the importance of non-code +contributions in the Governance structure. + +Evolve the Technical Committee into a Steering Committee +======================================================== + +Rename "Technical Committee" to "Steering Committee". + +Define the Steering Committee as a subset of Core Contributors rather +than a subset of Core Developers. + +Create a Triage Team +==================== + +Create a Triage Team which would be given "Write" permissions on GitHub +[7]_ to be able to perform triaging tasks, such as editing issues' +description. + +Define "Core Contributors" +========================== + +Establish all members of the following teams as "Core Contributors": + + - Triage Team + - Communication Team + - Development Team + +A Contributor is promoted to a Core Contributor after being proposed by +at least one existing Core Contributor. The proposal must specify which +Core Team the Contributor will be part of. The promotion is effective +after a vote on the private Core Contributor mailing list which must +last for two weeks and which must reach at least two-thirds positive +majority of the cast votes. + +Extend voting rights +==================== + +Give voting rights to all Core Contributors. + +Simplify subsequent changes to the Governance +============================================= + +Allow changes to the following aspects of the scikit-learn Governance +without requiring a SLEP: + + - additions and changes to Roles' and Teams' scopes + - additions and changes to Roles' and Teams' permissions + +Any changes to the scikit-learn Governance (including ones which do not +require being back by a SLEP) will continue to be subject to the +decision making process [8]_, which includes a vote of the Core +Contributors. + +If subsequent changes to the Governance are proposed through a GitHub +Pull Request (PR): + + - a positive vote is cast by approving the PR (i.e. "Approve" + review) + - a negative vote is cast by requesting changes to the PR (i.e. + "Request changes" review) + +In this case, the vote still has to be announced on the Core +Contributors' mailing list, but the system of Pull Request approvals +will replace a vote on the private Core Contributors' mailing list. + +*********** + Copyright +*********** + +This document has been placed in the public domain [9]_. + +************************** + References and Footnotes +************************** + +.. [1] + + J. -G. Young, A. Casari, K. McLaughlin, M. Z. Trujillo, L. + Hébert-Dufresne and J. P. Bagrow, "Which contributions count? Analysis + of attribution in open source," 2021 IEEE/ACM 18th International + Conference on Mining Software Repositories (MSR), 2021, pp. 242-253, + doi: 10.1109/MSR52588.2021.00036: https://arxiv.org/abs/2103.11007 + +.. [2] + + Contributor experience, diversity and culture in Open Source Projects: + keynote from Melissa Weber Mendonça: + https://2022.pycon.de/program/NVBLKH/ + +.. [3] + + Reshama Shaikh's quote from Melissa Weber Mendonça' keynote: + https://twitter.com/reshamas/status/1513488342767353857 + +.. [4] + + NumPy Newcomer's Hour: an Experiment on Community Building, talk from + Melissa Weber Mendonça: https://www.youtube.com/watch?v=c0XZQbu0xnw + +.. [5] + + scikit-learn April 25th 2022 Developer meeting notes: + https://github.com/scikit-learn/administrative/blob/master/meeting_notes/2022-04-25.md + +.. [6] + + scikit-learn September 5th 2022 Developer meeting notes: + https://github.com/scikit-learn/administrative/blob/master/meeting_notes/2022-09-05.md + +.. [7] + + Permissions for each role, Repository roles for an organization, GitHub + Docs: + https://docs.github.com/en/organizations/managing-access-to-your-organizations-repositories/repository-roles-for-an-organization#permissions-for-each-role + +.. [8] + + Decision Making Process, scikit-learn Governance and Decision-Making: + https://scikit-learn.org/dev/governance.html#decision-making-process + +.. [9] + + Open Publication License: https://www.opencontent.org/openpub/ From 25edba4a37e696e32a1b65497b714c80e09223b7 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Tue, 29 Nov 2022 04:35:01 -0500 Subject: [PATCH 108/118] SLEP 014 Pandas in Pandas out (#37) Co-authored-by: Joel Nothman <joeln@canva.com> --- index.rst | 2 +- rejected.rst | 4 - slep014/proposal.rst | 264 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 265 insertions(+), 5 deletions(-) delete mode 100644 rejected.rst create mode 100644 slep014/proposal.rst diff --git a/index.rst b/index.rst index a8147b0..1760bb5 100644 --- a/index.rst +++ b/index.rst @@ -39,7 +39,7 @@ :maxdepth: 1 :caption: Rejected - rejected + slep014/proposal .. toctree:: :maxdepth: 1 diff --git a/rejected.rst b/rejected.rst deleted file mode 100644 index 42799a4..0000000 --- a/rejected.rst +++ /dev/null @@ -1,4 +0,0 @@ -Rejected SLEPs -============== - -Nothing here diff --git a/slep014/proposal.rst b/slep014/proposal.rst new file mode 100644 index 0000000..adf8fbc --- /dev/null +++ b/slep014/proposal.rst @@ -0,0 +1,264 @@ +.. _slep_014: + +============================== +SLEP014: Pandas In, Pandas Out +============================== + +:Author: Thomas J Fan +:Status: Rejected +:Type: Standards Track +:Created: 2020-02-18 + +Abstract +######## + +This SLEP proposes using pandas DataFrames for propagating feature names +through ``scikit-learn`` transformers. + +Motivation +########## + +``scikit-learn`` is commonly used as a part of a larger data processing +pipeline. When this pipeline is used to transform data, the result is a +NumPy array, discarding column names. The current workflow for +extracting the feature names requires calling ``get_feature_names`` on the +transformer that created the feature. This interface can be cumbersome when used +together with a pipeline with multiple column names:: + + import pandas as pd + import numpy as np + from sklearn.compose import make_column_transformer + from sklearn.preprocessing import OneHotEncoder, StandardScaler + from sklearn.pipeline import make_pipeline + from sklearn.linear_model import LogisticRegression + + X = pd.DataFrame({'letter': ['a', 'b', 'c'], + 'pet': ['dog', 'snake', 'dog'], + 'num': [1, 2, 3]}) + y = [0, 0, 1] + orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num'] + + ct = make_column_transformer( + (OneHotEncoder(), orig_cat_cols), (StandardScaler(), orig_num_cols)) + pipe = make_pipeline(ct, LogisticRegression()).fit(X, y) + + cat_names = (pipe['columntransformer'] + .named_transformers_['onehotencoder'] + .get_feature_names(orig_cat_cols)) + + feature_names = np.r_[cat_names, orig_num_cols] + +The ``feature_names`` extracted above corresponds to the features directly +passed into ``LogisticRegression``. As demonstrated above, the process of +extracting ``feature_names`` requires knowing the order of the selected +categories in the ``ColumnTransformer``. Furthemore, if there is feature +selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method +would need to be used to determine column names that were selected. + +Solution +######## + +The pandas DataFrame has been widely adopted by the Python Data ecosystem to +store data with feature names. This SLEP proposes using a DataFrame to +track the feature names as the data is transformed. With this feature, the +API for extracting feature names would be:: + + from sklearn import set_config + set_config(pandas_in_out=True) + + pipe.fit(X, y) + X_trans = pipe[:-1].transform(X) + + X_trans.columns.tolist() + ['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num'] + +This SLEP proposes attaching feature names to the output of ``transform``. In +the above example, ``pipe[:-1].transform(X)`` propagates the feature names +through the multiple transformers. + +This feature is only available through a soft dependency on pandas. Furthermore, +it will be opt-in with the the configuration flag: ``pandas_in_out``. By +default, ``pandas_in_out`` is set to ``False``, resulting in the output of all +estimators to be a ndarray. + +Enabling Functionality +###################### + +The following enhancements are **not** a part of this SLEP. These features are +made possible if this SLEP gets accepted. + +1. Allows estimators to treat columns differently based on name or dtype. For + example, the categorical dtype is useful for tree building algorithms. + +2. Storing feature names inside estimators for model inspection:: + + from sklearn import set_config + set_config(store_feature_names_in=True) + + pipe.fit(X, y) + + pipe['logisticregression'].feature_names_in_ + +3. Allow for extracting the feature names of estimators in meta-estimators:: + + from sklearn import set_config + set_config(store_feature_names_in=True) + + est = BaggingClassifier(LogisticRegression()) + est.fit(X, y) + + # Gets the feature names used by an estimator in the ensemble + est.estimators_[0].feature_names_in_ + +For options 2 and 3 the default value of configuration flag: +``store_feature_names_in`` is False. + +Considerations +############## + +Memory copies +------------- + +As noted in `pandas #27211 <https://github.com/pandas-dev/pandas/issues/27211>`_, +there is not a guarantee that there is a zero-copy round-trip going from numpy +to a DataFrame. In other words, the following may lead to a memory copy in +a future version of ``pandas``:: + + X = np.array(...) + X_df = pd.DataFrame(X) + X_again = np.asarray(X_df) + +This is an issue for ``scikit-learn`` when estimators are placed into a +pipeline. For example, consider the following pipeline:: + + set_config(pandas_in_out=True) + pipe = make_pipeline(StandardScaler(), LogisticRegression()) + pipe.fit(X, y) + +Interally, ``StandardScaler.fit_transform`` will operate on a ndarray and +wrap the ndarray into a DataFrame as a return value. This is will be +piped into ``LogisticRegression.fit`` which calls ``check_array`` on the +DataFrame, which may lead to a memory copy in a future version of +``pandas``. This leads to unnecessary overhead from piping the data from one +estimator to another. + +Sparse matrices +--------------- + +Traditionally, ``scikit-learn`` prefers to process sparse matrices in +the compressed sparse row (CSR) matrix format. The `sparse data structure <https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html>`_ in pandas 1.0 only supports converting directly to +the coordinate format (COO). Although this format was designed to quickly +convert to CSR or CSC formats, the conversion process still needs to allocate +more memory to store. This can be an issue with transformers such as the +``OneHotEncoder.transform`` which has been optimized to construct a CSR matrix. + +Backward compatibility +###################### + +The ``set_config(pandas_in_out=True)`` global configuration flag will be set to +``False`` by default to ensure backward compatibility. When this flag is False, +the output of all estimators will be a ndarray. + +Community Adoption +################## + +With the new ``pandas_in_out`` configuration flag, third party libraries may +need to query the configuration flag to be fully compliant with this SLEP. +Specifically, "to be fully compliant" entails the following policy: + +1. If ``pandas_in_out=False``, then ``transform`` always returns numpy array. +2. If ``pandas_in_out=True``, then ``transform`` returns a DataFrame if the + input is a Dataframe. + +This policy can either be enforced with ``check_estimator`` or not: + +- **Enforce**: This increases the maintaince burden of third party libraries. + This burden includes: checking for the configuration flag, generating feature names and including pandas as a dependency to their library. + +- **Not Enforce**: Currently, third party transformers can return a DataFrame + or a numpy and this is mostly compatible with ``scikit-learn``. Users with + third party transformers would not be able to access the features enabled + by this SLEP. + + +Alternatives +############ + +This section lists alternative data structures that can be used with their +advantages and disadvantages when compared to a pandas DataFrame. + +InputArray +---------- + +The proposed ``InputArray`` described +:ref:`SLEP012 Custom InputArray Data Structure <slep_012>` introduces a new +data structure for homogenous data. + +Pros +~~~~ + +- A thin wrapper around a numpy array or a sparse matrix with a minimial feature + set that ``scikit-learn`` can evolve independently. + +Cons +~~~~ + +- Introduces another data structure for data storage in the PyData ecosystem. +- Currently, the design only allows for homogenous data. +- Increases maintenance responsibilities for ``scikit-learn``. + +XArray Dataset +-------------- + +`xarray's Dataset <http://xarray.pydata.org/en/stable/data-structures.html#dataset>`_ +is a multi-dimenstional version of panda's DataFrame. + +Pros +~~~~ + +- Can be used for heterogeneous data. + +Cons +~~~~ + +- ``scikit-learn`` does not require many of the features Dataset provides. +- Needs to be converted to a DataArray before it can be converted to a numpy array. +- The `conversion from a pandas DataFrame to a Dataset <http://xarray.pydata.org/en/stable/pandas.html>`_ + is not lossless. For example, categorical dtypes in a pandas dataframe will + lose their categorical information when converted to a Dataset. +- xarray does not have as much adoption as pandas, which increases the learning + curve for using Dataset with ``scikit-learn``. + +XArray DataArray +---------------- + +`xarray's DataArray <http://xarray.pydata.org/en/stable/data-structures.html#dataarray>`_ +is a data structure that store homogenous data. + +Pros +~~~~ + +- xarray guarantees that there will be no copies during round-trips from + numpy. (`xarray #3077 <https://github.com/pydata/xarray/issues/3077>`_) + +Cons +~~~~ + +- Can only be used for homogenous data. +- As with XArray's Dataset, DataArray does not as much adoption as pandas, + which increases the learning curve for using DataArray with ``scikit-learn``. + +References and Footnotes +######################## + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +######### + +This document has been placed in the public domain. [1]_ From 221362bf8ba4dd82a9da62c91127b8171d471764 Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Wed, 30 Nov 2022 03:48:09 -0500 Subject: [PATCH 109/118] SLEP015: Feature Names Propagation (#48) --- index.rst | 1 + slep015/proposal.rst | 191 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 192 insertions(+) create mode 100644 slep015/proposal.rst diff --git a/index.rst b/index.rst index 1760bb5..d0d2119 100644 --- a/index.rst +++ b/index.rst @@ -40,6 +40,7 @@ :caption: Rejected slep014/proposal + slep015/proposal .. toctree:: :maxdepth: 1 diff --git a/slep015/proposal.rst b/slep015/proposal.rst new file mode 100644 index 0000000..bea2d8f --- /dev/null +++ b/slep015/proposal.rst @@ -0,0 +1,191 @@ +.. _slep_015: + +================================== +SLEP015: Feature Names Propagation +================================== + +:Author: Thomas J Fan +:Status: Rejected +:Type: Standards Track +:Created: 2020-10-03 + +Abstract +######## + +This SLEP proposes adding the ``get_feature_names_out`` method to all +transformers and the ``feature_names_in_`` attribute for all estimators. +The ``feature_names_in_`` attribute is set during ``fit`` if the input, ``X``, +contains the feature names. + +Motivation +########## + +``scikit-learn`` is commonly used as a part of a larger data processing +pipeline. When this pipeline is used to transform data, the result is a +NumPy array, discarding column names. The current workflow for +extracting the feature names requires calling ``get_feature_names`` on the +transformer that created the feature. This interface can be cumbersome when used +together with a pipeline with multiple column names:: + + X = pd.DataFrame({'letter': ['a', 'b', 'c'], + 'pet': ['dog', 'snake', 'dog'], + 'distance': [1, 2, 3]}) + y = [0, 0, 1] + orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num'] + + ct = ColumnTransformer( + [('cat', OneHotEncoder(), orig_cat_cols), + ('num', StandardScaler(), orig_num_cols)]) + pipe = make_pipeline(ct, LogisticRegression()).fit(X, y) + + cat_names = (pipe['columntransformer'] + .named_transformers_['onehotencoder'] + .get_feature_names(orig_cat_cols)) + + feature_names = np.r_[cat_names, orig_num_cols] + +The ``feature_names`` extracted above corresponds to the features directly +passed into ``LogisticRegression``. As demonstrated above, the process of +extracting ``feature_names`` requires knowing the order of the selected +categories in the ``ColumnTransformer``. Furthermore, if there is feature +selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method +would need to be used to infer the column names that were selected. + +Solution +######## + +This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators +that will extract the feature names of ``X`` during ``fit``. This will also +be used for validation during non-``fit`` methods such as ``transform`` or +``predict``. If the ``X`` is not a recognized container with columns, then +``feature_names_in_`` can be undefined. If ``feature_names_in_`` is undefined, +then it will not be validated. + +Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)`` +to all transformers. By default, the input features will be determined by the +``feature_names_in_`` attribute. The feature names of a pipeline can then be +easily extracted as follows:: + + pipe[:-1].get_feature_names_out() + # ['cat__letter_a', 'cat__letter_b', 'cat__letter_c', + 'cat__pet_dog', 'cat__pet_snake', 'num__distance'] + +Note that ``get_feature_names_out`` does not require ``input_names`` +because the feature names was stored in the pipeline itself. These +features will be passed to each step's ``get_feature_names_out`` method to +obtain the output feature names of the ``Pipeline`` itself. + +Enabling Functionality +###################### + +The following enhancements are **not** a part of this SLEP. These features are +made possible if this SLEP gets accepted. + +1. This SLEP enables us to implement an ``array_out`` keyword argument to + all ``transform`` methods to specify the array container outputted by + ``transform``. An implementation of ``array_out`` requires + ``feature_names_in_`` to validate that the names in ``fit`` and + ``transform`` are consistent. An implementation of ``array_out`` needs + a way to map from the input feature names to output feature names, which is + provided by ``get_feature_names_out``. + +2. An alternative to ``array_out``: Transformers in a pipeline may wish to have + feature names passed in as ``X``. This can be enabled by adding a + ``array_input`` parameter to ``Pipeline``:: + + pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(), + array_input='pandas') + + In this case, the pipeline will construct a pandas DataFrame to be inputted + into ``MyTransformer`` and ``LogisticRegression``. The feature names + will be constructed by calling ``get_feature_names_out`` as data is passed + through the ``Pipeline``. This feature implies that ``Pipeline`` is + doing the construction of the DataFrame. + +Considerations and Limitations +############################## + +1. The ``get_feature_names_out`` will be constructed using the name generation + specification from :ref:`slep_007`. + +2. For a ``Pipeline`` with only one estimator, slicing will not work and one + would need to access the feature names directly:: + + pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) + pipe[:-1].feature_names_in_ # Works + + pipe2 = make_pipeline(LogisticRegression()) + pipe[:-1].feature_names_in_ # Does not work + + This is because `pipe2[:-1]` raises an error because it will result in + a pipeline with no steps. We can work around this by allowing pipelines + with no steps. + +3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or + an ndarray. + +4. Meta-estimators will delegate the setting and validation of + ``feature_names_in_`` to its inner estimators. The meta-estimator will + define ``feature_names_in_`` by referencing its inner estimators. For + example, the ``Pipeline`` can use ``steps[0].feature_names_in_`` as + the input feature names. If the inner estimators do not define + ``feature_names_in_`` then the meta-estimator will not defined + ``feature_names_in_`` as well. + +Backward compatibility +###################### + +1. This SLEP is fully backward compatible with previous versions. With the + introduction of ``get_feature_names_out``, ``get_feature_names`` will + be deprecated. Note that ``get_feature_names_out``'s signature will + always contain ``input_features`` which can be used or ignored. This + helps standardize the interface for the get feature names method. + +2. The inclusion of a ``get_feature_names_out`` method will not introduce any + overhead to estimators. + +3. The inclusion of a ``feature_names_in_`` attribute will increase the size of + estimators because they would store the feature names. Users can remove + the attribute by calling ``del est.feature_names_in_`` if they want to + remove the feature and disable validation. + +Alternatives +############ + +There have been many attempts to address this issue: + +1. ``array_out`` in keyword parameter in ``transform`` : This approach requires + third party estimators to unwrap and wrap array containers in transform, + which introduces more burden for third party estimator maintainers. + Furthermore, ``array_out`` with sparse data will introduce an overhead when + being passed along in a ``Pipeline``. This overhead comes from the + construction of the sparse data container that has the feature names. + +2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute + while this SLEP proposes a ``get_feature_names_out`` method to accomplish + the same task. The benefit of the ``get_feature_names_out`` method is that + it can be used even if the feature names were not passed in ``fit`` with a + dataframe. For example, in a ``Pipeline`` the feature names are not passed + through to each step and a ``get_feature_names_out`` method can be used to + get the names of each step with slicing. + +3. :ref:`slep_012` : The ``InputArray`` was developed to work around the + overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The + introduction of another data structure into the Python Data Ecosystem, would + lead to more burden for third party estimator maintainers. + + +References and Footnotes +######################## + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +######### + +This document has been placed in the public domain. [1]_ From cbde437dc7b13a6c5d7cd8ecd728b72c81d97020 Mon Sep 17 00:00:00 2001 From: Adrin Jalali <adrin.jalali@gmail.com> Date: Mon, 5 Dec 2022 13:45:26 +0100 Subject: [PATCH 110/118] Add X,y,Y to ignored list (#69) --- slep006/proposal.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/slep006/proposal.rst b/slep006/proposal.rst index 80b6b9b..ddc7c34 100644 --- a/slep006/proposal.rst +++ b/slep006/proposal.rst @@ -76,6 +76,10 @@ Note that in the core library nothing is requested by default, except the time of writing this proposal, all metadata requested in the core library are sample aligned. +Also note that ``X``, ``y``, and ``Y`` input arguments are never automatically +added to the routing mechanism and are always passed into their respective +methods. + Detailed description -------------------- From 4642805fabbaef76ffd83e0b5073f5bf436b9e0b Mon Sep 17 00:00:00 2001 From: Adrin Jalali <adrin.jalali@gmail.com> Date: Wed, 28 Dec 2022 05:05:15 +0100 Subject: [PATCH 111/118] slep019: Add non-core-dev contributors to an ack section (#82) = --- slep019/proposal.rst | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/slep019/proposal.rst b/slep019/proposal.rst index 2ddd55c..89ff535 100644 --- a/slep019/proposal.rst +++ b/slep019/proposal.rst @@ -139,6 +139,19 @@ In this case, the vote still has to be announced on the Core Contributors' mailing list, but the system of Pull Request approvals will replace a vote on the private Core Contributors' mailing list. + +************** +Acknowledgment +************** + +We thank the following people who have helped with discussions during the +development of this SLEP: + +- Lucy Liu: https://github.com/lucyleeow +- Noa Tamir: https://github.com/noatamir +- Reshama Shaikh: https://github.com/reshamas +- Tim Head: https://github.com/betatim + *********** Copyright *********** From 281e2b9315ae43af608028f20804023b98bf2d91 Mon Sep 17 00:00:00 2001 From: Andreas Mueller <t3kcit@gmail.com> Date: Thu, 29 Dec 2022 22:15:38 -0800 Subject: [PATCH 112/118] VOTE SLEP 17 (#79) --- index.rst | 1 + slep017/proposal.rst | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/index.rst b/index.rst index d0d2119..031a53b 100644 --- a/index.rst +++ b/index.rst @@ -14,6 +14,7 @@ slep007/proposal slep009/proposal slep010/proposal + slep017/proposal slep018/proposal .. toctree:: diff --git a/slep017/proposal.rst b/slep017/proposal.rst index fdec790..da0e508 100644 --- a/slep017/proposal.rst +++ b/slep017/proposal.rst @@ -3,10 +3,11 @@ SLEP017: Clone Override Protocol with ``__sklearn_clone__`` =========================================================== :Author: Joel Nothman -:Status: Draft +:Status: Accepted :Type: Standards Track :Created: 2022-03-19 -:Resolution: (required for Accepted | Rejected | Withdrawn) +:scikit-learn-Version: 1.3.0 +:Resolution: https://github.com/scikit-learn/enhancement_proposals/pull/79 Abstract -------- From ac7f438c2013a64ddd907d8f22bc3a807b20b26e Mon Sep 17 00:00:00 2001 From: Adrin Jalali <adrin.jalali@gmail.com> Date: Mon, 2 Jan 2023 12:07:52 +0100 Subject: [PATCH 113/118] Reject SLEP013 (#36) * accept SLEP013 * add example * reject SLEP13 * move to rejected --- index.rst | 2 +- slep013/proposal.rst | 51 +++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 51 insertions(+), 2 deletions(-) diff --git a/index.rst b/index.rst index 031a53b..4115d93 100644 --- a/index.rst +++ b/index.rst @@ -22,7 +22,6 @@ :caption: Under review slep012/proposal - slep013/proposal slep017/proposal slep019/proposal @@ -40,6 +39,7 @@ :maxdepth: 1 :caption: Rejected + slep013/proposal slep014/proposal slep015/proposal diff --git a/slep013/proposal.rst b/slep013/proposal.rst index 4744aaa..0fb0b2d 100644 --- a/slep013/proposal.rst +++ b/slep013/proposal.rst @@ -5,7 +5,7 @@ SLEP013: ``n_features_out_`` attribute ====================================== :Author: Adrin Jalali -:Status: Under Review +:Status: Rejected :Type: Standards Track :Created: 2020-02-12 @@ -22,6 +22,55 @@ Knowing the number of features that a transformer outputs is useful for inspection purposes. This is in conjunction with `*SLEP010: ``n_features_in_``* <https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html>`_. +Take the following piece as an example:: + + X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) + + # We will train our classifier with the following features: + # Numeric Features: + # - age: float. + # - fare: float. + # Categorical Features: + # - embarked: categories encoded as strings {'C', 'S', 'Q'}. + # - sex: categories encoded as strings {'female', 'male'}. + # - pclass: ordinal integers {1, 2, 3}. + + # We create the preprocessing pipelines for both numeric and categorical data. + numeric_features = ['age', 'fare'] + numeric_transformer = Pipeline(steps=[ + ('imputer', SimpleImputer(strategy='median')), + ('scaler', StandardScaler())]) + + categorical_features = ['embarked', 'sex', 'pclass'] + categorical_transformer = Pipeline(steps=[ + ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), + ('onehot', OneHotEncoder(handle_unknown='ignore'))]) + + preprocessor = ColumnTransformer( + transformers=[ + ('num', numeric_transformer, numeric_features), + ('cat', categorical_transformer, categorical_features)]) + + # Append classifier to preprocessing pipeline. + # Now we have a full prediction pipeline. + clf = Pipeline(steps=[('preprocessor', preprocessor), + ('classifier', LogisticRegression())]) + + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) + + clf.fit(X_train, y_train) + +The user could then inspect the number of features going out from each step:: + + # Total number of output features from the `ColumnTransformer` + clf[0].n_features_out_ + + # Number of features as a result of the numerical pipeline: + clf[0].named_transformers_['num'].n_features_out_ + + # Number of features as a result of the categorical pipeline: + clf[0].named_transformers_['cat'].n_features_out_ + Solution ######## From 60a337d26eabef0baa757877e6da8dc31bfa583d Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Fri, 20 Jan 2023 14:26:51 -0500 Subject: [PATCH 114/118] SLEP020: Simplifing Governance Changes (#84) --- index.rst | 1 + slep020/proposal.rst | 62 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 63 insertions(+) create mode 100644 slep020/proposal.rst diff --git a/index.rst b/index.rst index 4115d93..792b111 100644 --- a/index.rst +++ b/index.rst @@ -24,6 +24,7 @@ slep012/proposal slep017/proposal slep019/proposal + slep020/proposal .. toctree:: :maxdepth: 1 diff --git a/slep020/proposal.rst b/slep020/proposal.rst new file mode 100644 index 0000000..355b60b --- /dev/null +++ b/slep020/proposal.rst @@ -0,0 +1,62 @@ +.. _slep_020: + +======================================= +SLEP020: Simplifying Governance Changes +======================================= + +:Author: Thomas J Fan +:Status: Draft +:Type: Process +:Created: 2023-01-09 + +Abstract +-------- + +This SLEP proposes to permit governance changes through GitHub Pull Requests, +where a vote will also occur in the Pull Request. + +Detailed description +-------------------- + +Currently, scikit-learn's governance document [2]_ requires an enhancement +proposal to make any changes to the governance document. In this SLEP, we +propose simplifying the process by allowing governance changes through GitHub +Pull Requests. Once the authors are happy with the state of the Pull Request, +they can call for a vote on the mailing list. No changes are allowed until the +vote is closed. A Pull Request approval will count as a positive vote, and a +"Request Changes" review will count as a negative vote. The voting period will +remain one month as stated in the current Governance and Decision-Making +Document [2]_. + +Discussion +---------- + +Members of the scikit-learn community have discussed changing the governance +through :ref:`SLEP019 <slep_019>` in following PRs: + +1. `enhancement_proposals#74 <https://github.com/scikit-learn/enhancement_proposals/pull/74>`__ + proposed updating the Governance to broaden the notion of contribution in scikit-learn. + The draft was approved and merged on 2022-11-18. +2. `enhancement_proposals#81 <https://github.com/scikit-learn/enhancement_proposals/pull/81>`__ + proposed updates to :ref:`SLEP019 <slep_019>`. + +:ref:`SLEP019 <slep_019>` also includes the voting change proposed in this SLEP. +This SLEP's goal is to simplify the process of making governance changes, thus +enabling the governance structure to evolve more efficiently. + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open Publication + License`_. +.. [2] `scikit-learn Governance and Decision-Making + <https://scikit-learn.org/stable/governance.html#decision-making-process>`__ + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From 24db8ea36b629b1d4f012b289afcd8bac2bb2b6f Mon Sep 17 00:00:00 2001 From: "Thomas J. Fan" <thomasjpfan@gmail.com> Date: Wed, 22 Feb 2023 10:10:14 -0500 Subject: [PATCH 115/118] VOTE SLEP 20: Simplifying Governance Changes (#85) --- index.rst | 2 +- slep020/proposal.rst | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/index.rst b/index.rst index 792b111..ff7d43c 100644 --- a/index.rst +++ b/index.rst @@ -16,6 +16,7 @@ slep010/proposal slep017/proposal slep018/proposal + slep020/proposal .. toctree:: :maxdepth: 1 @@ -24,7 +25,6 @@ slep012/proposal slep017/proposal slep019/proposal - slep020/proposal .. toctree:: :maxdepth: 1 diff --git a/slep020/proposal.rst b/slep020/proposal.rst index 355b60b..5037d4e 100644 --- a/slep020/proposal.rst +++ b/slep020/proposal.rst @@ -5,9 +5,10 @@ SLEP020: Simplifying Governance Changes ======================================= :Author: Thomas J Fan -:Status: Draft +:Status: Accepted :Type: Process :Created: 2023-01-09 +:Resolution: https://github.com/scikit-learn/enhancement_proposals/pull/85 Abstract -------- From c1dde7527a1c97b6a41ce180395e90e3f31fd59f Mon Sep 17 00:00:00 2001 From: Adrin Jalali <adrin.jalali@gmail.com> Date: Wed, 15 Mar 2023 11:30:48 +0100 Subject: [PATCH 116/118] Withdraw SLEP019 (#87) --- slep019/proposal.rst | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/slep019/proposal.rst b/slep019/proposal.rst index 89ff535..b416b90 100644 --- a/slep019/proposal.rst +++ b/slep019/proposal.rst @@ -5,7 +5,7 @@ #################################################################### :Author: Julien Jerphanion <git@jjerphan.xyz>, Gaël Varoquaux <gael.varoquaux@normalesup.org> -:Status: Draft +:Status: Withdrawn :Type: Process :Created: 2022-09-12 @@ -210,3 +210,12 @@ This document has been placed in the public domain [9]_. .. [9] Open Publication License: https://www.opencontent.org/openpub/ + + +**** +Note +**** + +Since SLEP020 allows us to modify the governance w/o a SLEP requirement, many +discussions from this SLEP are to be discussed and implemented in consequent +PRs directly to change the governance, on the main repo. From 3f1c7a6b63cf93e5b7ed6f48ea78538f46d1ab51 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Fri, 1 Mar 2024 18:49:30 +0100 Subject: [PATCH 117/118] MNT add .readthedocs.yaml (#91) --- .readthedocs.yaml | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 .readthedocs.yaml diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 0000000..3c04e40 --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,13 @@ +version: 2 + +build: + os: ubuntu-22.04 + tools: + python: "3.12" + +sphinx: + configuration: ./conf.py + +python: + install: + - requirements: requirements.txt \ No newline at end of file From 6f219b26de19ca76d56ab8ce9e30d7cf5f456fa4 Mon Sep 17 00:00:00 2001 From: Christian Lorentzen <lorentzen.ch@gmail.com> Date: Mon, 7 Oct 2024 16:53:30 +0200 Subject: [PATCH 118/118] Withdraw SLEP004 and SLEP012 (#93) * Withdraw SLEP012 * Withdraw SLEP004 --- index.rst | 4 ++-- slep004/proposal.rst | 5 +++++ slep012/proposal.rst | 2 +- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/index.rst b/index.rst index ff7d43c..9848922 100644 --- a/index.rst +++ b/index.rst @@ -22,7 +22,6 @@ :maxdepth: 1 :caption: Under review - slep012/proposal slep017/proposal slep019/proposal @@ -34,12 +33,13 @@ slep001/proposal slep002/proposal slep003/proposal - slep004/proposal .. toctree:: :maxdepth: 1 :caption: Rejected + slep004/proposal + slep012/proposal slep013/proposal slep014/proposal slep015/proposal diff --git a/slep004/proposal.rst b/slep004/proposal.rst index a9992eb..195076f 100644 --- a/slep004/proposal.rst +++ b/slep004/proposal.rst @@ -4,6 +4,11 @@ SLEP004: Data information ========================= +:Author: Nicolas Hug +:Status: Withdrawn (superseded by :ref:`SLEP006 <slep_006>`) +:Type: Standards Track +:Created: 2018-12-12 + This is a specification to introduce data information (as ``sample_weights``) during the computation of an estimator methods (``fit``, ``score``, ...) based on the different discussion proposes on diff --git a/slep012/proposal.rst b/slep012/proposal.rst index 90c9347..aad4a46 100644 --- a/slep012/proposal.rst +++ b/slep012/proposal.rst @@ -5,7 +5,7 @@ SLEP012: ``InputArray`` ======================= :Author: Adrin jalali -:Status: Draft +:Status: Withdrawn (superseded by :ref:`SLEP007 <slep_007>` and :ref:`SLEP018 <slep_018>`) :Type: Standards Track :Created: 2019-12-20