Skip to content

Commit dc76927

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 97e1401822e2b18339d3e7a9497cb682617c1587
1 parent 22ae15b commit dc76927

File tree

1,217 files changed

+4301
-4289
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,217 files changed

+4301
-4289
lines changed
Binary file not shown.
Binary file not shown.

dev/_downloads/7012baed63b9a27f121bae611b8285c2/plot_cyclical_feature_engineering.ipynb

Lines changed: 15 additions & 15 deletions
Large diffs are not rendered by default.

dev/_downloads/9fcbbc59ab27a20d07e209a711ac4f50/plot_cyclical_feature_engineering.py

Lines changed: 57 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,9 @@
6464
#
6565
# When reporting performance measure on the test set in the discussion, we
6666
# instead choose to focus on the mean absolute error that is more
67-
# intuitive than the (root) mean squared error. Note however that the best
68-
# models for one metric are also the best for the other in this study.
67+
# intuitive than the (root) mean squared error. Note, however, that the
68+
# best models for one metric are also the best for the other in this
69+
# study.
6970
y = df["count"] / 1000
7071

7172
# %%
@@ -171,11 +172,11 @@
171172
# let the model know that it should treat those as categorical variables by
172173
# using a dedicated tree splitting rule. Since we use an ordinal encoder, we
173174
# pass the list of categorical values explicitly to use a logical order when
174-
# encoding the categories as integer instead of the lexicographical order. This
175-
# also has the added benefit of preventing any issue with unknown categories
176-
# when using cross-validation.
175+
# encoding the categories as integers instead of the lexicographical order.
176+
# This also has the added benefit of preventing any issue with unknown
177+
# categories when using cross-validation.
177178
#
178-
# The numerical variable need no preprocessing and, for the sake of simplicity,
179+
# The numerical variables need no preprocessing and, for the sake of simplicity,
179180
# we only try the default hyper-parameters for this model:
180181
from sklearn.pipeline import make_pipeline
181182
from sklearn.preprocessing import OrdinalEncoder
@@ -243,7 +244,7 @@ def evaluate(model, X, y, cv):
243244
# of a problem for tree-based models as they can learn a non-monotonic
244245
# relationship between ordinal input features and the target.
245246
#
246-
# This is not the case for linear regression model as we will see in the
247+
# This is not the case for linear regression models as we will see in the
247248
# following.
248249
#
249250
# Naive linear regression
@@ -279,25 +280,26 @@ def evaluate(model, X, y, cv):
279280
#
280281
# The performance is not good: the average error is around 14% of the maximum
281282
# demand. This is more than three times higher than the average error of the
282-
# gradient boosting model. We can suspect that the naive original encoding of
283-
# the periodic time-related features might prevent the linear regression model
284-
# to properly leverage the time information: linear regression does not model
285-
# non-monotonic relationships between the input features and the target.
286-
# Non-linear terms have to be engineered in the input.
283+
# gradient boosting model. We can suspect that the naive original encoding
284+
# (merely min-max scaled) of the periodic time-related features might prevent
285+
# the linear regression model to properly leverage the time information: linear
286+
# regression does not automatically model non-monotonic relationships between
287+
# the input features and the target. Non-linear terms have to be engineered in
288+
# the input.
287289
#
288290
# For example, the raw numerical encoding of the `"hour"` feature prevents the
289291
# linear model from recognizing that an increase of hour in the morning from 6
290292
# to 8 should have a strong positive impact on the number of bike rentals while
291-
# a increase of similar magnitude in the evening from 18 to 20 should have a
293+
# an increase of similar magnitude in the evening from 18 to 20 should have a
292294
# strong negative impact on the predicted number of bike rentals.
293295
#
294296
# Time-steps as categories
295297
# ------------------------
296298
#
297299
# Since the time features are encoded in a discrete manner using integers (24
298300
# unique values in the "hours" feature), we could decide to treat those as
299-
# categorical variables and ignore any assumption implied by the ordering of
300-
# the hour values using a one-hot encoding.
301+
# categorical variables using a one-hot encoding and thereby ignore any
302+
# assumption implied by the ordering of the hour values.
301303
#
302304
# Using one-hot encoding for the time features gives the linear model a lot
303305
# more flexibility as we introduce one additional feature per discrete time
@@ -317,8 +319,8 @@ def evaluate(model, X, y, cv):
317319

318320
# %%
319321
# The average error rate of this model is 10% which is much better than using
320-
# the original ordinal encoding of the time feature, confirming our intuition
321-
# that the linear regression model benefit from the added flexibility to not
322+
# the original (ordinal) encoding of the time feature, confirming our intuition
323+
# that the linear regression model benefits from the added flexibility to not
322324
# treat time progression in a monotonic manner.
323325
#
324326
# However, this introduces a very large number of new features. If the time of
@@ -330,7 +332,7 @@ def evaluate(model, X, y, cv):
330332
# benefitting from the non-monotonic expressivity advantages of one-hot
331333
# encoding.
332334
#
333-
# Finally, we also observe than one-hot encoding completely ignores the
335+
# Finally, we also observe that one-hot encoding completely ignores the
334336
# ordering of the hour levels while this could be an interesting inductive bias
335337
# to preserve to some level. In the following we try to explore smooth,
336338
# non-monotonic encoding that locally preserves the relative ordering of time
@@ -340,7 +342,7 @@ def evaluate(model, X, y, cv):
340342
# ----------------------
341343
#
342344
# As a first attempt, we can try to encode each of those periodic features
343-
# using a sine and cosine transform with the matching period.
345+
# using a sine and cosine transformation with the matching period.
344346
#
345347
# Each ordinal time feature is transformed into 2 features that together encode
346348
# equivalent information in a non-monotonic way, and more importantly without
@@ -375,9 +377,9 @@ def cos_transformer(period):
375377
#
376378
# Let's use a 2D scatter plot with the hours encoded as colors to better see
377379
# how this representation maps the 24 hours of the day to a 2D space, akin to
378-
# some sort of 24 hour version of an analog clock. Note that the "25th" hour is
379-
# mapped back to the 1st hour because of the periodic nature of the sine/cosine
380-
# representation.
380+
# some sort of a 24 hour version of an analog clock. Note that the "25th" hour
381+
# is mapped back to the 1st hour because of the periodic nature of the
382+
# sine/cosine representation.
381383
fig, ax = plt.subplots(figsize=(7, 5))
382384
sp = ax.scatter(hour_df["hour_sin"], hour_df["hour_cos"], c=hour_df["hour"])
383385
ax.set(
@@ -420,7 +422,8 @@ def cos_transformer(period):
420422
#
421423
# We can try an alternative encoding of the periodic time-related features
422424
# using spline transformations with a large enough number of splines, and as a
423-
# result a larger number of expanded features:
425+
# result a larger number of expanded features compared to the sine/cosine
426+
# transformation:
424427
from sklearn.preprocessing import SplineTransformer
425428

426429

@@ -485,8 +488,8 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
485488
# ~10% of the maximum demand, which is similar to what we observed with the
486489
# one-hot encoded features.
487490
#
488-
# Qualitative analysis of the impact of features on linear models predictions
489-
# ---------------------------------------------------------------------------
491+
# Qualitative analysis of the impact of features on linear model predictions
492+
# --------------------------------------------------------------------------
490493
#
491494
# Here, we want to visualize the impact of the feature engineering choices on
492495
# the time related shape of the predictions.
@@ -539,13 +542,13 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
539542
# %%
540543
# We can draw the following conclusions from the above plot:
541544
#
542-
# - the **raw ordinal time-related features** are problematic because they do
545+
# - The **raw ordinal time-related features** are problematic because they do
543546
# not capture the natural periodicity: we observe a big jump in the
544547
# predictions at the end of each day when the hour features goes from 23 back
545548
# to 0. We can expect similar artifacts at the end of each week or each year.
546549
#
547-
# - as expected, the **trigonometric features** (sine and cosine) do not have
548-
# these discontinuities at midnight but the linear regression model fails to
550+
# - As expected, the **trigonometric features** (sine and cosine) do not have
551+
# these discontinuities at midnight, but the linear regression model fails to
549552
# leverage those features to properly model intra-day variations.
550553
# Using trigonometric features for higher harmonics or additional
551554
# trigonometric features for the natural period with different phases could
@@ -557,7 +560,7 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
557560
# `extrapolation="periodic"` option enforces a smooth representation between
558561
# `hour=23` and `hour=0`.
559562
#
560-
# - the **one-hot encoded features** behave similarly to the periodic
563+
# - The **one-hot encoded features** behave similarly to the periodic
561564
# spline-based features but are more spiky: for instance they can better
562565
# model the morning peak during the week days since this peak lasts shorter
563566
# than an hour. However, we will see in the following that what can be an
@@ -592,21 +595,21 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
592595
# under-estimate the commuting-related events during the working days.
593596
#
594597
# These systematic prediction errors reveal a form of under-fitting and can be
595-
# explained by the lack of non-additive modeling of the interactions between
596-
# features (in this case "workingday" and features derived from "hours"). This
597-
# issue will be addressed in the following section.
598+
# explained by the lack of interactions terms between features, e.g.
599+
# "workingday" and features derived from "hours". This issue will be addressed
600+
# in the following section.
598601

599602
# %%
600603
# Modeling pairwise interactions with splines and polynomial features
601604
# -------------------------------------------------------------------
602605
#
603-
# Linear models alone cannot model interaction effects between input features.
604-
# It does not help that some features are marginally non-linear as is the case
605-
# with features constructed by `SplineTransformer` (or one-hot encoding or
606-
# binning).
606+
# Linear models do not automatically capture interaction effects between input
607+
# features. It does not help that some features are marginally non-linear as is
608+
# the case with features constructed by `SplineTransformer` (or one-hot
609+
# encoding or binning).
607610
#
608611
# However, it is possible to use the `PolynomialFeatures` class on coarse
609-
# grained splined encoded hours to model the "workingday"/"hours" interaction
612+
# grained spline encoded hours to model the "workingday"/"hours" interaction
610613
# explicitly without introducing too many new variables:
611614
from sklearn.preprocessing import PolynomialFeatures
612615
from sklearn.pipeline import FeatureUnion
@@ -644,16 +647,16 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
644647
#
645648
# The previous analysis highlighted the need to model the interactions between
646649
# `"workingday"` and `"hours"`. Another example of a such a non-linear
647-
# interactions that we would like to model could be the impact of the rain that
650+
# interaction that we would like to model could be the impact of the rain that
648651
# might not be the same during the working days and the week-ends and holidays
649652
# for instance.
650653
#
651654
# To model all such interactions, we could either use a polynomial expansion on
652-
# all marginal features at once, after their spline-based expansion. However
655+
# all marginal features at once, after their spline-based expansion. However,
653656
# this would create a quadratic number of features which can cause overfitting
654657
# and computational tractability issues.
655658
#
656-
# Alternatively we can use the Nyström method to compute an approximate
659+
# Alternatively, we can use the Nyström method to compute an approximate
657660
# polynomial kernel expansion. Let us try the latter:
658661
from sklearn.kernel_approximation import Nystroem
659662

@@ -693,11 +696,11 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
693696

694697

695698
# %%
696-
# While one-hot features were competitive with spline-based features when using
697-
# linear models, this is no longer the case when using a low-rank approximation
698-
# of a non-linear kernel: this can be explained by the fact that spline
699-
# features are smoother and allow the kernel approximation to find a more
700-
# expressive decision function.
699+
# While one-hot encoded features were competitive with spline-based features
700+
# when using linear models, this is no longer the case when using a low-rank
701+
# approximation of a non-linear kernel: this can be explained by the fact that
702+
# spline features are smoother and allow the kernel approximation to find a
703+
# more expressive decision function.
701704
#
702705
# Let us now have a qualitative look at the predictions of the kernel models
703706
# and of the gradient boosted trees that should be able to better model
@@ -747,13 +750,13 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
747750
# since, by default, decision trees are allowed to grow beyond a depth of 2
748751
# levels.
749752
#
750-
# Here we can observe that the combinations of spline features and non-linear
753+
# Here, we can observe that the combinations of spline features and non-linear
751754
# kernels works quite well and can almost rival the accuracy of the gradient
752755
# boosting regression trees.
753756
#
754-
# On the contrary, one-hot time features do not perform that well with the low
755-
# rank kernel model. In particular they significantly over-estimate the low
756-
# demand hours more than the competing models.
757+
# On the contrary, one-hot encoded time features do not perform that well with
758+
# the low rank kernel model. In particular, they significantly over-estimate
759+
# the low demand hours more than the competing models.
757760
#
758761
# We also observe that none of the models can successfully predict some of the
759762
# peak rentals at the rush hours during the working days. It is possible that
@@ -791,7 +794,7 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
791794
# %%
792795
# This visualization confirms the conclusions we draw on the previous plot.
793796
#
794-
# All models under-estimate the high demand events (working days rush hours),
797+
# All models under-estimate the high demand events (working day rush hours),
795798
# but gradient boosting a bit less so. The low demand events are well predicted
796799
# on average by gradient boosting while the one-hot polynomial regression
797800
# pipeline seems to systematically over-estimate demand in that regime. Overall
@@ -804,9 +807,10 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
804807
# We note that we could have obtained slightly better results for kernel models
805808
# by using more components (higher rank kernel approximation) at the cost of
806809
# longer fit and prediction durations. For large values of `n_components`, the
807-
# performance of the one-hot features would even match the spline features.
810+
# performance of the one-hot encoded features would even match the spline
811+
# features.
808812
#
809-
# The `Nystroem` + `RidgeCV` classifier could also have been replaced by
813+
# The `Nystroem` + `RidgeCV` regressor could also have been replaced by
810814
# :class:`~sklearn.neural_network.MLPRegressor` with one or two hidden layers
811815
# and we would have obtained quite similar results.
812816
#
@@ -818,7 +822,7 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
818822
# flexibility.
819823
#
820824
# Finally, in this notebook we used `RidgeCV` because it is very efficient from
821-
# a computational point of view. However it models the target variable as a
825+
# a computational point of view. However, it models the target variable as a
822826
# Gaussian random variable with constant variance. For positive regression
823827
# problems, it is likely that using a Poisson or Gamma distribution would make
824828
# more sense. This could be achieved by using

dev/_downloads/scikit-learn-docs.zip

59 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)