64
64
#
65
65
# When reporting performance measure on the test set in the discussion, we
66
66
# instead choose to focus on the mean absolute error that is more
67
- # intuitive than the (root) mean squared error. Note however that the best
68
- # models for one metric are also the best for the other in this study.
67
+ # intuitive than the (root) mean squared error. Note, however, that the
68
+ # best models for one metric are also the best for the other in this
69
+ # study.
69
70
y = df ["count" ] / 1000
70
71
71
72
# %%
171
172
# let the model know that it should treat those as categorical variables by
172
173
# using a dedicated tree splitting rule. Since we use an ordinal encoder, we
173
174
# pass the list of categorical values explicitly to use a logical order when
174
- # encoding the categories as integer instead of the lexicographical order. This
175
- # also has the added benefit of preventing any issue with unknown categories
176
- # when using cross-validation.
175
+ # encoding the categories as integers instead of the lexicographical order.
176
+ # This also has the added benefit of preventing any issue with unknown
177
+ # categories when using cross-validation.
177
178
#
178
- # The numerical variable need no preprocessing and, for the sake of simplicity,
179
+ # The numerical variables need no preprocessing and, for the sake of simplicity,
179
180
# we only try the default hyper-parameters for this model:
180
181
from sklearn .pipeline import make_pipeline
181
182
from sklearn .preprocessing import OrdinalEncoder
@@ -243,7 +244,7 @@ def evaluate(model, X, y, cv):
243
244
# of a problem for tree-based models as they can learn a non-monotonic
244
245
# relationship between ordinal input features and the target.
245
246
#
246
- # This is not the case for linear regression model as we will see in the
247
+ # This is not the case for linear regression models as we will see in the
247
248
# following.
248
249
#
249
250
# Naive linear regression
@@ -279,25 +280,26 @@ def evaluate(model, X, y, cv):
279
280
#
280
281
# The performance is not good: the average error is around 14% of the maximum
281
282
# demand. This is more than three times higher than the average error of the
282
- # gradient boosting model. We can suspect that the naive original encoding of
283
- # the periodic time-related features might prevent the linear regression model
284
- # to properly leverage the time information: linear regression does not model
285
- # non-monotonic relationships between the input features and the target.
286
- # Non-linear terms have to be engineered in the input.
283
+ # gradient boosting model. We can suspect that the naive original encoding
284
+ # (merely min-max scaled) of the periodic time-related features might prevent
285
+ # the linear regression model to properly leverage the time information: linear
286
+ # regression does not automatically model non-monotonic relationships between
287
+ # the input features and the target. Non-linear terms have to be engineered in
288
+ # the input.
287
289
#
288
290
# For example, the raw numerical encoding of the `"hour"` feature prevents the
289
291
# linear model from recognizing that an increase of hour in the morning from 6
290
292
# to 8 should have a strong positive impact on the number of bike rentals while
291
- # a increase of similar magnitude in the evening from 18 to 20 should have a
293
+ # an increase of similar magnitude in the evening from 18 to 20 should have a
292
294
# strong negative impact on the predicted number of bike rentals.
293
295
#
294
296
# Time-steps as categories
295
297
# ------------------------
296
298
#
297
299
# Since the time features are encoded in a discrete manner using integers (24
298
300
# unique values in the "hours" feature), we could decide to treat those as
299
- # categorical variables and ignore any assumption implied by the ordering of
300
- # the hour values using a one-hot encoding .
301
+ # categorical variables using a one-hot encoding and thereby ignore any
302
+ # assumption implied by the ordering of the hour values .
301
303
#
302
304
# Using one-hot encoding for the time features gives the linear model a lot
303
305
# more flexibility as we introduce one additional feature per discrete time
@@ -317,8 +319,8 @@ def evaluate(model, X, y, cv):
317
319
318
320
# %%
319
321
# The average error rate of this model is 10% which is much better than using
320
- # the original ordinal encoding of the time feature, confirming our intuition
321
- # that the linear regression model benefit from the added flexibility to not
322
+ # the original ( ordinal) encoding of the time feature, confirming our intuition
323
+ # that the linear regression model benefits from the added flexibility to not
322
324
# treat time progression in a monotonic manner.
323
325
#
324
326
# However, this introduces a very large number of new features. If the time of
@@ -330,7 +332,7 @@ def evaluate(model, X, y, cv):
330
332
# benefitting from the non-monotonic expressivity advantages of one-hot
331
333
# encoding.
332
334
#
333
- # Finally, we also observe than one-hot encoding completely ignores the
335
+ # Finally, we also observe that one-hot encoding completely ignores the
334
336
# ordering of the hour levels while this could be an interesting inductive bias
335
337
# to preserve to some level. In the following we try to explore smooth,
336
338
# non-monotonic encoding that locally preserves the relative ordering of time
@@ -340,7 +342,7 @@ def evaluate(model, X, y, cv):
340
342
# ----------------------
341
343
#
342
344
# As a first attempt, we can try to encode each of those periodic features
343
- # using a sine and cosine transform with the matching period.
345
+ # using a sine and cosine transformation with the matching period.
344
346
#
345
347
# Each ordinal time feature is transformed into 2 features that together encode
346
348
# equivalent information in a non-monotonic way, and more importantly without
@@ -375,9 +377,9 @@ def cos_transformer(period):
375
377
#
376
378
# Let's use a 2D scatter plot with the hours encoded as colors to better see
377
379
# how this representation maps the 24 hours of the day to a 2D space, akin to
378
- # some sort of 24 hour version of an analog clock. Note that the "25th" hour is
379
- # mapped back to the 1st hour because of the periodic nature of the sine/cosine
380
- # representation.
380
+ # some sort of a 24 hour version of an analog clock. Note that the "25th" hour
381
+ # is mapped back to the 1st hour because of the periodic nature of the
382
+ # sine/cosine representation.
381
383
fig , ax = plt .subplots (figsize = (7 , 5 ))
382
384
sp = ax .scatter (hour_df ["hour_sin" ], hour_df ["hour_cos" ], c = hour_df ["hour" ])
383
385
ax .set (
@@ -420,7 +422,8 @@ def cos_transformer(period):
420
422
#
421
423
# We can try an alternative encoding of the periodic time-related features
422
424
# using spline transformations with a large enough number of splines, and as a
423
- # result a larger number of expanded features:
425
+ # result a larger number of expanded features compared to the sine/cosine
426
+ # transformation:
424
427
from sklearn .preprocessing import SplineTransformer
425
428
426
429
@@ -485,8 +488,8 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
485
488
# ~10% of the maximum demand, which is similar to what we observed with the
486
489
# one-hot encoded features.
487
490
#
488
- # Qualitative analysis of the impact of features on linear models predictions
489
- # ---------------------------------------------------------------------------
491
+ # Qualitative analysis of the impact of features on linear model predictions
492
+ # --------------------------------------------------------------------------
490
493
#
491
494
# Here, we want to visualize the impact of the feature engineering choices on
492
495
# the time related shape of the predictions.
@@ -539,13 +542,13 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
539
542
# %%
540
543
# We can draw the following conclusions from the above plot:
541
544
#
542
- # - the **raw ordinal time-related features** are problematic because they do
545
+ # - The **raw ordinal time-related features** are problematic because they do
543
546
# not capture the natural periodicity: we observe a big jump in the
544
547
# predictions at the end of each day when the hour features goes from 23 back
545
548
# to 0. We can expect similar artifacts at the end of each week or each year.
546
549
#
547
- # - as expected, the **trigonometric features** (sine and cosine) do not have
548
- # these discontinuities at midnight but the linear regression model fails to
550
+ # - As expected, the **trigonometric features** (sine and cosine) do not have
551
+ # these discontinuities at midnight, but the linear regression model fails to
549
552
# leverage those features to properly model intra-day variations.
550
553
# Using trigonometric features for higher harmonics or additional
551
554
# trigonometric features for the natural period with different phases could
@@ -557,7 +560,7 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
557
560
# `extrapolation="periodic"` option enforces a smooth representation between
558
561
# `hour=23` and `hour=0`.
559
562
#
560
- # - the **one-hot encoded features** behave similarly to the periodic
563
+ # - The **one-hot encoded features** behave similarly to the periodic
561
564
# spline-based features but are more spiky: for instance they can better
562
565
# model the morning peak during the week days since this peak lasts shorter
563
566
# than an hour. However, we will see in the following that what can be an
@@ -592,21 +595,21 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
592
595
# under-estimate the commuting-related events during the working days.
593
596
#
594
597
# These systematic prediction errors reveal a form of under-fitting and can be
595
- # explained by the lack of non-additive modeling of the interactions between
596
- # features (in this case "workingday" and features derived from "hours") . This
597
- # issue will be addressed in the following section.
598
+ # explained by the lack of interactions terms between features, e.g.
599
+ # "workingday" and features derived from "hours". This issue will be addressed
600
+ # in the following section.
598
601
599
602
# %%
600
603
# Modeling pairwise interactions with splines and polynomial features
601
604
# -------------------------------------------------------------------
602
605
#
603
- # Linear models alone cannot model interaction effects between input features.
604
- # It does not help that some features are marginally non-linear as is the case
605
- # with features constructed by `SplineTransformer` (or one-hot encoding or
606
- # binning).
606
+ # Linear models do not automatically capture interaction effects between input
607
+ # features. It does not help that some features are marginally non-linear as is
608
+ # the case with features constructed by `SplineTransformer` (or one-hot
609
+ # encoding or binning).
607
610
#
608
611
# However, it is possible to use the `PolynomialFeatures` class on coarse
609
- # grained splined encoded hours to model the "workingday"/"hours" interaction
612
+ # grained spline encoded hours to model the "workingday"/"hours" interaction
610
613
# explicitly without introducing too many new variables:
611
614
from sklearn .preprocessing import PolynomialFeatures
612
615
from sklearn .pipeline import FeatureUnion
@@ -644,16 +647,16 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
644
647
#
645
648
# The previous analysis highlighted the need to model the interactions between
646
649
# `"workingday"` and `"hours"`. Another example of a such a non-linear
647
- # interactions that we would like to model could be the impact of the rain that
650
+ # interaction that we would like to model could be the impact of the rain that
648
651
# might not be the same during the working days and the week-ends and holidays
649
652
# for instance.
650
653
#
651
654
# To model all such interactions, we could either use a polynomial expansion on
652
- # all marginal features at once, after their spline-based expansion. However
655
+ # all marginal features at once, after their spline-based expansion. However,
653
656
# this would create a quadratic number of features which can cause overfitting
654
657
# and computational tractability issues.
655
658
#
656
- # Alternatively we can use the Nyström method to compute an approximate
659
+ # Alternatively, we can use the Nyström method to compute an approximate
657
660
# polynomial kernel expansion. Let us try the latter:
658
661
from sklearn .kernel_approximation import Nystroem
659
662
@@ -693,11 +696,11 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
693
696
694
697
695
698
# %%
696
- # While one-hot features were competitive with spline-based features when using
697
- # linear models, this is no longer the case when using a low-rank approximation
698
- # of a non-linear kernel: this can be explained by the fact that spline
699
- # features are smoother and allow the kernel approximation to find a more
700
- # expressive decision function.
699
+ # While one-hot encoded features were competitive with spline-based features
700
+ # when using linear models, this is no longer the case when using a low-rank
701
+ # approximation of a non-linear kernel: this can be explained by the fact that
702
+ # spline features are smoother and allow the kernel approximation to find a
703
+ # more expressive decision function.
701
704
#
702
705
# Let us now have a qualitative look at the predictions of the kernel models
703
706
# and of the gradient boosted trees that should be able to better model
@@ -747,13 +750,13 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
747
750
# since, by default, decision trees are allowed to grow beyond a depth of 2
748
751
# levels.
749
752
#
750
- # Here we can observe that the combinations of spline features and non-linear
753
+ # Here, we can observe that the combinations of spline features and non-linear
751
754
# kernels works quite well and can almost rival the accuracy of the gradient
752
755
# boosting regression trees.
753
756
#
754
- # On the contrary, one-hot time features do not perform that well with the low
755
- # rank kernel model. In particular they significantly over-estimate the low
756
- # demand hours more than the competing models.
757
+ # On the contrary, one-hot encoded time features do not perform that well with
758
+ # the low rank kernel model. In particular, they significantly over-estimate
759
+ # the low demand hours more than the competing models.
757
760
#
758
761
# We also observe that none of the models can successfully predict some of the
759
762
# peak rentals at the rush hours during the working days. It is possible that
@@ -791,7 +794,7 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
791
794
# %%
792
795
# This visualization confirms the conclusions we draw on the previous plot.
793
796
#
794
- # All models under-estimate the high demand events (working days rush hours),
797
+ # All models under-estimate the high demand events (working day rush hours),
795
798
# but gradient boosting a bit less so. The low demand events are well predicted
796
799
# on average by gradient boosting while the one-hot polynomial regression
797
800
# pipeline seems to systematically over-estimate demand in that regime. Overall
@@ -804,9 +807,10 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
804
807
# We note that we could have obtained slightly better results for kernel models
805
808
# by using more components (higher rank kernel approximation) at the cost of
806
809
# longer fit and prediction durations. For large values of `n_components`, the
807
- # performance of the one-hot features would even match the spline features.
810
+ # performance of the one-hot encoded features would even match the spline
811
+ # features.
808
812
#
809
- # The `Nystroem` + `RidgeCV` classifier could also have been replaced by
813
+ # The `Nystroem` + `RidgeCV` regressor could also have been replaced by
810
814
# :class:`~sklearn.neural_network.MLPRegressor` with one or two hidden layers
811
815
# and we would have obtained quite similar results.
812
816
#
@@ -818,7 +822,7 @@ def periodic_spline_transformer(period, n_splines=None, degree=3):
818
822
# flexibility.
819
823
#
820
824
# Finally, in this notebook we used `RidgeCV` because it is very efficient from
821
- # a computational point of view. However it models the target variable as a
825
+ # a computational point of view. However, it models the target variable as a
822
826
# Gaussian random variable with constant variance. For positive regression
823
827
# problems, it is likely that using a Poisson or Gamma distribution would make
824
828
# more sense. This could be achieved by using
0 commit comments