Skip to content

Commit ee599ab

Browse files
Merge pull request #309 from UBC-DSCI/best-params-fit-consistency
Consistency of usage of `_best_params` and `.fit`
2 parents 72607be + 3eb9de3 commit ee599ab

File tree

5 files changed

+61
-59
lines changed

5 files changed

+61
-59
lines changed

source/classification1.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1834,24 +1834,24 @@ For the `y` response variable argument, we pass the `unscaled_cancer["Class"]` s
18341834
```{code-cell} ipython3
18351835
from sklearn.pipeline import make_pipeline
18361836
1837-
knn_fit = make_pipeline(preprocessor, knn).fit(
1837+
knn_pipeline = make_pipeline(preprocessor, knn)
1838+
knn_pipeline.fit(
18381839
X=unscaled_cancer,
18391840
y=unscaled_cancer["Class"]
18401841
)
1841-
1842-
knn_fit
1842+
knn_pipeline
18431843
```
18441844

18451845
As before, the fit object lists the function that trains the model. But now the fit object also includes information about
18461846
the overall workflow, including the standardization preprocessing step.
1847-
In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new
1847+
In other words, when we use the `predict` function with the `knn_pipeline` object to make a prediction for a new
18481848
observation, it will first apply the same preprocessing steps to the new observation.
18491849
As an example, we will predict the class label of two new observations:
18501850
one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`.
18511851

18521852
```{code-cell} ipython3
18531853
new_observation = pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]})
1854-
prediction = knn_fit.predict(new_observation)
1854+
prediction = knn_pipeline.predict(new_observation)
18551855
prediction
18561856
```
18571857

@@ -1886,7 +1886,7 @@ asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T
18861886
asgrid = pd.DataFrame(asgrid, columns=["Area", "Smoothness"])
18871887
18881888
# use the fit workflow to make predictions at the grid points
1889-
knnPredGrid = knn_fit.predict(asgrid)
1889+
knnPredGrid = knn_pipeline.predict(asgrid)
18901890
18911891
# bind the predictions as a new column with the grid points
18921892
prediction_table = asgrid.copy()

source/classification2.md

Lines changed: 33 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -594,9 +594,10 @@ knn = KNeighborsClassifier(n_neighbors=3)
594594
X = cancer_train[["Smoothness", "Concavity"]]
595595
y = cancer_train["Class"]
596596
597-
knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
597+
knn_pipeline = make_pipeline(cancer_preprocessor, knn)
598+
knn_pipeline.fit(X, y)
598599
599-
knn_fit
600+
knn_pipeline
600601
```
601602

602603
### Predict the labels in the test set
@@ -614,7 +615,7 @@ variables in the output data frame.
614615

615616
```{code-cell} ipython3
616617
cancer_test_predictions = cancer_test.assign(
617-
predicted = knn_fit.predict(cancer_test[["Smoothness", "Concavity"]])
618+
predicted = knn_pipeline.predict(cancer_test[["Smoothness", "Concavity"]])
618619
)
619620
cancer_test_predictions[["ID", "Class", "predicted"]]
620621
```
@@ -645,7 +646,7 @@ for the predictors that we originally passed into `predict` when making predicti
645646
and we provide the actual labels via the `cancer_test["Class"]` series.
646647

647648
```{code-cell} ipython3
648-
cancer_acc_1 = knn_fit.score(
649+
cancer_acc_1 = knn_pipeline.score(
649650
cancer_test[["Smoothness", "Concavity"]],
650651
cancer_test["Class"]
651652
)
@@ -662,11 +663,9 @@ glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
662663

663664
The output shows that the estimated accuracy of the classifier on the test data
664665
was {glue:text}`cancer_acc_1`%.
665-
We can also look at the *confusion matrix* for the classifier
666-
using the `crosstab` function from `pandas`. A confusion matrix shows how many
667-
observations of each (actual) label were classified as each (predicted) label.
668-
The `crosstab` function
669-
takes two arguments: the actual labels first, then the predicted labels second.
666+
We can also look at the *confusion matrix* for the classifier
667+
using the `crosstab` function from `pandas`. The `crosstab` function takes two
668+
arguments: the actual labels first, then the predicted labels second.
670669

671670
```{code-cell} ipython3
672671
pd.crosstab(
@@ -884,10 +883,11 @@ cancer_subtrain, cancer_validation = train_test_split(
884883
knn = KNeighborsClassifier(n_neighbors=3)
885884
X = cancer_subtrain[["Smoothness", "Concavity"]]
886885
y = cancer_subtrain["Class"]
887-
knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
886+
knn_pipeline = make_pipeline(cancer_preprocessor, knn)
887+
knn_pipeline.fit(X, y)
888888
889889
# compute the score on validation data
890-
acc = knn_fit.score(
890+
acc = knn_pipeline.score(
891891
cancer_validation[["Smoothness", "Concavity"]],
892892
cancer_validation["Class"]
893893
)
@@ -908,10 +908,10 @@ for i in range(1, 5):
908908
knn = KNeighborsClassifier(n_neighbors=3)
909909
X = cancer_subtrain[["Smoothness", "Concavity"]]
910910
y = cancer_subtrain["Class"]
911-
knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
911+
knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y)
912912
913913
# compute the score on validation data
914-
accuracies.append(knn_fit.score(
914+
accuracies.append(knn_pipeline.score(
915915
cancer_validation[["Smoothness", "Concavity"]],
916916
cancer_validation["Class"]
917917
))
@@ -979,7 +979,6 @@ Since the `cross_validate` function outputs a dictionary, we use `pd.DataFrame`
979979
dataframe for better visualization.
980980
Note that the `cross_validate` function handles stratifying the classes in
981981
each train and validate fold automatically.
982-
We begin by importing the `cross_validate` function from `sklearn`.
983982

984983
```{code-cell} ipython3
985984
from sklearn.model_selection import cross_validate
@@ -1183,17 +1182,14 @@ format. We will wrap it in a `pd.DataFrame` to make it easier to understand,
11831182
and print the `info` of the result.
11841183

11851184
```{code-cell} ipython3
1186-
accuracies_grid = pd.DataFrame(
1187-
cancer_tune_grid.fit(
1188-
cancer_train[["Smoothness", "Concavity"]],
1189-
cancer_train["Class"]
1190-
).cv_results_
1185+
cancer_tune_grid.fit(
1186+
cancer_train[["Smoothness", "Concavity"]],
1187+
cancer_train["Class"]
11911188
)
1192-
```
1193-
1194-
```{code-cell} ipython3
1189+
accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
11951190
accuracies_grid.info()
11961191
```
1192+
11971193
There is a lot of information to look at here, but we are most interested
11981194
in three quantities: the number of neighbors (`param_kneighbors_classifier__n_neighbors`),
11991195
the cross-validation accuracy estimate (`mean_test_score`),
@@ -1224,8 +1220,7 @@ accuracies_grid
12241220

12251221
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
12261222
as shown in {numref}`fig:06-find-k`.
1227-
Here we are using the shortcut `point=True`
1228-
to layer a point and line chart.
1223+
Here we are using the shortcut `point=True` to layer a point and line chart.
12291224

12301225
```{code-cell} ipython3
12311226
:tags: [remove-output]
@@ -1254,6 +1249,13 @@ glue("best_acc", "{:.1f}".format(accuracies_grid["mean_test_score"].max()*100))
12541249
Plot of estimated accuracy versus the number of neighbors.
12551250
:::
12561251

1252+
We can also obtain the number of neighbours with the highest accuracy programmatically by accessing
1253+
the `best_params_` attribute of the fit `GridSearchCV` object. Note that it is still useful to visualize
1254+
the results as we did above since this provides additional information on how the model performance varies.
1255+
```{code-cell} ipython3
1256+
cancer_tune_grid.best_params_
1257+
```
1258+
12571259
+++
12581260

12591261
Setting the number of
@@ -1303,13 +1305,13 @@ large_cancer_tune_grid = GridSearchCV(
13031305
cv=10
13041306
)
13051307
1306-
large_accuracies_grid = pd.DataFrame(
1307-
large_cancer_tune_grid.fit(
1308-
cancer_train[["Smoothness", "Concavity"]],
1309-
cancer_train["Class"]
1310-
).cv_results_
1308+
large_cancer_tune_grid.fit(
1309+
cancer_train[["Smoothness", "Concavity"]],
1310+
cancer_train["Class"]
13111311
)
13121312
1313+
large_accuracies_grid = pd.DataFrame(large_cancer_tune_grid.cv_results_)
1314+
13131315
large_accuracy_vs_k = alt.Chart(large_accuracies_grid).mark_line(point=True).encode(
13141316
x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
13151317
y=alt.Y("mean_test_score")
@@ -1903,7 +1905,6 @@ n_total = len(names)
19031905
# start with an empty list of selected predictors
19041906
selected = []
19051907
1906-
19071908
# create the pipeline and CV grid search objects
19081909
param_grid = {
19091910
"kneighborsclassifier__n_neighbors": range(1, 61, 5),
@@ -1929,8 +1930,8 @@ for i in range(1, n_total + 1):
19291930
y = cancer_subset["Class"]
19301931
19311932
# Find the best K for this set of predictors
1932-
cancer_model_grid = cancer_tune_grid.fit(X, y)
1933-
accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_)
1933+
cancer_tune_grid.fit(X, y)
1934+
accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
19341935
19351936
# Store the tuned accuracy for this set of predictors
19361937
accs[j] = accuracies_grid["mean_test_score"].max()

source/clustering.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -752,7 +752,7 @@ total WSSD, since the cluster center (denoted by large shapes with black outline
752752
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
753753
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
754754
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") when we reach roughly
755-
the right number of clusters ({numref}`toy-kmeans-elbow`)).
755+
the right number of clusters ({numref}`toy-kmeans-elbow`).
756756

757757
```{code-cell} ipython3
758758
:tags: [remove-cell]
@@ -840,7 +840,8 @@ the random seed in the beginning of this chapter, the clustering will be reprodu
840840
```{code-cell} ipython3
841841
from sklearn.pipeline import make_pipeline
842842
843-
penguin_clust = make_pipeline(preprocessor, kmeans).fit(penguins)
843+
penguin_clust = make_pipeline(preprocessor, kmeans)
844+
penguin_clust.fit(penguins)
844845
penguin_clust
845846
```
846847

source/regression1.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -603,13 +603,13 @@ and rename the parameter column to be more readable.
603603

604604
```{code-cell} ipython3
605605
# fit the GridSearchCV object
606-
sacr_fit = sacr_gridsearch.fit(
606+
sacr_gridsearch.fit(
607607
sacramento_train[["sqft"]], # A single-column data frame
608608
sacramento_train["price"] # A series
609609
)
610610
611611
# Retrieve the CV scores
612-
sacr_results = pd.DataFrame(sacr_fit.cv_results_)[[
612+
sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)[[
613613
"param_kneighborsregressor__n_neighbors",
614614
"mean_test_score",
615615
"std_test_score"
@@ -689,7 +689,7 @@ Note that it is still useful to visualize the results as we did above
689689
since this provides additional information on how the model performance varies.
690690

691691
```{code-cell} ipython3
692-
sacr_fit.best_params_
692+
sacr_gridsearch.best_params_
693693
```
694694

695695
+++
@@ -835,7 +835,7 @@ model uses a different default scoring metric than the RMSPE.
835835
from sklearn.metrics import mean_squared_error
836836
837837
sacr_preds = sacramento_test.assign(
838-
predicted = sacr_fit.predict(sacramento_test)
838+
predicted = sacr_gridsearch.predict(sacramento_test)
839839
)
840840
RMSPE = mean_squared_error(
841841
y_true = sacr_preds["price"],
@@ -891,7 +891,7 @@ sqft_prediction_grid = pd.DataFrame({
891891
})
892892
# Predict the price for each of the sqft values in the grid
893893
sacr_preds = sqft_prediction_grid.assign(
894-
predicted = sacr_fit.predict(sqft_prediction_grid)
894+
predicted = sacr_gridsearch.predict(sqft_prediction_grid)
895895
)
896896
897897
# Plot all the houses
@@ -1012,18 +1012,19 @@ param_grid = {
10121012
"kneighborsregressor__n_neighbors": range(1, 50),
10131013
}
10141014
1015-
sacr_fit = GridSearchCV(
1015+
sacr_gridsearch = GridSearchCV(
10161016
estimator=sacr_pipeline,
10171017
param_grid=param_grid,
10181018
cv=5,
10191019
scoring="neg_root_mean_squared_error"
1020-
).fit(
1021-
sacramento_train[["sqft", "beds"]],
1022-
sacramento_train["price"]
1023-
)
1020+
)
1021+
sacr_gridsearch.fit(
1022+
sacramento_train[["sqft", "beds"]],
1023+
sacramento_train["price"]
1024+
)
10241025
10251026
# retrieve the CV scores
1026-
sacr_results = pd.DataFrame(sacr_fit.cv_results_)[[
1027+
sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)[[
10271028
"param_kneighborsregressor__n_neighbors",
10281029
"mean_test_score",
10291030
"std_test_score"
@@ -1035,13 +1036,10 @@ sacr_results = (
10351036
.rename(columns={"param_kneighborsregressor__n_neighbors" : "n_neighbors"})
10361037
.drop(columns=["std_test_score"])
10371038
)
1038-
10391039
sacr_results["mean_test_score"] = -sacr_results["mean_test_score"]
10401040
10411041
# show only the row of minimum RMSPE
1042-
sacr_results[
1043-
sacr_results["mean_test_score"] == sacr_results["mean_test_score"].min()
1044-
]
1042+
sacr_results.nsmallest(1, "mean_test_score")
10451043
```
10461044

10471045
```{code-cell} ipython3
@@ -1072,7 +1070,7 @@ to compute the RMSPE.
10721070

10731071
```{code-cell} ipython3
10741072
sacr_preds = sacramento_test.assign(
1075-
predicted = sacr_fit.predict(sacramento_test)
1073+
predicted = sacr_gridsearch.predict(sacramento_test)
10761074
)
10771075
RMSPE_mult = mean_squared_error(
10781076
y_true = sacr_preds["price"],
@@ -1109,7 +1107,7 @@ xygrid = np.array(np.meshgrid(xvals, yvals)).reshape(2, -1).T
11091107
xygrid = pd.DataFrame(xygrid, columns=["sqft", "beds"])
11101108
11111109
# add prediction
1112-
knnPredGrid = sacr_fit.predict(xygrid)
1110+
knnPredGrid = sacr_gridsearch.predict(xygrid)
11131111
11141112
fig = px.scatter_3d(
11151113
sacramento_train,

source/regression2.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -726,7 +726,8 @@ method as usual.
726726

727727
```{code-cell} ipython3
728728
729-
mlm = LinearRegression().fit(
729+
mlm = LinearRegression()
730+
mlm.fit(
730731
sacramento_train[["sqft", "beds"]],
731732
sacramento_train["price"]
732733
)
@@ -838,11 +839,12 @@ Unfortunately you have to do this mapping yourself: the coefficients in `mlm.coe
838839
in the *same order* as the columns of the predictor data frame you used when training.
839840
So since we used `sacramento_train[["sqft", "beds"]]` when training,
840841
we have that `mlm.coef_[0]` corresponds to `sqft`, and `mlm.coef_[1]` corresponds to `beds`.
842+
Once you sort out the correspondence, you can then use those slopes to write a mathematical equation to describe the prediction plane:
841843

842844
```{index} plane equation
843845
```
844846

845-
And then use those slopes to write a mathematical equation to describe the prediction plane:
847+
846848

847849
$$\text{house sale price} = \beta_0 + \beta_1\cdot(\text{house size}) + \beta_2\cdot(\text{number of bedrooms}),$$
848850
where:

0 commit comments

Comments
 (0)