Merge pull request #309 from UBC-DSCI/best-params-fit-consistency

trevorcampbell · web-flow · commit ee599abe2910 · 2023-11-12T17:27:36.000-08:00
Consistency of usage of `_best_params` and `.fit`
diff --git a/source/classification1.md b/source/classification1.md
@@ -1834,24 +1834,24 @@ For the `y` response variable argument, we pass the `unscaled_cancer["Class"]` s
 ```{code-cell} ipython3
 from sklearn.pipeline import make_pipeline
 
-knn_fit = make_pipeline(preprocessor, knn).fit(
+knn_pipeline = make_pipeline(preprocessor, knn)
+knn_pipeline.fit(
     X=unscaled_cancer, 
     y=unscaled_cancer["Class"]
 )
-
-knn_fit
+knn_pipeline
 ```
 
 As before, the fit object lists the function that trains the model. But now the fit object also includes information about
 the overall workflow, including the standardization preprocessing step.
-In other words, when we use the `predict` function with the `knn_fit` object to make a prediction for a new
+In other words, when we use the `predict` function with the `knn_pipeline` object to make a prediction for a new
 observation, it will first apply the same preprocessing steps to the new observation. 
 As an example, we will predict the class label of two new observations:
 one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`.
 
 ```{code-cell} ipython3
 new_observation = pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]})
-prediction = knn_fit.predict(new_observation)
+prediction = knn_pipeline.predict(new_observation)
 prediction
 ```
 
@@ -1886,7 +1886,7 @@ asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T
 asgrid = pd.DataFrame(asgrid, columns=["Area", "Smoothness"])
 
 # use the fit workflow to make predictions at the grid points
-knnPredGrid = knn_fit.predict(asgrid)
+knnPredGrid = knn_pipeline.predict(asgrid)
 
 # bind the predictions as a new column with the grid points
 prediction_table = asgrid.copy()
diff --git a/source/classification2.md b/source/classification2.md
@@ -594,9 +594,10 @@ knn = KNeighborsClassifier(n_neighbors=3)
 X = cancer_train[["Smoothness", "Concavity"]]
 y = cancer_train["Class"]
 
-knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
+knn_pipeline = make_pipeline(cancer_preprocessor, knn)
+knn_pipeline.fit(X, y)
 
-knn_fit
+knn_pipeline
 ```
 
 ### Predict the labels in the test set
@@ -614,7 +615,7 @@ variables in the output data frame.
 
 ```{code-cell} ipython3
 cancer_test_predictions = cancer_test.assign(
-    predicted = knn_fit.predict(cancer_test[["Smoothness", "Concavity"]])
+    predicted = knn_pipeline.predict(cancer_test[["Smoothness", "Concavity"]])
 )
 cancer_test_predictions[["ID", "Class", "predicted"]]
 ```
@@ -645,7 +646,7 @@ for the predictors that we originally passed into `predict` when making predicti
 and we provide the actual labels via the `cancer_test["Class"]` series.
 
 ```{code-cell} ipython3
-cancer_acc_1 = knn_fit.score(
+cancer_acc_1 = knn_pipeline.score(
     cancer_test[["Smoothness", "Concavity"]],
     cancer_test["Class"]
 )
@@ -662,11 +663,9 @@ glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
 
 The output shows that the estimated accuracy of the classifier on the test data 
 was {glue:text}`cancer_acc_1`%.
-We can also look at the *confusion matrix* for the classifier 
-using the `crosstab` function from `pandas`. A confusion matrix shows how many 
-observations of each (actual) label were classified as each (predicted) label.
-The `crosstab` function
-takes two arguments: the actual labels first, then the predicted labels second.
+We can also look at the *confusion matrix* for the classifier
+using the `crosstab` function from `pandas`. The `crosstab` function takes two
+arguments: the actual labels first, then the predicted labels second.
 
 ```{code-cell} ipython3
 pd.crosstab(
@@ -884,10 +883,11 @@ cancer_subtrain, cancer_validation = train_test_split(
 knn = KNeighborsClassifier(n_neighbors=3) 
 X = cancer_subtrain[["Smoothness", "Concavity"]]
 y = cancer_subtrain["Class"]
-knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
+knn_pipeline = make_pipeline(cancer_preprocessor, knn)
+knn_pipeline.fit(X, y)
 
 # compute the score on validation data
-acc = knn_fit.score(
+acc = knn_pipeline.score(
     cancer_validation[["Smoothness", "Concavity"]],
     cancer_validation["Class"]
 )
@@ -908,10 +908,10 @@ for i in range(1, 5):
     knn = KNeighborsClassifier(n_neighbors=3) 
     X = cancer_subtrain[["Smoothness", "Concavity"]]
     y = cancer_subtrain["Class"]
-    knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
+    knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y)
 
     # compute the score on validation data
-    accuracies.append(knn_fit.score(
+    accuracies.append(knn_pipeline.score(
         cancer_validation[["Smoothness", "Concavity"]],
         cancer_validation["Class"]
        ))
@@ -979,7 +979,6 @@ Since the `cross_validate` function outputs a dictionary, we use `pd.DataFrame`
 dataframe for better visualization. 
 Note that the `cross_validate` function handles stratifying the classes in
 each train and validate fold automatically. 
-We begin by importing the `cross_validate` function from `sklearn`.
 
 ```{code-cell} ipython3
 from sklearn.model_selection import cross_validate
@@ -1183,17 +1182,14 @@ format. We will wrap it in a `pd.DataFrame` to make it easier to understand,
 and print the `info` of the result.
 
 ```{code-cell} ipython3
-accuracies_grid = pd.DataFrame(
-    cancer_tune_grid.fit(
-        cancer_train[["Smoothness", "Concavity"]],
-        cancer_train["Class"]
-    ).cv_results_
+cancer_tune_grid.fit(
+    cancer_train[["Smoothness", "Concavity"]],
+    cancer_train["Class"]
 )
-```
-
-```{code-cell} ipython3
+accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
 accuracies_grid.info()
 ```
+
 There is a lot of information to look at here, but we are most interested
 in three quantities: the number of neighbors (`param_kneighbors_classifier__n_neighbors`),
 the cross-validation accuracy estimate (`mean_test_score`), 
@@ -1224,8 +1220,7 @@ accuracies_grid
 
 We can decide which number of neighbors is best by plotting the accuracy versus $K$,
 as shown in {numref}`fig:06-find-k`.
-Here we are using the shortcut `point=True`
-to layer a point and line chart.
+Here we are using the shortcut `point=True` to layer a point and line chart.
 
 ```{code-cell} ipython3
 :tags: [remove-output]
@@ -1254,6 +1249,13 @@ glue("best_acc", "{:.1f}".format(accuracies_grid["mean_test_score"].max()*100))
 Plot of estimated accuracy versus the number of neighbors.
 :::
 
+We can also obtain the number of neighbours with the highest accuracy programmatically by accessing
+the `best_params_` attribute of the fit `GridSearchCV` object. Note that it is still useful to visualize
+the results as we did above since this provides additional information on how the model performance varies.
+```{code-cell} ipython3
+cancer_tune_grid.best_params_
+```
+
 +++
 
 Setting the number of 
@@ -1303,13 +1305,13 @@ large_cancer_tune_grid = GridSearchCV(
     cv=10
 )
 
-large_accuracies_grid = pd.DataFrame(
-    large_cancer_tune_grid.fit(
-        cancer_train[["Smoothness", "Concavity"]],
-        cancer_train["Class"]
-    ).cv_results_
+large_cancer_tune_grid.fit(
+    cancer_train[["Smoothness", "Concavity"]],
+    cancer_train["Class"]
 )
 
+large_accuracies_grid = pd.DataFrame(large_cancer_tune_grid.cv_results_)
+
 large_accuracy_vs_k = alt.Chart(large_accuracies_grid).mark_line(point=True).encode(
     x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
     y=alt.Y("mean_test_score")
@@ -1903,7 +1905,6 @@ n_total = len(names)
 # start with an empty list of selected predictors
 selected = []
 
-
 # create the pipeline and CV grid search objects
 param_grid = {
     "kneighborsclassifier__n_neighbors": range(1, 61, 5),
@@ -1929,8 +1930,8 @@ for i in range(1, n_total + 1):
         y = cancer_subset["Class"]
         
         # Find the best K for this set of predictors
-        cancer_model_grid = cancer_tune_grid.fit(X, y)
-        accuracies_grid = pd.DataFrame(cancer_model_grid.cv_results_)
+        cancer_tune_grid.fit(X, y)
+        accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
 
         # Store the tuned accuracy for this set of predictors
         accs[j] = accuracies_grid["mean_test_score"].max()
diff --git a/source/clustering.md b/source/clustering.md
@@ -752,7 +752,7 @@ total WSSD, since the cluster center (denoted by large shapes with black outline
 the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
 decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
 clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") when we reach roughly
-the right number of clusters ({numref}`toy-kmeans-elbow`)).
+the right number of clusters ({numref}`toy-kmeans-elbow`).
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
@@ -840,7 +840,8 @@ the random seed in the beginning of this chapter, the clustering will be reprodu
 ```{code-cell} ipython3
 from sklearn.pipeline import make_pipeline
 
-penguin_clust = make_pipeline(preprocessor, kmeans).fit(penguins)
+penguin_clust = make_pipeline(preprocessor, kmeans)
+penguin_clust.fit(penguins)
 penguin_clust
 ```
 
diff --git a/source/regression1.md b/source/regression1.md
@@ -603,13 +603,13 @@ and rename the parameter column to be more readable.
 
 ```{code-cell} ipython3
 # fit the GridSearchCV object
-sacr_fit = sacr_gridsearch.fit(
+sacr_gridsearch.fit(
     sacramento_train[["sqft"]],  # A single-column data frame
     sacramento_train["price"]  # A series
 )
 
 # Retrieve the CV scores
-sacr_results = pd.DataFrame(sacr_fit.cv_results_)[[
+sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)[[
     "param_kneighborsregressor__n_neighbors",
     "mean_test_score",
     "std_test_score"
@@ -689,7 +689,7 @@ Note that it is still useful to visualize the results as we did above
 since this provides additional information on how the model performance varies.
 
 ```{code-cell} ipython3
-sacr_fit.best_params_
+sacr_gridsearch.best_params_
 ```
 
 +++
@@ -835,7 +835,7 @@ model uses a different default scoring metric than the RMSPE.
 from sklearn.metrics import mean_squared_error
 
 sacr_preds = sacramento_test.assign(
-    predicted = sacr_fit.predict(sacramento_test)
+    predicted = sacr_gridsearch.predict(sacramento_test)
 )
 RMSPE = mean_squared_error(
     y_true = sacr_preds["price"],
@@ -891,7 +891,7 @@ sqft_prediction_grid = pd.DataFrame({
 })
 # Predict the price for each of the sqft values in the grid
 sacr_preds = sqft_prediction_grid.assign(
-    predicted = sacr_fit.predict(sqft_prediction_grid)
+    predicted = sacr_gridsearch.predict(sqft_prediction_grid)
 )
 
 # Plot all the houses
@@ -1012,18 +1012,19 @@ param_grid = {
     "kneighborsregressor__n_neighbors": range(1, 50),
 }
 
-sacr_fit = GridSearchCV(
+sacr_gridsearch = GridSearchCV(
     estimator=sacr_pipeline,
     param_grid=param_grid,
     cv=5,
     scoring="neg_root_mean_squared_error"
-    ).fit(
-      sacramento_train[["sqft", "beds"]],
-      sacramento_train["price"]
-    )
+)
+sacr_gridsearch.fit(
+  sacramento_train[["sqft", "beds"]],
+  sacramento_train["price"]
+)
 
 # retrieve the CV scores
-sacr_results = pd.DataFrame(sacr_fit.cv_results_)[[
+sacr_results = pd.DataFrame(sacr_gridsearch.cv_results_)[[
     "param_kneighborsregressor__n_neighbors",
     "mean_test_score",
     "std_test_score"
@@ -1035,13 +1036,10 @@ sacr_results = (
     .rename(columns={"param_kneighborsregressor__n_neighbors" : "n_neighbors"})
     .drop(columns=["std_test_score"])
 )
-
 sacr_results["mean_test_score"] = -sacr_results["mean_test_score"]
 
 # show only the row of minimum RMSPE
-sacr_results[
-   sacr_results["mean_test_score"] == sacr_results["mean_test_score"].min()
-]
+sacr_results.nsmallest(1, "mean_test_score")
 ```
 
 ```{code-cell} ipython3
@@ -1072,7 +1070,7 @@ to compute the RMSPE.
 
 ```{code-cell} ipython3
 sacr_preds = sacramento_test.assign(
-    predicted = sacr_fit.predict(sacramento_test)
+    predicted = sacr_gridsearch.predict(sacramento_test)
 )
 RMSPE_mult = mean_squared_error(
     y_true = sacr_preds["price"], 
@@ -1109,7 +1107,7 @@ xygrid = np.array(np.meshgrid(xvals, yvals)).reshape(2, -1).T
 xygrid = pd.DataFrame(xygrid, columns=["sqft", "beds"])
 
 # add prediction
-knnPredGrid = sacr_fit.predict(xygrid)
+knnPredGrid = sacr_gridsearch.predict(xygrid)
 
 fig = px.scatter_3d(
     sacramento_train,
diff --git a/source/regression2.md b/source/regression2.md
@@ -726,7 +726,8 @@ method as usual.
 
 ```{code-cell} ipython3
 
-mlm = LinearRegression().fit(
+mlm = LinearRegression()
+mlm.fit(
     sacramento_train[["sqft", "beds"]],
     sacramento_train["price"]
 )
@@ -838,11 +839,12 @@ Unfortunately you have to do this mapping yourself: the coefficients in `mlm.coe
 in the *same order* as the columns of the predictor data frame you used when training.
 So since we used `sacramento_train[["sqft", "beds"]]` when training, 
 we have that `mlm.coef_[0]` corresponds to `sqft`, and `mlm.coef_[1]` corresponds to `beds`.
+Once you sort out the correspondence, you can then use those slopes to write a mathematical equation to describe the prediction plane:
 
 ```{index} plane equation
 ```
 
-And then use those slopes to write a mathematical equation to describe the prediction plane:
+
 
 $$\text{house sale price} = \beta_0 + \beta_1\cdot(\text{house size}) + \beta_2\cdot(\text{number of bedrooms}),$$
 where: