@@ -594,9 +594,10 @@ knn = KNeighborsClassifier(n_neighbors=3)
594
594
X = cancer_train[["Smoothness", "Concavity"]]
595
595
y = cancer_train["Class"]
596
596
597
- knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
597
+ knn_pipeline = make_pipeline(cancer_preprocessor, knn)
598
+ knn_pipeline.fit(X, y)
598
599
599
- knn_fit
600
+ knn_pipeline
600
601
```
601
602
602
603
### Predict the labels in the test set
@@ -614,7 +615,7 @@ variables in the output data frame.
614
615
615
616
``` {code-cell} ipython3
616
617
cancer_test_predictions = cancer_test.assign(
617
- predicted = knn_fit .predict(cancer_test[["Smoothness", "Concavity"]])
618
+ predicted = knn_pipeline .predict(cancer_test[["Smoothness", "Concavity"]])
618
619
)
619
620
cancer_test_predictions[["ID", "Class", "predicted"]]
620
621
```
@@ -645,7 +646,7 @@ for the predictors that we originally passed into `predict` when making predicti
645
646
and we provide the actual labels via the ` cancer_test["Class"] ` series.
646
647
647
648
``` {code-cell} ipython3
648
- cancer_acc_1 = knn_fit .score(
649
+ cancer_acc_1 = knn_pipeline .score(
649
650
cancer_test[["Smoothness", "Concavity"]],
650
651
cancer_test["Class"]
651
652
)
@@ -662,11 +663,9 @@ glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
662
663
663
664
The output shows that the estimated accuracy of the classifier on the test data
664
665
was {glue: text }` cancer_acc_1 ` %.
665
- We can also look at the * confusion matrix* for the classifier
666
- using the ` crosstab ` function from ` pandas ` . A confusion matrix shows how many
667
- observations of each (actual) label were classified as each (predicted) label.
668
- The ` crosstab ` function
669
- takes two arguments: the actual labels first, then the predicted labels second.
666
+ We can also look at the * confusion matrix* for the classifier
667
+ using the ` crosstab ` function from ` pandas ` . The ` crosstab ` function takes two
668
+ arguments: the actual labels first, then the predicted labels second.
670
669
671
670
``` {code-cell} ipython3
672
671
pd.crosstab(
@@ -884,10 +883,11 @@ cancer_subtrain, cancer_validation = train_test_split(
884
883
knn = KNeighborsClassifier(n_neighbors=3)
885
884
X = cancer_subtrain[["Smoothness", "Concavity"]]
886
885
y = cancer_subtrain["Class"]
887
- knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
886
+ knn_pipeline = make_pipeline(cancer_preprocessor, knn)
887
+ knn_pipeline.fit(X, y)
888
888
889
889
# compute the score on validation data
890
- acc = knn_fit .score(
890
+ acc = knn_pipeline .score(
891
891
cancer_validation[["Smoothness", "Concavity"]],
892
892
cancer_validation["Class"]
893
893
)
@@ -908,10 +908,10 @@ for i in range(1, 5):
908
908
knn = KNeighborsClassifier(n_neighbors=3)
909
909
X = cancer_subtrain[["Smoothness", "Concavity"]]
910
910
y = cancer_subtrain["Class"]
911
- knn_fit = make_pipeline(cancer_preprocessor, knn).fit(X, y)
911
+ knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y)
912
912
913
913
# compute the score on validation data
914
- accuracies.append(knn_fit .score(
914
+ accuracies.append(knn_pipeline .score(
915
915
cancer_validation[["Smoothness", "Concavity"]],
916
916
cancer_validation["Class"]
917
917
))
@@ -979,7 +979,6 @@ Since the `cross_validate` function outputs a dictionary, we use `pd.DataFrame`
979
979
dataframe for better visualization.
980
980
Note that the ` cross_validate ` function handles stratifying the classes in
981
981
each train and validate fold automatically.
982
- We begin by importing the ` cross_validate ` function from ` sklearn ` .
983
982
984
983
``` {code-cell} ipython3
985
984
from sklearn.model_selection import cross_validate
@@ -1183,17 +1182,14 @@ format. We will wrap it in a `pd.DataFrame` to make it easier to understand,
1183
1182
and print the ` info ` of the result.
1184
1183
1185
1184
``` {code-cell} ipython3
1186
- accuracies_grid = pd.DataFrame(
1187
- cancer_tune_grid.fit(
1188
- cancer_train[["Smoothness", "Concavity"]],
1189
- cancer_train["Class"]
1190
- ).cv_results_
1185
+ cancer_tune_grid.fit(
1186
+ cancer_train[["Smoothness", "Concavity"]],
1187
+ cancer_train["Class"]
1191
1188
)
1192
- ```
1193
-
1194
- ``` {code-cell} ipython3
1189
+ accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
1195
1190
accuracies_grid.info()
1196
1191
```
1192
+
1197
1193
There is a lot of information to look at here, but we are most interested
1198
1194
in three quantities: the number of neighbors (` param_kneighbors_classifier__n_neighbors ` ),
1199
1195
the cross-validation accuracy estimate (` mean_test_score ` ),
@@ -1224,8 +1220,7 @@ accuracies_grid
1224
1220
1225
1221
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
1226
1222
as shown in {numref}` fig:06-find-k ` .
1227
- Here we are using the shortcut ` point=True `
1228
- to layer a point and line chart.
1223
+ Here we are using the shortcut ` point=True ` to layer a point and line chart.
1229
1224
1230
1225
``` {code-cell} ipython3
1231
1226
:tags: [remove-output]
@@ -1254,6 +1249,13 @@ glue("best_acc", "{:.1f}".format(accuracies_grid["mean_test_score"].max()*100))
1254
1249
Plot of estimated accuracy versus the number of neighbors.
1255
1250
:::
1256
1251
1252
+ We can also obtain the number of neighbours with the highest accuracy programmatically by accessing
1253
+ the ` best_params_ ` attribute of the fit ` GridSearchCV ` object. Note that it is still useful to visualize
1254
+ the results as we did above since this provides additional information on how the model performance varies.
1255
+ ``` {code-cell} ipython3
1256
+ cancer_tune_grid.best_params_
1257
+ ```
1258
+
1257
1259
+++
1258
1260
1259
1261
Setting the number of
@@ -1303,13 +1305,13 @@ large_cancer_tune_grid = GridSearchCV(
1303
1305
cv=10
1304
1306
)
1305
1307
1306
- large_accuracies_grid = pd.DataFrame(
1307
- large_cancer_tune_grid.fit(
1308
- cancer_train[["Smoothness", "Concavity"]],
1309
- cancer_train["Class"]
1310
- ).cv_results_
1308
+ large_cancer_tune_grid.fit(
1309
+ cancer_train[["Smoothness", "Concavity"]],
1310
+ cancer_train["Class"]
1311
1311
)
1312
1312
1313
+ large_accuracies_grid = pd.DataFrame(large_cancer_tune_grid.cv_results_)
1314
+
1313
1315
large_accuracy_vs_k = alt.Chart(large_accuracies_grid).mark_line(point=True).encode(
1314
1316
x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
1315
1317
y=alt.Y("mean_test_score")
@@ -1903,7 +1905,6 @@ n_total = len(names)
1903
1905
# start with an empty list of selected predictors
1904
1906
selected = []
1905
1907
1906
-
1907
1908
# create the pipeline and CV grid search objects
1908
1909
param_grid = {
1909
1910
"kneighborsclassifier__n_neighbors": range(1, 61, 5),
@@ -1929,8 +1930,8 @@ for i in range(1, n_total + 1):
1929
1930
y = cancer_subset["Class"]
1930
1931
1931
1932
# Find the best K for this set of predictors
1932
- cancer_model_grid = cancer_tune_grid.fit(X, y)
1933
- accuracies_grid = pd.DataFrame(cancer_model_grid .cv_results_)
1933
+ cancer_tune_grid.fit(X, y)
1934
+ accuracies_grid = pd.DataFrame(cancer_tune_grid .cv_results_)
1934
1935
1935
1936
# Store the tuned accuracy for this set of predictors
1936
1937
accs[j] = accuracies_grid["mean_test_score"].max()
0 commit comments