Skip to content

Commit 802d1ef

Browse files
author
xhlulu
committed
ML Docs: More explanations for the KNN section
1 parent 895231f commit 802d1ef

File tree

1 file changed

+66
-7
lines changed

1 file changed

+66
-7
lines changed

doc/python/ml-knn.md

Lines changed: 66 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jupyter:
2020
name: python
2121
nbconvert_exporter: python
2222
pygments_lexer: ipython3
23-
version: 3.7.6
23+
version: 3.7.7
2424
plotly:
2525
description: Visualize scikit-learn's k-Nearest Neighbors (kNN) classification
2626
in Python with Plotly.
@@ -36,13 +36,19 @@ jupyter:
3636

3737
## Basic binary classification with kNN
3838

39-
This section gets us started with displaying basic binary classification using 2D data. We first show how to display training versus testing data using [various marker styles](https://plot.ly/python/marker-style/), then demonstrate how to evaluate a kNN classifier's performance on the **test split** using a continuous color gradient to indicate the model's predicted score.
39+
This section gets us started with displaying basic binary classification using 2D data. We first show how to display training versus testing data using [various marker styles](https://plot.ly/python/marker-style/), then demonstrate how to evaluate our classifier's performance on the **test split** using a continuous color gradient to indicate the model's predicted score.
40+
41+
We will use [Scikit-learn](https://scikit-learn.org/) for training our model and for loading and splitting data. Scikit-learn is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models. It was designed to be accessible, and to work seamlessly with popular libraries like NumPy and Pandas.
42+
43+
We will train a [k-Nearest Neighbors (kNN)](https://scikit-learn.org/stable/modules/neighbors.html) classifier. First, the model records the label of each training sample. Then, whenever we give it a new sample, it will look at the `k` closest samples from the training set to find the most common label, and assign it to our new sample.
4044

4145

4246
### Display training and test splits
4347

4448

45-
Here, we display all the negative labels as squares, and positive labels as circles. We differentiate the training and test set by adding a dot to the center of test data.
49+
Using Scikit-learn, we first generate synthetic data that form the shape of a moon. We then split it into a training and testing set. Finally, we display the ground truth labels using [a scatter plot](https://plotly.com/python/line-and-scatter/).
50+
51+
In the graph, we display all the negative labels as squares, and positive labels as circles. We differentiate the training and test set by adding a dot to the center of test data.
4652

4753
```python
4854
import numpy as np
@@ -52,6 +58,7 @@ from sklearn.datasets import make_moons
5258
from sklearn.model_selection import train_test_split
5359
from sklearn.neighbors import KNeighborsClassifier
5460

61+
# Load and split data
5562
X, y = make_moons(noise=0.3, random_state=0)
5663
X_train, X_test, y_train, y_test = train_test_split(
5764
X, y.astype(str), test_size=0.25, random_state=0)
@@ -78,10 +85,12 @@ fig.update_traces(
7885
fig.show()
7986
```
8087

81-
### Visualize predictions on test split
88+
### Visualize predictions on test split with [`plotly.express`](https://plotly.com/python/plotly-express/)
89+
8290

91+
Now, we train the kNN model on the same training data displayed in the previous graph. Then, we predict the confidence score of the model for each of the data points in the test set. We will use shapes to denote the true labels, and the color will indicate the confidence of the model for assign that score.
8392

84-
Now, we evaluate the model only on the test set. Notice that `px.scatter` only require 1 function call to plot both negative and positive labels, and can additionally set a continuous color scale based on the `y_score` output by our kNN model.
93+
Notice that `px.scatter` only require 1 function call to plot both negative and positive labels, and can additionally set a continuous color scale based on the `y_score` output by our kNN model.
8594

8695
```python
8796
import numpy as np
@@ -114,6 +123,56 @@ fig.show()
114123

115124
## Probability Estimates with `go.Contour`
116125

126+
Just like the previous example, we will first train our kNN model on the training set.
127+
128+
Instead of predicting the conference for the test set, we can predict the confidence map for the entire area that wraps around the dimensions of our dataset. To do this, we use [`np.meshgrid`](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html) to create a grid, where the distance between each point is denoted by the `mesh_size` variable.
129+
130+
Then, for each of those points, we will use our model to give a confidence score, and plot it with a [contour plot](https://plotly.com/python/contour-plots/).
131+
132+
```python
133+
import numpy as np
134+
import plotly.express as px
135+
import plotly.graph_objects as go
136+
from sklearn.datasets import make_moons
137+
from sklearn.model_selection import train_test_split
138+
from sklearn.neighbors import KNeighborsClassifier
139+
140+
mesh_size = .02
141+
margin = 0.25
142+
143+
# Load and split data
144+
X, y = make_moons(noise=0.3, random_state=0)
145+
X_train, X_test, y_train, y_test = train_test_split(
146+
X, y.astype(str), test_size=0.25, random_state=0)
147+
148+
# Create a mesh grid on which we will run our model
149+
x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
150+
y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
151+
xrange = np.arange(x_min, x_max, mesh_size)
152+
yrange = np.arange(y_min, y_max, mesh_size)
153+
xx, yy = np.meshgrid(xrange, yrange)
154+
155+
# Create classifier, run predictions on grid
156+
clf = KNeighborsClassifier(15, weights='uniform')
157+
clf.fit(X, y)
158+
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
159+
Z = Z.reshape(xx.shape)
160+
161+
162+
# Plot the figure
163+
fig = go.Figure(data=[
164+
go.Contour(
165+
x=xrange,
166+
y=yrange,
167+
z=Z,
168+
colorscale='RdBu'
169+
)
170+
])
171+
fig.show()
172+
```
173+
174+
Now, let's try to combine our `go.Contour` plot with the first scatter plot of our data points, so that we can visually compare the confidence of our model with the true labels.
175+
117176
```python
118177
import numpy as np
119178
import plotly.express as px
@@ -178,9 +237,9 @@ fig.add_trace(
178237
fig.show()
179238
```
180239

181-
## Multi-class prediction confidence with `go.Heatmap`
240+
## Multi-class prediction confidence with [`go.Heatmap`](https://plotly.com/python/heatmaps/)
182241

183-
It is also possible to visualize the prediction confidence of the model using `go.Heatmap`. In this example, you can see how to compute how confident the model is about its prediction at every point in the 2D grid. Here, we define the confidence as the difference between the highest score and the score of the other classes summed, at a certain point.
242+
It is also possible to visualize the prediction confidence of the model using [heatmaps](https://plotly.com/python/heatmaps/). In this example, you can see how to compute how confident the model is about its prediction at every point in the 2D grid. Here, we define the confidence as the difference between the highest score and the score of the other classes summed, at a certain point.
184243

185244
```python
186245
import numpy as np

0 commit comments

Comments
 (0)