Skip to content

Commit e2de8d4

Browse files
authored
Adds NER NLP end-to-end example (elastic#2226)
1 parent 1a6831c commit e2de8d4

9 files changed

+14383
-1
lines changed

docs/en/stack/ml/nlp/data/les-miserables-nd.json

Lines changed: 14021 additions & 0 deletions
Large diffs are not rendered by default.
Loading
Loading

docs/en/stack/ml/nlp/index.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,6 @@ include::ml-nlp-deploy-models.asciidoc[leveloffset=+1]
77
include::ml-nlp-inference.asciidoc[leveloffset=+1]
88
include::ml-nlp-apis.asciidoc[leveloffset=+1]
99
include::ml-nlp-model-ref.asciidoc[leveloffset=+1]
10+
include::ml-nlp-examples.asciidoc[leveloffset=+1]
11+
include::ml-nlp-ner-example.asciidoc[leveloffset=+2]
1012

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[[ml-nlp-examples]]
2+
= Examples
3+
4+
The following pages contain end-to-end examples of how to use the different
5+
{nlp} tasks in the {stack}.
6+
7+
* <<ml-nlp-ner-example>>

docs/en/stack/ml/nlp/ml-nlp-extract-info.asciidoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ These NLP tasks enable you to extract information from your unstructured text:
1616
== Named entity recognition
1717

1818
The named entity recognition (NER) task can identify and categorize certain
19-
entities typically proper nouns in your unstructured text. Named entities
19+
entities - typically proper nouns - in your unstructured text. Named entities
2020
usually refer to objects in the real world such as persons, locations,
2121
organizations, and other miscellaneous entities that are consistently referenced
2222
by a proper name.
Lines changed: 309 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
[[ml-nlp-ner-example]]
2+
= How to deploy named entity recognition
3+
4+
++++
5+
<titleabbrev>Named entity recognition</titleabbrev>
6+
++++
7+
:keywords: {ml-init}, {stack}, {nlp}
8+
9+
You can use these instructions to deploy a
10+
<<ml-nlp-ner,named entity recognition (NER)>> model in {es}, test the model, and
11+
add it to an {infer} ingest pipeline. The model that is used in the example is
12+
publicly available on https://huggingface.co/[HuggingFace].
13+
14+
15+
[discrete]
16+
[[ex-ner-requirements]]
17+
== Requirements
18+
19+
include::ml-nlp-shared.asciidoc[tag=nlp-requirements]
20+
21+
22+
[discrete]
23+
[[ex-ner-deploy]]
24+
== Deploy a NER model
25+
26+
include::ml-nlp-shared.asciidoc[tag=nlp-eland-clone-docker-build]
27+
28+
Select a NER model from the
29+
{ml-docs}/ml-nlp-model-ref.html#ml-nlp-model-ref-ner[third-party model reference list].
30+
This example uses an
31+
https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english[uncased NER model].
32+
33+
Install the model by running the `eland_import_model_hub` command in the Docker
34+
image:
35+
36+
[source,shell]
37+
--------------------------------------------------
38+
docker run -it --rm elastic/eland \
39+
eland_import_hub_model \
40+
--cloud-id $CLOUD_ID \
41+
-u <username> -p <password> \
42+
--hub-model-id elastic/distilbert-base-uncased-finetuned-conll03-english \
43+
--task-type ner \
44+
--start
45+
46+
--------------------------------------------------
47+
48+
You need to provide an administrator username and its password and replace the
49+
`$CLOUD_ID` with the ID of your Cloud deployment. This Cloud ID can be copied
50+
from the deployments page on your Cloud website.
51+
52+
include::ml-nlp-shared.asciidoc[tag=nlp-start]
53+
54+
include::ml-nlp-shared.asciidoc[tag=nlp-sync]
55+
56+
57+
[discrete]
58+
[[ex-ner-test]]
59+
== Test the NER model
60+
61+
Deployed models can be evaluated in {kib} under **{ml-app}** >
62+
**Trained Models** by selecting the **Test model** action for the respective
63+
model.
64+
65+
[role="screenshot"]
66+
image::images/ml-nlp-ner-test.png[Test trained model UI]
67+
68+
.**Test the model by using the _infer API**
69+
[%collapsible]
70+
====
71+
You can also evaluate your models by using the
72+
{ref}/infer-trained-model-deployment.html[_infer API]. In the following request,
73+
`text_field` is the field name where the model expects to find the input, as
74+
defined in the model configuration. By default, if the model was uploaded via
75+
Eland, the input field is `text_field`.
76+
77+
[source,js]
78+
--------------------------------------------------
79+
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/_infer
80+
{
81+
"docs": [
82+
{
83+
"text_field": "Elastic is headquartered in Mountain View, California."
84+
}
85+
]
86+
}
87+
--------------------------------------------------
88+
89+
The API returns a response similar to the following:
90+
91+
[source,js]
92+
--------------------------------------------------
93+
{
94+
"inference_results": [
95+
{
96+
"predicted_value": "[Elastic](ORG&Elastic) is headquartered in [Mountain View](LOC&Mountain+View), [California](LOC&California).",
97+
"entities": [
98+
{
99+
"entity": "elastic",
100+
"class_name": "ORG",
101+
"class_probability": 0.9958921231805256,
102+
"start_pos": 0,
103+
"end_pos": 7
104+
},
105+
{
106+
"entity": "mountain view",
107+
"class_name": "LOC",
108+
"class_probability": 0.9844731508992688,
109+
"start_pos": 28,
110+
"end_pos": 41
111+
},
112+
{
113+
"entity": "california",
114+
"class_name": "LOC",
115+
"class_probability": 0.9972361009811214,
116+
"start_pos": 43,
117+
"end_pos": 53
118+
}
119+
]
120+
}
121+
]
122+
}
123+
--------------------------------------------------
124+
// NOTCONSOLE
125+
====
126+
127+
Using the example text "Elastic is headquartered in Mountain View, California.",
128+
the model finds three entities: an organization "Elastic", and two locations
129+
"Mountain View" and "California".
130+
131+
132+
[discrete]
133+
[[ex-ner-ingest]]
134+
== Add the NER model to an {infer} ingest pipeline
135+
136+
You can perform bulk {infer} on documents as they are ingested by using an
137+
{ref}/inference-processor.html[{infer} processor] in your ingest pipeline. The
138+
novel _Les Misérables_ by Victor Hugo is used as an example for {infer} in
139+
the following example.
140+
https://github.com/elastic/stack-docs/blob/8.5/docs/en/stack/ml/nlp/data/les-miserables-nd.json[Download]
141+
the novel text split by paragraph as a JSON file, then upload it by using the
142+
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer].
143+
Give the new index the name `les-miserables` when uploading the file.
144+
145+
Now create an ingest pipeline either in the
146+
{ml-docs}/ml-nlp-inference.html#ml-nlp-inference-processor[Stack management UI]
147+
or by using the API:
148+
149+
[source,js]
150+
--------------------------------------------------
151+
PUT _ingest/pipeline/ner
152+
{
153+
"description": "NER pipeline",
154+
"processors": [
155+
{
156+
"inference": {
157+
"model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
158+
"target_field": "ml.ner",
159+
"field_map": {
160+
"paragraph": "text_field"
161+
}
162+
}
163+
},
164+
{
165+
"script": {
166+
"lang": "painless",
167+
"if": "return ctx['ml']['ner'].containsKey('entities')",
168+
"source": "Map tags = new HashMap(); for (item in ctx['ml']['ner']['entities']) { if (!tags.containsKey(item.class_name)) tags[item.class_name] = new HashSet(); tags[item.class_name].add(item.entity);} ctx['tags'] = tags;"
169+
}
170+
}
171+
],
172+
"on_failure": [
173+
{
174+
"set": {
175+
"description": "Index document to 'failed-<index>'",
176+
"field": "_index",
177+
"value": "failed-{{{ _index }}}"
178+
}
179+
},
180+
{
181+
"set": {
182+
"description": "Set error message",
183+
"field": "ingest.failure",
184+
"value": "{{_ingest.on_failure_message}}"
185+
}
186+
}
187+
]
188+
}
189+
--------------------------------------------------
190+
191+
The `field_map` object of the `inference` processor maps the `paragraph` field
192+
in the _Les Misérables_ documents to `text_field` (the name of the
193+
field the model is configured to use). The `target_field` is the name of the
194+
field to write the inference results to.
195+
196+
The `script` processor pulls out the entities and groups them by type. The end
197+
result is lists of people, locations, and organizations detected in the input
198+
text. This painless script enables you to build visualizations from the fields
199+
that are created.
200+
201+
The purpose of the `on_failure` clause is to record errors. It sets the `_index`
202+
meta field to a new value, and the document is now stored there. It also sets a
203+
new field `ingest.failure` and the error message is written to this field.
204+
{infer-cap} can fail for a number of easily fixable reasons. Perhaps the model
205+
has not been deployed, or the input field is missing in some of the source
206+
documents. By redirecting the failed documents to another index and setting the
207+
error message, those failed inferences are not lost and can be reviewed later.
208+
When the errors are fixed, reindex from the failed index to recover the
209+
unsuccessful requests.
210+
211+
Ingest the text of the novel - the index `les-miserables` - through the pipeline
212+
you created:
213+
214+
[source,js]
215+
--------------------------------------------------
216+
POST _reindex
217+
{
218+
"source": {
219+
"index": "les-miserables"
220+
},
221+
"dest": {
222+
"index": "les-miserables-infer",
223+
"pipeline": "ner"
224+
}
225+
}
226+
--------------------------------------------------
227+
228+
Take a random paragraph from the source document as an example:
229+
230+
[source,js]
231+
--------------------------------------------------
232+
{
233+
"paragraph": "Father Gillenormand did not do it intentionally, but inattention to proper names was an aristocratic habit of his.",
234+
"line": 12700
235+
}
236+
--------------------------------------------------
237+
238+
After the text is ingested through the NER pipeline, find the resulting document
239+
stored in {es}:
240+
241+
[source,js]
242+
--------------------------------------------------
243+
GET /les-miserables-infer/_search
244+
{
245+
"query": {
246+
"term": {
247+
"line": 12700
248+
}
249+
}
250+
}
251+
--------------------------------------------------
252+
253+
The request returns the document marked up with one identified person:
254+
255+
[source,js]
256+
--------------------------------------------------
257+
(...)
258+
"paragraph": "Father Gillenormand did not do it intentionally, but inattention to proper names was an aristocratic habit of his.",
259+
"@timestamp": "2020-01-01T17:38:25.000+01:00",
260+
"line": 12700,
261+
"ml": {
262+
"ner": {
263+
"predicted_value": "Father [Gillenormand](PER&Gillenormand) did not do it intentionally, but inattention to proper names was an aristocratic habit of his.",
264+
"entities": [
265+
{
266+
"entity": "gillenormand",
267+
"class_name": "PER",
268+
"class_probability": 0.9452480789333386,
269+
"start_pos": 7,
270+
"end_pos": 19
271+
}
272+
],
273+
"model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english"
274+
}
275+
},
276+
"tags": {
277+
"PER": [
278+
"gillenormand"
279+
]
280+
}
281+
(...)
282+
--------------------------------------------------
283+
284+
285+
[discrete]
286+
[[ex-ner-visual]]
287+
== Visualize results
288+
289+
You can create a tag cloud to visualize your data processed by the {infer}
290+
pipeline. A tag cloud is a visualization that scales words by the frequency at
291+
which they occur. It is a handy tool for viewing the entities found in the data.
292+
293+
In {kib}, open **Stack management** > **{data-sources-cap}**, and create a new
294+
{data-source} from the `les-miserables-infer` index pattern.
295+
296+
Open **Dashboard** and create a new dashboard. Select the
297+
*Aggregation based-type > Tag cloud* visualization. Choose the new {data-source}
298+
as the source.
299+
300+
Add a new bucket with a term aggregation, select the `tags.PER.keyword` field,
301+
and increase the size to 20.
302+
303+
Optionally, adjust the time selector to cover the data points in the
304+
{data-source} if you selected a time field when creating it.
305+
306+
Update and save the visualization.
307+
308+
[role="screenshot"]
309+
image::images/ml-nlp-tag-cloud.png[alt="Tag cloud created from Les Misérables",align="center"]
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
tag::nlp-eland-clone-docker-build[]
2+
You can use the {eland-docs}[Eland client] to install the {nlp} model. Eland
3+
commands can be run in Docker. First, you need to clone the Eland repository
4+
then create a Docker image of Eland:
5+
6+
[source,shell]
7+
--------------------------------------------------
8+
git clone [email protected]:elastic/eland.git
9+
cd eland
10+
docker build -t elastic/eland .
11+
--------------------------------------------------
12+
13+
After the script finishes, your Eland Docker client is ready to use.
14+
end::nlp-eland-clone-docker-build[]
15+
16+
tag::nlp-requirements[]
17+
To follow along the process on this page, you must have:
18+
19+
* an {es} Cloud cluster that is set up properly to use the {ml-features}. Refer
20+
to <<setup>>.
21+
22+
* The {subscriptions}[appropriate subscription] level or the free trial period
23+
activated.
24+
25+
* https://docs.docker.com/get-docker/[Docker] installed.
26+
end::nlp-requirements[]
27+
28+
tag::nlp-start[]
29+
Since the `--start` option is used at the end of the Eland import command, {es}
30+
deploys the model ready to use. If you have multiple models and want to select
31+
which model to deploy, you can use the **{ml-app} > Model Management** user
32+
interface in {kib} to manage the starting and stopping of models.
33+
end::nlp-start[]
34+
35+
tag::nlp-sync[]
36+
Go to the **{ml-app} > Trained Models** page and synchronize your trained
37+
models. A warning message is displayed at the top of the page that says
38+
_"ML job and trained model synchronization required"_. Follow the link to
39+
_"Synchronize your jobs and trained models."_ Then click **Synchronize**. You
40+
can also wait for the automatic synchronization that occurs in every hour, or
41+
use the {kibana-ref}/ml-sync.html[sync {ml} objects API].
42+
end::nlp-sync[]

docs/en/stack/ml/nlp/ml-nlp.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,6 @@ predictions.
1515
* <<ml-nlp-deploy-models>>
1616
* <<ml-nlp-inference>>
1717
* <<ml-nlp-apis>>
18+
* <<ml-nlp-examples>>
1819

1920
--

0 commit comments

Comments
 (0)