Adds NER NLP end-to-end example (elastic#2226)

szabosteve · web-flow · commit e2de8d4d3d90 · 2022-09-15T10:01:24.000+02:00
diff --git a/docs/en/stack/ml/nlp/data/les-miserables-nd.json b/docs/en/stack/ml/nlp/data/les-miserables-nd.json
diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-ner-test.png b/docs/en/stack/ml/nlp/images/ml-nlp-ner-test.png
diff --git a/docs/en/stack/ml/nlp/images/ml-nlp-tag-cloud.png b/docs/en/stack/ml/nlp/images/ml-nlp-tag-cloud.png
diff --git a/docs/en/stack/ml/nlp/index.asciidoc b/docs/en/stack/ml/nlp/index.asciidoc
@@ -7,4 +7,6 @@ include::ml-nlp-deploy-models.asciidoc[leveloffset=+1]
 include::ml-nlp-inference.asciidoc[leveloffset=+1]
 include::ml-nlp-apis.asciidoc[leveloffset=+1]
 include::ml-nlp-model-ref.asciidoc[leveloffset=+1]
+include::ml-nlp-examples.asciidoc[leveloffset=+1]
+include::ml-nlp-ner-example.asciidoc[leveloffset=+2]
 
diff --git a/docs/en/stack/ml/nlp/ml-nlp-examples.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-examples.asciidoc
@@ -0,0 +1,7 @@
+[[ml-nlp-examples]]
+= Examples
+
+The following pages contain end-to-end examples of how to use the different 
+{nlp} tasks in the {stack}.
+
+* <<ml-nlp-ner-example>>
diff --git a/docs/en/stack/ml/nlp/ml-nlp-extract-info.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-extract-info.asciidoc
@@ -16,7 +16,7 @@ These NLP tasks enable you to extract information from your unstructured text:
 == Named entity recognition
 
 The named entity recognition (NER) task can identify and categorize certain 
-entities – typically proper nouns – in your unstructured text. Named entities 
+entities - typically proper nouns - in your unstructured text. Named entities 
 usually refer to objects in the real world such as persons, locations, 
 organizations, and other miscellaneous entities that are consistently referenced 
 by a proper name.
diff --git a/docs/en/stack/ml/nlp/ml-nlp-ner-example.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-ner-example.asciidoc
@@ -0,0 +1,309 @@
+[[ml-nlp-ner-example]]
+= How to deploy named entity recognition
+
+++++
+<titleabbrev>Named entity recognition</titleabbrev>
+++++
+:keywords: {ml-init}, {stack}, {nlp}
+
+You can use these instructions to deploy a 
+<<ml-nlp-ner,named entity recognition (NER)>> model in {es}, test the model, and 
+add it to an {infer} ingest pipeline. The model that is used in the example is 
+publicly available on https://huggingface.co/[HuggingFace].
+
+
+[discrete]
+[[ex-ner-requirements]]
+== Requirements
+
+include::ml-nlp-shared.asciidoc[tag=nlp-requirements]
+
+
+[discrete]
+[[ex-ner-deploy]]
+== Deploy a NER model
+
+include::ml-nlp-shared.asciidoc[tag=nlp-eland-clone-docker-build]
+
+Select a NER model from the 
+{ml-docs}/ml-nlp-model-ref.html#ml-nlp-model-ref-ner[third-party model reference list].
+This example uses an 
+https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english[uncased NER model].
+
+Install the model by running the `eland_import_model_hub` command in the Docker 
+image:
+
+[source,shell]
+--------------------------------------------------
+docker run -it --rm elastic/eland \
+    eland_import_hub_model \
+      --cloud-id $CLOUD_ID \
+      -u <username> -p <password> \
+      --hub-model-id elastic/distilbert-base-uncased-finetuned-conll03-english \
+      --task-type ner \
+      --start
+
+--------------------------------------------------
+
+You need to provide an administrator username and its password and replace the 
+`$CLOUD_ID` with the ID of your Cloud deployment. This Cloud ID can be copied 
+from the deployments page on your Cloud website.
+
+include::ml-nlp-shared.asciidoc[tag=nlp-start]
+
+include::ml-nlp-shared.asciidoc[tag=nlp-sync]
+
+
+[discrete]
+[[ex-ner-test]]
+== Test the NER model
+
+Deployed models can be evaluated in {kib} under **{ml-app}** > 
+**Trained Models** by selecting the **Test model** action for the respective 
+model. 
+
+[role="screenshot"]
+image::images/ml-nlp-ner-test.png[Test trained model UI]
+
+.**Test the model by using the _infer API**
+[%collapsible]
+====
+You can also evaluate your models by using the 
+{ref}/infer-trained-model-deployment.html[_infer API]. In the following request, 
+`text_field` is the field name where the model expects to find the input, as 
+defined in the model configuration. By default, if the model was uploaded via 
+Eland, the input field is `text_field`.
+
+[source,js]
+--------------------------------------------------
+POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/_infer
+{
+  "docs": [
+    {
+      "text_field": "Elastic is headquartered in Mountain View, California."
+    }
+  ]
+}
+--------------------------------------------------
+
+The API returns a response similar to the following:
+
+[source,js]
+--------------------------------------------------
+{
+  "inference_results": [
+    {
+      "predicted_value": "[Elastic](ORG&Elastic) is headquartered in [Mountain View](LOC&Mountain+View), [California](LOC&California).",
+      "entities": [
+        {
+          "entity": "elastic",
+          "class_name": "ORG",
+          "class_probability": 0.9958921231805256,
+          "start_pos": 0,
+          "end_pos": 7
+        },
+        {
+          "entity": "mountain view",
+          "class_name": "LOC",
+          "class_probability": 0.9844731508992688,
+          "start_pos": 28,
+          "end_pos": 41
+        },
+        {
+          "entity": "california",
+          "class_name": "LOC",
+          "class_probability": 0.9972361009811214,
+          "start_pos": 43,
+          "end_pos": 53
+        }
+      ]
+    }
+  ]
+}
+--------------------------------------------------
+// NOTCONSOLE
+====
+
+Using the example text "Elastic is headquartered in Mountain View, California.", 
+the model finds three entities: an organization "Elastic", and two locations 
+"Mountain View" and "California".
+
+
+[discrete]
+[[ex-ner-ingest]]
+== Add the NER model to an {infer} ingest pipeline
+
+You can perform bulk {infer} on documents as they are ingested by using an 
+{ref}/inference-processor.html[{infer} processor] in your ingest pipeline. The 
+novel _Les Misérables_ by Victor Hugo is used as an example for {infer} in 
+the following example. 
+https://github.com/elastic/stack-docs/blob/8.5/docs/en/stack/ml/nlp/data/les-miserables-nd.json[Download] 
+the novel text split by paragraph as a JSON file, then upload it by using the 
+{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]. 
+Give the new index the name `les-miserables` when uploading the file.
+
+Now create an ingest pipeline either in the 
+{ml-docs}/ml-nlp-inference.html#ml-nlp-inference-processor[Stack management UI] 
+or by using the API:
+
+[source,js]
+--------------------------------------------------
+PUT _ingest/pipeline/ner
+{
+  "description": "NER pipeline",
+  "processors": [
+    {
+      "inference": {
+        "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
+        "target_field": "ml.ner",
+        "field_map": {
+          "paragraph": "text_field"
+        }
+      }
+    },
+    {
+      "script": {
+        "lang": "painless",
+        "if": "return ctx['ml']['ner'].containsKey('entities')",
+        "source": "Map tags = new HashMap(); for (item in ctx['ml']['ner']['entities']) { if (!tags.containsKey(item.class_name)) tags[item.class_name] = new HashSet(); tags[item.class_name].add(item.entity);} ctx['tags'] = tags;"
+      }
+    }
+  ],
+  "on_failure": [
+    {
+      "set": {
+        "description": "Index document to 'failed-<index>'",
+        "field": "_index",
+        "value": "failed-{{{ _index }}}"
+      }
+    },
+    {
+      "set": {
+        "description": "Set error message",
+        "field": "ingest.failure",
+        "value": "{{_ingest.on_failure_message}}"
+      }
+    }
+  ]
+}
+--------------------------------------------------
+
+The `field_map` object of the `inference` processor maps the `paragraph` field 
+in the _Les Misérables_  documents to `text_field` (the name of the 
+field the model is configured to use). The `target_field` is the name of the 
+field to write the inference results to.
+
+The `script` processor pulls out the entities and groups them by type. The end 
+result is lists of people, locations, and organizations detected in the input 
+text. This painless script enables you to build visualizations from the fields 
+that are created.
+
+The purpose of the `on_failure` clause is to record errors. It sets the `_index` 
+meta field to a new value, and the document is now stored there. It also sets a 
+new field `ingest.failure` and the error message is written to this field. 
+{infer-cap} can fail for a number of easily fixable reasons. Perhaps the model 
+has not been deployed, or the input field is missing in some of the source 
+documents. By redirecting the failed documents to another index and setting the 
+error message, those failed inferences are not lost and can be reviewed later. 
+When the errors are fixed, reindex from the failed index to recover the 
+unsuccessful requests.
+
+Ingest the text of the novel - the index `les-miserables` - through the pipeline 
+you created:
+
+[source,js]
+--------------------------------------------------
+POST _reindex
+{
+  "source": {
+    "index": "les-miserables"
+  },
+  "dest": {
+    "index": "les-miserables-infer",
+    "pipeline": "ner"
+  }
+}
+--------------------------------------------------
+
+Take a random paragraph from the source document as an example:
+
+[source,js]
+--------------------------------------------------
+{
+    "paragraph": "Father Gillenormand did not do it intentionally, but inattention to proper names was an aristocratic habit of his.",
+    "line": 12700
+}
+--------------------------------------------------
+
+After the text is ingested through the NER pipeline, find the resulting document 
+stored in {es}:
+
+[source,js]
+--------------------------------------------------
+GET /les-miserables-infer/_search
+{
+  "query": {
+    "term": {
+      "line": 12700
+    }
+  }
+}
+--------------------------------------------------
+
+The request returns the document marked up with one identified person:
+
+[source,js]
+--------------------------------------------------
+(...)
+"paragraph": "Father Gillenormand did not do it intentionally, but inattention to proper names was an aristocratic habit of his.",
+  "@timestamp": "2020-01-01T17:38:25.000+01:00",
+  "line": 12700,
+  "ml": {
+    "ner": {
+      "predicted_value": "Father [Gillenormand](PER&Gillenormand) did not do it intentionally, but inattention to proper names was an aristocratic habit of his.",
+      "entities": [
+        {
+          "entity": "gillenormand",
+          "class_name": "PER",
+          "class_probability": 0.9452480789333386,
+          "start_pos": 7,
+          "end_pos": 19
+        }
+      ],
+      "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english"
+    }
+  },
+  "tags": {
+    "PER": [
+      "gillenormand"
+    ]
+  }
+(...)
+--------------------------------------------------
+
+
+[discrete]
+[[ex-ner-visual]]
+== Visualize results
+
+You can create a tag cloud to visualize your data processed by the {infer} 
+pipeline. A tag cloud is a visualization that scales words by the frequency at 
+which they occur. It is a handy tool for viewing the entities found in the data.
+
+In {kib}, open **Stack management** > **{data-sources-cap}**, and create a new 
+{data-source} from the `les-miserables-infer` index pattern. 
+
+Open **Dashboard** and create a new dashboard. Select the 
+*Aggregation based-type > Tag cloud* visualization. Choose the new {data-source} 
+as the source.
+
+Add a new bucket with a term aggregation, select the `tags.PER.keyword` field, 
+and increase the size to 20. 
+
+Optionally, adjust the time selector to cover the data points in the 
+{data-source} if you selected a time field when creating it.
+
+Update and save the visualization.
+
+[role="screenshot"]
+image::images/ml-nlp-tag-cloud.png[alt="Tag cloud created from Les Misérables",align="center"]
diff --git a/docs/en/stack/ml/nlp/ml-nlp-shared.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-shared.asciidoc
@@ -0,0 +1,42 @@
+tag::nlp-eland-clone-docker-build[]
+You can use the {eland-docs}[Eland client] to install the {nlp} model. Eland 
+commands can be run in Docker. First, you need to clone the Eland repository 
+then create a Docker image of Eland:
+
+[source,shell]
+--------------------------------------------------
+git clone git@github.com:elastic/eland.git
+cd eland
+docker build -t elastic/eland .
+--------------------------------------------------
+
+After the script finishes, your Eland Docker client is ready to use.
+end::nlp-eland-clone-docker-build[]
+
+tag::nlp-requirements[]
+To follow along the process on this page, you must have:
+
+* an {es} Cloud cluster that is set up properly to use the {ml-features}. Refer 
+to <<setup>>.
+
+* The {subscriptions}[appropriate subscription] level or the free trial period 
+activated.
+
+* https://docs.docker.com/get-docker/[Docker] installed.
+end::nlp-requirements[]
+
+tag::nlp-start[]
+Since the `--start` option is used at the end of the Eland import command, {es} 
+deploys the model ready to use. If you have multiple models and want to select 
+which model to deploy, you can use the **{ml-app} > Model Management** user 
+interface in {kib} to manage the starting and stopping of models.
+end::nlp-start[]
+
+tag::nlp-sync[]
+Go to the **{ml-app} > Trained Models** page and synchronize your trained 
+models. A warning message is displayed at the top of the page that says 
+_"ML job and trained model synchronization required"_. Follow the link to 
+_"Synchronize your jobs and trained models."_ Then click **Synchronize**. You 
+can also wait for the automatic synchronization that occurs in every hour, or 
+use the {kibana-ref}/ml-sync.html[sync {ml} objects API].
+end::nlp-sync[]
diff --git a/docs/en/stack/ml/nlp/ml-nlp.asciidoc b/docs/en/stack/ml/nlp/ml-nlp.asciidoc
@@ -15,5 +15,6 @@ predictions.
 * <<ml-nlp-deploy-models>>
 * <<ml-nlp-inference>>
 * <<ml-nlp-apis>>
+* <<ml-nlp-examples>>
 
 --