Skip to content

Commit c68bb3a

Browse files
szabostevedavidkylelcawl
authored
Adds text embedding and vector search end-to-end example (elastic#2229)
Co-authored-by: David Kyle <[email protected]> Co-authored-by: Lisa Cawley <[email protected]>
1 parent 4986ea0 commit c68bb3a

7 files changed

+182814
-1
lines changed

docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv

Lines changed: 182469 additions & 0 deletions
Large diffs are not rendered by default.
Loading
Loading
Loading

docs/en/stack/ml/nlp/index.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ include::ml-nlp-apis.asciidoc[leveloffset=+1]
99
include::ml-nlp-model-ref.asciidoc[leveloffset=+1]
1010
include::ml-nlp-examples.asciidoc[leveloffset=+1]
1111
include::ml-nlp-ner-example.asciidoc[leveloffset=+2]
12+
include::ml-nlp-text-emb-vector-search-example.asciidoc[leveloffset=+2]
1213

docs/en/stack/ml/nlp/ml-nlp-examples.asciidoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@
44
The following pages contain end-to-end examples of how to use the different
55
{nlp} tasks in the {stack}.
66

7-
* <<ml-nlp-ner-example>>
7+
* <<ml-nlp-ner-example>>
8+
* <<ml-nlp-text-emb-vector-search-example>>
Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
[[ml-nlp-text-emb-vector-search-example]]
2+
= How to deploy a text embedding model and use it with vector search
3+
4+
++++
5+
<titleabbrev>Text embedding and vector search</titleabbrev>
6+
++++
7+
:keywords: {ml-init}, {stack}, {nlp}
8+
9+
You can use these instructions to deploy a
10+
<<ml-nlp-text-embedding,text embedding>> model in {es}, test the model, and
11+
add it to an {infer} ingest pipeline. It enables you to generate vector
12+
representations of text and perform vector similarity search on the generated
13+
vectors. The model that is used in the example is publicly available on
14+
https://huggingface.co/[HuggingFace].
15+
16+
The example uses a public data set from the
17+
https://microsoft.github.io/msmarco/#ranking[MS MARCO Passage Ranking Task]. It
18+
consists of real questions from the Microsoft Bing search engine and human
19+
generated answers for them. The example works with a sample of this data set,
20+
uses a model to produce text embeddings, and then runs vector search on it.
21+
22+
[discrete]
23+
[[ex-te-vs-requirements]]
24+
== Requirements
25+
26+
include::ml-nlp-shared.asciidoc[tag=nlp-requirements]
27+
28+
29+
[discrete]
30+
[[ex-te-vs-deploy]]
31+
== Deploy a text embedding model
32+
33+
include::ml-nlp-shared.asciidoc[tag=nlp-eland-clone-docker-build]
34+
35+
Select a text embedding model from the
36+
{ml-docs}/ml-nlp-model-ref.html#ml-nlp-model-ref-ner[third-party model reference list].
37+
This example uses the
38+
https://huggingface.co/sentence-transformers/msmarco-MiniLM-L-12-v3[msmarco-MiniLM-L-12-v3]
39+
sentence-transformer model.
40+
41+
Install the model by running the `eland_import_model_hub` command in the Docker
42+
image:
43+
44+
[source,shell]
45+
--------------------------------------------------
46+
docker run -it --rm elastic/eland \
47+
eland_import_hub_model \
48+
--cloud-id $CLOUD_ID \
49+
-u <username> -p <password> \
50+
--hub-model-id sentence-transformers/msmarco-MiniLM-L-12-v3 \
51+
--task-type text_embedding \
52+
--start
53+
--------------------------------------------------
54+
55+
You need to provide an administrator username and password and replace the
56+
`$CLOUD_ID` with the ID of your Cloud deployment. This Cloud ID can be copied
57+
from the deployments page on your Cloud website.
58+
59+
include::ml-nlp-shared.asciidoc[tag=nlp-start]
60+
61+
include::ml-nlp-shared.asciidoc[tag=nlp-sync]
62+
63+
[discrete]
64+
[[ex-text-emb-test]]
65+
== Test the text embedding model
66+
67+
Deployed models can be evaluated in {kib} under **{ml-app}** >
68+
**Trained Models** by selecting the **Test model** action for the respective
69+
model.
70+
71+
[role="screenshot"]
72+
image::images/ml-nlp-text-emb-test.png[Test trained model UI]
73+
74+
.**Test the model by using the _infer API**
75+
[%collapsible]
76+
====
77+
You can also evaluate your models by using the
78+
{ref}/infer-trained-model-deployment.html[_infer API]. In the following request,
79+
`text_field` is the field name where the model expects to find the input, as
80+
defined in the model configuration. By default, if the model was uploaded via
81+
Eland, the input field is `text_field`.
82+
83+
[source,js]
84+
--------------------------------------------------
85+
POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3/_infer
86+
{
87+
"docs": {
88+
"text_field": "How is the weather in Jamaica?"
89+
}
90+
}
91+
--------------------------------------------------
92+
93+
The API returns a response similar to the following:
94+
95+
[source,js]
96+
--------------------------------------------------
97+
{
98+
"inference_results": [
99+
{
100+
"predicted_value": [
101+
0.39521875977516174,
102+
-0.3263707458972931,
103+
0.26809820532798767,
104+
0.30127981305122375,
105+
0.502890408039093,
106+
...
107+
]
108+
}
109+
]
110+
}
111+
--------------------------------------------------
112+
// NOTCONSOLE
113+
====
114+
115+
The result is the predicted dense vector transformed from the example text.
116+
117+
118+
[discrete]
119+
[[ex-text-emb-data]]
120+
== Load data
121+
122+
In this step, you load the data that you later use in an ingest pipeline to get
123+
the embeddings.
124+
125+
The data set `msmarco-passagetest2019-top1000` is a subset of the MS MACRO
126+
Passage Ranking data set used in the testing stage of the 2019 TREC Deep
127+
Learning Track. It contains 200 queries and for each query a list of relevant
128+
text passages extracted by a simple information retrieval (IR) system. From that
129+
data set, all unique passages with their IDs have been extracted and put into a
130+
https://github.com/elastic/stack-docs/blob/8.5/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file],
131+
totaling 182469 passages. In the following, this file is used as the example
132+
data set.
133+
134+
Upload the file by using the
135+
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer].
136+
Name the first column `id` and the second one `text`. The index name is
137+
`collection`. After the upload is done, you can see an index named `collection`
138+
with 182469 documents.
139+
140+
[role="screenshot"]
141+
image::images/ml-nlp-text-emb-data.png[Importing the data]
142+
143+
[discrete]
144+
[[ex-text-emb-ingest]]
145+
== Add the text embedding model to an {infer} ingest pipeline
146+
147+
Process the initial data with an
148+
{ref}/inference-processor.html[{infer} processor]. It adds an embedding for each
149+
passage. For this, create a text embedding ingest pipeline and then reindex the
150+
initial data with this pipeline.
151+
152+
Now create an ingest pipeline either in the
153+
{ml-docs}/ml-nlp-inference.html#ml-nlp-inference-processor[{stack-manage-app} UI]
154+
or by using the API:
155+
156+
[source,js]
157+
--------------------------------------------------
158+
PUT _ingest/pipeline/text-embeddings
159+
{
160+
"description": "Text embedding pipeline",
161+
"processors": [
162+
{
163+
"inference": {
164+
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
165+
"target_field": "text_embedding",
166+
"field_map": {
167+
"text": "text_field"
168+
}
169+
}
170+
}
171+
],
172+
"on_failure": [
173+
{
174+
"set": {
175+
"description": "Index document to 'failed-<index>'",
176+
"field": "_index",
177+
"value": "failed-{{{_index}}}"
178+
}
179+
},
180+
{
181+
"set": {
182+
"description": "Set error message",
183+
"field": "ingest.failure",
184+
"value": "{{_ingest.on_failure_message}}"
185+
}
186+
}
187+
]
188+
}
189+
--------------------------------------------------
190+
191+
The passages are in a field named `text`. The `field_map` maps the text to the
192+
field `text_field` that the model expects. The `on_failure` handler is set to
193+
index failures into a different index.
194+
195+
Before ingesting the data through the pipeline, create the mappings of the
196+
destination index, in particular for the field `text_embedding.predicted_value`
197+
where the ingest processor stores the embeddings. The msmarco-MiniLM-L-12-v3 model produces
198+
embeddings with 384 dimensions; the `dense_vector` field must be configured
199+
with the same number of dimensions as specified by the `dims` option.
200+
201+
[source,js]
202+
--------------------------------------------------
203+
PUT collection-with-embeddings
204+
{
205+
"mappings": {
206+
"properties": {
207+
"text_embedding.predicted_value": {
208+
"type": "dense_vector",
209+
"dims": 384,
210+
"index": true,
211+
"similarity": "cosine"
212+
},
213+
"text": {
214+
"type": "text"
215+
}
216+
}
217+
}
218+
}
219+
--------------------------------------------------
220+
221+
Create the text embeddings by reindexing the data to the
222+
`collection-with-embeddings` index through the {infer} pipeline. The {infer}
223+
ingest processor inserts the embedding vector into each document.
224+
225+
[source,js]
226+
--------------------------------------------------
227+
POST _reindex?wait_for_completion=false
228+
{
229+
"source": {
230+
"index": "collection"
231+
},
232+
"dest": {
233+
"index": "collection-with-embeddings",
234+
"pipeline": "text-embeddings"
235+
}
236+
}
237+
--------------------------------------------------
238+
239+
The API call returns a task ID that can be used to monitor the progress:
240+
241+
[source,js]
242+
--------------------------------------------------
243+
GET _tasks/<task_id>
244+
--------------------------------------------------
245+
246+
You can also open the model stat UI to follow the progress.
247+
248+
[role="screenshot"]
249+
image::images/ml-nlp-text-emb-reindex.png[Model status UI]
250+
251+
After the reindexing is finished, the documents in the new index contain the
252+
{infer} results – the vector embeddings.
253+
254+
255+
[discrete]
256+
[[ex-text-emb-vect-search]]
257+
== Vector similarity search
258+
259+
To perform vector similarity search, you need to obtain the text embedding of a
260+
text. This example uses the "How is the weather in Jamaica?" query as the input
261+
text. The {ref}/infer-trained-model-deployment.html[_infer API] gives you the
262+
embedding of this query as a dense vector:
263+
264+
[source,js]
265+
--------------------------------------------------
266+
POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3/_infer
267+
{
268+
"docs": {
269+
"text_field": "How is the weather in Jamaica?"
270+
}
271+
}
272+
--------------------------------------------------
273+
274+
You can use the resulting dense vector in the `query_vector` of a
275+
{ref}/knn-search.html[kNN search]:
276+
277+
[source,js]
278+
--------------------------------------------------
279+
GET collection-with-embeddings/_search
280+
{
281+
"knn": {
282+
"field": "text_embedding.predicted_value",
283+
"query_vector": [
284+
0.39521875977516174,
285+
-0.3263707458972931,
286+
0.26809820532798767,
287+
0.30127981305122375,
288+
(...)
289+
],
290+
"k": 10,
291+
"num_candidates": 100
292+
},
293+
"_source": [
294+
"id",
295+
"text"
296+
]
297+
}
298+
--------------------------------------------------
299+
300+
As a result, you receive the top 10 documents that are closest in meaning to the
301+
query from the `collection-with-embedings` index sorted by their proximity to
302+
the query:
303+
304+
[source,js]
305+
--------------------------------------------------
306+
"hits" : [
307+
{
308+
"_index" : "collection-with-embeddings",
309+
"_id" : "47TPtn8BjSkJO8zzKq_o",
310+
"_score" : 0.94591534,
311+
"_source" : {
312+
"id" : 434125,
313+
"text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading."
314+
}
315+
},
316+
{
317+
"_index" : "collection-with-embeddings",
318+
"_id" : "3LTPtn8BjSkJO8zzKJO1",
319+
"_score" : 0.94536424,
320+
"_source" : {
321+
"id" : 4498474,
322+
"text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year"
323+
}
324+
},
325+
{
326+
"_index" : "collection-with-embeddings",
327+
"_id" : "KrXPtn8BjSkJO8zzPbDW",
328+
"_score" : 0.9432083,
329+
"_source" : {
330+
"id" : 190804,
331+
"text" : "Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading"
332+
}
333+
},
334+
(...)
335+
]
336+
--------------------------------------------------
337+
338+
If you want to do a quick verification of the results, follow the steps of the
339+
_Quick verification_ section of
340+
https://www.elastic.co/blog/how-to-deploy-nlp-text-embeddings-and-vector-search#[this blog post].
341+
342+

0 commit comments

Comments
 (0)