Skip to content

Commit 424b0c4

Browse files
authored
[DOCS] Amends data frame analytics overview and adds resources section (elastic#1726)
* Amends data frame analytics overview. * Adds metadata to How DFA works page. * Renames Concepts to Advanced concepts. * Adds DFA at scale link to Advanced concepts. * Adds Resources section. * Changes link on DFA main page.
1 parent 9785216 commit 424b0c4

File tree

7 files changed

+167
-152
lines changed

7 files changed

+167
-152
lines changed

docs/en/stack/ml/df-analytics/index.asciidoc

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
include::ml-dfanalytics.asciidoc[]
22

33
include::ml-dfa-overview.asciidoc[leveloffset=+1]
4-
include::ml-supervised-workflow.asciidoc[leveloffset=+2]
5-
include::ml-dfa-phases.asciidoc[leveloffset=+2]
6-
include::ml-dfa-scale.asciidoc[leveloffset=+2]
4+
75

86
include::ml-dfa-concepts.asciidoc[leveloffset=+1]
7+
include::ml-how-dfa-works.asciidoc[leveloffset=+2]
8+
include::ml-dfa-scale.asciidoc[leveloffset=+2]
99
include::dfa-outlier-detection.asciidoc[leveloffset=+2]
1010
include::dfa-regression.asciidoc[leveloffset=+2]
1111
include::dfa-classification.asciidoc[leveloffset=+2]
@@ -25,4 +25,5 @@ include::flightdata-regression.asciidoc[leveloffset=+2]
2525
include::flightdata-classification.asciidoc[leveloffset=+2]
2626
include::ml-lang-ident.asciidoc[leveloffset=+2]
2727

28-
include::ml-dfa-limitations.asciidoc[leveloffset=+1]
28+
include::ml-dfa-resources.asciidoc[leveloffset=+1]
29+
include::ml-dfa-limitations.asciidoc[leveloffset=+2]

docs/en/stack/ml/df-analytics/ml-dfa-concepts.asciidoc

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
[role="xpack"]
22
[[ml-dfa-concepts]]
3-
= Concepts
3+
= Advanced concepts
44

5-
This section explains the fundamental concepts of the Elastic {ml} {dfanalytics}
6-
feature and the corresponding {evaluatedf-api}.
5+
This section explains the more complex concepts of the Elastic {ml}
6+
{dfanalytics} feature.
77

8+
* <<ml-dfa-phases>>
9+
* <<ml-dfa-scale>>
810
* <<dfa-outlier-detection>>
911
* <<dfa-regression>>
1012
* <<dfa-classification>>

docs/en/stack/ml/df-analytics/ml-dfa-overview.asciidoc

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,141 @@ with supervised learning.
4343
| {regression} | supervised
4444
| {classification} | supervised
4545
|===
46+
47+
[discrete]
48+
[[ml-supervised-workflow]]
49+
== Introduction to supervised learning
50+
51+
52+
Elastic supervised learning enables you to train a {ml} model based on training
53+
examples that you provide. You can then use your model to make predictions on
54+
new data. This page summarizes the end-to-end workflow for training, evaluating
55+
and deploying a model. It gives a high-level overview of the steps required to
56+
identify and implement a solution using supervised learning.
57+
58+
The workflow for supervised learning consists of the following stages:
59+
60+
image::images/ml-dfa-lifecycle-diagram.png["Supervised learning workflow"]
61+
62+
These are iterative stages, meaning that after evaluating each step, you might
63+
need to make adjustments before you move further.
64+
65+
[discrete]
66+
[[define-problem]]
67+
=== Define the problem
68+
69+
It’s important to take a moment and think about where {ml} can be most
70+
impactful. Consider what type of data you have available and what value it
71+
holds. The better you know the data, the quicker you will be able to create {ml}
72+
models that generate useful insights. What kinds of patterns do you want to
73+
discover in your data? What type of value do you want to predict: a category, or
74+
a numerical value? The answers help you choose the type of analysis that fits
75+
your use case.
76+
77+
After you identify the problem, consider which of the {ml-features} are most
78+
likely to help you solve it. Supervised learning requires a data set that
79+
contains known values that the model can be trained on. Unsupervised learning –
80+
like {anomaly-detect} or {oldetection} – does not have this requirement.
81+
82+
{stack} provides the following types of supervised learning:
83+
84+
* {regression}: predicts **continuous, numerical values** like the response time
85+
of a web request.
86+
* {classification}: predicts **discrete, categorical values** like whether a
87+
https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity[DNS request originates from a malicious or benign ___domain].
88+
89+
90+
[discrete]
91+
[[prepare-transform-data]]
92+
=== Prepare and transform data
93+
94+
You have defined the problem and selected an appropriate type of analysis. The
95+
next step is to produce a high-quality data set in {es} with a clear
96+
relationship to your training objectives. If your data is not already in {es},
97+
this is the stage where you develop your data pipeline. If you want to learn
98+
more about how to ingest data into {es}, refer to the
99+
{ref}/ingest.html[Ingest node documentation].
100+
101+
{regression-cap} and {classification} are supervised {ml} techniques, therefore
102+
you must supply a labelled data set for training. This is often called the
103+
"ground truth". The training process uses this information to identify
104+
relationships among the various characteristics of the data and the predicted
105+
value. It also plays a critical role in model evaluation.
106+
107+
An important requirement is a data set that is large enough to train a model.
108+
For example, if you would like to train a {classification} model that decides
109+
whether an email is a spam or not, you need a labelled data set that contains
110+
enough data points from each possible category to train the model. What counts
111+
as "enough" depends on various factors like the complexity of the problem or
112+
the {ml} solution you have chosen. There is no exact number that fits every
113+
use case; deciding how much data is acceptable is rather a heuristic process
114+
that might involve iterative trials.
115+
116+
Before you train the model, consider preprocessing the data. In practice, the
117+
type of preprocessing depends on the nature of the data set. Preprocessing can
118+
include, but is not limited to, mitigating redundancy, reducing biases, applying
119+
standards and/or conventions, data normalization, and so on.
120+
121+
{regression-cap} and {classification} require specifically structured source
122+
data: a two dimensional tabular data structure. For this reason, you might need
123+
to {ref}/transforms.html[{transform}] your data to create a {dataframe} which
124+
can be used as the source for these types of {dfanalytics}.
125+
126+
[discrete]
127+
[[train-test-iterate]]
128+
=== Train, test, iterate
129+
130+
After your data is prepared and transformed into the right format, it is time to
131+
train the model. Training is an iterative process — every iteration is followed
132+
by an evaluation to see how the model performs.
133+
134+
The first step is defining the features – the relevant fields in the data set –
135+
that will be used for training the model. By default, all the fields with
136+
supported types are included in {regression} and {classification} automatically.
137+
However, you can optionally exclude irrelevant fields from the process. Doing so
138+
makes a large data set more manageable, reducing the computing resources and
139+
time required for training.
140+
141+
Next you must define how to split your data into a training and a test set. The
142+
test set won’t be used to train the model; it is used to evaluate how the model
143+
performs. There is no optimal percentage that fits all use cases, it depends on
144+
the amount of data and the time you have to train. For large data sets, you may
145+
want to start with a low training percent to complete an end-to-end iteration in
146+
a short time.
147+
148+
During the training process, the training data is fed through the learning
149+
algorithm. The model predicts the value and compares it to the ground truth then
150+
the model is fine-tuned to make the predictions more accurate.
151+
152+
Once the model is trained, you can evaluate how well it predicts previously
153+
unseen data with the model generalization error. There are further
154+
evaluation types for both {regression} and {classification} analysis which
155+
provide metrics about training performance. When you are satisfied with the
156+
results, you are ready to deploy the model. Otherwise, you may want to adjust
157+
the training configuration or consider alternative ways to preprocess and
158+
represent your data.
159+
160+
[discrete]
161+
[[deploy-model]]
162+
=== Deploy model
163+
164+
You have trained the model and are satisfied with the performance. The last step
165+
is to deploy your trained model and start using it on new data.
166+
167+
The Elastic {ml} feature called {infer} enables you to make predictions for new
168+
data either by using it as a processor in an ingest pipeline, in a continuous
169+
{transform} or as an aggregation at search time. When new data comes into your
170+
ingest pipeline or you run a search on your data with an {infer} aggregation,
171+
the model is used to infer against the data and make predictions on it.
172+
173+
[discrete]
174+
[[next-steps]]
175+
=== Next steps
176+
177+
* Read more about how to {ref}/transforms.html[transform you data] into an
178+
entity-centric index.
179+
* Consult the documentation to learn more about <<dfa-regression,regression>>
180+
and <<dfa-classification,classification>>.
181+
* Learn how to <<ml-dfanalytics-evaluate,evaluate>> regression and
182+
classification models.
183+
* Find out how to deploy your model by using <<ml-inference,inference>>.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[role="xpack"]
2+
[[ml-dfa-resources]]
3+
= Resources
4+
5+
This section contains further resources for using {dfanalytics}.
6+
7+
* <<ml-dfa-limitations>>

docs/en/stack/ml/df-analytics/ml-dfanalytics.asciidoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,6 @@ and the security privileges that are required to use {dfanalytics}.
2020
* <<ml-dfa-concepts>>
2121
* <<ml-dfanalytics-apis>>
2222
* <<dfanalytics-examples>>
23-
* <<ml-dfa-limitations>>
23+
* <<ml-dfa-resources>>
2424

2525
--

docs/en/stack/ml/df-analytics/ml-dfa-phases.asciidoc renamed to docs/en/stack/ml/df-analytics/ml-how-dfa-works.asciidoc

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
[role="xpack"]
22
[[ml-dfa-phases]]
33
= How a {dfanalytics-job} works
4+
[subs="attributes"]
45
++++
5-
<titleabbrev>How it works</titleabbrev>
6+
<titleabbrev>How {dfanalytics-jobs} work</titleabbrev>
67
++++
8+
:keywords: {ml-init}, {stack}, {dfanalytics}, advanced,
9+
:description: An explanation of how the {dfanalytics-jobs} work. Every job has \
10+
four or five main phases depending on its analysis type.
711

812

913
A {dfanalytics-job} is essentially a persistent {es} task. During its life
@@ -17,6 +21,7 @@ cycle, it goes through four or five main phases depending on the analysis type:
1721

1822
Let's take a look at the phases one-by-one.
1923

24+
[discrete]
2025
[[ml-dfa-phases-reindex]]
2126
== Reindexing
2227

@@ -28,13 +33,15 @@ default settings.
2833
Once the destination index is built, the {dfanalytics-job} task calls the {es}
2934
{ref}/docs-reindex.html[Reindex API] to launch the reindexing task.
3035

36+
[discrete]
3137
[[ml-dfa-phases-load]]
3238
== Loading data
3339

3440
After the reindexing is finished, the job fetches the needed data from the
3541
destination index. It converts the data into the format that the analysis
3642
process expects, then sends it to the analysis process.
3743

44+
[discrete]
3845
[[ml-dfa-phases-analyze]]
3946
== Analyzing
4047

@@ -54,6 +61,7 @@ in which they identify outliers in the data.
5461
hyperparameters. See <<hyperparameters,hyperparameter optimization>>.
5562
. `final_training`: Trains the {ml} model.
5663

64+
[discrete]
5765
[[ml-dfa-phases-write]]
5866
== Writing results
5967

@@ -63,6 +71,7 @@ ones that have been loaded in the loading data phase are not. The
6371
{dfanalytics-job} matches the results with the data rows in the destination
6472
index, merges them, and indexes them back to the destination index.
6573

74+
[discrete]
6675
[[ml-dfa-phases-inference]]
6776
== {infer-cap}
6877

@@ -72,11 +81,4 @@ set.
7281

7382

7483
Finally, after all phases are completed, the task is marked as completed and the
75-
{dfanalytics-job} stops. Your data is ready to be evaluated.
76-
77-
78-
Check the <<ml-dfa-concepts>> section if you'd like to know more about the
79-
various {dfanalytics} types.
80-
81-
Check the <<ml-dfanalytics-evaluate>> section if you are interested in the
82-
evaluation of the {dfanalytics} results.
84+
{dfanalytics-job} stops. Your data is ready to be evaluated.

0 commit comments

Comments
 (0)