Skip to content

Commit 649dc48

Browse files
Adds anomaly detection FAQ items to the Troubleshooting page (elastic#2714) (elastic#2735)
Co-authored-by: István Zoltán Szabó <[email protected]>
1 parent 3e461d8 commit 649dc48

File tree

1 file changed

+172
-4
lines changed

1 file changed

+172
-4
lines changed
Lines changed: 172 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,183 @@
11
[role="xpack"]
22
[[ml-ad-troubleshooting]]
3-
= Troubleshooting {ml} {anomaly-detect}
3+
= Troubleshooting {ml} {anomaly-detect} and frequently asked questions
44
++++
5-
<titleabbrev>Troubleshooting</titleabbrev>
5+
<titleabbrev>Troubleshooting and FAQ</titleabbrev>
66
++++
77

88
Use the information in this section to troubleshoot common problems and find
99
answers for frequently asked questions.
1010

11+
1112
[discrete]
1213
[[ml-ad-restart-failed-jobs]]
13-
== Restart failed {anomaly-jobs}
14+
== How to restart failed {anomaly-jobs}
15+
16+
include::ml-restart-failed-jobs.asciidoc[]
17+
18+
[discrete]
19+
[[faq-methods]]
20+
== What {ml} methods are used for {anomaly-detect}?
21+
22+
For detailed information, refer to the paper https://www.ijmlc.org/papers/398-LC018.pdf[Anomaly Detection in Application Performance Monitoring Data] by Thomas Veasey and Stephen Dodson, as well as our webinars on https://www.elastic.co/elasticon/conf/2018/sf/the-math-behind-elastic-machine-learning[The Math behind Elastic Machine Learning] and
23+
https://www.elastic.co/elasticon/conf/2017/sf/machine-learning-and-statistical-methods-for-time-series-analysis[Machine Learning and Statistical Methods for Time Series Analysis].
24+
25+
Further papers cited in the C++ code:
26+
27+
* http://arxiv.org/pdf/1109.2378.pdf[Modern hierarchical, agglomerative clustering algorithms]
28+
* https://www.cs.umd.edu/~mount/Projects/KMeans/pami02.pdf[An Efficient k-Means Clustering Algorithm: Analysis and Implementation]
29+
* http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf[Large-Scale Bayesian Logistic Regression for Text Categorization]
30+
* https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf[X-means: Extending K-means with Efficient Estimation of the Number of Clusters]
31+
32+
33+
[discrete]
34+
[[faq-features]]
35+
== What are the input features used by the model?
36+
37+
All input features are specified by the user, for example, using
38+
https://www.elastic.co/guide/en/machine-learning/current/ml-functions.html[diverse statistical functions]
39+
like count or mean over the data of interest.
40+
41+
42+
[discrete]
43+
[[faq-data]]
44+
== Does the data used by the model only include customers' data?
45+
46+
Yes. Only the data specified in the {anomaly-job} configuration are used for
47+
detection.
48+
49+
50+
[discrete]
51+
[[faq-output-score]]
52+
== What does the model output score represent? How is it generated and calibrated?
53+
54+
The ensemble model generates a probability value, which is then mapped to an
55+
anomaly severity score between 0 and 100. The lower the probability of observed
56+
data, the higher the severity score. Refer to this
57+
<<ml-ad-explain,advanced concept doc>> for details. Calibration (also called as
58+
normalization) happens on two levels:
59+
60+
. Within the same metric/partition, the scores are re-normalized “back in time”
61+
within the window specified by the `renormalization_window_days` parameter.
62+
This is the reason, for example, that both `record_score` and
63+
`initial_record_score` exist.
64+
. Over multiple partitions, scores are renormalized as described in
65+
https://www.elastic.co/blog/changes-to-elastic-machine-learning-anomaly-scoring-in-6-5[this blog post].
66+
67+
68+
[discrete]
69+
[[faq-model-update]]
70+
== Is the model static or updated periodically?
71+
72+
It's an online model and updated continuously. Old parts of the model are pruned
73+
out based on the parameter `model_prune_window` (usually 30 days).
74+
75+
76+
[discrete]
77+
[[faq-model-performance]]
78+
== Is the performance of the model monitored?
79+
80+
There is a set of benchmarks to monitor the performance of the {anomaly-detect}
81+
algorithms and to ensure no regression occurs as the methods are continuously
82+
developed and refined. They are called "data scenarios" and consist of 3 things:
83+
84+
* a dataset (stored as an {es} snapshot),
85+
* a {ml} config ({anomaly-detect}, {dfanalysis}, {transform}, or {infer}),
86+
* an arbitrary set of static assertions (bucket counts, anomaly scores, accuracy
87+
value, and so on).
88+
89+
Performance metrics are collected from each and every scenario run and they are
90+
persisted in an Elastic Cloud cluster. This information is then used to track
91+
the performance over time, across the different builds, mainly to detect any
92+
regressions in the performance (both result quality and compute time).
93+
94+
On the customer side, the situation is different. There is no conventional way
95+
to monitor the model performance as it's unsupervised. Usually,
96+
operationalization of the model output include one or several of the following
97+
steps:
98+
99+
* Creating alerts for influencers, buckets, or records based on a certain
100+
anomaly score.
101+
* Use the forecasting feature to predict the development of the metric of
102+
interest in the future.
103+
* Use one or a combination of multiple {anomaly-jobs} to identify the
104+
significant anomaly influencers.
105+
106+
107+
[discrete]
108+
[[faq-model-accuracy]]
109+
== How to measure the accuracy of the unsupervised {ml} model?
110+
111+
For each record in a given time series, anomaly detection models provide an
112+
anomaly severity score, 95% confidence intervals, and an actual value. This data
113+
is stored in an index and can be retrieved using the Get Records API. With this
114+
information, you can use standard measures to assess prediction accuracy,
115+
interval calibration, and so on. Elasticsearch aggregations can be used to
116+
compute these statistics.
117+
118+
The purpose of {anomaly-detect} is to achieve the best ranking of periods where
119+
an anomaly happened. A practical way to evaluate this is to keep track of real
120+
incidents and see how well they correlate with the predictions of
121+
{anomaly-detect}.
122+
123+
124+
[discrete]
125+
[[faq-model-drift]]
126+
== Can the {anomaly-detect} model experience model drift?
127+
128+
Elasticsearch's {anomaly-detect} model continuously learns and adapts to changes
129+
in the time series. These changes can take the form of slow drifts as well as
130+
sudden jumps. Therefore, we take great care to manage the adaptation to changing
131+
data characteristics. There is always a fine trade-off between fitting anomalous
132+
periods (over-fitting) and not learning new normal behavior. The following are
133+
the main approaches Elastic uses to manage this trade-off:
134+
135+
* Learning the optimal decay rate based on measuring the bias in the forecast
136+
and the moments of the error distribution and error distribution moments.
137+
* Allowing continuous small drifts in periodic patterns. This is achieved by
138+
continuously minimizing the mean prediction error over the last iteration with
139+
respect to a small bounded time shift.
140+
* If the predictions are significantly wrong over a long period of time, the
141+
algorithm tests whether the time series has undergone a sudden change.
142+
Hypothesis Testing is used to test for different types of changes, such as
143+
scaling of values, shifting of values, and large time shifts in periodic
144+
patterns such as daylight saving time.
145+
* Running continuous hypothesis tests on time windows of various lengths to test
146+
for significant evidence of new or changed periodic patterns, and update the
147+
model if the null hypothesis of unchanged features is rejected.
148+
* Accumulating error statistics on calendar days and continuously test whether
149+
predictive calendar features need to be added or removed from the model.
150+
151+
152+
[discrete]
153+
[[faq-minimum-data]]
154+
== What is the minimum amount of data for an {anomaly-job}?
155+
156+
Elastic {ml} needs a minimum amount of data to be able to build an effective
157+
model for {anomaly-detect}.
158+
159+
* For sampled metrics such as `mean`, `min`, `max`,
160+
and `median`, the minimum data amount is either eight non-empty bucket spans or
161+
two hours, whichever is greater.
162+
* For all other non-zero/null metrics and count-based quantities, it's four
163+
non-empty bucket spans or two hours, whichever is greater.
164+
* For the `count` and `sum` functions, empty buckets matter and therefore it is
165+
the same as sampled metrics - eight buckets or two hours.
166+
* For the `rare` function, it's typically around 20 bucket spans. It can be faster
167+
for population models, but it depends on the number of people that interact per
168+
bucket.
169+
170+
Rules of thumb:
171+
172+
* more than three weeks for periodic data or a few hundred buckets for
173+
non-periodic data
174+
* at least as much data as you want to forecast
175+
176+
177+
[discrete]
178+
[[faq-data-integrity]]
179+
== Are there any checks or processes to ensure data integrity?
14180

15-
include::ml-restart-failed-jobs.asciidoc[]
181+
The Elastic {ml} algorithms are programmed to work with missing and noisy data
182+
and use denoising and data reputation techniques based on the learned
183+
statistical properties.

0 commit comments

Comments
 (0)