|
1 | 1 | [role="xpack"]
|
2 | 2 | [[ml-ad-troubleshooting]]
|
3 |
| -= Troubleshooting {ml} {anomaly-detect} |
| 3 | += Troubleshooting {ml} {anomaly-detect} and frequently asked questions |
4 | 4 | ++++
|
5 |
| -<titleabbrev>Troubleshooting</titleabbrev> |
| 5 | +<titleabbrev>Troubleshooting and FAQ</titleabbrev> |
6 | 6 | ++++
|
7 | 7 |
|
8 | 8 | Use the information in this section to troubleshoot common problems and find
|
9 | 9 | answers for frequently asked questions.
|
10 | 10 |
|
| 11 | + |
11 | 12 | [discrete]
|
12 | 13 | [[ml-ad-restart-failed-jobs]]
|
13 |
| -== Restart failed {anomaly-jobs} |
| 14 | +== How to restart failed {anomaly-jobs} |
| 15 | + |
| 16 | +include::ml-restart-failed-jobs.asciidoc[] |
| 17 | + |
| 18 | +[discrete] |
| 19 | +[[faq-methods]] |
| 20 | +== What {ml} methods are used for {anomaly-detect}? |
| 21 | + |
| 22 | +For detailed information, refer to the paper https://www.ijmlc.org/papers/398-LC018.pdf[Anomaly Detection in Application Performance Monitoring Data] by Thomas Veasey and Stephen Dodson, as well as our webinars on https://www.elastic.co/elasticon/conf/2018/sf/the-math-behind-elastic-machine-learning[The Math behind Elastic Machine Learning] and |
| 23 | +https://www.elastic.co/elasticon/conf/2017/sf/machine-learning-and-statistical-methods-for-time-series-analysis[Machine Learning and Statistical Methods for Time Series Analysis]. |
| 24 | + |
| 25 | +Further papers cited in the C++ code: |
| 26 | + |
| 27 | +* http://arxiv.org/pdf/1109.2378.pdf[Modern hierarchical, agglomerative clustering algorithms] |
| 28 | +* https://www.cs.umd.edu/~mount/Projects/KMeans/pami02.pdf[An Efficient k-Means Clustering Algorithm: Analysis and Implementation] |
| 29 | +* http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf[Large-Scale Bayesian Logistic Regression for Text Categorization] |
| 30 | +* https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf[X-means: Extending K-means with Efficient Estimation of the Number of Clusters] |
| 31 | + |
| 32 | + |
| 33 | +[discrete] |
| 34 | +[[faq-features]] |
| 35 | +== What are the input features used by the model? |
| 36 | + |
| 37 | +All input features are specified by the user, for example, using |
| 38 | +https://www.elastic.co/guide/en/machine-learning/current/ml-functions.html[diverse statistical functions] |
| 39 | +like count or mean over the data of interest. |
| 40 | + |
| 41 | + |
| 42 | +[discrete] |
| 43 | +[[faq-data]] |
| 44 | +== Does the data used by the model only include customers' data? |
| 45 | + |
| 46 | +Yes. Only the data specified in the {anomaly-job} configuration are used for |
| 47 | +detection. |
| 48 | + |
| 49 | + |
| 50 | +[discrete] |
| 51 | +[[faq-output-score]] |
| 52 | +== What does the model output score represent? How is it generated and calibrated? |
| 53 | + |
| 54 | +The ensemble model generates a probability value, which is then mapped to an |
| 55 | +anomaly severity score between 0 and 100. The lower the probability of observed |
| 56 | +data, the higher the severity score. Refer to this |
| 57 | +<<ml-ad-explain,advanced concept doc>> for details. Calibration (also called as |
| 58 | +normalization) happens on two levels: |
| 59 | + |
| 60 | +. Within the same metric/partition, the scores are re-normalized “back in time” |
| 61 | +within the window specified by the `renormalization_window_days` parameter. |
| 62 | +This is the reason, for example, that both `record_score` and |
| 63 | +`initial_record_score` exist. |
| 64 | +. Over multiple partitions, scores are renormalized as described in |
| 65 | +https://www.elastic.co/blog/changes-to-elastic-machine-learning-anomaly-scoring-in-6-5[this blog post]. |
| 66 | + |
| 67 | + |
| 68 | +[discrete] |
| 69 | +[[faq-model-update]] |
| 70 | +== Is the model static or updated periodically? |
| 71 | + |
| 72 | +It's an online model and updated continuously. Old parts of the model are pruned |
| 73 | +out based on the parameter `model_prune_window` (usually 30 days). |
| 74 | + |
| 75 | + |
| 76 | +[discrete] |
| 77 | +[[faq-model-performance]] |
| 78 | +== Is the performance of the model monitored? |
| 79 | + |
| 80 | +There is a set of benchmarks to monitor the performance of the {anomaly-detect} |
| 81 | +algorithms and to ensure no regression occurs as the methods are continuously |
| 82 | +developed and refined. They are called "data scenarios" and consist of 3 things: |
| 83 | + |
| 84 | +* a dataset (stored as an {es} snapshot), |
| 85 | +* a {ml} config ({anomaly-detect}, {dfanalysis}, {transform}, or {infer}), |
| 86 | +* an arbitrary set of static assertions (bucket counts, anomaly scores, accuracy |
| 87 | +value, and so on). |
| 88 | + |
| 89 | +Performance metrics are collected from each and every scenario run and they are |
| 90 | +persisted in an Elastic Cloud cluster. This information is then used to track |
| 91 | +the performance over time, across the different builds, mainly to detect any |
| 92 | +regressions in the performance (both result quality and compute time). |
| 93 | + |
| 94 | +On the customer side, the situation is different. There is no conventional way |
| 95 | +to monitor the model performance as it's unsupervised. Usually, |
| 96 | +operationalization of the model output include one or several of the following |
| 97 | +steps: |
| 98 | + |
| 99 | +* Creating alerts for influencers, buckets, or records based on a certain |
| 100 | +anomaly score. |
| 101 | +* Use the forecasting feature to predict the development of the metric of |
| 102 | +interest in the future. |
| 103 | +* Use one or a combination of multiple {anomaly-jobs} to identify the |
| 104 | +significant anomaly influencers. |
| 105 | + |
| 106 | + |
| 107 | +[discrete] |
| 108 | +[[faq-model-accuracy]] |
| 109 | +== How to measure the accuracy of the unsupervised {ml} model? |
| 110 | + |
| 111 | +For each record in a given time series, anomaly detection models provide an |
| 112 | +anomaly severity score, 95% confidence intervals, and an actual value. This data |
| 113 | +is stored in an index and can be retrieved using the Get Records API. With this |
| 114 | +information, you can use standard measures to assess prediction accuracy, |
| 115 | +interval calibration, and so on. Elasticsearch aggregations can be used to |
| 116 | +compute these statistics. |
| 117 | + |
| 118 | +The purpose of {anomaly-detect} is to achieve the best ranking of periods where |
| 119 | +an anomaly happened. A practical way to evaluate this is to keep track of real |
| 120 | +incidents and see how well they correlate with the predictions of |
| 121 | +{anomaly-detect}. |
| 122 | + |
| 123 | + |
| 124 | +[discrete] |
| 125 | +[[faq-model-drift]] |
| 126 | +== Can the {anomaly-detect} model experience model drift? |
| 127 | + |
| 128 | +Elasticsearch's {anomaly-detect} model continuously learns and adapts to changes |
| 129 | +in the time series. These changes can take the form of slow drifts as well as |
| 130 | +sudden jumps. Therefore, we take great care to manage the adaptation to changing |
| 131 | +data characteristics. There is always a fine trade-off between fitting anomalous |
| 132 | +periods (over-fitting) and not learning new normal behavior. The following are |
| 133 | +the main approaches Elastic uses to manage this trade-off: |
| 134 | + |
| 135 | +* Learning the optimal decay rate based on measuring the bias in the forecast |
| 136 | +and the moments of the error distribution and error distribution moments. |
| 137 | +* Allowing continuous small drifts in periodic patterns. This is achieved by |
| 138 | +continuously minimizing the mean prediction error over the last iteration with |
| 139 | +respect to a small bounded time shift. |
| 140 | +* If the predictions are significantly wrong over a long period of time, the |
| 141 | +algorithm tests whether the time series has undergone a sudden change. |
| 142 | +Hypothesis Testing is used to test for different types of changes, such as |
| 143 | +scaling of values, shifting of values, and large time shifts in periodic |
| 144 | +patterns such as daylight saving time. |
| 145 | +* Running continuous hypothesis tests on time windows of various lengths to test |
| 146 | +for significant evidence of new or changed periodic patterns, and update the |
| 147 | +model if the null hypothesis of unchanged features is rejected. |
| 148 | +* Accumulating error statistics on calendar days and continuously test whether |
| 149 | +predictive calendar features need to be added or removed from the model. |
| 150 | + |
| 151 | + |
| 152 | +[discrete] |
| 153 | +[[faq-minimum-data]] |
| 154 | +== What is the minimum amount of data for an {anomaly-job}? |
| 155 | + |
| 156 | +Elastic {ml} needs a minimum amount of data to be able to build an effective |
| 157 | +model for {anomaly-detect}. |
| 158 | + |
| 159 | +* For sampled metrics such as `mean`, `min`, `max`, |
| 160 | +and `median`, the minimum data amount is either eight non-empty bucket spans or |
| 161 | +two hours, whichever is greater. |
| 162 | +* For all other non-zero/null metrics and count-based quantities, it's four |
| 163 | +non-empty bucket spans or two hours, whichever is greater. |
| 164 | +* For the `count` and `sum` functions, empty buckets matter and therefore it is |
| 165 | +the same as sampled metrics - eight buckets or two hours. |
| 166 | +* For the `rare` function, it's typically around 20 bucket spans. It can be faster |
| 167 | +for population models, but it depends on the number of people that interact per |
| 168 | +bucket. |
| 169 | + |
| 170 | +Rules of thumb: |
| 171 | + |
| 172 | +* more than three weeks for periodic data or a few hundred buckets for |
| 173 | +non-periodic data |
| 174 | +* at least as much data as you want to forecast |
| 175 | + |
| 176 | + |
| 177 | +[discrete] |
| 178 | +[[faq-data-integrity]] |
| 179 | +== Are there any checks or processes to ensure data integrity? |
14 | 180 |
|
15 |
| -include::ml-restart-failed-jobs.asciidoc[] |
| 181 | +The Elastic {ml} algorithms are programmed to work with missing and noisy data |
| 182 | +and use denoising and data reputation techniques based on the learned |
| 183 | +statistical properties. |
0 commit comments