Skip to content

Commit aaddcfd

Browse files
szabostevelcawl
andauthored
Adds page about recovering from a failed job to anomaly detection docs (elastic#1667)
Co-authored-by: Lisa Cawley <[email protected]>
1 parent 6339e77 commit aaddcfd

File tree

3 files changed

+50
-0
lines changed

3 files changed

+50
-0
lines changed

docs/en/stack/ml/anomaly-detection/index.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ include::job-tips.asciidoc[leveloffset=+3]
2626

2727
include::stopping-ml.asciidoc[leveloffset=+2]
2828

29+
include::ml-restart-failed-jobs.asciidoc[leveloffset=+2]
30+
2931
include::anomaly-detection-scale.asciidoc[leveloffset=+2]
3032

3133
include::ml-api-quickref.asciidoc[leveloffset=+1]

docs/en/stack/ml/anomaly-detection/ml-configuration.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ you visualize and explore the results.
1818

1919
* <<create-jobs>>
2020
* <<stopping-ml>>
21+
* <<ml-restart-failed-jobs>>
2122

2223
After you learn how to create and stop {anomaly-detect} jobs, you can check the
2324
<<anomaly-examples>> for more advanced settings and scenarios.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
[role="xpack"]
2+
[[ml-restart-failed-jobs]]
3+
= Restart failed {anomaly-jobs}
4+
5+
If an {anomaly-job} fails, try to restart the job by following the procedure
6+
described below. If the restarted job runs as expected, then the problem that
7+
caused the job to fail was transient and no further investigation is needed. If
8+
the job quickly fails after the restart, then the problem is persistent and
9+
needs further investigation. In this case, find out which node the failed job
10+
was running on by checking the job stats on the **Job management** pane in
11+
{kib}. Then get the logs for that node and look for exceptions and errors where
12+
the ID of the {anomaly-job} is in the message to have a better understanding of
13+
the issue.
14+
15+
If an {anomaly-job} has failed, do the following to recover from `failed` state:
16+
17+
. _Force_ stop the corresponding {dfeed} by using the
18+
{ref}/ml-stop-datafeed.html[Stop {dfeed} API] with the `force` parameter being
19+
`true`. For example, the following request force stops the `my_datafeed`
20+
{dfeed}.
21+
+
22+
--
23+
[source,console]
24+
--------------------------------------------------
25+
POST _ml/datafeeds/my_datafeed/_stop
26+
{
27+
"force": "true"
28+
}
29+
--------------------------------------------------
30+
// TEST[skip]
31+
--
32+
33+
. _Force_ close the {anomaly-job} by using the
34+
{ref}/ml-close-job.html[Close {anomaly-job} API] with the `force` parameter
35+
being `true`. For example, the following request force closes the `my_job`
36+
{anomaly-job}:
37+
+
38+
--
39+
[source,console]
40+
--------------------------------------------------
41+
POST _ml/anomaly_detectors/my_job/_close?force=true
42+
--------------------------------------------------
43+
// TEST[skip]
44+
--
45+
46+
. Restart the {anomaly-job} on the **Job management** pane in {kib}.
47+

0 commit comments

Comments
 (0)