Adds page about recovering from a failed job to anomaly detection docs (elastic#1667)

szabosteve · lcawl · web-flow · commit aaddcfd1167f · 2021-05-20T16:30:08.000+02:00
Co-authored-by: Lisa Cawley &lt;lcawley@elastic.co&gt;
diff --git a/docs/en/stack/ml/anomaly-detection/index.asciidoc b/docs/en/stack/ml/anomaly-detection/index.asciidoc
@@ -26,6 +26,8 @@ include::job-tips.asciidoc[leveloffset=+3]
 
 include::stopping-ml.asciidoc[leveloffset=+2]
 
+include::ml-restart-failed-jobs.asciidoc[leveloffset=+2]
+
 include::anomaly-detection-scale.asciidoc[leveloffset=+2]
 
 include::ml-api-quickref.asciidoc[leveloffset=+1]
diff --git a/docs/en/stack/ml/anomaly-detection/ml-configuration.asciidoc b/docs/en/stack/ml/anomaly-detection/ml-configuration.asciidoc
@@ -18,6 +18,7 @@ you visualize and explore the results.
 
 * <<create-jobs>>
 * <<stopping-ml>>
+* <<ml-restart-failed-jobs>>
 
 After you learn how to create and stop {anomaly-detect} jobs, you can check the 
 <<anomaly-examples>> for more advanced settings and scenarios.
diff --git a/docs/en/stack/ml/anomaly-detection/ml-restart-failed-jobs.asciidoc b/docs/en/stack/ml/anomaly-detection/ml-restart-failed-jobs.asciidoc
@@ -0,0 +1,47 @@
+[role="xpack"]
+[[ml-restart-failed-jobs]]
+= Restart failed {anomaly-jobs}
+
+If an {anomaly-job} fails, try to restart the job by following the procedure 
+described below. If the restarted job runs as expected, then the problem that 
+caused the job to fail was transient and no further investigation is needed. If 
+the job quickly fails after the restart, then the problem is persistent and 
+needs further investigation. In this case, find out which node the failed job 
+was running on by checking the job stats on the **Job management** pane in 
+{kib}. Then get the logs for that node and look for exceptions and errors where 
+the ID of the {anomaly-job} is in the message to have a better understanding of 
+the issue.
+
+If an {anomaly-job} has failed, do the following to recover from `failed` state: 
+
+. _Force_ stop the corresponding {dfeed} by using the 
+{ref}/ml-stop-datafeed.html[Stop {dfeed} API] with the `force` parameter being 
+`true`. For example, the following request force stops the `my_datafeed` 
+{dfeed}.
++
+--
+[source,console]
+--------------------------------------------------
+POST _ml/datafeeds/my_datafeed/_stop
+{
+  "force": "true"
+}
+--------------------------------------------------
+// TEST[skip]
+--
+
+. _Force_ close the {anomaly-job} by using the 
+{ref}/ml-close-job.html[Close {anomaly-job} API] with the `force` parameter 
+being `true`. For example, the following request force closes the `my_job` 
+{anomaly-job}:
++
+--
+[source,console]
+--------------------------------------------------
+POST _ml/anomaly_detectors/my_job/_close?force=true
+--------------------------------------------------
+// TEST[skip]
+--
+
+. Restart the {anomaly-job} on the **Job management** pane in {kib}. 
+