You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/AI/AI-Unsupervised-Learning-Algorithms.md
+96-1Lines changed: 96 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -456,5 +456,100 @@ Here we combined our previous 4D normal dataset with a handful of extreme outlie
456
456
</details>
457
457
458
458
459
-
{{#include ../banners/hacktricks-training.md}}
459
+
### HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise)
460
+
461
+
**HDBSCAN** is an extension of DBSCAN that removes the need to pick a single global `eps` value and is able to recover clusters of **different density** by building a hierarchy of density-connected components and then condensing it. Compared with vanilla DBSCAN it usually
462
+
463
+
* extracts more intuitive clusters when some clusters are dense and others are sparse,
464
+
* has only one real hyper-parameter (`min_cluster_size`) and a sensible default,
465
+
* gives every point a cluster‐membership *probability* and an **outlier score** (`outlier_scores_`), which is extremely handy for threat-hunting dashboards.
466
+
467
+
> [!TIP]
468
+
> *Use cases in cybersecurity:* HDBSCAN is very popular in modern threat-hunting pipelines – you will often see it inside notebook-based hunting playbooks shipped with commercial XDR suites. One practical recipe is to cluster HTTP beaconing traffic during IR: user-agent, interval and URI length often form several tight groups of legitimate software updaters while C2 beacons remain as tiny low-density clusters or as pure noise.
Recent work has shown that **unsupervised learners are *not* immune to active attackers**:
504
+
505
+
***Data-poisoning against anomaly detectors.** Chen *et al.* (IEEE S&P 2024) demonstrated that adding as little as 3 % crafted traffic can shift the decision boundary of Isolation Forest and ECOD so that real attacks look normal. The authors released an open-source PoC (`udo-poison`) that automatically synthesises poison points.
506
+
***Backdooring clustering models.** The *BadCME* technique (BlackHat EU 2023) implants a tiny trigger pattern; whenever that trigger appears, a K-Means-based detector quietly places the event inside a “benign” cluster.
507
+
***Evasion of DBSCAN/HDBSCAN.** A 2025 academic pre-print from KU Leuven showed that an attacker can craft beaconing patterns that purposely fall into density gaps, effectively hiding inside *noise* labels.
508
+
509
+
Mitigations that are gaining traction:
510
+
511
+
1.**Model sanitisation / TRIM.** Before every retraining epoch, discard the 1–2 % highest-loss points (trimmed maximum likelihood) to make poisoning dramatically harder.
512
+
2.**Consensus ensembling.** Combine several heterogeneous detectors (e.g., Isolation Forest + GMM + ECOD) and raise an alert if *any* model flags a point. Research indicates this raises the attacker’s cost by >10×.
513
+
3.**Distance-based defence for clustering.** Re-compute clusters with `k` different random seeds and ignore points that constantly hop clusters.
460
514
515
+
---
516
+
517
+
### Modern Open-Source Tooling (2024-2025)
518
+
519
+
***PyOD 2.x** (released May 2024) added *ECOD*, *COPOD* and GPU-accelerated *AutoFormer* detectors. It now ships a `benchmark` sub-command that lets you compare 30+ algorithms on your dataset with **one line of code**:
***Anomalib v1.5** (Feb 2025) focuses on vision but also contains a generic **PatchCore** implementation – handy for screenshot-based phishing page detection.
524
+
***scikit-learn 1.5** (Nov 2024) finally exposes `score_samples` for *HDBSCAN* via the new `cluster.HDBSCAN` wrapper, so you do not need the external contrib package when on Python 3.12.
525
+
526
+
<details>
527
+
<summary>Quick PyOD example – ECOD + Isolation Forest ensemble</summary>
528
+
529
+
```python
530
+
from pyod.models importECOD, IForest
531
+
from pyod.utils.data import generate_data, evaluate_print
532
+
from pyod.utils.example import visualize
533
+
534
+
X_train, y_train, X_test, y_test = generate_data(
535
+
n_train=5000, n_test=1000, n_features=16,
536
+
contamination=0.02, random_state=42)
537
+
538
+
models = [ECOD(), IForest()]
539
+
540
+
# majority vote – flag if any model thinks it is anomalous
541
+
anomaly_scores =sum(m.fit(X_train).decision_function(X_test) for m in models) /len(models)
0 commit comments