Skip to content

esops-dev/promanomaly

promanomaly

Anomaly detection for Prometheus-compatible metrics, exposed as Prometheus metrics.

CI Release License

promanomaly reads a YAML config, queries a Prometheus-compatible TSDB (VictoriaMetrics by default), runs robust statistical detectors over a rolling window of recent data, and exposes anomaly scores on /metrics. Anomalies become first-class metrics — graphable in Grafana, alertable through Alertmanager, indistinguishable from anything else in your monitoring stack.

It is the sibling project to promforecast. Where promforecast does forecast-based anomaly detection on signals with learnable seasonality, promanomaly handles everything else: change-points, statistical baselines on signals that aren't reasonably forecastable, and (later) population/cohort divergence. Install one, both, or neither — they share architectural DNA but no runtime dependency.

Why

  • Catches what forecasting misses. Step changes after a deploy, spikes on sparse signals, and series diverging from their cohort — none of which forecast-based deviation alerts pick up reliably.
  • Drop-in. helm install next to an existing Prometheus and point it at a long-term TSDB. No production scrape changes required.
  • Speak Prometheus everywhere. PromQL in, exposition format out. The TSDB is swappable (VictoriaMetrics, Mimir, Thanos).
  • Anomalies are just metrics. Same labels, same retention, same dashboards. Compose with recording rules and alerts you already know.
  • Stateless. No database to operate. Baselines recompute on a schedule; results live in /metrics.
  • Cheap per series. Detectors are O(N log N) on a rolling window with no model fitting — roughly an order of magnitude lighter than a statistical forecast.

Quickstart

helm repo add promanomaly https://esops-dev.github.io/promanomaly
helm install promanomaly promanomaly/promanomaly-stack

This installs VictoriaMetrics, the detector, and example dashboards. For an existing TSDB, install only promanomaly/promanomaly.

For a complete day-one starter next to node-exporter, point the detector at examples/configs/node-exporter.yaml — a ready-made multi-group config watching the host golden signals (CPU, memory, filesystem fill + fill-rate, disk I/O, network errors, load) with detectors already matched to each signal's shape. Sibling starters cover kube-state-metrics, cAdvisor, kubelet, and application/middleware signals, and the docs/cookbook/ recipes explain which detector fits which signal. Once running, promanomaly top --target <url> ranks what is anomalous right now by a normalised severity signal — see docs/triage.md.

Editing the YAML config does not require a pod restart: the detector watches the mounted ConfigMap, validates the new config in-place, and rolls back on validation failure (kubectl apply to the ConfigMap is enough). POST /-/reload and SIGHUP trigger the same path. Run promanomaly validate --config path/to/config.yaml in CI to catch schema errors before merging; add --probe --strict to additionally execute every query against the live datasource and fail the build on empty or erroring queries. See docs/operations.md for the full CLI surface and operational guidance.

See docs/ for configuration, detector selection, and operational guidance — including docs/patterns.md, the composition cookbook of how anomaly scores plug into the rest of the Prometheus ecosystem (rate-of-change, group rollups, multi-window monitoring, deploy-time silencing, cross-tool joins with promforecast). docs/when-to-use.md walks through the "promforecast, promanomaly, or both?" decision tree, and docs/config-schema.md documents the v1 schema stability commitments. Before going to production, work through the docs/production-checklist.md sizing guide and readiness checklist, and keep the degraded-mode playbook handy for on-call. Running many clusters? The multi-cluster reference architectures spell out the three topologies and their cardinality / network / blast-radius trade-offs.

A self-monitoring example config is bundled at examples/configs/self-monitoring.yaml, and a starter PrometheusRule at examples/alerts/promanomaly-rules.yaml. The full multi-group production reference deployment — values, NetworkPolicy egress, ArgoCD Application, and Flux HelmRepository/HelmRelease — lives at examples/production/.

Feeding anomaly signal to the platform

Anomaly scores are most useful when the rest of your stack can act on them. promanomaly stays the signal provider; the systems that react (autoscalers, rollout controllers, dashboards, long-term storage) stay the decision-makers.

  • Push sinks ship each snapshot beyond /metrics: a Prometheus remote_write sink for long-horizon anomaly history, and a Grafana annotations sink that surfaces firings across every dashboard. See docs/sinks.md.
  • Kubernetes metrics adapter re-serves anomaly signal through the external/custom metrics APIs so HPA and KEDA can consume it. Opt-in via charts/promanomaly-metrics-adapter/; see docs/adapter.md.
  • Reaction recipes — KEDA, HPA, and Argo Rollouts cohort-gated canaries, with the autoscaling guard-rails baked in — live at examples/k8s/.
  • GitOps config validation — a composite GitHub Action validates your config on every PR (schema, cardinality/cost, optional live probe), and promanomaly generate-rules scaffolds a PrometheusRule from it. See docs/gitops.md.

Relationship with promforecast

promforecast promanomaly
Question "What will this look like in N hours?" "Is this abnormal right now?"
Method Fit a model on a long lookback, predict ahead. Compute a baseline on a short rolling window, score the present.
Latency Minutes-to-hours (next refit). Seconds-to-minutes (next scrape).
Best for Capacity, traffic, KPIs. Bursty, sparse, step-shifting, or fleet-relative signals.

Run both for full coverage: forecast-based deviation alerts on capacity metrics, plus change-point and baseline anomaly alerts on error rates, rare events, and per-instance outliers.

Status

The configuration schema is stable as of the first stable release: apiVersion: promanomaly.io/v1 and the values.schema.json $id both move from alpha to v1. The previous promanomaly.io/v1alpha1 alias keeps loading as a byte-identical form with a deprecation warning. See docs/config-schema.md for the stability commitments, including which changes are breaking and which are not.

Output metric names and label sets are part of the public API and follow Prometheus conventions strictly (no colons — those are reserved for recording rules).

Security

Found a vulnerability? See SECURITY.md. Please do not open public issues for security reports.

Contributing

See CONTRIBUTING.md and the docs/ folder for detector authoring.

License

Apache-2.0. See LICENSE and NOTICE.

About

Anomaly detection for Prometheus-compatible metrics. Scores change-points, distribution shifts, and statistical baselines as first-class Prometheus metrics, graphable and alertable.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages