Skip to content

metrics added for vector reload being stuck#105

Open
akshayakumar-t wants to merge 4 commits into
masterfrom
OBE-10020_metrics_for_vector_reload_stuck
Open

metrics added for vector reload being stuck#105
akshayakumar-t wants to merge 4 commits into
masterfrom
OBE-10020_metrics_for_vector_reload_stuck

Conversation

@akshayakumar-t
Copy link
Copy Markdown
Contributor

@akshayakumar-t akshayakumar-t commented May 27, 2026

Summary

Adds three internal gauge metrics to detect when a Vector pod has not reloaded its configuration — either because the reload is stuck (deadlock scenario) or because it loaded an invalid config.

Context: In our multi-pod deployment, the config file is updated by a sidecar and Vector watches it for changes. We observed that some pods occasionally fail to complete a reload after a config change. Investigation traced this to a known upstream deadlock (vectordotdev/vector#24125): when a sink has queued undeliverable events at reload time, a Pause sent to the upstream fanout creates a circular dependency that prevents the reload from ever completing. These metrics make that condition alertable without requiring a fix to the reload logic itself.

Changes

src/internal_events/process.rs

  • Adds VectorReloadStarted internal event, emitted at the start of every reload attempt
  • Adds three gauge metrics across the existing lifecycle events (VectorStarted, VectorReloaded, VectorReloadError, VectorConfigLoadError)

src/topology/controller.rs

  • Emits VectorReloadStarted at the top of TopologyController::reload()

Metrics Reference

All three metrics are exposed via the internal_metrics source and scraped as vector_<name> in Prometheus.

vector_config_load_succeeded (gauge: 0 or 1)

Tracks whether the currently running configuration is valid.

Value Meaning
1 Pod started successfully or last reload succeeded
0 Last reload failed (bad topology or unparseable config file)

Set to 1 on VectorStarted and VectorReloaded. Set to 0 on VectorReloadError and VectorConfigLoadError.

Alert — pod is running a bad config:

vector_config_load_succeeded == 0

vector_reload_in_progress (gauge: 0 or 1)

Tracks whether a reload is currently in flight. Set to 1 when a reload begins and back to 0 when it completes (success or failure). If it stays at 1, the reload is stuck.

Value Meaning
1 A reload was triggered and has not yet completed
0 No reload in progress

Alert — reload is stuck (tune threshold to exceed your normal healthy reload time):

vector_reload_in_progress == 1
  and
(time() - vector_last_reload_started_timestamp_seconds) > 120

vector_last_reload_started_timestamp_seconds (gauge: Unix timestamp)

Records the Unix timestamp (seconds) of the most recent reload attempt. Used alongside vector_reload_in_progress to compute how long a reload has been in progress.

Alert — pods diverged (some reloaded, some did not):

stddev by (job) (vector_last_reload_started_timestamp_seconds) > 60

This fires when pods within the same job have reload timestamps more than 60 seconds apart, indicating some pods received the config change signal and others did not.


akshayakumar-t and others added 4 commits May 27, 2026 10:22
…of custom helper

Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>
Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>
Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>
Comment thread src/internal_events/process.rs
Copy link
Copy Markdown
Contributor

@janmejay-s1 janmejay-s1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already reviewed, rubber-stamping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants