metrics added for vector reload being stuck by akshayakumar-t · Pull Request #105 · Sentinel-One/vector

akshayakumar-t · 2026-05-27T04:52:50Z

Summary

Adds three internal gauge metrics to detect when a Vector pod has not reloaded its configuration — either because the reload is stuck (deadlock scenario) or because it loaded an invalid config.

Context: In our multi-pod deployment, the config file is updated by a sidecar and Vector watches it for changes. We observed that some pods occasionally fail to complete a reload after a config change. Investigation traced this to a known upstream deadlock (vectordotdev/vector#24125): when a sink has queued undeliverable events at reload time, a Pause sent to the upstream fanout creates a circular dependency that prevents the reload from ever completing. These metrics make that condition alertable without requiring a fix to the reload logic itself.

Changes

src/internal_events/process.rs

Adds VectorReloadStarted internal event, emitted at the start of every reload attempt
Adds three gauge metrics across the existing lifecycle events (VectorStarted, VectorReloaded, VectorReloadError, VectorConfigLoadError)

src/topology/controller.rs

Emits VectorReloadStarted at the top of TopologyController::reload()

Metrics Reference

All three metrics are exposed via the internal_metrics source and scraped as vector_<name> in Prometheus.

`vector_config_load_succeeded` (gauge: 0 or 1)

Tracks whether the currently running configuration is valid.

Value	Meaning
`1`	Pod started successfully or last reload succeeded
`0`	Last reload failed (bad topology or unparseable config file)

Set to 1 on VectorStarted and VectorReloaded. Set to 0 on VectorReloadError and VectorConfigLoadError.

Alert — pod is running a bad config:

vector_config_load_succeeded == 0

`vector_reload_in_progress` (gauge: 0 or 1)

Tracks whether a reload is currently in flight. Set to 1 when a reload begins and back to 0 when it completes (success or failure). If it stays at 1, the reload is stuck.

Value	Meaning
`1`	A reload was triggered and has not yet completed
`0`	No reload in progress

Alert — reload is stuck (tune threshold to exceed your normal healthy reload time):

vector_reload_in_progress == 1
  and
(time() - vector_last_reload_started_timestamp_seconds) > 120

`vector_last_reload_started_timestamp_seconds` (gauge: Unix timestamp)

Records the Unix timestamp (seconds) of the most recent reload attempt. Used alongside vector_reload_in_progress to compute how long a reload has been in progress.

Alert — pods diverged (some reloaded, some did not):

stddev by (job) (vector_last_reload_started_timestamp_seconds) > 60

This fires when pods within the same job have reload timestamps more than 60 seconds apart, indicating some pods received the config change signal and others did not.

…of custom helper Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>

Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>

janmejay-s1

Already reviewed, rubber-stamping.

akshayakumar-t and others added 4 commits May 27, 2026 10:22

metrics added for vector reload being stuck

393cf2a

refactor: use chrono::Utc::now() for reload timestamp metric instead …

b29d19c

…of custom helper Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>

refactor: rename valid_config metric to config_load_succeeded

d4d3503

Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>

refactor: extract metric name string literals into constants

faba7c3

Co-Authored-By: Akshaya's Agent <akshaya.kumar+agent@sentinelone.com>

janmejay approved these changes May 27, 2026

View reviewed changes

Comment thread src/internal_events/process.rs

janmejay-s1 approved these changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics added for vector reload being stuck#105

metrics added for vector reload being stuck#105
akshayakumar-t wants to merge 4 commits into
masterfrom
OBE-10020_metrics_for_vector_reload_stuck

akshayakumar-t commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

janmejay-s1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akshayakumar-t commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Metrics Reference

vector_config_load_succeeded (gauge: 0 or 1)

vector_reload_in_progress (gauge: 0 or 1)

vector_last_reload_started_timestamp_seconds (gauge: Unix timestamp)

Uh oh!

Uh oh!

janmejay-s1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akshayakumar-t commented May 27, 2026 •

edited

Loading

`vector_config_load_succeeded` (gauge: 0 or 1)

`vector_reload_in_progress` (gauge: 0 or 1)

`vector_last_reload_started_timestamp_seconds` (gauge: Unix timestamp)