Device resilience: self-healing encoder with supervision and liveness#17
Merged
Conversation
Complete the ADR-0005 transient-SRT-loss path. The reconnect core (srt_reconnect state machine, ceracoder.c integration, test_reconnect, Makefile wiring) landed in the prior amended commit due to a concurrent multi-agent 'git add -A' race; this commit adds the remaining pieces: - Runtime-configurable reconnect window/backoff via environment (CERACODER_RECONNECT_BASE_MS/_MAX_MS/_MAX_ATTEMPTS; 0 = unlimited), fulfilling the 'max attempts: configurable' requirement WITHOUT touching the INI/config struct, so the stable TypeScript bindings are unchanged. - Fix the reconnect-start log to report the configured backoff cap (reconnect_ctrl.max_backoff_ms) instead of the default constant. - Ignore the tests/test_reconnect build artifact (matches sibling binaries). Verified: make test green (24 tests). Real-binary runtime proof against srt-live-transmit -- transient drop reconnects with observed backoff (same PID), permanent outage exhausts the bounded window and exits non-zero with no core dump for systemd Restart=on-failure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ceracoder used to exit the moment it lost its SRT connection — one network blip and the stream was dead until something restarted the process. This changes that.
The encoder now handles transient SRT loss internally. When the connection drops, it enters a reconnect loop with exponential backoff rather than exiting. If the connection genuinely can't be recovered after exhausting retries, it exits cleanly so systemd can restart it — but short interruptions are absorbed without the operator ever noticing.
Alongside that, we've added a frame-production liveness signal. Previously, supervision could only tell whether the process was alive — not whether it was actually encoding. Now ceracoder tracks whether frames are advancing through the pipeline, and feeds that signal into the systemd watchdog ping. A process that's alive but silently stalled (what we call a zombie encode) will now fail its watchdog and get restarted automatically.
The systemd service unit has been hardened to match: it uses sd_notify so systemd knows when the encoder is genuinely ready, a watchdog timeout that kills hung processes, and crash-loop damping so a rapid restart cycle backs off rather than spinning.
Together these changes mean the encoder is self-healing for the common failure cases — transient network loss, pipeline stalls, and process hangs — without requiring any operator intervention.