Skip to content

Device resilience: self-healing encoder with supervision and liveness#17

Merged
andrescera merged 5 commits into
mainfrom
feat/device-stable-core
Jun 6, 2026
Merged

Device resilience: self-healing encoder with supervision and liveness#17
andrescera merged 5 commits into
mainfrom
feat/device-stable-core

Conversation

@andrescera

Copy link
Copy Markdown
Member

ceracoder used to exit the moment it lost its SRT connection — one network blip and the stream was dead until something restarted the process. This changes that.

The encoder now handles transient SRT loss internally. When the connection drops, it enters a reconnect loop with exponential backoff rather than exiting. If the connection genuinely can't be recovered after exhausting retries, it exits cleanly so systemd can restart it — but short interruptions are absorbed without the operator ever noticing.

Alongside that, we've added a frame-production liveness signal. Previously, supervision could only tell whether the process was alive — not whether it was actually encoding. Now ceracoder tracks whether frames are advancing through the pipeline, and feeds that signal into the systemd watchdog ping. A process that's alive but silently stalled (what we call a zombie encode) will now fail its watchdog and get restarted automatically.

The systemd service unit has been hardened to match: it uses sd_notify so systemd knows when the encoder is genuinely ready, a watchdog timeout that kills hung processes, and crash-loop damping so a rapid restart cycle backs off rather than spinning.

Together these changes mean the encoder is self-healing for the common failure cases — transient network loss, pipeline stalls, and process hangs — without requiring any operator intervention.

Complete the ADR-0005 transient-SRT-loss path. The reconnect core
(srt_reconnect state machine, ceracoder.c integration, test_reconnect,
Makefile wiring) landed in the prior amended commit due to a concurrent
multi-agent 'git add -A' race; this commit adds the remaining pieces:

- Runtime-configurable reconnect window/backoff via environment
  (CERACODER_RECONNECT_BASE_MS/_MAX_MS/_MAX_ATTEMPTS; 0 = unlimited),
  fulfilling the 'max attempts: configurable' requirement WITHOUT touching
  the INI/config struct, so the stable TypeScript bindings are unchanged.
- Fix the reconnect-start log to report the configured backoff cap
  (reconnect_ctrl.max_backoff_ms) instead of the default constant.
- Ignore the tests/test_reconnect build artifact (matches sibling binaries).

Verified: make test green (24 tests). Real-binary runtime proof against
srt-live-transmit -- transient drop reconnects with observed backoff
(same PID), permanent outage exhausts the bounded window and exits
non-zero with no core dump for systemd Restart=on-failure.
@andrescera andrescera merged commit c059eee into main Jun 6, 2026
3 checks passed
@andrescera andrescera deleted the feat/device-stable-core branch June 6, 2026 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant