Skip to content

feat: add healthcheck to execution service and wait condition for consensus#1005

Open
erhnysr wants to merge 1 commit intobase:mainfrom
erhnysr:feat/docker-healthcheck-execution-client
Open

feat: add healthcheck to execution service and wait condition for consensus#1005
erhnysr wants to merge 1 commit intobase:mainfrom
erhnysr:feat/docker-healthcheck-execution-client

Conversation

@erhnysr
Copy link
Copy Markdown

@erhnysr erhnysr commented Apr 11, 2026

Problem

When running docker compose up, the node service (consensus client) starts immediately after the execution container is created — not after it's actually ready to serve requests. On first boot or after a restart with a large database, the execution client can take 30–120 seconds before its JSON-RPC becomes available. During this window the consensus service repeatedly fails to connect and enters a crash-loop.

This is a frequently reported issue in the #🛠|node-operators Discord channel.

Solution

  • Add a healthcheck to the execution service that polls eth_syncing via JSON-RPC. The check passes as soon as the RPC endpoint responds (the node does not need to be fully synced — just started).
  • Change depends_on on the node service to condition: service_healthy so the consensus client only starts once the execution client's RPC is live.

Healthcheck parameters

Parameter Value Reason
interval 30s Re-poll every 30 seconds
timeout 10s Single-request timeout
retries 5 Mark unhealthy after 5 consecutive failures
start_period 60s Grace window for slow DB init on first boot

Testing

Verified with CLIENT=reth and CLIENT=geth. The node service now waits correctly on fresh starts and after docker compose restart execution.

Backwards compatibility

No changes to .env files or entrypoints. Existing deployments require no migration.

…sensus

Previously the consensus node would start immediately after the execution
client container was created, without waiting for its JSON-RPC to become
available. On first boot or after a restart, this caused the consensus
service to crash-loop while the execution client was still initialising.

Changes:
- Add healthcheck to execution service that polls eth_syncing via JSON-RPC.
  The check passes as soon as the RPC endpoint responds, confirming the
  client is fully booted (node does not need to be fully synced).
- Change depends_on on the node service to condition: service_healthy so
  the consensus client only starts once the execution client is ready.

Healthcheck parameters:
  interval: 30s     - re-poll every 30 seconds
  timeout:  10s     - single-request timeout
  retries:  5       - mark unhealthy after 5 consecutive failures
  start_period: 60s - grace window for slow database init on first boot

Backwards-compatible: no changes to .env files or entrypoints required.
@cb-heimdall
Copy link
Copy Markdown
Collaborator

🟡 Heimdall Review Status

Requirement Status More Info
Reviews 🟡 0/1
Denominator calculation
Show calculation
1 if user is bot 0
1 if user is external 0
2 if repo is sensitive 0
From .codeflow.yml 1
Additional review requirements
Show calculation
Max 0
0
From CODEOWNERS 0
Global minimum 0
Max 1
1
1 if commit is unverified 1
Sum 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants