Skip to content

feat(blockchain): warn instead of failing on unclosed justified divergence#480

Merged
pablodeymo merged 2 commits into
mainfrom
fix/warn-on-justified-divergence
Jun 29, 2026
Merged

feat(blockchain): warn instead of failing on unclosed justified divergence#480
pablodeymo merged 2 commits into
mainfrom
fix/warn-on-justified-divergence

Conversation

@MegaRedHand

@MegaRedHand MegaRedHand commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

What

produce_block_with_signatures checks that a produced block's justified
slot does not lag the store's justified checkpoint (leanSpec #595). When it
did, the function returned StoreError::JustifiedDivergenceNotClosed, which
aborted block production for that slot.

This PR removes that error variant and replaces the hard failure with a
warn-level log, then publishes the block anyway.

Why

Aborting production on this condition halts the chain rather than recovering
from it. The proposer cannot re-justify the head because validators source
from the store-justified checkpoint, so every subsequent proposal hits the
same divergence and block production freezes indefinitely (observed as a
"did not converge" stall).

Publishing a block whose justified slot lags by one fork is the lesser evil:
the chain keeps making progress and pool attestations get a chance to close
the divergence on later slots. The divergence is now surfaced as a warning
instead of silently halting the node.

This mirrors the upstream spec change in
leanEthereum/leanSpec#1168.

Changes

  • crates/blockchain/src/store.rs: replace the return Err(JustifiedDivergenceNotClosed) with a warn! log; remove the now-unused StoreError::JustifiedDivergenceNotClosed variant.
  • crates/blockchain/state_transition/src/lib.rs: reword a test doc comment that referenced the removed error type by name.

Testing

  • cargo build -p ethlambda-blockchain
  • cargo clippy -p ethlambda-blockchain --all-targets ✅ (no warnings)

References

…gence

When a produced block's justified slot lagged the store's justified
checkpoint, `produce_block_with_signatures` returned
`JustifiedDivergenceNotClosed`, which aborted block production entirely.
In practice this halts the chain: the proposer cannot re-justify the head
because validators source from the store-justified checkpoint, so every
subsequent proposal hits the same divergence and freezes block production.

Halting production is worse than publishing a block whose justified slot
lags by one fork: the latter still lets the chain make progress and gives
pool attestations a chance to close the divergence on later slots. Replace
the hard error with a warn-level log and publish the block anyway.
@github-actions

Copy link
Copy Markdown

🤖 Kimi Code Review

Review: PR #480 - Liveness fix for justified checkpoint divergence

The change correctly prioritizes chain liveness over strict adherence to the ideal justified checkpoint invariant. Allowing block production to continue with a warning—rather than halting with an error—is the correct consensus-layer behavior.

crates/blockchain/src/store.rs

  1. Lines 828–840: The warning log is well-structured with sufficient context (slot, proposer, block_justified_slot, store_justified_slot). Consider also logging the difference (store_justified_slot - post_checkpoints.justified.slot) to help operators quantify the divergence severity.

  2. Lines 945–954: Clean removal of the obsolete JustifiedDivergenceNotClosed error variant. Verify that no other crates (e.g., RPC handlers or metrics) match on this variant; otherwise, this is a breaking API change requiring coordination.

  3. Logic correctness: The comment accurately reflects the trade-off. Publishing a block with a lagging justified checkpoint is safe—the block is still valid, and subsequent attestations can advance justification. Freezing block production is a definitive liveness failure.

crates/blockchain/state_transition/src/lib.rs

  1. Line 713: Comment update accurately reflects the behavioral change from error to warning.

Consensus considerations

  • This aligns with the Ethereum philosophy: never stall block production due to local store state inconsistencies. The fork choice rule (LMD GHOST / 3SF-mini) will still handle the block correctly; it simply won't advance justification until the next slot.
  • Monitor frequency of this warning in production—repeated occurrences may indicate network partitioning or attestation pool starvation.

Verdict: LGTM. Safe liveness improvement with proper observability.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Codex Code Review

No findings.

The behavioral change in crates/blockchain/src/store.rs is narrow: it stops treating a lagging post_checkpoints.justified.slot as a hard build failure and downgrades it to a warning. Given the current import path only latches latest_justified monotonically upward (store.rs) and the block-builder/state-transition regressions already cover the original within-block justification regression (block_builder.rs, state_transition/src/lib.rs), I don’t see a concrete correctness or safety bug introduced by this diff.

Residual risk: there is no store-level regression test for the new “warn and continue” path at store.rs, so if build_block ever really fails to close the divergence again, this behavior change is only covered by operator logs. I also could not run the Rust tests here because cargo/rustup attempts to write under a read-only home directory in this environment.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR changes justified-divergence handling during block production. The main changes are:

  • Logs unclosed justified divergence instead of returning an error.
  • Publishes the produced block after the warning.
  • Removes the unused JustifiedDivergenceNotClosed error variant.
  • Updates a nearby test comment to avoid the removed error name.

Confidence Score: 5/5

This looks safe to merge after considering the warning-only behavior.

  • The changed path intentionally keeps block production moving when the fixed-point loop does not close the divergence.
  • No blocking issue was found in the removed error variant or the comment-only test update.
  • The remaining note is a liveness concern around repeated lagging proposals.

Important Files Changed

Filename Overview
crates/blockchain/src/store.rs Changes block production from hard-failing on lagging justified checkpoints to warning and continuing, and removes the old error variant.
crates/blockchain/state_transition/src/lib.rs Updates a test comment to describe the freeze scenario without naming the removed error variant.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/blockchain/src/store.rs:835-845
**Lagging Justification Still Publishes**

When `post_checkpoints.justified.slot` stays behind `store.latest_justified().slot`, this path now publishes the block without changing the block or surfacing the divergence to the caller. The local import path only advances the store checkpoint on a higher justified slot, so repeated proposals in this state can keep producing blocks that do not close justification and leave finality stalled instead of making the failure actionable.

Reviews (1): Last reviewed commit: "feat(blockchain): warn instead of failin..." | Re-trigger Greptile

Comment on lines 835 to 845
let store_justified_slot = store.latest_justified().slot;
if post_checkpoints.justified.slot < store_justified_slot {
return Err(StoreError::JustifiedDivergenceNotClosed {
block_justified_slot: post_checkpoints.justified.slot,
warn!(
%slot,
proposer = validator_index,
block_justified_slot = post_checkpoints.justified.slot,
store_justified_slot,
});
"Produced block justified slot is behind store justified slot; \
fixed-point attestation loop did not converge"
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Lagging Justification Still Publishes

When post_checkpoints.justified.slot stays behind store.latest_justified().slot, this path now publishes the block without changing the block or surfacing the divergence to the caller. The local import path only advances the store checkpoint on a higher justified slot, so repeated proposals in this state can keep producing blocks that do not close justification and leave finality stalled instead of making the failure actionable.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/store.rs
Line: 835-845

Comment:
**Lagging Justification Still Publishes**

When `post_checkpoints.justified.slot` stays behind `store.latest_justified().slot`, this path now publishes the block without changing the block or surfacing the divergence to the caller. The local import path only advances the store checkpoint on a higher justified slot, so repeated proposals in this state can keep producing blocks that do not close justification and leave finality stalled instead of making the failure actionable.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions

Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything I need. Let me write up the review.


Review: feat(blockchain): warn instead of failing on unclosed justified divergence

This PR converts a hard block-production abort (return Err(JustifiedDivergenceNotClosed)) into a warn-and-continue when the produced block's latest_justified slot lags the store's. The motivation is sound: an infinite abort loop is strictly worse than publishing a suboptimal block, and the root-cause fix for the underlying regression (non-monotonic latest_justified within a block) already landed in state_transition. The cleanup is complete — no dangling references to the removed error variant exist anywhere in the codebase. Three findings are worth addressing before merge.


crates/blockchain/src/lib.rs:500lean_block_building_failures_total no longer fires on justified divergence

Severity: Low (observability)

Previously, JustifiedDivergenceNotClosed was an Err return → the caller's else branch at line 499 fired → metrics::inc_block_building_failures() was incremented. Now the function returns Ok, so the counter stays silent while the node continuously publishes lagging-justified blocks.

lean_block_building_failures_total is incremented in nine other places in propose_block (lines 500, 517, 527, 536, 545, 557, 587, 600, 608, 646) for every other failure mode. If this counter is used for alerting, a node stuck in a persistent divergence loop — exactly the scenario the PR's warning text describes — will show zero failures and appear healthy.

Suggestion: Either add metrics::inc_block_building_failures() inside the new warn! block (so the counter still signals degraded operation), or add a dedicated counter for divergence events (e.g. lean_block_justified_divergence_total). The latter is more precise.


crates/blockchain/src/store.rs:830 — comment misattributes the round loop to build_block

Severity: Low (comment accuracy)

The comment says:

The fixed-point loop in build_block is expected to incorporate pool attestations that close any divergence inherited from a minority fork

build_block has no loop. Its own inline doc (line 94 of block_builder.rs) is explicit: "ethlambda runs the STF once after selection." The round loop is in select_attestations at line 165 (for _round in 0..MAX_ATTESTATIONS_DATA), called by build_block. The comment's attribution causes a reader debugging divergence to look for a loop in build_block that does not exist.

Suggestion: Change line 830–831 to:

// The greedy round loop in `select_attestations` (called by `build_block`)
// is expected to incorporate pool attestations that close any

crates/blockchain/src/store.rs:835 — new warn-and-continue path has no unit test

Severity: Low (test coverage)

produce_block_with_signatures is not called from any test — only from propose_block in the integration path. The old error variant was also untested at this level, so this PR does not introduce the gap, but it does change observable behavior in a way that is straightforward to cover: the scenario (store's latest_justified is ahead of the post-build block's justified slot) is already wired up in the store test infrastructure. produce_attestation_data_sources_from_head_state_not_store (line 1250) constructs exactly the right off_head_justified store state. A companion test calling produce_block_with_signatures in that setup and asserting is_ok() would pin the new contract and catch any future re-hardening of the check.


What looks good

The warn fields (%slot, proposer = validator_index, block_justified_slot, store_justified_slot) follow the CLAUDE.md temporal → identity → context ordering. The enum cleanup is complete — no match arms or pattern uses of JustifiedDivergenceNotClosed remain anywhere. The state_transition doc comment update is accurate and necessary. The _post_checkpoints discard in lib.rs is pre-existing and unaffected by this change. The overall design decision (liveness over protocol purity) is well-argued in the PR body.


Automated review by Claude (Anthropic) · sonnet · custom prompt

Comment thread crates/blockchain/state_transition/src/lib.rs Outdated
@pablodeymo pablodeymo merged commit b181dca into main Jun 29, 2026
2 checks passed
@pablodeymo pablodeymo deleted the fix/warn-on-justified-divergence branch June 29, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants