Skip to content

fix(audit): keep disk pressure out of postponed state#146

Open
mateeullahmalik wants to merge 6 commits into
masterfrom
fix/audit-storagefull-disk-state
Open

fix(audit): keep disk pressure out of postponed state#146
mateeullahmalik wants to merge 6 commits into
masterfrom
fix/audit-storagefull-disk-state

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Contributor

Summary

Permanent chain-side fix for the disk pressure state-machine conflict between audit host requirements and STORAGE_FULL.

The audit module remains the state-transition owner, but disk pressure is no longer treated as an audit_host_requirements POSTPONED reason. Disk capacity now stays on the STORAGE_FULL path driven by audit epoch HostReport disk usage.

Behavior:

  • selfHostViolatesMinimums ignores disk and only checks non-storage host metrics.
  • selfHostCompliant ignores disk for POSTPONED recovery eligibility.
  • POSTPONED recovery chooses STORAGE_FULL when the same epoch self HostReport still exceeds supernode.max_storage_usage_percent; otherwise it recovers to ACTIVE through the existing supernode keeper path.
  • STORAGE_FULL remains protected from disk-driven POSTPONED behavior.

Invariant Table

Field / Behavior Valid Range / Contract Enforcement Points Test Coverage
Disk pressure state ownership Disk pressure must not cause audit_host_requirements POSTPONED selfHostViolatesMinimums, selfHostCompliant TestEnforceEpochEnd_DiskPressureDoesNotPostponeActive, empty-active-set self-compliance test updated to CPU
POSTPONED recovery target Recovered node with disk > max storage returns as STORAGE_FULL, not ACTIVE recoverSupernodeFromPostponed TestEnforceEpochEnd_RecoversPostponedToStorageFullWhenDiskStillHigh
Non-storage recovery behavior CPU/mem host failures still block recovery selfHostCompliant TestEnforceEpochEnd_EmptyActiveSet_NonCompliantSelf_NoRecover

Cosmos / Upgrade Notes

No proto changes, no store key changes, no migration required. This is deterministic keeper logic over existing epoch reports and existing supernode state.

Tests

  • go test -count=1 ./x/audit/v1/keeper
  • go test -count=1 ./x/audit/v1/... ./x/supernode/v1/keeper
  • git diff --check

Risk / Rollback

Risk is limited to audit EndBlock state policy. CPU/mem host-requirement postponement, missing reports, peer port checks, action-finalization evidence, and storage-truth enforcement remain intact. Rollback is reverting this PR before release if needed.

…pressure invariant

The PR removes disk from selfHostViolatesMinimums / selfHostCompliant so
disk pressure is owned exclusively by the STORAGE_FULL transition path.
TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed was
written under the old invariant (disk-non-compliance → stays POSTPONED)
and started observing POSTPONED → STORAGE_FULL after this PR landed in CI.

Mirrors the unit-level refactor of TestEnforceEpochEnd_EmptyActiveSet_NonCompliantSelf_NoRecover
already in this PR (disk → cpu swap).

Changes:
- Swap the violating metric in
  TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed from
  disk_usage_percent=95 / MinDiskFreePercent=20 to
  cpu_usage_percent=95 / MinCpuFreePercent=20. The intent of the test
  (self-compliance still gates the bootstrap exception) is preserved on
  a metric that still postpones.
- Add TestAuditEmptyActiveSetBootstrap_DiskPressureGoesToStorageFull,
  the end-to-end mirror of unit-level
  TestEnforceEpochEnd_RecoversPostponedToStorageFullWhenDiskStillHigh:
  empty active set + POSTPONED self + disk > MaxStorageUsagePercent →
  recovers to STORAGE_FULL (not ACTIVE, not stuck POSTPONED).
- Replace helper setAuditParamsForFastEpochsWithMinDiskFree with
  setAuditParamsForFastEpochsWithMinCpuFree (the old helper had a single
  caller and contradicted the new invariant).
- Add auditHostReportWithCpuUsageJSON; keep auditHostReportWithDiskUsageJSON
  for the new STORAGE_FULL test.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates audit epoch-end enforcement so disk pressure no longer participates in audit_host_requirements POSTPONED logic, and instead remains solely on the STORAGE_FULL state path (including during POSTPONED recovery).

Changes:

  • Remove disk usage from host-minimum POSTPONED checks (selfHostViolatesMinimums, selfHostCompliant) and steer POSTPONED recovery to STORAGE_FULL when the same-epoch self HostReport still exceeds supernode.max_storage_usage_percent.
  • Update unit/system tests to cover disk-pressure invariants (no disk-driven POSTPONED; POSTPONED recovery targets STORAGE_FULL when disk remains high).
  • Remove v1.11.1 upgrade-time enforcement of an audit min_disk_free_percent floor and update associated docs/changelog notes.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
x/audit/v1/README.md Updates module docs to reflect disk no longer being a POSTPONED criterion.
x/audit/v1/POSTPONEMENT_RULES.md Updates postponement criteria docs to exclude disk and reference STORAGE_FULL.
x/audit/v1/keeper/enforcement.go Adjusts host-minimum predicates and adds POSTPONED recovery logic that can transition to STORAGE_FULL.
x/audit/v1/keeper/enforcement_storagefull_transition_test.go Adds/updates unit tests for POSTPONED recovery routing to STORAGE_FULL and disk invariants.
x/audit/v1/keeper/enforcement_empty_active_set_test.go Updates empty-active-set recovery gate test to use CPU (non-storage) rather than disk.
tests/systemtests/audit_test_helpers_test.go Renames/repurposes genesis mutator + adds CPU host-report JSON helper for non-storage compliance tests.
tests/systemtests/audit_empty_active_set_bootstrap_test.go Updates bootstrap non-compliance test to CPU and adds system test covering disk-pressure POSTPONED→STORAGE_FULL.
CHANGELOG.md Removes mention of enforcing a min_disk_free_percent floor in v1.11.1 notes.
app/upgrades/v1_11_1/upgrade.go Removes audit min_disk_free_percent floor logic from the v1.11.1 upgrade handler.
app/upgrades/v1_11_1/upgrade_test.go Removes tests that only covered the removed disk-free-percent floor helper.
app/upgrades/upgrades.go Updates upgrade table comment to remove the disk-free-percent floor note for v1.11.1.
Comments suppressed due to low confidence (2)

x/audit/v1/README.md:94

  • The README still frames recovery as POSTPONED -> ACTIVE, but this PR introduces a recovery path that can transition POSTPONED -> STORAGE_FULL when disk usage remains above supernode.max_storage_usage_percent. Please update this section header/text to reflect both possible recovery targets so operators don’t infer that POSTPONED always returns to ACTIVE.
- **Self Report minimum failures** (CPU/mem free% thresholds),
- **Peer port thresholds**: a required port is treated as CLOSED if peer observations meet `peer_port_postpone_threshold_percent`, and this happens for `consecutive_epochs_to_postpone` consecutive epochs.

### Recovery (`POSTPONED -> ACTIVE`)

x/audit/v1/README.md:94

  • The README still frames recovery as POSTPONED -> ACTIVE, but this PR introduces a recovery path that can transition POSTPONED -> STORAGE_FULL when disk usage remains above supernode.max_storage_usage_percent. Please update this section header/text to reflect both possible recovery targets so operators don’t infer that POSTPONED always returns to ACTIVE.
- **Self Report minimum failures** (CPU/mem free% thresholds),
- **Peer port thresholds**: a required port is treated as CLOSED if peer observations meet `peer_port_postpone_threshold_percent`, and this happens for `consecutive_epochs_to_postpone` consecutive epochs.

### Recovery (`POSTPONED -> ACTIVE`)


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread x/audit/v1/POSTPONEMENT_RULES.md
Comment thread x/audit/v1/keeper/enforcement.go
Comment thread x/audit/v1/keeper/enforcement.go Outdated
Comment thread x/audit/v1/POSTPONEMENT_RULES.md
Comment thread x/audit/v1/keeper/enforcement.go
Comment thread x/audit/v1/keeper/enforcement.go
@mateeullahmalik mateeullahmalik self-assigned this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants