fix(audit): keep disk pressure out of postponed state#146
Open
mateeullahmalik wants to merge 6 commits into
Open
fix(audit): keep disk pressure out of postponed state#146mateeullahmalik wants to merge 6 commits into
mateeullahmalik wants to merge 6 commits into
Conversation
…pressure invariant The PR removes disk from selfHostViolatesMinimums / selfHostCompliant so disk pressure is owned exclusively by the STORAGE_FULL transition path. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed was written under the old invariant (disk-non-compliance → stays POSTPONED) and started observing POSTPONED → STORAGE_FULL after this PR landed in CI. Mirrors the unit-level refactor of TestEnforceEpochEnd_EmptyActiveSet_NonCompliantSelf_NoRecover already in this PR (disk → cpu swap). Changes: - Swap the violating metric in TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed from disk_usage_percent=95 / MinDiskFreePercent=20 to cpu_usage_percent=95 / MinCpuFreePercent=20. The intent of the test (self-compliance still gates the bootstrap exception) is preserved on a metric that still postpones. - Add TestAuditEmptyActiveSetBootstrap_DiskPressureGoesToStorageFull, the end-to-end mirror of unit-level TestEnforceEpochEnd_RecoversPostponedToStorageFullWhenDiskStillHigh: empty active set + POSTPONED self + disk > MaxStorageUsagePercent → recovers to STORAGE_FULL (not ACTIVE, not stuck POSTPONED). - Replace helper setAuditParamsForFastEpochsWithMinDiskFree with setAuditParamsForFastEpochsWithMinCpuFree (the old helper had a single caller and contradicted the new invariant). - Add auditHostReportWithCpuUsageJSON; keep auditHostReportWithDiskUsageJSON for the new STORAGE_FULL test.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates audit epoch-end enforcement so disk pressure no longer participates in audit_host_requirements POSTPONED logic, and instead remains solely on the STORAGE_FULL state path (including during POSTPONED recovery).
Changes:
- Remove disk usage from host-minimum POSTPONED checks (
selfHostViolatesMinimums,selfHostCompliant) and steer POSTPONED recovery toSTORAGE_FULLwhen the same-epoch self HostReport still exceedssupernode.max_storage_usage_percent. - Update unit/system tests to cover disk-pressure invariants (no disk-driven POSTPONED; POSTPONED recovery targets
STORAGE_FULLwhen disk remains high). - Remove v1.11.1 upgrade-time enforcement of an audit
min_disk_free_percentfloor and update associated docs/changelog notes.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| x/audit/v1/README.md | Updates module docs to reflect disk no longer being a POSTPONED criterion. |
| x/audit/v1/POSTPONEMENT_RULES.md | Updates postponement criteria docs to exclude disk and reference STORAGE_FULL. |
| x/audit/v1/keeper/enforcement.go | Adjusts host-minimum predicates and adds POSTPONED recovery logic that can transition to STORAGE_FULL. |
| x/audit/v1/keeper/enforcement_storagefull_transition_test.go | Adds/updates unit tests for POSTPONED recovery routing to STORAGE_FULL and disk invariants. |
| x/audit/v1/keeper/enforcement_empty_active_set_test.go | Updates empty-active-set recovery gate test to use CPU (non-storage) rather than disk. |
| tests/systemtests/audit_test_helpers_test.go | Renames/repurposes genesis mutator + adds CPU host-report JSON helper for non-storage compliance tests. |
| tests/systemtests/audit_empty_active_set_bootstrap_test.go | Updates bootstrap non-compliance test to CPU and adds system test covering disk-pressure POSTPONED→STORAGE_FULL. |
| CHANGELOG.md | Removes mention of enforcing a min_disk_free_percent floor in v1.11.1 notes. |
| app/upgrades/v1_11_1/upgrade.go | Removes audit min_disk_free_percent floor logic from the v1.11.1 upgrade handler. |
| app/upgrades/v1_11_1/upgrade_test.go | Removes tests that only covered the removed disk-free-percent floor helper. |
| app/upgrades/upgrades.go | Updates upgrade table comment to remove the disk-free-percent floor note for v1.11.1. |
Comments suppressed due to low confidence (2)
x/audit/v1/README.md:94
- The README still frames recovery as
POSTPONED -> ACTIVE, but this PR introduces a recovery path that can transitionPOSTPONED -> STORAGE_FULLwhen disk usage remains abovesupernode.max_storage_usage_percent. Please update this section header/text to reflect both possible recovery targets so operators don’t infer that POSTPONED always returns to ACTIVE.
- **Self Report minimum failures** (CPU/mem free% thresholds),
- **Peer port thresholds**: a required port is treated as CLOSED if peer observations meet `peer_port_postpone_threshold_percent`, and this happens for `consecutive_epochs_to_postpone` consecutive epochs.
### Recovery (`POSTPONED -> ACTIVE`)
x/audit/v1/README.md:94
- The README still frames recovery as
POSTPONED -> ACTIVE, but this PR introduces a recovery path that can transitionPOSTPONED -> STORAGE_FULLwhen disk usage remains abovesupernode.max_storage_usage_percent. Please update this section header/text to reflect both possible recovery targets so operators don’t infer that POSTPONED always returns to ACTIVE.
- **Self Report minimum failures** (CPU/mem free% thresholds),
- **Peer port thresholds**: a required port is treated as CLOSED if peer observations meet `peer_port_postpone_threshold_percent`, and this happens for `consecutive_epochs_to_postpone` consecutive epochs.
### Recovery (`POSTPONED -> ACTIVE`)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
j-rafique
approved these changes
May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Permanent chain-side fix for the disk pressure state-machine conflict between audit host requirements and STORAGE_FULL.
The audit module remains the state-transition owner, but disk pressure is no longer treated as an
audit_host_requirementsPOSTPONED reason. Disk capacity now stays on the STORAGE_FULL path driven by audit epoch HostReport disk usage.Behavior:
selfHostViolatesMinimumsignores disk and only checks non-storage host metrics.selfHostCompliantignores disk for POSTPONED recovery eligibility.STORAGE_FULLwhen the same epoch self HostReport still exceedssupernode.max_storage_usage_percent; otherwise it recovers to ACTIVE through the existing supernode keeper path.Invariant Table
audit_host_requirementsPOSTPONEDselfHostViolatesMinimums,selfHostCompliantTestEnforceEpochEnd_DiskPressureDoesNotPostponeActive, empty-active-set self-compliance test updated to CPUrecoverSupernodeFromPostponedTestEnforceEpochEnd_RecoversPostponedToStorageFullWhenDiskStillHighselfHostCompliantTestEnforceEpochEnd_EmptyActiveSet_NonCompliantSelf_NoRecoverCosmos / Upgrade Notes
No proto changes, no store key changes, no migration required. This is deterministic keeper logic over existing epoch reports and existing supernode state.
Tests
go test -count=1 ./x/audit/v1/keepergo test -count=1 ./x/audit/v1/... ./x/supernode/v1/keepergit diff --checkRisk / Rollback
Risk is limited to audit EndBlock state policy. CPU/mem host-requirement postponement, missing reports, peer port checks, action-finalization evidence, and storage-truth enforcement remain intact. Rollback is reverting this PR before release if needed.