fix(wireguard): prevent bandwidth overcounting by nmnmcc · Pull Request #6728 · nymtech/nym

nmnmcc · 2026-05-04T02:46:44Z

Summary

Fix WireGuard bandwidth overcounting in several reconnect / re-registration paths.

The main issue was that repeated AddPeer requests for the same WireGuard public key could create multiple active PeerHandles. Each handle observed the same cumulative kernel peer counters and deducted bandwidth independently, which could multiply reported usage. Re-adding a peer could also seed the cached counters from zero, causing historical kernel traffic to be charged again.

This PR also fixes storage sync races that could double-apply a pending bandwidth delta, and fixes PSK persistence/update behavior during peer refresh.

Changes

Make PeerController::handle_add_request idempotent for already-active peers.
- Reuses the existing active peer manager instead of spawning another PeerHandle.
- Syncs existing storage state before replacing the bandwidth manager.
- Updates allowed IPs in place.
- Rolls back newly allocated IPs if kernel peer configuration fails.
- Releases the old IP pair after a successful IP change.
Seed new WireGuard peer cache from current kernel counters when the peer already exists in the kernel.
- Prevents recharging historical rx_bytes + tx_bytes on first poll after re-add.
Make credential bandwidth storage sync atomic per client.
- Adds a sync lock.
- Uses take_delta_since_sync / restore-on-error instead of read-then-clear.
- Preserves deltas accumulated while storage sync is in progress.
- Prevents two concurrent syncs from applying the same delta twice.
Fix WireGuard PSK update path.
- Updating PSK no longer removes the active peer first.
- Active peer state is refreshed through the idempotent add path.
Persist WireGuard peer PSK in gateway storage.
- insert_peer now inserts and updates the psk column.

Tests

Added regression tests covering the overcounting cases:

client_bandwidth::old_read_only_sync_could_apply_the_same_delta_twice
- Reproduces the old duplicate delta flush behavior and verifies the fixed take-delta behavior.
client_bandwidth::resync_preserves_delta_accumulated_during_storage_sync
- Verifies that deltas accumulated during storage sync are not lost.
client_bandwidth::failed_sync_restores_reserved_delta
- Verifies that failed storage sync restores the reserved delta.
peer_controller::duplicate_add_for_active_peer_is_idempotent
- Verifies that duplicate add requests for the same active public key do not break IP allocation or spawn duplicate accounting state.
peer_storage_manager::seeding_cache_from_empty_peer_recounts_all_existing_kernel_traffic
- Reproduces the historical-kernel-counter recount issue and verifies that seeding from kernel state avoids immediate overcharging.
wireguard_peers::insert_peer_persists_psk_on_insert_and_update
- Verifies PSK persistence on insert and update.

Verification

Ran:

cargo test -p nym-credential-verification client_bandwidth --lib
cargo test -p nym-wireguard --features mock peer_controller --lib
cargo test -p nym-wireguard --features mock peer_storage_manager --lib
cargo test -p nym-gateway-storage --lib insert_peer_persists_psk_on_insert_and_update
cargo test -p nym-gateway --lib wireguard

This change is

Summary by CodeRabbit

New Features
- Duplicate peer addition requests are now handled gracefully as idempotent refresh operations, avoiding unnecessary reconnections.
Bug Fixes
- PSK updates no longer disconnect active WireGuard peers; connections are preserved during the update process.
- Bandwidth synchronization now properly recovers from storage failures by restoring pending deltas.
- Improved IP allocation handling during peer updates and removals.

vercel · 2026-05-04T02:46:52Z

@nmnmcc is attempting to deploy a commit to the nyx-network Team on Vercel.

A member of the Team first needs to authorize it.

github-actions · 2026-05-04T02:46:53Z

Thank you for making this first PR

coderabbitai · 2026-05-04T02:47:00Z

📝 Walkthrough

Walkthrough

This PR introduces bandwidth synchronization atomicity and idempotent peer lifecycle management. A new sync_lock and delta-preservation mechanism in ClientBandwidth enable atomic sync operations with rollback on failure. Peer controllers now handle duplicate adds idempotently, PSK persistence is added to storage, and allowed_ips is wrapped for concurrent mutation.

Changes

Bandwidth Synchronization and Peer Lifecycle Atomicity

Layer / File(s)	Summary
Sync Primitives `common/credential-verification/src/client_bandwidth.rs`	`ClientBandwidth` gains `sync_lock` field and new async methods: `sync_guard()`, `take_delta_since_sync()`, `restore_delta_since_sync()`. `resync_bandwidth_with_storage()` now preserves accumulated `bytes_delta_since_sync` instead of clearing it.
Atomic Storage Sync `common/credential-verification/src/bandwidth_storage_manager.rs`	`sync_storage_bandwidth()` acquires sync guard, takes delta, and matches on storage result; failed updates restore delta before returning error. `expire_bandwidth()` acquires guard before resetting state.
Peer Storage & State `common/gateway-storage/src/wireguard_peers.rs`, `common/wireguard/src/peer_handle.rs`	`WgPeerManager::insert_peer()` now stores and updates PSK alongside allowed IPs and client ID. `SharedBandwidthStorageManager` wraps `allowed_ips` in `Arc<RwLock<>>` and provides async `allowed_ips()` and `set_allowed_ips()` accessors.
Peer Lifecycle Idempotency `common/wireguard/src/peer_controller/mod.rs`, `gateway/src/node/wireguard/new_peer_registration/lp.rs`	`handle_add_request()` detects and idempotently handles duplicate adds by recomputing IP allocation, reconfiguring peer if needed, and reusing bandwidth entry. `update_peer_psk()` now queries and preserves active peers, updating PSK in-memory and re-adding instead of removing. `remove_peer()` queries peer before removing storage entries.
Initialization & Refactoring `common/wireguard/src/peer_controller/mod.rs`	Non-duplicate add path probes kernel host interface data to seed cached peer state. `ip_to_key()` refactored from closure chain to explicit loop for clarity.
Tests & Validation `common/credential-verification/src/client_bandwidth.rs`, `common/gateway-storage/src/wireguard_peers.rs`, `common/wireguard/src/peer_controller/mod.rs`, `common/wireguard/src/peer_storage_manager.rs`	Delta preservation across resync, delta restoration on sync failure, PSK persistence on insert/update, and duplicate-add idempotency tests added; consumed bandwidth calculation test validates initialization from empty peer.

Sequence Diagram

sequenceDiagram
    participant PeerCtrl as PeerController
    participant ClientBW as ClientBandwidth
    participant BWSyncMgr as BandwidthStorageManager
    participant Storage as Storage
    participant WGKernel as WireGuard Kernel

    Note over PeerCtrl,WGKernel: Duplicate Add (Idempotent Refresh)
    PeerCtrl->>PeerCtrl: Detect existing peer
    PeerCtrl->>PeerCtrl: Compute IP allocation from old IPs
    PeerCtrl->>WGKernel: configure_peer(peer)
    WGKernel-->>PeerCtrl: OK
    PeerCtrl->>BWSyncMgr: sync_storage_bandwidth()
    BWSyncMgr->>ClientBW: sync_guard()
    ClientBW-->>BWSyncMgr: OwnedMutexGuard
    BWSyncMgr->>ClientBW: take_delta_since_sync()
    ClientBW-->>BWSyncMgr: delta (atomically reset)
    BWSyncMgr->>Storage: increase_bandwidth(delta)
    alt Storage Success
        Storage-->>BWSyncMgr: updated
        BWSyncMgr->>ClientBW: resync_bandwidth_with_storage(updated)
        ClientBW-->>BWSyncMgr: OK (delta preserved)
    else Storage Failure
        Storage-->>BWSyncMgr: Error
        BWSyncMgr->>ClientBW: restore_delta_since_sync(delta)
        BWSyncMgr-->>PeerCtrl: Error
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A hop through sync and state so fair,
Where deltas dance with utmost care—
Restore on fail, preserve through time,
Idempotent peers, each one sublime! 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'fix(wireguard): prevent bandwidth overcounting' directly and clearly summarizes the main change: addressing a bandwidth overcounting issue in WireGuard.
Docstring Coverage	✅ Passed	Docstring coverage is 80.56% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

common/wireguard/src/peer_controller/mod.rs (2)

217-229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Delay IP release until the kernel peer is actually removed.

This now returns the IP pair to the pool before wg_api.remove_peer(key) succeeds. If the kernel removal fails, that IP can be handed to a new peer while the stale peer is still attached, and this method has already dropped its storage/bandwidth tracking state.

Suggested fix

-        if let Some(peer) = stored_peer
-            && let Some(ip_pair) = allocated_ip_pair(&peer)
-        {
-            self.ip_pool.release(ip_pair)
-        }
-
         let ret = self.wg_api.remove_peer(key);
         if ret.is_err() {
             nym_metrics::inc!("wg_peer_removal_failed");
             error!(
                 "Wireguard peer could not be removed from wireguard kernel module. Process should be restarted so that the interface is reset."
             );
         } else {
+            if let Some(peer) = stored_peer
+                && let Some(ip_pair) = allocated_ip_pair(&peer)
+            {
+                self.ip_pool.release(ip_pair)
+            }
             nym_metrics::inc!("wg_peer_removal_success");
         }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@common/wireguard/src/peer_controller/mod.rs` around lines 217 - 229, The code
currently removes storage and returns the IP (remove_wireguard_peer,
bw_storage_managers.remove, ip_pool.release) before calling wg_api.remove_peer;
change the flow so that you first call wg_api.remove_peer(key) and only on
success remove persistent state and release the IP: call
self.wg_api.remove_peer(key) first, and if it returns Ok then call
self.ecash_verifier.storage().remove_wireguard_peer(&key.to_string()).await?,
self.bw_storage_managers.remove(key), and if let Some(peer) = stored_peer && let
Some(ip_pair) = allocated_ip_pair(&peer) then self.ip_pool.release(ip_pair); if
wg_api.remove_peer fails, return the error without altering storage or releasing
the IP.

338-370: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback the kernel peer if confirm_allocation fails.

By Line 369 the peer is already configured in the kernel, but an error from confirm_allocation returns before the manager is inserted or the handle is spawned. That leaves a live, untracked peer behind with no controller state to meter or remove it.

Suggested fix

         let mut handle = PeerHandle::new(
             peer.public_key.clone(),
             self.host_information.clone(),
             cached_peer_manager,
             bandwidth_storage_manager.clone(),
             self.request_tx.clone(),
             self.upgrade_mode.clone(),
             &self.shutdown_token,
         );
         let public_key = peer.public_key.clone();
-        self.ip_pool.confirm_allocation(ip_pair)?;
+        if let Err(err) = self.ip_pool.confirm_allocation(ip_pair) {
+            let _ = self.wg_api.remove_peer(&peer.public_key);
+            nym_metrics::inc!("wg_peer_addition_failed");
+            return Err(err);
+        }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@common/wireguard/src/peer_controller/mod.rs` around lines 338 - 370, After
configuring the kernel peer via self.wg_api.configure_peer(peer) but before
returning on a failure of self.ip_pool.confirm_allocation(ip_pair), perform a
rollback by removing the kernel peer (e.g. call
self.wg_api.remove_peer(&public_key) or the appropriate removal method) so the
kernel does not retain an untracked peer; attempt the removal and log any
removal error (or increment a metric) but return the original confirm_allocation
error to the caller. Ensure you use the already-cloned public_key variable,
await the removal if the API is async, and keep the existing metrics/return
behavior (i.e., removal errors should not mask the original confirm_allocation
error).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/wireguard/src/peer_controller/mod.rs`:
- Around line 306-316: The current code replaces the existing ClientBandwidth
instance stored in existing_bandwidth_storage_manager by assigning *guard =
Self::generate_bandwidth_manager(...), which creates a new
ClientBandwidth/sync_lock pair and leads to split in-memory balances between old
and new verifiers (see handle_get_client_bandwidth_by_key /
handle_query_verifier_by_key). Instead of swapping the whole manager, update the
existing manager in-place: call sync_storage_bandwidth() on the guard, then
mutate the fields of the existing ClientBandwidth (or expose a method like
update_from or refresh_verifier on ClientBandwidth) to replace/refresh its
verifier/storage handle and balances based on the result of
generate_bandwidth_manager(...) so that the sync_lock and existing clones remain
valid and all requests continue to authorize against the same in-memory state.
Ensure you keep the same instance behind
existing_bandwidth_storage_manager.inner() and only update its internals.

In `@gateway/src/node/wireguard/new_peer_registration/lp.rs`:
- Around line 23-38: Summary: storage PSK is updated before the live kernel
peer, risking divergence if kernel update fails; fix by reading the current
active peer/PSK first and performing a compensating rollback on storage if
kernel update fails. Change lp.rs so you call
self.peer_manager.check_active_peer(peer).await? and
self.peer_manager.query_peer(peer).await? to capture the existing active_peer
and its current preshared_key before calling
self.ecash_verifier.storage().update_peer_psk(...). Then after computing
encoded_psk and calling update_peer_psk, attempt to update the kernel peer (set
active_peer.preshared_key = Some(psk) and call
self.peer_manager.add_peer(active_peer).await?); if add_peer returns Err, call
self.ecash_verifier.storage().update_peer_psk(&peer.to_string(),
old_psk_option.map(|s| s.as_str())) .await to roll back to the previous PSK (and
handle the None case to clear it). Ensure you handle the case where query_peer()
returned None by either not updating storage or treating old_psk as None for
proper rollback.

---

Outside diff comments:
In `@common/wireguard/src/peer_controller/mod.rs`:
- Around line 217-229: The code currently removes storage and returns the IP
(remove_wireguard_peer, bw_storage_managers.remove, ip_pool.release) before
calling wg_api.remove_peer; change the flow so that you first call
wg_api.remove_peer(key) and only on success remove persistent state and release
the IP: call self.wg_api.remove_peer(key) first, and if it returns Ok then call
self.ecash_verifier.storage().remove_wireguard_peer(&key.to_string()).await?,
self.bw_storage_managers.remove(key), and if let Some(peer) = stored_peer && let
Some(ip_pair) = allocated_ip_pair(&peer) then self.ip_pool.release(ip_pair); if
wg_api.remove_peer fails, return the error without altering storage or releasing
the IP.
- Around line 338-370: After configuring the kernel peer via
self.wg_api.configure_peer(peer) but before returning on a failure of
self.ip_pool.confirm_allocation(ip_pair), perform a rollback by removing the
kernel peer (e.g. call self.wg_api.remove_peer(&public_key) or the appropriate
removal method) so the kernel does not retain an untracked peer; attempt the
removal and log any removal error (or increment a metric) but return the
original confirm_allocation error to the caller. Ensure you use the
already-cloned public_key variable, await the removal if the API is async, and
keep the existing metrics/return behavior (i.e., removal errors should not mask
the original confirm_allocation error).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89bbaac1-63ce-4234-ba8a-b92a3e017d87

📥 Commits

Reviewing files that changed from the base of the PR and between d23a42f and 24b3d02.

📒 Files selected for processing (7)

common/credential-verification/src/bandwidth_storage_manager.rs
common/credential-verification/src/client_bandwidth.rs
common/gateway-storage/src/wireguard_peers.rs
common/wireguard/src/peer_controller/mod.rs
common/wireguard/src/peer_handle.rs
common/wireguard/src/peer_storage_manager.rs
gateway/src/node/wireguard/new_peer_registration/lp.rs

fix(wireguard): prevent bandwidth overcounting

24b3d02

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread common/wireguard/src/peer_controller/mod.rs

Comment thread gateway/src/node/wireguard/new_peer_registration/lp.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wireguard): prevent bandwidth overcounting#6728

fix(wireguard): prevent bandwidth overcounting#6728
nmnmcc wants to merge 1 commit intonymtech:developfrom
nmnmcc:wireguard-bandwidth-overcount-fix

nmnmcc commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nmnmcc commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Verification

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nmnmcc commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading