Skip to content

fix(wireguard): prevent bandwidth overcounting#6728

Open
nmnmcc wants to merge 1 commit intonymtech:developfrom
nmnmcc:wireguard-bandwidth-overcount-fix
Open

fix(wireguard): prevent bandwidth overcounting#6728
nmnmcc wants to merge 1 commit intonymtech:developfrom
nmnmcc:wireguard-bandwidth-overcount-fix

Conversation

@nmnmcc
Copy link
Copy Markdown

@nmnmcc nmnmcc commented May 4, 2026

Summary

Fix WireGuard bandwidth overcounting in several reconnect / re-registration paths.

The main issue was that repeated AddPeer requests for the same WireGuard public key could create multiple active PeerHandles. Each handle observed the same cumulative kernel peer counters and deducted bandwidth independently, which could multiply reported usage. Re-adding a peer could also seed the cached counters from zero, causing historical kernel traffic to be charged again.

This PR also fixes storage sync races that could double-apply a pending bandwidth delta, and fixes PSK persistence/update behavior during peer refresh.

Changes

  • Make PeerController::handle_add_request idempotent for already-active peers.

    • Reuses the existing active peer manager instead of spawning another PeerHandle.
    • Syncs existing storage state before replacing the bandwidth manager.
    • Updates allowed IPs in place.
    • Rolls back newly allocated IPs if kernel peer configuration fails.
    • Releases the old IP pair after a successful IP change.
  • Seed new WireGuard peer cache from current kernel counters when the peer already exists in the kernel.

    • Prevents recharging historical rx_bytes + tx_bytes on first poll after re-add.
  • Make credential bandwidth storage sync atomic per client.

    • Adds a sync lock.
    • Uses take_delta_since_sync / restore-on-error instead of read-then-clear.
    • Preserves deltas accumulated while storage sync is in progress.
    • Prevents two concurrent syncs from applying the same delta twice.
  • Fix WireGuard PSK update path.

    • Updating PSK no longer removes the active peer first.
    • Active peer state is refreshed through the idempotent add path.
  • Persist WireGuard peer PSK in gateway storage.

    • insert_peer now inserts and updates the psk column.

Tests

Added regression tests covering the overcounting cases:

  • client_bandwidth::old_read_only_sync_could_apply_the_same_delta_twice

    • Reproduces the old duplicate delta flush behavior and verifies the fixed take-delta behavior.
  • client_bandwidth::resync_preserves_delta_accumulated_during_storage_sync

    • Verifies that deltas accumulated during storage sync are not lost.
  • client_bandwidth::failed_sync_restores_reserved_delta

    • Verifies that failed storage sync restores the reserved delta.
  • peer_controller::duplicate_add_for_active_peer_is_idempotent

    • Verifies that duplicate add requests for the same active public key do not break IP allocation or spawn duplicate accounting state.
  • peer_storage_manager::seeding_cache_from_empty_peer_recounts_all_existing_kernel_traffic

    • Reproduces the historical-kernel-counter recount issue and verifies that seeding from kernel state avoids immediate overcharging.
  • wireguard_peers::insert_peer_persists_psk_on_insert_and_update

    • Verifies PSK persistence on insert and update.

Verification

Ran:

cargo test -p nym-credential-verification client_bandwidth --lib
cargo test -p nym-wireguard --features mock peer_controller --lib
cargo test -p nym-wireguard --features mock peer_storage_manager --lib
cargo test -p nym-gateway-storage --lib insert_peer_persists_psk_on_insert_and_update
cargo test -p nym-gateway --lib wireguard

This change is Reviewable

Summary by CodeRabbit

  • New Features

    • Duplicate peer addition requests are now handled gracefully as idempotent refresh operations, avoiding unnecessary reconnections.
  • Bug Fixes

    • PSK updates no longer disconnect active WireGuard peers; connections are preserved during the update process.
    • Bandwidth synchronization now properly recovers from storage failures by restoring pending deltas.
    • Improved IP allocation handling during peer updates and removals.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 4, 2026

@nmnmcc is attempting to deploy a commit to the nyx-network Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Thank you for making this first PR

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

This PR introduces bandwidth synchronization atomicity and idempotent peer lifecycle management. A new sync_lock and delta-preservation mechanism in ClientBandwidth enable atomic sync operations with rollback on failure. Peer controllers now handle duplicate adds idempotently, PSK persistence is added to storage, and allowed_ips is wrapped for concurrent mutation.

Changes

Bandwidth Synchronization and Peer Lifecycle Atomicity

Layer / File(s) Summary
Sync Primitives
common/credential-verification/src/client_bandwidth.rs
ClientBandwidth gains sync_lock field and new async methods: sync_guard(), take_delta_since_sync(), restore_delta_since_sync(). resync_bandwidth_with_storage() now preserves accumulated bytes_delta_since_sync instead of clearing it.
Atomic Storage Sync
common/credential-verification/src/bandwidth_storage_manager.rs
sync_storage_bandwidth() acquires sync guard, takes delta, and matches on storage result; failed updates restore delta before returning error. expire_bandwidth() acquires guard before resetting state.
Peer Storage & State
common/gateway-storage/src/wireguard_peers.rs, common/wireguard/src/peer_handle.rs
WgPeerManager::insert_peer() now stores and updates PSK alongside allowed IPs and client ID. SharedBandwidthStorageManager wraps allowed_ips in Arc<RwLock<>> and provides async allowed_ips() and set_allowed_ips() accessors.
Peer Lifecycle Idempotency
common/wireguard/src/peer_controller/mod.rs, gateway/src/node/wireguard/new_peer_registration/lp.rs
handle_add_request() detects and idempotently handles duplicate adds by recomputing IP allocation, reconfiguring peer if needed, and reusing bandwidth entry. update_peer_psk() now queries and preserves active peers, updating PSK in-memory and re-adding instead of removing. remove_peer() queries peer before removing storage entries.
Initialization & Refactoring
common/wireguard/src/peer_controller/mod.rs
Non-duplicate add path probes kernel host interface data to seed cached peer state. ip_to_key() refactored from closure chain to explicit loop for clarity.
Tests & Validation
common/credential-verification/src/client_bandwidth.rs, common/gateway-storage/src/wireguard_peers.rs, common/wireguard/src/peer_controller/mod.rs, common/wireguard/src/peer_storage_manager.rs
Delta preservation across resync, delta restoration on sync failure, PSK persistence on insert/update, and duplicate-add idempotency tests added; consumed bandwidth calculation test validates initialization from empty peer.

Sequence Diagram

sequenceDiagram
    participant PeerCtrl as PeerController
    participant ClientBW as ClientBandwidth
    participant BWSyncMgr as BandwidthStorageManager
    participant Storage as Storage
    participant WGKernel as WireGuard Kernel

    Note over PeerCtrl,WGKernel: Duplicate Add (Idempotent Refresh)
    PeerCtrl->>PeerCtrl: Detect existing peer
    PeerCtrl->>PeerCtrl: Compute IP allocation from old IPs
    PeerCtrl->>WGKernel: configure_peer(peer)
    WGKernel-->>PeerCtrl: OK
    PeerCtrl->>BWSyncMgr: sync_storage_bandwidth()
    BWSyncMgr->>ClientBW: sync_guard()
    ClientBW-->>BWSyncMgr: OwnedMutexGuard
    BWSyncMgr->>ClientBW: take_delta_since_sync()
    ClientBW-->>BWSyncMgr: delta (atomically reset)
    BWSyncMgr->>Storage: increase_bandwidth(delta)
    alt Storage Success
        Storage-->>BWSyncMgr: updated
        BWSyncMgr->>ClientBW: resync_bandwidth_with_storage(updated)
        ClientBW-->>BWSyncMgr: OK (delta preserved)
    else Storage Failure
        Storage-->>BWSyncMgr: Error
        BWSyncMgr->>ClientBW: restore_delta_since_sync(delta)
        BWSyncMgr-->>PeerCtrl: Error
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A hop through sync and state so fair,
Where deltas dance with utmost care—
Restore on fail, preserve through time,
Idempotent peers, each one sublime! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'fix(wireguard): prevent bandwidth overcounting' directly and clearly summarizes the main change: addressing a bandwidth overcounting issue in WireGuard.
Docstring Coverage ✅ Passed Docstring coverage is 80.56% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
common/wireguard/src/peer_controller/mod.rs (2)

217-229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Delay IP release until the kernel peer is actually removed.

This now returns the IP pair to the pool before wg_api.remove_peer(key) succeeds. If the kernel removal fails, that IP can be handed to a new peer while the stale peer is still attached, and this method has already dropped its storage/bandwidth tracking state.

Suggested fix
-        if let Some(peer) = stored_peer
-            && let Some(ip_pair) = allocated_ip_pair(&peer)
-        {
-            self.ip_pool.release(ip_pair)
-        }
-
         let ret = self.wg_api.remove_peer(key);
         if ret.is_err() {
             nym_metrics::inc!("wg_peer_removal_failed");
             error!(
                 "Wireguard peer could not be removed from wireguard kernel module. Process should be restarted so that the interface is reset."
             );
         } else {
+            if let Some(peer) = stored_peer
+                && let Some(ip_pair) = allocated_ip_pair(&peer)
+            {
+                self.ip_pool.release(ip_pair)
+            }
             nym_metrics::inc!("wg_peer_removal_success");
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/wireguard/src/peer_controller/mod.rs` around lines 217 - 229, The code
currently removes storage and returns the IP (remove_wireguard_peer,
bw_storage_managers.remove, ip_pool.release) before calling wg_api.remove_peer;
change the flow so that you first call wg_api.remove_peer(key) and only on
success remove persistent state and release the IP: call
self.wg_api.remove_peer(key) first, and if it returns Ok then call
self.ecash_verifier.storage().remove_wireguard_peer(&key.to_string()).await?,
self.bw_storage_managers.remove(key), and if let Some(peer) = stored_peer && let
Some(ip_pair) = allocated_ip_pair(&peer) then self.ip_pool.release(ip_pair); if
wg_api.remove_peer fails, return the error without altering storage or releasing
the IP.

338-370: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback the kernel peer if confirm_allocation fails.

By Line 369 the peer is already configured in the kernel, but an error from confirm_allocation returns before the manager is inserted or the handle is spawned. That leaves a live, untracked peer behind with no controller state to meter or remove it.

Suggested fix
         let mut handle = PeerHandle::new(
             peer.public_key.clone(),
             self.host_information.clone(),
             cached_peer_manager,
             bandwidth_storage_manager.clone(),
             self.request_tx.clone(),
             self.upgrade_mode.clone(),
             &self.shutdown_token,
         );
         let public_key = peer.public_key.clone();
-        self.ip_pool.confirm_allocation(ip_pair)?;
+        if let Err(err) = self.ip_pool.confirm_allocation(ip_pair) {
+            let _ = self.wg_api.remove_peer(&peer.public_key);
+            nym_metrics::inc!("wg_peer_addition_failed");
+            return Err(err);
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/wireguard/src/peer_controller/mod.rs` around lines 338 - 370, After
configuring the kernel peer via self.wg_api.configure_peer(peer) but before
returning on a failure of self.ip_pool.confirm_allocation(ip_pair), perform a
rollback by removing the kernel peer (e.g. call
self.wg_api.remove_peer(&public_key) or the appropriate removal method) so the
kernel does not retain an untracked peer; attempt the removal and log any
removal error (or increment a metric) but return the original confirm_allocation
error to the caller. Ensure you use the already-cloned public_key variable,
await the removal if the API is async, and keep the existing metrics/return
behavior (i.e., removal errors should not mask the original confirm_allocation
error).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/wireguard/src/peer_controller/mod.rs`:
- Around line 306-316: The current code replaces the existing ClientBandwidth
instance stored in existing_bandwidth_storage_manager by assigning *guard =
Self::generate_bandwidth_manager(...), which creates a new
ClientBandwidth/sync_lock pair and leads to split in-memory balances between old
and new verifiers (see handle_get_client_bandwidth_by_key /
handle_query_verifier_by_key). Instead of swapping the whole manager, update the
existing manager in-place: call sync_storage_bandwidth() on the guard, then
mutate the fields of the existing ClientBandwidth (or expose a method like
update_from or refresh_verifier on ClientBandwidth) to replace/refresh its
verifier/storage handle and balances based on the result of
generate_bandwidth_manager(...) so that the sync_lock and existing clones remain
valid and all requests continue to authorize against the same in-memory state.
Ensure you keep the same instance behind
existing_bandwidth_storage_manager.inner() and only update its internals.

In `@gateway/src/node/wireguard/new_peer_registration/lp.rs`:
- Around line 23-38: Summary: storage PSK is updated before the live kernel
peer, risking divergence if kernel update fails; fix by reading the current
active peer/PSK first and performing a compensating rollback on storage if
kernel update fails. Change lp.rs so you call
self.peer_manager.check_active_peer(peer).await? and
self.peer_manager.query_peer(peer).await? to capture the existing active_peer
and its current preshared_key before calling
self.ecash_verifier.storage().update_peer_psk(...). Then after computing
encoded_psk and calling update_peer_psk, attempt to update the kernel peer (set
active_peer.preshared_key = Some(psk) and call
self.peer_manager.add_peer(active_peer).await?); if add_peer returns Err, call
self.ecash_verifier.storage().update_peer_psk(&peer.to_string(),
old_psk_option.map(|s| s.as_str())) .await to roll back to the previous PSK (and
handle the None case to clear it). Ensure you handle the case where query_peer()
returned None by either not updating storage or treating old_psk as None for
proper rollback.

---

Outside diff comments:
In `@common/wireguard/src/peer_controller/mod.rs`:
- Around line 217-229: The code currently removes storage and returns the IP
(remove_wireguard_peer, bw_storage_managers.remove, ip_pool.release) before
calling wg_api.remove_peer; change the flow so that you first call
wg_api.remove_peer(key) and only on success remove persistent state and release
the IP: call self.wg_api.remove_peer(key) first, and if it returns Ok then call
self.ecash_verifier.storage().remove_wireguard_peer(&key.to_string()).await?,
self.bw_storage_managers.remove(key), and if let Some(peer) = stored_peer && let
Some(ip_pair) = allocated_ip_pair(&peer) then self.ip_pool.release(ip_pair); if
wg_api.remove_peer fails, return the error without altering storage or releasing
the IP.
- Around line 338-370: After configuring the kernel peer via
self.wg_api.configure_peer(peer) but before returning on a failure of
self.ip_pool.confirm_allocation(ip_pair), perform a rollback by removing the
kernel peer (e.g. call self.wg_api.remove_peer(&public_key) or the appropriate
removal method) so the kernel does not retain an untracked peer; attempt the
removal and log any removal error (or increment a metric) but return the
original confirm_allocation error to the caller. Ensure you use the
already-cloned public_key variable, await the removal if the API is async, and
keep the existing metrics/return behavior (i.e., removal errors should not mask
the original confirm_allocation error).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89bbaac1-63ce-4234-ba8a-b92a3e017d87

📥 Commits

Reviewing files that changed from the base of the PR and between d23a42f and 24b3d02.

📒 Files selected for processing (7)
  • common/credential-verification/src/bandwidth_storage_manager.rs
  • common/credential-verification/src/client_bandwidth.rs
  • common/gateway-storage/src/wireguard_peers.rs
  • common/wireguard/src/peer_controller/mod.rs
  • common/wireguard/src/peer_handle.rs
  • common/wireguard/src/peer_storage_manager.rs
  • gateway/src/node/wireguard/new_peer_registration/lp.rs

Comment thread common/wireguard/src/peer_controller/mod.rs
Comment thread gateway/src/node/wireguard/new_peer_registration/lp.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant