fix(wireguard): prevent bandwidth overcounting#6728
fix(wireguard): prevent bandwidth overcounting#6728nmnmcc wants to merge 1 commit intonymtech:developfrom
Conversation
|
@nmnmcc is attempting to deploy a commit to the nyx-network Team on Vercel. A member of the Team first needs to authorize it. |
|
Thank you for making this first PR |
📝 WalkthroughWalkthroughThis PR introduces bandwidth synchronization atomicity and idempotent peer lifecycle management. A new ChangesBandwidth Synchronization and Peer Lifecycle Atomicity
Sequence DiagramsequenceDiagram
participant PeerCtrl as PeerController
participant ClientBW as ClientBandwidth
participant BWSyncMgr as BandwidthStorageManager
participant Storage as Storage
participant WGKernel as WireGuard Kernel
Note over PeerCtrl,WGKernel: Duplicate Add (Idempotent Refresh)
PeerCtrl->>PeerCtrl: Detect existing peer
PeerCtrl->>PeerCtrl: Compute IP allocation from old IPs
PeerCtrl->>WGKernel: configure_peer(peer)
WGKernel-->>PeerCtrl: OK
PeerCtrl->>BWSyncMgr: sync_storage_bandwidth()
BWSyncMgr->>ClientBW: sync_guard()
ClientBW-->>BWSyncMgr: OwnedMutexGuard
BWSyncMgr->>ClientBW: take_delta_since_sync()
ClientBW-->>BWSyncMgr: delta (atomically reset)
BWSyncMgr->>Storage: increase_bandwidth(delta)
alt Storage Success
Storage-->>BWSyncMgr: updated
BWSyncMgr->>ClientBW: resync_bandwidth_with_storage(updated)
ClientBW-->>BWSyncMgr: OK (delta preserved)
else Storage Failure
Storage-->>BWSyncMgr: Error
BWSyncMgr->>ClientBW: restore_delta_since_sync(delta)
BWSyncMgr-->>PeerCtrl: Error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
common/wireguard/src/peer_controller/mod.rs (2)
217-229:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDelay IP release until the kernel peer is actually removed.
This now returns the IP pair to the pool before
wg_api.remove_peer(key)succeeds. If the kernel removal fails, that IP can be handed to a new peer while the stale peer is still attached, and this method has already dropped its storage/bandwidth tracking state.Suggested fix
- if let Some(peer) = stored_peer - && let Some(ip_pair) = allocated_ip_pair(&peer) - { - self.ip_pool.release(ip_pair) - } - let ret = self.wg_api.remove_peer(key); if ret.is_err() { nym_metrics::inc!("wg_peer_removal_failed"); error!( "Wireguard peer could not be removed from wireguard kernel module. Process should be restarted so that the interface is reset." ); } else { + if let Some(peer) = stored_peer + && let Some(ip_pair) = allocated_ip_pair(&peer) + { + self.ip_pool.release(ip_pair) + } nym_metrics::inc!("wg_peer_removal_success"); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@common/wireguard/src/peer_controller/mod.rs` around lines 217 - 229, The code currently removes storage and returns the IP (remove_wireguard_peer, bw_storage_managers.remove, ip_pool.release) before calling wg_api.remove_peer; change the flow so that you first call wg_api.remove_peer(key) and only on success remove persistent state and release the IP: call self.wg_api.remove_peer(key) first, and if it returns Ok then call self.ecash_verifier.storage().remove_wireguard_peer(&key.to_string()).await?, self.bw_storage_managers.remove(key), and if let Some(peer) = stored_peer && let Some(ip_pair) = allocated_ip_pair(&peer) then self.ip_pool.release(ip_pair); if wg_api.remove_peer fails, return the error without altering storage or releasing the IP.
338-370:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRollback the kernel peer if
confirm_allocationfails.By Line 369 the peer is already configured in the kernel, but an error from
confirm_allocationreturns before the manager is inserted or the handle is spawned. That leaves a live, untracked peer behind with no controller state to meter or remove it.Suggested fix
let mut handle = PeerHandle::new( peer.public_key.clone(), self.host_information.clone(), cached_peer_manager, bandwidth_storage_manager.clone(), self.request_tx.clone(), self.upgrade_mode.clone(), &self.shutdown_token, ); let public_key = peer.public_key.clone(); - self.ip_pool.confirm_allocation(ip_pair)?; + if let Err(err) = self.ip_pool.confirm_allocation(ip_pair) { + let _ = self.wg_api.remove_peer(&peer.public_key); + nym_metrics::inc!("wg_peer_addition_failed"); + return Err(err); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@common/wireguard/src/peer_controller/mod.rs` around lines 338 - 370, After configuring the kernel peer via self.wg_api.configure_peer(peer) but before returning on a failure of self.ip_pool.confirm_allocation(ip_pair), perform a rollback by removing the kernel peer (e.g. call self.wg_api.remove_peer(&public_key) or the appropriate removal method) so the kernel does not retain an untracked peer; attempt the removal and log any removal error (or increment a metric) but return the original confirm_allocation error to the caller. Ensure you use the already-cloned public_key variable, await the removal if the API is async, and keep the existing metrics/return behavior (i.e., removal errors should not mask the original confirm_allocation error).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@common/wireguard/src/peer_controller/mod.rs`:
- Around line 306-316: The current code replaces the existing ClientBandwidth
instance stored in existing_bandwidth_storage_manager by assigning *guard =
Self::generate_bandwidth_manager(...), which creates a new
ClientBandwidth/sync_lock pair and leads to split in-memory balances between old
and new verifiers (see handle_get_client_bandwidth_by_key /
handle_query_verifier_by_key). Instead of swapping the whole manager, update the
existing manager in-place: call sync_storage_bandwidth() on the guard, then
mutate the fields of the existing ClientBandwidth (or expose a method like
update_from or refresh_verifier on ClientBandwidth) to replace/refresh its
verifier/storage handle and balances based on the result of
generate_bandwidth_manager(...) so that the sync_lock and existing clones remain
valid and all requests continue to authorize against the same in-memory state.
Ensure you keep the same instance behind
existing_bandwidth_storage_manager.inner() and only update its internals.
In `@gateway/src/node/wireguard/new_peer_registration/lp.rs`:
- Around line 23-38: Summary: storage PSK is updated before the live kernel
peer, risking divergence if kernel update fails; fix by reading the current
active peer/PSK first and performing a compensating rollback on storage if
kernel update fails. Change lp.rs so you call
self.peer_manager.check_active_peer(peer).await? and
self.peer_manager.query_peer(peer).await? to capture the existing active_peer
and its current preshared_key before calling
self.ecash_verifier.storage().update_peer_psk(...). Then after computing
encoded_psk and calling update_peer_psk, attempt to update the kernel peer (set
active_peer.preshared_key = Some(psk) and call
self.peer_manager.add_peer(active_peer).await?); if add_peer returns Err, call
self.ecash_verifier.storage().update_peer_psk(&peer.to_string(),
old_psk_option.map(|s| s.as_str())) .await to roll back to the previous PSK (and
handle the None case to clear it). Ensure you handle the case where query_peer()
returned None by either not updating storage or treating old_psk as None for
proper rollback.
---
Outside diff comments:
In `@common/wireguard/src/peer_controller/mod.rs`:
- Around line 217-229: The code currently removes storage and returns the IP
(remove_wireguard_peer, bw_storage_managers.remove, ip_pool.release) before
calling wg_api.remove_peer; change the flow so that you first call
wg_api.remove_peer(key) and only on success remove persistent state and release
the IP: call self.wg_api.remove_peer(key) first, and if it returns Ok then call
self.ecash_verifier.storage().remove_wireguard_peer(&key.to_string()).await?,
self.bw_storage_managers.remove(key), and if let Some(peer) = stored_peer && let
Some(ip_pair) = allocated_ip_pair(&peer) then self.ip_pool.release(ip_pair); if
wg_api.remove_peer fails, return the error without altering storage or releasing
the IP.
- Around line 338-370: After configuring the kernel peer via
self.wg_api.configure_peer(peer) but before returning on a failure of
self.ip_pool.confirm_allocation(ip_pair), perform a rollback by removing the
kernel peer (e.g. call self.wg_api.remove_peer(&public_key) or the appropriate
removal method) so the kernel does not retain an untracked peer; attempt the
removal and log any removal error (or increment a metric) but return the
original confirm_allocation error to the caller. Ensure you use the
already-cloned public_key variable, await the removal if the API is async, and
keep the existing metrics/return behavior (i.e., removal errors should not mask
the original confirm_allocation error).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 89bbaac1-63ce-4234-ba8a-b92a3e017d87
📒 Files selected for processing (7)
common/credential-verification/src/bandwidth_storage_manager.rscommon/credential-verification/src/client_bandwidth.rscommon/gateway-storage/src/wireguard_peers.rscommon/wireguard/src/peer_controller/mod.rscommon/wireguard/src/peer_handle.rscommon/wireguard/src/peer_storage_manager.rsgateway/src/node/wireguard/new_peer_registration/lp.rs
Summary
Fix WireGuard bandwidth overcounting in several reconnect / re-registration paths.
The main issue was that repeated
AddPeerrequests for the same WireGuard public key could create multiple activePeerHandles. Each handle observed the same cumulative kernel peer counters and deducted bandwidth independently, which could multiply reported usage. Re-adding a peer could also seed the cached counters from zero, causing historical kernel traffic to be charged again.This PR also fixes storage sync races that could double-apply a pending bandwidth delta, and fixes PSK persistence/update behavior during peer refresh.
Changes
Make
PeerController::handle_add_requestidempotent for already-active peers.PeerHandle.Seed new WireGuard peer cache from current kernel counters when the peer already exists in the kernel.
rx_bytes + tx_byteson first poll after re-add.Make credential bandwidth storage sync atomic per client.
take_delta_since_sync/ restore-on-error instead of read-then-clear.Fix WireGuard PSK update path.
Persist WireGuard peer PSK in gateway storage.
insert_peernow inserts and updates thepskcolumn.Tests
Added regression tests covering the overcounting cases:
client_bandwidth::old_read_only_sync_could_apply_the_same_delta_twiceclient_bandwidth::resync_preserves_delta_accumulated_during_storage_syncclient_bandwidth::failed_sync_restores_reserved_deltapeer_controller::duplicate_add_for_active_peer_is_idempotentpeer_storage_manager::seeding_cache_from_empty_peer_recounts_all_existing_kernel_trafficwireguard_peers::insert_peer_persists_psk_on_insert_and_updateVerification
Ran:
This change is
Summary by CodeRabbit
New Features
Bug Fixes