fix(d2d): make device-to-device discovery work on multi-homed / multicast-hostile networks#48
fix(d2d): make device-to-device discovery work on multi-homed / multicast-hostile networks#48armwaheed wants to merge 1 commit into
Conversation
…cast-hostile networks A partner integration reported that an agent host could not discover or invoke a device on the same LAN: discovery returned [] and invoke failed with "no responders", with both hosts in insecure zero-infra D2D mode. Reproduced on a multi-homed robot (private motor-control NIC + Wi-Fi + docker0) talking to a multi-homed laptop (Wi-Fi + AWDL/VPN interfaces), on a LAN where multicast provably works — isolating two code defects: 1. agent-tools connection.py silently dropped out of D2D presence discovery into registry mode whenever an explicit ZENOH_CONNECT endpoint was set. Against a zero-infra device (no registry) that timed out and returned [] unless the caller also set DEVICE_CONNECT_DISCOVERY_MODE=d2d. connect() now mirrors mcp.bridge._is_d2d_mode: "auto" (default) means D2D whenever the backend is Zenoh, including with ZENOH_CONNECT, so the reliable unicast path (the recommended workaround when multicast is blocked) works with no extra env var. Opt into the registry with DEVICE_CONNECT_DISCOVERY_MODE=infra. 2. The Zenoh adapter now honours ZENOH_MULTICAST_INTERFACE (and a multicast_interface kwarg) to pin multicast scouting to a NIC. On multi-homed hosts Zenoh's default interface="auto" can bind the scout to the wrong interface, so peers on the same LAN intermittently never form a session. Pinning to the LAN interface makes zero-config discovery deterministic. Unit tests added for both fixes; full edge suite and the agent-tools connection/adapter suites pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@atsyplikhin, please review |
|
Nice fix — aligning One correctness note worth flaggingThe default discovery path now flips to D2D for any Zenoh backend, including with
Suggest calling this out in the PR description/changelog as a behavior change for infra Zenoh users, and deciding whether the device side should get the same DX suggestions
ScopeI'd keep this PR tight: fold in #1, #3-cheap (log interfaces), the unknown-mode warning, and the README box now; split #2 (fast-fail), #4 (doctor), and interface auto-select into a follow-up so they get proper tests. The PR already gives an expert the knobs — #1 + #4 are what tell a first-timer which knob to turn. |
Make D2D discovery work on multi-homed / multicast-hostile networks
Summary
A partner integration reported that an agent host could not discover or invoke a device on the same LAN:
discover_devices()returned[]andinvoke_device()failed withno respondersafter a 30 s timeout. Both hosts were on the same subnet, in insecure zero-infrastructure D2D (device-to-device) mode, and both correctly auto-selected the Zenoh peer backend.The initial read was "this is the network layer, not a code bug." Half true — D2D discovery rides Zenoh's UDP multicast scouting, which is genuinely fragile on real robot/Wi-Fi networks. But we reproduced the identical symptom on a LAN where multicast provably works, which isolates two real, fixable code defects. This PR fixes both.
Root cause
D2D mode has no registry service; devices announce presence and peers discover each other over Zenoh. Two things break this on real-world multi-homed hosts:
The reliable unicast escape hatch was booby-trapped. When multicast is unreliable, the documented workaround is to pin a direct unicast link with
ZENOH_CONNECT=tcp/<device>:7447. But the agent'sconnect()path treated any explicit endpoint as "infrastructure mode" and routed discovery to a registry service that a zero-infra device doesn't run — so it timed out and returned[]unless the caller also knew to setDEVICE_CONNECT_DISCOVERY_MODE=d2d. The MCP bridge (mcp/bridge.py::_is_d2d_mode) already had the correct logic;connection.pysimply diverged from it. That inconsistency is the bug.Multicast scouting bound to the wrong NIC. The Zenoh adapter never set a scouting
interface, so Zenoh's defaultinterface:"auto"roulettes across whatever NICs a host has. Robots and laptops are heavily multi-homed (a private motor-control NIC,docker0, VPNutun*, Appleawdl0/llw0, …), so two peers on the same LAN intermittently never form a session.The fix
agent-tools/connection.pymcp/bridge.py:auto(default) = D2D whenever the backend is Zenoh, including withZENOH_CONNECT. The unicast endpoint is kept so the agent connects straight to the peer while still discovering it via presence. Opt into the registry withDEVICE_CONNECT_DISCOVERY_MODE=infra.edge/messaging/zenoh_adapter.pyZENOH_MULTICAST_INTERFACEenv var (andmulticast_interfacekwarg) pins multicast scouting to a chosen NIC (name or IP) so zero-config discovery is deterministic on multi-homed hosts.Both changes are backward compatible: default behavior is unchanged when neither the env var nor
ZENOH_CONNECTis set, and non-Zenoh backends are unaffected.How to reproduce
Two hosts on one LAN, both insecure D2D (the device is the repo's quick-start
sensordemo,device_id=sensor-001; the agent is the agent-tools client). No physical sensor required — any two hosts reproduce it.Decisive test — pin a direct unicast link, bypassing multicast entirely:
Differential diagnosis — speculative causes eliminated
Reproduced with two hosts on one Wi-Fi LAN: a multi-homed robot (private motor-control NIC on
eth0, Wi-Fi onwlan0,docker0) and a multi-homed laptop (en0Wi-Fi plusawdl0/llw0/utun*/bridge0).DEVICE_CONNECT_ALLOW_INSECURE=true(auth bypassed on both)ZenohAdapteron both hosts; dumped connected peer ZIDs + presenceZENOH_CONNECTonlydiscovery timed out (no responders),DISCOVERED: []interface=autovs pinned, multicast-onlyautointermittently fails to form a session; pinned is deterministicFix validation (on hardware)
ZENOH_CONNECT=tcp/<device>:7447only (noDISCOVERY_MODE)discovery timed out (no responders),DISCOVERED: []DISCOVERED: [sensor-001], invoke →{temperature: 22.5, humidity: 45}ZENOH_MULTICAST_INTERFACEpinned to the LAN NICTests
device-connect-edgeunit (excl. fuzz)test_zenoh_adapter.pytest_connection_unit.pyNew tests assert: an explicit
ZENOH_CONNECTkeeps D2D mode and retains the unicast endpoint;DISCOVERY_MODE=infraopts out; andZENOH_MULTICAST_INTERFACE/ themulticast_interfacekwarg pin the scout interface.Customer-facing guidance (TL;DR for the field)
On a robot/Wi-Fi network where D2D discovery returns
[]:ZENOH_MULTICAST_INTERFACE=wlan0.ZENOH_LISTEN=tcp/0.0.0.0:7447; agent:ZENOH_CONNECT=tcp/<device-ip>:7447. After this PR the agent no longer needs a second magic env var.