Skip to content

fix(d2d): make device-to-device discovery work on multi-homed / multicast-hostile networks#48

Open
armwaheed wants to merge 1 commit into
arm:mainfrom
armwaheed:deep-robotics
Open

fix(d2d): make device-to-device discovery work on multi-homed / multicast-hostile networks#48
armwaheed wants to merge 1 commit into
arm:mainfrom
armwaheed:deep-robotics

Conversation

@armwaheed

Copy link
Copy Markdown

Make D2D discovery work on multi-homed / multicast-hostile networks

Summary

A partner integration reported that an agent host could not discover or invoke a device on the same LAN: discover_devices() returned [] and invoke_device() failed with no responders after a 30 s timeout. Both hosts were on the same subnet, in insecure zero-infrastructure D2D (device-to-device) mode, and both correctly auto-selected the Zenoh peer backend.

The initial read was "this is the network layer, not a code bug." Half true — D2D discovery rides Zenoh's UDP multicast scouting, which is genuinely fragile on real robot/Wi-Fi networks. But we reproduced the identical symptom on a LAN where multicast provably works, which isolates two real, fixable code defects. This PR fixes both.

Root cause

D2D mode has no registry service; devices announce presence and peers discover each other over Zenoh. Two things break this on real-world multi-homed hosts:

  1. The reliable unicast escape hatch was booby-trapped. When multicast is unreliable, the documented workaround is to pin a direct unicast link with ZENOH_CONNECT=tcp/<device>:7447. But the agent's connect() path treated any explicit endpoint as "infrastructure mode" and routed discovery to a registry service that a zero-infra device doesn't run — so it timed out and returned [] unless the caller also knew to set DEVICE_CONNECT_DISCOVERY_MODE=d2d. The MCP bridge (mcp/bridge.py::_is_d2d_mode) already had the correct logic; connection.py simply diverged from it. That inconsistency is the bug.

  2. Multicast scouting bound to the wrong NIC. The Zenoh adapter never set a scouting interface, so Zenoh's default interface:"auto" roulettes across whatever NICs a host has. Robots and laptops are heavily multi-homed (a private motor-control NIC, docker0, VPN utun*, Apple awdl0/llw0, …), so two peers on the same LAN intermittently never form a session.

The fix

File Change
agent-tools/connection.py Discovery mode now mirrors mcp/bridge.py: auto (default) = D2D whenever the backend is Zenoh, including with ZENOH_CONNECT. The unicast endpoint is kept so the agent connects straight to the peer while still discovering it via presence. Opt into the registry with DEVICE_CONNECT_DISCOVERY_MODE=infra.
edge/messaging/zenoh_adapter.py New ZENOH_MULTICAST_INTERFACE env var (and multicast_interface kwarg) pins multicast scouting to a chosen NIC (name or IP) so zero-config discovery is deterministic on multi-homed hosts.
READMEs Documented both knobs + a "Multi-homed hosts & unreliable multicast" section.

Both changes are backward compatible: default behavior is unchanged when neither the env var nor ZENOH_CONNECT is set, and non-Zenoh backends are unaffected.

How to reproduce

Two hosts on one LAN, both insecure D2D (the device is the repo's quick-start sensor demo, device_id=sensor-001; the agent is the agent-tools client). No physical sensor required — any two hosts reproduce it.

# Device:
DEVICE_CONNECT_ALLOW_INSECURE=true python sensor.py
# Agent:
DEVICE_CONNECT_ALLOW_INSECURE=true python client.py
#   discover_devices(device_type="sensor") -> []
#   invoke_device("sensor-001", "get_reading") -> "no responders" (30 s)

Decisive test — pin a direct unicast link, bypassing multicast entirely:

# Device:
ZENOH_LISTEN=tcp/0.0.0.0:7447 DEVICE_CONNECT_ALLOW_INSECURE=true python sensor.py
# Agent — BEFORE this PR you also needed DEVICE_CONNECT_DISCOVERY_MODE=d2d or it returned []:
ZENOH_CONNECT=tcp/<device-ip>:7447 DEVICE_CONNECT_ALLOW_INSECURE=true python client.py

Differential diagnosis — speculative causes eliminated

Reproduced with two hosts on one Wi-Fi LAN: a multi-homed robot (private motor-control NIC on eth0, Wi-Fi on wlan0, docker0) and a multi-homed laptop (en0 Wi-Fi plus awdl0/llw0/utun*/bridge0).

# Hypothesis How it was tested Observation Verdict
1 Misconfigured credentials / auth Both ends run DEVICE_CONNECT_ALLOW_INSECURE=true (auth bypassed on both) No auth in the path ❌ Eliminated
2 Wrong backend / not really in D2D Compared device startup logs Both auto-select Zenoh D2D peer mode; healthy, identical logs ❌ Eliminated
3 Multicast physically blocked on the LAN Raw Zenoh pub/sub, device → agent 16/16 messages delivered ❌ Eliminated (multicast works here)
4 Zenoh session never forms (real adapter) Ran the real ZenohAdapter on both hosts; dumped connected peer ZIDs + presence Session forms; presence flows both ways ❌ Eliminated as a blanket transport failure
5 Agent leaves D2D when given an explicit endpoint Ran agent with ZENOH_CONNECT only Routed to registry → discovery timed out (no responders), DISCOVERED: [] Root cause #1
6 Multicast scout binds the wrong NIC (multi-homed) interface=auto vs pinned, multicast-only auto intermittently fails to form a session; pinned is deterministic Root cause #2

Fix validation (on hardware)

Scenario Before After
Agent with ZENOH_CONNECT=tcp/<device>:7447 only (no DISCOVERY_MODE) discovery timed out (no responders), DISCOVERED: [] DISCOVERED: [sensor-001], invoke → {temperature: 22.5, humidity: 45}
Multicast-only with ZENOH_MULTICAST_INTERFACE pinned to the LAN NIC flaky / no session session + presence both directions

Tests

Suite Result
device-connect-edge unit (excl. fuzz) 529 passed, 1 skipped
test_zenoh_adapter.py 72 passed (incl. 2 new)
test_connection_unit.py 26 passed (incl. 4 new)

New tests assert: an explicit ZENOH_CONNECT keeps D2D mode and retains the unicast endpoint; DISCOVERY_MODE=infra opts out; and ZENOH_MULTICAST_INTERFACE / the multicast_interface kwarg pin the scout interface.

Customer-facing guidance (TL;DR for the field)

On a robot/Wi-Fi network where D2D discovery returns []:

  • Multi-homed host? Pin the LAN NIC: ZENOH_MULTICAST_INTERFACE=wlan0.
  • Multicast blocked (AP/client isolation, managed switch)? Use a direct unicast link — device: ZENOH_LISTEN=tcp/0.0.0.0:7447; agent: ZENOH_CONNECT=tcp/<device-ip>:7447. After this PR the agent no longer needs a second magic env var.

…cast-hostile networks

A partner integration reported that an agent host could not discover or
invoke a device on the same LAN: discovery returned [] and invoke failed
with "no responders", with both hosts in insecure zero-infra D2D mode.

Reproduced on a multi-homed robot (private motor-control NIC + Wi-Fi +
docker0) talking to a multi-homed laptop (Wi-Fi + AWDL/VPN interfaces),
on a LAN where multicast provably works — isolating two code defects:

1. agent-tools connection.py silently dropped out of D2D presence
   discovery into registry mode whenever an explicit ZENOH_CONNECT
   endpoint was set. Against a zero-infra device (no registry) that
   timed out and returned [] unless the caller also set
   DEVICE_CONNECT_DISCOVERY_MODE=d2d. connect() now mirrors
   mcp.bridge._is_d2d_mode: "auto" (default) means D2D whenever the
   backend is Zenoh, including with ZENOH_CONNECT, so the reliable
   unicast path (the recommended workaround when multicast is blocked)
   works with no extra env var. Opt into the registry with
   DEVICE_CONNECT_DISCOVERY_MODE=infra.

2. The Zenoh adapter now honours ZENOH_MULTICAST_INTERFACE (and a
   multicast_interface kwarg) to pin multicast scouting to a NIC. On
   multi-homed hosts Zenoh's default interface="auto" can bind the scout
   to the wrong interface, so peers on the same LAN intermittently never
   form a session. Pinning to the LAN interface makes zero-config
   discovery deterministic.

Unit tests added for both fixes; full edge suite and the agent-tools
connection/adapter suites pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@armwaheed

Copy link
Copy Markdown
Author

@atsyplikhin, please review

@atsyplikhin

Copy link
Copy Markdown
Collaborator

Nice fix — aligning connection.py with mcp/bridge.py::_is_d2d_mode is the right call, and the ZENOH_MULTICAST_INTERFACE knob targets the real root cause. Verified the core behavior (explicit ZENOH_CONNECT + D2D presence discovery) works end-to-end on a two-process repro. A few suggestions, mostly aimed at developer experience for the first-run-on-a-robot/Wi-Fi case.

One correctness note worth flagging

The default discovery path now flips to D2D for any Zenoh backend, including with ZENOH_CONNECT — but the device side (device.py:469) was not changed to match. A device started with ZENOH_CONNECT stays _d2d_mode=False, registers with the registry, and never starts the presence announcer (gated at device.py:1734).

  • Partner's case (device in pure D2D, agent pins ZENOH_CONNECT to reach it): works
  • Infra deployment (router + registry, both sides using ZENOH_CONNECT): the agent now defaults to presence discovery while the device only registered with the registry → agent finds nothing unless it sets DEVICE_CONNECT_DISCOVERY_MODE=infra.

Suggest calling this out in the PR description/changelog as a behavior change for infra Zenoh users, and deciding whether the device side should get the same auto treatment for symmetry (identical env → identical behavior on both ends).

DX suggestions

  1. Make 0-peer discovery self-diagnosing (highest leverage). Today discover_devices() returns a silent [] and, 30s later, invoke returns no responders — with no hint why. At the wait_for_peers 0-result path, emit something actionable, e.g.:

    D2D discovery found 0 peers after 3.0s via Zenoh multicast.
      Interfaces seen: en0 (192.168.19.17), docker0 (172.17.0.1), utun3 (10.2.0.2)
      Multicast is likely blocked or scouting the wrong NIC. Try:
        - ZENOH_MULTICAST_INTERFACE=en0      (pin scouting to your LAN NIC)
        - ZENOH_CONNECT=tcp/<device-ip>:7447 (skip multicast, direct link)
    

    This puts the knowledge from the README into the runtime, where nobody has to go find it.

  2. Fail fast instead of the 30s dead-wait. If the target device isn't in the presence table, raise immediately (device 'X' not found among N discovered peers) rather than blocking the full request_timeout. Distinguishing "no Zenoh session" from "session up, device absent" is the difference between a network fix and a config fix.

  3. Remove the NIC guesswork. ZENOH_MULTICAST_INTERFACE still requires the dev to know their interface name. Cheap win: log candidate interfaces + IPs at connect time so the value to set is right there. Follow-up: auto-select the interface whose subnet matches the default route / the ZENOH_CONNECT peer and pin it (mirrors the strategy robot DDS configs already use).

  4. Ship a doctor preflight (e.g. python -m device_connect_agent_tools doctor) that enumerates NICs, checks 7446/udp + 7447/tcp, fires one multicast scout to see if any peer answers, and prints the exact env vars to set. Same pattern as tailscale/docker/kubectl — turns "it doesn't work" tickets into self-service.

  5. Smaller polish:

    • Warn on unknown DEVICE_CONNECT_DISCOVERY_MODE values (e.g. someone types registry) instead of silently falling through to D2D.
    • Plumb multicast_interface as a first-class DeviceRuntime(...) / connect() kwarg, not only an adapter kwarg/env.
    • Add an inline "returned [] / no responders?" troubleshooting box in the quick-start, and ship the sensor.py + client.py repro as official examples.

Scope

I'd keep this PR tight: fold in #1, #3-cheap (log interfaces), the unknown-mode warning, and the README box now; split #2 (fast-fail), #4 (doctor), and interface auto-select into a follow-up so they get proper tests. The PR already gives an expert the knobs — #1 + #4 are what tell a first-timer which knob to turn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants