fix(d2d): make device-to-device discovery work on multi-homed / multicast-hostile networks by armwaheed · Pull Request #48 · arm/device-connect

armwaheed · 2026-06-03T05:53:29Z

Make D2D discovery work on multi-homed / multicast-hostile networks

Summary

A partner integration reported that an agent host could not discover or invoke a device on the same LAN: discover_devices() returned [] and invoke_device() failed with no responders after a 30 s timeout. Both hosts were on the same subnet, in insecure zero-infrastructure D2D (device-to-device) mode, and both correctly auto-selected the Zenoh peer backend.

The initial read was "this is the network layer, not a code bug." Half true — D2D discovery rides Zenoh's UDP multicast scouting, which is genuinely fragile on real robot/Wi-Fi networks. But we reproduced the identical symptom on a LAN where multicast provably works, which isolates two real, fixable code defects. This PR fixes both.

Root cause

D2D mode has no registry service; devices announce presence and peers discover each other over Zenoh. Two things break this on real-world multi-homed hosts:

The reliable unicast escape hatch was booby-trapped. When multicast is unreliable, the documented workaround is to pin a direct unicast link with ZENOH_CONNECT=tcp/<device>:7447. But the agent's connect() path treated any explicit endpoint as "infrastructure mode" and routed discovery to a registry service that a zero-infra device doesn't run — so it timed out and returned [] unless the caller also knew to set DEVICE_CONNECT_DISCOVERY_MODE=d2d. The MCP bridge (mcp/bridge.py::_is_d2d_mode) already had the correct logic; connection.py simply diverged from it. That inconsistency is the bug.
Multicast scouting bound to the wrong NIC. The Zenoh adapter never set a scouting interface, so Zenoh's default interface:"auto" roulettes across whatever NICs a host has. Robots and laptops are heavily multi-homed (a private motor-control NIC, docker0, VPN utun*, Apple awdl0/llw0, …), so two peers on the same LAN intermittently never form a session.

The fix

File	Change
`agent-tools/connection.py`	Discovery mode now mirrors `mcp/bridge.py`: `auto` (default) = D2D whenever the backend is Zenoh, including with `ZENOH_CONNECT`. The unicast endpoint is kept so the agent connects straight to the peer while still discovering it via presence. Opt into the registry with `DEVICE_CONNECT_DISCOVERY_MODE=infra`.
`edge/messaging/zenoh_adapter.py`	New `ZENOH_MULTICAST_INTERFACE` env var (and `multicast_interface` kwarg) pins multicast scouting to a chosen NIC (name or IP) so zero-config discovery is deterministic on multi-homed hosts.
READMEs	Documented both knobs + a "Multi-homed hosts & unreliable multicast" section.

Both changes are backward compatible: default behavior is unchanged when neither the env var nor ZENOH_CONNECT is set, and non-Zenoh backends are unaffected.

How to reproduce

Two hosts on one LAN, both insecure D2D (the device is the repo's quick-start sensor demo, device_id=sensor-001; the agent is the agent-tools client). No physical sensor required — any two hosts reproduce it.

# Device:
DEVICE_CONNECT_ALLOW_INSECURE=true python sensor.py
# Agent:
DEVICE_CONNECT_ALLOW_INSECURE=true python client.py
#   discover_devices(device_type="sensor") -> []
#   invoke_device("sensor-001", "get_reading") -> "no responders" (30 s)

Decisive test — pin a direct unicast link, bypassing multicast entirely:

# Device:
ZENOH_LISTEN=tcp/0.0.0.0:7447 DEVICE_CONNECT_ALLOW_INSECURE=true python sensor.py
# Agent — BEFORE this PR you also needed DEVICE_CONNECT_DISCOVERY_MODE=d2d or it returned []:
ZENOH_CONNECT=tcp/<device-ip>:7447 DEVICE_CONNECT_ALLOW_INSECURE=true python client.py

Differential diagnosis — speculative causes eliminated

Reproduced with two hosts on one Wi-Fi LAN: a multi-homed robot (private motor-control NIC on eth0, Wi-Fi on wlan0, docker0) and a multi-homed laptop (en0 Wi-Fi plus awdl0/llw0/utun*/bridge0).

#	Hypothesis	How it was tested	Observation	Verdict
1	Misconfigured credentials / auth	Both ends run `DEVICE_CONNECT_ALLOW_INSECURE=true` (auth bypassed on both)	No auth in the path	❌ Eliminated
2	Wrong backend / not really in D2D	Compared device startup logs	Both auto-select Zenoh D2D peer mode; healthy, identical logs	❌ Eliminated
3	Multicast physically blocked on the LAN	Raw Zenoh pub/sub, device → agent	16/16 messages delivered	❌ Eliminated (multicast works here)
4	Zenoh session never forms (real adapter)	Ran the real `ZenohAdapter` on both hosts; dumped connected peer ZIDs + presence	Session forms; presence flows both ways	❌ Eliminated as a blanket transport failure
5	Agent leaves D2D when given an explicit endpoint	Ran agent with `ZENOH_CONNECT` only	Routed to registry → `discovery timed out (no responders)`, `DISCOVERED: []`	✅ Root cause #1
6	Multicast scout binds the wrong NIC (multi-homed)	`interface=auto` vs pinned, multicast-only	`auto` intermittently fails to form a session; pinned is deterministic	✅ Root cause #2

Fix validation (on hardware)

Scenario	Before	After
Agent with `ZENOH_CONNECT=tcp/<device>:7447` only (no `DISCOVERY_MODE`)	`discovery timed out (no responders)`, `DISCOVERED: []`	`DISCOVERED: [sensor-001]`, invoke → `{temperature: 22.5, humidity: 45}`
Multicast-only with `ZENOH_MULTICAST_INTERFACE` pinned to the LAN NIC	flaky / no session	session + presence both directions

Tests

Suite	Result
`device-connect-edge` unit (excl. fuzz)	529 passed, 1 skipped
`test_zenoh_adapter.py`	72 passed (incl. 2 new)
`test_connection_unit.py`	26 passed (incl. 4 new)

New tests assert: an explicit ZENOH_CONNECT keeps D2D mode and retains the unicast endpoint; DISCOVERY_MODE=infra opts out; and ZENOH_MULTICAST_INTERFACE / the multicast_interface kwarg pin the scout interface.

Customer-facing guidance (TL;DR for the field)

On a robot/Wi-Fi network where D2D discovery returns []:

Multi-homed host? Pin the LAN NIC: ZENOH_MULTICAST_INTERFACE=wlan0.
Multicast blocked (AP/client isolation, managed switch)? Use a direct unicast link — device: ZENOH_LISTEN=tcp/0.0.0.0:7447; agent: ZENOH_CONNECT=tcp/<device-ip>:7447. After this PR the agent no longer needs a second magic env var.

…cast-hostile networks A partner integration reported that an agent host could not discover or invoke a device on the same LAN: discovery returned [] and invoke failed with "no responders", with both hosts in insecure zero-infra D2D mode. Reproduced on a multi-homed robot (private motor-control NIC + Wi-Fi + docker0) talking to a multi-homed laptop (Wi-Fi + AWDL/VPN interfaces), on a LAN where multicast provably works — isolating two code defects: 1. agent-tools connection.py silently dropped out of D2D presence discovery into registry mode whenever an explicit ZENOH_CONNECT endpoint was set. Against a zero-infra device (no registry) that timed out and returned [] unless the caller also set DEVICE_CONNECT_DISCOVERY_MODE=d2d. connect() now mirrors mcp.bridge._is_d2d_mode: "auto" (default) means D2D whenever the backend is Zenoh, including with ZENOH_CONNECT, so the reliable unicast path (the recommended workaround when multicast is blocked) works with no extra env var. Opt into the registry with DEVICE_CONNECT_DISCOVERY_MODE=infra. 2. The Zenoh adapter now honours ZENOH_MULTICAST_INTERFACE (and a multicast_interface kwarg) to pin multicast scouting to a NIC. On multi-homed hosts Zenoh's default interface="auto" can bind the scout to the wrong interface, so peers on the same LAN intermittently never form a session. Pinning to the LAN interface makes zero-config discovery deterministic. Unit tests added for both fixes; full edge suite and the agent-tools connection/adapter suites pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

armwaheed · 2026-06-03T06:03:15Z

@atsyplikhin, please review

atsyplikhin · 2026-06-04T19:55:00Z

Nice fix — aligning connection.py with mcp/bridge.py::_is_d2d_mode is the right call, and the ZENOH_MULTICAST_INTERFACE knob targets the real root cause. Verified the core behavior (explicit ZENOH_CONNECT + D2D presence discovery) works end-to-end on a two-process repro. A few suggestions, mostly aimed at developer experience for the first-run-on-a-robot/Wi-Fi case.

One correctness note worth flagging

The default discovery path now flips to D2D for any Zenoh backend, including with ZENOH_CONNECT — but the device side (device.py:469) was not changed to match. A device started with ZENOH_CONNECT stays _d2d_mode=False, registers with the registry, and never starts the presence announcer (gated at device.py:1734).

Partner's case (device in pure D2D, agent pins ZENOH_CONNECT to reach it): works ✅
Infra deployment (router + registry, both sides using ZENOH_CONNECT): the agent now defaults to presence discovery while the device only registered with the registry → agent finds nothing unless it sets DEVICE_CONNECT_DISCOVERY_MODE=infra.

Suggest calling this out in the PR description/changelog as a behavior change for infra Zenoh users, and deciding whether the device side should get the same auto treatment for symmetry (identical env → identical behavior on both ends).

DX suggestions

Make 0-peer discovery self-diagnosing (highest leverage). Today discover_devices() returns a silent [] and, 30s later, invoke returns no responders — with no hint why. At the wait_for_peers 0-result path, emit something actionable, e.g.:

D2D discovery found 0 peers after 3.0s via Zenoh multicast.
  Interfaces seen: en0 (192.168.19.17), docker0 (172.17.0.1), utun3 (10.2.0.2)
  Multicast is likely blocked or scouting the wrong NIC. Try:
    - ZENOH_MULTICAST_INTERFACE=en0      (pin scouting to your LAN NIC)
    - ZENOH_CONNECT=tcp/<device-ip>:7447 (skip multicast, direct link)

This puts the knowledge from the README into the runtime, where nobody has to go find it.

Fail fast instead of the 30s dead-wait. If the target device isn't in the presence table, raise immediately (device 'X' not found among N discovered peers) rather than blocking the full request_timeout. Distinguishing "no Zenoh session" from "session up, device absent" is the difference between a network fix and a config fix.
Remove the NIC guesswork. ZENOH_MULTICAST_INTERFACE still requires the dev to know their interface name. Cheap win: log candidate interfaces + IPs at connect time so the value to set is right there. Follow-up: auto-select the interface whose subnet matches the default route / the ZENOH_CONNECT peer and pin it (mirrors the strategy robot DDS configs already use).
Ship a doctor preflight (e.g. python -m device_connect_agent_tools doctor) that enumerates NICs, checks 7446/udp + 7447/tcp, fires one multicast scout to see if any peer answers, and prints the exact env vars to set. Same pattern as tailscale/docker/kubectl — turns "it doesn't work" tickets into self-service.
Smaller polish:
- Warn on unknown DEVICE_CONNECT_DISCOVERY_MODE values (e.g. someone types registry) instead of silently falling through to D2D.
- Plumb multicast_interface as a first-class DeviceRuntime(...) / connect() kwarg, not only an adapter kwarg/env.
- Add an inline "returned [] / no responders?" troubleshooting box in the quick-start, and ship the sensor.py + client.py repro as official examples.

Scope

I'd keep this PR tight: fold in #1, #3-cheap (log interfaces), the unknown-mode warning, and the README box now; split #2 (fast-fail), #4 (doctor), and interface auto-select into a follow-up so they get proper tests. The PR already gives an expert the knobs — #1 + #4 are what tell a first-timer which knob to turn.

armwaheed mentioned this pull request Jun 3, 2026

Device Connect D2D: discovery returns [] / invoke 'no responders' on multi-homed / multicast-hostile networks armwaheed/device-connect#1

Open

armwaheed mentioned this pull request Jun 5, 2026

Device Connect D2D: discovery returns [] / invoke 'no responders' on multi-homed / multicast-hostile networks #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(d2d): make device-to-device discovery work on multi-homed / multicast-hostile networks#48

fix(d2d): make device-to-device discovery work on multi-homed / multicast-hostile networks#48
armwaheed wants to merge 1 commit into
arm:mainfrom
armwaheed:deep-robotics

armwaheed commented Jun 3, 2026

Uh oh!

armwaheed commented Jun 3, 2026

Uh oh!

atsyplikhin commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

armwaheed commented Jun 3, 2026

Make D2D discovery work on multi-homed / multicast-hostile networks

Summary

Root cause

The fix

How to reproduce

Differential diagnosis — speculative causes eliminated

Fix validation (on hardware)

Tests

Customer-facing guidance (TL;DR for the field)

Uh oh!

armwaheed commented Jun 3, 2026

Uh oh!

atsyplikhin commented Jun 4, 2026

One correctness note worth flagging

DX suggestions

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants