Skip to content

RFC: dmsg over UDP — async store-and-forward semantic + peer overlay #3005

@0pcom

Description

@0pcom

Summary

Today dmsg runs over long-lived yamux-multiplexed TCP between visors and dmsg servers. Every dmsg-addressed message requires both endpoints to be live on the same TCP/yamux session at delivery time — there is no queueing, no store-and-forward, no path to bypass an unreachable dmsg server. This issue proposes exploring a UDP-based dmsg transport alongside the existing TCP one.

The interesting unlock isn't UDP-vs-TCP itself — it's the asynchronous messaging semantic that UDP makes natural.

Motivation

1. Store-and-forward (the big one)

A UDP-based dmsg session can queue datagrams server-side when the recipient visor is offline, then deliver on reconnect — the way email / push notifications / Matrix already work. Today's TCP/yamux model is more like IRC: both ends must be live on the same session or the message simply doesn't get delivered.

Use cases this unlocks:

  • Mobile / intermittent-connectivity visors that aren't online 24/7
  • Asynchronous APIs (skychat group fanout, deferred TPD publish, reward / metrics submission with retry-on-reconnect)
  • Push-style notifications without a separate push subsystem

2. Peer dmsg overlay (NAT traversal)

UDP enables hole-punching, so visors could relay dmsg for each other in addition to the dedicated dmsg servers. That removes the dmsg-server-as-SPOF without adding TCP-NAT complexity. dmsg becomes a true peer overlay where any reachable visor can serve as a stepping stone.

3. Avoiding yamux head-of-line blocking

A slow consumer on one yamux stream backs up the whole TCP session today. A datagram-style dmsg has natural per-message independence (or QUIC-style independent substreams) — one slow sink doesn't drag down the rest.

4. RST-injection / firewall-TCP-RST resistance

Some censoring middleboxes inject TCP RST. A UDP-based dmsg sidesteps that class of attack, similar to how QUIC was motivated for the open web.

Sketch

Two layering options:

(A) UDP with KCP/app-layer retransmit, mirroring SUDPH

  • Reuse the existing KCP-on-UDP code that SUDPH transports already use
  • Maps well onto current dmsg framing
  • Watch for the failure modes PR #3003 just fixed (no liveness detection over KCP-on-UDP → silent dead conns post-AR-restart)

(B) QUIC-based dmsg

  • TLS-equivalent built-in (we'd use noise instead), multiplexed substreams, congestion control done well, RST-immune
  • Heavier dep (quic-go), but the protocol is mature and the API is reasonable
  • Stream-oriented surface so the existing yamux-based dmsg client doesn't need rewriting — just the transport swaps

In both cases the visor↔dmsg-server protocol gains:

  • Per-message acks (so store-and-forward can know what was delivered)
  • A retrieval API for the recipient to pull queued messages on reconnect
  • Optional TTL on queued messages

Tradeoffs

  • Complexity: KCP route reinvents pieces of TCP imperfectly; QUIC pulls in a non-trivial dep. PR fix(sudph): detect a dead AR connection so visors re-register after an AR restart #3003's recent bug is fresh evidence that UDP-on-skywire is real engineering, not a free transport swap.
  • dmsg's current design is server-relay-centric — UDP gains less for relay-only than it would for true p2p. If we don't pursue the peer-overlay angle (motivation Feature/dmsg hypervisor #2), TCP's reliability is hard to beat for server-relay.
  • Existing SUDPH already provides UDP-based skywire connectivity for direct p2p. The question is whether dmsg's control-plane / messaging layer should also have a UDP option, or whether SUDPH's data-plane role is enough.
  • Backwards compatibility: dmsg-servers and dmsg-clients would need to negotiate transport. Cleanest path is a per-server capability flag and clients prefer UDP+QUIC when both ends support it.

Open questions

  1. Is store-and-forward (motivation Mainnet milestone1 #1) interesting enough on its own to justify the work, even without the peer-overlay angle?
  2. Should this be a new dmsg version (dmsg/v2) or a parallel transport that existing dmsg can fall back through?
  3. KCP-on-UDP (reusing SUDPH machinery) vs QUIC — anyone strongly prefer one?
  4. Who else has thought about this — pointers to prior discussion / related work appreciated.

cc folks running visors in intermittent-connectivity environments — your use cases would shape this most.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions