Skip to content

fix(#1775, #1489): path quality gating in onContactPathRecv + 3x flood ACK retry#2569

Open
thecharge wants to merge 2 commits into
meshcore-dev:devfrom
thecharge:fix/1775-1489-path-gating-flood-ack-reliability
Open

fix(#1775, #1489): path quality gating in onContactPathRecv + 3x flood ACK retry#2569
thecharge wants to merge 2 commits into
meshcore-dev:devfrom
thecharge:fix/1775-1489-path-gating-flood-ack-reliability

Conversation

@thecharge
Copy link
Copy Markdown

Reliability Changes: Path Quality Gating & Flood ACK Retry

Related issues: #1775, #1489


Change 1 — Path Quality Gating in onContactPathRecv

The Bug

BaseChatMesh::onContactPathRecv unconditionally overwrote the stored
out_path with any newly-arriving path, regardless of the quality of the
stored path or the quality of the incoming one.

In an RF mesh with multipath propagation, flood path-return packets can
arrive from multiple routes in quick succession
. The first to arrive wins
— which is not necessarily the shortest route. More critically, a
longer-hop path arriving shortly after an established short-hop path
silently replaced the working route with a worse one.

Consequence (from issue #1775):

  • A stable, proven direct route (e.g. 1 hop) could be replaced by a
    suboptimal multipath duplicate (e.g. 3 hops) arriving 50–200 ms later.
  • Subsequent direct messages then travel via the longer route, increasing
    airtime, collision probability, and delivery failure rate.
  • The user sees intermittent "works / doesn't work" behaviour with the same
    peer even when the mesh topology has not changed.

The Fix

onContactPathRecv now applies a stickiness window before accepting a
path replacement:

  1. If the stored path is younger than PATH_STICKINESS_WINDOW_SECS
    (default 600 s / 10 min) and the incoming path has more hops than
    the stored one, the stored path is kept.
  2. Any embedded ACK or response carried inside the path packet is still
    processed regardless (so the sender's ACK timeout is cancelled correctly).
  3. The stored path is always replaced when it is stale, when it is the
    same or shorter hop count, or when no path was previously known — ensuring
    the node still adapts to topology changes.

A new field out_path_timestamp (uint32_t, zero-init) on ContactInfo
records the RTC time at which the current out_path was last accepted.

Delivery Success Tracking

A second new field path_ack_count (uint8_t, zero-init, saturates at 255)
is incremented in onAckRecv whenever an ACK arrives via the stored direct
path (not flood). This provides a cheap per-contact signal of proven
delivery that can be used in future heuristics (e.g. giving a higher
stickiness weight to a path that has successfully delivered messages).

The counter is reset to zero whenever a new path is accepted, so it always
reflects delivery history on the current path.

Configurable Knobs

#define Default Description
PATH_STICKINESS_WINDOW_SECS 600 Seconds a fresh path is protected from longer-hop replacement

Override before including BaseChatMesh.h or via build flags.


Change 2 — Flood ACK Reliability (sendAckTo)

The Bug

When a direct message arrived via flood routing (i.e. the recipient had no
stored direct path to the sender), sendAckTo sent exactly one flood
ACK packet.

LoRa RF environments are inherently lossy. A single ACK transmission:

  • Can collide with other traffic on the channel.
  • Can be lost due to transient interference or fading.
  • Has no retry mechanism at the MAC layer.

When the ACK is lost the sender must wait for its full timeout (several
seconds) before attempting to retransmit the message, burning airtime and
battery, and degrading the user experience. In practice users reported
that doubling or tripling ACK transmissions (already possible via
getExtraAckTransmitCount() for direct-path ACKs) dramatically improved
perceived reliability (issue #1489).

The Fix

sendAckTo now sends three independent flood ACK packets at staggered
delays when out_path_len == OUT_PATH_UNKNOWN:

Attempt Delay
1st 200 ms
2nd 800 ms
2000 ms 2000 ms

Each copy is a separately-scheduled, independent RF transmission.
If the first copy is lost the second (and third) have a fresh chance of
reaching the sender through a congestion-free window.

Deduplication on the Receiver Side

Duplicate suppression is handled by the pre-existing MeshTables::hasSeen()
mechanism
at every node (including the destination):

  • If a copy arrives and the node has already seen the same packet hash, it
    is discarded immediately — the ACK is not processed twice.
  • If the first copy was lost (never received), neither the intermediate
    repeaters nor the destination have seen it, so the second copy propagates
    normally.

No protocol or wire-format changes are required. The feature is 100%
backward-compatible with older firmware nodes that forward ACK packets
without understanding the retry intent.

Configurable Knobs

#define Default Description
FLOOD_ACK_RETRY_COUNT 3 Number of independent flood ACK sends
TXT_ACK_DELAY 200 Delay of the first ACK (ms)

The 2nd and 3rd delays (800 ms, 2000 ms) are currently fixed in the
flood_ack_delays array inside sendAckTo. They can be made configurable
via additional #defines if needed.


Files Changed

File Nature of change
src/helpers/ContactInfo.h Added out_path_timestamp and path_ack_count fields
src/helpers/BaseChatMesh.cpp onContactPathRecv — path gating logic; sendAckTo — 3× flood retry; onAckRecv — delivery tracking

Files Added

File Purpose
test/test_reliability/test_path_gating.cpp Unit tests for path gating condition
test/test_reliability/test_flood_ack.cpp Unit tests for flood ACK scheduling

Running the Unit Tests

# Run the new reliability tests on native platform
pio test -e native_reliability

# Run the existing utility tests (unchanged)
pio test -e native

thecharge added 2 commits May 16, 2026 00:12
…ontactPathRecv + 3x flood ACK retry

- onContactPathRecv: don't replace a fresh stored path (<10 min old) with a
  longer-hop incoming path; prevents multipath duplicates from silently
  downgrading a working direct route (fixes meshcore-dev#1775)
- ContactInfo: add out_path_timestamp (path freshness) and path_ack_count
  (cheap delivery success counter, reset on path change)
- sendAckTo: send flood ACK 3x at 200/800/2000 ms staggered delays when no
  direct path is known; dedup handled by existing MeshTables::hasSeen()
  (fixes meshcore-dev#1489)
- platformio.ini: add native_reliability env with GoogleTest unit tests
- test/test_path_gating: 13 unit tests covering gating logic edge cases
- test/test_flood_ack: 11 unit tests covering retry count, delays, direct path
…backup flood ACK

Root causes of 'sender sees FAILED, recipient sees message':

1. Mesh::createPathReturn() omitted the random nonce when extra_len > 0
   (path + embedded ACK case).  AES-ECB is deterministic: without a nonce
   every PATH+ACK for the same message produced an identical ciphertext and
   thus an identical calculatePacketHash().  hasSeen() at intermediate nodes
   treated every retransmission as a duplicate and dropped it silently, so
   the single PATH+ACK was the only chance to deliver the ACK.
   Fix: always append the 4-byte random nonce regardless of extra presence.

2. BaseChatMesh::onPeerDataRecv(): when a flood message was received, only
   one PATH+ACK packet was sent.  If that packet was lost over RF the sender
   had no fallback and showed the message as failed.
   Fix: after the PATH+ACK, also call sendAckTo() to send standalone flood
   ACKs at staggered delays (200/800/2000 ms) as an independent backup.

3. PATH_STICKINESS_WINDOW_SECS reduced 600 s -> 30 s.  A 10-minute window
   prevented B from learning A's updated return path for up to 10 minutes,
   causing ACKs to travel via a stale direct route that no longer works.
   30 s is long enough to reject multipath duplicates (50-200 ms apart) but
   short enough to adapt to topology changes.

Tests: 29 tests pass (15 path-gating + 14 flood-ack, 3 new backup-ACK cases)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant