Skip to content

Latest commit

 

History

History
3203 lines (3058 loc) · 177 KB

File metadata and controls

3203 lines (3058 loc) · 177 KB

TODO: Rewrite Cleanup Summary

Purpose

Finish the rewrite to a production-ready state with:

  • typed Level 2 APIs and internal buffer management
  • green Linux and Windows validation
  • trustworthy benchmark generation and reporting
  • realistic hardening, stress, and coverage gates

TL;DR

  • The rewrite itself is in good shape. Linux is green, Windows tests are green, and POSIX/Windows benchmark floors are green.
  • The Windows benchmark blocker is now explained and fixed with concrete evidence. The remaining work is not about core correctness regressions. It is about coverage completeness, coverage-threshold raising, Windows validation parity, and one deferred Windows managed-server stress investigation.

Architecture Correction

  • The intended public model is service-oriented, not plugin-oriented.
  • Clients connect to a service kind, not to a specific plugin/process.
  • One service endpoint serves one request kind only.
  • Examples of service kinds:
    • cgroups-snapshot
    • ip-to-asn
    • pid-traffic
  • Startup order is intentionally asynchronous:
    • providers may start late
    • providers may restart or disappear
    • enrichments are optional
    • clients must tolerate absence and reconnect from their normal loop
  • The current generic multi-method server surface is now known design drift and must be corrected before Netdata integration.

Current Focus (2026-03-24)

  • user decision:

    • the remaining Windows benchmark variation and full-suite flake must be explained, and fixed where possible, before Netdata integration
    • Costa explicitly decided that this is a hard blocker:
      • we must find the root cause of the remaining Windows full-suite benchmark instability
      • no exceptions, no integration before it is explained
    • rationale from the user:
      • the benchmark smoke may hide a production breakdown risk
      • we must not accept unexplained variation or call it noise
    • implication:
      • do not add bounded retries as a workaround
      • do not refresh the checked-in Windows benchmark artifacts until the root cause is identified
      • the next work is:
        • root-cause analysis of the c->rust np-pipeline-d16 full-suite failure in the larger suite context
        • root-cause analysis of the remaining snapshot-shm variation after the Rust Named Pipe hot-path fix
        • repeated clean official Windows full-suite reruns with unique CSV paths after the fixes, to verify that the benchmark harness is now stable enough to trust
    • current status:
      • blocker satisfied on 2026-03-24
      • the two concrete causes were:
        • overflow-prone QueryPerformanceCounter conversion in the Windows C and Go benchmark drivers
        • heavy WMI process scanning in the Windows benchmark runner CPU fallback
      • both were fixed
      • the checked-in Windows benchmark artifacts were refreshed only after two clean official full reruns completed with:
        • 201 rows
        • 0 duplicate keys
        • 0 zero-throughput rows
  • next benchmark task after the first Windows Rust hot-path fix:

    • purpose:
      • remove the remaining blocker to refreshing the checked-in Windows benchmark artifacts
      • determine whether the c->rust np-pipeline-d16 full-suite failure is:
        • a benchmark-runner orchestration bug
        • a flaky startup/readiness race
        • or a real transport/protocol issue
    • facts already established:
      • the full official-style Windows rerun into /tmp/plugin-ipc-investigate/bench-427907b.csv wrote 200 rows with 0 duplicate keys and 0 zero-throughput rows recorded in the CSV
      • that rerun still failed before completion because the suite printed:
        • Invalid zero throughput from c pipeline client for rust server
      • the suspected failing pair did not reproduce in 5/5 direct reruns:
        • c pipeline client against the Rust Named Pipe server succeeded every time
        • measured throughputs:
          • 241869
          • 243570
          • 249819
          • 249393
          • 243381
      • implication:
        • the current evidence points to a full-suite flake or orchestration issue
        • it does not currently point to a deterministic regression from the Windows Rust send-buffer optimization
    • facts now established from the root-cause work:
      • the full debug rerun of the official Windows suite completed cleanly with 201 measurements
      • the old c->rust np-pipeline-d16 failure did not reproduce in that full rerun
      • bounded official-runner replay of blocks 1..4 also completed cleanly and produced a healthy snapshot-shm rust->c max-throughput row:
        • 550336
      • isolated snapshot-shm rust->c reruns are consistently much faster than the single bad row:
        • 484608 to 553481
      • implication:
        • the single 178078 snapshot-shm rust->c row is not a stable property of the pair or of the official suite prefix up to block 4
        • the best current explanation is a transient host-level stall during that particular run, not a deterministic transport/protocol bug
    • concrete remaining code-backed target from this investigation:
      • Windows C benchmark snapshot server still rebuilds the 16 cgroup names and paths on every request in:
        • bench/drivers/c/bench_windows.c
      • this is inconsistent with:
        • POSIX C:
          • bench/drivers/c/bench_posix.c
        • Windows Go:
          • bench/drivers/go/main_windows.go
        • Windows Rust:
          • bench/drivers/rust/src/bench_windows.rs
      • implication:
        • the remaining stable snapshot-shm spread against the C server is at least partly benchmark-driver overhead, not library behavior
    • plan for this pass:
      • mirror the existing POSIX C snapshot-template precompute in the Windows C benchmark driver
      • rerun the official Windows benchmark blocks that include snapshot-shm
      • compare the rows against C server before deciding whether there is still a deeper runtime/library issue
  • benchmark investigation before Netdata integration:

    • purpose:
      • explain the largest remaining interop throughput asymmetries before we integrate this into Netdata
      • use this to decide whether there are still hidden robustness or transport-state risks in the cross-language hot paths
    • scope for this pass:
      • Windows snapshot-shm
      • Windows shm-batch-ping-pong
      • Windows np-pipeline-batch-d16
      • compare them against the matching POSIX scenarios to separate Windows-specific issues from generic language/runtime differences
    • expected output:
      • facts from the checked-in benchmark artifacts
      • exact code paths involved in each slow pair
      • working theories for the throughput gaps
      • recommendation on whether to fix now or proceed with guarded Netdata integration
    • facts established from the checked-in Windows benchmark artifacts:
      • each benchmark row is one server process paired with one client process:
        • tests/run-windows-bench.sh:231
        • tests/run-windows-bench.sh:415
        • tests/run-windows-bench.sh:454
        • tests/run-windows-bench.sh:480
        • tests/run-windows-bench.sh:617
      • this means worker-count defaults are not the primary explanation for the max-throughput rows, because these are not multi-client saturation tests
      • the largest bad spreads are Windows-specific and cluster by server implementation, especially Rust servers:
        • snapshot-shm:
          • slowest go->rust: 246709
          • fastest c->go: 1036379
          • spread: 4.20x
        • shm-batch-ping-pong:
          • slowest go->rust: 12959829
          • fastest c->c: 56949157
          • spread: 4.39x
        • np-pipeline-batch-d16:
          • slowest rust->rust: 14153205
          • fastest c->c: 38068732
          • spread: 2.69x
      • the matching POSIX spreads are much smaller:
        • snapshot-shm: 1.69x
        • shm-batch-ping-pong: 1.80x
        • uds-pipeline-batch-d16: 1.96x
      • simple Windows scenarios do not show the same Rust-server collapse:
        • shm-ping-pong stays fairly tight:
          • slowest go->go: 1737335
          • fastest c->go: 2551798
        • snapshot-baseline also stays tight:
          • slowest go->c: 15944
          • fastest c->go: 17907
      • implication:
        • this does not look like a generic Rust implementation problem
        • it also does not look like a raw WinSHM transport problem by itself
        • it appears when the Windows server is doing larger response assembly / batch handling, and especially when the Rust server is also on the Named Pipe send hot path
    • exact code-path differences already confirmed:
      • Rust Windows Named Pipe send allocates a fresh Vec for every message:
        • src/crates/netipc/src/transport/windows.rs:401
      • Go Windows Named Pipe send reuses a session scratch buffer:
        • src/go/pkg/netipc/transport/windows/pipe.go:458
      • C Windows Named Pipe send uses stack storage for small messages and heap only when needed:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:188
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:723
      • Rust batch benchmark server leaves max_response_payload_bytes at BENCH_BATCH_BUF_SIZE:
        • bench/drivers/rust/src/bench_windows.rs:254
      • Go batch benchmark server explicitly doubles the response payload limit:
        • bench/drivers/go/main_windows.go:207
        • bench/drivers/go/main_windows.go:327
      • C batch benchmark server also gives the server a doubled response buffer:
        • bench/drivers/c/bench_windows.c:330
        • bench/drivers/c/bench_windows.c:346
      • Rust managed server defaults to 8 workers:
        • src/crates/netipc/src/service/cgroups.rs:998
        • src/crates/netipc/src/service/cgroups.rs:1004
      • Go Windows managed server runs a single accept loop and handles the accepted session directly:
        • src/go/pkg/netipc/service/cgroups/client_windows.go:500
        • src/go/pkg/netipc/service/cgroups/client_windows.go:539
        • src/go/pkg/netipc/service/cgroups/client_windows.go:596
      • C Windows benchmark servers pass explicit worker counts at init:
        • single-request server path: bench/drivers/c/bench_windows.c:286
        • batch server path: bench/drivers/c/bench_windows.c:349
    • working theories:
      • theory 1:
        • Rust Windows Named Pipe send-side allocation is a real hot-path cost and is the strongest code-level explanation for the poor np-pipeline-batch-d16 Rust-server rows
      • theory 2:
        • the bad snapshot-shm and shm-batch-ping-pong Rust-server rows are not explained by raw WinSHM alone, because shm-ping-pong is fine
        • the more likely area is Windows-specific cost in larger response assembly / batch handling / copy behavior on the Rust server side
      • theory 3:
        • worker-count differences are real implementation differences, but they are unlikely to be the main cause of the current max-throughput interop rows because each measurement is still one server process paired with one client process
    • current recommendation before Netdata integration:
      • do one focused Rust-on-Windows performance pass before broad integration
      • first target:
        • remove the per-message allocation from src/crates/netipc/src/transport/windows.rs:401
      • second target:
        • re-check the Windows Rust snapshot and batch response hot paths after that change
      • only after those reruns decide whether the remaining Windows interop gaps are acceptable for guarded integration or still need another optimization pass
    • first optimization pass completed:
      • implemented:

        • src/crates/netipc/src/transport/windows.rs
        • raw_send_msg() now reuses a per-session scratch buffer instead of allocating a fresh Vec for every Windows Named Pipe send
        • NpSession now owns send_buf
      • local safety validation:

        • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result: 294/294 passing
      • clean Windows validation environment used:

        • a fresh temp clone on win11: /tmp/plugin-ipc-investigate
        • correct native toolchain environment:
          • PATH=/c/Users/costa/.cargo/bin:/c/Program Files/Go/bin:/mingw64/bin:$PATH
          • MSYSTEM=MINGW64
          • CC=/mingw64/bin/gcc
          • CXX=/mingw64/bin/g++
      • important evidence from the clean win11 rerun:

        • the temporary CSV at /tmp/plugin-ipc-investigate/bench-hotpath.csv ended up with interleaved duplicate rows from more than one benchmark writer
        • evidence:
          • total rows: 384
          • expected full 3x3 matrix: 201
          • duplicate keys existed for many (scenario, client, server, target_rps) combinations
        • implication:
          • the raw CSV cannot be consumed as-is
          • the safe interpretation for this investigation is to keep the last row per (scenario, client, server, target_rps) key
          • this matches the live stream from the final completed run in the reused SSH session
      • measured impact on the three target scenarios, comparing the checked-in Windows CSV against the clean win11 rerun after deduping by keep-last:

        • snapshot-shm:
          • before:
            • fastest c->go: 1036379
            • slowest go->rust: 246709
            • spread: 4.20x
          • after:
            • fastest c->rust: 1192436
            • slowest go->c: 466278
            • spread: 2.56x
        • shm-batch-ping-pong:
          • before:
            • fastest c->c: 56949157
            • slowest go->rust: 12959829
            • spread: 4.39x
          • after:
            • fastest c->rust: 55907488
            • slowest go->go: 37762867
            • spread: 1.48x
        • np-pipeline-batch-d16:
          • before:
            • fastest c->c: 38068732
            • slowest rust->rust: 14153205
            • spread: 2.69x
          • after:
            • fastest c->rust: 42592457
            • slowest rust->go: 32392057
            • spread: 1.31x
      • extra controls from the same clean rerun:

        • np-pipeline-d16 stayed tight:
          • before: 1.18x
          • after: 1.13x
        • snapshot-baseline stayed tight:
          • before: 1.12x
          • after: 1.15x
        • np-batch-ping-pong also tightened:
          • before: 1.44x
          • after: 1.14x
      • strongest conclusion from the evidence:

        • the Rust Windows Named Pipe per-message allocation was a real hot-path cost
        • it was a major contributor to the suspicious Windows Rust interop collapse
        • after the fix, the worst Windows interop spreads are no longer clustered around Rust servers
        • snapshot-shm still shows moderate variation, but it is no longer the same pathology as the old Rust-server collapse
      • practical implication for Netdata integration:

        • this first performance fix explains and removes most of the previously suspicious Windows Rust interop asymmetry
        • before broad integration, the official checked-in Windows benchmark artifacts should be rerun from a clean non-overlapping win11 workspace so the repo records the fixed numbers directly
      • follow-up artifact rerun:

        • reran the full Windows suite again from the existing validated temp workspace with a unique output path:
          • /tmp/plugin-ipc-investigate/bench-427907b.csv
        • good facts from that rerun:
          • rows written: 200
          • duplicate keys: 0
          • zero-throughput rows recorded in the CSV: 0
          • implication:
            • this rerun did not suffer from the earlier interleaved-writer corruption
        • one failure still occurred in the full-suite driver:
          • c->rust on np-pipeline-d16
          • the suite printed:
            • Invalid zero throughput from c pipeline client for rust server
          • that made the CSV incomplete by one row and therefore not suitable to replace the checked-in artifact yet
        • targeted follow-up on the suspected failing pair:
          • reran the same logical pairing (C pipeline client against the Rust Named Pipe server) 5 times directly in the validated temp workspace
          • all 5/5 targeted reruns succeeded
          • measured throughputs:
            • 241869
            • 243570
            • 249819
            • 249393
            • 243381
        • conclusion from that evidence:
          • the full-suite c->rust np-pipeline-d16 failure is currently a flake, not a reproduced deterministic regression from the send-buffer optimization
          • the hot-path performance explanation still stands
          • the benchmark artifact refresh is blocked only by this remaining Windows full-suite flake, not by the interop throughput issue that motivated the investigation
      • additional bounded reproduction after the failed artifact rerun:

        • reran just the full NP pipeline block logic in isolation, with:
          • the same shared RUN_DIR model as the full suite
          • the same fixed service-name pattern:
            • pipeline-${server_lang}-${client_lang}
          • the same 0.2s post-ready sleep
          • the same 0.5s inter-pair sleep
        • results:
          • 2/2 whole pipeline-block rounds passed
          • c->rust specifically passed in both rounds:
            • round 1: 254199
            • round 2: 237909
        • implication:
          • the flake does not reproduce from the pipeline block alone
          • it appears only in the larger full-suite context, after the earlier benchmark groups
          • the remaining blocker is therefore most likely a suite-level orchestration flake, not a deterministic C-vs-Rust pipeline incompatibility
      • full debug rerun with preserved raw-output instrumentation:

        • reran the full official Windows suite again from the same validated temp workspace, using the debug runner that:
          • prints raw pipeline output on parse/throughput failure
          • preserves RUN_DIR on failure
        • outcome:
          • completed cleanly with 201 measurements
          • did not reproduce the earlier c->rust np-pipeline-d16 failure
        • updated clean max-throughput spreads from that completed run:
          • snapshot-shm:
            • slowest rust->c: 178078
            • fastest c->rust: 1236344
            • spread: 6.94x
          • shm-batch-ping-pong:
            • slowest go->rust: 20859299
            • fastest c->c: 58418230
            • spread: 2.80x
          • np-pipeline-d16:
            • slowest rust->go: 231500
            • fastest go->rust: 265755
            • spread: 1.15x
          • np-pipeline-batch-d16:
            • slowest rust->c: 22814639
            • fastest c->rust: 42047527
            • spread: 1.84x
        • implication:
          • the old broad Rust-server Named Pipe collapse is gone after the send-buffer fix
          • the largest remaining anomaly is now snapshot-shm rust->c, not np-pipeline-d16
      • isolated follow-up on the new snapshot-shm rust->c outlier:

        • isolated snapshot-shm reruns show the pair is not inherently slow:
          • rust->c isolated, unique run_dir per run:
            • 543392
            • 535990
            • 524335
          • rust->c isolated, shared RUN_DIR immediately after c->c:
            • 553481
            • 548110
            • 528533
            • 550328
            • 484608
          • controls from the same isolated runs:
            • c->rust: 1209183 to 1296249
            • rust->rust: 1041573 to 1308501
            • go->c: 454735 to 490505
        • implication:
          • the 178078 full-suite rust->c result is not a stable property of the pair itself
          • simple shared-RUN_DIR reuse and the immediately preceding c->c row are not enough to reproduce it
      • bounded prefix reproductions that did not reproduce the snapshot-shm rust->c slowdown:

        • after the full snapshot-baseline block, snapshot-shm rust->c still measured 540849
        • after the full shm-ping-pong block, snapshot-shm rust->c still measured 543552
        • implication:
          • the remaining contamination is not explained by:
            • service-name reuse alone
            • shared RUN_DIR alone
            • the preceding snapshot-baseline block alone
            • the preceding shm-ping-pong block alone
          • the next honest root-cause target is the larger combined prefix of the official suite, not an isolated transport pair
      • concrete Windows C benchmark-driver fix:

        • mirrored the existing POSIX C snapshot-template precompute in:
          • bench/drivers/c/bench_windows.c
        • the Windows C snapshot server no longer rebuilds the same 16 cgroup names and paths on every request
        • it now precomputes them once with:
          • InitOnceExecuteOnce()
        • implication:
          • this removes benchmark-driver overhead that was unique to the Windows C snapshot server
      • exact official rerun after the Windows C snapshot fix:

        • reran the full official Windows suite to:
          • /tmp/plugin-ipc-investigate/bench-full-after-c-snapshot-fix.csv
        • outcome:
          • completed cleanly with 201 measurements
          • no zero-throughput rows
          • no duplicate keys
          • the earlier c->rust np-pipeline-d16 failure did not reproduce
        • updated max-throughput spreads from the clean full rerun:
          • snapshot-shm:
            • slowest go->go: 860567
            • fastest rust->rust: 1290354
            • spread: 1.50x
          • shm-batch-ping-pong:
            • slowest go->go: 38594327
            • fastest c->c: 56333291
            • spread: 1.46x
          • np-pipeline-d16:
            • slowest go->go: 229507
            • fastest rust->rust: 270223
            • spread: 1.18x
          • np-pipeline-batch-d16:
            • slowest rust->go: 32250435
            • fastest c->rust: 41361971
            • spread: 1.28x
        • before/after comparison against the checked-in pre-fix Windows artifact:
          • snapshot-shm improved from 4.20x spread to 1.50x
          • shm-batch-ping-pong improved from 4.39x to 1.46x
          • np-pipeline-batch-d16 improved from 2.69x to 1.28x
        • implication:
          • the meaningful stable Windows interop variation is now explained and largely fixed
          • the repository no longer shows the old cross-language Rust-server collapse pattern on Windows
          • the earlier one-off c->rust np-pipeline-d16 full-suite failure remains unreproduced after extensive bounded reruns and one clean official rerun
          • the best current explanation for that one event is a transient host-level stall or suite-level transient, not a deterministic transport/protocol bug
      • clean full-suite soak rerun after the fix:

        • reran the full official Windows suite from a fresh clean win11 clone at commit 2aa62b7
        • the first soak run failed again, but with a different and more useful symptom:
          • runner warning:
            • Invalid zero throughput from go client for shm-batch-ping-pong
          • exact missing row in the partial CSV:
            • shm-batch-ping-pong,go,go,1000
          • partial CSV facts:
            • total rows: 200
            • only missing shm-batch-ping-pong row:
              • go->go @ 1000
        • implication:
          • there is still a real Windows benchmark instability after the snapshot fix
          • it is no longer centered on np-pipeline-d16
          • the current best target is now:
            • Go client to Go server
            • WinSHM batch ping-pong
            • 1000 req/s
      • bounded reproduction after the first soak failure:

        • reran blocks 5..6 only from the same clean clone with the debug runner
        • outcome:
          • passed cleanly with 72 measurements
          • the previously missing row was present:
            • shm-batch-ping-pong,go,go,1000
        • implication:
          • block 5 (np-batch-ping-pong) is not sufficient to trigger the failure
          • the contaminating prefix, if real, is earlier in the suite:
            • blocks 1..4
            • or a larger accumulated prefix that includes them
      • bounded reproduction with a longer prefix:

        • reran blocks 3..6 from the same clean clone with the debug runner
        • outcome:
          • failed again, but with a different concrete symptom
          • warning:
            • c client failed for shm-batch-ping-pong (exit 124)
          • exact failing row:
            • shm-batch-ping-pong
            • client c
            • server go
            • 10000 req/s
          • preserved run dir:
            • /tmp/netipc-bench-170103
          • preserved files showed:
            • client stderr empty
            • Go server stdout contained READY and later SERVER_CPU_SEC=0.015625
        • implication:
          • the benchmark flake is not just a reporting issue
          • there is a real live-suite Windows SHM batch instability involving the Go server
          • the server largely sat idle and auto-stopped, which is consistent with:
            • no request being observed on the server side
            • or client/server ending up on different SHM state
          • simple stale leftover objects are not sufficient to explain it:
            • rerunning the exact same row afterward with the same RUN_DIR and same service name passed immediately
      • bounded reproduction with the same longer prefix, second rerun:

        • reran blocks 3..6 again from the same clean clone with the improved debug runner
        • outcome:
          • passed cleanly with 108 measurements
          • no missing rows
          • the previously failing row was present:
            • shm-batch-ping-pong c->go @ 10000: 4983532
          • the previously missing row was present:
            • shm-batch-ping-pong go->go @ 1000: 497521
        • important performance facts from that same successful run:
          • shm-batch-ping-pong with Go server at max rate was still materially slower than the surrounding rows:
            • c->go @ max: 43048873
            • rust->go @ max: 13817950
            • go->go @ max: 26847702
          • same-scenario controls in the same run were much higher:
            • c->c @ max: 51187534
            • c->rust @ max: 51850988
            • rust->c @ max: 49334739
            • rust->rust @ max: 45365604
        • implication:
          • the remaining Windows benchmark problem is not yet a deterministic timeout reproducer
          • but there is now stronger evidence of a real live-suite performance collapse centered on the Go Windows SHM batch server path
          • working theory:
            • the occasional timeout / zero-throughput failures are the extreme tail of the same degradation, not a separate phenomenon
      • isolated row checks after the second 3..6 rerun:

        • isolated c->go WinSHM batch max-rate reruns were stable:
          • 41626243
          • 40759709
          • 39025091
          • 38260019
          • 42359213
        • isolated rust->go WinSHM batch max-rate reruns were also stable:
          • 36652632
          • 37585653
          • 37291592
          • 37543755
          • 39968087
        • isolated go->go WinSHM batch max-rate reruns showed a new concrete failure mode:
          • first 4 runs were normal:
            • 34735352
            • 33976434
            • 34603041
            • 34953529
          • fifth run printed a bogus success line:
            • shm-batch-ping-pong,go,go,0,15.700,40.700,115.900,0.0,0.0,0.0
            • return code was still 0
            • paired Go server CPU from the same run was 4.531250 sec
        • implication:
          • there is now a direct isolated reproducer of the zero-throughput symptom
          • the strongest current target is the Go Windows benchmark client timing/accounting path, not the transport alone
          • working theory:
            • nowNS() in bench/drivers/go/main_windows.go is vulnerable to bad wall-time conversion because it computes counter * 1e9 / qpcFreq in 64-bit arithmetic
            • that can corrupt throughput and CPU percentages without necessarily corrupting short per-request latency samples the same way
      • Windows benchmark timing fix and bounded rerun:

        • applied an overflow-safe QueryPerformanceCounter conversion in:
          • bench/drivers/go/main_windows.go
          • bench/drivers/c/bench_windows.c
        • direct isolated go->go WinSHM batch max reruns after the fix:
          • no more bogus throughput=0 success rows in 10 reruns
          • measured range: 15161017 to 35283655
        • reran bounded blocks 3..6 after the fix:
          • outcome:
            • passed cleanly with 108 measurements
            • no zero-throughput abort
            • no timeout
          • important positive result:
            • the old WinSHM batch collapses were gone in this rerun:
              • shm-batch-ping-pong rust->go @ max: 39542183
              • shm-batch-ping-pong go->go @ max: 38745288
              • shm-batch-ping-pong c->go @ max: 44211168
          • remaining concrete issue:
            • one real NP batch max-rate outlier remains:
              • np-batch-ping-pong c->go @ max: 3143676
              • surrounding same-block controls were all around 7.5M..8.1M
        • implication:
          • the broken wall-time conversion was a real benchmark bug and explains at least part of the original blocker
          • it does not explain everything
          • the next honest target is now the remaining np-batch-ping-pong c->go @ max outlier under suite conditions
      • no-WMI bounded control after the timing fix:

        • reran blocks 3..5 from the same clean clone with:
          • NIPC_SKIP_SERVER_CPU_FALLBACK=1
        • outcome:
          • passed cleanly with 72 measurements
          • the previous suite-only Named Pipe batch outlier disappeared
          • exact max-rate rows were all in the expected band:
            • np-batch-ping-pong c->c @ max: 8390004
            • np-batch-ping-pong rust->c @ max: 8179574
            • np-batch-ping-pong go->c @ max: 7959801
            • np-batch-ping-pong c->rust @ max: 8522949
            • np-batch-ping-pong rust->rust @ max: 8112289
            • np-batch-ping-pong go->rust @ max: 7675526
            • np-batch-ping-pong c->go @ max: 7501699
            • np-batch-ping-pong rust->go @ max: 6929177
            • np-batch-ping-pong go->go @ max: 7217936
        • implication:
          • the remaining suite-only throughput collapse is strongly tied to the benchmark runner's Windows CPU fallback, not the transport path itself
          • the specific suspect is server_cpu_seconds() in tests/run-windows-bench.sh
          • that helper currently does an expensive PowerShell / WMI scan of all bench_windows* processes by command line, even though the runner already knows the exact server PID
          • the next honest fix is:
            • keep the real timing fix in the C and Go benchmark clients
            • replace the WMI process scan with a direct per-PID CPU query, or otherwise remove the heavy fallback from the normal suite path
      • PID-only fallback attempt and MSYS PID mapping:

        • replaced the heavy WMI scan locally with a direct Get-Process -Id $pid fallback
        • reran bounded blocks 3..5
        • outcome:
          • throughput stayed healthy
          • but server CPU columns became 0.000
        • direct evidence from the bounded CSV:
          • np-batch-ping-pong c->go @ max stayed healthy at 7705810
          • but server_cpu_pct was 0.000
        • root cause:
          • the Bash background PID is an MSYS PID, not the real Windows PID
          • direct probe on win11 showed:
            • shell PID 191671
            • mapped WINPID 21056 from ps -W
        • implication:
          • we can remove the heavy WMI process scan without losing server CPU data
          • but the runner must first translate the MSYS shell PID to the real Windows PID
      • final runner fix and official reruns:

        • replaced the heavy WMI scan with a direct per-process CPU query after translating the MSYS shell PID to the real Windows PID via ps -W
        • bounded rerun of blocks 3..5 after the WINPID fix:
          • passed cleanly
          • non-lookup server_cpu_pct columns were populated again
          • example:
            • snapshot-baseline c->go @ max
            • client_cpu_pct=22.2
            • server_cpu_pct=26.250
            • total_cpu_pct=48.450
        • first clean official full rerun after both fixes:
          • CSV: /tmp/plugin-ipc-soak-results-2aa62b7/full-runner-fix.csv
          • facts:
            • 201 rows
            • 0 duplicate keys
            • 0 zero-throughput rows
          • remaining anomaly:
            • one isolated low row:
              • np-batch-ping-pong rust->go @ 10000 = 2823793
        • focused replay of blocks 1..5:
          • CSV: /tmp/plugin-ipc-soak-results-2aa62b7/blocks1-5-rerun.csv
          • facts:
            • 144 rows
            • the low row did not reproduce
            • np-batch-ping-pong rust->go @ 10000 = 4992606
        • second clean official full rerun after both fixes:
          • CSV: /tmp/plugin-ipc-soak-results-2aa62b7/full-runner-fix-2.csv
          • facts:
            • 201 rows
            • 0 duplicate keys
            • 0 zero-throughput rows
            • the earlier low row did not reproduce
            • np-batch-ping-pong rust->go @ 10000 = 4992690
          • max-throughput spread summary from the final checked-in CSV:
            • snapshot-shm: 1.49x
            • shm-batch-ping-pong: 1.57x
            • np-pipeline-batch-d16: 1.33x
        • implication:
          • the benchmark smoke was real
          • we found and fixed concrete causes instead of masking them
          • current evidence does not support a remaining live full-suite benchmark breakdown
          • benchmark generation and reporting are trustworthy again for the checked-in Windows matrix

Resolved User Decision

1. Netdata integration timing after the benchmark investigation

Historical decision context before the final benchmark fixes:

Evidence:

  • checked-in benchmarks and validation are already strong:
    • Linux is green
    • Windows tests are green
    • POSIX and Windows benchmark floors are green
  • the remaining meaningful performance caveat is now narrow:
    • large Windows interop throughput asymmetries cluster around Rust servers in:
      • snapshot-shm
      • shm-batch-ping-pong
      • np-pipeline-batch-d16
  • one concrete Rust Windows hot-path cost is already confirmed:
    • per-message allocation in src/crates/netipc/src/transport/windows.rs:401

Options:

  • A

    • start guarded Netdata integration now, behind a feature flag, and keep the Windows performance work in parallel
    • pros:
      • fastest path to real integration feedback
      • low risk if rollout is Linux-first or explicitly guarded
    • implications:
      • Windows interop performance caveat remains open during early integration
    • risks:
      • a slow Rust Windows server path may survive into the first integrated rollout
  • B

    • do one focused Rust-on-Windows optimization pass first, then integrate
    • pros:
      • highest confidence with limited extra work
      • directly targets the remaining unexplained asymmetry before integration
    • implications:
      • integration waits for one short benchmark / optimization cycle
    • risks:
      • if the first fix is not enough, one more investigation slice may still be needed
  • C

    • stop integration work until Linux/Windows parity is much closer in chaos, hardening, and stress
    • pros:
      • strongest validation story before rollout
    • implications:
      • much slower path to integration
    • risks:
      • delays real Netdata integration feedback for issues that may not affect the first guarded rollout

Recommendation:

  • 1. B
    • reason:
      • the remaining concern is now specific and actionable, not broad and unknown
      • one focused Windows Rust hot-path pass is the best trade-off before integrating this into Netdata

Decision made by user:

  • before Netdata integration, explain the remaining interop performance variation and fix it where the evidence is strong enough

  • implication:

    • Netdata integration is intentionally blocked on this focused performance/robustness pass
    • the next engineering work should optimize the measured hot paths first, then rerun the affected benchmark scenarios on Windows
  • resolution status:

    • satisfied for the benchmark blocker
    • the earlier Windows benchmark variation and full-suite flake are now explained and fixed
    • the remaining integration caveats are now the separate validation-parity and deferred Windows managed-server stress items, not unexplained benchmark instability
  • current verified Windows C state after the latest clean win11 rerun:

    • the real bash tests/run-coverage-c-windows.sh 90 flow now completes end to end again on clean win11
    • exact measured Windows C result from the real script:
      • total: 93.9%
      • src/libnetdata/netipc/src/service/netipc_service_win.c: 92.0%
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c: 95.3%
      • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c: 95.9%
    • evidence:
      • test_win_service_guards.exe: 134 passed, 0 failed
      • test_win_service_guards_extra.exe: 93 passed, 0 failed
      • test_win_service_extra.exe: 81 passed, 0 failed
      • the remaining Windows C subset then passed one-by-one under ctest --timeout 60
      • the final script summary reported 93.9% total and all tracked files above 90%
    • implication:
      • measured Windows C is still honestly above the shared 90% gate on current main
      • the aggregate Windows C script is trustworthy again on the validated win11 workflow
  • next honest ordinary Windows C target:

    • I tested the next most plausible service-side ordinary branch directly:
      • added a HYBRID idle-before-first-request test in tests/fixtures/c/test_win_service_guards.c
      • rebuilt the clean win11 coverage tree
      • reran test_win_service_guards.exe
      • checked gcov on src/libnetdata/netipc/src/service/netipc_service_win.c
    • hard evidence:
      • src/libnetdata/netipc/src/service/netipc_service_win.c:661 executed
      • src/libnetdata/netipc/src/service/netipc_service_win.c:662 remained uncovered
      • direct gcov excerpt:
        • 661: 36*
        • 662: #####
        • 663: 36
    • implication:
      • the naive HYBRID idle-timeout idea is not enough to hit the continue branch honestly
      • the remaining easy ordinary Windows C targets are now sparse
      • the remaining Windows C misses are increasingly:
        • allocation-only paths
        • handshake send-failure paths that were already shown to be unstable or non-deterministic on real win11
        • deeper timing-sensitive paths that need more than a simple fixture tweak
        • src/libnetdata/netipc/src/service/netipc_service_win.c:179
      • these are still normal state-validation branches, not allocation-failure-only paths
    • exact clean win11 validation on the extended guard tree:
      • test_win_service_guards.exe: 164 passed, 0 failed
      • direct gcov on netipc_service_win.c proved:
        • src/libnetdata/netipc/src/service/netipc_service_win.c:147: covered
        • src/libnetdata/netipc/src/service/netipc_service_win.c:179: covered
    • implication after this slice:
      • the WinSHM client send/receive guard paths are no longer missing ordinary service coverage
      • the remaining netipc_service_win.c misses are increasingly failure-only branches, fixed-size encode guards, or low-level allocation paths
    • next service target after this:
      • src/libnetdata/netipc/src/service/netipc_service_win.c:159
    • why:
      • it is still ordinary transport-state mapping, not an allocation-only branch
      • it only needs a hybrid client call where nipc_win_shm_send() returns a non-OK status
    • non-goals for this follow-up:
      • nipc_win_shm_send() internal allocation / mapping failures
      • fake low-memory paths
      • trying to revive the dead session-array growth branch
    • exact clean win11 validation on the extended guard tree:
      • test_win_service_guards.exe: 167 passed, 0 failed
      • direct gcov on netipc_service_win.c proved:
        • src/libnetdata/netipc/src/service/netipc_service_win.c:159: covered
    • implication after this slice:
      • transport_send() SHM path in netipc_service_win.c is now fully covered
      • the remaining ordinary service-file misses are now mostly retry / handler / raw transport failure mappings, not the SHM send/receive wrapper itself
    • exact clean win11 targeted validation on the extended guard tree:
      • test_win_service_guards.exe: 194 passed, 0 failed
      • direct gcov on netipc_service_win.c proved:
        • src/libnetdata/netipc/src/service/netipc_service_win.c:534: covered
        • src/libnetdata/netipc/src/service/netipc_service_win.c:543: covered
        • src/libnetdata/netipc/src/service/netipc_service_win.c:611: covered
      • same targeted gcov summary on the clean coverage build:
        • src/libnetdata/netipc/src/service/netipc_service_win.c: 92.04% of 779
    • implication after this slice:
      • the ordinary batch send / receive failure mappings are no longer missing service coverage
      • the ordinary string raw-call failure propagation is no longer missing service coverage
      • the remaining netipc_service_win.c misses are now mostly:
        • fixed-size or pre-sized encode guards
        • allocation / low-level failure paths
        • branches that need a different coverage harness than the current deterministic HYBRID fake server
    • source-backed classification of the tempting remaining service targets:
      • src/libnetdata/netipc/src/service/netipc_service_win.c:517
        • not an honest ordinary target for increment batch
        • evidence:
          • caller pre-sizes req_buf_size as count * (8 + NIPC_INCREMENT_PAYLOAD_SIZE) + 64
          • NIPC_INCREMENT_PAYLOAD_SIZE is 8
          • nipc_batch_builder_add() overflows only when packed batch data exceeds the provided buffer
          • for this exact call shape, the request buffer has deterministic slack beyond the batch builder's real need
      • src/libnetdata/netipc/src/service/netipc_service_win.c:603
        • not an honest ordinary target for string reverse
        • evidence:
          • caller computes req_buf_size = NIPC_STRING_REVERSE_HDR_SIZE + request_len + 1
          • nipc_string_reverse_encode() returns 0 only when buf_len < NIPC_STRING_REVERSE_HDR_SIZE + request_len + 1
          • after the caller's own size guard passes, this encode-failure branch is structurally guarded away
    • next honest Windows C work after this:
      • stop grinding netipc_service_win.c as if it still had cheap ordinary misses
      • move to the remaining transport-file ordinary branches or raise the C gate only after a fresh full clean rerun
    • fresh next transport target from the current clean win11 coverage build:
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
      • fresh uncovered-line scan still shows an ordinary chunked receive cluster at:
        • 959-972
        • 986-992
      • why this is still honest ordinary work:
        • these are protocol-validation and peer-behavior branches in nipc_np_receive()
        • they can be driven by deterministic fake-server continuation packets
        • they do not require Win32 fault injection
      • non-goals for the next slice:
        • malloc / realloc / CreateNamedPipeW / CreateFileW failure paths
        • handshake send-failure races at 324 and 500
        • SetNamedPipeHandleState() failure at 649-650
    • exact clean win11 targeted validation on the extended Named Pipe tree:
      • test_named_pipe.exe: 195 passed, 0 failed
      • direct gcov on netipc_named_pipe.c proved:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:959-960
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:964-965
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:971-972
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:986-987
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:991-992
      • same targeted gcov summary on the clean coverage build:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c: 95.35% of 473
    • implication after this slice:
      • the ordinary Named Pipe chunked receive error cluster is no longer missing coverage
      • the remaining netipc_named_pipe.c misses are now mostly:
        • allocation-only paths
        • Win32 API failure paths
        • handshake send-failure races already shown to be non-deterministic as ordinary tests
      • the next honest Windows C step is no longer “add another easy Named Pipe protocol test”
      • the next honest Windows C step is:
        • rerun the full clean Windows C coverage flow to refresh the aggregate numbers
        • then decide whether the C gate should move above 90%
    • latest blocker from the attempted fresh aggregate rerun:
      • the repo's own tests/run-coverage-c-windows.sh 90 still times out on the first direct run of test_win_service_guards.exe
      • exact clean win11 evidence:
        • the script exits with 124 inside test_win_service_guards.exe
        • the log reaches the typed-dispatch section and stops after:
          • missing-string raw send ok
      • critical counter-evidence on the same coverage build:
        • an immediate direct rerun on the later non-coverage debug build had passed with:
          • 194 passed, 0 failed
        • but the same measurement on the real coverage build did not finish within 180s
        • the timed direct coverage-build log stopped much earlier, at:
          • raw unknown-method send ok
      • implication:
        • the current blocker is not just a too-tight 120s script timeout
        • the coverage-built test_win_service_guards.exe itself is too large / unstable for a single bounded direct run
      • next implementation step:
        • split the late dispatch / cache / drain portion of test_win_service_guards.exe into another bounded coverage-only executable
        • keep the new executable in the direct-run section of tests/run-coverage-c-windows.sh
      • non-goals for this stabilization slice:
        • retry-only or timeout-only fixes without reducing the coverage-built executable's scope
        • claiming a fresh aggregate Windows C number before the stabilized flow passes
  • next ordinary Windows WinSHM timeout-loop follow-up:

    • the previous Named Pipe chunk-receive follow-up is no longer considered an honest ordinary target with the current fake-server harness
    • concrete clean win11 evidence:
      • direct gcov after the deep batch-validation slice showed:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:960: already covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:965: uncovered
      • the first short-chunk attempt returned:
        • nipc_np_send() -> NIPC_NP_ERR_DISCONNECTED
      • two deeper malformed-chunk variants also failed at the same stage:
        • bad chunk header variant
        • bad chunk payload-length variant
      • implication:
        • the current fake server closes early enough that the client is measuring a send/close race
        • this does not honestly prove the receive-loop branch in 957-965
    • decision from the evidence:
      • stop grinding this Named Pipe path as an ordinary deterministic target
      • keep the already-pushed deep batch-validation coverage and move to a cleaner target
    • next deterministic target:
      • inspect the existing WinSHM timeout/zero-timeout tests against:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:666-685
      • verify on a clean win11 clone which of these lines are still actually uncovered under gcov
      • only add tests if the current timeout harness truly misses them
    • exact clean win11 validation on the modified tree:
      • a new deterministic test pre-populates the hybrid response slot and sets client.spin_tries = 0
      • targeted ctest --test-dir build-windows-coverage-c --output-on-failure -R "^test_win_shm$": pass
      • direct gcov on netipc_win_shm.c proved:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:674: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:685: still uncovered
    • conclusion from this slice:
      • 674 was a real ordinary branch and is now covered honestly
      • 685 is not a good ordinary target with the current API surface
      • it requires the timeout budget to expire before the first WaitForSingleObject() call even starts, which is a timing-only condition rather than a normal protocol or transport behavior
    • next honest ordinary WinSHM targets after this:
      • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:374
        • client_attach() to a nonexistent service should deterministically return NIPC_WIN_SHM_ERR_OPEN_MAPPING
      • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:452
        • manual HYBRID mapping with no events should deterministically fail the first OpenEventW()
      • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:460
        • manual HYBRID mapping with only the request event should deterministically fail the second OpenEventW()
      • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:781-787
        • nipc_win_shm_cleanup_stale() is a public no-op and should simply be executed once
    • exact clean win11 validation on the extended tree:
      • targeted ctest --test-dir build-windows-coverage-c --output-on-failure -R "^test_win_shm$": pass
      • direct gcov on netipc_win_shm.c proved:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:374: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:452: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:460: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:674: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:781-787: covered
    • implication after this slice:
      • the remaining visible netipc_win_shm.c misses are now dominated by create/map/event fault paths and one likely unreachable name-buffer guard
      • WinSHM ordinary deterministic coverage is close to exhausted
    • non-goals for this follow-up:
      • more Named Pipe handshake timing tricks
      • allocation-only paths
      • fault-injection-only paths
  • next ordinary Windows Named Pipe deep batch-validation follow-up:

    • fresh clean win11 direct gcov after the latest chunked-batch slice reports:
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:1005: covered
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:1007: covered
      • but inside validate_batch() the deeper packed-area path is still not reached:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:860-864
    • implication:
      • the current malformed chunked batch only proves post-assembly rejection
      • it still fails at the earlier short-directory guard, not inside the real directory validator
    • planned deterministic work:
      • inspect nipc_batch_dir_validate() and craft a chunked batch payload with:
        • payload_len >= dir_aligned
        • invalid directory offsets/lengths inside the aligned directory
      • keep the first packet small enough to force the chunked receive path
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:860: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:861: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:864: covered
    • nuance:
      • the crafted payload also still exercises the earlier protocol-return site:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:858
      • but now the packed-area validator path is proven too, so the deeper branch is no longer a gap
    • implication:
      • the chunked post-assembly batch-validation path is now covered honestly end-to-end
    • non-goals for this follow-up:
      • allocation-failure-only chunk buffer paths
      • handshake timing tricks
      • in-flight growth failure paths
  • next ordinary Windows Named Pipe chunked-batch validation follow-up:

    • fresh clean win11 direct gcov after the latest connect-validation slice still reports the chunked completion path uncovered:
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:1005: reached only when a chunked payload fully assembles
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:1007: still not covered
    • implication:
      • the remaining ordinary miss is not basic request/response validation anymore
      • it is the post-assembly validate_batch() rejection for a malformed chunked batch payload
    • planned deterministic test:
      • fake server sends a chunked batch response with item_count = 2, a payload larger than one pipe packet, and an invalid batch directory
      • client should assemble all chunks and return NIPC_NP_ERR_PROTOCOL
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:1005: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:1007: covered
    • nuance:
      • this malformed chunked payload still fails inside validate_batch() at the short-directory guard:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:858
      • it does not reach the deeper packed-area path yet:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:860-864
    • implication:
      • the post-assembly validate_batch() rejection path is now covered honestly
      • any remaining 860-864 work needs a different malformed payload, not more of the same short-directory case
    • non-goals for this follow-up:
      • allocation-failure-only chunk buffer paths
      • handshake send timing tricks
      • in-flight growth failure paths
  • next ordinary Windows Named Pipe connect validation follow-up:

    • client-handshake send recheck outcome:
      • the attempted fake-server "closes before HELLO" follow-up did not produce a stable direct raw_send() failure on clean win11
      • observed outcome during targeted reruns:
        • NIPC_NP_ERR_RECV
      • implication:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:324 is not an honest deterministic target with the current fake-ACK harness
        • do not keep grinding that branch as if it were ordinary
    • next cheap deterministic miss from source review:
      • nonexistent service connect rejection in nipc_np_connect():
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:644
    • planned deterministic test:
      • call nipc_np_connect() on a unique service name with no listener and assert NIPC_NP_ERR_CONNECT
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:644: covered
    • implication:
      • the direct no-listener connect rejection is now covered honestly
    • non-goals for this follow-up:
      • server-side ACK send timing tricks
      • allocation-failure-only paths
      • in-flight growth failure paths
  • next ordinary Windows Named Pipe validation follow-up:

    • handshake-send recheck outcome:
      • the attempted "close after HELLO" follow-up did not produce a stable raw_send() failure on clean win11
      • observed outcomes during targeted reruns:
        • NIPC_NP_ERR_ACCEPT
        • NIPC_NP_ERR_PROTOCOL
      • implication:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:500 is not an honest deterministic target with the current fake-handshake harness
        • do not keep grinding that branch as if it were ordinary
    • next cheap deterministic misses from source review:
      • bad derived pipe name in nipc_np_listen():
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:537
      • bad derived pipe name in nipc_np_connect():
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:630
      • ConnectNamedPipe() failure on a closed-but-non-null listener handle:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:579
    • planned deterministic tests:
      • overlong service name rejected by nipc_np_listen()
      • overlong service name rejected by nipc_np_connect()
      • close a successfully created listener handle, then call nipc_np_accept() and assert NIPC_NP_ERR_ACCEPT
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:537: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:579: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:630: covered
    • implication:
      • the cheap argument/closed-handle validation misses are now covered honestly
    • non-goals for this follow-up:
      • handshake send-failure timing tricks
      • allocation-failure-only paths
      • in-flight growth failure paths
  • next ordinary Windows Named Pipe preferred-profile follow-up:

    • next cheap deterministic success-path miss:
      • preferred-profile selection when preferred_intersection != 0:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:452
    • planned deterministic test:
      • real client/server handshake with both peers setting preferred_profiles = NIPC_PROFILE_BASELINE
      • assert both accepted sessions select NIPC_PROFILE_BASELINE
    • new evidence from clean win11 coverage-build validation:
      • the pre-existing "peer closes before HELLO" test can return either:
        • NIPC_NP_ERR_RECV
        • or NIPC_NP_ERR_ACCEPT
      • reason:
        • under slower coverage instrumentation, the fake client can disconnect early enough for ConnectNamedPipe() to fail before server_handshake() reaches its receive path
      • implication:
        • that table-driven test should accept both valid disconnect outcomes instead of treating ACCEPT as a regression
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:452: covered
    • implication:
      • the preferred-profile success-path selection branch is now covered honestly
    • non-goals for this follow-up:
      • handshake send failure at src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:498-500
      • allocation-failure-only paths
  • latest Windows Named Pipe negotiation follow-up:

    • deterministic table-driven cases added in:
      • tests/fixtures/c/test_named_pipe.c
        • fake ACK server sends a valid HELLO_ACK with transport_status = NIPC_STATUS_UNSUPPORTED
        • fake HELLO client sends a valid HELLO with supported_profiles = 0
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:345: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:435: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:436: covered
    • implication:
      • the remaining cheap Windows Named Pipe handshake negotiation rejections are now covered honestly
  • latest Windows Named Pipe handshake-disconnect follow-up:

    • deterministic table-driven cases added in:
      • tests/fixtures/c/test_named_pipe.c
        • fake ACK server accepts HELLO and closes before sending any HELLO_ACK
        • fake HELLO client connects and closes before sending any HELLO
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:330: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:396: covered
    • implication:
      • both handshake receive-side disconnect branches are now covered honestly with the existing fake-handshake harness
  • latest Windows Named Pipe zero-byte follow-up:

    • deterministic test added in:
      • tests/fixtures/c/test_named_pipe.c
        • fake server sends a valid HELLO_ACK, then a zero-byte pipe message, and the client maps the receive to NIPC_NP_ERR_RECV
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_named_pipe$": pass
      • direct coverage-build test_named_pipe.exe + gcov on netipc_named_pipe.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:233: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:234: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:235: covered
    • implication:
      • the raw_recv() zero-byte branch is now covered honestly and proven deterministic on win11
  • latest Windows SHM server-disconnect follow-up:

    • deterministic test added in:
      • tests/fixtures/c/test_win_shm.c
        • HYBRID server receive after client close, asserting local_req_seq advances
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_win_shm$": pass
      • direct coverage-build test_win_shm.exe + gcov on netipc_win_shm.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:701: covered
        • file-specific gcov result after the targeted run:
          • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c: 92.97%
    • implication:
      • the HYBRID server-role disconnect sequence-advance branch is now covered honestly
  • latest deterministic Windows SHM receive slice:

    • deterministic tests added in:
      • tests/fixtures/c/test_win_shm.c
        • HYBRID client timeout_ms = 0 receive with a delayed real server sender
        • BUSYWAIT server receive after client close, asserting local_req_seq advances
    • exact clean win11 validation on the modified tree:
      • targeted build + ctest --test-dir build --output-on-failure -R "^test_win_shm$": pass
      • isolated ctest --test-dir build --output-on-failure -j1 --timeout 60 -R "^test_win_service$": pass
      • direct coverage-build test_win_shm.exe + gcov on netipc_win_shm.c:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:680: covered
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:744: covered
        • file-specific gcov result after the targeted run:
          • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c: 92.70%
    • important honesty note:
      • the clean full parallel win11 ctest --test-dir build --output-on-failure -j4 still hit the old noisy test_win_service timeout tail in the handler-failure block
      • the clean full Windows C coverage script still timed out later in test_win_service_extra.exe
      • neither timeout is in the modified test_win_shm slice, so the authoritative signal for this slice is the targeted test_win_shm pass plus direct gcov on netipc_win_shm.c
  • current deterministic Windows SHM receive slice:

    • purpose:
      • cover the remaining ordinary nipc_win_shm_receive() branches without fake fault injection
    • exact target lines from fresh clean win11 gcov:
      • HYBRID client receive infinite-wait path:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:680
      • BUSYWAIT server-role disconnect sequence advance:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:744
    • planned deterministic tests:
      • HYBRID client timeout_ms=0 receive with a delayed real server sender
      • BUSYWAIT server receive after client close, asserting local_req_seq advances
    • non-goals for this slice:
      • Win32 create/open-event fault injection
      • spurious-wake deadline-expiry tricks around src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:685
      • allocation-failure-only branches
    • Latest authoritative slice:
      • latest Windows Named Pipe chunked-reuse slice:
        • deterministic test added:
          • a second large chunked round-trip on the same client session now proves the client reuses the already-grown receive buffer instead of reallocating it again
        • exact win11 validation on the modified tree:
          • bash tests/run-coverage-c-windows.sh 90: pass
          • test_named_pipe inside the clean coverage build: pass
        • current measured Windows C result:
          • total: 92.2%
          • src/libnetdata/netipc/src/service/netipc_service_win.c: 91.3%
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c: 92.4%
          • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c: 94.1%
        • implication:
          • the client chunked receive-buffer reuse fast-path in ensure_recv_buf() is now covered honestly
          • the remaining cheap Named Pipe ordinary targets are getting sparse
      • latest Windows C guard + protocol stabilization slice:
        • root-cause fixes applied:
          • the hybrid attach mismatch fake server now creates the wrong SHM profile from the start instead of mutating the region after creation
          • the hybrid attach guard now waits for terminal DISCONNECTED
          • the missing-string internal-error coverage case was moved from test_win_service_guards_extra.exe into the already-stable test_win_service_guards.exe
        • exact win11 validation on the modified tree:
          • test_win_service_guards.exe: 150 passed, 0 failed
          • test_win_service_guards_extra.exe: 33 passed, 0 failed
          • bash tests/run-coverage-c-windows.sh 90: pass
          • ctest --test-dir build --output-on-failure -j4: 28/28 passing
        • current measured Windows C result:
          • total: 92.1%
          • src/libnetdata/netipc/src/service/netipc_service_win.c: 91.4%
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c: 92.2%
          • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c: 93.5%
        • implication:
          • the Windows C 90% gate remains green after the new deterministic Named Pipe response-protocol tests
          • the dedicated coverage-only Windows guard harness is trustworthy again on the exact modified tree
      • latest C threshold verification:
        • Linux C was re-run locally and remains safely above the next shared threshold step
        • Windows C was re-run on win11 at the shared 90% gate after the guard-harness stabilization slice
        • measured result:
          • Linux C total: 94.1%
          • Windows C total: 92.2%
          • Windows C file breakdown:
            • src/libnetdata/netipc/src/service/netipc_service_win.c: 91.3%
            • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c: 92.4%
            • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c: 94.1%
        • implication:
          • the shared Linux/Windows C gate can now move from 85% to 90%
          • the dedicated Windows C coverage-only harness is the correct place for the extra Windows service-guard tests
          • the Windows C script is trustworthy again only when:
            • test_win_service_guards.exe
            • test_win_service_guards_extra.exe
            • test_win_service_extra.exe run as separate bounded direct executables before the generic ctest loop
    • latest ordinary Windows SHM transport slice:
      • fresh win11 gcov evidence before the slice showed the cheapest deterministic wins in:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c (90.5%)
      • deterministic tests added:
        • HYBRID client-attach bad-param path when event-name object construction overflows:
          • exercised with a manually created valid HYBRID mapping so client_attach() reaches the event-name builder instead of failing earlier in OpenFileMappingW
        • HYBRID receive timeout / disconnect sequence tracking
        • BUSYWAIT receive timeout / disconnect sequence tracking
        • client-side oversized response handling returning NIPC_WIN_SHM_ERR_MSG_TOO_LARGE
      • validated result on the exact modified win11 tree:
        • targeted test_win_shm.exe: 91 passed, 0 failed
        • normal ctest --test-dir build --output-on-failure -j4: 28/28 passing
        • netipc_win_shm.c raised from 90.5% to 93.5%
        • one transient test_win_service_guards.exe timeout was seen on the first post-threshold rerun, but it did not reproduce on an isolated rerun or on the next full script rerun
    • next C threshold step:
      • with Linux C at 94.1% and Windows C at 92.2%, plus every tracked Windows C file above 90%, the next honest shared gate is 90%
      • non-goals for this threshold step:
        • Win32 fault-injection-only paths
        • service-layer malloc / realloc / _beginthreadex failures
        • impossible fixed-size encode guards like req_len == 0 in constant-size request paths
    • next ordinary Windows C target after the 90% gate raise:
      • fresh clean win11 gcov says netipc_service_win.c is still the lowest tracked file at 91.4%, but most of its misses are now:
        • allocation failure cleanup
        • fixed-size encode guards
        • session-array growth failures
      • the cheaper deterministic ordinary targets are now back in:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c (91.8%)
      • strongest ordinary candidates from the current uncovered lines:
        • zero-byte disconnect handling in raw_recv():
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:233-235
        • fake-server HELLO_ACK send failure path:
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:498-500
        • client in-flight limit rejection in nipc_np_send():
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:734-738
        • short first-packet / bad decoded header protocol rejection in nipc_np_receive():
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:878-889
        • chunked receive path where ensure_recv_buf() returns an error:
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:929-931
      • non-goals for this slice:
        • CreateNamedPipeW / CreateFileW / SetNamedPipeHandleState fault-injection
        • chunk-buffer allocation failures
        • peer-close timing tricks that only sometimes hit a line
    • next ordinary C target after the 85% gate raise:
      • Windows C is no longer blocked by netipc_service_win.c
      • the next weakest tracked Windows C files are now:
        • src/libnetdata/netipc/src/service/netipc_service_win.c (90.1%)
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c (91.6%)
      • implication:
        • the next honest ordinary C gains should come from either:
          • remaining deterministic Windows Named Pipe branches
          • or ordinary Windows service retry / teardown paths
        • target files:
          • tests/fixtures/c/test_named_pipe.c
          • tests/fixtures/c/test_win_service_guards.c
        • do not spend time on Win32 fault-injection-only branches yet
    • current in-progress slice:
      • add only deterministic L1 Windows transport tests that match fresh gcov gaps
      • focus areas:
        • nipc_win_shm_server_create() / nipc_win_shm_client_attach() bad-parameter and validation branches
        • nipc_np_accept() / nipc_np_connect() / nipc_np_receive() bad-parameter and invalid-handle guards
        • only attempt handshake/protocol-path additions if the existing test harness already supports them cleanly
      • non-goals for this slice:
        • fault-injection-only Win32 failure paths
        • more service-layer coverage-only tests
    • latest Windows C L1 transport slice status:
      • important correction:
        • the first remote win11 runs in this slice were against the old remote tree and must be ignored
        • reason:
          • the local edits in:
            • tests/fixtures/c/test_named_pipe.c
            • tests/fixtures/c/test_win_shm.c
          • had not yet been copied to win11
        • after syncing the edited files to win11, the targeted validation on the real modified tree is:
          • test_named_pipe: pass
          • test_win_shm: pass
      • deterministic tests added in this slice:
        • Named Pipe:
          • null config / null out checks for nipc_np_connect() and nipc_np_listen()
          • null argument checks for nipc_np_accept()
          • null / invalid-handle checks for nipc_np_send() and nipc_np_receive()
          • null-pointer no-op close checks
        • Windows SHM:
          • null run_dir / null service_name validation for server create and client attach
          • long run_dir hash-overflow validation
          • long service-name object-name overflow validation
          • HYBRID-only event-name overflow validation
          • direct public nipc_win_shm_send() / nipc_win_shm_receive() bad-parameter checks
    • measured result on the real modified win11 coverage build:
      • direct gcov on the generated .gcno files reports:
        • netipc_service_win.c: 90.1% (702/779)
        • netipc_named_pipe.c: 91.8% (434/473)
        • netipc_win_shm.c: 91.6% (339/370)
        • implied combined total across the 3 tracked C Windows files:
          • 90.9% (1475/1622)
      • implication:
        • the ordinary Windows C transport tests did materially raise the two transport files
        • the remaining ordinary Windows C gaps are now much more concentrated in:
          • Named Pipe disconnect / send / limit / chunk-error branches
          • true Win32 failure paths
    • latest handshake / disconnect follow-up:
      • added fake-peer Windows Named Pipe tests for:
        • client HELLO_ACK protocol rejection
        • server HELLO protocol rejection
        • receive after peer disconnect
        • chunk-index validation failure
      • facts:
        • test_named_pipe passes on win11
        • the full bash tests/run-coverage-c-windows.sh 85 run now completes cleanly on win11
        • direct test_win_service_guards.exe runs complete with 142 passed, 0 failed
      • implication:
        • the earlier timeout seen during one intermediate rerun did not reproduce cleanly
        • the Windows C coverage script is currently trustworthy again on the real modified tree
    • current in-progress slice:
      • keep working in tests/fixtures/c/test_named_pipe.c
      • target only deterministic ordinary branches that match the fresh win11 gcov output:
        • nipc_np_receive() response payload limit rejection
        • nipc_np_receive() response batch item-count limit rejection
        • validate_batch() short / invalid batch directory rejection
        • nipc_np_send() zero chunk-budget guard
      • non-goals for this slice:
        • allocation-failure-only branches
        • CreateNamedPipeW / SetNamedPipeHandleState / CreateFileW fault-injection branches
        • any test that needs flaky peer-close timing just to hit a line
    • latest deterministic Named Pipe validation follow-up:
      • added fake-peer response tests for:
        • oversized response payload rejection
        • excessive batch item-count rejection
        • short batch-directory rejection
        • zero chunk-budget send rejection
      • measured result on win11:
        • netipc_service_win.c: 90.1% (702/779)
        • netipc_named_pipe.c: 91.8% (434/473)
        • netipc_win_shm.c: 91.6% (339/370)
        • combined total: 90.9% (1475/1622)
      • validation:
        • test_named_pipe: pass
        • bash tests/run-coverage-c-windows.sh 85: pass
      • implication:
        • netipc_named_pipe.c is no longer the gating Windows C file
        • the next honest ordinary Windows C target is now netipc_service_win.c
    • latest deterministic Windows service-coverage slice:
      • moved from Windows Named Pipe transport follow-up into deterministic netipc_service_win.c coverage
      • evidence from the fresh win11 gcov output:
        • server_typed_dispatch() still misses ordinary branches at:
          • string-reverse success path (src/libnetdata/netipc/src/service/netipc_service_win.c:836)
          • missing snapshot handler (src/libnetdata/netipc/src/service/netipc_service_win.c:843)
          • default unknown-method rejection (src/libnetdata/netipc/src/service/netipc_service_win.c:850)
        • server init / bookkeeping still misses ordinary paths at:
          • long run_dir truncation (src/libnetdata/netipc/src/service/netipc_service_win.c:936)
          • long service_name truncation (src/libnetdata/netipc/src/service/netipc_service_win.c:943)
        • cache / teardown still misses ordinary paths at:
          • next_power_of_2() non-minimum branch (src/libnetdata/netipc/src/service/netipc_service_win.c:1267)
          • hash-table collision probe in lookup (src/libnetdata/netipc/src/service/netipc_service_win.c:1456)
          • drain-timeout forced close path (src/libnetdata/netipc/src/service/netipc_service_win.c:1173)
      • ordinary targets for this slice:
        • add direct typed-handler coverage-only tests for:
          • string-reverse success
          • missing increment/snapshot handler failure
          • unknown method mapping to internal error
        • add cache refresh tests with enough items and controlled collisions to hit:
          • next_power_of_2() for n >= 16
          • collision probe during lookup
        • if stable, add a short-timeout drain test that forces the CancelIoEx() branch
      • non-goals for this slice:
        • calloc / realloc / _beginthreadex / WinSHM create fault-injection branches
        • peer-close timing tricks that only sometimes hit the line
        • any regression to the normal win11 ctest path
      • validation fact:
        • test_win_service_guards.exe passes on win11 in:
          • direct targeted runs
          • the exact ctest-subset + guarded timeout 120 .../test_win_service_guards.exe reproduction
          • the full bash tests/run-coverage-c-windows.sh 85 path
        • implication:
          • the earlier wedge was a script-launch reliability issue, not a proven library/test correctness issue
          • the coverage script now launches the guard executable under a bounded timeout and fails explicitly if it hangs
    • latest Windows guard-test blocker diagnosis:
      • fresh win11 reruns after the deterministic Named Pipe response-protocol slice showed the new Named Pipe test is not the blocker:
        • targeted test_named_pipe.exe: 120 passed, 0 failed
      • the real failing point is again:
        • tests/fixtures/c/test_win_service_guards.c:test_hybrid_attach_failure_disconnects()
      • concrete evidence:
        • direct win11 rerun of test_win_service_guards.exe failed the two assertions:
          • hybrid attach failure leaves client not ready
          • hybrid attach failure maps to DISCONNECTED
        • then the executable later timed out
      • root cause from code review:
        • the fake server currently creates a HYBRID SHM region and only then mutates its header profile to BUSYWAIT
        • file/lines:
          • tests/fixtures/c/test_win_service_guards.c:483-497
        • implication:
          • the client can sometimes attach successfully before the post-create mutation becomes visible
        • the assertion is also too eager:
          • the test assumes a single nipc_client_refresh() is enough, while the real Windows client performs bounded SHM attach retries inside client_try_connect()
          • file/lines:
            • src/libnetdata/netipc/src/service/netipc_service_win.c:91-127
            • src/libnetdata/netipc/src/service/netipc_service_win.c:376-404
      • fix approach for the next slice:
        • make the fake server create the mismatched BUSYWAIT region from the start for the bad-profile mode
        • then wait for the client to reach terminal DISCONNECTED instead of assuming one refresh call is enough
    • latest Windows extra-guard blocker diagnosis:
      • after fixing the hybrid attach race, the full win11 coverage script still timed out in:
        • test_win_service_guards_extra.exe
      • concrete evidence:
        • the executable stalls in test_missing_string_handler_returns_internal_error()
        • the last successful log line is:
          • missing-string raw send ok
      • code evidence:
        • the hanging case is implemented in:
          • tests/fixtures/c/test_win_service_guards_extra.c:489-538
        • the same pattern already exists and runs stably in the main guard executable for:
          • unknown method
          • missing increment handler
          • missing snapshot handler
          • tests/fixtures/c/test_win_service_guards.c:860-999
      • implication:
        • this is a harness-placement problem, not evidence that the service branch is fundamentally untestable
      • fix approach for the next slice:
        • move the missing-string internal-error case into test_win_service_guards.c
        • remove it from test_win_service_guards_extra.c
        • keep the extra executable focused on the worker-limit / destroy / send-failure cases that already complete reliably under gcov
    • next ordinary Windows Named Pipe target after the guard-harness stabilization:
      • fresh win11 gcov after the fixed 90% coverage rerun reports:
        • netipc_named_pipe.c: 92.2% (436/473)
      • important correction:
        • the nipc_np_send() NIPC_NP_ERR_LIMIT_EXCEEDED branch at src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:738 is not an ordinary in-flight-limit policy branch here
        • code evidence:
          • inflight_add() returns -2 only on realloc() failure
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:251-255
        • implication:
          • do not waste time pretending this is a normal deterministic coverage target
      • next honest deterministic targets:
        • client chunked receive-buffer reuse fast-path in ensure_recv_buf():
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:835-836
        • if deterministic on win11, successful zero-byte disconnect handling in raw_recv():
          • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c:233-235
    • next ordinary Windows SHM receive target:
      • fresh win11 gcov after the latest Named Pipe slice still reports ordinary wait / disconnect misses in:
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:666-685
        • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:744
      • chosen deterministic targets for the next slice:
        • HYBRID client receive with timeout_ms = 0 and a delayed real server sender
          • purpose:
            • cover the infinite-wait path (wait_ms = INFINITE) without using fake fault injection
        • BUSYWAIT server-side receive after client close
          • purpose:
            • cover the server-role disconnect branch that advances local_req_seq
      • non-goals for this slice:
        • Win32 create/open-event fault injection
        • spurious-wake deadline-expiry tricks unless they prove deterministic on win11
    • next ordinary Windows C target after the stabilized service slice:
      • fresh post-fix gcov evidence from netipc_service_win.c shows the remaining ordinary branches are now concentrated in:
        • worker-limit rejection in server_run():
          • src/libnetdata/netipc/src/service/netipc_service_win.c:1045-1049
        • active-session join / cleanup in server_destroy():
          • src/libnetdata/netipc/src/service/netipc_service_win.c:1225-1230
        • cache refresh failure on malformed snapshot rebuild:
          • src/libnetdata/netipc/src/service/netipc_service_win.c:1414-1415
        • a few service-loop disconnect / send-failure paths:
          • src/libnetdata/netipc/src/service/netipc_service_win.c:797-798
      • implications:
        • netipc_named_pipe.c and netipc_win_shm.c are no longer the best ordinary targets
        • the next honest Windows C gains should come from more deterministic service-coverage tests in tests/fixtures/c/test_win_service_guards.c
      • non-goals for the next slice:
        • malloc / calloc / realloc / _beginthreadex fault-injection
        • WinSHM mapping / event creation failures
        • peer-close timing tricks that only sometimes hit a line
      • fresh execution-order fact from the new service follow-up:
        • on win11, coverage-build test_win_service.exe passes standalone under:
          • ctest --test-dir build-windows-coverage-c --output-on-failure -V -j1 -R "^test_win_service$" --timeout 60
        • the dedicated guard executable also stops being trustworthy when it is run after the coverage subset
        • standalone test_win_service_extra.exe also passes cleanly, which points to the grouped ctest -R ... coverage invocation itself as the unstable layer
        • the only clean-build guard failures left are the old missing-string-handler assertions inside the mixed client-guard test; the equivalent dedicated missing-increment and missing-snapshot service cases already pass
        • implication:
          • this is currently a coverage-order interaction, not a proven correctness bug in test_win_service.exe or the guard executable
          • the Windows C coverage script should run the coverage-relevant Windows C tests one-by-one in an explicit known-good order, instead of relying on grouped ctest -R ... invocations
          • the missing-string-handler raw check should move into the dedicated typed-dispatch test block so it uses its own clean service instance
    • follow-up from the first Windows C service fix attempt:
      • adding the new client guard tests directly into tests/fixtures/c/test_win_service_extra.c did raise Windows C coverage in the coverage build
      • but that same edit introduced a real side effect in the normal win11 build:
        • test_win_service_extra.exe hangs in the ordinary build/ ctest path
        • the same executable still passes in the coverage build
      • implication:
        • the new ordinary guard tests should live in a dedicated Windows C coverage-only executable
        • the default ctest executable test_win_service_extra.exe should stay on its previously stable path
      • implemented resolution:
        • added tests/fixtures/c/test_win_service_guards.c
        • built it as test_win_service_guards.exe
        • kept it out of the default ctest inventory
        • ran it only from tests/run-coverage-c-windows.sh
    • decision made by Costa:
      • raise the Go coverage gate from 85% to 90%
      • keep the Go coverage gate policy identical on Linux and Windows
    • implementation implication of that decision:
      • update:
        • tests/run-coverage-go.sh
        • tests/run-coverage-go-windows.sh
      • refresh the active coverage docs to reflect the new enforced Go threshold
      • Linux and win11 must both pass the new 90% gate on the exact current tree
    • verified result after applying the Go gate change:
      • Linux Go: 95.8%
      • Windows Go (win11): 96.7%
      • implication:
        • the shared Linux/Windows Go gate can now safely move to 90%
    • decision made by Costa:
      • raise the Rust coverage gate from 80% to 90%
      • keep the Rust coverage gate policy identical on Linux and Windows
    • implementation implication of that decision:
      • update:
        • tests/run-coverage-rust.sh
        • tests/run-coverage-rust-windows.sh
      • refresh the active coverage docs to reflect the new enforced Rust threshold
      • revalidate Linux locally
      • fresh win11 rerun now verifies Windows Rust coverage at 93.68%
    • latest narrow ordinary deterministic Rust follow-up is complete:
      • completed targets:
        • direct UdsListener::accept() failure on a closed listener fd
        • ShmContext::owner_alive() with cached generation 0 skipping generation mismatch checks
        • ShmContext::receive() waking successfully with a finite timeout budget
      • measured result:
        • Linux Rust total moved from 98.52% to 98.57%
        • src/transport/posix.rs moved from 97.35% to 97.50%
        • src/transport/shm.rs moved from 96.04% to 96.20%
        • Rust lib tests moved from 291/291 to 294/294
      • implication:
    • current Windows C split validation status:
      • first win11 rebuild of the split harness fails at compile time in:
        • tests/fixtures/c/test_win_service_guards.c:851
      • concrete compiler error:
        • error: 'service' undeclared (first use in this function)
      • code-review fact:
        • the old missing-string-handler raw check was moved into test_string_dispatch_missing_handlers_and_unknown_method()
        • but that move was only partial: the block now references service without its own local service/server setup
        • this is a test-harness revert bug, not a service-layer regression
      • implication:
        • restore the dedicated missing-string service case cleanly before any further win11 runtime validation
    • follow-up evidence after restoring the dedicated missing-string service case:
      • win11 normal build:
        • test_win_service_guards.exe passes standalone with 149 passed, 0 failed
      • win11 coverage build:
        • the same executable still wedges only in the dedicated missing-string service case
        • the stall point is reproducible:
          • log stops after PASS: missing-string raw send ok
        • the small dedicated test_win_service_guards_extra.exe still passes cleanly with 33 passed, 0 failed
      • implication:
        • this is still a coverage-build harness stability issue, not a proven netipc_service_win.c missing-string dispatch bug
        • the next fix should move the missing-string dedicated service case out of the old large guard executable and into the small extra guard executable
    • current split follow-up:
      • while removing the missing-string block from tests/fixtures/c/test_win_service_guards.c, the old guard file picked up a local brace mismatch
      • concrete win11 compiler errors:
        • tests/fixtures/c/test_win_service_guards.c:935:5: error: expected identifier or '(' before '{' token
        • tests/fixtures/c/test_win_service_guards.c:981:1: error: expected identifier or '(' before '}' token
      • implication:
        • fix the local syntax regression first, then rerun the guard split validation
    • fresh clean-build result after fixing the syntax regression and moving missing-string into the extra guard binary:
      • fresh win11 coverage build now gets through:
        • old large guard executable: 140 passed, 0 failed
        • new small extra guard executable: 42 passed, 0 failed
        • per-test loop through:
          • test_protocol
          • interop_codec
          • fuzz_protocol_30s
          • test_named_pipe
          • test_named_pipe_interop
          • test_win_shm
          • test_win_service
      • it then stalls when the loop reaches:
        • ctest --test-dir build-windows-coverage-c --output-on-failure -j1 -R "^test_win_service_extra$"
      • code-review fact:
        • the per-test loop currently has no explicit ctest --timeout
      • implication:
        • add an explicit per-test timeout to the Windows C coverage loop
        • then verify whether test_win_service_extra is only a bounded slow/hung test in this position, or whether it needs a separate known-good order
    • follow-up from the fresh clean-build loop:
      • test_win_service_extra is not just missing a timeout in this order
      • the fresh clean coverage loop fails it concretely after 72.82 sec
      • captured log stops in:
        • --- Cache refresh rebuilds / linear lookup ---
      • implication:
        • test_win_service_extra should be treated like the guard executables:
          • run it as a separate bounded direct executable in a known-good position
          • remove it from the generic per-test ctest loop
        • keep a per-test ctest --timeout for the remaining loop entries anyway
    • final validation fact for this slice:
      • fresh clean win11 coverage run now completes successfully with:
        • netipc_service_win.c: 91.4%
        • netipc_named_pipe.c: 91.8%
        • netipc_win_shm.c: 90.5%
        • total: 91.3%
      • a later full parallel win11 ctest --test-dir build -j4 run had one noisy slow tail on test_win_service
      • isolated rerun immediately after that:
        • ctest --test-dir build --output-on-failure -j1 --timeout 60 -R "^test_win_service$"
        • result: pass in 0.28 sec
      • implication:
        • there is no evidence that this coverage-only slice introduced a normal-suite regression
        • the remaining Rust misses are now even more concentrated in non-ordinary territory
        • cheap deterministic gains still exist, but they are now very small
    • next ordinary deterministic Rust review should treat the remaining misses as:
      • src/service/cgroups.rs
        • remaining misses are now mostly fixed-size encode guards, listener teardown edges, send-break timing, or already-tested branches that llvm-cov still maps as uncovered
        • recommendation:
          • do not grind these blindly
          • only add tests if they exercise a clearly ordinary deterministic path
      • src/transport/posix.rs
        • the remaining misses in this file are otherwise mostly:
          • socket/listen/connect probe syscall failures
          • structurally unreachable zero-arm math
      • src/transport/shm.rs
        • one still-possible but low-value ordinary target remains in the receive path:
          • immediate timeout before any futex wait completes
        • the remaining misses in this file are otherwise mostly:
          • ftruncate / mmap / fstat failure branches
          • impossible CString conversion failures for directory entries
          • cleanup corner cases already exercised but still mapped sparsely
    • Linux Rust coverage collection is now standardized on cargo-llvm-cov, matching Windows Rust coverage policy
    • Linux Rust now excludes Windows-tagged Rust files from the Linux total:
      • src/service/cgroups_windows_tests.rs
      • src/transport/windows.rs
      • src/transport/win_shm.rs
    • removed the old tarpaulin-only Linux drift from the default Linux Rust script
    • Linux Unix Rust service tests are now split out of src/service/cgroups.rs into:
      • src/service/cgroups_unix_tests.rs
    • reason:
      • cargo-llvm-cov counts inline #[cfg(test)] code inside the production file
      • that made valid new tests lower the reported runtime coverage of src/service/cgroups.rs
    • latest ordinary Unix Rust service slice added deterministic coverage for:
      • managed-server recovery after malformed short UDS request
      • managed-server recovery after malformed UDS header
      • managed-server recovery after peer-close during UDS response send
      • managed-server recovery after malformed short SHM request
      • managed-server recovery after malformed SHM header
      • poll_fd() readable and deterministic EINTR handling
    • latest ordinary Linux Rust SHM slice added deterministic coverage for:
      • cleanup_stale() on missing run dir
      • cleanup_stale() ignoring unrelated and non-UTF8 entries
      • check_shm_stale() recovering zero-generation stale files
    • latest Linux Rust transport follow-up:
      • Unix transport tests were split out of:
        • src/transport/posix.rs
        • src/transport/shm.rs
      • into:
        • src/transport/posix_tests.rs
        • src/transport/shm_tests.rs
      • reason:
        • same as the earlier Unix service split
        • keep runtime coverage honest by avoiding inline #[cfg(test)] code inside the production transport files
      • measured effect on the kept transport split:
        • Linux Rust total moved from 98.70% to 98.47%
        • src/transport/posix.rs moved from 99.00% to 97.35%
        • src/transport/shm.rs moved from 96.85% to 95.71%
      • next deterministic targets on top of that split:
        • check_shm_stale() open-failure cleanup
        • check_shm_stale() mmap-failure cleanup
        • cleanup_stale() mmap-failure cleanup
    • result after adding those 3 ordinary SHM stale-cleanup tests:
      • Rust lib tests: 291/291 passing
      • Linux Rust total moved from 98.47% to 98.52%
      • src/transport/shm.rs moved from 95.71% to 96.04%
    • latest protocol follow-up finding:
      • src/protocol/increment.rs, src/protocol/string_reverse.rs, and src/protocol/cgroups.rs still had inline #[cfg(test)] modules
      • they were split out experimentally for the same reason as the Unix service split
      • measured effect on the experimental protocol split:
        • Rust lib tests stay 291/291 passing
        • Linux Rust total moved from 98.52% down to 98.49%
        • src/protocol/increment.rs now reports 95.83%
        • src/protocol/string_reverse.rs now reports 97.83%
        • src/protocol/cgroups.rs now reports 99.64%
      • implication:
        • the protocol split does not currently buy enough honest runtime signal to justify the lower total on its own
        • this is now a real coverage-accounting decision point, not just an implementation detail
    • decision made by Costa:
      • keep the Unix Rust transport split
      • revert the Rust protocol split
      • keep the new deterministic SHM stale-cleanup tests
    • implementation implication of that decision:
      • restore inline tests in:
        • src/protocol/increment.rs
        • src/protocol/string_reverse.rs
        • src/protocol/cgroups.rs
      • remove the experimental protocol-only test files:
        • src/protocol/increment_tests.rs
        • src/protocol/string_reverse_tests.rs
        • src/protocol/cgroups_tests.rs
    • current result after applying Costa's decision:
      • keep the transport split
      • keep the new deterministic SHM stale-cleanup tests
      • revert the protocol split
  • Latest verified Linux Rust result:

    • bash tests/run-coverage-rust.sh 80
    • tool on this host: cargo-llvm-cov
    • total: 98.57% (3998/4056 executed, 58 missed)
    • key files:
      • service/cgroups.rs: 98.28% (802/816)
      • transport/posix.rs: 97.50% (663/680)
      • transport/shm.rs: 96.20% (583/606)
    • exact validated state after the latest Rust slice:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1: 294/294 passing
      • /usr/bin/ctest --test-dir build --output-on-failure -j4: 37/37 passing
  • Latest verified Linux C result:

    • bash tests/run-coverage-c.sh
    • total: 94.1%
    • key files:
      • netipc_protocol.c: 98.7%
      • netipc_uds.c: 92.9% (434/467)
      • netipc_shm.c: 95.1% (346/364)
      • netipc_service.c: 92.1% (734/797)
  • Latest verified test results for this slice:

    • bash tests/run-coverage-rust.sh 80: passing
    • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1: 294/294 passing
    • /usr/bin/ctest --test-dir build --output-on-failure -j4: 37/37 passing
  • Immediate next target:

    • Linux C ordinary deterministic coverage is starting to saturate
    • fresh review and the latest gcov output say:
      • netipc_shm.c remaining lines are mostly OS-failure / timeout / path-length territory
      • the remaining netipc_protocol.c BAD_ITEM_COUNT lines are size_t overflow guards and are not reachable on this 64-bit host from a uint32_t item_count
      • some remaining netipc_uds.c and netipc_protocol.c lines still report uncovered even though direct public tests already exercise the corresponding bad-param / bad-kind paths
      • the remaining netipc_service.c holes are now mostly encode-guard, allocation, signal, peer-close timing, or session-allocation / thread-creation territory
    • recommendation:
      • stop grinding Linux C for now
      • switch the next ordinary deterministic slice back to Linux Rust coverage
      • use the current C state as the new baseline when raising thresholds
  • Fresh Linux Rust baseline on the exact current tree:

    • superseded by the new cargo-llvm-cov Linux baseline above
    • old tarpaulin baseline is historical only and should not be treated as the active Linux Rust total anymore
    • Linux-side ordinary candidates still visible in the report:
      • src/service/cgroups.rs
        • remaining likely special or low-value branches:
          • fixed-size encode guards:
            • 189
            • 202
            • 221
            • 252
          • listener loop / teardown edges:
            • 1050
            • 1062
          • remaining transport break paths:
            • 1431
            • 1445
            • 1552
            • 1563
          • poll_fd() residual lines after the new readable / EINTR tests:
            • 1597
            • 1598
            • 1611
            • 1613
      • src/transport/posix.rs
        • remaining gaps are now mostly:
          • syscall / listener creation failures:
            • 226
            • 532
            • 550-555
            • 830
          • structurally unreachable zero-arm math:
            • 298
            • 427
          • test-only panic lines in Rust transport tests:
            • 2485
            • 3016
            • 3173
            • 3234
            • 3288
            • 3346
      • src/transport/shm.rs
        • remaining gaps are now mostly:
          • raw OS failure branches:
            • 245-250
            • 264-269
            • 335-336
            • 356-357
            • 963-964
          • deadline-expired receive before futex wait:
            • 601
          • stale / cleanup corner cases:
            • 722
            • 755-756
          • sparsely mapped but already-exercised receive/copy path:
            • 635
    • explicit non-goals for the next Rust slice:
      • fixed-size encode guards:
        • src/service/cgroups.rs: 189, 202, 221, 252
      • chunk-count zero-arm lines that are structurally unreachable in the chunked path:
        • src/transport/posix.rs: 298, 427
      • raw socket / listen / bind / syscall-failure branches:
        • src/transport/posix.rs: 226, 532, 550, 552, 554-555, 577, 830
      • Windows-tagged files are now excluded from the Linux Rust total by the default Linux script
  • Note:

    • the older slice notes below are historical context
    • they are no longer the authoritative current state
    • one new layering fact is now explicit:
      • malformed batch directories on POSIX UDS are rejected by L1 before the managed Rust L2 loop can return INTERNAL_ERROR
      • the honest ordinary coverage path for that branch is Linux SHM, not UDS

Decision Needed (2026-03-24): Linux Rust Coverage Collection

  • Status:

    • implemented
    • Linux default Rust coverage now uses cargo-llvm-cov
    • Linux default Rust coverage now excludes Windows-tagged Rust files from the Linux total
    • the historical evidence below explains why this decision was made
  • Background:

    • Linux Rust coverage is now the next honest bottleneck after the recent C and Go gains.
    • The current Linux script auto-picks cargo-llvm-cov when available, otherwise falls back to cargo-tarpaulin:
      • tests/run-coverage-rust.sh
    • On this machine, only cargo-tarpaulin is installed:
      • command -v cargo-llvm-cov -> empty
      • command -v cargo-tarpaulin -> /home/costa/.cargo/bin/cargo-tarpaulin
    • The latest verified Linux Rust result is therefore coming from tarpaulin:
      • bash tests/run-coverage-rust.sh 80
      • total: 90.76% (1886/2078)
    • Evidence from the current docs and report:
      • README.md
      • COVERAGE-EXCLUSIONS.md
      • Windows-tagged Rust files are still counted in the Linux total on this host:
        • src/service/cgroups_windows_tests.rs
        • src/transport/windows.rs
        • src/transport/win_shm.rs
  • Official tool facts:

    • cargo-llvm-cov supports:
      • total gating with --fail-under-lines
      • file filtering with --ignore-filename-regex
      • summary-only reporting
    • source:
      • https://github.com/taiki-e/cargo-llvm-cov
    • tarpaulin supports file exclusion and code exclusion, but on Linux its default backend is still ptrace, and the project documents backend-dependent accuracy differences.
    • source:
      • https://github.com/xd009642/tarpaulin
  • Open-source examples already reviewed:

    • /opt/baddisk/monitoring/openobserve/openobserve/coverage.sh
      • uses cargo llvm-cov
      • uses --ignore-filename-regex
    • /opt/baddisk/monitoring/clickhouse/rust_vendor/aws-lc-rs-1.13.3/Makefile
      • uses cargo llvm-cov
      • uses --fail-under-lines
      • uses --ignore-filename-regex
  • Facts that matter for the decision:

    • Linux and Windows Rust coverage policy already uses the same nominal threshold (80%), but the collection method is inconsistent.
    • Windows Rust is already using native cargo-llvm-cov in:
      • tests/run-coverage-rust-windows.sh
    • The remaining Linux Rust total is increasingly polluted by:
      • Windows-tagged files counted on Linux
      • helper / test-module lines
      • fault-injection / syscall-failure paths
  • Decision options:

    • 1. A
      • Keep Linux on tarpaulin by default and continue adding ordinary tests only.
      • Pros:
        • smallest script change
        • no new tool install on Linux
      • Implications:
        • Linux and Windows Rust measurement stay inconsistent
        • Linux totals continue to include Windows-tagged files on this machine
      • Risks:
        • more time spent chasing non-Linux noise instead of real Linux gaps
        • harder to compare Linux vs Windows Rust coverage honestly
    • 1. B
      • Keep the current auto-detect script, but add Linux-side excludes so tarpaulin stops counting Windows-tagged files.
      • Pros:
        • smaller change than a full tool switch
        • keeps existing local workflow
      • Implications:
        • Linux still depends on whichever tool happens to be installed
        • output semantics still differ between hosts
      • Risks:
        • two developers can get different Linux Rust totals from the same tree
        • the policy remains harder to reason about
    • 1. C
      • Standardize Linux Rust on cargo-llvm-cov, matching Windows, and use an explicit ignore regex for Windows-tagged files in the Linux run.
      • Pros:
        • same Rust coverage tool family on Linux and Windows
        • honest Linux totals focused on Linux-relevant Rust code
        • built-in gating and cleaner summary/report flow
      • Implications:
        • Linux script behavior changes
        • local Linux coverage now requires cargo-llvm-cov
      • Risks:
        • one-time tool-install cost on Linux
        • report numbers will shift, so docs and the current baseline must be refreshed
  • Recommendation:

    • 1. C
    • Reason:
      • it is the cleanest way to make Linux and Windows Rust coverage policy genuinely consistent
      • it removes the current “same threshold, different measurement semantics” drift
      • it prevents wasting more effort on Windows-only lines while we are trying to improve Linux coverage
  • Decision made by Costa:

    • 1. C
    • implement Linux Rust coverage with cargo-llvm-cov
    • use an explicit Linux-side ignore regex for Windows-tagged files
    • refresh the Linux Rust baseline and sync the docs after the switch
  • Result after the follow-up Unix test-file split:

    • service/cgroups.rs no longer contains the Unix test module inline
    • the Unix tests now live in src/service/cgroups_unix_tests.rs
    • the exact verified Linux Rust rerun after the split is:
      • total: 98.70%
      • service/cgroups.rs: 98.28%
      • transport/posix.rs: 99.00%
      • transport/shm.rs: 96.85%
    • exact verified Linux regressions after the split:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1: 279/279 passing
      • /usr/bin/ctest --test-dir build --output-on-failure -j4: 37/37 passing
  • Current execution slice after the Linux cargo-llvm-cov switch:

    • keep the next Rust work on Linux only
    • focus on deterministic service/cgroups.rs gaps that still count in the new Linux total:
      • managed-server loop break paths:
        • 1050
        • 1062
        • 1421
        • 1425
        • 1431
        • 1445
        • 1552
        • 1563
      • still-counted inline test/helper branches that are cheap and deterministic:
        • 1946
        • 1957
        • 2166
        • 2183
        • 2463
        • 2480
    • explicit non-goals for this slice:
      • fixed-size encode guards in typed APIs
      • raw syscall / mmap / bind fault-injection paths
      • poll_fd() branches that need unreliable signal timing unless a deterministic reproducer is found
      • multiline llvm-cov line-mapping artifacts like the already-tested chunk_index mismatch formatting line in transport/posix.rs
    • new fact discovered during this slice:
      • adding more inline tests inside src/service/cgroups.rs can lower the measured file coverage under cargo-llvm-cov, even when the new tests are valid and all pass
      • this is now fixed by moving the Unix tests into src/service/cgroups_unix_tests.rs
      • the coverage regression from inline test growth no longer applies to the runtime file
    • decision made by Costa:
      • move the Linux Rust service tests out of src/service/cgroups.rs
      • mirror the existing split-file test pattern already used by the Windows Rust service tests
  • Current execution slice after a36cf6e:

    • stay on Linux Rust only
    • keep only ordinary deterministic targets in scope:
      • src/service/cgroups.rs
        • raw response-envelope mismatch guards in the typed request-buffer paths:
          • 550
          • 587
          • 626
        • Linux managed-server SHM-upgrade rejection:
          • 1090
          • 1230
        • direct helper branches that are still deterministic:
          • 1594-1598
          • 1613
      • src/transport/posix.rs
        • chunk-index mismatch formatting path:
          • 452-453
        • direct helper / fallback branches that can be hit without syscall fault injection:
          • 671
          • 742 only if peer-close produces a deterministic send failure
    • explicit non-goals for this slice:
      • fixed-size encode guards in typed APIs (189, 202, 221, 252)
      • test-helper panic / timeout lines (1919, 1922, 2024, 2058, 2116, 2132-2133)
      • raw socket/listen/accept creation failure branches (226, 532, 550-555, 577, 830)
  • Current execution slice after e0a0f7d:

    • switch from Rust to C
    • next ordinary target is src/libnetdata/netipc/src/service/netipc_service.c
    • fresh evidence from bash tests/run-coverage-c.sh 82:
      • total: 90.5%
      • netipc_protocol.c: 98.7%
      • netipc_uds.c: 89.7%
      • netipc_shm.c: 91.2%
      • netipc_service.c: 86.6%
    • keep only ordinary deterministic C service targets in scope:
      • client typed-call branches:
        • default client buffer sizing (33, 41)
        • empty batch fast-path (515)
        • request-buffer overflow / truncation for batch and string-reverse (519, 608)
        • SHM short / malformed response handling (188, 191, 195, 246, 248, 250, 556-560, 622)
      • Linux SHM negotiation failure branches:
        • client attach failure after handshake (121-124)
        • server-side SHM create failure on negotiated sessions (1113-1118)
      • typed dispatch ordinary branches:
        • missing typed handlers for increment / string-reverse / snapshot (693-716)
    • explicit non-goals for this slice:
      • malloc / calloc / realloc failure paths (373-381, 803-805, 999, 1125, 1139, 1161)
      • raw socket / listen / accept / thread-create failures in L1-managed code
      • any branch that needs fault injection instead of a normal public test
    • first deterministic implementation subset for this slice:
      • tests/fixtures/c/test_service.c
        • client init defaults + long-string truncation
        • empty increment-batch fast-path
        • tiny request-buffer overflow for increment-batch and string-reverse
        • negotiated SHM obstruction that forces:
          • server-side SHM create rejection
          • client-side SHM attach failure after handshake
      • tests/fixtures/c/test_hardening.c
        • typed server with partial / missing handler tables so the managed typed dispatch covers:
          • missing increment handler
          • missing string-reverse handler
          • missing snapshot handler
    • deferred to the next C slice unless this subset leaves them clearly ordinary:
      • SHM malformed-response envelope coverage for:
        • short response
        • bad decoded header
        • wrong kind / code / message_id / item_count on SHM responses
    • fresh measured result after the first deterministic C subset:
      • bash tests/run-coverage-c.sh 82
      • total: 91.7%
      • netipc_service.c: 89.6% (714/797)
      • exact wins from the first subset:
        • client init defaults + truncation now covered
        • empty increment-batch fast-path now covered
        • tiny request-buffer overflow guards for batch and string-reverse now covered
        • typed dispatch missing-handler branches now covered
        • negotiated SHM obstruction now covers both:
          • server-side SHM create rejection
          • client-side SHM attach failure after handshake
    • next ordinary C subset from the fresh uncovered list:
      • typed-server success paths in server_typed_dispatch():
        • increment dispatch call (696)
        • string-reverse dispatch call (704)
        • snapshot dispatch call (712)
        • default snapshot_max_items == 0 path (678)
      • SHM fixed-size send-buffer overflow on the increment path:
        • transport_send() overflow (149)
        • do_increment_attempt() propagating do_raw_call() error (483)
      • cheap server-init ordinary guards:
        • worker_count normalization (970)
        • server run_dir / service_name truncation paths (976, 982)
  • Coverage parity and documentation honesty, not emergency benchmark or transport fixes.

  • Current execution slice after f4fdc10:

    • continue only with the remaining Linux-ordinary Rust targets from the earlier 88.98% tarpaulin rerun
    • exact next scope for this slice:
      • src/service/cgroups.rs
        • retry-second-failure branches in raw_call_with_retry_request_buf() and raw_batch_call_with_retry_request_buf()
        • Linux negotiated SHM attach-failure path in try_connect()
        • SHM short-message rejection in transport_receive()
        • remaining managed-server batch failure branches if they are still reachable without synthetic hooks
      • src/transport/posix.rs
        • remaining ordinary helper / handshake branches from the fresh uncovered-line list
        • do not chase raw socket creation or short-write failure paths in this slice
      • src/transport/shm.rs
        • only if a still-ordinary stale-cleanup / stale-open path remains after direct review
    • explicit non-goals for this slice:
      • Windows-tagged Rust lines still counted by tarpaulin
      • raw syscall / mmap / ftruncate / fstat fault-injection paths
      • deferred Windows managed-server retry / shutdown behavior
  • Current execution slice after the latest Linux Rust ordinary follow-up:

    • completed the next ordinary Rust transport / cache slice and revalidated Linux end-to-end
    • latest ordinary Rust additions:
      • src/transport/posix.rs
        • real payload-limit rejection
        • non-chunked invalid batch-directory validation
        • chunk total_message_len mismatch
        • chunk chunk_payload_len mismatch
      • src/service/cgroups.rs
        • cache malformed-item refresh preserves the old snapshot cache
      • tests/test_service_interop.sh
        • fixed the real POSIX service-interop readiness bug by waiting for the socket path after READY
    • exact Linux Rust result for that earlier verified rerun:
      • bash tests/run-coverage-rust.sh 80
      • current tool on this host: tarpaulin
      • total at that point: 88.98%
      • key files:
        • src/service/cgroups.rs: 623/664
        • src/transport/posix.rs: 377/401
        • src/transport/shm.rs: 346/375
    • final validation for this slice:
      • cargo test --lib -- --test-threads=1: 247/247 passing
      • /usr/bin/ctest --test-dir build --output-on-failure -j1 -R ^test_service_interop$ --repeat until-fail:10: passing
      • cmake --build build -j4: passing
      • /usr/bin/ctest --test-dir build --output-on-failure -j4: 37/37 passing
    • current implication:
      • Linux Rust is still improving, but the ordinary gains are now smaller
      • the remaining Linux Rust total is increasingly concentrated in:
        • retry-second-failure paths
        • Linux negotiated SHM attach failure
        • SHM short-message rejection
        • a few managed-server batch failure branches
        • Windows-tagged lines still counted by tarpaulin
        • and real syscall / timeout / race territory
  • Current execution slice after the latest Linux Rust ordinary-coverage pass:

    • completed the first direct Linux Rust follow-up after the POSIX Go transport/service cleanup
    • added ordinary Rust L2 SHM service coverage for:
      • snapshot
      • increment
      • string-reverse
      • increment-batch
      • malformed response envelopes and helper bounds
    • added direct Linux Rust transport coverage for:
      • short UDS packets
      • non-chunked batch-directory underflow
      • chunk message-id mismatch
      • live-server bind() rejection
      • SHM live-region rejection
      • SHM short-file / undersized-region attach failures
      • SHM invalid-entry cleanup and no-deadline receive behavior
    • exact Linux Rust result for that earlier verified rerun:
      • bash tests/run-coverage-rust.sh 80
      • current tool on this host: tarpaulin
      • total at that point: 88.98%
      • key files:
        • src/service/cgroups.rs: 623/664
        • src/transport/posix.rs: 377/401
        • src/transport/shm.rs: 346/375
    • final validation for this slice:
      • cargo test --lib -- --test-threads=1: 247/247 passing
      • cmake --build build -j4: passing
      • /usr/bin/ctest --test-dir build --output-on-failure -j4: 37/37 passing
    • implication:
      • Linux Rust is no longer sitting at the old 80.85% floor
      • the remaining Rust total is now a mix of:
        • still-ordinary helper / validation branches
        • Windows-tagged lines still counted by tarpaulin
        • and real syscall / timeout / race territory
      • one exact layering fact is now proven:
        • on POSIX baseline, bad response message_id does not reach the L2 envelope checks
        • UdsSession::receive() rejects it first as UnknownMsgId, and transport_receive() maps that to NipcError::Truncated
    • next exact Linux Rust ordinary targets from the fresh rerun:
      • src/service/cgroups.rs
        • retry-once second-failure paths still missing in:
          • raw_call_with_retry_request_buf()
          • raw_batch_call_with_retry_request_buf()
        • remaining ordinary service branches:
          • negotiated SHM attach failure in try_connect() on Linux
          • SHM short-message rejection in transport_receive()
          • baseline batch response message_id mismatch is not a remaining L2 target, because L1 rejects it first
        • remaining ordinary server-loop branches:
          • malformed batch request item
          • batch builder add failure
          • SHM response send failure
        • remaining ordinary cache branch:
          • malformed snapshot item preserves old cache
      • src/transport/posix.rs
        • remaining ordinary malformed receive branches:
          • payload limit exceeded
          • non-final / final chunk payload-length and total-length mismatches
          • chunked batch-directory packed-area validation failure
        • remaining ordinary handshake / helper branches:
          • default supported-profile baseline branches
          • listener accept() cleanup on handshake failure is now covered
          • stale-recovery live-server probe path is still worth one direct test if it can be driven without races
        • remaining ordinary listener / helper branches:
          • listen(2) failure after successful bind is not ordinary
          • raw socket creation and short-write failures remain special-infrastructure
      • src/transport/shm.rs
        • remaining ordinary stale / recovery utility branches:
          • cleanup_stale() mmap-failure / bad-open cleanup if they can be reproduced with ordinary filesystem objects
          • check_shm_stale() open-failure cleanup if it can be driven without fault injection
        • not the next target:
          • ftruncate, mmap, fstat, and arch-specific cpu_relax() branches still look like special-infrastructure territory
  • Current execution slice after the latest POSIX Go UDS / SHM stability pass:

    • revalidated the exact current Linux / POSIX Go transport package coverage from the real module root
    • current package result:
      • transport/posix total: 93.8%
      • transport/posix/shm_linux.go: 91.9%
      • transport/posix/uds.go: 95.6%
    • current verified weak POSIX UDS functions:
      • Receive(): 97.8%
      • Listen(): 81.0%
      • detectPacketSize(): 100.0%
      • rawSendMsg(): 83.3%
      • connectAndHandshake(): 93.2%
      • serverHandshake(): 95.3%
    • completed the next ordinary POSIX UDS coverage slice
    • validated ordinary raw UDS tests for:
      • client Send() initialization of the first in-flight request set
      • non-chunked batch-directory underflow rejection
      • chunked batch-directory validation after full payload reassembly
      • detectPacketSize() fallback and live-fd success behavior
    • discovered one real POSIX SHM transport test-harness race while rerunning the package under coverage:
      • TestShmDirectRoundtrip and related tests still used fixed service names plus blind 50ms sleeps before ShmClientAttach()
      • under coverage slowdown this caused both:
        • attach-before-create failures (SHM open failed: ... no such file or directory)
        • and later server-side futex-wait timeouts
    • fixed the SHM transport package race honestly:
      • replaced blind sleeps with attach-ready waiting
      • moved the live SHM roundtrip tests to unique per-test service names
      • verified the package with go test -count=5 ./pkg/netipc/transport/posix
    • reviewed the remaining uds.go uncovered blocks against the real code and the existing raw UDS edge-test helpers
    • checked the official Linux manual pages for recvmsg() / MSG_TRUNC on AF_UNIX sequenced-packet sockets:
      • verified that record boundaries and truncation behavior are explicit for AF_UNIX datagram / sequenced-packet sockets
      • implication:
        • the next honest ordinary coverage should come from malformed packet sequences and real protocol states
        • not from pretending POSIX UDS behaves like a byte-stream transport
    • current split of remaining POSIX UDS gaps:
      • ordinary testable now:
        • non-chunked batch directory underflow / invalidation in Receive()
        • chunked final batch-directory validation in Receive()
        • client-side Send() branch where inflightIDs starts nil
        • possibly one small detectPacketSize() fallback helper case if it can be driven without fault injection
      • likely special-infrastructure later:
        • Connect() / Listen() raw socket, bind, and listen syscall failures
        • short writes in rawSendMsg() and handshake send paths
        • zero-length or syscall-failure handshake receive paths
        • most ShmServerCreate() / ShmClientAttach() remaining Ftruncate, Mmap, Dup, and Stat failures
    • next target:
      • review whether any remaining low-level POSIX transport gaps are still ordinary:
        • rawSendMsg()
        • Listen()
        • connectAndHandshake()
        • serverHandshake()
      • classify the remainder honestly into:
        • still ordinary
        • or special-infrastructure / syscall-failure territory
      • latest line-by-line classification from the current local rerun:
        • still ordinary:
          • Listen() bind failure when the run directory does not exist
          • client handshake peer disconnect before HELLO_ACK
          • server handshake peer disconnect before HELLO
        • not ordinary:
          • raw socket creation failures
          • short writes in rawSendMsg() and handshake send paths
          • forced listen(2) failure after a successful bind
    • follow-up validation after the low-level UDS slice exposed and fixed two more real Unix Go harness bugs:
      • TestUnixServerRejectsSessionAtWorkerCapacity
        • failing symptom before the fix:
          • first client did not occupy the only worker slot
        • evidence:
          • the readiness probe in startServerWithWorkers() used waitUnixServerReady()
          • that helper performs a real connection / handshake probe
          • for the workers=1 capacity test, this probe could consume the only worker slot briefly before the real test client connected
        • fix:
          • added a socket-ready startup helper for this test instead of a full handshake probe
      • TestNonRequestTerminatesSession
        • failing symptom before the fix:
          • repeated isolated runs later failed at server should still be alive after bad client
        • evidence:
          • the test used a one-shot raw posix.Connect(...)
          • and later checked recovery with a single verifyClient.Refresh()
        • fix:
          • raw connect now retries readiness
          • the recovery check now uses the existing retry-style client readiness helper
    • final validation of the slice:
      • go test -count=20 -run '^TestUnixServerRejectsSessionAtWorkerCapacity$' ./pkg/netipc/service/cgroups: passing
      • go test -count=20 -run '^TestNonRequestTerminatesSession$' ./pkg/netipc/service/cgroups: passing
    • bash tests/run-coverage-go.sh 90: passing
      • /usr/bin/ctest --test-dir build --output-on-failure -j4: 37/37 passing
    • next exact low-level transport classification from the fresh cover profile:
      • transport/posix/uds.go
        • remaining uncovered ordinary-looking paths are effectively exhausted
        • current uncovered lines are concentrated in:
          • raw socket creation failure in Connect() / Listen()
          • short writes in rawSendMsg()
          • handshake send / recv syscall failures and short writes
          • forced listen(2) failure after successful bind
        • implication:
          • uds.go is now mostly special-infrastructure territory
      • transport/posix/shm_linux.go
        • remaining possibly ordinary/testable:
          • ShmReceive() deadline-expired timeout branch with no publisher
          • ShmClientAttach() malformed-file follow-ups only if they can be driven with ordinary files instead of syscall fault injection
        • likely special-infrastructure later:
          • Ftruncate, Mmap, and Dup failures in ShmServerCreate()
          • Stat, Mmap, and Dup failures in ShmClientAttach() when they need syscall fault injection
    • completed the next direct POSIX SHM guard slice:
      • added the missing ShmSend() signal-add guard
      • added the missing spin-phase ShmReceive() msg_len load guard
      • revalidated the transport package with go test -count=20 ./pkg/netipc/transport/posix
      • current result after the slice:
        • transport/posix total: 93.8%
        • transport/posix/shm_linux.go: 91.9%
        • ShmSend(): 96.6%
        • ShmReceive(): 96.2%
      • implication:
        • the remaining shm_linux.go gaps are even more concentrated in syscall-failure, impossible ordering, or timeout-orchestration territory
    • next ordinary Linux Go service slice selected from the fresh service/cgroups cover profile:
      • verified current uncovered targets in service/cgroups/client.go
      • do not chase the fixed-size encode guard branches first:
        • CallSnapshot() request encode
        • CallIncrement() request encode
        • CallStringReverse() encode
        • CallIncrementBatch() fixed-size item encode
        • these are effectively impossible with the current exact-size caller buffers
      • current ordinary targets selected for the next pass:
        • tryConnect() default StateDisconnected path for non-classified connect errors
        • pollFd() invalid-fd / hangup handling
        • single-item response overflow in handleSession()
        • negotiated SHM create failure in Run() while keeping the server healthy for later sessions
      • evidence:
        • current uncovered line groups are at:
          • client.go:381-382
          • client.go:576-577
          • client.go:611-615
          • client.go:707-710
          • client.go:830
        • local poll(2) documentation check confirms:
          • POLLHUP reports peer hangup
          • POLLNVAL reports invalid fd
        • implication:
          • direct pollFd() tests are honest ordinary coverage, not synthetic protocol cheating
    • completed the next Linux Go ordinary service slice:
      • covered tryConnect() default StateDisconnected mapping with an invalid service name
      • covered direct pollFd() hangup / invalid-fd handling with real Unix pipe descriptors
      • covered single-item response overflow and client recovery
      • covered short SHM request termination and bad SHM header termination while proving the server remains healthy for later sessions
      • verified the new tests with go test -count=20
      • current result after the slice:
        • service/cgroups/client.go: 95.9%
        • Run(): 94.7%
        • handleSession(): 92.9%
        • tryConnect(): 100.0%
      • important finding:
        • targeted line coverage now confirms the negotiated SHM create-failure branch in Run() is covered by the obstructed first-session test
        • evidence from a direct -run '^TestUnixShmCreateFailureKeepsServerHealthy$' cover profile:
          • client.go:611-615 executed
        • implication:
          • remove this branch from the “unresolved” bucket
    • next remaining Linux Go service classification after the fresh rerun:
      • handleSession() ordinary SHM malformed-request branches are no longer the main gap
      • current remaining uncovered line groups from the fresh full-package rerun:
        • client.go:189-191
        • client.go:218-220
        • client.go:244-246
        • client.go:284-289
        • client.go:576-577
        • client.go:585-586
        • client.go:665
        • client.go:707-710
        • client.go:765-767
        • client.go:780-786
        • client.go:830
        • client.go:845
      • likely non-ordinary / invariant-bound:
        • fixed-size encode guards in typed client calls
        • single-dispatch responseLen > len(respBuf) guard for the existing typed methods
        • msgBuf growth path, because it is already pre-sized from MaxResponsePayloadBytes + HeaderSize
        • ShmReceive() non-timeout error in the server loop, because the live server-side context keeps the atomic offsets in-bounds
        • listener poll / accept error branches in Run()
        • peer-close response send failure on POSIX sequenced-packet sockets unless a deterministic reproduction exists
        • pollFd() raw syscall-failure / unexpected-revents fallthrough paths
    • fresh Linux Rust coverage measurement from the current machine:
      • bash tests/run-coverage-rust.sh 80
      • current tool on this host: tarpaulin
      • current result: 90.66%
      • current largest uncovered Rust files from the report:
        • src/service/cgroups.rs: 686/710
        • src/transport/posix.rs: 388/401
        • src/transport/shm.rs: 347/375
      • implication:
        • Linux Rust is now the next biggest ordinary coverage target, not Linux Go
    • direct uncovered-line extraction from src/crates/netipc/cobertura.xml confirms a mixed picture:
      • a real part of the missing service/cgroups.rs coverage is Linux-ordinary
      • another real part is Windows-only code counted inside the shared file by tarpaulin
      • concrete evidence:
        • Linux-ordinary gaps in service/cgroups.rs:
          • SHM L2 client send/receive paths: 645-658, 695-709, 749-758
          • SHM-managed server request/response paths: 1418-1428, 1538-1551, 1571
          • response envelope checks for typed calls / batch calls: 544, 547, 550, 581, 584, 587, 590, 620, 623, 626, 632
          • dispatch_single() missing-handler and derived-zero-capacity paths: 912, 921, 937, 946, 949
          • poll_fd() EINTR / unexpected-revents fallthrough: 1594-1596, 1598, 1613
          • cache lossy-conversion / malformed-item preservation: 1711, 1716, 1728-1729
        • Windows-only or Linux-non-testable groups inside the same file:
          • Windows try_connect() / WinSHM path: 364-407, 665-730, 1123-1253, 1260-1396
          • fixed-size encode guards in typed calls: 189, 202, 221, 252
          • helper overflow guards and readiness wait-loop sleeps: 1876, 1945, 1979, 2663
      • transport/posix.rs still has ordinary Linux gaps:
        • packet_size too small: 289
        • short packet / negotiated-limit checks: 347, 361, 392
        • chunk-header mismatch checks: 440, 448, 457, 460, 465, 468
        • live-server stale detection / listener conflict: 526, 836
        • handshake rejection/truncation branches: 930, 941, 949, 1004, 1057
      • transport/shm.rs still has ordinary Linux gaps:
        • live-server stale-region rejection in server_create(): 227-229
        • short-file / undersized-region attach failures: 341-342, 428-431
        • zero-timeout deadline branch in receive(): 581, 601, 609
        • cleanup_stale() invalid-entry cleanup branches: 729, 736-737, 763-764
      • working theory:
        • the next honest Linux Rust gains should come first from real Linux SHM service coverage and direct malformed transport tests
        • after that, the remaining Linux total will need a tooling review, because tarpaulin is still counting Windows-tagged lines in the shared Rust library total
    • next execution slice for Linux Rust:
      • add real L2 SHM service tests in service/cgroups.rs
        • snapshot / increment / string-reverse / batch over SHM
        • bad-kind / bad-code / bad-message-id / bad-item-count response validation on the SHM path
        • direct dispatch_single() and snapshot_max_items() tests for the remaining ordinary helper branches
      • add direct POSIX UDS malformed transport tests in transport/posix.rs
        • packet too short
        • limit exceeded
        • batch-directory overflow
        • chunk-header mismatch
        • live-server stale detection
        • handshake rejection / truncation branches
      • add direct POSIX SHM stale / attach / timeout tests in transport/shm.rs
        • live-server stale recovery rejection
        • undersized file / undersized mapping rejection
        • zero-timeout receive branch
        • invalid-entry cleanup paths
  • Current execution slice after the Windows Go parity expansion:

    • completed the next Linux / POSIX Go SHM service follow-up slice
    • validated ordinary POSIX SHM service tests for:
      • attach failure
      • normal SHM roundtrip
      • malformed batch request recovery
      • batch handler failure -> refresh
      • batch response overflow -> refresh
    • completed the next direct POSIX SHM transport guard slice
    • validated direct transport tests for:
      • invalid service-name entry guards
      • ShmSend() bad-parameter guards
      • ShmReceive() bad-parameter and timeout paths
      • ShmCleanupStale() missing-directory and unrelated-file branches
    • completed the next direct POSIX SHM raw-response slice
    • validated direct raw SHM service tests for:
      • doRawCall() bad message_id
      • batch bad message_id
      • malformed batch payload
      • snapshot dispatch with derived zero-capacity buffer
    • completed the next Linux / POSIX Go ordinary server-loop slice
    • validated ordinary POSIX server-loop tests for:
      • worker-capacity rejection
      • idle peer disconnect
      • non-request termination
      • truncated raw request recovery
    • fixed one real Unix Go test-harness issue exposed by coverage slowdown:
      • baseline / SHM / stress helpers were still using blind sleeps before clients raced Refresh()
      • they now wait for a real successful POSIX handshake instead of just waiting for the socket path to appear
    • completed the next Linux / POSIX Go SHM transport obstruction slice
    • validated ordinary POSIX SHM filesystem-obstruction tests for:
      • unreadable stale-file recovery in checkShmStale()
      • non-empty directory stale entry in checkShmStale()
      • ShmServerCreate() retry-create failure when stale recovery cannot remove the target obstruction
    • reclassified raw malformed POSIX SHM request recovery (short, bad header, unexpected kind) out of the ordinary bucket:
      • all three block in ShmReceive(..., 30000) today
      • they belong to timeout-behavior / special-infrastructure work unless POSIX SHM timeout control becomes testable
    • completed the next Windows Go ordinary-coverage pass on win11
    • validated the new Windows-only Go transport edge tests directly with native go test
    • synced the TODO and coverage docs to the latest Windows Go numbers
    • discovered one real Go Windows shutdown bug during the next service-coverage pass:
      • idle Server.Stop() can hang because windows.Listener.Close() does not wake a blocked Accept() with no client connected yet
      • the C Windows transport already solves this with a loopback wake-connect on the pipe name before closing the listener handle
    • fixed the exact-head Windows Rust state-test startup race under parallel ctest
    • fixed the matching service-interop client readiness race across the C, Rust, and Go service interop fixtures on both POSIX and Windows
    • reviewed the real win11 Go coverage profiles for both service/cgroups and transport/windows
    • fixed the real Go Windows listener shutdown bug:
      • windows.Listener.Close() now mirrors the C transport and performs a loopback wake-connect before closing the listener handle
      • this unblocks a blocked Accept() reliably, so idle managed Server.Stop() no longer hangs
    • validated the new Windows Go idle-stop and malformed-response tests directly with native go test
    • next target:
      • keep raising the relaxed coverage gates toward 100%
      • current result:
        • malformed-response tests raised service/cgroups.rs
        • WinSHM edge-case tests raised transport/win_shm.rs
        • Windows named-pipe transport tests raised transport/windows.rs into the mid-90% range
        • WinSHM service tests and stricter malformed batch/snapshot tests raised Go service/cgroups/client_windows.go above 90%
        • the latest Windows Go transport edge tests plus the listener shutdown fix raised:
          • transport/windows/pipe.go to 97.1%
          • transport/windows/shm.go to 92.9%
          • transport/windows package total to 95.2%
          • service/cgroups/client_windows.go to 96.7%
          • service/cgroups package total to 96.5%
          • Windows Go total to 96.7%
        • Windows Go no longer has a weak transport package
        • exact uncovered Go functions on win11 are now known:
          • doRawCall (100.0%)
          • CallSnapshot (94.1%)
          • CallStringReverse (93.8%)
          • CallIncrementBatch (95.5%)
          • transportReceive (100.0%)
          • Run (91.7%)
          • handleSession (95.0%)
        • facts from the uncovered blocks:
          • the ordinary Windows Go L2 service targets in client_windows.go were pushed much further and are no longer the main gap
          • Windows named-pipe transport edge handling is now broadly covered
          • the recent honest coverage gains came from real malformed transport tests and WinSHM edge tests, not from exclusions
          • some malformed named-pipe response cases never reach L2 validation because the Windows session layer rejects them first
          • raw malformed WinSHM requests now also cover the real managed-server SHM session teardown and reconnect path
        • split of remaining Go gaps:
          • ordinary testable now:
            • Windows Go ordinary coverage is no longer the main gap
            • next honest Go target is Linux / POSIX:
              • service/cgroups/client.go (94.3%)
              • transport/posix/shm_linux.go (90.6%)
              • transport/posix/uds.go (92.0%)
            • keep the deferred managed-server retry/shutdown investigation separate from ordinary coverage
          • likely requires special orchestration later:
            • fixed-size encode / builder overflow guards in client_windows.go that the current scratch sizing makes unreachable in normal calls
            • client_windows.go SHM server-create, defensive response-length, msg-buffer growth, and SHM send failure paths
            • transport-level malformed response MessageID and some response-envelope corruptions that are rejected below L2 on named pipes
            • rare managed-server retry/shutdown races already tracked separately
      • keep focusing on ordinary testable branches first, not the deferred managed-server retry/shutdown investigation
  • Verified current Windows coverage state on 2026-03-24:

    • C:
      • src/libnetdata/netipc/src/service/netipc_service_win.c (90.1%)
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c (91.8%)
      • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c (91.6%)
      • total: 90.9%
      • status: the script now passes the Linux-matching per-file 85% gate
    • Go:
      • total: 96.7%
      • package coverage:
        • service/cgroups: 96.5%
        • transport/windows: 95.2%
      • key files:
        • service/cgroups/client_windows.go: 96.7%
        • service/cgroups/types.go: 100.0%
        • transport/windows/pipe.go: 97.1%
        • transport/windows/shm.go: 92.9%
      • status:
        • passes the Linux-matching 90% target
        • the noninteractive exit problem is fixed
        • first-class Windows Go CTest targets now exist for service/cache coverage parity
        • latest added WinSHM service tests, malformed-response tests, and transport edge tests increased both client_windows.go and the Windows transport package materially
        • the idle managed Server.Stop() hang on Windows is fixed and covered
        • direct raw WinSHM tests now cover the Windows-only L2 branches that named pipes reject below L2
        • the latest create / attach edge tests materially raised the remaining ordinary Windows Go transport file
        • the latest raw I/O, handshake, Listen(), chunked batch, and disconnect tests pushed pipe.go above 97% and Windows Go total to 96.7%
    • Rust:
      • validated workflow: cargo-llvm-cov + rustup component add llvm-tools-preview
      • measured with Windows-native unit tests + Rust interop ctests, with Rust bin / benchmark noise excluded from the report:
        • src/service/cgroups.rs: 83.83% line coverage
        • src/transport/windows.rs: 94.43% line coverage
        • src/transport/win_shm.rs: 88.27% line coverage
        • total line coverage: 93.68%
      • implication: Windows Rust coverage is now real and useful, but one retry/shutdown test is still intentionally ignored pending the separate managed-server investigation
  • Approved next sequence:

    • document the new Windows Go numbers honestly in the TODO and coverage docs
    • align Windows C and Go default thresholds with the already-used Linux defaults
    • after that, keep raising the relaxed coverage gates toward 100%
    • resolved during the Windows Go parity pass:
      • Windows Go CTest commands now execute reliably on win11
      • the fix was to define the tests as direct go test commands and let CTest inject CGO_ENABLED=0 via test environment properties
      • current validated Windows CTest inventory is now 28 tests, not 26

Recorded Decision

1. Windows Rust coverage gate policy

Facts:

  • The validated Windows Rust workflow now reports:
    • total line coverage: 93.68%
    • src/service/cgroups.rs: 83.83%
    • src/transport/windows.rs: 94.43%
    • src/transport/win_shm.rs: 88.27%
  • cargo-llvm-cov has a built-in total-line gate via --fail-under-lines, but not a built-in per-file gate.
  • The current Windows C script enforces per-file gates on the exact Windows C files it cares about.
  • The current Windows Go script enforces only a total-package threshold.
  • One Windows Rust retry/shutdown test is still intentionally ignored because it belongs to the separate managed-server investigation.

User decision (2026-03-23):

  • Windows Rust coverage policy should match Linux Rust coverage policy unless there is a proven technical reason for divergence.
  • Do not invent a Windows-only coverage policy if the real issue is just script drift.

Implementation consequence:

  • The Linux and Windows Rust coverage scripts must enforce the same total-threshold policy.
  • Costa later raised the shared Rust threshold to 90% on both Linux and Windows.

2. Cross-platform test-framework parity expectation

User requirement (2026-03-23):

  • Linux and Windows should have similar validation scope across all implementations.
  • This includes:
    • unit and integration coverage
    • interoperability tests
    • fuzz / chaos style validation where technically possible
    • benchmarks
    • interop benchmarks

Implication:

  • Before increasing coverage further, the repository needs an honest parity review of Linux vs Windows validation scope.
  • Any meaningful Windows-vs-Linux gaps must be documented clearly in this TODO instead of being hidden behind partial scripts.

3. Current execution order

User direction (2026-03-23):

  • Proceed with the ordinary testable Windows Go coverage targets first.
  • Do not jump to special-infrastructure branches before the ordinary remaining branches are exhausted.

4. README summary refresh

User direction (2026-03-23):

  • Replace the old README.md with a concise, trustworthy summary for team handoff.
  • The README must explain:
    • design and architecture
    • the specs and where they live
    • API levels
    • language interoperability
    • performance
    • testing, coverage, and validation scope
  • The README should be something the team can reasonably trust about features, performance, reliability, and robustness.

Implementation consequence:

  • The README must be based on the current measured repo state, not on stale claims.
  • Any claim about performance, reliability, robustness, interoperability, or validation must be traceable to checked-in docs, benchmark artifacts, or current test / coverage workflows.

Status:

  • Completed.
  • README.md now summarizes the current design, specifications, API levels, interoperability model, checked-in benchmark results, and validated test / coverage state for team handoff.

Summary Of Work Done

  • Normalized the public specifications so Level 2 is clearly typed-only and transport/buffer details remain internal.
  • Aligned the implementation with the typed Level 2 direction across C, Rust, and Go.
  • Fixed the verified SHM attach race where clients could accept partially initialized region headers.
  • Removed verified Rust Level 2 hot-path allocations and corrected benchmark distortions from synthetic per-request snapshot rebuilding.
  • Fixed Windows benchmark implementation bugs, including:
    • SHM batch crash in the C benchmark driver
    • named-pipe pipeline+batch behavior at depth 16
    • Windows benchmark timing/reporting bugs
  • Made both benchmark generators fail closed on stale or malformed CSV input.
  • Regenerated benchmark artifacts from fresh reruns instead of trusting stale checked-in files.
  • Repaired the broken follow-up hardening/coverage pass by:
    • replacing the non-self-contained test_hardening
    • wiring Windows stress into ctest
    • fixing the broken coverage script error handling
    • validating the Windows coverage scripts on win11
  • Replaced the stale top-level README.md with a factual repository summary for team handoff, based on the current checked-in specs, benchmark reports, and validated Linux / Windows test and coverage results.

Current Verified State

Linux

  • cmake --build build -j4: passing
  • /usr/bin/ctest --test-dir build --output-on-failure -j4: 37/37 passing
  • test_service_interop stabilization:
    • exact repeated validation with /usr/bin/ctest --test-dir build --output-on-failure -j1 -R ^test_service_interop$ --repeat until-fail:10: passing
    • implication:
      • the previous Rust server -> C client client: not ready failure was a real interop-fixture startup race
      • the POSIX service interop harness now also waits for the socket path after READY, because the Go and Rust fixtures emit READY just before entering server.Run()
  • POSIX benchmarks:
    • 201 rows
    • report regenerates successfully
    • configured POSIX floors pass

Linux Coverage

Verified on 2026-03-23:

  • C:
    • bash tests/run-coverage-c.sh
    • result: 94.1%
    • current threshold: 85%
  • Go:
    • bash tests/run-coverage-go.sh
    • result: 95.8%
    • current threshold: 90%
  • Rust:
    • bash tests/run-coverage-rust.sh
    • result: 98.57%
    • current threshold: 90%

Important fact:

  • The C coverage script was fixed during this pass.
    • it now runs the extra C binaries it was already building (test_chaos, test_hardening, test_ping_pong, test_stress)
    • it no longer exits with 141 because of grep | head under pipefail

Windows (win11)

Verified on 2026-03-23:

  • cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo: passing
  • cmake --build build -j4: passing
  • ctest --test-dir build --output-on-failure -j4:
    • current verified state: 28/28 passing
    • note:
      • exact-head validation after the Windows Rust coverage additions exposed one real Windows test-isolation bug in the Rust state tests
      • failing case: service::cgroups::windows_tests::test_client_incompatible_windows
      • symptom under full ctest -j4: the first immediate refresh() could see Disconnected instead of the expected terminal state because the spawned server was not always fully listening yet
      • evidence:
        • isolated rerun with ctest --test-dir build --output-on-failure -j1 -R ^test_protocol_rust$ passed
        • exact same tree under full ctest --test-dir build --output-on-failure -j4 failed once with left: Disconnected, right: Incompatible
      • fix:
        • the Windows Rust auth-failure and incompatible tests now wait for the target client state instead of assuming one immediate refresh is sufficient
      • final verification:
        • exact win11 rerun after the fix passed 28/28 under full ctest --test-dir build --output-on-failure -j4
      • one attempted rerun failed only because ctest and cargo llvm-cov clean --workspace were mistakenly run in parallel on the same win11 tree
      • that failure was invalid test orchestration, not a product regression

Important facts:

  • The Go fuzz tests are now serialized in CTest with RESOURCE_LOCK.
    • This fixed the previous go_FuzzDecodeCgroupsResponse timeout on win11.
  • The current exact head was revalidated again after the coverage work.
    • ctest --test-dir build --output-on-failure -j4: 28/28 passing after the Rust Windows state-test startup-race fix
  • test_service_win_interop stabilization:
    • exact repeated validation with ctest --test-dir build --output-on-failure -j1 -R ^test_service_win_interop$ --repeat until-fail:10: passing
    • implication:
      • the Windows service interop clients had the same one-refresh startup race pattern as POSIX
      • the fixture behavior is now aligned across C, Rust, and Go
  • test_win_stress is now wired and validated.
    • Current default scope is only the validated WinSHM lifecycle repetition.
    • The managed-service stress subcases were intentionally removed from the default Windows ctest path because Windows managed-server shutdown under stress still needs a separate investigation.
  • Windows Go parity improved:
    • test_named_pipe_go
    • test_service_win_go
    • test_cache_win_go
    • all three now execute successfully via ctest on win11

Windows Benchmarks

  • Windows benchmark matrix:
    • 201 rows
    • report regenerates successfully
    • configured Windows floors pass
  • Windows benchmark reporting is trustworthy for client/server scenarios:
    • 0 zero-throughput rows
    • 0 non-lookup rows with server_cpu_pct=0
    • 0 non-lookup rows with p50_us=0
    • the only server_cpu_pct=0 rows are the 3 lookup rows, which is correct

Windows Coverage

The scripts are now real and validated on win11.

Current measured results:

  • C:

    • latest clean win11 coverage build:
      • the raw bash tests/run-coverage-c-windows.sh 90 path completed end to end
    • coverage result: 93.9%
    • per-file:
      • netipc_service_win.c: 92.0%
      • netipc_named_pipe.c: 95.3%
      • netipc_win_shm.c: 95.9%
    • status:
      • passes the Linux-matching 90% target, including the per-file gate
      • the dedicated coverage-only guard executables remain stable under bounded timeout 120
      • the old first-run coverage instability is fixed by the test_win_service_guards.exe / test_win_service_guards_extra.exe split
  • Go:

    • bash tests/run-coverage-go-windows.sh 90
    • coverage result: 96.7%
    • package coverage:
      • protocol: 99.5%
      • service/cgroups: 96.5%
      • transport/windows: 95.2%
    • status:
      • reported above the Linux-matching 90% target
      • focused helper tests plus the listener shutdown fix raised:
        • transport/windows/pipe.go to 97.1%
        • transport/windows/shm.go to 92.9%
        • transport/windows package total to 95.2%
        • service/cgroups/types.go to 100.0%
        • service/cgroups/client_windows.go to 96.7%
      • first-class Windows Go CTest targets are now real and passing on win11
      • the idle managed Server.Stop() hang is fixed and covered
      • raw WinSHM tests now cover the Windows-only doRawCall() / transportReceive() branches that named pipes cannot reach honestly
      • malformed raw WinSHM request tests now also cover the real SHM server-side teardown / reconnect path

Important facts:

  • TestPipePipelineChunked in the Go Windows transport package is intentionally skipped.

    • Reason: with the current single-session API and tiny pipe buffers, the chunked full-duplex pipelining case deadlocks in WriteFile() on both sides.
    • This is a real limitation of the current API/test shape, not a flaky timeout to ignore.
  • The Windows C service coverage harness was trimmed to keep ctest trustworthy.

    • The broken-session retry and cache subcases need a smaller dedicated Windows-only harness.
    • Keeping them in the monolithic test_win_service.exe caused intermittent deadlocks and poisoned full-suite validation.
  • Windows C coverage now includes test_win_service.exe again, but it no longer relies on that executable alone for the extra deterministic service guard branches.

    • The coverage script runs the normal C coverage subset, which includes test_win_service.exe, and then separately runs test_win_service_guards.exe under timeout 120.
    • Reason: the dedicated guard executable isolates the extra service-only branches without risking the ordinary ctest inventory.
  • The Windows Go coverage script no longer stalls in noninteractive ssh.

    • Root cause was the script's own slow shell post-processing, not MSYS / SSH.
    • The per-file aggregation now uses one awk pass and exits cleanly on win11.
  • Rust:

    • validated tool choice:
      • cargo-llvm-cov
      • rustup component add llvm-tools-preview
    • validated script:
      • bash tests/run-coverage-rust-windows.sh
    • current measured report from win11 with Windows-native Rust L2/L3 unit tests + Rust interop ctests, after excluding Rust bin / benchmark noise from the report:
      • service/cgroups.rs: 83.83% line coverage
      • transport/windows.rs: 94.43% line coverage
      • transport/win_shm.rs: 88.27% line coverage
      • total: 93.68% line coverage
    • status:
      • the workflow is real and scripted
      • the report is now meaningful for the Windows Rust service path too
      • the script should enforce the same 90% total threshold policy as Linux Rust
      • the named-pipe transport file is no longer the weak Windows Rust target
      • the remaining Rust work is broader coverage raising plus the deferred shutdown/retry investigation
      • one Windows retry/shutdown test is intentionally ignored because it belongs to the separate managed-server shutdown investigation

Not Remaining

  • No active Linux test failure
  • No active Windows test failure
  • No active POSIX benchmark floor failure
  • No active Windows benchmark floor failure
  • No active Windows benchmark reporting bug
  • No active stale benchmark artifact problem
  • No active Windows C coverage regression

Windows Handoff (win11)

This is the verified workflow for another agent to build, test, and benchmark on Windows.

High-level workflow

  1. Develop locally.
  2. Push the branch or commit.
  3. ssh win11
  4. Reset or pull on win11.
  5. Build and validate on win11.
  6. Copy benchmark artifacts back only if Windows benchmarks were rerun.

Repo and shell entrypoint

ssh win11
cd ~/src/plugin-ipc.git

Important facts:

  • The win11 repo is disposable.
  • If it gets dirty or confusing, it is acceptable to clean it there.
  • The login shell may start as MSYSTEM=MSYS; use the toolchain environment below before building.

Known-good Windows toolchain environment

export PATH="/c/Users/costa/.cargo/bin:/c/Program Files/Go/bin:/mingw64/bin:$PATH"
export MSYSTEM=MINGW64
export CC=/mingw64/bin/gcc
export CXX=/mingw64/bin/g++

Sanity check:

type -a cargo go gcc g++ cmake ninja gcov

Expected shape:

  • cargo first from /c/Users/costa/.cargo/bin
  • go first from /c/Program Files/Go/bin
  • gcc / g++ / gcov from /mingw64/bin

Clean reset on win11 if needed

Use this only on win11, not in the local working repo:

git fetch origin
git reset --hard origin/main
git clean -fd

Configure and build on Windows

cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build -j4

Current expected result:

  • build passes

Full Windows test pass

ctest --test-dir build --output-on-failure -j4

Current expected result:

  • 28/28 tests passing

Important note:

  • The Go fuzz tests are serialized with RESOURCE_LOCK go_fuzz_tests.
  • test_win_stress currently validates only WinSHM lifecycle repetition in the default path.

Full Windows benchmark pass

bash tests/run-windows-bench.sh benchmarks-windows.csv 5
bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md

Current expected result:

  • 201 CSV rows
  • generator passes
  • all configured Windows floors pass
  • optional diagnostic mode for investigation without weakening publish mode:
    • NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh ...
    • behavior:
      • the publish run still fails closed
      • the first failure remains authoritative
      • each failed row is rerun once in an isolated diagnostic subdirectory under the preserved RUN_DIR
      • side-by-side evidence is written to:
        • ${RUN_DIR}/diagnostics-summary.txt
      • diagnostic reruns never write publish rows into the benchmark CSV
  • trust methodology now enforced by the runner:
    • each published row is the median of 5 measured samples by default
    • fixed-rate rows use the CLI duration:
      • 5s in the command above
    • most max-throughput rows use NIPC_BENCH_MAX_DURATION, default:
      • 10s
    • np-pipeline-batch-d16 @ max uses NIPC_BENCH_PIPELINE_BATCH_MAX_DURATION, default:
      • 20s
    • with 5 samples, one low and one high throughput sample are trimmed before the stability check
    • the remaining stable core must contain at least 3 samples and stay within:
      • max/min <= 1.35
    • if the stable core exceeds that spread, the runner fails closed instead of publishing the row

Windows coverage scripts

bash tests/run-coverage-c-windows.sh
bash tests/run-coverage-go-windows.sh 90
bash tests/run-coverage-rust-windows.sh 90

Current expected result:

  • bash tests/run-coverage-c-windows.sh
    • current clean-coverage measurement is 93.9%
    • all tracked Windows C files are above 90%
    • the full raw script now completes end to end on the validated win11 workflow
  • bash tests/run-coverage-go-windows.sh 90
    • currently reports 96.7%
  • bash tests/run-coverage-rust-windows.sh 90
    • currently reports 93.68%
    • should now enforce the same 90% total threshold used by Linux Rust
    • key remaining gap is no longer missing service coverage; it is raising coverage further and finishing the separate retry/shutdown investigation

Copy benchmark artifacts back to the local repo

scp win11:~/src/plugin-ipc.git/benchmarks-windows.csv /home/costa/src/plugin-ipc.git/benchmarks-windows.csv
scp win11:~/src/plugin-ipc.git/benchmarks-windows.md /home/costa/src/plugin-ipc.git/benchmarks-windows.md

Known pitfalls and fixes

  • Do not use MSYS2 cargo or go.
  • Do not trust a stale build/ directory after major changes.
  • If a benchmark or manual test was interrupted, check for stale exact PIDs before rebuilding:
tasklist //FI "IMAGENAME eq test_win_stress.exe"
tasklist //FI "IMAGENAME eq bench_windows_c.exe"
tasklist //FI "IMAGENAME eq bench_windows_go.exe"
tasklist //FI "IMAGENAME eq bench_windows.exe"
  • Kill only exact PIDs:
taskkill //PID <pid> //T //F
  • The Windows C coverage script must pass real Windows compiler paths to CMake.
    • It now uses cygpath -m "$(command -v gcc)".

Remaining Work Plan

1. Coverage program is still incomplete

Facts:

  • Linux coverage scripts are working and pass their current lowered thresholds.
  • Windows coverage docs now match the measured numbers from 2026-03-24.
  • Windows C coverage currently passes:
    • total: 93.9%
    • netipc_service_win.c: 92.0%
    • netipc_named_pipe.c: 95.3%
    • netipc_win_shm.c: 95.9%
  • Windows Go coverage currently reports 96.7%.
  • Linux Go coverage currently reports 95.8% with the remaining ordinary gaps now reduced to a much smaller POSIX transport/service residue.
  • Rust Windows coverage now has a validated workflow with meaningful service coverage.

Required next work:

  1. Keep the deferred Windows retry/shutdown investigation separate from the normal coverage gate
  2. Start raising the relaxed coverage thresholds toward 100%
  3. Immediate next pass:
    • stop treating Windows Go as the main ordinary Go target
    • review the Linux / POSIX Go gaps and classify them honestly:
      • ordinary testable
      • or genuinely fault-injection / Win32-failure territory
    • keep managed-server shutdown / retry behavior handled separately from ordinary coverage
    • keep Linux and Windows Go validation parity honest
  4. Current execution slice (2026-03-23):
    • inspect the remaining weak Linux Go and Rust service paths function-by-function
    • add tests only for real ordinary uncovered logic, not for branches that already require orchestration or fault injection
    • re-measure on the active platform before deciding whether to continue on Go or switch to the next parity gap
    • immediate implementation focus for the just-finished UDS slice:
      • bring Linux Go service tests closer to the existing Windows raw malformed-response coverage
      • add ordinary UDS-based L2 tests for:
        • malformed response envelopes
        • malformed typed payloads
        • transport-without-session safety
        • reconnect after a poisoned nil-session transport state
        • idle stop / unsupported dispatch helpers
      • use the real POSIX listener/session transport for these tests, not synthetic mocks
    • current function-level evidence from bash tests/run-coverage-go.sh 90:
      • service/cgroups/client.go
        • Refresh: 100.0%
        • doRawCall: 100.0%
        • CallSnapshot: 94.1%
        • CallIncrement: 92.9%
        • CallStringReverse: 93.8%
        • CallIncrementBatch: 95.5%
        • transportReceive: 100.0%
        • dispatchSingle: 100.0%
        • Run: 86.8%
        • handleSession: 90.6%
        • result of the latest Unix raw malformed-response parity slice:
          • service/cgroups/client.go moved from 81.4% to 88.0%
        • result of the latest POSIX service follow-up slice:
          • service/cgroups/client.go moved from 87.7% to 90.2%
          • Refresh() and transportReceive() are now fully covered
        • result of the latest POSIX SHM service follow-up slice:
          • service/cgroups/client.go moved from 90.2% to 92.3%
          • tryConnect() is now 94.7%
          • handleSession() moved to 89.4%
        • result of the latest direct POSIX SHM raw-response slice:
          • service/cgroups/client.go moved from 92.3% to 93.4%
          • doRawCall() is now 100.0%
          • CallIncrementBatch() moved to 95.5%
          • dispatchSingle() is now 100.0%
        • result of the latest Linux / POSIX server-loop slice:
          • service/cgroups/client.go moved from 93.4% to 94.3%
          • Run() moved to 86.8%
          • handleSession() moved to 90.6%
      • transport/posix/shm_linux.go
        • result of the latest ordinary SHM slice:
          • file moved from 77.5% to 86.7%
        • result of the latest POSIX SHM service follow-up slice:
          • file moved from 86.7% to 87.5%
        • result of the latest direct POSIX SHM transport slice:
          • file moved from 87.5% to 90.6%
        • result of the latest POSIX SHM obstruction slice:
          • file moved from 90.6% to 91.4%
        • result of the latest direct POSIX SHM guard slice:
          • file moved from 91.4% to 91.9%
          • ShmSend() moved to 96.6%
          • ShmReceive() moved to 96.2%
        • OwnerAlive: 100.0%
        • ShmServerCreate: 79.2%
        • ShmClientAttach: 82.7%
        • ShmSend: 93.1%
        • ShmReceive: 94.9%
        • ShmCleanupStale: 100.0%
        • checkShmStale: 92.6%
      • transport/posix/uds.go
        • result of the latest ordinary UDS slice:
          • file moved from 83.7% to 92.0%
        • result of the latest focused UDS follow-up slice:
          • file moved from 92.0% to 95.6%
        • Connect: 90.9%
        • Send: 100.0%
        • sendInner: 94.3%
        • Receive: 97.8%
        • Listen: 81.0%
        • Accept: 100.0%
        • detectPacketSize: 100.0%
        • rawSendMsg: 83.3%
        • rawRecv: 100.0%
        • connectAndHandshake: 93.2%
        • serverHandshake: 95.3%
      • implication:
        • the next honest ordinary target is still Linux Go, but no longer the ordinary Receive() / Send() / helper work in transport/posix/uds.go
    • next ordinary target:
      • start with the remaining low-risk Linux Go service gaps:
        • service/cgroups/types.go is now done (100.0%)
        • review whether the remaining service/cgroups/client.go paths are still ordinary:
          • Run
          • handleSession
        • current verified service/cgroups profile on the latest local slice:
          • Run: 86.8%
          • handleSession: 90.6%
          • pollFd: 85.7%
        • concrete remaining ordinary branches from the current HTML profile:
          • handleSession():
            • response send failure after peer close (session.Send(...) error)
        • branches that still do not look ordinary from the current profile:
          • Run():
            • listener poll error / Accept() error while still running
            • negotiated SHM upgrade create failure
          • handleSession():
            • SHM short/bad-header receive paths that currently block in ShmReceive(..., 30000) without extra timeout control
            • len(msgBuf) < msgLen growth path, because msgBuf is already sized from MaxResponsePayloadBytes
            • peer-close send failure on Unix packet sockets, because the ordinary delayed-close reproduction still did not trigger session.Send(...) failure in this slice
        • current execution slice:
          • inspect the remaining client.go and shm_linux.go uncovered blocks line-by-line
          • add only ordinary POSIX tests for:
            • handleSession() server-side protocol / batching branches still reachable with normal clients or raw POSIX sessions
            • the remaining ShmServerCreate() / ShmClientAttach() / checkShmStale() paths that are still reachable without fault injection
          • do not chase:
            • listener/socket syscall failures
            • forced short writes
            • rare kernel timing races that already look like special orchestration territory
      • then decide whether the remaining low-level POSIX SHM / UDS gaps are still ordinary or already special-infrastructure territory
      • keep Windows Go low-level branches documented, but no longer treat them as the first ordinary target
      • do not treat low-level OS failure or fault-injection branches as ordinary test targets
      • remaining uds.go likely non-ordinary / special-infrastructure territory:
        • short-write SendmsgN
        • socket / bind / listen syscall failures
        • hello / hello-ack short writes
        • next-level kernel timing races around disconnect during send
        • current shm_linux.go ordinary candidates from the merged profile:
          • ShmServerCreate
          • ShmClientAttach
          • ShmCleanupStale
          • checkShmStale
        • latest line-by-line fact check in shm_linux.go:
          • completed in the latest obstruction slice:
            • checkShmStale() invalid-file open failure (filesystem obstruction / unreadable stale entry)
            • checkShmStale() directory-entry Mmap failure
            • ShmServerCreate() retry-create final failure after stale recovery when the target path is still obstructed by a non-file entry
          • likely already special-infrastructure:
            • Ftruncate, Mmap, Dup, and f.Stat() failures
            • atomic-load bounds failures after a successful Mmap
            • ShmClientAttach() Dup / Mmap / Stat failure branches
      • immediate follow-up after the SHM slice:
        • move the tiny Handler.snapshotMaxItems() coverage from the Windows-only test file into a shared Go test file so Linux covers service/cgroups/types.go too
        • status:
          • completed
          • service/cgroups/types.go is now 100.0%
      • concrete next ordinary POSIX service cases:
        • Refresh() from StateBroken with a successful reconnect
          • status: completed
        • Run() invalid service name returning the listener error directly
          • status: completed
        • SHM-side transportReceive():
          • receive error -> ErrTruncated
          • short message -> ErrTruncated
          • bad header -> decode error
          • status: completed
        • latest POSIX SHM service follow-up:
          • port the existing Windows SHM service recovery/error tests to POSIX SHM where the transport semantics match:
            • malformed batch request
            • batch handler failure -> refresh
            • batch response overflow -> refresh
          • status:
            • completed for:
              • malformed batch request
              • batch handler failure -> refresh
              • batch response overflow -> refresh
            • not ordinary today for:
              • malformed short request
              • malformed header request
              • unexpected request kind
          • evidence:
            • all three non-ordinary cases block in ShmReceive(..., 30000) inside service/cgroups/client.go
            • they are therefore timeout-behavior / special-infrastructure cases, not cheap ordinary unit tests
        • latest direct POSIX SHM ordinary target:
          • add transport-level tests for:
            • invalid service-name guards in ShmServerCreate() / ShmClientAttach()
            • ShmSend() / ShmReceive() bad-parameter guards
            • short-backing-slice defensive errors
            • cheap timeout paths with millisecond waits
            • ShmCleanupStale() non-existent-directory and unrelated-file branches
          • status:
            • completed
          • result:
            • transport/posix/shm_linux.go moved from 87.5% to 90.6%
        • possible server capacity test if one session can be held open deterministically without introducing timing flake

2. Cross-platform validation parity is only partial

Facts:

  • Linux currently registers 37 CTest tests:
    • /usr/bin/ctest --test-dir build -N
  • Windows currently registers 28 CTest tests:
    • ctest --test-dir build -N on win11
  • Parity is reasonably good for:
    • protocol fuzzing:
      • C standalone fuzz target and Go fuzz targets are defined before platform splits in CMakeLists.txt
    • cross-language transport / L2 / L3 interop:
      • POSIX UDS / SHM / service / cache interop on Linux
      • Named Pipe / WinSHM / service / cache interop on Windows
    • benchmark matrices:
  • Parity is not good yet for:
    • chaos testing:
      • Linux has test_chaos
      • Windows has no equivalent CTest target
    • hardening:
      • Linux has test_hardening
      • Windows has no equivalent CTest target
    • stress:
      • Linux has C, Go, and Rust stress targets
      • Windows currently has only test_win_stress and its default scope is intentionally narrow
    • single-language Rust / Go Windows CTest coverage:
      • Linux has direct Rust and Go service / transport test targets in CTest
      • Windows still relies more on coverage scripts and interop passes than on first-class Rust / Go CTest targets

Brutal truth:

  • The repository is not yet in the Linux/Windows parity you expect.
  • It is strongest on benchmarks and interop.
  • It is weakest on Windows chaos, hardening, and multi-language stress coverage.

Required next work:

  1. Decide which missing Windows parity items are mandatory for the production gate
  2. Add Windows equivalents where technically possible
  3. Document clearly where exact parity is impossible because the transports themselves differ (UDS / POSIX SHM vs Named Pipe / WinSHM)

3. Windows managed-server stress is only partially covered

Facts:

  • The original multi-client and typed-service stress subcases were not reliable in default Windows ctest.
  • They exposed a real separate investigation area around Windows managed-server shutdown under stress.

Required next work:

  • investigate Windows managed-server shutdown behavior under stressed live sessions
  • reintroduce managed-service stress subtests only after they are stable and diagnostically useful

4. Final production gate is still open

Required next work:

  • finish the coverage program honestly
  • rerun external multi-agent review against the final state
  • get final user approval

Deferred Future Work (Not Part Of The Current Red Gate)

  • Rust file-size discipline:
    • src/crates/netipc/src/service/cgroups.rs
    • src/crates/netipc/src/protocol/mod.rs
    • src/crates/netipc/src/transport/posix.rs
    • These files are still too large and should eventually be split by concern.
  • Native-endian optimization:
    • the separate endianness-removal / native-byte-order optimization remains a future performance task
    • it is not part of the current production-readiness gate
  • Historical phase notes:
    • the old per-phase and per-feature TODO files are being retired in favor of:
      • this active summary/plan
      • TODO-plugin-ipc.history.md as the historical transcript