Skip to content

DeltaFosB/sequitur

Repository files navigation

Sequitur HFT Matching Engine

Sequitur is a deterministic, ultra-low latency C++ matching engine designed for high-frequency algorithmic trading (HFT). Built with strict hardware sympathy, the project investigates the limits of single-threaded deterministic execution, lock-free concurrency, zero-allocation memory management, and decoupled real-time observability to achieve sub-microsecond "Tick-to-Trade" latencies.

The architecture isolates a pure Price-Time Priority limit order book inside a dedicated, single-threaded execution core, completely removing the overhead of operating system synchronization, context switches, and core-to-core memory contention during high-throughput market events.


Performance and Micro-Architectural Analysis

Comprehensive macro and micro performance profiling runs were conducted on an isolated deterministic core (bypassing network and I/O) to establish the absolute hardware baseline of the C++ math. The benchmarking harness utilized a strict alternating "ping-pong" sequence (Buy/Sell) to violently exercise the order annihilation logic and object pools at maximum velocity.

To eliminate the Heisenberg Observer Effect (where the telemetry framework alters the speed of the code being measured), time tracking is performed completely out-of-band using Structural Macro-Workload Partitioning, grouping transactions into un-instrumented blocks of 1,000 orders to amortize system clock taxes.

Key Performance Indicators (KPIs)

Metric Baseline (Atomic-Bound Engine) Optimized Engine (Shared-Nothing Monolithic Core) Engineering Impact and Insight
Peak Throughput 30.80M OPS 63,911,090.00 OPS +107.5% Increase in max transaction processing bandwidth under pristine cache alignment.
Mean Average Latency 32.46 ns 15.7138 ns 51.6% Latency Reduction achieved by stripping atomic pipeline synchronization and structural out-of-band offloading.
Mean Tail Latency (P99) 36.44 ns 18.2045 ns 50.0% Reduction in tail volatility, safely shattering the initial sub-20ns production target.
Hardware Jitter ($\sigma$) 7.82 ns 1.0042 ns Strict hardware-level execution pinning via SCHED_FIFO isolates the hot path from scheduler drift.
Memory Allocation O(1) O(1) Zero runtime heap allocations (malloc/new) prevents fragmentation micro-stalls.

End-to-End Telemetry Evaluation

The table below documents the empirical evolution of Sequitur's architectural performance characteristics, verified over multiple iterative engineering phases across identical macro test footprints (1,000,000 total orders):

Telemetry Phase / Layout Mean Round-Trip P99 Tail Latency Tracking Footprint Hardware Jitter ($\sigma$) Micro-Architectural Trade-off & Observation
Decoupled SPSC Pipeline 15.71 ns 18.20 ns 64 Bytes / Packet 1.00 ns Production Baseline. Telemetry is deferred out-of-band via a lock-free SPSC queue. CPU pipeline lookahead is maximized.
Macro-Bucket Partitioning 23.53 ns 25.60 ns 8 KB 1.51 ns Monolithic baseline with out-of-band macro-timing. Opens compiler loop vectorization options.
Atomic-Bound OrderBook 32.46 ns 36.44 ns 8 KB 7.82 ns Implements relaxed atomics inside the hot path. Forces hardware instruction taxes via Read-Modify-Write (RMW) cycle pipeline bubbles.
Individual Vector Tracking 39.95 ns 62.41 ns 8 MB 3.44 ns Suffers severe data cache thrashing. The massive 8MB telemetry array continuously evicts active order book nodes to RAM.
Individual Histogram Sort 61.73 ns 55.40 ns 80 KB 8.26 ns Eliminates cache pollution, but forces an un-amortized 36 ns Linux vDSO system clock read penalty onto every single order.

The Core Systems Hypotheses Verified

By implementing an automated hardware profiling suite via low-level kernel diagnostics, we isolated and resolved three major latency paradoxes within the engine:

Hypothesis 1: The Multi-Threaded Cache-Line Bouncing Trap (False Sharing)

Our baseline atomic tracking test added a mandatory 8.5 ns penalty on a single isolated core. Had this memory footprint been accessed concurrently by an external gateway or publisher thread, the hardware cache coherency protocol (MESI) would have triggered a continuous stream of cache invalidations. This would cause the shared 64-byte cache lines to bounce across CPU sockets, crashing matching loop throughput straight into the 150+ ns range. Shifting to an asymmetric SPSCQueue layout with alignas(64) padding on internal read/write pointers permanently shields Core 3 from Core 4's cache line polling.

Hypothesis 2: Microscopic Instrumentation vs. Pipeline Freedom

Placing high-resolution clock reads directly around individual order submissions injects an incompressible 30–36 ns vDSO clock read tax. Furthermore, these timekeeping wrappers act as a rigid serialization barrier, halting the CPU's Out-of-Order (OoO) execution engine. Passing telemetry asynchronous MetricsPacket chunks via lock-free rings drops the active telemetry tax to a negligible 0.00 ns inside the engine block, leaving the hardware pipeline completely free to optimize loop math, maximize register re-use, and apply loop unrolling.

Hypothesis 3: The Cold Start Tax (First-Touch Page Fault Boundary)

Initial profiling passes consistently revealed an isolated performance spike on Run 01 (~23.70 ns) before stabilizing into the ~14.1-16.3 ns band on subsequent runs. This is tracked back to Linux's lazy memory page allocation mechanism and cold CPU execution pipelines. While the 20,000-order warm-up loop primes the CPU instruction cache, the main run exercises wider price indices, triggering minor kernel page faults to map physical RAM to the vast pre-allocated array boundaries.


Engine Architecture

graph TD
    Client((Market Client)) -->|Line Wire Data| GoGateway[Go Network Subsystem]
    
    subgraph "Concurrency Bridge (Inter-Process / Inter-Thread)"
        GoGateway -->|Direct Protocol Unpacking| RingBuffer{Bounded Ring Buffer}
    end
    
    subgraph "Deterministic Core (Single Thread - Isolated Core 3)"
        RingBuffer -->|Zero-Lock Zero-Copy Pop| Engine[Matching Engine]
        Engine -->|O1 Acquire| Pool[Contiguous Object Pool]
        Engine -->|Price-Time Match| Book[Order Book]
        Book -->|Double-Linked Update| Matrix[Array Price Matrix]
    end
    
    subgraph "Asynchronous Telemetry Pipeline"
        Engine -->|Push Telemetry Packet| MetricsQueue{SPSC Metrics Queue}
        MetricsQueue -.->|Zero-Copy Pop Reference| MetricsWorker[Metrics Worker Thread]
        MetricsWorker -->|Structured JSON via RAM Disk| Vector[Vector Telemetry Agent]
        Vector -->|Real-Time Non-Blocking Ingress| Loki[Grafana Loki Container]
        Loki -->|Dynamic Multi-Percentile Plot| Grafana[Live Grafana UI]
    end
    
    subgraph "Hardware Target"
        Matrix -.->|Warm L1/L2 Cache Locality| CPU[Isolated Physical Silicon]
    end

Loading

1. Memory Architecture: Zero-Allocation Contiguous Object Pool

A strictly pre-allocated, continuous memory arena designed to entirely bypass the operating system's heap manager (malloc/new) during active trading matching.

  • Mechanism: Maintains a contiguous array of Order structs and an embedded stack of free indices. Custom acquire() and release() methods recycle raw pointer addresses in deterministic $O(1)$ time.
  • Benefit: Completely eliminates heap fragmentation and runtime lock contention. The complete execution state fits perfectly within the CPU core's dedicated L1/L2 data cache regime, enabling 1 to 4 ns memory lookups.

2. Concurrency Model: Single-Threaded, Shared-Nothing Monolithic Core

To protect the ultra-low latency execution path, the MatchingEngine and OrderBook operate on a single thread under a strict shared-nothing paradigm.

  • Mechanism: Core metrics like total_trades and total_volume utilize plain integer types. No mutexes, memory fences, or std::atomic variables exist within the matching loop structures.
  • Benefit: Bypasses all multi-threaded cache-line ownership fights. A single physical CPU core entirely owns the memory pool and double-linked list allocations, preventing instruction-level pipeline serialization stalls.

3. Telemetry: Decoupled Out-of-Band Real-Time Metrics Pipeline

Observability is entirely decoupled from the execution path using an asynchronous, non-blocking pipeline to extract real-time metrics without injecting measurement distortion or cache pollution into the engine.

  • Engine Layer: The matching core constructs a stack-allocated 64-byte MetricsPacket tracking engine states and raw cycle durations measured via assembly rdtsc. This packet is pushed out-of-band via a lock-free SPSCQueue using absolute distance index calculations, keeping the active telemetry tax at a flat 0.00 ns on the trading thread.
  • Ingress Layer: A background MetricsWorker thread consumes the queue on an independent core, flattening the structures into a structured JSON entry string, streaming them directly into /dev/shm/sequitur/telemetry.log (a Linux memory-resident RAM disk) to guarantee zero physical disk I/O blocking.
  • Aggregation & Visualization Layer: A local Vector data agent leverages inotify watches to stream the appended JSON lines instantly into a Grafana Loki instance. Invalid non-JSON startup/shutdown lines are intercepted and dropped at the Vector VRL remapping tier to ensure deterministic parsing. Grafana consumes the Loki data source to render sub-microsecond latency distributions (P50, P99, P99.9), memory allocation bounds, and throughput timelines live in production.

Build and Run

Prerequisites

  • Compiler: Toolchain fully compliant with C++20 (GCC 10+ or Clang 11+)
  • Build System: CMake 3.20 or higher
  • Containerization: Docker & Docker Compose (for Grafana/Loki stack provisioning)
  • Operating System: Linux environment with sudo access (mandatory for real-time scheduling priority adjustments and CPU core binding via chrt and taskset)

Compilation

To compile the matching engine, test suites, and optimized benchmarking targets with full compiler native optimization enabled:

mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build . --target benchmarks
cmake --build . --target unit_tests

Complete Telemetry Pipeline Orchestration

The entire infrastructure stack—including Docker containers, log clearouts, Vector data stream states, and the performance-isolated simulation engine loop—is fully automated via a single root orchestrator script.

To launch the unified pipeline, execute the following command from the project root:

sudo ./run_pipeline.sh

Under the Hood Lifecycle:

The orchestration script manages the system state sequentially to ensure no race conditions occur during system initialization:

  1. Privilege Validation: Asserts the shell session contains root execution access, preventing failure when requesting scheduling overrides.
  2. RAM Disk Allocation: Pre-allocates and clears out the memory-backed file descriptor path inside /dev/shm/sequitur/ to establish zero-I/O streaming bounds.
  3. Container Infrastructure Initialization: Launches the underlying Loki and Grafana container virtualization layer in detached mode (-d).
  4. Telemetry Agent Reset: Kills any orphaned logging agents, purges the local checkpoint cache tracking directory (.vector_data/), and spawns the Vector logging service attached to the local vector.toml layout guidelines.
  5. Hyper-Thread Isolated Loop Execution: Pins the compiled executable context to Physical CPU Core 3 via taskset -c 3 and updates the process scheduling model to the maximum Linux Real-Time FIFO priority (chrt -f 99). This locks out the standard kernel thread manager, keeping the execution path un-preempted by background system tasks.

Once the script displays that the pipeline is active, navigate your browser to http://localhost:3000 to view the Sequitur Real-Time Hardware Performance Monitor live. To shut down the simulation and safely pull down all container networks, send an interrupt signal (Ctrl + C) to the running script window.


Engineering Roadmap and Future Extensions

  • Phase 5: Go Network Gateway Integration (Production Ingress): Integration of a high-throughput, concurrent Go-based network subsystem. The Go layer natively handles extreme socket connection scaling and custom wire protocols, directly unpacking line-wire data to stream raw, structured ingress objects across the boundary ring buffer to the C++ core.
  • Hardware Acceleration Interfacing: Investigating hardware-offloaded network ingestion layers (such as a DPDK kernel bypass or an FPGA network tap) to apply sub-nanosecond wire timestamps directly to incoming packets before handing them off to the Go processing ring.

About

A deterministic, ultra-low latency C++ matching engine for HFT, featuring lock-free concurrency, zero-allocation memory, and sub-microsecond execution.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors