Skip to content

Latest commit

 

History

History
497 lines (377 loc) · 24.1 KB

File metadata and controls

497 lines (377 loc) · 24.1 KB

AGENTS.md — Pathrex

Project Overview

Pathrex is a Rust library and CLI tool for benchmarking queries on edge-labeled graphs constrained by regular languages and context-free languages. It uses SuiteSparse:GraphBLAS (via LAGraph) for sparse Boolean matrix operations and decomposes a graph by edge label into one Boolean adjacency matrix per label.

Repository Layout

pathrex/
├── Cargo.toml                  # Crate manifest (edition 2024)
├── build.rs                    # Links LAGraph + LAGraphX; optionally regenerates FFI bindings
├── src/
│   ├── lib.rs                  # Modules: formats, graph, rpq, sparql, utils, lagraph_sys
│   ├── main.rs                 # Binary entry point (placeholder)
│   ├── lagraph_sys.rs          # FFI module — includes generated bindings
│   ├── lagraph_sys_generated.rs# Bindgen output (checked in, regenerated in CI)
│   ├── utils.rs                # Public helpers: CountingBuilder, CountOutput, VecSource,
│   │                           #   grb_ok! and la_ok! macros, build_graph
│   ├── graph/
│   │   ├── mod.rs              # Core traits (GraphBuilder, GraphDecomposition, GraphSource,
│   │   │                       #   Backend, Graph<B>), error types, RAII wrappers, GrB init
│   │   └── inmemory.rs         # InMemory marker, InMemoryBuilder, InMemoryGraph
│   ├── rpq/
│   │   ├── mod.rs              # RpqEvaluator (assoc. Result), RpqQuery, Endpoint, PathExpr, RpqError
│   │   ├── nfarpq.rs           # NfaRpqEvaluator (LAGraph_RegularPathQuery)
│   │   └── rpqmatrix.rs        # Matrix-plan RPQ evaluator
│   ├── sparql/
│   │   └── mod.rs              # parse_rpq / extract_rpq → RpqQuery (spargebra)
│   └── formats/
│       ├── mod.rs              # FormatError enum, re-exports
│       ├── csv.rs              # Csv<R> — CSV → Edge iterator (CsvConfig, ColumnSpec)
│       ├── mm.rs               # MatrixMarket directory loader (vertices.txt, edges.txt, *.txt)
│       └── nt.rs               # NTriples<R> — N-Triples → Edge iterator (full predicate IRI labels)
├── tests/
│   ├── inmemory_tests.rs       # Integration tests for InMemoryBuilder / InMemoryGraph
│   ├── mm_tests.rs             # Integration tests for MatrixMarket format
│   ├── nfarpq_tests.rs         # Integration tests for NfaRpqEvaluator
│   └── rpqmatrix_tests.rs      # Integration tests for matrix-plan RPQ evaluator
├── deps/
│   └── LAGraph/                # Git submodule (SparseLinearAlgebra/LAGraph)
└── .github/workflows/ci.yml   # CI: build GraphBLAS + LAGraph, cargo build & test

Build & Dependencies

System prerequisites

Dependency Purpose
SuiteSparse:GraphBLAS Sparse matrix engine (libgraphblas)
LAGraph Graph algorithm library on top of GraphBLAS (liblagraph)
cmake Building LAGraph from source
libclang-dev / clang Required by bindgen when regenerate-bindings feature is active

Building

# Ensure submodules are present
git submodule update --init --recursive

# Build and install SuiteSparse:GraphBLAS system-wide
git clone --depth 1 https://github.com/DrTimothyAldenDavis/GraphBLAS.git
cd GraphBLAS && make compact && sudo make install && cd ..

# Build LAGraph inside the submodule (no system-wide install required)
cd deps/LAGraph && make && cd ../..

# Build pathrex
cargo build

# Run tests
LD_LIBRARY_PATH=deps/LAGraph/build/src:deps/LAGraph/build/experimental:/usr/local/lib cargo test

How build.rs handles linking

build.rs performs two jobs:

  1. Native linking. It emits six Cargo directives:

    • cargo:rustc-link-lib=dylib=graphblas — dynamically links libgraphblas.
    • cargo:rustc-link-search=native=/usr/local/lib — adds the system GraphBLAS install path to the native library search path.
    • cargo:rustc-link-lib=dylib=lagraph — dynamically links liblagraph.
    • cargo:rustc-link-search=native=deps/LAGraph/build/src — adds the submodule's core build output to the native library search path.
    • cargo:rustc-link-lib=dylib=lagraphx — dynamically links liblagraphx (experimental algorithms).
    • cargo:rustc-link-search=native=deps/LAGraph/build/experimental — adds the experimental build output to the native library search path.

    LAGraph does not need to be installed system-wide; building the submodule in deps/LAGraph/ is sufficient for compilation and linking. SuiteSparse:GraphBLAS must be installed system-wide (sudo make install).

    At runtime the OS dynamic linker (ld.so) does not use Cargo's link search paths — it only consults LD_LIBRARY_PATH, rpath, and the system library cache. Set LD_LIBRARY_PATH=/usr/local/lib after a system-wide LAGraph install, or include the submodule build paths if not installing system-wide.

  2. Optional FFI binding regeneration (feature regenerate-bindings). When the feature is active, regenerate_bindings() runs bindgen against deps/LAGraph/include/LAGraph.h and deps/LAGraph/include/LAGraphX.h (always from the submodule — no system path search), plus GraphBLAS.h (searched in /usr/local/include/suitesparse and /usr/include/suitesparse). The generated Rust file is written to src/lagraph_sys_generated.rs. Only a curated allowlist of GraphBLAS/LAGraph types and functions is exposed (see the allowlist_* calls in build.rs).

Feature flags

Feature Effect
regenerate-bindings Runs bindgen at build time to regenerate src/lagraph_sys_generated.rs from LAGraph.h, LAGraphX.h (both from deps/LAGraph/include) and GraphBLAS.h. Without this feature the checked-in bindings are used as-is.

Pre-generated FFI bindings

The file src/lagraph_sys_generated.rs is checked into version control. CI regenerates it with --features regenerate-bindings. Do not hand-edit this file.

Architecture & Key Abstractions

Edge

Edge is the universal currency between format parsers and graph builders: { source: String, target: String, label: String }.

GraphSource trait

GraphSource<B> is implemented by any data source that knows how to feed itself into a specific [GraphBuilder]:

Csv<R>, MatrixMarket, and NTriples<R> implement GraphSource<InMemoryBuilder> (see src/graph/inmemory.rs), so they can be passed to [GraphBuilder::load] and [Graph::try_from].

GraphBuilder trait

GraphBuilder accumulates edges and produces a GraphDecomposition:

InMemoryBuilder also exposes lower-level helpers outside the trait:

Backend trait & Graph<B> handle

Backend associates a marker type with a concrete builder/graph pair:

pub trait Backend {
    type Graph: GraphDecomposition;
    type Builder: GraphBuilder<Graph = Self::Graph>;
}

Graph<B> is a zero-sized handle parameterised by a Backend:

InMemory is the concrete backend marker type.

GraphDecomposition trait

GraphDecomposition is the read-only query interface:

InMemoryBuilder / InMemoryGraph

InMemoryBuilder is the primary GraphBuilder implementation. It collects edges in RAM, then build() calls GraphBLAS to create one GrB_Matrix per label via COO format, wraps each in an LAGraph_Graph, and returns an InMemoryGraph.

Multiple CSV sources can be chained with repeated .load() calls; all edges are merged into a single graph.

Node ID representation: Internally, InMemoryBuilder uses HashMap<usize, String> for id_to_node (changed from Vec<String> to support sparse/pre-assigned IDs from MatrixMarket). The set_node_map() method allows bulk-installing a node mapping, which is used by the MatrixMarket loader.

Format parsers

Three built-in parsers are available, each yielding Iterator<Item = Result<Edge, FormatError>> and pluggable into GraphBuilder::load() via GraphSource<InMemoryBuilder> (see src/graph/inmemory.rs). CSV and MatrixMarket edge loaders are available:

Csv<R>

Csv<R> parses delimiter-separated edge files.

Configuration is via CsvConfig:

Field Default Description
source_column Index(0) Column for the source node (by index or name)
target_column Index(1) Column for the target node
label_column Index(2) Column for the edge label
has_header true Whether the first row is a header
delimiter b',' Field delimiter byte

ColumnSpec is either Index(usize) or Name(String). Name-based lookup requires has_header: true.

MatrixMarket directory format

MatrixMarket loads an edge-labeled graph from a directory with:

  • vertices.txt — one line per node: <node_name> <1-based-index> on disk; get_node_id returns the matching 0-based matrix index
  • edges.txt — one line per label: <label_name> <1-based-index> (selects n.txt)
  • <n>.txt — MatrixMarket adjacency matrix for label with index n

Names in mapping files may be written with SPARQL-style angle brackets (e.g. <Article1>). parse_index_map strips a single pair of surrounding </> so dictionary keys match short labels (Article1), aligning with IRIs after RpqQuery::strip_base on SPARQL-derived queries.

The loader uses LAGraph_MMRead to parse each .txt file into a GrB_Matrix, then wraps it in an LAGraph_Graph. Vertex indices from vertices.txt are converted to 0-based and installed via InMemoryBuilder::set_node_map().

Helper functions:

MatrixMarket implements GraphSource<InMemoryBuilder> in src/graph/inmemory.rs (see the impl at line 215): vertices.txt maps are converted from 1-based file indices to 0-based matrix ids before set_node_map; edges.txt indices are unchanged for n.txt lookup.

NTriples<R>

NTriples<R> parses W3C N-Triples RDF files using oxttl and oxrdf. Each triple (subject, predicate, object) becomes an Edge where:

  • source — subject IRI or blank-node ID (_:label).
  • target — object IRI or blank-node ID; triples whose object is an RDF literal yield Err(FormatError::LiteralAsNode) (callers may filter these out).
  • label — predicate IRI, transformed by LabelExtraction:
Variant Behaviour
LocalName (default) Fragment (#name) or last path segment of the predicate IRI
FullIri Full predicate IRI string

Constructors:

SPARQL parsing (src/sparql/mod.rs)

The sparql module uses the spargebra crate to parse SPARQL 1.1 query strings and build a pathrex-native RpqQuery for RPQ evaluators.

Supported query form: SELECT queries with exactly one triple or property path pattern in the WHERE clause. Relative IRIs such as <knows> require a BASE declaration (or PREFIX / full IRIs). Example:

BASE <http://example.org/>
SELECT ?x ?y WHERE { ?x <knows>/<likes>* ?y . }

Key public items:

  • parse_rpq(sparql) — parses a SPARQL string with SparqlParser and returns an RpqQuery.
  • extract_rpq(query) — validates a parsed [spargebra::Query] is a SELECT with a single path pattern and returns an RpqQuery. Use this when you construct a custom SparqlParser (e.g. with prefix declarations) and call parse_query yourself.
  • ExtractError — error enum for extraction failures (NotSelect, NotSinglePath, UnsupportedSubject, UnsupportedObject, VariablePredicate). Converts to RpqError::Extract via #[from].

Call RpqQuery::strip_base when graph edge labels are short names and the parsed query contains full IRIs sharing a common prefix.

The module handles spargebra's desugaring of sequence paths (?x <a>/<b>/<c> ?y) from a chain of BGP triples back into a single path expression.

RPQ evaluation (src/rpq/)

The rpq module provides an abstraction for evaluating Regular Path Queries (RPQs) over edge-labeled graphs using GraphBLAS/LAGraph.

Key public items:

  • EndpointVariable(String) or Named(String) (IRI string).
  • PathExprLabel, Sequence, Alternative, ZeroOrMore, OneOrMore, ZeroOrOne.
  • RpqQuery{ subject, path, object } using the types above; strip_base(&mut self, base) removes a shared IRI prefix from named endpoints and labels.
  • RpqEvaluator — trait with associated type Result and evaluate(query, graph) taking &RpqQuery and [GraphDecomposition], returning Result<Self::Result, RpqError>. Each concrete evaluator exposes its own output type (see below).
  • RpqError — unified error type for RPQ parsing and evaluation: Parse (SPARQL syntax), Extract (query extraction), UnsupportedPath, VertexNotFound, and Graph (wraps GraphError for label-not-found and GraphBLAS/LAGraph failures).

NfaRpqResult wraps a [GraphblasVector] of reachable target vertices. When the subject is a variable, every vertex is used as a source and LAGraph_RegularPathQuery returns the union of targets — individual (source, target) pairs are not preserved.

RpqMatrixEvaluator (src/rpq/rpqmatrix.rs)

RpqMatrixEvaluator compiles [PathExpr] into a Boolean matrix plan over label adjacency matrices and runs [LAGraph_RPQMatrix]. It returns RpqMatrixResult: the path-relation nnz plus a [GraphblasMatrix] duplicate of the result matrix (full reachability relation for the path). Subject/object do not filter the matrix; a named subject is only validated to exist. Bound objects are not supported yet ([RpqError::UnsupportedPath]). NTriples<R> parses W3C N-Triples RDF files using oxttl and oxrdf. Each triple (subject, predicate, object) becomes an Edge where:

  • source — subject IRI or blank-node ID (_:label).
  • target — object IRI or blank-node ID; triples whose object is an RDF literal yield Err(FormatError::LiteralAsNode) (callers may filter these out).
  • label — full predicate IRI string (including fragment #… when present).

Constructor:

SPARQL parsing (src/sparql/mod.rs)

The rpq module provides an abstraction for evaluating Regular Path Queries (RPQs) over edge-labeled graphs using GraphBLAS/LAGraph.

Key public items:

  • EndpointVariable(String) or Named(String) (IRI string).
  • PathExprLabel, Sequence, Alternative, ZeroOrMore, OneOrMore, ZeroOrOne.
  • RpqQuery{ subject, path, object } using the types above; strip_base(&mut self, base) removes a shared IRI prefix from named endpoints and labels.
  • RpqEvaluator — trait with associated type Result and evaluate(query, graph) taking &RpqQuery and [GraphDecomposition], returning Result<Self::Result, RpqError>. Each concrete evaluator exposes its own output type (see below).
  • RpqError — unified error type for RPQ parsing and evaluation: Parse (SPARQL syntax), Extract (query extraction), UnsupportedPath, VertexNotFound, and Graph (wraps GraphError for label-not-found and GraphBLAS/LAGraph failures).

NfaRpqEvaluator (src/rpq/nfarpq.rs)

NfaRpqEvaluator implements [RpqEvaluator] by:

  1. Converting a [PathExpr] into an Nfa via Thompson's construction (Nfa::from_path_expr()).
  2. Eliminating ε-transitions via epsilon closure (NfaBuilder::epsilon_closure()).
  3. Building one LAGraph_Graph per NFA label transition (Nfa::build_lagraph_matrices()).
  4. Calling [LAGraph_RegularPathQuery] with the NFA matrices, data-graph matrices, start/final states, and source vertices.

type Result = NfaRpqResult ([GraphblasVector] of reachable targets).

Supported path operators match [PathExpr] variants above. Reverse and NegatedPropertySet from SPARQL map to [RpqError::UnsupportedPath] when they appear in extracted paths.

Subject/object resolution: [Endpoint::Variable] means "all vertices"; [Endpoint::Named] resolves to a single vertex via GraphDecomposition::get_node_id().

NfaRpqResult wraps a [GraphblasVector] of reachable target vertices. When the subject is a variable, every vertex is used as a source and LAGraph_RegularPathQuery returns the union of targets — individual (source, target) pairs are not preserved.

RpqMatrixEvaluator (src/rpq/rpqmatrix.rs)

RpqMatrixEvaluator compiles [PathExpr] into a Boolean matrix plan over label adjacency matrices and runs [LAGraph_RPQMatrix]. It returns RpqMatrixResult: the path-relation nnz plus a [GraphblasMatrix] duplicate of the result matrix (full reachability relation for the path). Subject/object do not filter the matrix; a named subject is only validated to exist. Bound objects are not supported yet ([RpqError::UnsupportedPath]).

FFI layer

lagraph_sys exposes raw C bindings for GraphBLAS and LAGraph. Safe Rust wrappers live in graph::mod:

Macros & helpers (src/utils.rs)

Two #[macro_export] macros handle FFI error mapping:

  • grb_ok!(expr) — evaluates a GraphBLAS call inside unsafe, maps the i32 return to Result<(), GraphError::GraphBlas(info)>.
  • la_ok!(fn::path(args…)) — evaluates a LAGraph call, automatically appending the required *mut i8 message buffer, and maps failure to GraphError::LAGraph(info, msg).

A convenience function is also provided:

  • build_graph(edges) — builds an InMemoryGraph from a slice of (&str, &str, &str) triples (source, target, label). Used by integration tests.

Coding Conventions

  • Rust edition 2024.
  • Error handling via thiserror derive macros; three main error enums: GraphError, FormatError, and RpqError.
  • FormatError converts into GraphError via #[from] FormatError on the GraphError::Format variant.
  • GraphError converts into RpqError via #[from] GraphError on the RpqError::Graph variant, enabling ? propagation in evaluators.
  • Unsafe FFI calls are confined to lagraph_sys, graph/mod.rs, graph/inmemory.rs, rpq/nfarpq.rs. All raw pointers are wrapped in RAII types that free resources on drop.
  • unsafe impl Send + Sync is provided for LagraphGraph, GraphblasVector, and GraphblasMatrix because GraphBLAS handles are thread-safe after init.
  • Unit tests live in #[cfg(test)] mod tests blocks inside each module. Integration tests that need GraphBLAS live in tests/inmemory_tests.rs, tests/mm_tests.rs, tests/nfarpq_tests.rs.

Testing

# Run all tests (LAGraph installed system-wide)
LD_LIBRARY_PATH=/usr/local/lib cargo test --verbose

# If LAGraph is NOT installed system-wide (only built in the submodule):
LD_LIBRARY_PATH=deps/LAGraph/build/src:deps/LAGraph/build/experimental:/usr/local/lib cargo test --verbose

Tests in src/graph/mod.rs use CountingBuilder / CountOutput / VecSource from src/utils.rs — these do not call into GraphBLAS and run without native libraries.

Tests in src/formats/csv.rs and src/formats/nt.rs are pure Rust and need no native dependencies.

Tests in src/sparql/mod.rs are pure Rust and need no native dependencies.

Tests in src/rpq/nfarpq.rs (NFA construction unit tests) are pure Rust and need no native dependencies.

Tests in src/graph/inmemory.rs, tests/inmemory_tests.rs, tests/mm_tests.rs, tests/nfarpq_tests.rs, and tests/rpqmatrix_tests.rs call real GraphBLAS/LAGraph and require the native libraries to be present.

CI

The GitHub Actions workflow (.github/workflows/ci.yml) runs on every push and PR across stable, beta, and nightly toolchains:

  1. Checks out with submodules: recursive.
  2. Installs cmake, libclang-dev, clang.
  3. Builds and installs SuiteSparse:GraphBLAS from source (sudo make install).
  4. Builds and installs LAGraph from the submodule (sudo make install).
  5. cargo build --features regenerate-bindings — rebuilds FFI bindings.
  6. LD_LIBRARY_PATH=/usr/local/lib cargo test --verbose — runs the full test suite.