Reference implementation for the experiment framework in the paper Leveraging Neural Graph Compilers in Machine Learning Research for Edge-Cloud Systems.
Benchmarks neural graph compilers under a uniform interface. It measures latency, throughput, and CPU/RAM/GPU utilization for off-the-shelf architectures and for synthetic blocks swept over width and depth, then derives the batch-scaling metrics used in the paper. Compilation runs locally on the target device; no hardware simulators are used.
Massive cudos to my Student Carmen Walser for implementing the initial core for the compiler configuration and the testbed.
If you face any issues, just send a mailt to a.furutanpey@coovally.ai, and I'll gladly try to carve out the time to help you out.
./setup.shCreates .venv and installs the base backends (identity, TorchScript, ONNX Runtime) and
the analysis. Requires Python 3.10+. Vendor backends are installed per device:
.venv/bin/pip install '.[openvino]' # CPU servers
.venv/bin/pip install '.[cuda]' # CUDA devices (TensorRT)Apache TVM (0.18.dev) is built separately, see https://tvm.apache.org/docs/install.
./run.sh <experiment> <compiler> [device]device is a label used to organize output. Results are written to
results/<device>/<compiler>/<run>.csv, one tidy row per run; existing files are skipped
so an interrupted sweep resumes. Pass --quick for a fast smoke run (batch sizes 1 and 8).
./run.sh architectures tensorrt gpu
./run.sh conv_blocks openvino xeon
./run.sh mha_blocks identity orin --quick| experiment | models |
|---|---|
architectures |
ResNet / EfficientNet / DeiT / Swin / ConvNeXt, 3 sizes each (Table IV) |
conv_blocks |
stacked Conv-BatchNorm-ReLU, swept over width and depth (Table V) |
mha_blocks |
stacked self-attention + ReLU, swept over width and depth (Table V) |
conv2d |
a single convolution |
fully_connected |
a single linear layer |
self_attention |
a single multi-head attention layer |
| compiler | role |
|---|---|
identity |
PyTorch dynamic graph (baseline) |
torchscript |
software-level optimization |
onnxruntime |
software-level optimization |
openvino |
vendor-specific (Intel CPU) |
tensorrt |
vendor-specific (NVIDIA GPU) |
tvm |
vendor-agnostic, with AutoTVM tuning |
The experiment grid (widths, depths, batch sizes) lives in configs/experiments.yaml;
the defaults reproduce the configurations reported in the paper. Per-compiler settings
(device, precision, threads, optimization level) live in configs/<compiler>.yaml.
We provide teh experiment traces in resources/traces/results_raw.zip.
The paper uses three devices (Table I). Run on each, with the compilers that device supports:
./sweep.sh gpu identity torchscript onnxruntime tensorrt tvm # RTX 4070 server
./sweep.sh xeon identity torchscript onnxruntime openvino tvm # Xeon CPU server
./sweep.sh orin identity torchscript tensorrt # Jetson Orin NanoEach experiment is repeated 100 times after 10 warmup iterations. Library versions used in the paper (Table II): PyTorch 2.4.1, ONNX Runtime 1.19.2, TensorRT 10.4.0, OpenVINO 2024.3.0, Apache TVM 0.18.dev, timm 1.0.15, CUDA 12.5, cuDNN 9.3.0.
ngraphbench/
compilers/ one backend per compiler behind a common interface
models.py architectures (timm) and synthetic Conv/MHA blocks
experiments.py expands the grid into individual runs
measure.py timed inference loop and system-metric monitoring
run.py CLI: run one experiment family with one compiler
results.py tidy result schema
analysis/ load runs and traces, compute derived metrics (no plotting)
configs/ experiment grid and per-compiler settings
resources/traces/ released measurement traces
Comments in the source point to the corresponding sections and tables of the paper.
[[TPDS]]
@article{furutanpey2025leveraging,
title={Leveraging Neural Graph Compilers in Machine Learning Research for Edge-Cloud Systems},
author={Furutanpey, Alireza and Walser, Carmen and Raith, Philipp and Frangoudis, Pantelis A and Dustdar, Schahram},
journal={arXiv preprint arXiv:2504.20198},
year={2025}
}