Priority: P2 · Confidence: medium · Effort: large
Problem
The engine runs on a single GPU. Users with multiple GPUs cannot pool them, and the keyspace (2^40) is embarrassingly parallel — ideal for linear multi-GPU scaling.
Suggested fix
Data-parallel partitioning at the launcher level:
- Enumerate all devices for the active back-end.
- Split the
[start_key, end_key) range into N contiguous shards (one per device), or use a work-queue so faster GPUs pull more shards (better load balancing on heterogeneous setups).
- Each device runs the existing kernel on its shard; collect each device's best
(score, key) and reduce host-side.
- Extend the
.progress checkpoint to record per-shard offsets so a multi-GPU run resumes correctly.
The single-best atomicMax packing already used per-device makes the per-device result collection trivial; the new work is device enumeration, shard scheduling, and the host-side reduce.
Why it matters
Near-linear throughput scaling — the highest-value performance item after the engine is consolidated. Independent of the abstraction-layer refactor; can land on the current CUDA path first.
Priority: P2 · Confidence: medium · Effort: large
Problem
The engine runs on a single GPU. Users with multiple GPUs cannot pool them, and the keyspace (2^40) is embarrassingly parallel — ideal for linear multi-GPU scaling.
Suggested fix
Data-parallel partitioning at the launcher level:
[start_key, end_key)range into N contiguous shards (one per device), or use a work-queue so faster GPUs pull more shards (better load balancing on heterogeneous setups).(score, key)and reduce host-side..progresscheckpoint to record per-shard offsets so a multi-GPU run resumes correctly.The single-best
atomicMaxpacking already used per-device makes the per-device result collection trivial; the new work is device enumeration, shard scheduling, and the host-side reduce.Why it matters
Near-linear throughput scaling — the highest-value performance item after the engine is consolidated. Independent of the abstraction-layer refactor; can land on the current CUDA path first.