Multi-GPU support: data-parallel keyspace partitioning

**Priority: P2 · Confidence: medium · Effort: large**

### Problem
The engine runs on a single GPU. Users with multiple GPUs cannot pool them, and the keyspace (2^40) is embarrassingly parallel — ideal for linear multi-GPU scaling.

### Suggested fix
Data-parallel partitioning at the launcher level:
- Enumerate all devices for the active back-end.
- Split the `[start_key, end_key)` range into N contiguous shards (one per device), or use a work-queue so faster GPUs pull more shards (better load balancing on heterogeneous setups).
- Each device runs the existing kernel on its shard; collect each device's best `(score, key)` and reduce host-side.
- Extend the `.progress` checkpoint to record per-shard offsets so a multi-GPU run resumes correctly.

The single-best `atomicMax` packing already used per-device makes the per-device result collection trivial; the new work is device enumeration, shard scheduling, and the host-side reduce.

### Why it matters
Near-linear throughput scaling — the highest-value performance item after the engine is consolidated. Independent of the abstraction-layer refactor; can land on the current CUDA path first.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU support: data-parallel keyspace partitioning #39

Problem

Suggested fix

Why it matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multi-GPU support: data-parallel keyspace partitioning #39

Description

Problem

Suggested fix

Why it matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions