Skip to content

Multi-GPU support: data-parallel keyspace partitioning #39

Description

@felixx-sp

Priority: P2 · Confidence: medium · Effort: large

Problem

The engine runs on a single GPU. Users with multiple GPUs cannot pool them, and the keyspace (2^40) is embarrassingly parallel — ideal for linear multi-GPU scaling.

Suggested fix

Data-parallel partitioning at the launcher level:

  • Enumerate all devices for the active back-end.
  • Split the [start_key, end_key) range into N contiguous shards (one per device), or use a work-queue so faster GPUs pull more shards (better load balancing on heterogeneous setups).
  • Each device runs the existing kernel on its shard; collect each device's best (score, key) and reduce host-side.
  • Extend the .progress checkpoint to record per-shard offsets so a multi-GPU run resumes correctly.

The single-best atomicMax packing already used per-device makes the per-device result collection trivial; the new work is device enumeration, shard scheduling, and the host-side reduce.

Why it matters

Near-linear throughput scaling — the highest-value performance item after the engine is consolidated. Independent of the abstraction-layer refactor; can land on the current CUDA path first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions