Skip to content

feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support #769

Open
wyfanxiao wants to merge 2 commits into
zilliztech:mainfrom
wyfanxiao:develop
Open

feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support #769
wyfanxiao wants to merge 2 commits into
zilliztech:mainfrom
wyfanxiao:develop

Conversation

@wyfanxiao
Copy link
Copy Markdown
Contributor

Hi team! 👋

This PR enhances the OceanBase client and adds a multi-process data loader to improve load throughput for SQL-based vector DBs.

Summary

  • Add MultiprocessInsertRunner for parallel data loading across worker processes, bypassing GIL limitation for SQL-based vector DBs
  • OceanBase: declare thread_safe=False, auto-switch to multi-process loader when --load-concurrency>1
  • Add --load-processes CLI option for explicit multi-process control
  • Add --create-index-parallel CLI option (default 16) for configurable index build parallelism
  • Add --extra-info-max-size CLI option (default 32, set 0 to omit) for HNSW index
  • Add --partitions CLI option for KEY partitioning (default 0, no partition)
  • HNSW_BQ: remove forced L2 for cosine metric, now supports cosine natively
  • Improve concurrent_runner log message for thread_safe=False fallback
  • pyproject.toml: add pyyaml dependency, fix packages.find to include all subpackages

We've been using these changes in our OceanBase benchmark testing and they've been working well. Would love to get this merged so it can benefit other users too. Thanks in advance for taking the time to review! Happy to address any feedback. 🙏

…W_BQ cosine support

- Add MultiprocessInsertRunner for parallel data loading across worker processes
- OceanBase: declare thread_safe=False, auto-switch to multi-process loader
- Add --load-processes CLI option for explicit multi-process control
- Add --create-index-parallel CLI option (default 16)
- Add --extra-info-max-size CLI option (default 32, set 0 to omit)
- Add --partitions CLI option for KEY partitioning (default 0, no partition)
- HNSW_BQ: remove forced L2 for cosine, now supports cosine natively
- Improve concurrent_runner log message for thread_safe=False fallback
- pyproject.toml: add pyyaml dependency, fix packages.find to include all subpackages
@sre-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wyfanxiao
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wyfanxiao wyfanxiao changed the title feat(oceanbase): multi-process loader, configurable index params, HNS… feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support Apr 29, 2026
@wyfanxiao
Copy link
Copy Markdown
Contributor Author

Hi @XuanYang-cn. Friendly ping on this PR when you have a moment — happy to adjust based on feedback.

Copy link
Copy Markdown
Collaborator

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The OceanBase-specific improvements look useful, especially the configurable index parallelism, partitions, extra_info_max_size, and the HNSW_BQ cosine update.

I do have concerns about merging the shared MultiprocessInsertRunner part as-is, because it changes the benchmark insert procedure rather than only tuning the OceanBase client.

My main concern is benchmark correctness and fairness. For clients that declare
thread_safe=False, this PR auto-switches from the existing single-worker fallback to an N-process loader when load_concurrency > 1. That may be a useful direction, but it changes the benchmark contract: some DBs may load through threads, some through one connection, and some through multiple independent processes/connections. We need to be explicit about whether that is considered fair and comparable.

Also, I don’t think thread_safe=False automatically means process-safe. It only tells us one client/connection should not be shared across threads. Multiprocess loading additionally assumes the client object is safely picklable, each child process can reinitialize an independent connection, concurrent inserts into the same table are supported, failures do not leave partial-success benchmark results, and inserted row counts can be verified before optimize/search.

There are also a few concrete correctness issues in the current runner implementation:

  1. MultiprocessInsertRunner can report partial load success. On the normal path it sends sentinels and waits a fixed join window, then terminates workers that are still alive. After that it returns counter.value without checking that inserted == produced. If the final batches take longer than the join window, the benchmark can continue to optimize/search a partially loaded table.

  2. load_timeout is not enforced while the producer is blocked on a full queue. The timeout check only runs after a successful enqueue. If workers hang or become very slow, _put_with_interruptible_wait can loop indefinitely as long as at least one worker is still alive.

  3. The timeout path calls PerformanceTimeoutError(msg), but PerformanceTimeoutError.__init__ currently takes no message argument. That raises TypeError instead of the intended timeout exception, so the case would be marked as a generic failure instead of the existing timeout/out-of-range path.

Given this, could we split this PR?

My preferred path is:

  1. Keep this PR focused on the OceanBase client/config/CLI changes, which are relatively contained and can likely move forward quickly after smaller review comments are addressed.
  2. Move MultiprocessInsertRunner, --load-processes, and the auto-switch behavior into a follow-up PR. (I can take the follow-up from there: decide whether multiprocess insert should be opt-in, auto-enabled only for selected clients, or generalized as the default loader strategy, and add the tests/metadata needed to preserve benchmark fairness.)

I’m open to the multiprocess direction, but I think it is larger than the OceanBase client change and needs dedicated verification before becoming part of the shared benchmark execution path.

@wyfanxiao
Copy link
Copy Markdown
Contributor Author

Hi @XuanYang-cn. Thanks for the thorough review and great suggestions! I've split this into two PRs as you recommended:

Happy to continue the discussion on the multi-process PR. Thanks again for the detailed feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants