feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support by wyfanxiao · Pull Request #769 · zilliztech/VectorDBBench

wyfanxiao · 2026-04-29T03:56:47Z

Hi team! 👋

This PR enhances the OceanBase client and adds a multi-process data loader to improve load throughput for SQL-based vector DBs.

Summary

Add MultiprocessInsertRunner for parallel data loading across worker processes, bypassing GIL limitation for SQL-based vector DBs
OceanBase: declare thread_safe=False, auto-switch to multi-process loader when --load-concurrency>1
Add --load-processes CLI option for explicit multi-process control
Add --create-index-parallel CLI option (default 16) for configurable index build parallelism
Add --extra-info-max-size CLI option (default 32, set 0 to omit) for HNSW index
Add --partitions CLI option for KEY partitioning (default 0, no partition)
HNSW_BQ: remove forced L2 for cosine metric, now supports cosine natively
Improve concurrent_runner log message for thread_safe=False fallback
pyproject.toml: add pyyaml dependency, fix packages.find to include all subpackages

We've been using these changes in our OceanBase benchmark testing and they've been working well. Would love to get this merged so it can benefit other users too. Thanks in advance for taking the time to review! Happy to address any feedback. 🙏

…W_BQ cosine support - Add MultiprocessInsertRunner for parallel data loading across worker processes - OceanBase: declare thread_safe=False, auto-switch to multi-process loader - Add --load-processes CLI option for explicit multi-process control - Add --create-index-parallel CLI option (default 16) - Add --extra-info-max-size CLI option (default 32, set 0 to omit) - Add --partitions CLI option for KEY partitioning (default 0, no partition) - HNSW_BQ: remove forced L2 for cosine, now supports cosine natively - Improve concurrent_runner log message for thread_safe=False fallback - pyproject.toml: add pyyaml dependency, fix packages.find to include all subpackages

sre-ci-robot · 2026-04-29T03:56:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wyfanxiao
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…limit

wyfanxiao · 2026-05-11T03:10:41Z

Hi @XuanYang-cn. Friendly ping on this PR when you have a moment — happy to adjust based on feedback.

XuanYang-cn

Thanks for the PR. The OceanBase-specific improvements look useful, especially the configurable index parallelism, partitions, extra_info_max_size, and the HNSW_BQ cosine update.

I do have concerns about merging the shared MultiprocessInsertRunner part as-is, because it changes the benchmark insert procedure rather than only tuning the OceanBase client.

My main concern is benchmark correctness and fairness. For clients that declare
thread_safe=False, this PR auto-switches from the existing single-worker fallback to an N-process loader when load_concurrency > 1. That may be a useful direction, but it changes the benchmark contract: some DBs may load through threads, some through one connection, and some through multiple independent processes/connections. We need to be explicit about whether that is considered fair and comparable.

Also, I don’t think thread_safe=False automatically means process-safe. It only tells us one client/connection should not be shared across threads. Multiprocess loading additionally assumes the client object is safely picklable, each child process can reinitialize an independent connection, concurrent inserts into the same table are supported, failures do not leave partial-success benchmark results, and inserted row counts can be verified before optimize/search.

There are also a few concrete correctness issues in the current runner implementation:

MultiprocessInsertRunner can report partial load success. On the normal path it sends sentinels and waits a fixed join window, then terminates workers that are still alive. After that it returns counter.value without checking that inserted == produced. If the final batches take longer than the join window, the benchmark can continue to optimize/search a partially loaded table.
load_timeout is not enforced while the producer is blocked on a full queue. The timeout check only runs after a successful enqueue. If workers hang or become very slow, _put_with_interruptible_wait can loop indefinitely as long as at least one worker is still alive.
The timeout path calls PerformanceTimeoutError(msg), but PerformanceTimeoutError.__init__ currently takes no message argument. That raises TypeError instead of the intended timeout exception, so the case would be marked as a generic failure instead of the existing timeout/out-of-range path.

Given this, could we split this PR?

My preferred path is:

Keep this PR focused on the OceanBase client/config/CLI changes, which are relatively contained and can likely move forward quickly after smaller review comments are addressed.
Move MultiprocessInsertRunner, --load-processes, and the auto-switch behavior into a follow-up PR. (I can take the follow-up from there: decide whether multiprocess insert should be opt-in, auto-enabled only for selected clients, or generalized as the default loader strategy, and add the tests/metadata needed to preserve benchmark fairness.)

I’m open to the multiprocess direction, but I think it is larger than the OceanBase client change and needs dedicated verification before becoming part of the shared benchmark execution path.

wyfanxiao · 2026-05-14T03:54:59Z

Hi @XuanYang-cn. Thanks for the thorough review and great suggestions! I've split this into two PRs as you recommended:

OceanBase client changes: feat(oceanbase): configurable index params, KEY partitioning, HNSW_BQ cosine support #776 (configurable index params, KEY partitioning, HNSW_BQ cosine support)
Multi-process loader: feat: add multi-process data loader for GIL-bound insert paths #777 (with all three bugs fixed + fairness discussion)

Happy to continue the discussion on the multi-process PR. Thanks again for the detailed feedback!

wyfanxiao changed the title ~~feat(oceanbase): multi-process loader, configurable index params, HNS…~~ feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support Apr 29, 2026

fix: format concurrent_runner.py log line to comply with line length …

af04df8

…limit

XuanYang-cn requested changes May 13, 2026

View reviewed changes

wyfanxiao mentioned this pull request May 14, 2026

feat: add multi-process data loader for GIL-bound insert paths #777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support #769

feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support #769
wyfanxiao wants to merge 2 commits into
zilliztech:mainfrom
wyfanxiao:develop

wyfanxiao commented Apr 29, 2026

Uh oh!

sre-ci-robot commented Apr 29, 2026

Uh oh!

wyfanxiao commented May 11, 2026

Uh oh!

XuanYang-cn left a comment

Uh oh!

wyfanxiao commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wyfanxiao commented Apr 29, 2026

Summary

Uh oh!

sre-ci-robot commented Apr 29, 2026

Uh oh!

wyfanxiao commented May 11, 2026

Uh oh!

XuanYang-cn left a comment

Choose a reason for hiding this comment

Uh oh!

wyfanxiao commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants