feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support #769
feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support #769wyfanxiao wants to merge 2 commits into
Conversation
…W_BQ cosine support - Add MultiprocessInsertRunner for parallel data loading across worker processes - OceanBase: declare thread_safe=False, auto-switch to multi-process loader - Add --load-processes CLI option for explicit multi-process control - Add --create-index-parallel CLI option (default 16) - Add --extra-info-max-size CLI option (default 32, set 0 to omit) - Add --partitions CLI option for KEY partitioning (default 0, no partition) - HNSW_BQ: remove forced L2 for cosine, now supports cosine natively - Improve concurrent_runner log message for thread_safe=False fallback - pyproject.toml: add pyyaml dependency, fix packages.find to include all subpackages
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wyfanxiao The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @XuanYang-cn. Friendly ping on this PR when you have a moment — happy to adjust based on feedback. |
XuanYang-cn
left a comment
There was a problem hiding this comment.
Thanks for the PR. The OceanBase-specific improvements look useful, especially the configurable index parallelism, partitions, extra_info_max_size, and the HNSW_BQ cosine update.
I do have concerns about merging the shared MultiprocessInsertRunner part as-is, because it changes the benchmark insert procedure rather than only tuning the OceanBase client.
My main concern is benchmark correctness and fairness. For clients that declare
thread_safe=False, this PR auto-switches from the existing single-worker fallback to an N-process loader when load_concurrency > 1. That may be a useful direction, but it changes the benchmark contract: some DBs may load through threads, some through one connection, and some through multiple independent processes/connections. We need to be explicit about whether that is considered fair and comparable.
Also, I don’t think thread_safe=False automatically means process-safe. It only tells us one client/connection should not be shared across threads. Multiprocess loading additionally assumes the client object is safely picklable, each child process can reinitialize an independent connection, concurrent inserts into the same table are supported, failures do not leave partial-success benchmark results, and inserted row counts can be verified before optimize/search.
There are also a few concrete correctness issues in the current runner implementation:
-
MultiprocessInsertRunnercan report partial load success. On the normal path it sends sentinels and waits a fixed join window, then terminates workers that are still alive. After that it returnscounter.valuewithout checking thatinserted == produced. If the final batches take longer than the join window, the benchmark can continue to optimize/search a partially loaded table. -
load_timeoutis not enforced while the producer is blocked on a full queue. The timeout check only runs after a successful enqueue. If workers hang or become very slow,_put_with_interruptible_waitcan loop indefinitely as long as at least one worker is still alive. -
The timeout path calls
PerformanceTimeoutError(msg), butPerformanceTimeoutError.__init__currently takes no message argument. That raisesTypeErrorinstead of the intended timeout exception, so the case would be marked as a generic failure instead of the existing timeout/out-of-range path.
Given this, could we split this PR?
My preferred path is:
- Keep this PR focused on the OceanBase client/config/CLI changes, which are relatively contained and can likely move forward quickly after smaller review comments are addressed.
- Move
MultiprocessInsertRunner,--load-processes, and the auto-switch behavior into a follow-up PR. (I can take the follow-up from there: decide whether multiprocess insert should be opt-in, auto-enabled only for selected clients, or generalized as the default loader strategy, and add the tests/metadata needed to preserve benchmark fairness.)
I’m open to the multiprocess direction, but I think it is larger than the OceanBase client change and needs dedicated verification before becoming part of the shared benchmark execution path.
|
Hi @XuanYang-cn. Thanks for the thorough review and great suggestions! I've split this into two PRs as you recommended:
Happy to continue the discussion on the multi-process PR. Thanks again for the detailed feedback! |
Hi team! 👋
This PR enhances the OceanBase client and adds a multi-process data loader to improve load throughput for SQL-based vector DBs.
Summary
MultiprocessInsertRunnerfor parallel data loading across worker processes, bypassing GIL limitation for SQL-based vector DBsthread_safe=False, auto-switch to multi-process loader when--load-concurrency>1--load-processesCLI option for explicit multi-process control--create-index-parallelCLI option (default 16) for configurable index build parallelism--extra-info-max-sizeCLI option (default 32, set 0 to omit) for HNSW index--partitionsCLI option for KEY partitioning (default 0, no partition)concurrent_runnerlog message forthread_safe=Falsefallbackpyproject.toml: addpyyamldependency, fixpackages.findto include all subpackagesWe've been using these changes in our OceanBase benchmark testing and they've been working well. Would love to get this merged so it can benefit other users too. Thanks in advance for taking the time to review! Happy to address any feedback. 🙏