Skip to content

feat(seekdb): add SeekDB backend and HNSW benchmark support#770

Merged
XuanYang-cn merged 1 commit into
zilliztech:mainfrom
liuhao6741:main
May 11, 2026
Merged

feat(seekdb): add SeekDB backend and HNSW benchmark support#770
XuanYang-cn merged 1 commit into
zilliztech:mainfrom
liuhao6741:main

Conversation

@liuhao6741
Copy link
Copy Markdown
Contributor

Summary

Introduce SeekDB as a first-class benchmark target: a MySQL-protocol vector database (OceanBase-style SQL and session variables). The integration supports standard performance workflows and StreamingPerformanceCase-style fixed-rate inserts by combining an upfront HNSW vector index with per-thread connection handling in the rate-based insert runner.

On each benchmark session, SeekDB.init() now applies OceanBase-style tenant system parameters so the engine does not cap workload memory or CPU for the session scope of the connection (ALTER SYSTEM SET memory_limit = "0M" and cpu_count = 0), matching operator guidance for unconstrained resource tests. memory_limit uses the engine-required string size form (e.g. "0M"), not a bare integer.

After bulk load, optimize() runs SELECT VERSION(), parses the embedded seekdb-vX.Y.Z… token (e.g. "5.7.25-OceanBase seekdb-v1.3.0.0"), and when the parsed version is >= 1.3.0 executes CALL dbms_index_manager.refresh() to align index metadata; older or unrecognized version strings skip the call.

New files (vectordb_bench/backend/clients/seekdb/) --------------------------------------------------

  • seekdb.py — SeekDB VectorDB implementation

    • Connects via mysql.connector using host/port/user/password/database from SeekDBConfig (Pydantic SecretStr surfaced as plain strings in a TypedDict for the driver).
    • Declares thread_safe=False so ConcurrentInsertRunner does not multiplex multiple threads on one client (mysql.connector connections are not thread-safe). Streaming inserts instead rely on rate_runner’s per-task shallow copy + init() pattern (see below).
    • Lifecycle: init optionally drops/creates table and creates the vector index when drop_old is true; connects temporarily then disconnects.
    • init() context manager reconnects, sets autocommit=1, then applies system parameters for reproducible benchmarking under default tenant limits:
      • ALTER SYSTEM SET memory_limit = "0M"
      • ALTER SYSTEM SET cpu_count = 0 followed, when the case uses HNSW, by search session parameters: * SET ob_hnsw_ef_search=<ef_search> — SeekDB uses OceanBase-style names, not a generic hnsw_ef_search variable. * SET ob_query_timeout=600000000 — raises the default query timeout (microseconds) so ANN queries under concurrent load are less likely to hit short default tenant limits (e.g. 10s).
    • Schema: CREATE TABLE … ORGANIZATION HEAP with INT id and VECTOR(dim) embedding column, as required for vector workloads on this engine.
    • Index: CREATE VECTOR INDEX … WITH (distance=cosine|l2|inner_product, type=HNSW, m=…, ef_construction=…) so the HNSW structure exists before bulk or streaming load (mirrors Milvus-style “index then insert” for concurrent read/write phases).
    • optimize(): SELECT VERSION(), parse seekdb-v… with regex, compare tuple to (1, 3, 0); if >= 1.3.0 run CALL dbms_index_manager.refresh(), else log and return. Requires an active cursor (task_runner calls optimize inside db.init()).
    • insert_embeddings: batched multi-row INSERT with vector literals; default batch size 256 (SEEKDB_DEFAULT_LOAD_BATCH_SIZE).
    • prepare_filter / search_embedding: NonFilter, numeric >= on id, and string equality on id; search uses ORDER BY <metric_func>(embedding, query) APPROXIMATE LIMIT k with metric function names aligned to MetricType (cosine_distance, l2_distance, negative_inner_product).
  • config.py — SeekDBConfig (DBConfig), SeekDBHNSWConfig (DBCaseConfig)

    • HNSW index_param() drives CREATE VECTOR INDEX WITH clause.
    • search_param() exposes ef_search for session SET ob_hnsw_ef_search.
    • _seekdb_case_config maps IndexType.HNSW to SeekDBHNSWConfig for DB enum integration.
  • cli.py — Click command SeekDBHNSW (registered as seekdbhnsw)

    • SeekDBTypedDict extends CommonTypedDict with --host, --user, --password (default SEEKDB_PASSWORD env or empty), --database, --port.
    • SeekDBHNSWTypedDict adds HNSWFlavor3 (--m, --ef-construction, --ef-search).
    • Invokes shared cli.run() with DB.SeekDB so behavior matches other databases (no SeekDB-specific streaming CLI flags in the shared framework).

Plumbing

  • vectordb_bench/backend/clients/init.py

    • DB.SeekDB enum value.
    • DB.init_cls branch returning SeekDB client class.
    • DB.config_cls branch returning SeekDBConfig.
    • DB.case_config_cls branch using _seekdb_case_config for HNSW.
  • vectordb_bench/cli/vectordbbench.py

    • Import and cli.add_command(SeekDBHNSW).

Rate runner (StreamingPerformanceCase / fixed-rate inserts) -----------------------------------------------------------

  • vectordb_bench/backend/runner/rate_runner.py
    • Import shallow copy() in addition to deepcopy().
    • New branch for db.name == "SeekDB": build a thread-local client with copy(db), clear _conn and _cursor so the new object does not share live sockets, then with db_copy.init(): _insert_embeddings(...).
    • Rationale: (1) deepcopy fails or is unsafe on open mysql.connector sockets when worker processes are spawned; (2) sharing one connection across threads violates connector thread-safety and caused streaming insert issues.
    • Other databases keep existing deepcopy or direct-insert behavior.

Documentation for operators

Example invocation (Python 3.11+ recommended for this repo):

python -m vectordb_bench.cli.vectordbbench seekdbhnsw
--case-type StreamingPerformanceCase
--host --port 2881 --user root --password ''
--database vectordbbench
--m 16 --ef-construction 200 --ef-search 64

Ensure the target database exists and mysql-connector-python is installed. The SeekDB user must be allowed to execute ALTER SYSTEM if init-time tuning is required; otherwise connection setup may fail at init().

Notes / non-goals

  • Only HNSW is registered in _seekdb_case_config; other index types are not wired for SeekDB in this change set.
  • No changes to shared cli.py CommonTypedDict or get_custom_case_config for streaming-specific knobs; StreamingPerformanceCase uses framework defaults when launched from CLI.

@liuhao6741
Copy link
Copy Markdown
Contributor Author

Hi, @XuanYang-cn I have fixed the lint error. can you approve the workflow and review the pr? thank you very much

@liuhao6741
Copy link
Copy Markdown
Contributor Author

Hi, @XuanYang-cn I have fixed the lint error. can you approve the workflow and review the pr? thank you very much

sorry,I made a typo. I have fixed it.

两项全部通过:

make lint ✅ black + ruff 全通过
make unittest ✅ 1 passed(下载了约 6 分钟的数据集,之后会有缓存)

Copy link
Copy Markdown
Collaborator

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the label-filter correctness issue and the missing optional dependency.

Comment thread vectordb_bench/backend/clients/seekdb/seekdb.py Outdated
Comment thread vectordb_bench/backend/clients/seekdb/seekdb.py
Add a new vector database backend for SeekDB, connecting via
mysql-connector-python over the MySQL wire protocol.

Key components:
- seekdb.py: VectorDB implementation with heap-organized table,
  HNSW vector index, and version-aware optimize() that calls
  dbms_index_manager.refresh() on SeekDB >= 1.3.0
- config.py: DBConfig with host/port/user/password/database and
  SeekDBHNSWConfig with m/ef_construction/ef_search parameters
- cli.py: Click command `SeekDBHNSW` for command-line benchmarks

Registration:
- Add SeekDB to the DB enum in backend/clients/__init__.py with
  lazy imports for init_cls, config_cls, and case_config_cls
- Register SeekDBHNSW CLI command in cli/vectordbbench.py
- Add seekdb optional dependency in pyproject.toml
  (pip install vectordb-bench[seekdb])

Filter support:
- NonFilter and NumGE (id >= N) filters are supported
- StrEqual (label filter) is intentionally excluded since the
  table schema only has id and embedding columns

Thread safety:
- mysql.connector is not thread-safe (thread_safe = False).
  ConcurrentInsertRunner uses max_workers=1 accordingly
- rate_runner.py handles SeekDB specially: copies the db object,
  resets the connection, and calls init() per worker thread

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@liuhao6741
Copy link
Copy Markdown
Contributor Author

Requesting changes for the label-filter correctness issue and the missing optional dependency.

Hi, @XuanYang-cn I have fixed the two issues. please review it again. Thanks very much.

@sre-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liuhao6741, XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@XuanYang-cn XuanYang-cn merged commit c2a6f85 into zilliztech:main May 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants