optimize: Optimize batch query performance#2982
optimize: Optimize batch query performance#2982lokidundun wants to merge 7 commits intoapache:masterfrom
Conversation
|
This CI failure is unrelated to the changes in this PR. The PR focuses on optimizing RocksDB batch query performance, and the failing build check does not involve the code modified here. |
|
Already rerun CI (also could check the tests could pass in local env) |
Temporarily use super.queryByIds() instead of getByIds() for batch version support.
|
@imbajin could you please take another look when you are convenient❤️ |
There was a problem hiding this comment.
Pull request overview
This PR introduces batched backend fetching for queryVerticesByIds() to reduce overhead when querying many vertex ids (e.g., g.V(id1, id2, ...)) by splitting backend id lookups into multiple IdQuery requests.
Changes:
- Collect backend-only vertex ids during
queryVerticesByIds()and issue backend queries in batches usingquery.batch_size. - Add per-batch capacity checks before executing backend queries.
- Minor formatting adjustment in RocksDB table iterator helper signature.
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/backend/tx/GraphTransaction.java | Batch backend id queries in queryVerticesByIds() instead of building a single IdQuery for all backend ids. |
| hugegraph-server/hugegraph-rocksdb/src/main/java/org/apache/hugegraph/backend/store/rocksdb/RocksDBTable.java | Formatting-only change to method parameter indentation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!backendIds.isEmpty()) { | ||
| // Query from backend store | ||
| query.mustSortByInput(false); | ||
| Iterator<HugeVertex> it = this.queryVerticesFromBackend(query); | ||
| QueryResults.fillMap(it, vertices); | ||
| final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size(); | ||
| for (int i = 0; i < backendIds.size(); i += batch) { | ||
| int end = Math.min(i + batch, backendIds.size()); | ||
| IdQuery query = new IdQuery(type); |
There was a problem hiding this comment.
The PR title says this is a RocksDB optimization, but the functional change here is backend-agnostic batching in GraphTransaction (affecting all stores, especially RPC-based ones). Please align the PR title (or add RocksDB-specific changes) so the intent matches what’s actually being modified.
| @@ -799,17 +800,30 @@ protected Iterator<Vertex> queryVerticesByIds(Object[] vertexIds, boolean adjace | |||
| // Found from local tx | |||
| vertices.put(vertex.id(), vertex); | |||
| } else { | |||
| // Prepare to query from backend store | |||
| query.query(id); | |||
| // store the IDs queried from backend | |||
| backendIds.add(id); | |||
| } | |||
There was a problem hiding this comment.
This change allocates backendIds as a second full list of ids (in addition to ids), which doubles memory usage for large g.V(id1,id2,...) calls (up to Query.DEFAULT_CAPACITY). Consider batching/issuing backend IdQuery requests incrementally during the main loop (flush when reaching batch size) to avoid retaining all backend ids at once.
| final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size(); | ||
| for (int i = 0; i < backendIds.size(); i += batch) { | ||
| int end = Math.min(i + batch, backendIds.size()); | ||
| IdQuery query = new IdQuery(type); | ||
| for (int j = i; j < end; j++) { | ||
| Id id = backendIds.get(j); | ||
| query.query(id); | ||
| } |
There was a problem hiding this comment.
With batching, duplicated ids that fall into different batches will trigger repeated backend reads/RPCs for the same id. You can keep the output behavior (duplicates preserved via ids) while deduplicating backend fetches (e.g., track a seen-set for backendIds or build per-batch unique ids) to avoid redundant backend queries.
| query.mustSortByInput(false); | ||
| Iterator<HugeVertex> it = this.queryVerticesFromBackend(query); | ||
| QueryResults.fillMap(it, vertices); | ||
| final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size(); |
There was a problem hiding this comment.
final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size(); is effectively always this.batchSize because query.batch_size is configured with a minimum of 1 (see CoreOptions.QUERY_BATCH_SIZE). Removing the dead fallback branch will simplify the code.
| final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size(); | |
| final int batch = this.batchSize; |
| if (!backendIds.isEmpty()) { | ||
| // Query from backend store | ||
| query.mustSortByInput(false); | ||
| Iterator<HugeVertex> it = this.queryVerticesFromBackend(query); | ||
| QueryResults.fillMap(it, vertices); | ||
| final int batch = this.batchSize > 0 ? this.batchSize : backendIds.size(); | ||
| for (int i = 0; i < backendIds.size(); i += batch) { | ||
| int end = Math.min(i + batch, backendIds.size()); | ||
| IdQuery query = new IdQuery(type); | ||
| for (int j = i; j < end; j++) { | ||
| Id id = backendIds.get(j); | ||
| query.query(id); | ||
| } | ||
| // Single batch capacity check | ||
| Query.checkForceCapacity(query.idsSize()); | ||
|
|
||
| // Query from backend store | ||
| query.mustSortByInput(false); | ||
| Iterator<HugeVertex> it = this.queryVerticesFromBackend(query); | ||
| QueryResults.fillMap(it, vertices); | ||
| } |
There was a problem hiding this comment.
The new multi-batch path isn’t covered by tests. Please add a unit/integration test that exercises queryVerticesByIds() with vertexIds.length > query.batch_size, including (1) duplicates across a batch boundary and (2) mixed local-tx + backend ids, to ensure results and NotFoundException behavior remain unchanged.
| // NOTE: allowed duplicated vertices if query by duplicated ids | ||
| List<Id> ids = InsertionOrderUtil.newList(); | ||
| Map<Id, HugeVertex> vertices = new HashMap<>(vertexIds.length); | ||
| Set<Id> fetchedIds = InsertionOrderUtil.newSet(); |
There was a problem hiding this comment.
fetchedIds 之后,非相邻的重复 id 也会被全局去重;而旧逻辑里 IdQuery.query() 只会折叠相邻重复 id。最终返回结果看起来应该还是保持重复输出,但真实的后端访问路径已经变了。建议补一个回归测试,至少覆盖 超过 query.batch_size、跨 batch 的重复 id、缺失 id + checkMustExist 这几个组合场景,避免后面再改这里时把语义悄悄带偏。
| Map<Id, HugeVertex> vertices = new HashMap<>(vertexIds.length); | ||
| Set<Id> fetchedIds = InsertionOrderUtil.newSet(); | ||
| IdQuery batchQuery = null; | ||
| final int batchSize = this.batchSize; |
There was a problem hiding this comment.
GraphTransaction 通用层后,会影响所有 backend,而不只是 issue #2674 里提到的 RPC backend。
以 RocksDB 为例,当前 queryByIds() 仍然是逐 id 展开查询,并没有真正走 multi-get;现在强制按 query.batch_size 拆成多个 IdQuery,很可能只是增加额外的 query/iterator 次数。建议把这类分批策略下沉到具体 backend,或者至少通过 feature/store type 把它限定在 HBase/HStore 这类 RPC backend 上,避免把针对性优化变成全局行为变化。
PS: 后续我们应该让 RocksDB 使用上原生的 multi-get API (这应该是之前的 TODO)

Purpose of the PR
Main Changes
This PR improves the performance of Gremlin queries like g.V(id1, id2, ...) when using RPC‑based backends such as HBase and HStore.
Previously, all vertex ids were either queried one by one or packed into a single large IdQuery, which led to many small RPC calls and poor latency in real production workloads.
Now, the ids to be queried from backend are batched by QUERY_BATCH_SIZE, and each batch is issued as a separate IdQuery to the backend, significantly reducing RPC overhead while keeping behavior unchanged.
Method:
• protected Iterator queryVerticesByIds(Object[] vertexIds, boolean adjacentVertex, boolean checkMustExist, HugeType type)
Verifying these changes
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need