Add HNSW Layered Index Support#2148
Conversation
- Add base node IDs for sequential access. - Scattered writes happen only in deserialization step using host memory.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| } else if (conf.at("hierarchy") == "gpu_layered_on_disk" || | ||
| conf.at("hierarchy") == "gpu_layered" || conf.at("hierarchy") == "layered") { | ||
| hnsw_params.hierarchy = cuvs::neighbors::hnsw::HnswHierarchy::GPU_LAYERED_ON_DISK; |
There was a problem hiding this comment.
Are these all just synonyms for GPU_LAYERED_ON_DISK?
There was a problem hiding this comment.
Yes, I've removed them to better align with the other formats.
| "Layered HNSW artifact '%s' does not exist.", | ||
| src_artifact.c_str()); | ||
|
|
||
| copy_file_overwrite(src_artifact, std::filesystem::path(file)); |
There was a problem hiding this comment.
Why do we copy here instead of moving (like in the cagra_ace block below?)
There was a problem hiding this comment.
Good catch, thanks. I've changed this and reuse the helper in the cagra_ace block.
| inline auto json_parse_double(const std::string& json, const std::string& key) -> double | ||
| { | ||
| auto pos = json_find_key(json, key); | ||
| auto colon = json.find(':', pos); | ||
| RAFT_EXPECTS(colon != std::string::npos, "Malformed JSON near key '%s'", key.c_str()); | ||
| auto begin = json.find_first_of("0123456789-.", colon + 1); | ||
| auto end = json.find_first_not_of("0123456789-.eE+", begin); | ||
| RAFT_EXPECTS(begin != std::string::npos, "Malformed double JSON value for key '%s'", key.c_str()); | ||
| return std::stod(json.substr(begin, end - begin)); | ||
| } | ||
|
|
||
| inline auto json_parse_layer_field(const std::string& object, const std::string& key) -> size_t | ||
| { | ||
| return json_parse_size(object, key); | ||
| } | ||
|
|
There was a problem hiding this comment.
If we use JSON here - why don't we use a common json utility for parsing? This whole section feels very verbose for a standard. Is there a cleaner alternative?
There was a problem hiding this comment.
This would pull the JSON dependency (nlohmann/json) into libcuvs. It is currently only used in the benchmarking. I've changed to a binary header format instead given the simplicity. Let me know what you think.
| const auto current_node_bytes = current_batch_size * sizeof(IdxT); | ||
| const auto current_link_bytes = current_batch_size * base_link_row_bytes; | ||
| cuvs::util::write_large_file(output_fd, | ||
| base_node_buffer.data(), | ||
| current_node_bytes, | ||
| base_nodes_offset + source_start * sizeof(IdxT)); | ||
| cuvs::util::write_large_file(output_fd, | ||
| base_link_buffer.data(), | ||
| current_link_bytes, | ||
| base_links_offset + source_start * base_link_row_bytes); |
There was a problem hiding this comment.
I see that the graph connections / links are shifted to reflect the original ordering, but I don't see the rows/nodes themself being reordered. Is this intended?
There was a problem hiding this comment.
Yes. The idea is to write sequentially and read scatter into memory. I've added a comment.
| template <typename T, typename IdxT, typename Callback> | ||
| void build_hnsw_upper_layer_graphs( | ||
| raft::resources const& res, | ||
| raft::host_matrix_view<const T, int64_t, raft::row_major> promoted_dataset, | ||
| const hnsw_level_plan& plan, | ||
| size_t M, | ||
| cuvs::distance::DistanceType metric, | ||
| Callback&& callback) |
There was a problem hiding this comment.
Good to have this extracted as utility.
| for (int64_t batch_idx = 0; batch_idx < static_cast<int64_t>(current_batch_size); | ||
| ++batch_idx) { | ||
| const auto row = batch_start + static_cast<size_t>(batch_idx); | ||
| node_buffer[batch_idx] = static_cast<IdxT>(hierarchy.order[start_idx + row]); | ||
| auto* link_row = link_buffer.data() + batch_idx * metadata.upper_link_row_bytes; | ||
| hnswlib::linklistsizeint list_count = | ||
| static_cast<hnswlib::linklistsizeint>(layer.degree); | ||
| std::memcpy(link_row, &list_count, sizeof(list_count)); | ||
| auto* dst = reinterpret_cast<IdxT*>(link_row + sizeof(hnswlib::linklistsizeint)); | ||
| if (layer.degree > 0) { | ||
| auto* src = host_neighbors.data_handle() + row * layer.degree; | ||
| for (size_t j = 0; j < layer.degree; ++j) { | ||
| dst[j] = static_cast<IdxT>(hierarchy.order[src[j] + start_idx]); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Ok, here we seem to write in in both the modified data order (not original) and by level connections instead of by row. I assume this gets re-ordered upon deserialize.
There was a problem hiding this comment.
Yes. I've added a comment.
| auto ll0 = appr_algo->get_linklist0(node_id); | ||
| memcpy(ll0, | ||
| base_link_buffer.data() + batch_idx * metadata.base_link_row_bytes, | ||
| metadata.base_link_row_bytes); |
There was a problem hiding this comment.
Ah ok - this is where we re-order upon insertion.
| auto ll = appr_algo->get_linklist(node_id, layer.level); | ||
| memcpy(ll, | ||
| link_buffer.data() + batch_idx * metadata.upper_link_row_bytes, | ||
| metadata.upper_link_row_bytes); | ||
| } |
There was a problem hiding this comment.
I see, this way we can re-arrange the levels upon load as well. I guess this works as long as we can provide our own deserialize-loader and don't have to mimic the serialized file structure of the original format.
But is this really the use-case? Do we eventually need to support a disk-layered+dataset->disk conversion?
There was a problem hiding this comment.
Yes, this minimizes the I/O for reading and constructing the HNSW index in memory. However, the same approach can be used for the memory bound case where we need to support the disk-layered+dataset->disk conversion you've mentioned. I'm working on a follow-up PR that enables it. The idea is to reorder the base links to original ID space and interleave the dataset vectors by either:
- Simple scatter if we have enough host memory.
- File-baked mmap: probably very slow.
- Write
[id][link row]records to temporary files in buckets. Replay each bucket and store in small per-bucket scatter buffer. Thus, the base section needs to be written and re-read but disk I/O is sequential. This seems to be the most promising approach when memory constrained. Let me know if you have other ideas please.
There was a problem hiding this comment.
I've added #2241 on top of this that implements strategies (1) and (3).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds GPU_LAYERED_ON_DISK support for HNSW, including public API changes, layered artifact serialization and deserialization, benchmark save handling, ACE build updates, tests, and a new example. ChangesLayered HNSW disk-backed flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- hnsw.hpp: adopt upstream's serialize_to_hnswlib_batched / from_inmem core and auto-selecting build(); re-apply layered helpers, serialize_to_layered_hnsw_from_disk, deserialize_layered_hnsw, and GPU_LAYERED_ON_DISK routing in from_cagra/serialize/ deserialize/build. Tighten the in-memory spill guard to NONE/GPU only. - cagra_build.cuh: take upstream's refined ACE memory-limit clamping and logging. - bench: adopt upstream hnsw_params.M; forward ace_* overrides via graph_build_params. - ACE tests: keep both the layered and in-memory disk-spill test cases. - examples: keep both HNSW_ACE_LAYERED and HNSW_OPENAI examples.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/neighbors/detail/hnsw.hpp (2)
2396-2405: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winHIGH: Validate
enterpoint_nodebefore storing it in hnswlib state.Have you considered range-checking
metadata.enterpoint_nodeagainst[0, n_rows)during deserialize? A malformed layered artifact can currently survive load with an invalid entrypoint and then walk hnswlib with an out-of-bounds internal id on the first search.Suggested fix
RAFT_EXPECTS(metadata.n_rows > 0, "Layered HNSW artifact must contain at least one row"); RAFT_EXPECTS(metadata.dim > 0, "Layered HNSW artifact must contain at least one dimension"); + RAFT_EXPECTS(metadata.enterpoint_node >= 0 && + static_cast<size_t>(metadata.enterpoint_node) < metadata.n_rows, + "Layered HNSW artifact enterpoint_node (%d) is outside [0, %zu)", + metadata.enterpoint_node, + metadata.n_rows); RAFT_EXPECTS(static_cast<size_t>(dim) == metadata.dim, "Layered HNSW artifact dim (%zu) does not match requested dim (%d)", metadata.dim, dim);As per coding guidelines,
Input validation must check for negative or invalid dimensions, null pointers, and invalid parameter combinations before GPU operations.Also applies to: 2499-2504
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/neighbors/detail/hnsw.hpp` around lines 2396 - 2405, Add a deserialize-time validation for metadata.enterpoint_node in the layered HNSW load path before it is written into hnswlib state. In the same checks near the existing RAFT_EXPECTS guards in the layered HNSW deserialization logic, verify that enterpoint_node is within [0, metadata.n_rows) and reject malformed artifacts early, using the relevant deserialize/load helper and the hnswlib state initialization code that stores the entrypoint.Source: Coding guidelines
727-735: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winHIGH: Reject
.npydatasets whose dtype or memory order does not match the layered artifact.Have you considered carrying the
.npyheader’s dtype/layout throughopen_npy_file()and validating it here? Right now only the shape is checked, so a same-shaped but wrong-typed or Fortran-ordered dataset will be streamed as row-majorTbytes and silently corrupt the reconstructed index. As per coding guidelines,Data layout (row-major vs column-major) must be explicitly verified and handled in memory accessandData format parameters (row-major vs column-major, memory layout) must be explicit in function signatures and documentation; ambiguous data layout assumptions should be clarified or eliminated.Also applies to: 2428-2459
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/neighbors/detail/hnsw.hpp` around lines 727 - 735, open_npy_file() currently returns only the shape, so layered_artifact::load_from_npz() can accept same-shaped but wrong-typed or Fortran-ordered .npy data and corrupt the index. Update open_npy_file() to carry the numpy header’s dtype and layout/stride metadata alongside shape, then validate those fields in the layered artifact loading path before streaming bytes into T. Use the existing symbols open_npy_file(), npy_file, and layered_artifact::load_from_npz() to enforce that only the expected row-major dtype/layout is accepted, and reject mismatches early.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cpp/src/neighbors/detail/hnsw.hpp`:
- Around line 2396-2405: Add a deserialize-time validation for
metadata.enterpoint_node in the layered HNSW load path before it is written into
hnswlib state. In the same checks near the existing RAFT_EXPECTS guards in the
layered HNSW deserialization logic, verify that enterpoint_node is within [0,
metadata.n_rows) and reject malformed artifacts early, using the relevant
deserialize/load helper and the hnswlib state initialization code that stores
the entrypoint.
- Around line 727-735: open_npy_file() currently returns only the shape, so
layered_artifact::load_from_npz() can accept same-shaped but wrong-typed or
Fortran-ordered .npy data and corrupt the index. Update open_npy_file() to carry
the numpy header’s dtype and layout/stride metadata alongside shape, then
validate those fields in the layered artifact loading path before streaming
bytes into T. Use the existing symbols open_npy_file(), npy_file, and
layered_artifact::load_from_npz() to enforce that only the expected row-major
dtype/layout is accepted, and reject mismatches early.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b15a3474-efb8-43fe-8b06-6a6283890e30
📒 Files selected for processing (10)
cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib.cucpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.hcpp/src/neighbors/detail/cagra/cagra_build.cuhcpp/src/neighbors/detail/hnsw.hppcpp/tests/neighbors/ann_hnsw_ace.cuhcpp/tests/neighbors/ann_hnsw_ace/test_float_uint32_t.cucpp/tests/neighbors/ann_hnsw_ace/test_half_uint32_t.cucpp/tests/neighbors/ann_hnsw_ace/test_int8_t_uint32_t.cucpp/tests/neighbors/ann_hnsw_ace/test_uint8_t_uint32_t.cuexamples/cpp/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (6)
- cpp/tests/neighbors/ann_hnsw_ace/test_float_uint32_t.cu
- examples/cpp/CMakeLists.txt
- cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib.cu
- cpp/tests/neighbors/ann_hnsw_ace.cuh
- cpp/bench/ann/src/cuvs/cuvs_cagra_hnswlib_wrapper.h
- cpp/src/neighbors/detail/cagra/cagra_build.cuh
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/include/cuvs/util/numpy_dtype.hpp`:
- Around line 24-35: The public API in numpy_dtype.hpp is missing required
Doxygen coverage, so add complete Doxygen comments for numpy_dtype_string,
make_numpy_header_from_dtype, and make_numpy_header_string, including parameter
descriptions, return values, and any side effects/contract details; place the
docs directly above the function declarations and ensure the public header
exposes the serialized-header behavior clearly for downstream users.
- Around line 24-35: These new public helpers expose stateful STL types in the
cuVS API, so either move numpy_dtype_string and make_numpy_header_from_dtype out
of the public header or change their interfaces to use stateless inputs instead
of std::string and std::vector<size_t>. Update the API surface in
cpp/include/cuvs/util/numpy_dtype.hpp and the related helpers around the same
area to rely on POD-style parameters, raft::resources, pointers, or
mdspan-compatible inputs so the public-header contract is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 80b27030-35d3-4c2c-9a98-ec15b4faa45a
📒 Files selected for processing (10)
cpp/include/cuvs/util/file_io.hppcpp/include/cuvs/util/numpy_dtype.hppcpp/src/neighbors/brute_force_serialize.cucpp/src/neighbors/detail/cagra/cagra_serialize.cuhcpp/src/neighbors/detail/hnsw.hppcpp/src/neighbors/ivf_flat/ivf_flat_serialize.cuhcpp/src/neighbors/ivf_sq/ivf_sq_serialize.cuhcpp/src/neighbors/mg/snmg.cuhcpp/src/util/serialize_validation.hppexamples/cpp/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/cpp/CMakeLists.txt
- cpp/src/neighbors/detail/hnsw.hpp
…ed-index # Conflicts: # cpp/include/cuvs/util/file_io.hpp # cpp/src/neighbors/brute_force_serialize.cu # cpp/src/neighbors/detail/cagra/cagra_serialize.cuh # cpp/src/neighbors/ivf_flat/ivf_flat_serialize.cuh # cpp/src/neighbors/mg/snmg.cuh
The CAGRA graph built by the disk-backed ACE algorithm partitions the dataset. Thus, the CAGRA graph uses the reordered index space. Building a HNSW index using
hnsw::from_cagrauses the reordered dataset and CAGRA graph. Downstream consumers building an HNSW index would therefore require the reordered dataset, which is typically large when requiring the disk-backed ACE algorithm. Thus, building only the layers of the HNSW index without the dataset and moving this to the search node can minimize the network transfers for downstream consumers if they have the original dataset locally available. Thehnsw::deserializestep then takes the layered index and combines it with the local dataset to form a hnswlib compatible search index.Artifact Layout
Layered HNSW Serialization
The layered serializer creates
hnsw_index.cuvsfrom the disk-backed ACE graph..cuvsfile and write the fixed header and metadata.levelssequentially.dataset_mapping.npysequentially intoreordered_to_original.cagra_graph.npysource-sequentially in ACE reordered row order.base_nodes[row] = reordered_to_original[ace_reordered_row]base_links[row]upper_nodesandupper_linkswith node IDs and neighbor IDs converted back to original IDs.This keeps remapping, link padding, and upper-layer KNN work on the build node.
Deserialization
The search node reads:
hnsw_index.cuvsindex_params.dataset_pathThe loader:
levelssequentially.levels[original_id]base_nodesandbase_linkssequentially.get_linklist0(base_node_id).upper_nodesandupper_linkssequentially by layer.get_linklist(node_id, level).The search node does no graph remapping, no level generation, no link padding, and no KNN work.
Disk Access Patterns
Build node:
reordered_dataset.npyandaugmented_dataset.npy.cagra_graph.npywhen creating the final layered artifact.hnsw_index.cuvs.Search node:
hnsw_index.cuvs.Runtime Requirements
Only
hnsw_index.cuvsis copied to the search node. ACE temporary files remain build-node-only.The search node must have the original dataset in original row order and must provide that path through
index_params.dataset_path.Misc
Unifies the logging format of the ACE algorithm.