feat: support core Lumina index build#347
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the work here. The overall shape looks good to me: the builder validates the table/index prerequisites up front, plans shards by contiguous row-id ranges, reads with _ROW_ID and strict vector validation, and the index-only commit path preserves row tracking while adding overlap checks for global index files.
A couple of follow-ups I think are worth addressing/clarifying before merge:
-
Snapshot/concurrency semantics.
LuminaIndexBuildBuilder::executeplans and extracts vectors from the latest snapshot it observed, then commits index-only messages throughTableCommit, which resolves against whatever snapshot is latest at commit time. The commit-side overlap check protects against another global-index commit, but it does not seem to verify that the underlying data snapshot/ranges are still the same. If a concurrent data commit rewrites/removes files for the same row ranges before the index-only commit lands, can the new index manifest be attached to a newer snapshot whose data no longer matches the built Lumina files? If this is intentionally safe because these tables are append-only + data-evolution row ranges are immutable, could you add a short comment/test or otherwise document that assumption? Otherwise we may need to carry the source snapshot id into the commit and reject conflicting data changes. -
Temporary file cleanup on failure.
build_index_fileremoves the local temp file only aftercopy_local_file_to_outputsucceeds. Ifpretrain,insert,dump, or the copy fails, the temp file can be left behind. This is minor, but a scope guard or cleanup-on-error helper would make repeated failed builds safer.
Also noted that the only full execute test is ignored because it needs the native Lumina library. That is understandable, but if we cannot enable it in CI, it would be helpful to keep as much non-native coverage as possible around the metadata/index-only commit path and to document how the ignored test is expected to be run manually.
@leaves12138 Thanks for the careful review. I addressed the follow-ups in the latest local changes:
|
Purpose
Rust already supports reading Lumina global indexes, but it did not have a core path to build Lumina index files from table data and commit them into the index manifest. Index-only commits were also skipped, so the generated index metadata could not create a new snapshot.
Brief change log
LuminaIndexBuildBuilderonTable.global-index.row-count-per-shard.ARRAY<FLOAT>vectors with strict_ROW_IDcontinuity validation.index/lumina-global-index-*.index.IndexFileMetawith canonicalindex_type = "lumina"and global row range metadata.Tests
API and Format
Documentation