Skip to content

feat: support core Lumina index build#347

Open
QuakeWang wants to merge 2 commits into
apache:mainfrom
QuakeWang:lumina-core-build
Open

feat: support core Lumina index build#347
QuakeWang wants to merge 2 commits into
apache:mainfrom
QuakeWang:lumina-core-build

Conversation

@QuakeWang
Copy link
Copy Markdown
Contributor

Purpose

Rust already supports reading Lumina global indexes, but it did not have a core path to build Lumina index files from table data and commit them into the index manifest. Index-only commits were also skipped, so the generated index metadata could not create a new snapshot.

Brief change log

  • Add LuminaIndexBuildBuilder on Table.
  • Plan Lumina shards by global row id and global-index.row-count-per-shard.
  • Extract ARRAY<FLOAT> vectors with strict _ROW_ID continuity validation.
  • Build Lumina index files under index/lumina-global-index-*.index.
  • Commit Lumina IndexFileMeta with canonical index_type = "lumina" and global row range metadata.
  • Support metadata-only index commits and reject overlapping global index ranges.
  • Fail fast for unsupported primary-key and deletion-vector tables.

Tests

API and Format

Documentation

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here. The overall shape looks good to me: the builder validates the table/index prerequisites up front, plans shards by contiguous row-id ranges, reads with _ROW_ID and strict vector validation, and the index-only commit path preserves row tracking while adding overlap checks for global index files.

A couple of follow-ups I think are worth addressing/clarifying before merge:

  1. Snapshot/concurrency semantics. LuminaIndexBuildBuilder::execute plans and extracts vectors from the latest snapshot it observed, then commits index-only messages through TableCommit, which resolves against whatever snapshot is latest at commit time. The commit-side overlap check protects against another global-index commit, but it does not seem to verify that the underlying data snapshot/ranges are still the same. If a concurrent data commit rewrites/removes files for the same row ranges before the index-only commit lands, can the new index manifest be attached to a newer snapshot whose data no longer matches the built Lumina files? If this is intentionally safe because these tables are append-only + data-evolution row ranges are immutable, could you add a short comment/test or otherwise document that assumption? Otherwise we may need to carry the source snapshot id into the commit and reject conflicting data changes.

  2. Temporary file cleanup on failure. build_index_file removes the local temp file only after copy_local_file_to_output succeeds. If pretrain, insert, dump, or the copy fails, the temp file can be left behind. This is minor, but a scope guard or cleanup-on-error helper would make repeated failed builds safer.

Also noted that the only full execute test is ignored because it needs the native Lumina library. That is understandable, but if we cannot enable it in CI, it would be helpful to keep as much non-native coverage as possible around the metadata/index-only commit path and to document how the ignored test is expected to be run manually.

@QuakeWang
Copy link
Copy Markdown
Contributor Author

Thanks for the work here. The overall shape looks good to me: the builder validates the table/index prerequisites up front, plans shards by contiguous row-id ranges, reads with _ROW_ID and strict vector validation, and the index-only commit path preserves row tracking while adding overlap checks for global index files.

A couple of follow-ups I think are worth addressing/clarifying before merge:

  1. Snapshot/concurrency semantics. LuminaIndexBuildBuilder::execute plans and extracts vectors from the latest snapshot it observed, then commits index-only messages through TableCommit, which resolves against whatever snapshot is latest at commit time. The commit-side overlap check protects against another global-index commit, but it does not seem to verify that the underlying data snapshot/ranges are still the same. If a concurrent data commit rewrites/removes files for the same row ranges before the index-only commit lands, can the new index manifest be attached to a newer snapshot whose data no longer matches the built Lumina files? If this is intentionally safe because these tables are append-only + data-evolution row ranges are immutable, could you add a short comment/test or otherwise document that assumption? Otherwise we may need to carry the source snapshot id into the commit and reject conflicting data changes.
  2. Temporary file cleanup on failure. build_index_file removes the local temp file only after copy_local_file_to_output succeeds. If pretrain, insert, dump, or the copy fails, the temp file can be left behind. This is minor, but a scope guard or cleanup-on-error helper would make repeated failed builds safer.

Also noted that the only full execute test is ignored because it needs the native Lumina library. That is understandable, but if we cannot enable it in CI, it would be helpful to keep as much non-native coverage as possible around the metadata/index-only commit path and to document how the ignored test is expected to be run manually.

@leaves12138 Thanks for the careful review.

I addressed the follow-ups in the latest local changes:

  1. LuminaIndexBuildBuilder::execute now commits through TableCommit::commit_if_latest_snapshot, carrying the source snapshot id observed during planning/extraction. TableCommit validates the latest snapshot id on each retry before committing the index-only snapshot, so a concurrent data or index commit will reject the stale Lumina build instead of attaching it to a newer snapshot.

  2. build_index_file now uses a small temp-file guard. The local Lumina dump file is cleaned up on failures from native pretrain / insert / dump / copy, and explicitly cleaned after a successful copy.

  3. I kept the native execute E2E ignored, but documented the exact manual run command next to the test. I also added the ignored native builder E2E to the Linux integration CI path after lumina-data is installed, matching the existing optional Vortex-style CI coverage pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants