Skip to content

[core] Fix orphan files clean deleting data files during concurrent snapshot expiration#7715

Open
heye1005 wants to merge 1 commit into
apache:masterfrom
heye1005:fix-orphan-clean-concurrent-expiration
Open

[core] Fix orphan files clean deleting data files during concurrent snapshot expiration#7715
heye1005 wants to merge 1 commit into
apache:masterfrom
heye1005:fix-orphan-clean-concurrent-expiration

Conversation

@heye1005
Copy link
Copy Markdown
Contributor

Purpose

Fix #7710.

LocalOrphanFilesClean.clean() collects used files by reading all snapshots via safelyGetAllSnapshots(). However, this method silently catches FileNotFoundException — if concurrent snapshot expiration deletes all snapshots between listing and reading, usedFiles ends up empty, and all candidate data files get deleted as orphans.

This patch skips orphan file deletion when usedFiles is empty to prevent accidental data loss.

Tests

Added LocalOrphanFilesCleanTest#testSkipCleaningWhenAllSnapshotsDeleted.

@JingsongLi
Copy link
Copy Markdown
Contributor

Hi @heye1005 It is better to fix it reading full ref files for latest snapshot.

@heye1005 heye1005 force-pushed the fix-orphan-clean-concurrent-expiration branch 3 times, most recently from a91ce0d to 5d525d4 Compare May 1, 2026 10:21
@heye1005
Copy link
Copy Markdown
Contributor Author

heye1005 commented May 1, 2026

Hi @heye1005 It is better to fix it reading full ref files for latest snapshot.

Thanks for the review! Updated the PR based on your suggestion.

The root cause here is that safelyGetAllSnapshots() lists the snapshot directory first, then reads each file. If expiration deletes the snapshot file (along with its manifest-list) between these two steps, the read silently fails and returns an empty list — all old data files then get treated as orphans.

This PR has two changes:

  1. Replaced the list-then-read in safelyGetAllSnapshots() with latestSnapshot() + ID range traversal. Expiration never deletes the latest snapshot, so the snapshot file read is safe. But the manifest-list of that snapshot can still be deleted before we read it.

  2. Added collectLatestSnapshotFiles() — re-reads the latest snapshot right before deletion and removes its referenced files from candidates. By then expiration is usually done, so the current latest snapshot's manifest-list is intact. This is applied to Local, Flink, and Spark subclasses.

This doesn't 100% eliminate the race — if expiration happens to delete the latest snapshot's manifest-list during the pre-deletion check, it would still fail. But that window is just milliseconds, so practically very unlikely.

To fully solve this, I think we'd need either some form of locking between orphan clean and expiration, or delay manifest-list deletion (similar to how orphan clean already requires data files to be older than olderThanMillis). Happy to discuss further.

@heye1005 heye1005 force-pushed the fix-orphan-clean-concurrent-expiration branch from 5d525d4 to 1d66c8e Compare May 1, 2026 10:30
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [core] Fix orphan files clean deleting data files during concurrent snapshot expiration

Overall Assessment

The defense-in-depth approach (modifying safelyGetAllSnapshots + adding collectLatestSnapshotFiles as a second barrier) is a reasonable strategy for narrowing the race window.

Correctness Concerns

1. Residual race still exists without a hard guard

The TOCTOU window is narrowed but not eliminated. Consider adding an explicit empty-set guard: if usedFiles is empty after collecting from all branches, abort with a warning rather than proceeding to delete. This is simple, cheap, and guarantees no data loss regardless of timing.

2. Broad exception swallowing in safelyGetAllSnapshots

The outer catch (Exception e) catches all Exception types. If latestSnapshot() throws due to corruption, permissions, or network timeout, it silently proceeds with an empty snapshot set. Consider catching only FileNotFoundException and letting other exceptions propagate.

3. collectLatestSnapshotFiles failure path

If it fails for a branch (manifest file deleted by expiration), the catch-all means that branch's latest snapshot files are NOT protected.

Performance

4. Redundant manifest reading -- collectLatestSnapshotFiles re-reads the latest snapshot's entire manifest tree. For large tables this is not trivial.

Minor

5. The test verifies sequential (expire then clean) but not true concurrent race. A clarifying comment would help.

Summary

Clear improvement. Main recommendation: add an explicit empty-set guard (if (usedFiles.isEmpty()) abort) as an absolute safety net.

@heye1005 heye1005 force-pushed the fix-orphan-clean-concurrent-expiration branch from 1d66c8e to 2febd22 Compare May 24, 2026 15:11
@heye1005
Copy link
Copy Markdown
Contributor Author

Thanks for thorough review, @JingsongLi.
Fix by your feedback. only retain add an explicit empty-set guard, and tests adjusted. PTAL.

… snapshot expiration

The clean() flow could end up with an empty usedFiles set when a concurrent
snapshot expiration races with snapshot/manifest reads in the cleaner,
which would then mistakenly treat live manifest/data files as orphans and
delete them.

Add a safety net in LocalOrphanFilesClean.clean(): if usedFiles is empty
while candidate files are present, abort the run with a warning instead
of proceeding to deletion. Add two tests covering the post-expiration
sequential path and the empty-usedFiles abort path.

This closes apache#7710.
@heye1005 heye1005 force-pushed the fix-orphan-clean-concurrent-expiration branch from 2febd22 to 80870aa Compare May 24, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] remove_orphan_files may accidentally delete data in multitasking concurrent scenarios

2 participants