[core] Fix orphan files clean deleting data files during concurrent snapshot expiration#7715
[core] Fix orphan files clean deleting data files during concurrent snapshot expiration#7715heye1005 wants to merge 1 commit into
Conversation
|
Hi @heye1005 It is better to fix it reading full ref files for latest snapshot. |
a91ce0d to
5d525d4
Compare
Thanks for the review! Updated the PR based on your suggestion. The root cause here is that This PR has two changes:
This doesn't 100% eliminate the race — if expiration happens to delete the latest snapshot's manifest-list during the pre-deletion check, it would still fail. But that window is just milliseconds, so practically very unlikely. To fully solve this, I think we'd need either some form of locking between orphan clean and expiration, or delay manifest-list deletion (similar to how orphan clean already requires data files to be older than |
5d525d4 to
1d66c8e
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [core] Fix orphan files clean deleting data files during concurrent snapshot expiration
Overall Assessment
The defense-in-depth approach (modifying safelyGetAllSnapshots + adding collectLatestSnapshotFiles as a second barrier) is a reasonable strategy for narrowing the race window.
Correctness Concerns
1. Residual race still exists without a hard guard
The TOCTOU window is narrowed but not eliminated. Consider adding an explicit empty-set guard: if usedFiles is empty after collecting from all branches, abort with a warning rather than proceeding to delete. This is simple, cheap, and guarantees no data loss regardless of timing.
2. Broad exception swallowing in safelyGetAllSnapshots
The outer catch (Exception e) catches all Exception types. If latestSnapshot() throws due to corruption, permissions, or network timeout, it silently proceeds with an empty snapshot set. Consider catching only FileNotFoundException and letting other exceptions propagate.
3. collectLatestSnapshotFiles failure path
If it fails for a branch (manifest file deleted by expiration), the catch-all means that branch's latest snapshot files are NOT protected.
Performance
4. Redundant manifest reading -- collectLatestSnapshotFiles re-reads the latest snapshot's entire manifest tree. For large tables this is not trivial.
Minor
5. The test verifies sequential (expire then clean) but not true concurrent race. A clarifying comment would help.
Summary
Clear improvement. Main recommendation: add an explicit empty-set guard (if (usedFiles.isEmpty()) abort) as an absolute safety net.
1d66c8e to
2febd22
Compare
|
Thanks for thorough review, @JingsongLi. |
… snapshot expiration The clean() flow could end up with an empty usedFiles set when a concurrent snapshot expiration races with snapshot/manifest reads in the cleaner, which would then mistakenly treat live manifest/data files as orphans and delete them. Add a safety net in LocalOrphanFilesClean.clean(): if usedFiles is empty while candidate files are present, abort the run with a warning instead of proceeding to deletion. Add two tests covering the post-expiration sequential path and the empty-usedFiles abort path. This closes apache#7710.
2febd22 to
80870aa
Compare
Purpose
Fix #7710.
LocalOrphanFilesClean.clean()collects used files by reading all snapshots viasafelyGetAllSnapshots(). However, this method silently catchesFileNotFoundException— if concurrent snapshot expiration deletes all snapshots between listing and reading,usedFilesends up empty, and all candidate data files get deleted as orphans.This patch skips orphan file deletion when
usedFilesis empty to prevent accidental data loss.Tests
Added
LocalOrphanFilesCleanTest#testSkipCleaningWhenAllSnapshotsDeleted.