Skip to content

fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9727

Open
priyansh17 wants to merge 8 commits into
velero-io:mainfrom
priyansh17:user/priyansh/issue-9674
Open

fix: fail backup early when CSI VolumeSnapshot error persists in async poll#9727
priyansh17 wants to merge 8 commits into
velero-io:mainfrom
priyansh17:user/priyansh/issue-9674

Conversation

@priyansh17
Copy link
Copy Markdown
Collaborator

Please add a summary of your change

  • This bug can be reproduced by triggering a scenario where cloud snapshot provisioning fails (e.g., make Azure Disk run out of available capacity or permissions).
  • The problematic code path is in pkg/backup/actions/csi/volumesnapshot_action.go, in the Progress() method of volumeSnapshotBackupItemAction:
if boolptr.IsSetToTrue(vs.Status.ReadyToUse) {
    // ... continue to check VSC
} else if vs.Status.Error != nil {
    errorMessage := ""
    if vs.Status.Error.Message != nil {
        errorMessage = *vs.Status.Error.Message
    }
    p.log.Warnf("VolumeSnapshot has a temporary error %s. Snapshot controller will retry later.", errorMessage)
    return progress, nil  // <-- Returns NOT completed, no error propagated
}
  • This returns progress.Completed=false even for permanent errors, so the backup controller keeps polling Progress() for the full itemOperationTimeout (could be hours), instead of failing immediately.
  • The synchronous code path (WaitUntilVSCHandleIsReady) properly fails on timeout, but async Progress() does not.

The fix is to wait for a certain period of time before failing it/ skipping it.

Does your change fix a particular issue?

Fixes #(issue)
#9674

Please indicate you've done the following:

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 48.14815% with 42 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/controller/backup_controller.go 6.89% 27 Missing ⚠️
pkg/util/csi/volume_snapshot.go 23.52% 10 Missing and 3 partials ⚠️
pkg/cmd/cli/schedule/create.go 0.00% 1 Missing ⚠️
pkg/cmd/server/server.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@kaovilai
Copy link
Copy Markdown
Collaborator

fyi this is in draft, when ready for review mark it as ready

@priyansh17 priyansh17 marked this pull request as ready for review May 4, 2026 03:06
@priyansh17 priyansh17 requested review from blackpiglet and removed request for Lyndon-Li and anshulahuja98 May 4, 2026 03:06
priyansh17 added 8 commits May 9, 2026 16:24
…CSISnapshotTimeout

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
…ation

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
…r handling in VolumeSnapshot progress

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
…istent errors in VolumeSnapshot

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
@blackpiglet blackpiglet force-pushed the user/priyansh/issue-9674 branch from ea1dd85 to db661d0 Compare May 9, 2026 08:24
@blackpiglet
Copy link
Copy Markdown
Contributor

@priyansh17
Do we need a snapshot error timeout timer per backup?
If not, I suggest to use a global timer instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants