Skip to content

fix(operations): rebuild no-backup instances via fresh PVCs#10295

Open
weicao wants to merge 8 commits into
mainfrom
lily/rebuild-nobackup-main
Open

fix(operations): rebuild no-backup instances via fresh PVCs#10295
weicao wants to merge 8 commits into
mainfrom
lily/rebuild-nobackup-main

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 27, 2026

Problem

No-backup in-place RebuildInstance has no backup data to restore. The old flow still created temporary PVCs and a helper pod, then tried to hand off a helper PV to the source PVC. In environments where that helper PV is not visible to the controller, the operation can fail with can not found the pv by the pvc ....

For no-backup rebuild, the intended result is a fresh source PVC from the current workload template, not a restored helper PV handoff.

Changes

  • Route no-backup in-place RebuildInstance through dynamic source PVC reprovisioning instead of temporary helper pod/PVC creation.
  • Record the original source PVC UID and volume name on the OpsRequest before deleting the source PVC, so re-entry can distinguish the old PVC from the recreated PVC.
  • Wait for the workload-created source PVC to be a new UID and Bound, then delete the target pod and wait for the rebuilt pod to become ready.
  • Reject dynamically recreated source PVCs whose workload/template labels do not match the expected source PVC identity.
  • If both old and new spec.volumeName are visible, reject rebinding to the old PV.
  • Keep the backup rebuild path on the existing helper-PV handoff flow.

Validation

  • KUBEBUILDER_ASSETS="$(.../setup-envtest-release-0.21 use 1.26.1 -p path)" go test ./pkg/operations -run TestAPIs -ginkgo.focus "test rebuild instance with no backup" -count=1 -v
  • KUBEBUILDER_ASSETS="$(.../setup-envtest-release-0.21 use 1.26.1 -p path)" go test ./pkg/operations -run TestAPIs -count=1
  • go test -c ./pkg/operations
  • git diff --check

Fixes #10293.

@weicao weicao requested review from a team and wangyelei as code owners May 27, 2026 04:03
@apecloud-bot
Copy link
Copy Markdown
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label May 27, 2026
@weicao weicao added the kind/bug Something isn't working label May 27, 2026
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 27, 2026

/pick release-1.0 release-1.1

@apecloud-bot apecloud-bot added pick-1.0 Auto cherry-pick to release-1.0 when PR merged pick-1.1 Auto cherry-pick to release-1.1 when PR merged labels May 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 62.85714% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.76%. Comparing base (bd81433) to head (b6cfcf2).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/operations/rebuild_instance_inplace.go 61.76% 9 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10295      +/-   ##
==========================================
- Coverage   52.83%   52.76%   -0.08%     
==========================================
  Files         533      533              
  Lines       61213    61250      +37     
==========================================
- Hits        32343    32319      -24     
- Misses      25621    25674      +53     
- Partials     3249     3257       +8     
Flag Coverage Δ
unittests 52.76% <62.85%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/XL Denotes a PR that changes 500-999 lines. and removed size/L Denotes a PR that changes 100-499 lines. labels May 27, 2026
@wangyelei
Copy link
Copy Markdown
Contributor

When does this scenario occur? Can it be reproduced?
The old flow still created temporary PVCs and a helper pod, then tried to hand off a helper PV to the source PVC. In environments where that helper PV is not visible to the controller, the operation can fail with can not found the pv by the pvc.

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 27, 2026

Yes. This occurs in the no-backup in-place RebuildInstance path when the old source PVC is being removed/recreated, but the legacy code still tries to run the backup-style helper PVC/PV handoff.

The failure condition is: helper pod finishes, then getRestoredPV() tries to find a PV for the tmp PVC. If tmpPVC.spec.volumeName is empty and no PV is visible with the tmp-PVC label, the OpsRequest fails with can not found the pv by the pvc .... For no-backup rebuild there is no restored data to hand off, so creating tmp PVCs/PVs is unnecessary and fragile.

It was reproduced twice with a MySQL semisync test on the release-1.0/1.0.3 line:

  • 2 replicas.
  • Delete the secondary pod and its PVC.
  • Submit RebuildInstance with inPlace=true and no backup.
  • Both runs failed at the same OpsRequest step with the tmp-PVC/PV lookup error.
  • The preserved failure state had the source PVC already Terminating with pvc-protection, while the target pod was still initializing.

This PR changes only the no-backup path: it stops creating helper tmp PVCs/pods, records the old source PVC identity, releases the old target pod first, waits for the workload to recreate the same-name source PVC from the current template, and then waits for the rebuilt pod. The backup/helper-PV path is left unchanged.

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 27, 2026

Clarification to the previous explanation: the old failure should be described as a restored-PV lookup failure, not as proof that the helper PV could never be created.

The old code path is:

  1. rebuildInstancePVByPod() creates the tmp PVCs and helper pod, then waits for the helper pod to become Succeeded.
  2. The next step calls getRestoredPV() for each tmp PVC.
  3. getRestoredPV() resolves the PV either from tmpPVC.spec.volumeName, or by listing PVs with operations.kubeblocks.io/rebuild-tmp-pvc=<tmpPVC name>.
  4. The observed failure means that, at that point, both lookup routes failed: tmpPVC.spec.volumeName was empty and no labeled PV was found by the controller.

So the reproduced symptom is clear, but the old helper-PV path root cause is narrower than my earlier wording: we did not prove that the PV was impossible to create; we proved that the controller could not resolve a transferable restored PV through the old lookup contract in that no-backup rebuild run.

This does not change the current PR direction. For no-backup rebuild, there is no backup data to restore or transfer, so requiring the backup-style tmp PVC/helper PV handoff is unnecessary. This PR removes that dependency only for the no-backup path and instead rebuilds through workload-owned source PVC deletion/recreation. The backup path continues to use the helper/restored-PV handoff.

@wangyelei
Copy link
Copy Markdown
Contributor

wangyelei commented May 27, 2026

Your modification approach has issues. Here's what I suggest:
if the rebuildInstance backup is not needed, then the inplaceRebuildHelper is unnecessary. You can just set setInstanceNodeSelectorForRebuild in advance, then delete the PVC, and finally delete the pod. After that, just wait for the new pod to become ready

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 27, 2026

Updated in 1703dc26a to follow the suggested no-backup order.

What changed:

  • For no-backup rebuild, when the old pod still exists, the node-selector intent is written before releasing source PVCs.
  • Old source PVC deletion is now requested before deleting the target pod. The code still does not clear pvc-protection; Kubernetes/workload ownership controls the actual release.
  • A replacement same-name PVC is still accepted only after the old target pod is absent, and only after UID / old volume / workload-template validation passes.
  • Backup/helper-PV rebuild path is unchanged.

The implementation still uses the existing helper receiver as the local holder for target pod / component / PVC-template context, but the no-backup path does not create helper tmp PVCs, helper pods, or helper PVs.

Local validation passed on 1703dc26a:

  • focused no-backup + existing-PVC-negative envtest: PASS
  • full go test ./pkg/operations -run TestAPIs -count=1: PASS
  • go test -c ./pkg/operations: PASS
  • git diff --check: PASS

GitHub CI is rerunning on the new head.

@wangyelei
Copy link
Copy Markdown
Contributor

Suggestions:

In rebuild_instance.go, do not modify anything else. Only modify the getPVCMapAndVolumes function within prepareInplaceRebuildHelper: if the backup is empty, do not obtain volumes and tmpPVC.

Modify the rebuildSourcePVCsByDynamicProvision logic as follows:

First, call setInstanceNodeSelectorForRebuild.

Then, based on the sourcePVC in pvcMap, delete it if it exists.

Finally, call deleteTargetPodForRebuild to delete the Pod.

Set progressDetail.Message = waitingForInstanceReadyMessage.

Do not change anything else.

@github-actions github-actions Bot added size/L Denotes a PR that changes 100-499 lines. and removed size/XL Denotes a PR that changes 500-999 lines. labels May 27, 2026
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 27, 2026

Updated in c7f310cc to follow the narrower suggestion.

What changed:

  • getPVCMapAndVolumes(..., noBackup=true) now only builds the source PVC map; it does not build helper volumes, volumeMounts, or tmp PVCs.
  • rebuildSourcePVCsByDynamicProvision now does only: set node selector -> delete existing source PVCs from pvcMap -> delete target pod -> set progressDetail.Message = waitingForInstanceReadyMessage.
  • Removed the extra source-PVC identity / absent-state / dynamic-wait validation logic from the previous patch.
  • Backup/helper-PV rebuild path remains unchanged.

One small guard remains in rebuildInstanceInPlace: after the target pod has been deleted and progress is already waitingForInstanceReadyMessage, a temporary Pod NotFound is treated as waiting instead of returning an error before the replacement pod appears.

Local validation passed on c7f310cc:

  • focused no-backup envtest: PASS
  • full go test ./pkg/operations -run TestAPIs -count=1: PASS
  • go test -c ./pkg/operations: PASS
  • git diff --check: PASS

GitHub CI is rerunning on the new head.

Comment thread pkg/operations/rebuild_instance.go Outdated
Comment thread pkg/operations/rebuild_instance_inplace.go Outdated
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 28, 2026

Updated in 552efac04 for the two review comments.

Changes:

  • Removed the extra target pod lookup in rebuildInstanceInPlace; the target pod is resolved by prepareInplaceRebuildHelper.
  • Reused deleteSourcePVCForRebuild in the no-backup dynamic source PVC path and removed the extra dynamic-provision delete helper.
  • Updated the no-backup test expectations for the requested delete helper behavior, including the pre-deleting PVC case.

Local validation:

  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/operations -run TestAPIs -count=1: PASS
  • go test -c ./pkg/operations: PASS
  • git diff --check: PASS

GitHub CI is running on the new head.

@weicao weicao requested a review from wangyelei May 28, 2026 02:40
@wangyelei
Copy link
Copy Markdown
Contributor

replace deleteSourcePVCForRebuild with

if sourcePVC.DeletionTimestamp == nil {
if err := intctrlutil.BackgroundDeleteObject(cli, reqCtx.Ctx, sourcePVC); err != nil {
return err
}

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 28, 2026

Updated in b6cfcf24 for the latest review comment.

Changes:

  • In the no-backup dynamic PVC path, replaced deleteSourcePVCForRebuild with the requested inline guarded background delete: only call BackgroundDeleteObject when sourcePVC.DeletionTimestamp == nil.
  • Updated the no-backup test to verify old source PVCs keep pvc-protection until the workload/test releases them.

Local validation:

  • KUBEBUILDER_ASSETS="/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64" go test ./pkg/operations -run TestAPIs -count=1: PASS
  • go test -c ./pkg/operations: PASS
  • git diff --check: PASS

GitHub CI is running on the new head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working pick-1.0 Auto cherry-pick to release-1.0 when PR merged pick-1.1 Auto cherry-pick to release-1.1 when PR merged size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MySQL RebuildInstance no-backup fails when helper PV is not visible

3 participants