fix: keep recoverable ops failures in progress by weicao · Pull Request #10299 · apecloud/kubeblocks

weicao · 2026-05-27T12:38:28Z

What happened

An Elasticsearch restart OpsRequest was marked Failed while the target Pod later became Ready and the Elasticsearch cluster stayed green.

The preserved evidence shows the Pod did have a real failed/waiting history during recreate:

Error: failed to sync configmap cache: timed out waiting for the condition

That history should remain visible on the Pod / InstanceSet side. The bug is that restart Ops used the recoverable historical failure as an immediate terminal Ops result, even though the component had not become Failed and the instance could still recover.

Fixes #10300

Root cause

Rolling Ops progress counted a failed progress detail as completed and then finalized the whole OpsRequest as Failed as soon as all expected instances had either succeeded or failed.

That is too eager for restart-style rolling operations: a recreated Pod can have a real temporary failure signal and then become available later. If the component is still not Failed, Ops should keep observing instead of freezing that historical failure as the final operation result.

Fix

Do not change Pod failure detection.

Instead, make component Ops progress more robust:

failed progress details are still recorded;
when a component is not in Failed, failed progress details do not count toward completed progress;
the OpsRequest remains running so a later recovered instance can overwrite the failed detail with Succeed;
if the component is actually Failed, existing terminal failure behavior is preserved.

Validation

Local validation:

go test ./pkg/controllerutil -run TestPodIsReady -count=1
KUBEBUILDER_ASSETS='/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64' go test ./pkg/operations -run TestAPIs -count=1
git diff --check

The new envtest covers the contract: a restart Ops with a failed progress detail stays Running while the component is still Running, reports 2/3, and then succeeds once the instance recovers. The existing failed-component test still expects Ops Failed.

Evidence boundary: the ES field occurrence is N=1. This PR fixes the deterministic Ops progress contract that turned a recoverable failure history into a terminal Ops failure. It is not a release-ready claim; ES needs fresh exact-head runtime validation after this new head is sideloaded.

apecloud-bot · 2026-05-27T12:38:45Z

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

weicao · 2026-05-27T12:40:50Z

Supersedes #10298. The patch commit is unchanged (403be1e); #10298 was closed because its branch name was rejected by PR pre-check. This PR uses the valid bugfix/es-pod-waiting-failure branch.

codecov · 2026-05-27T12:45:10Z

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.86%. Comparing base (bd81433) to head (d84aa80).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/operations/ops_comp_helper.go	90.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10299      +/-   ##
==========================================
+ Coverage   52.83%   52.86%   +0.03%     
==========================================
  Files         533      533              
  Lines       61213    61235      +22     
==========================================
+ Hits        32343    32374      +31     
+ Misses      25621    25609      -12     
- Partials     3249     3252       +3

Flag	Coverage Δ
unittests	`52.86% <90.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

weicao requested review from a team and leon-ape as code owners May 27, 2026 12:38

github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label May 27, 2026

weicao added the kind/bug Something isn't working label May 27, 2026

weicao mentioned this pull request May 27, 2026

fix: avoid terminal pod failure for transient waiting states #10298

Closed

weicao added the nopick Not auto cherry-pick when PR merged label May 27, 2026

weicao force-pushed the bugfix/es-pod-waiting-failure branch from 403be1e to baa2459 Compare May 28, 2026 06:33

weicao requested a review from wangyelei as a code owner May 28, 2026 06:33

weicao changed the title ~~fix: ignore transient pod waiting states~~ fix: keep recoverable ops failures in progress May 28, 2026

fix: avoid terminalizing recoverable ops failures

d84aa80

weicao force-pushed the bugfix/es-pod-waiting-failure branch from baa2459 to d84aa80 Compare May 28, 2026 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep recoverable ops failures in progress#10299

fix: keep recoverable ops failures in progress#10299
weicao wants to merge 1 commit into
mainfrom
bugfix/es-pod-waiting-failure

weicao commented May 27, 2026 •

edited

Loading

Uh oh!

apecloud-bot commented May 27, 2026

Uh oh!

weicao commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weicao commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What happened

Root cause

Fix

Validation

Uh oh!

apecloud-bot commented May 27, 2026

Uh oh!

weicao commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weicao commented May 27, 2026 •

edited

Loading

codecov Bot commented May 27, 2026 •

edited

Loading