Skip to content

fix: keep recoverable ops failures in progress#10299

Open
weicao wants to merge 1 commit into
mainfrom
bugfix/es-pod-waiting-failure
Open

fix: keep recoverable ops failures in progress#10299
weicao wants to merge 1 commit into
mainfrom
bugfix/es-pod-waiting-failure

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 27, 2026

What happened

An Elasticsearch restart OpsRequest was marked Failed while the target Pod later became Ready and the Elasticsearch cluster stayed green.

The preserved evidence shows the Pod did have a real failed/waiting history during recreate:

Error: failed to sync configmap cache: timed out waiting for the condition

That history should remain visible on the Pod / InstanceSet side. The bug is that restart Ops used the recoverable historical failure as an immediate terminal Ops result, even though the component had not become Failed and the instance could still recover.

Fixes #10300

Root cause

Rolling Ops progress counted a failed progress detail as completed and then finalized the whole OpsRequest as Failed as soon as all expected instances had either succeeded or failed.

That is too eager for restart-style rolling operations: a recreated Pod can have a real temporary failure signal and then become available later. If the component is still not Failed, Ops should keep observing instead of freezing that historical failure as the final operation result.

Fix

Do not change Pod failure detection.

Instead, make component Ops progress more robust:

  • failed progress details are still recorded;
  • when a component is not in Failed, failed progress details do not count toward completed progress;
  • the OpsRequest remains running so a later recovered instance can overwrite the failed detail with Succeed;
  • if the component is actually Failed, existing terminal failure behavior is preserved.

Validation

Local validation:

go test ./pkg/controllerutil -run TestPodIsReady -count=1
KUBEBUILDER_ASSETS='/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64' go test ./pkg/operations -run TestAPIs -count=1
git diff --check

The new envtest covers the contract: a restart Ops with a failed progress detail stays Running while the component is still Running, reports 2/3, and then succeeds once the instance recovers. The existing failed-component test still expects Ops Failed.

Evidence boundary: the ES field occurrence is N=1. This PR fixes the deterministic Ops progress contract that turned a recoverable failure history into a terminal Ops failure. It is not a release-ready claim; ES needs fresh exact-head runtime validation after this new head is sideloaded.

@weicao weicao requested review from a team and leon-ape as code owners May 27, 2026 12:38
@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label May 27, 2026
@apecloud-bot
Copy link
Copy Markdown
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

@weicao weicao added the kind/bug Something isn't working label May 27, 2026
@weicao weicao added the nopick Not auto cherry-pick when PR merged label May 27, 2026
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 27, 2026

Supersedes #10298. The patch commit is unchanged (403be1e); #10298 was closed because its branch name was rejected by PR pre-check. This PR uses the valid bugfix/es-pod-waiting-failure branch.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.86%. Comparing base (bd81433) to head (d84aa80).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/operations/ops_comp_helper.go 90.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10299      +/-   ##
==========================================
+ Coverage   52.83%   52.86%   +0.03%     
==========================================
  Files         533      533              
  Lines       61213    61235      +22     
==========================================
+ Hits        32343    32374      +31     
+ Misses      25621    25609      -12     
- Partials     3249     3252       +3     
Flag Coverage Δ
unittests 52.86% <90.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao weicao force-pushed the bugfix/es-pod-waiting-failure branch from 403be1e to baa2459 Compare May 28, 2026 06:33
@weicao weicao requested a review from wangyelei as a code owner May 28, 2026 06:33
@weicao weicao changed the title fix: ignore transient pod waiting states fix: keep recoverable ops failures in progress May 28, 2026
@weicao weicao force-pushed the bugfix/es-pod-waiting-failure branch from baa2459 to d84aa80 Compare May 28, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working nopick Not auto cherry-pick when PR merged size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transient pod waiting state can fail restart OpsRequest

2 participants