Skip to content

fix: surface HPA errors on ScaledObject#7669

Merged
wozniakjan merged 2 commits into
kedacore:mainfrom
wozniakjan:surface_hpa_errors_on_scaledobject
May 12, 2026
Merged

fix: surface HPA errors on ScaledObject#7669
wozniakjan merged 2 commits into
kedacore:mainfrom
wozniakjan:surface_hpa_errors_on_scaledobject

Conversation

@wozniakjan
Copy link
Copy Markdown
Member

KEDA's ScaledObject currently reports Ready=True even when its managed HPA cannot fetch metrics (HPA shows <unknown> targets with ScalingActive=False). Users see a healthy-looking ScaledObject while scaling is silently broken, the only way to discover the problem is to manually inspect the HPA. Examples of this might be broken metrics adapter, broken APIService.apiregistration, or invalid certs for the gRPC connection between operator and metrics adapter.

HPA broken with adapter not running
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: keda-hpa-nginx
  namespace: keda
spec:
  maxReplicas: 10
  metrics:
  - external:
      metric:
        name: s0-configmap-mock-metric-metric-value
        selector:
          matchLabels:
            scaledobject.keda.sh/name: nginx
      target:
        averageValue: "5"
        type: AverageValue
    type: External
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
status:
  conditions:
  - lastTransitionTime: "2026-04-20T08:49:50Z"
    message: the HPA controller was able to get the target's current scale
    reason: SucceededGetScale
    status: "True"
    type: AbleToScale
  - lastTransitionTime: "2026-04-20T10:26:40Z"
    message: 'the HPA was unable to compute the replica count: unable to get external
      metric keda/s0-configmap-mock-metric-metric-value/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:
      nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics
      from external metrics API: the server is currently unable to handle the request
      (get s0-configmap-mock-metric-metric-value.external.metrics.k8s.io)'
    reason: FailedGetExternalMetric
    status: "False"
    type: ScalingActive
  - lastTransitionTime: "2026-04-20T08:49:50Z"
    message: the desired count is within the acceptable range
    reason: DesiredWithinRange
    status: "False"
    type: ScalingLimited
  currentMetrics:
  - type: ""
  currentReplicas: 4
  desiredReplicas: 4
  lastScaleTime: "2026-04-20T08:49:50Z"
Former SO status reporting when HPA is broken
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nginx
  namespace: keda
spec:
  maxReplicaCount: 10
  minReplicaCount: 1
  scaleTargetRef:
    kind: Deployment
    name: nginx
  triggers:
  - metadata:
      key: metric-value
      resourceKind: ConfigMap
      resourceName: mock-metric
      targetValue: "5"
    type: kubernetes-resource
status:
  authenticationsTypes: ""
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady                                       # <-- imho this is incorrect
    status: "True"
    type: Ready
  - message: Scaling is performed because triggers are active
    reason: ScalerActive                                            # <- debatable whether active should be true?
    status: "True"
    type: Active
  - message: No fallbacks are active on this scaled object
    reason: NoFallbackFound
    status: "False"
    type: Fallback
  - status: "False"
    type: Paused
  externalMetricNames:
  - s0-configmap-mock-metric-metric-value
  hpaName: keda-hpa-nginx
  lastActiveTime: "2026-04-20T10:25:31Z"
  originalReplicaCount: 1
  scaleTargetGVKR:
    group: apps
    kind: Deployment
    resource: deployments
    version: v1
  scaleTargetKind: apps/v1.Deployment
  triggersActivity:
    s0-configmap-mock-metric-metric-value:
      isActive: true
  triggersTypes: kubernetes-resource
Proposed SO status reporting when HPA is broken
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: nginx
  namespace: keda
spec:
  maxReplicaCount: 10
  minReplicaCount: 1
  scaleTargetRef:
    kind: Deployment
    name: nginx
  triggers:
  - metadata:
      key: metric-value
      resourceKind: ConfigMap
      resourceName: mock-metric
      targetValue: "5"
    type: kubernetes-resource
status:
  authenticationsTypes: ""
  conditions:
  - message: 'ScaledObject is configured correctly but HPA is not healthy: FailedGetExternalMetric'
    reason: HPAMetricsUnavailable                                    # <- no longer reported as ready
    status: "False"
    type: Ready
  - message: Scaling is performed because triggers are active
    reason: ScalerActive                                             # <- still reports as active
    status: "True"
    type: Active
  - message: No fallbacks are active on this scaled object
    reason: NoFallbackFound
    status: "False"
    type: Fallback
  - status: "False"
    type: Paused
  externalMetricNames:
  - s0-configmap-mock-metric-metric-value
  hpaName: keda-hpa-nginx
  lastActiveTime: "2026-04-20T10:27:01Z"
  originalReplicaCount: 1
  scaleTargetGVKR:
    group: apps
    kind: Deployment
    resource: deployments
    version: v1
  scaleTargetKind: apps/v1.Deployment
  triggersActivity:
    s0-configmap-mock-metric-metric-value:
      isActive: true
  triggersTypes: kubernetes-resource

Checklist

  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added (if applicable)
  • Ensure make generate-scalers-schema has been run to update any outdated generated files
  • Changelog has been updated and is aligned with our changelog requirements, only when the change impacts end users
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #7649

@wozniakjan wozniakjan requested a review from a team as a code owner April 20, 2026 10:33
@github-actions
Copy link
Copy Markdown

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team April 20, 2026 10:33
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 20, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Surfaces managed HPA health problems on the ScaledObject by adjusting the Ready condition when the HPA cannot compute metrics (e.g., ScalingActive=False).

Changes:

  • Add HPA health evaluation during RequestScale and aggregate it into the ScaledObject Ready condition.
  • Introduce new Ready condition reasons: HPAMetricsUnavailable and ScalingDegraded.
  • Add unit tests for HPA health evaluation and update CHANGELOG.md.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
pkg/scaling/executor/scale_scaledobjects.go Adds HPA health reading (via autoscaling/v2 HPA status) and Ready-condition aggregation in RequestScale.
pkg/scaling/executor/scale_scaledobjects_test.go Adds tests for HPA health logic and Ready-condition aggregation behavior.
apis/keda/v1alpha1/condition_types.go Adds new ScaledObject Ready condition reason constants used by the executor.
CHANGELOG.md Documents the user-facing behavior change for ScaledObject Ready condition vs HPA health.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/scaling/executor/scale_scaledobjects_test.go Outdated
Comment thread pkg/scaling/executor/scale_scaledobjects_test.go Outdated
Comment thread pkg/scaling/executor/scale_scaledobjects_test.go Outdated
Comment thread pkg/scaling/executor/scale_scaledobjects.go
Comment thread pkg/scaling/executor/scale_scaledobjects_test.go Outdated
@wozniakjan wozniakjan force-pushed the surface_hpa_errors_on_scaledobject branch 2 times, most recently from c52be98 to db67926 Compare April 20, 2026 11:09
@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented Apr 20, 2026

/run-e2e
Update: You can check the progress here

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/scaling/executor/scale_scaledobjects.go
Comment thread pkg/scaling/executor/scale_scaledobjects.go Outdated
Comment thread pkg/scaling/executor/scale_scaledobjects_test.go Outdated
@wozniakjan wozniakjan force-pushed the surface_hpa_errors_on_scaledobject branch from db67926 to 333abcb Compare April 20, 2026 13:23
@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented Apr 20, 2026

/run-e2e
Update: You can check the progress here

@rickbrouwer rickbrouwer mentioned this pull request Apr 20, 2026
22 tasks
Comment thread pkg/scaling/executor/scale_scaledobjects.go Outdated
@wozniakjan wozniakjan force-pushed the surface_hpa_errors_on_scaledobject branch from 333abcb to 0a53c7e Compare April 21, 2026 13:08
@keda-automation keda-automation requested a review from a team April 21, 2026 13:08
@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented Apr 21, 2026

/run-e2e
Update: You can check the progress here

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/scaling/executor/scale_scaledobjects.go
Comment thread pkg/scaling/executor/scale_scaledobjects.go
@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented Apr 22, 2026

/run-e2e
Update: You can check the progress here

Comment thread pkg/scaling/executor/scale_scaledobjects.go
@keda-automation keda-automation requested a review from a team April 25, 2026 07:36
@rickbrouwer rickbrouwer force-pushed the surface_hpa_errors_on_scaledobject branch from b04e496 to 0a53c7e Compare April 25, 2026 07:41
Copy link
Copy Markdown
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, but I'm not sure if checking this makes sense here. I mean this will check the HPA on each controller loop, which means an extra request to the API (I know that it's cached).

The HPA is alreade added to ScaledObjectController's reconcilliation loop

Owns(&autoscalingv2.HorizontalPodAutoscaler{}, builder.WithPredicates(
predicate.Or(
predicate.LabelChangedPredicate{},
predicate.AnnotationChangedPredicate{},
kedacontrollerutil.HPASpecChangedPredicate{},
))).

Does it make sense to include the change of these fields in predicates and manage changes via watch instead of regular checks?

@wozniakjan
Copy link
Copy Markdown
Member Author

Does it make sense to include the change of these fields in predicates and manage changes via watch instead of regular checks?

I did a little bit of digging into the history (more on the issue #7649), apparently that was how KEDA had it before and it was causing too frequent reconciliations and unnecessary condition flapping. Per my testing, HPA really can throw a good number of events leading to extremely frequent reconciles. It was enough to have 200 SOs in cluster with trigger-happy HPAs to take KEDA effectively down.

which means an extra request to the API (I know that it's cached)

exactly, because this uses cached client, it's read from memory => 0 extra API calls.

@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented Apr 28, 2026

/run-e2e
Update: You can check the progress here

@keda-automation keda-automation requested a review from a team April 28, 2026 15:34
@wozniakjan wozniakjan force-pushed the surface_hpa_errors_on_scaledobject branch from 00d8b85 to 6cfc4f4 Compare April 29, 2026 13:08
@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented Apr 29, 2026

/run-e2e
Update: You can check the progress here

@wozniakjan wozniakjan dismissed rickbrouwer’s stale review April 29, 2026 13:08

thanks! fixed, ptal

@rickbrouwer rickbrouwer added Awaiting/2nd-approval This PR needs one more approval review required:keda-v2.20 labels Apr 29, 2026
@rickbrouwer
Copy link
Copy Markdown
Member

rickbrouwer commented May 4, 2026

/run-e2e
Update: You can check the progress here

@rickbrouwer rickbrouwer added ok-to-merge This PR can be merged and removed Awaiting/2nd-approval This PR needs one more approval review labels May 4, 2026
@rickbrouwer rickbrouwer added the merge-conflict This PR has a merge conflict label May 7, 2026
wozniakjan and others added 2 commits May 11, 2026 12:53
Signed-off-by: Jan Wozniak <wozniak.jan@gmail.com>
Co-authored-by: Rick Brouwer <rickbrouwer@gmail.com>
Signed-off-by: Jan Wozniak <wozniak.jan@gmail.com>
@wozniakjan wozniakjan force-pushed the surface_hpa_errors_on_scaledobject branch from 6cfc4f4 to a491504 Compare May 11, 2026 10:54
@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented May 11, 2026

/run-e2e
Update: You can check the progress here

@keda-automation keda-automation requested a review from a team May 11, 2026 10:54
@wozniakjan wozniakjan enabled auto-merge (squash) May 11, 2026 10:54
@semgrep-code-kedacore
Copy link
Copy Markdown

Semgrep found 12 context-todo findings:

Consider to use well-defined context

@rickbrouwer rickbrouwer removed the merge-conflict This PR has a merge conflict label May 11, 2026
@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented May 11, 2026

/run-e2e ^tests/(internals|secret-providers|scalers/(datadog|[^d][^/]))/._test.go$

skip dynatrace - its CI secret DYNATRACE_HOST returns 404, unrelated to this PR

Update: You can check the progress here

@wozniakjan
Copy link
Copy Markdown
Member Author

wozniakjan commented May 12, 2026

/run-e2e internal
Update: You can check the progress here

the only thing that failed in #7669 (comment) were dynatrace tests (they have unrelated issue, 404 to putting metrics to the project). With #7669 (comment) I probably abused regex too much, tests weren't even triggered, so just a last sanity check before merging running for internal e2es

@wozniakjan wozniakjan merged commit 44daf4f into kedacore:main May 12, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HPA status in ScaledObject

5 participants