Monitoring API: remove maximum retention days prometheus by marioferh · Pull Request #2851 · openshift/api

marioferh · 2026-05-20T14:34:33Z

Remove Maximum validation and documentation for durationInDays (365)
and sizeInGiB (16384) on Prometheus Retention. We do not have enough
customer or operational data to pick defensible API-level ceilings;
arbitrary caps could block valid large-cluster configurations or
misalign with what CMO/Prometheus can actually support.

openshift-merge-bot · 2026-05-20T14:34:36Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-20T14:34:41Z

Hello @marioferh! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

coderabbitai · 2026-05-20T14:34:52Z

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

This PR updates the Retention.DurationInDays kubebuilder markers to enforce a minimum of 1 day and removes the previous maximum of 365 days. The generated ClusterMonitoring CRD schema was updated to remove the maximum constraints for both spec.prometheusConfig.retention.durationInDays and spec.prometheusConfig.retention.sizeInGiB, leaving their formats and minimums intact.

Suggested reviewers

JoelSpeed

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Microshift Test Compatibility	⚠️ Warning	New Ginkgo tests in tests/suite_test.go generate test cases for unavailable MicroShift APIs (machine.openshift.io, config.openshift.io, apiserver.openshift.io) without MicroShift protection.	Add [apigroup:...] tags for unavailable API groups or [Skipped:MicroShift] labels, or guard tests with exutil.IsMicroShiftCluster() checks.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: removing maximum retention constraints from Prometheus configuration, though it could be more specific about both fields affected.
Description check	✅ Passed	The description clearly explains what was changed (removal of maximum validation for durationInDays and sizeInGiB) and provides solid rationale about insufficient data for API-level ceilings.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The PR contains no test files with dynamic names. Ginkgo test names use stable values (CRD metadata, static filenames, feature gates) from fmt.Sprintf, not pods/UUIDs/timestamps.
Test Structure And Quality	✅ Passed	PR contains no Ginkgo test code. It only modifies API definitions and CRD manifests for Prometheus retention constraints. Custom check for test structure/quality is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. PR only modifies CRD validation annotations and schema for Prometheus retention fields in API type definitions and manifests, not test files.
Topology-Aware Scheduling Compatibility	✅ Passed	PR only modifies Prometheus retention field validation (removing max bounds), not scheduling constraints. No pod affinity, topology constraints, or operator deployment changes introduced.
Ote Binary Stdout Contract	✅ Passed	PR does not violate OTE Binary Stdout Contract; test suite properly routes logging via GinkgoWriter in BeforeSuite/AfterSuite, no direct stdout writes in process-level code.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR; changes are limited to CRD validation annotations and YAML manifest updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented
The command is terminated due to an error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-20T14:35:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

marioferh · 2026-05-20T14:57:41Z

@simonpasquier @danielmellado

danielmellado · 2026-05-20T15:05:09Z

While this should be ok not having a tombstone as we're just loosening limits, the openapi.json diff includes unrelated changes (NetworkObservabilitySpec, registries descriptions, softirqs collector, KMS default) which are likely from other PRs that merged into master since the branch was created. Could you rebase on latest master so the openapi.json diff only shows the PR changes? This would avoid confusion during review and prevent accidentally shipping stale generated data.

Remove Maximum validation and documentation for durationInDays (365) and sizeInGiB (16384) on Prometheus Retention. We do not have enough customer or operational data to pick defensible API-level ceilings; arbitrary caps could block valid large-cluster configurations or misalign with what CMO/Prometheus can actually support. Signed-off-by: Mario Fernandez <mariofer@redhat.com>

danielmellado · 2026-05-20T15:57:42Z

Hey @marioferh, as commented before we go with just removing the max constraints, I think we should also address the granularity issue that Ayoub and Simon raised.

Right now durationInDays: int32 can't express sub-day retention like 15h or 24h (which is the default for UWM), and sizeInGiB: int32 can't express sub-GiB sizes like 500MiB. The CMO ConfigMap and the Prometheus Operator CRD both accept freeform Prometheus duration/size strings, so customers migrating from retention: "15h" would have no equivalent in the new CRD.

I think we should push back on the integer-only convention for these two fields and switch to string fields with the same Prometheus pattern validation that the Prometheus Operator CRD uses:

duration: string with pattern ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$
size: string with pattern ^(0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$
This way CMO can pass the values straight through to the Prometheus CR without any lossy conversion, and all existing ConfigMap values remain valid. The patterns are well-defined and fully validatable at admission time, so we're not losing any safety vs. int32.

@everettraven I know we discussed avoiding duration strings back in #2653, but this comes from the Prometheus Operator CR itself, which validates retention with ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ and size with (^0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ (https://github.com/openshift/cluster-monitoring-operator/blob/0ce8e4c0bce00381802c25d92ed238775139ee82/assets/monitoring-operator/00_0prometheus-custom-resource-definition.yaml). CMO passes these values straight through to the Prometheus CR, so using durationInDays: int32 and sizeInGiB: int32 introduces a lossy conversion layer that breaks existing configs like retention: "15h" or retentionSize: "500MiB" that customers can set today in the ConfigMap. Would you be open to using string fields with these same Prometheus patterns for this specific case? They're fully validatable at admission time via the regex, and since this is v1alpha1 behind a feature gate we can still change it.

Worst case, as a fallback, we may set hours, but wanted to get back your thoughts on this! Thanks!

openshift-ci · 2026-05-20T16:43:51Z

@marioferh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

everettraven · 2026-05-20T19:12:15Z

@@ -4152,9 +4150,7 @@ spec:
                          When the limit is reached, Prometheus will delete oldest data first.
                          When omitted, no size limit is enforced and Prometheus uses available PersistentVolume capacity.
                          Minimum value is 1 GiB.
-                          Maximum value is 16384 GiB (16 TiB).
                        format: int32
-                        maximum: 16384


We strongly recommend having reasonable upper bounds here. If nothing else, at least a maximum of 2,147,483,647 (the int32 maximum).

It is also worth noting that once you set a maximum and it GAs, we won't let you decrease it without strong evidence that it does not break user workflows.

It is usually easier for us to increase maximum constraints as needed than to add one or decrease it.

Also noting here that 2,147,483,647 GiB is ~ 2047 PiB. That seems pretty high...

everettraven · 2026-05-20T19:15:21Z

                          Minimum value is 1 day.
-                          Maximum value is 365 days (1 year).
                        format: int32
-                        maximum: 365


Just a note, a maximum of 2,147,483,647 here is equivalent to ~58835 centuries. Seems like an unreasonably high number that would never be set.

Without more context, I would think it to be reasonable to increase this to something like 3650 to account for ~10 years of retention. That seems pretty high to me for metrics, but 🤷

everettraven · 2026-05-21T12:42:22Z

Hey @marioferh, as commented before we go with just removing the max constraints, I think we should also address the granularity issue that Ayoub and Simon raised.

Right now durationInDays: int32 can't express sub-day retention like 15h or 24h (which is the default for UWM), and sizeInGiB: int32 can't express sub-GiB sizes like 500MiB. The CMO ConfigMap and the Prometheus Operator CRD both accept freeform Prometheus duration/size strings, so customers migrating from retention: "15h" would have no equivalent in the new CRD.

I think we should push back on the integer-only convention for these two fields and switch to string fields with the same Prometheus pattern validation that the Prometheus Operator CRD uses:

duration: string with pattern ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ size: string with pattern ^(0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ This way CMO can pass the values straight through to the Prometheus CR without any lossy conversion, and all existing ConfigMap values remain valid. The patterns are well-defined and fully validatable at admission time, so we're not losing any safety vs. int32.

@everettraven I know we discussed avoiding duration strings back in #2653, but this comes from the Prometheus Operator CR itself, which validates retention with ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ and size with (^0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ (https://github.com/openshift/cluster-monitoring-operator/blob/0ce8e4c0bce00381802c25d92ed238775139ee82/assets/monitoring-operator/00_0prometheus-custom-resource-definition.yaml). CMO passes these values straight through to the Prometheus CR, so using durationInDays: int32 and sizeInGiB: int32 introduces a lossy conversion layer that breaks existing configs like retention: "15h" or retentionSize: "500MiB" that customers can set today in the ConfigMap. Would you be open to using string fields with these same Prometheus patterns for this specific case? They're fully validatable at admission time via the regex, and since this is v1alpha1 behind a feature gate we can still change it.

Worst case, as a fallback, we may set hours, but wanted to get back your thoughts on this! Thanks!

We strongly encourage not using durations in favor of a unit-based integer, but if you folks feel strongly that this is the best UX for your end-users I'm not going to prevent you from doing so. I'm not the expert on your user base here, you folks are :).

That being said, if the concern is solely that you cannot reasonably express smaller unit values that are likely to occur, you can change the unit you are using for the integer field to the smallest reasonable unit. For example, durationInSeconds and sizeInBytes.

Even if this is in v1alpha1 we still have to abide by tombstoning rules if we change the serialization of the fields unless we want to move this to a v1alpha2.

openshift-ci Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 20, 2026

openshift-ci Bot requested review from JoelSpeed and everettraven May 20, 2026 14:35

marioferh force-pushed the fix_prometheus_retention_days branch 2 times, most recently from 80e0c7d to 03a1de9 Compare May 20, 2026 14:57

marioferh force-pushed the fix_prometheus_retention_days branch from 03a1de9 to 854e0ad Compare May 20, 2026 15:23

everettraven reviewed May 20, 2026

View reviewed changes

Conversation

marioferh commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-merge-bot Bot commented May 20, 2026

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

marioferh commented May 20, 2026

Uh oh!

danielmellado commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielmellado commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented May 20, 2026

Uh oh!

everettraven May 20, 2026

Choose a reason for hiding this comment

Uh oh!

everettraven May 20, 2026

Choose a reason for hiding this comment

Uh oh!

everettraven May 20, 2026

Choose a reason for hiding this comment

Uh oh!

everettraven commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marioferh commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

danielmellado commented May 20, 2026 •

edited

Loading

danielmellado commented May 20, 2026 •

edited

Loading