Skip to content

Monitoring API: remove maximum retention days prometheus#2851

Open
marioferh wants to merge 1 commit into
openshift:masterfrom
marioferh:fix_prometheus_retention_days
Open

Monitoring API: remove maximum retention days prometheus#2851
marioferh wants to merge 1 commit into
openshift:masterfrom
marioferh:fix_prometheus_retention_days

Conversation

@marioferh
Copy link
Copy Markdown
Contributor

@marioferh marioferh commented May 20, 2026

Remove Maximum validation and documentation for durationInDays (365)
and sizeInGiB (16384) on Prometheus Retention. We do not have enough
customer or operational data to pick defensible API-level ceilings;
arbitrary caps could block valid large-cluster configurations or
misalign with what CMO/Prometheus can actually support.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

Hello @marioferh! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

This PR updates the Retention.DurationInDays kubebuilder markers to enforce a minimum of 1 day and removes the previous maximum of 365 days. The generated ClusterMonitoring CRD schema was updated to remove the maximum constraints for both spec.prometheusConfig.retention.durationInDays and spec.prometheusConfig.retention.sizeInGiB, leaving their formats and minimums intact.

Suggested reviewers

  • JoelSpeed
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Microshift Test Compatibility ⚠️ Warning New Ginkgo tests in tests/suite_test.go generate test cases for unavailable MicroShift APIs (machine.openshift.io, config.openshift.io, apiserver.openshift.io) without MicroShift protection. Add [apigroup:...] tags for unavailable API groups or [Skipped:MicroShift] labels, or guard tests with exutil.IsMicroShiftCluster() checks.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: removing maximum retention constraints from Prometheus configuration, though it could be more specific about both fields affected.
Description check ✅ Passed The description clearly explains what was changed (removal of maximum validation for durationInDays and sizeInGiB) and provides solid rationale about insufficient data for API-level ceilings.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR contains no test files with dynamic names. Ginkgo test names use stable values (CRD metadata, static filenames, feature gates) from fmt.Sprintf, not pods/UUIDs/timestamps.
Test Structure And Quality ✅ Passed PR contains no Ginkgo test code. It only modifies API definitions and CRD manifests for Prometheus retention constraints. Custom check for test structure/quality is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. PR only modifies CRD validation annotations and schema for Prometheus retention fields in API type definitions and manifests, not test files.
Topology-Aware Scheduling Compatibility ✅ Passed PR only modifies Prometheus retention field validation (removing max bounds), not scheduling constraints. No pod affinity, topology constraints, or operator deployment changes introduced.
Ote Binary Stdout Contract ✅ Passed PR does not violate OTE Binary Stdout Contract; test suite properly routes logging via GinkgoWriter in BeforeSuite/AfterSuite, no direct stdout writes in process-level code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR; changes are limited to CRD validation annotations and YAML manifest updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented
The command is terminated due to an error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 20, 2026
@openshift-ci openshift-ci Bot requested review from JoelSpeed and everettraven May 20, 2026 14:35
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@marioferh marioferh force-pushed the fix_prometheus_retention_days branch 2 times, most recently from 80e0c7d to 03a1de9 Compare May 20, 2026 14:57
@marioferh
Copy link
Copy Markdown
Contributor Author

@danielmellado
Copy link
Copy Markdown
Contributor

danielmellado commented May 20, 2026

While this should be ok not having a tombstone as we're just loosening limits, the openapi.json diff includes unrelated changes (NetworkObservabilitySpec, registries descriptions, softirqs collector, KMS default) which are likely from other PRs that merged into master since the branch was created. Could you rebase on latest master so the openapi.json diff only shows the PR changes? This would avoid confusion during review and prevent accidentally shipping stale generated data.

Remove Maximum validation and documentation for durationInDays (365)
and sizeInGiB (16384) on Prometheus Retention. We do not have enough
customer or operational data to pick defensible API-level ceilings;
arbitrary caps could block valid large-cluster configurations or
misalign with what CMO/Prometheus can actually support.

Signed-off-by: Mario Fernandez <mariofer@redhat.com>
@marioferh marioferh force-pushed the fix_prometheus_retention_days branch from 03a1de9 to 854e0ad Compare May 20, 2026 15:23
@danielmellado
Copy link
Copy Markdown
Contributor

danielmellado commented May 20, 2026

Hey @marioferh, as commented before we go with just removing the max constraints, I think we should also address the granularity issue that Ayoub and Simon raised.

Right now durationInDays: int32 can't express sub-day retention like 15h or 24h (which is the default for UWM), and sizeInGiB: int32 can't express sub-GiB sizes like 500MiB. The CMO ConfigMap and the Prometheus Operator CRD both accept freeform Prometheus duration/size strings, so customers migrating from retention: "15h" would have no equivalent in the new CRD.

I think we should push back on the integer-only convention for these two fields and switch to string fields with the same Prometheus pattern validation that the Prometheus Operator CRD uses:

duration: string with pattern ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$
size: string with pattern ^(0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$
This way CMO can pass the values straight through to the Prometheus CR without any lossy conversion, and all existing ConfigMap values remain valid. The patterns are well-defined and fully validatable at admission time, so we're not losing any safety vs. int32.

@everettraven I know we discussed avoiding duration strings back in #2653, but this comes from the Prometheus Operator CR itself, which validates retention with ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ and size with (^0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ (https://github.com/openshift/cluster-monitoring-operator/blob/0ce8e4c0bce00381802c25d92ed238775139ee82/assets/monitoring-operator/00_0prometheus-custom-resource-definition.yaml). CMO passes these values straight through to the Prometheus CR, so using durationInDays: int32 and sizeInGiB: int32 introduces a lossy conversion layer that breaks existing configs like retention: "15h" or retentionSize: "500MiB" that customers can set today in the ConfigMap. Would you be open to using string fields with these same Prometheus patterns for this specific case? They're fully validatable at admission time via the regex, and since this is v1alpha1 behind a feature gate we can still change it.

Worst case, as a fallback, we may set hours, but wanted to get back your thoughts on this! Thanks!

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@marioferh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines 4143 to -4157
@@ -4152,9 +4150,7 @@ spec:
When the limit is reached, Prometheus will delete oldest data first.
When omitted, no size limit is enforced and Prometheus uses available PersistentVolume capacity.
Minimum value is 1 GiB.
Maximum value is 16384 GiB (16 TiB).
format: int32
maximum: 16384
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We strongly recommend having reasonable upper bounds here. If nothing else, at least a maximum of 2,147,483,647 (the int32 maximum).

It is also worth noting that once you set a maximum and it GAs, we won't let you decrease it without strong evidence that it does not break user workflows.

It is usually easier for us to increase maximum constraints as needed than to add one or decrease it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also noting here that 2,147,483,647 GiB is ~ 2047 PiB. That seems pretty high...

Minimum value is 1 day.
Maximum value is 365 days (1 year).
format: int32
maximum: 365
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, a maximum of 2,147,483,647 here is equivalent to ~58835 centuries. Seems like an unreasonably high number that would never be set.

Without more context, I would think it to be reasonable to increase this to something like 3650 to account for ~10 years of retention. That seems pretty high to me for metrics, but 🤷

@everettraven
Copy link
Copy Markdown
Contributor

Hey @marioferh, as commented before we go with just removing the max constraints, I think we should also address the granularity issue that Ayoub and Simon raised.

Right now durationInDays: int32 can't express sub-day retention like 15h or 24h (which is the default for UWM), and sizeInGiB: int32 can't express sub-GiB sizes like 500MiB. The CMO ConfigMap and the Prometheus Operator CRD both accept freeform Prometheus duration/size strings, so customers migrating from retention: "15h" would have no equivalent in the new CRD.

I think we should push back on the integer-only convention for these two fields and switch to string fields with the same Prometheus pattern validation that the Prometheus Operator CRD uses:

duration: string with pattern ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ size: string with pattern ^(0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ This way CMO can pass the values straight through to the Prometheus CR without any lossy conversion, and all existing ConfigMap values remain valid. The patterns are well-defined and fully validatable at admission time, so we're not losing any safety vs. int32.

@everettraven I know we discussed avoiding duration strings back in #2653, but this comes from the Prometheus Operator CR itself, which validates retention with ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ and size with (^0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ (https://github.com/openshift/cluster-monitoring-operator/blob/0ce8e4c0bce00381802c25d92ed238775139ee82/assets/monitoring-operator/00_0prometheus-custom-resource-definition.yaml). CMO passes these values straight through to the Prometheus CR, so using durationInDays: int32 and sizeInGiB: int32 introduces a lossy conversion layer that breaks existing configs like retention: "15h" or retentionSize: "500MiB" that customers can set today in the ConfigMap. Would you be open to using string fields with these same Prometheus patterns for this specific case? They're fully validatable at admission time via the regex, and since this is v1alpha1 behind a feature gate we can still change it.

Worst case, as a fallback, we may set hours, but wanted to get back your thoughts on this! Thanks!

We strongly encourage not using durations in favor of a unit-based integer, but if you folks feel strongly that this is the best UX for your end-users I'm not going to prevent you from doing so. I'm not the expert on your user base here, you folks are :).

That being said, if the concern is solely that you cannot reasonably express smaller unit values that are likely to occur, you can change the unit you are using for the integer field to the smallest reasonable unit. For example, durationInSeconds and sizeInBytes.

Even if this is in v1alpha1 we still have to abide by tombstoning rules if we change the serialization of the fields unless we want to move this to a v1alpha2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants