Monitoring API: remove maximum retention days prometheus#2851
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Hello @marioferh! Some important instructions when contributing to openshift/api: |
|
Caution Review failedFailed to post review comments 📝 WalkthroughWalkthroughThis PR updates the Retention.DurationInDays kubebuilder markers to enforce a minimum of 1 day and removes the previous maximum of 365 days. The generated ClusterMonitoring CRD schema was updated to remove the Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
80e0c7d to
03a1de9
Compare
|
While this should be ok not having a tombstone as we're just loosening limits, the openapi.json diff includes unrelated changes (NetworkObservabilitySpec, registries descriptions, softirqs collector, KMS default) which are likely from other PRs that merged into master since the branch was created. Could you rebase on latest master so the openapi.json diff only shows the PR changes? This would avoid confusion during review and prevent accidentally shipping stale generated data. |
Remove Maximum validation and documentation for durationInDays (365) and sizeInGiB (16384) on Prometheus Retention. We do not have enough customer or operational data to pick defensible API-level ceilings; arbitrary caps could block valid large-cluster configurations or misalign with what CMO/Prometheus can actually support. Signed-off-by: Mario Fernandez <mariofer@redhat.com>
03a1de9 to
854e0ad
Compare
|
Hey @marioferh, as commented before we go with just removing the max constraints, I think we should also address the granularity issue that Ayoub and Simon raised. Right now durationInDays: int32 can't express sub-day retention like 15h or 24h (which is the default for UWM), and sizeInGiB: int32 can't express sub-GiB sizes like 500MiB. The CMO ConfigMap and the Prometheus Operator CRD both accept freeform Prometheus duration/size strings, so customers migrating from retention: "15h" would have no equivalent in the new CRD. I think we should push back on the integer-only convention for these two fields and switch to string fields with the same Prometheus pattern validation that the Prometheus Operator CRD uses: duration: string with pattern ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ @everettraven I know we discussed avoiding duration strings back in #2653, but this comes from the Prometheus Operator CR itself, which validates retention with ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ and size with (^0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ (https://github.com/openshift/cluster-monitoring-operator/blob/0ce8e4c0bce00381802c25d92ed238775139ee82/assets/monitoring-operator/00_0prometheus-custom-resource-definition.yaml). CMO passes these values straight through to the Prometheus CR, so using durationInDays: int32 and sizeInGiB: int32 introduces a lossy conversion layer that breaks existing configs like retention: "15h" or retentionSize: "500MiB" that customers can set today in the ConfigMap. Would you be open to using string fields with these same Prometheus patterns for this specific case? They're fully validatable at admission time via the regex, and since this is v1alpha1 behind a feature gate we can still change it. Worst case, as a fallback, we may set hours, but wanted to get back your thoughts on this! Thanks! |
|
@marioferh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| @@ -4152,9 +4150,7 @@ spec: | |||
| When the limit is reached, Prometheus will delete oldest data first. | |||
| When omitted, no size limit is enforced and Prometheus uses available PersistentVolume capacity. | |||
| Minimum value is 1 GiB. | |||
| Maximum value is 16384 GiB (16 TiB). | |||
| format: int32 | |||
| maximum: 16384 | |||
There was a problem hiding this comment.
We strongly recommend having reasonable upper bounds here. If nothing else, at least a maximum of 2,147,483,647 (the int32 maximum).
It is also worth noting that once you set a maximum and it GAs, we won't let you decrease it without strong evidence that it does not break user workflows.
It is usually easier for us to increase maximum constraints as needed than to add one or decrease it.
There was a problem hiding this comment.
Also noting here that 2,147,483,647 GiB is ~ 2047 PiB. That seems pretty high...
| Minimum value is 1 day. | ||
| Maximum value is 365 days (1 year). | ||
| format: int32 | ||
| maximum: 365 |
There was a problem hiding this comment.
Just a note, a maximum of 2,147,483,647 here is equivalent to ~58835 centuries. Seems like an unreasonably high number that would never be set.
Without more context, I would think it to be reasonable to increase this to something like 3650 to account for ~10 years of retention. That seems pretty high to me for metrics, but 🤷
We strongly encourage not using durations in favor of a unit-based integer, but if you folks feel strongly that this is the best UX for your end-users I'm not going to prevent you from doing so. I'm not the expert on your user base here, you folks are :). That being said, if the concern is solely that you cannot reasonably express smaller unit values that are likely to occur, you can change the unit you are using for the integer field to the smallest reasonable unit. For example, Even if this is in v1alpha1 we still have to abide by tombstoning rules if we change the serialization of the fields unless we want to move this to a v1alpha2. |
Remove Maximum validation and documentation for durationInDays (365)
and sizeInGiB (16384) on Prometheus Retention. We do not have enough
customer or operational data to pick defensible API-level ceilings;
arbitrary caps could block valid large-cluster configurations or
misalign with what CMO/Prometheus can actually support.