Skip to content

feat(aro-hcp): add periodic Grafana datasource cleanup job (AROSLSRE-1138)#80947

Open
cssjr wants to merge 2 commits into
openshift:mainfrom
cssjr:aroslsre-1138-grafana-cleanup-periodic
Open

feat(aro-hcp): add periodic Grafana datasource cleanup job (AROSLSRE-1138)#80947
cssjr wants to merge 2 commits into
openshift:mainfrom
cssjr:aroslsre-1138-grafana-cleanup-periodic

Conversation

@cssjr

@cssjr cssjr commented Jun 24, 2026

Copy link
Copy Markdown

Summary

  • Adds a monthly Prow periodic job that runs grafanactl clean datasources and grafanactl clean fixup-datasources against the DEV Grafana instance (arohcp-dev in resource group global, subscription 1d3378d3-...)
  • Removes orphaned Prometheus datasources left by personal dev environments
  • Follows the existing cleanup-sweeper step pattern in the step registry
  • Reports failures to #aro-hcp-failures-dev Slack channel

New files

  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/ — step registry entry (commands script, ref YAML, metadata, OWNERS)

Modified files

  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic-cleanup.yaml — added clean-grafana-datasources entry with monthly cron (0 6 1 * *)
  • ci-operator/jobs/Azure/ARO-HCP/Azure-ARO-HCP-main-periodics.yaml — auto-regenerated via make jobs

Context

  • Jira: AROSLSRE-1138
  • One-time manual cleanup removed ~3,450 orphaned datasources on 2026-06-08
  • Growth source is personal dev environments only (CI no longer creates datasources)
  • grafanactl lives in Azure/ARO-Tools, entry point in Azure/ARO-HCP

Test plan

  • pj-rehearse validates the job can be scheduled and run
  • Verify grafanactl builds and authenticates in the Prow container
  • Confirm both clean datasources and clean fixup-datasources execute successfully
  • After merge, manually trigger and verify via Prow UI

🤖 Generated with Claude Code

Summary by CodeRabbit

This PR adds a new monthly Prow periodic job to the ARO-HCP CI infrastructure that automatically cleans up orphaned Grafana datasources from the DEV Grafana instance (arohcp-dev).

What changes (practically):

  • New periodic job: Updates ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic-cleanup.yaml to schedule a clean-grafana-datasources run on the 1st day of every month at 06:00 UTC (0 6 1 * *). The job reports both failures and errors to the #aro-hcp-failures-dev Slack channel.
  • New step-registry deprovision step: Adds a new step-registry entry under ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/ that:
    • Builds and runs grafanactl and executes, in order, grafanactl clean datasources and grafanactl clean fixup-datasources.
    • Authenticates to Azure using service principal credentials sourced from cluster-mounted vault secrets, including setting AZURE_TOKEN_CREDENTIALS=prod (so DefaultAzureCredential selects environment-based credentials).
    • Targets the ARO-HCP DEV Grafana resources (via GRAFANA_RESOURCE_GROUP and GRAFANA_NAME) and removes stale/orphaned Managed_Prometheus_* datasources that are not backed by a live Azure Monitor Workspace.
    • Defines resource requests and the credential mount used to access the cluster secrets, plus registers step metadata and OWNERS for review/approval.

Regenerated CI wiring:

  • ci-operator/jobs/Azure/ARO-HCP/Azure-ARO-HCP-main-periodics.yaml is updated automatically via make jobs to include the new periodic definition.

Impact:

  • Prevents ongoing growth of orphaned Prometheus datasources caused by personal development environments. A prior one-time manual cleanup on 2026-06-08 removed ~3,450 orphaned datasources; this automation keeps the DEV Grafana instance from regrowing those leftovers without manual intervention.

…1138)

Add a monthly Prow periodic that runs grafanactl clean datasources and
clean fixup-datasources against the DEV Grafana instance to remove
orphaned Prometheus datasources left by personal dev environments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1e19f6ca-fdd9-4bc6-8782-6c532e260959

📥 Commits

Reviewing files that changed from the base of the PR and between 0f029a8 and 4fa2718.

📒 Files selected for processing (1)
  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/aro-hcp-deprovision-grafana-datasources-commands.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/aro-hcp-deprovision-grafana-datasources-commands.sh

Walkthrough

A new step-registry entry (aro-hcp-deprovision-grafana-datasources) is added with a bash script that authenticates to Azure, builds a grafanactl binary, and runs two cleanup passes against Azure Managed Grafana. A monthly periodic CI job (clean-grafana-datasources) is registered to invoke this step.

Changes

ARO HCP Grafana Datasource Cleanup

Layer / File(s) Summary
Step registry: ref, commands, metadata, and owners
ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/aro-hcp-deprovision-grafana-datasources-ref.yaml, ...commands.sh, ...ref.metadata.json, OWNERS
Defines the new deprovisioning step: ref YAML configures resource requests (100m CPU/300Mi), a cluster-secrets-aro-hcp-dev credentials mount, and env vars (VAULT_SECRET_PROFILE, GRAFANA_RESOURCE_GROUP, GRAFANA_NAME); the command script reads Azure SP credentials from the profile directory, authenticates via az login, builds grafanactl, then runs clean datasources and clean fixup-datasources; metadata JSON and OWNERS list approvers/reviewers.
Periodic cleanup job wiring
ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic-cleanup.yaml
Adds the clean-grafana-datasources test entry with cron 0 6 1 * *, failure reporting to #aro-hcp-failures-dev on failure and error states, and a single step referencing aro-hcp-deprovision-grafana-datasources.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

lgtm, approved, rehearsals-ack, jira/valid-reference

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(aro-hcp): add periodic Grafana datasource cleanup job (AROSLSRE-1138)' directly and accurately summarizes the main change: adding a periodic Grafana datasource cleanup job to ARO-HCP's Prow configuration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR contains no Ginkgo tests. All modified files are CI/CD configuration (YAML), scripts (shell), and metadata files. The check is not applicable to non-test code.
Test Structure And Quality ✅ Passed PR contains no Ginkgo test code; check is not applicable to CI configuration and infrastructure files being added.
Microshift Test Compatibility ✅ Passed This PR adds Prow CI configuration and cleanup scripts, not Ginkgo e2e tests. The check only applies when new e2e tests are added, so it is not applicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds no Ginkgo e2e tests. It only adds CI infrastructure files (YAML configs, bash scripts, OWNERS), so the SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only Prow ci-operator configuration and test step registry entries—CI/CD system files, not deployment manifests, operator code, or controllers deployed to user clusters. The topolo...
Ote Binary Stdout Contract ✅ Passed PR does not add any OTE binaries or Go test code. It only adds Prow job configurations and a bash script for job execution, which are not subject to the OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds CI/CD infrastructure files (Prow job configs, bash scripts, metadata) but no Ginkgo e2e tests; the custom check only applies to Ginkgo e2e tests.
No-Weak-Crypto ✅ Passed PR introduces only configuration and bash scripts for Grafana datasource cleanup. No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-...
Container-Privileges ✅ Passed No privileged container configurations found: no privileged: true, hostPID/Network/IPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation settings detected in any YAML or bash files.
No-Sensitive-Data-In-Logs ✅ Passed Script uses --output none on az login and only logs non-sensitive progress messages; credentials loaded from secure files and never echoed or logged.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cssjr
Once this PR has been reviewed and has the lgtm label, please assign geoberle for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cssjr

cssjr commented Jun 24, 2026

Copy link
Copy Markdown
Author

/pj-rehearse

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The Azure SDK's DefaultAzureCredential with RequireAzureTokenCredentials
requires AZURE_TOKEN_CREDENTIALS to select credential sources. Setting
it to "prod" enables EnvironmentCredential (AZURE_CLIENT_ID/SECRET/TENANT),
which is how all ARO-HCP Prow steps authenticate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cssjr

cssjr commented Jun 24, 2026

Copy link
Copy Markdown
Author

/pj-rehearse

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@cssjr: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-Azure-ARO-HCP-main-periodic-cleanup-clean-grafana-datasources N/A periodic Periodic changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@cssjr cssjr marked this pull request as ready for review June 24, 2026 02:14
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci openshift-ci Bot requested review from geoberle and mmazur June 24, 2026 02:14
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@cssjr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/step-registry-metadata 4fa2718 link true /test step-registry-metadata

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant