[DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod#1843
[DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod#1843mirooon wants to merge 3 commits into
Conversation
…erting to prod Follow-up to #1828. Brings the prod emergency-pause script and workflow to parity with the staging mirror that was hardened in that PR. Same defensive behavior, prod-specific values preserved (networks.json enumeration, "production" env, PROD slack wording, DiamondPauser team membership enforced, secrets.PRIV_KEY_PAUSER_WALLET). script/utils/diamondEMERGENCYPauseGitHub.sh: - Add rpcCallWithRetry helper (5 attempts, 3s back-off, clean stdout via mktemp, stderr→stdout fallback for universalCall's merged stream) - Tighten pre-pause classifier to ^0x[0-9a-fA-F]{40}$ (was 0x* glob) - Pre-pause check fails closed on unknown state after retries — no pauseDiamond() send on undetermined pause state - Wrap balance + pauserWallet() reads in retry helper - Validate balance is decimal and compare via bc (bash 64-bit arith overflows above ~9.2 ETH in wei) - Send-loop exhaustion falls through to on-chain truth instead of short-circuiting to failure (handles parallel-pauser race) - Final pause check inspects response content for DiamondIsPaused selector with retries — no longer relies on $? alone (read-after-write lag) - Rename local PRIVATE_KEY → PAUSER_PRIVATE_KEY to avoid shadowing the .env PRIVATE_KEY global (per .agents/rules/300-bash.md); thread through to universalCast send call - main(): PID-based exit aggregation (broken jobs -p after no-arg wait) - main(): wait on printStatus PIDs so the summary block isn't truncated - printStatus(): retry on unclear responses; remove dead RPC_URL/RPC_KEY assignments (universalCast resolves its own RPC) .github/workflows/diamondEmergencyPause.yml: - Add actions: read permission (needed for the jobs API call below) - New "Resolve current job URL" step calls gh api to build a deep-link to the current job's log view; falls back to run-level URL - Slack step: if: ${{ always() }} so the alert fires on any exit code (per-network isolation now lets the script return 1 with partial success — on-call must still be paged) - Slack payload switched YAML→JSON so the inline ternary for status emoji interpolates without YAML quoting headaches - Slack message body now includes Success ✅ / Failed ❌ status and the job deep-link URL - Renamed step "Send Reminder to Slack SC-general Channel" → "Notify Slack SC-general channel of pause execution" Validated by staging E2E in #1828 (cycles A/B/C). This PR is a verbatim behavioral port; only the prod-specific strings differ. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WalkthroughEnhances the emergency-pause workflow to post job-specific deep-link Slack notifications and hardens the pause script with an rpcCallWithRetry helper, retry-on-unclear-read checks, optimistic pause sends plus final owner()-based verification, and deterministic parallel PID/exit-code aggregation. ChangesEmergency Pause Workflow and Script Resilience
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/diamondEmergencyPause.yml:
- Around line 102-106: The Slack notification step named "Notify Slack
SC-general channel of pause execution" currently runs with if: ${{ always() }}
and uses wording that states the pause "was just executed on PROD" even when
prechecks or input/team checks failed; update the message text in that step (and
the analogous step at lines referenced by the reviewer) to use neutral phrasing
such as "pause attempt triggered on PROD" or "pause attempted on PROD — status:
<success|failure>" and include a status indicator so the notification reflects
whether the pause completed or failed rather than asserting it executed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8cacebfa-9601-447a-933a-dff8458b39d2
📒 Files selected for processing (2)
.github/workflows/diamondEmergencyPause.ymlscript/utils/diamondEMERGENCYPauseGitHub.sh
…ript invocation) - Bump oven-sh/setup-bun pin from 0c5077e5 (# v2) to 3d267786 (# v2.1.2) to match the version already in use by diamondEmergencyPauseStaging.yml. - Update the tspascoal/get-user-teams-membership version comment from `# v3` to `# v3.0.0` for parity with staging (SHA was already identical). - Fix typo in workflow_dispatch warning: PROUCTION → PRODUCTION. - Run the pause script via `bash script/utils/...` instead of `./script/utils/...` so both workflows invoke their respective pause scripts the same way (and we don't depend on the executable bit being preserved across checkouts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit (cloud) flagged that the Slack message text asserts "was just executed on PROD" even when the workflow fails before the pause script runs - e.g. wrong `warning` input, non-DiamondPauser actor. Because the Slack step uses `if: always()`, those precheck failures still produce a message that reads as if an on-chain pause happened. Keep the always() firing behavior (still want on-call paged on any trigger, including precheck failures) but switch the verb to "triggered" so the message is honest about what actually happened. The status emoji (Success ✅ / Failed ❌) and the deep-link URL still carry the truth so on-call can disambiguate "real pause executed" from "someone fat-fingered the warning input" with one click. Staging carries the same wording today; intentionally left out of scope for this PR (the TEST TEST TEST prefix on staging already signals "don't act on this raw"). Will be aligned in a follow-up cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/diamondEmergencyPause.yml (1)
1-4:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required workflow header comment block at file start.
Line 1 starts immediately with
name:. The workflow is missing the mandated header block (purpose, triggers, key behaviors, known limitations) required for this file class.Proposed minimal fix
+## Purpose: Emergency pause all production diamonds and notify on-call with a deep link to job logs. +## Triggers: Manual `workflow_dispatch` with explicit operator acknowledgement. +## Key behaviors: Validates operator/team, executes pause script, always sends Slack status notification. +## Known limitations: Not E2E-tested in prod; partial per-network failures can still produce overall failure status. name: EMERGENCY >> Pause all PROD diamondsAs per coding guidelines, “Each workflow file must start with a header comment block (purpose, triggers, key behaviors, known limitations).”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/diamondEmergencyPause.yml around lines 1 - 4, Add a header comment block at the very top of the workflow file describing purpose, triggers, key behaviors, and known limitations; place it above the existing "name: EMERGENCY >> Pause all PROD diamonds" line and reference the workflow trigger ("on: workflow_dispatch") in the header so readers know what starts this workflow and any important caveats or operational notes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In @.github/workflows/diamondEmergencyPause.yml:
- Around line 1-4: Add a header comment block at the very top of the workflow
file describing purpose, triggers, key behaviors, and known limitations; place
it above the existing "name: EMERGENCY >> Pause all PROD diamonds" line and
reference the workflow trigger ("on: workflow_dispatch") in the header so
readers know what starts this workflow and any important caveats or operational
notes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: abb3e64b-1d43-4082-8224-6b187ccd95de
📒 Files selected for processing (1)
.github/workflows/diamondEmergencyPause.yml
Which Linear task belongs to this PR?
N/A — operational follow-up to #1828 (port the staging emergency-pause hardening to prod).
Why did I implement it this way?
Background
#1828 hardened the staging emergency-pause flow against ~12 separate gaps that would compromise incident response — see that PR's description for the full enumeration and the E2E test plan (Cycles A/B/C on
bsc/arbitrum/optimism/base). The merged staging script + workflow now embody the agreed behavior.That PR's description was explicit about scope, and the Slack review thread reinforced it: staging is the lower-stakes preview that lets us prove the bytes, and the prod mirror must reach parity so that "tested in staging" means "tested the exact bytes that will run in prod." This PR is that parity port.
Scope
Verbatim behavioral port of the staging script + workflow to their prod counterparts. Only prod-specific values differ:
STAGING_NETWORKS=("bsc" "arbitrum" "optimism" "base")networks.json(preserved)getContractAddressFromDeploymentLogsenv"staging""production"universalCast "send"env"staging""production"secrets.PRIV_KEY_PAUSER_WALLET_STAGINGsecrets.PRIV_KEY_PAUSER_WALLET(preserved)DiamondPauserteam checkTEST TEST TEST ATTENTION … STAGING (bsc/arbitrum/optimism/base) …ATTENTION … on PROD …… diamondEMERGENCYPauseStagingGitHub completed… diamondEMERGENCYPause completedEverything else — the
rpcCallWithRetryhelper, the tightened pre-pause regex, fail-closed semantics,bc-based balance comparison, PID-based exit aggregation, send-loop fall-through to on-chain truth, content-based final pause check,PAUSER_PRIVATE_KEYrename to avoid.envshadowing,if: always()on the Slack step, the newResolve current job URLstep + deep-link URL, the status-emoji ternary, and the JSON payload format — is copied byte-for-byte from the staging file.What changes (matched 1:1 to #1828's "Solution" list)
script/utils/diamondEMERGENCYPauseGitHub.sh:rpcCallWithRetryhelper (5 attempts, 3s back-off, clean stdout viamktempfor stderr).^0x[0-9a-fA-F]{40}$(was a0x*glob).$RPC_MAX_ATTEMPTS— nopauseDiamond()send on unknown state.DiamondIsPausedselector instead of relying oncast's exit code.^[0-9]+$then compares viabc(bash 64-bit arithmetic overflows above ~9.2 ETH in wei).main()(jobs -pafter no-argwaitreturns empty and silently swallows per-network failures).local PRIVATE_KEY→local PAUSER_PRIVATE_KEYto avoid shadowing the.envglobal (per.agents/rules/300-bash.md); threaded through to theuniversalCast "send"invocation.printStatus()retries unclear responses; removes deadRPC_URL/RPC_KEYassignments..github/workflows/diamondEmergencyPause.yml:if: ${{ always() }}on the Slack step (the script can now legitimately exit 1 with partial success, and on-call still needs to be paged).Success ✅/Failed ❌via${{ job.status == 'success' && 'Success ✅' || 'Failed ❌' }}.Resolve current job URLstep calls the jobs API, filters by${{ github.job }}, falls back to the run-level URL on failure. Required addingactions: readto the workflow permissions.Send Reminder to Slack SC-general Channel→Notify Slack SC-general channel of pause execution. (Slack payload also switched YAML → JSON so the inline ternary interpolates without YAML quoting headaches.)Testing
This PR is intentionally not E2E-tested against prod diamonds — we cannot exercise prod the way #1828 exercised staging (Cycle B deliberately broke an RPC and verified per-network isolation by partially pausing 3 of 4 staging diamonds — not a thing we can replicate on prod). Two layers of confidence instead:
bash -n script/utils/diamondEMERGENCYPauseGitHub.shis clean. Prettier on the workflow file is clean./pr-ready(local CodeRabbit,--type committed --base origin/main) returned No findings.If a regression is going to bite, it will bite the next time on-call dispatches the staging workflow — and we'll catch it there before the prod workflow gets dispatched in anger. That is precisely why we did staging first.
Checklist before requesting a review
/pr-ready(local CodeRabbit) on this branch and resolved (or explicitly documented) all findings — see.agents/commands/pr-ready.mdChecklist for reviewer (DO NOT DEPLOY and contracts BEFORE CHECKING THIS!!!)