Skip to content

[DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod#1843

Open
mirooon wants to merge 3 commits into
mainfrom
chore/emergency-pause-prod-rpc-pacing
Open

[DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod#1843
mirooon wants to merge 3 commits into
mainfrom
chore/emergency-pause-prod-rpc-pacing

Conversation

@mirooon
Copy link
Copy Markdown
Contributor

@mirooon mirooon commented May 25, 2026

Which Linear task belongs to this PR?

N/A — operational follow-up to #1828 (port the staging emergency-pause hardening to prod).

Why did I implement it this way?

Background

#1828 hardened the staging emergency-pause flow against ~12 separate gaps that would compromise incident response — see that PR's description for the full enumeration and the E2E test plan (Cycles A/B/C on bsc/arbitrum/optimism/base). The merged staging script + workflow now embody the agreed behavior.

That PR's description was explicit about scope, and the Slack review thread reinforced it: staging is the lower-stakes preview that lets us prove the bytes, and the prod mirror must reach parity so that "tested in staging" means "tested the exact bytes that will run in prod." This PR is that parity port.

Scope

Verbatim behavioral port of the staging script + workflow to their prod counterparts. Only prod-specific values differ:

Aspect Staging (already shipped in #1828) Prod (this PR)
Networks iterated hardcoded STAGING_NETWORKS=("bsc" "arbitrum" "optimism" "base") all keys from networks.json (preserved)
getContractAddressFromDeploymentLogs env "staging" "production"
universalCast "send" env "staging" "production"
Workflow secret secrets.PRIV_KEY_PAUSER_WALLET_STAGING secrets.PRIV_KEY_PAUSER_WALLET (preserved)
DiamondPauser team check commented out for staging enforced for prod (preserved)
Slack message text TEST TEST TEST ATTENTION … STAGING (bsc/arbitrum/optimism/base) … ATTENTION … on PROD …
Script trailing log line … diamondEMERGENCYPauseStagingGitHub completed … diamondEMERGENCYPause completed

Everything else — the rpcCallWithRetry helper, the tightened pre-pause regex, fail-closed semantics, bc-based balance comparison, PID-based exit aggregation, send-loop fall-through to on-chain truth, content-based final pause check, PAUSER_PRIVATE_KEY rename to avoid .env shadowing, if: always() on the Slack step, the new Resolve current job URL step + deep-link URL, the status-emoji ternary, and the JSON payload format — is copied byte-for-byte from the staging file.

What changes (matched 1:1 to #1828's "Solution" list)

script/utils/diamondEMERGENCYPauseGitHub.sh:

  1. rpcCallWithRetry helper (5 attempts, 3s back-off, clean stdout via mktemp for stderr).
  2. Pre-pause classifier tightened to ^0x[0-9a-fA-F]{40}$ (was a 0x* glob).
  3. Pre-pause check fails closed after $RPC_MAX_ATTEMPTS — no pauseDiamond() send on unknown state.
  4. Final pause check inspects response content for DiamondIsPaused selector instead of relying on cast's exit code.
  5. Balance check validates ^[0-9]+$ then compares via bc (bash 64-bit arithmetic overflows above ~9.2 ETH in wei).
  6. PID-based exit aggregation in main() (jobs -p after no-arg wait returns empty and silently swallows per-network failures).
  7. Send-loop exhaustion falls through to the on-chain final check — handles the parallel-pauser / runbook race.
  8. local PRIVATE_KEYlocal PAUSER_PRIVATE_KEY to avoid shadowing the .env global (per .agents/rules/300-bash.md); threaded through to the universalCast "send" invocation.
  9. printStatus() retries unclear responses; removes dead RPC_URL/RPC_KEY assignments.

.github/workflows/diamondEmergencyPause.yml:

  1. if: ${{ always() }} on the Slack step (the script can now legitimately exit 1 with partial success, and on-call still needs to be paged).
  2. Status indicator in the message body — Success ✅ / Failed ❌ via ${{ job.status == 'success' && 'Success ✅' || 'Failed ❌' }}.
  3. Job-deep-link URL in the message body — new Resolve current job URL step calls the jobs API, filters by ${{ github.job }}, falls back to the run-level URL on failure. Required adding actions: read to the workflow permissions.
  4. Step renamed Send Reminder to Slack SC-general ChannelNotify Slack SC-general channel of pause execution. (Slack payload also switched YAML → JSON so the inline ternary interpolates without YAML quoting headaches.)

Testing

This PR is intentionally not E2E-tested against prod diamonds — we cannot exercise prod the way #1828 exercised staging (Cycle B deliberately broke an RPC and verified per-network isolation by partially pausing 3 of 4 staging diamonds — not a thing we can replicate on prod). Two layers of confidence instead:

  1. chore(emergency-pause): RPC pacing/retries + staging EmergencyPauseFacet redeploy #1828's E2E cycles validate the exact same code. This is a verbatim port. A diff against the staging script shows only the prod-specific value changes called out in the Scope table; same for the workflow.
  2. Local syntax/format gates pass. bash -n script/utils/diamondEMERGENCYPauseGitHub.sh is clean. Prettier on the workflow file is clean. /pr-ready (local CodeRabbit, --type committed --base origin/main) returned No findings.

If a regression is going to bite, it will bite the next time on-call dispatches the staging workflow — and we'll catch it there before the prod workflow gets dispatched in anger. That is precisely why we did staging first.

Checklist before requesting a review

Checklist for reviewer (DO NOT DEPLOY and contracts BEFORE CHECKING THIS!!!)

  • I have checked that any arbitrary calls to external contracts are validated and or restricted
  • I have checked that any privileged calls (i.e. storage modifications) are validated and or restricted
  • I have ensured that any new contracts have had AT A MINIMUM 1 preliminary audit conducted on by <company/auditor>

…erting to prod

Follow-up to #1828. Brings the prod emergency-pause script and workflow
to parity with the staging mirror that was hardened in that PR. Same
defensive behavior, prod-specific values preserved (networks.json
enumeration, "production" env, PROD slack wording, DiamondPauser team
membership enforced, secrets.PRIV_KEY_PAUSER_WALLET).

script/utils/diamondEMERGENCYPauseGitHub.sh:
- Add rpcCallWithRetry helper (5 attempts, 3s back-off, clean stdout
  via mktemp, stderr→stdout fallback for universalCall's merged stream)
- Tighten pre-pause classifier to ^0x[0-9a-fA-F]{40}$ (was 0x* glob)
- Pre-pause check fails closed on unknown state after retries — no
  pauseDiamond() send on undetermined pause state
- Wrap balance + pauserWallet() reads in retry helper
- Validate balance is decimal and compare via bc (bash 64-bit arith
  overflows above ~9.2 ETH in wei)
- Send-loop exhaustion falls through to on-chain truth instead of
  short-circuiting to failure (handles parallel-pauser race)
- Final pause check inspects response content for DiamondIsPaused
  selector with retries — no longer relies on $? alone (read-after-write
  lag)
- Rename local PRIVATE_KEY → PAUSER_PRIVATE_KEY to avoid shadowing the
  .env PRIVATE_KEY global (per .agents/rules/300-bash.md); thread
  through to universalCast send call
- main(): PID-based exit aggregation (broken jobs -p after no-arg wait)
- main(): wait on printStatus PIDs so the summary block isn't truncated
- printStatus(): retry on unclear responses; remove dead RPC_URL/RPC_KEY
  assignments (universalCast resolves its own RPC)

.github/workflows/diamondEmergencyPause.yml:
- Add actions: read permission (needed for the jobs API call below)
- New "Resolve current job URL" step calls gh api to build a deep-link
  to the current job's log view; falls back to run-level URL
- Slack step: if: ${{ always() }} so the alert fires on any exit code
  (per-network isolation now lets the script return 1 with partial
  success — on-call must still be paged)
- Slack payload switched YAML→JSON so the inline ternary for status
  emoji interpolates without YAML quoting headaches
- Slack message body now includes Success ✅ / Failed ❌ status and
  the job deep-link URL
- Renamed step "Send Reminder to Slack SC-general Channel" →
  "Notify Slack SC-general channel of pause execution"

Validated by staging E2E in #1828 (cycles A/B/C). This PR is a verbatim
behavioral port; only the prod-specific strings differ.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Review Change Stack

Walkthrough

Enhances the emergency-pause workflow to post job-specific deep-link Slack notifications and hardens the pause script with an rpcCallWithRetry helper, retry-on-unclear-read checks, optimistic pause sends plus final owner()-based verification, and deterministic parallel PID/exit-code aggregation.

Changes

Emergency Pause Workflow and Script Resilience

Layer / File(s) Summary
Workflow notifications with job context
.github/workflows/diamondEmergencyPause.yml
Adds actions: read permission, fixes warning text, pins updated action SHAs, resolves the current job ID via gh api to construct a job-specific logs URL (fallback to run URL), and replaces the previous Slack reminder with an always() notification that posts actor, status emoji, and the resolved job URL.
RPC retry helper and variable safety
script/utils/diamondEMERGENCYPauseGitHub.sh
Adds retry/pacing constants and rpcCallWithRetry helper (separates stdout/stderr for diagnostics); updates header comment and renames local pauser private key variable to avoid shadowing .env.
Pauser address derivation
script/utils/diamondEMERGENCYPauseGitHub.sh
Derives the pauser address using the renamed local pauser private key variable via the cast wallet address invocation.
Pre-pause state and balance verification
script/utils/diamondEMERGENCYPauseGitHub.sh
Reworks pre-pause checks to retry owner() on unclear RPC outputs, adds numeric-validated balance reads via retry helper, and reads pauserWallet() with exhaustion-aware error handling.
Pause execution with canonical final verification
script/utils/diamondEMERGENCYPauseGitHub.sh
Performs optimistic pauseDiamond() sends inside a retry loop without immediate hard-failure, then runs a separate canonical verification loop calling owner() repeatedly to detect the paused selector/DiamondIsPaused, failing only after retries are exhausted while capturing the last observed response.
Status reporting with retry on unclear response
script/utils/diamondEMERGENCYPauseGitHub.sh
Updates printStatus() to retry owner() when responses are unclear and adjusts success/error reporting to distinguish non-paused owner addresses from paused selector/DiamondIsPaused confirmations.
Parallel execution with PID tracking and exit aggregation
script/utils/diamondEMERGENCYPauseGitHub.sh
Modifies main() to launch handleNetwork in background while collecting PIDs, waits each PID to aggregate exit codes deterministically, and awaits all parallel printStatus jobs before exiting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lifinance/contracts#1831: Shares on-chain probe usage (owner(), pauserWallet(), DiamondIsPaused) and pause-status checking logic; closely related verification behavior.
  • lifinance/contracts#1828: Introduces a similar rpcCallWithRetry and owner()-based retry-on-unclear-read patterns used here (staging reference).
  • lifinance/contracts#829: Previously modified owner()-based paused-state detection and printStatus logic; directly related to these script changes.

Suggested labels

AuditRequired

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description is comprehensive, covering Linear task, implementation rationale, scope, detailed changelist, testing approach, and all required checklist items with clear justifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main change: porting emergency-pause staging improvements (RPC pacing/retries and workflow alerting) to production.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/emergency-pause-prod-rpc-pacing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/diamondEmergencyPause.yml:
- Around line 102-106: The Slack notification step named "Notify Slack
SC-general channel of pause execution" currently runs with if: ${{ always() }}
and uses wording that states the pause "was just executed on PROD" even when
prechecks or input/team checks failed; update the message text in that step (and
the analogous step at lines referenced by the reviewer) to use neutral phrasing
such as "pause attempt triggered on PROD" or "pause attempted on PROD — status:
<success|failure>" and include a status indicator so the notification reflects
whether the pause completed or failed rather than asserting it executed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8cacebfa-9601-447a-933a-dff8458b39d2

📥 Commits

Reviewing files that changed from the base of the PR and between d97f142 and d55bbf1.

📒 Files selected for processing (2)
  • .github/workflows/diamondEmergencyPause.yml
  • script/utils/diamondEMERGENCYPauseGitHub.sh

Comment thread .github/workflows/diamondEmergencyPause.yml
mirooon and others added 2 commits May 25, 2026 23:40
…ript invocation)

- Bump oven-sh/setup-bun pin from 0c5077e5 (# v2) to 3d267786 (# v2.1.2)
  to match the version already in use by diamondEmergencyPauseStaging.yml.
- Update the tspascoal/get-user-teams-membership version comment from `# v3`
  to `# v3.0.0` for parity with staging (SHA was already identical).
- Fix typo in workflow_dispatch warning: PROUCTION → PRODUCTION.
- Run the pause script via `bash script/utils/...` instead of
  `./script/utils/...` so both workflows invoke their respective pause
  scripts the same way (and we don't depend on the executable bit being
  preserved across checkouts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CodeRabbit (cloud) flagged that the Slack message text asserts "was just
executed on PROD" even when the workflow fails before the pause script
runs - e.g. wrong `warning` input, non-DiamondPauser actor. Because the
Slack step uses `if: always()`, those precheck failures still produce a
message that reads as if an on-chain pause happened.

Keep the always() firing behavior (still want on-call paged on any
trigger, including precheck failures) but switch the verb to "triggered"
so the message is honest about what actually happened. The status emoji
(Success ✅ / Failed ❌) and the deep-link URL still carry the truth so
on-call can disambiguate "real pause executed" from "someone fat-fingered
the warning input" with one click.

Staging carries the same wording today; intentionally left out of scope
for this PR (the TEST TEST TEST prefix on staging already signals "don't
act on this raw"). Will be aligned in a follow-up cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/diamondEmergencyPause.yml (1)

1-4: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required workflow header comment block at file start.

Line 1 starts immediately with name:. The workflow is missing the mandated header block (purpose, triggers, key behaviors, known limitations) required for this file class.

Proposed minimal fix
+## Purpose: Emergency pause all production diamonds and notify on-call with a deep link to job logs.
+## Triggers: Manual `workflow_dispatch` with explicit operator acknowledgement.
+## Key behaviors: Validates operator/team, executes pause script, always sends Slack status notification.
+## Known limitations: Not E2E-tested in prod; partial per-network failures can still produce overall failure status.
 name: EMERGENCY >> Pause all PROD diamonds

As per coding guidelines, “Each workflow file must start with a header comment block (purpose, triggers, key behaviors, known limitations).”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/diamondEmergencyPause.yml around lines 1 - 4, Add a header
comment block at the very top of the workflow file describing purpose, triggers,
key behaviors, and known limitations; place it above the existing "name:
EMERGENCY >> Pause all PROD diamonds" line and reference the workflow trigger
("on: workflow_dispatch") in the header so readers know what starts this
workflow and any important caveats or operational notes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/diamondEmergencyPause.yml:
- Around line 1-4: Add a header comment block at the very top of the workflow
file describing purpose, triggers, key behaviors, and known limitations; place
it above the existing "name: EMERGENCY >> Pause all PROD diamonds" line and
reference the workflow trigger ("on: workflow_dispatch") in the header so
readers know what starts this workflow and any important caveats or operational
notes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: abb3e64b-1d43-4082-8224-6b187ccd95de

📥 Commits

Reviewing files that changed from the base of the PR and between f210cdb and 0e076aa.

📒 Files selected for processing (1)
  • .github/workflows/diamondEmergencyPause.yml

@mirooon mirooon marked this pull request as ready for review May 25, 2026 22:02
@mirooon mirooon changed the title chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod [DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants