[DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod by mirooon · Pull Request #1843 · lifinance/contracts

mirooon · 2026-05-25T18:31:15Z

Which Linear task belongs to this PR?

N/A — operational follow-up to #1828 (port the staging emergency-pause hardening to prod).

Why did I implement it this way?

Background

#1828 hardened the staging emergency-pause flow against ~12 separate gaps that would compromise incident response — see that PR's description for the full enumeration and the E2E test plan (Cycles A/B/C on bsc/arbitrum/optimism/base). The merged staging script + workflow now embody the agreed behavior.

That PR's description was explicit about scope, and the Slack review thread reinforced it: staging is the lower-stakes preview that lets us prove the bytes, and the prod mirror must reach parity so that "tested in staging" means "tested the exact bytes that will run in prod." This PR is that parity port.

Scope

Verbatim behavioral port of the staging script + workflow to their prod counterparts. Only prod-specific values differ:

Aspect	Staging (already shipped in #1828)	Prod (this PR)
Networks iterated	hardcoded `STAGING_NETWORKS=("bsc" "arbitrum" "optimism" "base")`	all keys from `networks.json` (preserved)
`getContractAddressFromDeploymentLogs` env	`"staging"`	`"production"`
`universalCast "send"` env	`"staging"`	`"production"`
Workflow secret	`secrets.PRIV_KEY_PAUSER_WALLET_STAGING`	`secrets.PRIV_KEY_PAUSER_WALLET` (preserved)
`DiamondPauser` team check	commented out for staging	enforced for prod (preserved)
Slack message text	`TEST TEST TEST ATTENTION … STAGING (bsc/arbitrum/optimism/base) …`	`ATTENTION … on PROD …`
Script trailing log line	`… diamondEMERGENCYPauseStagingGitHub completed`	`… diamondEMERGENCYPause completed`

Everything else — the rpcCallWithRetry helper, the tightened pre-pause regex, fail-closed semantics, bc-based balance comparison, PID-based exit aggregation, send-loop fall-through to on-chain truth, content-based final pause check, PAUSER_PRIVATE_KEY rename to avoid .env shadowing, if: always() on the Slack step, the new Resolve current job URL step + deep-link URL, the status-emoji ternary, and the JSON payload format — is copied byte-for-byte from the staging file.

What changes (matched 1:1 to #1828's "Solution" list)

script/utils/diamondEMERGENCYPauseGitHub.sh:

rpcCallWithRetry helper (5 attempts, 3s back-off, clean stdout via mktemp for stderr).
Pre-pause classifier tightened to ^0x[0-9a-fA-F]{40}$ (was a 0x* glob).
Pre-pause check fails closed after $RPC_MAX_ATTEMPTS — no pauseDiamond() send on unknown state.
Final pause check inspects response content for DiamondIsPaused selector instead of relying on cast's exit code.
Balance check validates ^[0-9]+$ then compares via bc (bash 64-bit arithmetic overflows above ~9.2 ETH in wei).
PID-based exit aggregation in main() (jobs -p after no-arg wait returns empty and silently swallows per-network failures).
Send-loop exhaustion falls through to the on-chain final check — handles the parallel-pauser / runbook race.
local PRIVATE_KEY → local PAUSER_PRIVATE_KEY to avoid shadowing the .env global (per .agents/rules/300-bash.md); threaded through to the universalCast "send" invocation.
printStatus() retries unclear responses; removes dead RPC_URL/RPC_KEY assignments.

.github/workflows/diamondEmergencyPause.yml:

if: ${{ always() }} on the Slack step (the script can now legitimately exit 1 with partial success, and on-call still needs to be paged).
Status indicator in the message body — Success ✅ / Failed ❌ via ${{ job.status == 'success' && 'Success ✅' || 'Failed ❌' }}.
Job-deep-link URL in the message body — new Resolve current job URL step calls the jobs API, filters by ${{ github.job }}, falls back to the run-level URL on failure. Required adding actions: read to the workflow permissions.
Step renamed Send Reminder to Slack SC-general Channel → Notify Slack SC-general channel of pause execution. (Slack payload also switched YAML → JSON so the inline ternary interpolates without YAML quoting headaches.)

Testing

This PR is intentionally not E2E-tested against prod diamonds — we cannot exercise prod the way #1828 exercised staging (Cycle B deliberately broke an RPC and verified per-network isolation by partially pausing 3 of 4 staging diamonds — not a thing we can replicate on prod). Two layers of confidence instead:

chore(emergency-pause): RPC pacing/retries + staging EmergencyPauseFacet redeploy #1828's E2E cycles validate the exact same code. This is a verbatim port. A diff against the staging script shows only the prod-specific value changes called out in the Scope table; same for the workflow.
Local syntax/format gates pass. bash -n script/utils/diamondEMERGENCYPauseGitHub.sh is clean. Prettier on the workflow file is clean. /pr-ready (local CodeRabbit, --type committed --base origin/main) returned No findings.

If a regression is going to bite, it will bite the next time on-call dispatches the staging workflow — and we'll catch it there before the prod workflow gets dispatched in anger. That is precisely why we did staging first.

Checklist before requesting a review

I have performed a self-review of my code
This pull request is as small as possible and only tackles one problem
I have run /pr-ready (local CodeRabbit) on this branch and resolved (or explicitly documented) all findings — see .agents/commands/pr-ready.md
I have added tests that cover the functionality / test the bug (chore(emergency-pause): RPC pacing/retries + staging EmergencyPauseFacet redeploy #1828's E2E cycles validate the identical code path; prod cannot be E2E-tested without pausing prod diamonds)
For new facets: I have checked all points from this list: https://www.notion.so/lifi/New-Facet-Contract-Checklist-157f0ff14ac78095a2b8f999d655622e
I have updated any required documentation

Checklist for reviewer (DO NOT DEPLOY and contracts BEFORE CHECKING THIS!!!)

I have checked that any arbitrary calls to external contracts are validated and or restricted
I have checked that any privileged calls (i.e. storage modifications) are validated and or restricted
I have ensured that any new contracts have had AT A MINIMUM 1 preliminary audit conducted on by <company/auditor>

…erting to prod Follow-up to #1828. Brings the prod emergency-pause script and workflow to parity with the staging mirror that was hardened in that PR. Same defensive behavior, prod-specific values preserved (networks.json enumeration, "production" env, PROD slack wording, DiamondPauser team membership enforced, secrets.PRIV_KEY_PAUSER_WALLET). script/utils/diamondEMERGENCYPauseGitHub.sh: - Add rpcCallWithRetry helper (5 attempts, 3s back-off, clean stdout via mktemp, stderr→stdout fallback for universalCall's merged stream) - Tighten pre-pause classifier to ^0x[0-9a-fA-F]{40}$ (was 0x* glob) - Pre-pause check fails closed on unknown state after retries — no pauseDiamond() send on undetermined pause state - Wrap balance + pauserWallet() reads in retry helper - Validate balance is decimal and compare via bc (bash 64-bit arith overflows above ~9.2 ETH in wei) - Send-loop exhaustion falls through to on-chain truth instead of short-circuiting to failure (handles parallel-pauser race) - Final pause check inspects response content for DiamondIsPaused selector with retries — no longer relies on $? alone (read-after-write lag) - Rename local PRIVATE_KEY → PAUSER_PRIVATE_KEY to avoid shadowing the .env PRIVATE_KEY global (per .agents/rules/300-bash.md); thread through to universalCast send call - main(): PID-based exit aggregation (broken jobs -p after no-arg wait) - main(): wait on printStatus PIDs so the summary block isn't truncated - printStatus(): retry on unclear responses; remove dead RPC_URL/RPC_KEY assignments (universalCast resolves its own RPC) .github/workflows/diamondEmergencyPause.yml: - Add actions: read permission (needed for the jobs API call below) - New "Resolve current job URL" step calls gh api to build a deep-link to the current job's log view; falls back to run-level URL - Slack step: if: ${{ always() }} so the alert fires on any exit code (per-network isolation now lets the script return 1 with partial success — on-call must still be paged) - Slack payload switched YAML→JSON so the inline ternary for status emoji interpolates without YAML quoting headaches - Slack message body now includes Success ✅ / Failed ❌ status and the job deep-link URL - Renamed step "Send Reminder to Slack SC-general Channel" → "Notify Slack SC-general channel of pause execution" Validated by staging E2E in #1828 (cycles A/B/C). This PR is a verbatim behavioral port; only the prod-specific strings differ. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-25T18:31:28Z

Walkthrough

Enhances the emergency-pause workflow to post job-specific deep-link Slack notifications and hardens the pause script with an rpcCallWithRetry helper, retry-on-unclear-read checks, optimistic pause sends plus final owner()-based verification, and deterministic parallel PID/exit-code aggregation.

Changes

Emergency Pause Workflow and Script Resilience

Layer / File(s)	Summary
Workflow notifications with job context `.github/workflows/diamondEmergencyPause.yml`	Adds `actions: read` permission, fixes warning text, pins updated action SHAs, resolves the current job ID via `gh api` to construct a job-specific logs URL (fallback to run URL), and replaces the previous Slack reminder with an `always()` notification that posts actor, status emoji, and the resolved job URL.
RPC retry helper and variable safety `script/utils/diamondEMERGENCYPauseGitHub.sh`	Adds retry/pacing constants and `rpcCallWithRetry` helper (separates stdout/stderr for diagnostics); updates header comment and renames local pauser private key variable to avoid shadowing `.env`.
Pauser address derivation `script/utils/diamondEMERGENCYPauseGitHub.sh`	Derives the pauser address using the renamed local pauser private key variable via the `cast wallet address` invocation.
Pre-pause state and balance verification `script/utils/diamondEMERGENCYPauseGitHub.sh`	Reworks pre-pause checks to retry `owner()` on unclear RPC outputs, adds numeric-validated balance reads via retry helper, and reads `pauserWallet()` with exhaustion-aware error handling.
Pause execution with canonical final verification `script/utils/diamondEMERGENCYPauseGitHub.sh`	Performs optimistic `pauseDiamond()` sends inside a retry loop without immediate hard-failure, then runs a separate canonical verification loop calling `owner()` repeatedly to detect the paused selector/`DiamondIsPaused`, failing only after retries are exhausted while capturing the last observed response.
Status reporting with retry on unclear response `script/utils/diamondEMERGENCYPauseGitHub.sh`	Updates `printStatus()` to retry `owner()` when responses are unclear and adjusts success/error reporting to distinguish non-paused owner addresses from paused selector/`DiamondIsPaused` confirmations.
Parallel execution with PID tracking and exit aggregation `script/utils/diamondEMERGENCYPauseGitHub.sh`	Modifies `main()` to launch `handleNetwork` in background while collecting PIDs, waits each PID to aggregate exit codes deterministically, and awaits all parallel `printStatus` jobs before exiting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

lifinance/contracts#1831: Shares on-chain probe usage (owner(), pauserWallet(), DiamondIsPaused) and pause-status checking logic; closely related verification behavior.
lifinance/contracts#1828: Introduces a similar rpcCallWithRetry and owner()-based retry-on-unclear-read patterns used here (staging reference).
lifinance/contracts#829: Previously modified owner()-based paused-state detection and printStatus logic; directly related to these script changes.

Suggested labels

AuditRequired

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is comprehensive, covering Linear task, implementation rationale, scope, detailed changelist, testing approach, and all required checklist items with clear justifications.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately summarizes the main change: porting emergency-pause staging improvements (RPC pacing/retries and workflow alerting) to production.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/emergency-pause-prod-rpc-pacing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/diamondEmergencyPause.yml:
- Around line 102-106: The Slack notification step named "Notify Slack
SC-general channel of pause execution" currently runs with if: ${{ always() }}
and uses wording that states the pause "was just executed on PROD" even when
prechecks or input/team checks failed; update the message text in that step (and
the analogous step at lines referenced by the reviewer) to use neutral phrasing
such as "pause attempt triggered on PROD" or "pause attempted on PROD — status:
<success|failure>" and include a status indicator so the notification reflects
whether the pause completed or failed rather than asserting it executed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8cacebfa-9601-447a-933a-dff8458b39d2

📥 Commits

Reviewing files that changed from the base of the PR and between d97f142 and d55bbf1.

📒 Files selected for processing (2)

.github/workflows/diamondEmergencyPause.yml
script/utils/diamondEMERGENCYPauseGitHub.sh

…ript invocation) - Bump oven-sh/setup-bun pin from 0c5077e5 (# v2) to 3d267786 (# v2.1.2) to match the version already in use by diamondEmergencyPauseStaging.yml. - Update the tspascoal/get-user-teams-membership version comment from `# v3` to `# v3.0.0` for parity with staging (SHA was already identical). - Fix typo in workflow_dispatch warning: PROUCTION → PRODUCTION. - Run the pause script via `bash script/utils/...` instead of `./script/utils/...` so both workflows invoke their respective pause scripts the same way (and we don't depend on the executable bit being preserved across checkouts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CodeRabbit (cloud) flagged that the Slack message text asserts "was just executed on PROD" even when the workflow fails before the pause script runs - e.g. wrong `warning` input, non-DiamondPauser actor. Because the Slack step uses `if: always()`, those precheck failures still produce a message that reads as if an on-chain pause happened. Keep the always() firing behavior (still want on-call paged on any trigger, including precheck failures) but switch the verb to "triggered" so the message is honest about what actually happened. The status emoji (Success ✅ / Failed ❌) and the deep-link URL still carry the truth so on-call can disambiguate "real pause executed" from "someone fat-fingered the warning input" with one click. Staging carries the same wording today; intentionally left out of scope for this PR (the TEST TEST TEST prefix on staging already signals "don't act on this raw"). Will be aligned in a follow-up cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/diamondEmergencyPause.yml (1)
1-4: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required workflow header comment block at file start.

Line 1 starts immediately with name:. The workflow is missing the mandated header block (purpose, triggers, key behaviors, known limitations) required for this file class.
Proposed minimal fix
+## Purpose: Emergency pause all production diamonds and notify on-call with a deep link to job logs.
+## Triggers: Manual `workflow_dispatch` with explicit operator acknowledgement.
+## Key behaviors: Validates operator/team, executes pause script, always sends Slack status notification.
+## Known limitations: Not E2E-tested in prod; partial per-network failures can still produce overall failure status.
 name: EMERGENCY >> Pause all PROD diamonds
As per coding guidelines, “Each workflow file must start with a header comment block (purpose, triggers, key behaviors, known limitations).”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/diamondEmergencyPause.yml around lines 1 - 4, Add a header
comment block at the very top of the workflow file describing purpose, triggers,
key behaviors, and known limitations; place it above the existing "name:
EMERGENCY >> Pause all PROD diamonds" line and reference the workflow trigger
("on: workflow_dispatch") in the header so readers know what starts this
workflow and any important caveats or operational notes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/diamondEmergencyPause.yml:
- Around line 1-4: Add a header comment block at the very top of the workflow
file describing purpose, triggers, key behaviors, and known limitations; place
it above the existing "name: EMERGENCY >> Pause all PROD diamonds" line and
reference the workflow trigger ("on: workflow_dispatch") in the header so
readers know what starts this workflow and any important caveats or operational
notes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: abb3e64b-1d43-4082-8224-6b187ccd95de

📥 Commits

Reviewing files that changed from the base of the PR and between f210cdb and 0e076aa.

📒 Files selected for processing (1)

.github/workflows/diamondEmergencyPause.yml

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread .github/workflows/diamondEmergencyPause.yml

mirooon and others added 2 commits May 25, 2026 23:40

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

mirooon marked this pull request as ready for review May 25, 2026 22:02

mirooon changed the title ~~chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod~~ [DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod May 25, 2026

lifi-action-bot added the AuditNotRequired label May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod#1843

[DO NOT MERGE] chore(emergency-pause): port staging RPC pacing/retries + workflow alerting to prod#1843
mirooon wants to merge 3 commits into
mainfrom
chore/emergency-pause-prod-rpc-pacing

mirooon commented May 25, 2026

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mirooon commented May 25, 2026

Which Linear task belongs to this PR?

Why did I implement it this way?

Background

Scope

What changes (matched 1:1 to #1828's "Solution" list)

Testing

Checklist before requesting a review

Checklist for reviewer (DO NOT DEPLOY and contracts BEFORE CHECKING THIS!!!)

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 25, 2026 •

edited

Loading