feat: for offsite audits, read brand presence execution data from PostgREST (LLMO-4307)#2389
feat: for offsite audits, read brand presence execution data from PostgREST (LLMO-4307)#2389rarescheseli wants to merge 19 commits into
Conversation
|
This PR will trigger no release when merged. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
solaris007
left a comment
There was a problem hiding this comment.
Hey @rarescheseli,
This PR directly addresses one of the key findings from the aem.live traffic spike investigation - adding the brandalf guard to brand-presence-enrichment.js that was missing. The overall approach is solid: PostgREST as primary source, feature-flag-gated rollout, legacy fallback, and thorough test coverage. Good work on the module extraction and the legacy format mapping - changing one layer at a time is the right transitional strategy.
Strengths
- Clean extraction of
isBrandalfEnabledandresolveOrganizationIdForSiteinto sharedbrandalf-utils.jswith improved null-safety over the inline original - Well-designed org ID resolution chain:
site.getOrganizationId()-> explicit fallback ->Site.findById()lookup - All PostgREST interactions use parameterized queries via supabase-js fluent API - no injection risk. Every query filters by both
organization_idANDsite_idfor tenant isolation - Fail-closed design throughout - any PostgREST failure returns
null, cleanly triggering the legacy path mapExecutionsToLegacyBrandPresenceRowsreshapes DB results into the existing{ Sources, Region, Topics, Prompt, Category }shape so downstream consumers stay untouched- Thorough test coverage across all new modules:
brandalf-utils.test.jscovers all 15 branches,offsite-brand-presence-postgrest.test.jscovers pagination, chunking, null inputs, and error paths - Safe rollout via feature flag - can be enabled per-org without risk to non-migrated sites
Issues
Important (Should Fix)
1. Fallback to aem.live for brandalf-enabled sites defeats the traffic reduction goal
In both offsiteBrandPresenceRunner and computeTopicsFromBrandPresence, when isBrandalfOrg is true but loadBrandPresenceDataFromPostgrest returns null (no data, empty sources, or error), the code falls through to the legacy aem.live file-fetch path. But the whole motivation for this PR is that migrated sites generate 404s against aem.live. The fallback re-creates the exact problem.
For brandalf-enabled sites, "no data in PostgREST" should mean "no brand presence data exists for this period" - not "try fetching from the old path that will 404." The fallback should only apply when isBrandalfOrg is false.
Recommendation: after the PostgREST attempt, if isBrandalfOrg is true, return an empty/no-data result instead of falling through to the aem.live path.
2. Feature flag HTTP call per audit invocation - no caching
isBrandalfEnabled issues a fresh HTTP request to /organizations/{id}/feature-flags on every audit invocation. At 212K invocations/day (and growing), that's 212K round trips to the SpaceCat API just for flag checks. A module-level Map keyed by organizationId with a 5-minute TTL would cut this traffic by orders of magnitude.
3. Source query fan-out scales poorly
fetchSourcesForExecutionIds chunks execution IDs into groups of 50 and queries them sequentially. A site with 5000 executions yields 100 sequential PostgREST round trips. Options: increase EXECUTION_ID_CHUNK_SIZE to 200-500 (Postgres handles .in() with hundreds of UUIDs comfortably), use Promise.all with a concurrency limiter, or query brand_presence_sources by the same (org_id, site_id, date_range) filter to avoid the ID fan-out entirely.
4. Duplicated PostgREST-first orchestration logic
The same sequence (resolve org ID -> check brandalf flag -> load from PostgREST -> fallback) is repeated nearly verbatim in offsiteBrandPresenceRunner and computeTopicsFromBrandPresence. If the fallback strategy changes (e.g., fixing issue 1 above), both sites must be updated in lockstep. Consider extracting a shared loadBrandPresenceWithFallback(siteId, context) function.
5. No successful multi-page pagination test
The pagination test for fetchExecutionsBatched creates a 5000-row first batch, then the second batch errors. There is no test where two pages both succeed and results are accumulated correctly. Add a test with a first batch of exactly EXECUTION_FETCH_BATCH_SIZE rows and a second batch of fewer rows, then assert the combined count.
Minor (Nice to Have)
6. 'all' -> 'chatgpt-paid' mapping is undocumented - In BRAND_PRESENCE_DB_MODEL_BY_PROVIDER, all: 'chatgpt-paid' is semantically surprising. A one-line comment explaining why would help future maintainers.
7. Wasted topicMap computation in handler.js PostgREST path - extractUrlsAndTopics populates topicMap but it goes out of scope unused - only allUrls is consumed downstream.
8. computeTopicsFromBrandPresence doesn't receive the site object - Only siteId is passed, forcing resolveOrganizationIdForSite to do a Site.findById() lookup. All callers (cited, reddit, youtube handlers) already have site in scope. Threading it through would eliminate one DB round trip per invocation.
Recommendations
- Issue 1 is the highest-priority fix - it directly determines whether this PR achieves its goal of reducing aem.live traffic for migrated sites
- Issues 2-4 can ship in a fast-follow if the team prefers, but the caching (issue 2) will matter at current scale
- The companion PR adobe/spacecat-api-service#2229 (lowering default proxy limit to 5K) is already approved - together these two PRs address the top recommendations from the traffic spike analysis
Assessment
Ready to merge? With fixes
The code is well-structured, well-tested, and directly addresses the aem.live traffic spike root cause. The one issue that should be fixed before merge is the fallback behavior for brandalf-enabled sites (issue 1) - without it, migrated sites will still generate 404 traffic against aem.live, which is the problem this PR is meant to solve. The remaining issues (caching, fan-out, duplication, pagination test) are important for operational health and can be addressed in a fast-follow.
Next Steps
- Fix the brandalf-enabled fallback (issue 1) - short-circuit to empty result instead of falling through to aem.live
- Add the successful multi-page pagination test (issue 5)
- Consider the caching, fan-out, and duplication improvements (issues 2-4) as fast-follow
| * Gets the ISO week number and year for the previous two weeks. | ||
| * @returns {Array<{ week: number, year: number }>} Previous two weeks (most recent first) | ||
| */ | ||
| function getPreviousWeeks() { |
There was a problem hiding this comment.
This block was moved to offsite-brand-presence-enrichment.js
solaris007
left a comment
There was a problem hiding this comment.
Hey @rarescheseli,
All five Important items from the prior review are resolved, most exceeding what was asked:
Strengths (previously flagged, now resolved)
- Item 1 (brandalf fallback):
loadBrandPresenceDatareturnsnullfor brandalf orgs when PostgREST returns no data. The legacy file path is reachable only whenisBrandalfOrg === false. Fix is in the right place. - Item 3 (source query fan-out): Exceeded expectations. Replaced the N/50 chunked fan-out with a single PostgREST FK-embedded query (
brand_presence_sources(source_urls(url))) plus keyset pagination on(execution_date, id). Both the round-trip pattern and the offset performance cliff are gone. - Item 4 (duplicated orchestration):
loadBrandPresenceDatain enrichment.js is now the single canonical entry point. BothoffsiteBrandPresenceRunnerandcomputeTopicsFromBrandPresencedelegate to it. handler.js dropped 270 lines, gained 36. - Item 5 (site threading):
computeTopicsFromBrandPresence(siteId, context, site)now accepts the site entity. All three callers (cited, reddit, youtube) forward it, avoiding redundantSite.findById(). - Pagination coverage: New test
accumulates rows across two successful pages via keyset cursorseeds a full first page (5000 rows) plus a partial second page (3 rows) and asserts correct accumulation. Prior gap closed. - Orphan-source rows eliminated structurally: The embedded FK join means orphan source rows cannot exist by construction, not by guard code.
- Observability bonus:
USER_AGENT = 'Offsite Audits - Spacecat/1.0'and&source=offsite-auditsquery param added to outbound calls.
Issues
Important (Should Fix)
1. Stale log message in loadBrandPresenceDataFromPostgrest catch block. src/utils/offsite-brand-presence-postgrest.js (~line 177) still reads "Falling back to file-backed brand presence for site ${siteId}". With the no-fallback behavior for brandalf orgs, this message is now actively misleading on every PostgREST failure - ops will read it as "we fell back to files" when in fact the function returns null and no fallback occurs. Reword to e.g. "PostgREST query failed for site ${siteId}: ${error.message}" and update the corresponding test matcher. One-line fix with real incident-response value.
2. No test pins the site-parameter forwarding through computeTopicsFromBrandPresence. The signature change to accept site is the mechanism that avoids the extra org lookup (item 5), but no test passes a non-undefined site and asserts that resolveOrganizationIdForSite received it. A regression that drops the forwarding (e.g., someone reintroduces a 2-arg call) would not be caught. Add one test that stubs resolveOrganizationIdForSite, calls with a fakeSite, and asserts .firstCall.args[0].site === fakeSite.
Minor (Nice to Have)
- Fail-open on brandalf check blip (pre-existing, impact increased).
isBrandalfEnabledreturnsfalsefor non-2xx responses and thrown errors. InloadBrandPresenceData, that means a feature-flag endpoint outage silently routes a brandalf-enabled site to the legacy aem.live path - the exact behavior item 1 was meant to prevent. Distinguishing "flag is false" from "flag check failed" (e.g., returnnull/throw for failure) would close this hole. Acceptable as a tracked follow-up, not blocking. isBrandalfEnabledcaching (item 2, not addressed). Still one HTTP call per audit invocation. Post-consolidation the call fires once per audit rather than once per provider, so the rate is reduced. Acceptable as a tracked follow-up.'all' -> 'chatgpt-paid'mapping undocumented. A two-line comment would prevent future confusion.fetchExecutionsWithSourceshas no max-iteration guard. AMAX_PAGESconstant (e.g., 50 = 250K executions) would prevent a pathological condition from looping the Lambda to timeout.- Keyset cursor interpolation trust assumption implicit. Values are ISO dates and UUIDs (safe today). A one-line comment noting this prevents future schema-change surprises.
- Cross-year boundary test coverage narrowed. Old handler test exercised full QI-fetch flow with
[{w:1,y:2026},{w:52,y:2025}]. New test mocks at theloadBrandPresenceDataboundary. Add one cross-year case in the enrichment layer to retain that coverage.
Assessment
Ready to merge? With fixes.
The prior CHANGES_REQUESTED items are all resolved, with the source-query refactor substantially exceeding what was asked. The two remaining Important items are a one-line log reword and a small test addition. The fail-open and caching concerns are pre-existing, their impact is reduced by this PR's consolidation, and they have a clean follow-up path.
Next Steps
- Reword the stale catch-block log in
loadBrandPresenceDataFromPostgrestand update the test matcher. - Add one test pinning
siteparameter forwarding throughcomputeTopicsFromBrandPresence. - File a follow-up for the fail-open brandalf check and the caching concern so they do not get lost.
| return null; | ||
| } | ||
|
|
||
| const rows = mapExecutionsToLegacyBrandPresenceRows(executions); |
There was a problem hiding this comment.
Important: This log message still reads "Falling back to file-backed brand presence" but with the new no-fallback behavior for brandalf orgs, no fallback occurs - the function returns null. Reword to e.g. PostgREST query failed for site ${siteId}: ${error.message} and update the test matcher. One-line fix, real incident-response value.
solaris007
left a comment
There was a problem hiding this comment.
Hey @rarescheseli,
All prior findings addressed cleanly in the two new commits.
Strengths (previously flagged, now resolved)
- Stale log message -
loadBrandPresenceDataFromPostgrestcatch block now reads"PostgREST query failed for site ${siteId}: ${error.message}"and the test matcher is updated to match. No more misleading "fallback" wording during incidents. - Site parameter forwarding test - new test "forwards the provided site object to organization resolution" stubs
resolveOrganizationIdForSiteand assertssite: fakeSitewas passed through. Closes the regression-window for the item-5 fix. - Fail-open on brandalf check (improved beyond what was asked) -
isBrandalfEnablednow returnsnull(instead offalse) for missing env, non-2xx responses, non-array payloads, and thrown errors.loadBrandPresenceDatashort-circuits withisBrandalfOrg === nulland skips both PostgREST and the legacy aem.live path. This closes the silent-fallback hole that the item-1 fix would otherwise leave open during a feature-flag endpoint outage. 'all' -> 'chatgpt-paid'mapping documented with a one-line comment.MAX_EXECUTION_FETCH_PAGES = 50guard added tofetchExecutionsWithSourceswiththrowon overrun and a corresponding test.- Keyset cursor trust assumption annotated with a comment noting the ISO-date and UUID dependency.
- Cross-year boundary explicitly tested with
[{w:1,y:2026},{w:52,y:2025}].
Assessment
Ready to merge? Yes.
Each prior item has a matching code change and a matching test. The brandalf null-return change is a particularly nice touch - the prior reviews flagged the fail-open as a separate concern and this PR resolves it tighter than the workaround we discussed.
Summary:
Please ensure your pull request adheres to the following guidelines:
Related Issues
Thanks for contributing!