Skip to content

feat: for offsite audits, read brand presence execution data from PostgREST (LLMO-4307)#2389

Open
rarescheseli wants to merge 19 commits into
mainfrom
rcheseli/read-from-postgrest-then-fallback-to-json
Open

feat: for offsite audits, read brand presence execution data from PostgREST (LLMO-4307)#2389
rarescheseli wants to merge 19 commits into
mainfrom
rcheseli/read-from-postgrest-then-fallback-to-json

Conversation

@rarescheseli
Copy link
Copy Markdown
Contributor

@rarescheseli rarescheseli commented Apr 15, 2026

Summary:

  • Add PostgREST as the primary data source for brand presence execution data in offsite audits, falling back to the existing SharePoint file-backed approach when no DB rows are found
  • Extract isBrandalfEnabled and resolveOrganizationIdForSite into a shared brandalf-utils.js module, reused by the offsite handler, enrichment, and customer analysis flows
  • Rename brand-presence-enrichment.js to offsite-brand-presence-enrichment.js for consistency with related modules

Please ensure your pull request adheres to the following guidelines:

  • make sure to link the related issues in this description
  • when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes
  • If data sources for any opportunity has been updated/added, please update the wiki for same opportunity.

Related Issues

Thanks for contributing!

@github-actions
Copy link
Copy Markdown

This PR will trigger no release when merged.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@rarescheseli rarescheseli marked this pull request as ready for review April 16, 2026 11:58
@rarescheseli rarescheseli changed the title feat: read from Postgrest brand presence execution data and fallback to share-point data if there's no data in DB (LLMO-4307) feat: read from PostgREST brand presence execution data and fallback to share-point data if there's no data in DB (LLMO-4307) Apr 16, 2026
@rarescheseli rarescheseli changed the title feat: read from PostgREST brand presence execution data and fallback to share-point data if there's no data in DB (LLMO-4307) feat: for offsite audits, read brand presence execution data from PostgREST (LLMO-4307) Apr 16, 2026
@solaris007 solaris007 added the enhancement New feature or request label Apr 17, 2026
Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rarescheseli,

This PR directly addresses one of the key findings from the aem.live traffic spike investigation - adding the brandalf guard to brand-presence-enrichment.js that was missing. The overall approach is solid: PostgREST as primary source, feature-flag-gated rollout, legacy fallback, and thorough test coverage. Good work on the module extraction and the legacy format mapping - changing one layer at a time is the right transitional strategy.

Strengths

  • Clean extraction of isBrandalfEnabled and resolveOrganizationIdForSite into shared brandalf-utils.js with improved null-safety over the inline original
  • Well-designed org ID resolution chain: site.getOrganizationId() -> explicit fallback -> Site.findById() lookup
  • All PostgREST interactions use parameterized queries via supabase-js fluent API - no injection risk. Every query filters by both organization_id AND site_id for tenant isolation
  • Fail-closed design throughout - any PostgREST failure returns null, cleanly triggering the legacy path
  • mapExecutionsToLegacyBrandPresenceRows reshapes DB results into the existing { Sources, Region, Topics, Prompt, Category } shape so downstream consumers stay untouched
  • Thorough test coverage across all new modules: brandalf-utils.test.js covers all 15 branches, offsite-brand-presence-postgrest.test.js covers pagination, chunking, null inputs, and error paths
  • Safe rollout via feature flag - can be enabled per-org without risk to non-migrated sites

Issues

Important (Should Fix)

1. Fallback to aem.live for brandalf-enabled sites defeats the traffic reduction goal

In both offsiteBrandPresenceRunner and computeTopicsFromBrandPresence, when isBrandalfOrg is true but loadBrandPresenceDataFromPostgrest returns null (no data, empty sources, or error), the code falls through to the legacy aem.live file-fetch path. But the whole motivation for this PR is that migrated sites generate 404s against aem.live. The fallback re-creates the exact problem.

For brandalf-enabled sites, "no data in PostgREST" should mean "no brand presence data exists for this period" - not "try fetching from the old path that will 404." The fallback should only apply when isBrandalfOrg is false.

Recommendation: after the PostgREST attempt, if isBrandalfOrg is true, return an empty/no-data result instead of falling through to the aem.live path.

2. Feature flag HTTP call per audit invocation - no caching

isBrandalfEnabled issues a fresh HTTP request to /organizations/{id}/feature-flags on every audit invocation. At 212K invocations/day (and growing), that's 212K round trips to the SpaceCat API just for flag checks. A module-level Map keyed by organizationId with a 5-minute TTL would cut this traffic by orders of magnitude.

3. Source query fan-out scales poorly

fetchSourcesForExecutionIds chunks execution IDs into groups of 50 and queries them sequentially. A site with 5000 executions yields 100 sequential PostgREST round trips. Options: increase EXECUTION_ID_CHUNK_SIZE to 200-500 (Postgres handles .in() with hundreds of UUIDs comfortably), use Promise.all with a concurrency limiter, or query brand_presence_sources by the same (org_id, site_id, date_range) filter to avoid the ID fan-out entirely.

4. Duplicated PostgREST-first orchestration logic

The same sequence (resolve org ID -> check brandalf flag -> load from PostgREST -> fallback) is repeated nearly verbatim in offsiteBrandPresenceRunner and computeTopicsFromBrandPresence. If the fallback strategy changes (e.g., fixing issue 1 above), both sites must be updated in lockstep. Consider extracting a shared loadBrandPresenceWithFallback(siteId, context) function.

5. No successful multi-page pagination test

The pagination test for fetchExecutionsBatched creates a 5000-row first batch, then the second batch errors. There is no test where two pages both succeed and results are accumulated correctly. Add a test with a first batch of exactly EXECUTION_FETCH_BATCH_SIZE rows and a second batch of fewer rows, then assert the combined count.

Minor (Nice to Have)

6. 'all' -> 'chatgpt-paid' mapping is undocumented - In BRAND_PRESENCE_DB_MODEL_BY_PROVIDER, all: 'chatgpt-paid' is semantically surprising. A one-line comment explaining why would help future maintainers.

7. Wasted topicMap computation in handler.js PostgREST path - extractUrlsAndTopics populates topicMap but it goes out of scope unused - only allUrls is consumed downstream.

8. computeTopicsFromBrandPresence doesn't receive the site object - Only siteId is passed, forcing resolveOrganizationIdForSite to do a Site.findById() lookup. All callers (cited, reddit, youtube handlers) already have site in scope. Threading it through would eliminate one DB round trip per invocation.

Recommendations

  • Issue 1 is the highest-priority fix - it directly determines whether this PR achieves its goal of reducing aem.live traffic for migrated sites
  • Issues 2-4 can ship in a fast-follow if the team prefers, but the caching (issue 2) will matter at current scale
  • The companion PR adobe/spacecat-api-service#2229 (lowering default proxy limit to 5K) is already approved - together these two PRs address the top recommendations from the traffic spike analysis

Assessment

Ready to merge? With fixes

The code is well-structured, well-tested, and directly addresses the aem.live traffic spike root cause. The one issue that should be fixed before merge is the fallback behavior for brandalf-enabled sites (issue 1) - without it, migrated sites will still generate 404 traffic against aem.live, which is the problem this PR is meant to solve. The remaining issues (caching, fan-out, duplication, pagination test) are important for operational health and can be addressed in a fast-follow.

Next Steps

  1. Fix the brandalf-enabled fallback (issue 1) - short-circuit to empty result instead of falling through to aem.live
  2. Add the successful multi-page pagination test (issue 5)
  3. Consider the caching, fan-out, and duplication improvements (issues 2-4) as fast-follow

Comment thread src/offsite-brand-presence/handler.js Outdated
Comment thread src/utils/offsite-brand-presence-enrichment.js Outdated
* Gets the ISO week number and year for the previous two weeks.
* @returns {Array<{ week: number, year: number }>} Previous two weeks (most recent first)
*/
function getPreviousWeeks() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was moved to offsite-brand-presence-enrichment.js

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rarescheseli,

All five Important items from the prior review are resolved, most exceeding what was asked:

Strengths (previously flagged, now resolved)

  • Item 1 (brandalf fallback): loadBrandPresenceData returns null for brandalf orgs when PostgREST returns no data. The legacy file path is reachable only when isBrandalfOrg === false. Fix is in the right place.
  • Item 3 (source query fan-out): Exceeded expectations. Replaced the N/50 chunked fan-out with a single PostgREST FK-embedded query (brand_presence_sources(source_urls(url))) plus keyset pagination on (execution_date, id). Both the round-trip pattern and the offset performance cliff are gone.
  • Item 4 (duplicated orchestration): loadBrandPresenceData in enrichment.js is now the single canonical entry point. Both offsiteBrandPresenceRunner and computeTopicsFromBrandPresence delegate to it. handler.js dropped 270 lines, gained 36.
  • Item 5 (site threading): computeTopicsFromBrandPresence(siteId, context, site) now accepts the site entity. All three callers (cited, reddit, youtube) forward it, avoiding redundant Site.findById().
  • Pagination coverage: New test accumulates rows across two successful pages via keyset cursor seeds a full first page (5000 rows) plus a partial second page (3 rows) and asserts correct accumulation. Prior gap closed.
  • Orphan-source rows eliminated structurally: The embedded FK join means orphan source rows cannot exist by construction, not by guard code.
  • Observability bonus: USER_AGENT = 'Offsite Audits - Spacecat/1.0' and &source=offsite-audits query param added to outbound calls.

Issues

Important (Should Fix)

1. Stale log message in loadBrandPresenceDataFromPostgrest catch block. src/utils/offsite-brand-presence-postgrest.js (~line 177) still reads "Falling back to file-backed brand presence for site ${siteId}". With the no-fallback behavior for brandalf orgs, this message is now actively misleading on every PostgREST failure - ops will read it as "we fell back to files" when in fact the function returns null and no fallback occurs. Reword to e.g. "PostgREST query failed for site ${siteId}: ${error.message}" and update the corresponding test matcher. One-line fix with real incident-response value.

2. No test pins the site-parameter forwarding through computeTopicsFromBrandPresence. The signature change to accept site is the mechanism that avoids the extra org lookup (item 5), but no test passes a non-undefined site and asserts that resolveOrganizationIdForSite received it. A regression that drops the forwarding (e.g., someone reintroduces a 2-arg call) would not be caught. Add one test that stubs resolveOrganizationIdForSite, calls with a fakeSite, and asserts .firstCall.args[0].site === fakeSite.

Minor (Nice to Have)

  • Fail-open on brandalf check blip (pre-existing, impact increased). isBrandalfEnabled returns false for non-2xx responses and thrown errors. In loadBrandPresenceData, that means a feature-flag endpoint outage silently routes a brandalf-enabled site to the legacy aem.live path - the exact behavior item 1 was meant to prevent. Distinguishing "flag is false" from "flag check failed" (e.g., return null/throw for failure) would close this hole. Acceptable as a tracked follow-up, not blocking.
  • isBrandalfEnabled caching (item 2, not addressed). Still one HTTP call per audit invocation. Post-consolidation the call fires once per audit rather than once per provider, so the rate is reduced. Acceptable as a tracked follow-up.
  • 'all' -> 'chatgpt-paid' mapping undocumented. A two-line comment would prevent future confusion.
  • fetchExecutionsWithSources has no max-iteration guard. A MAX_PAGES constant (e.g., 50 = 250K executions) would prevent a pathological condition from looping the Lambda to timeout.
  • Keyset cursor interpolation trust assumption implicit. Values are ISO dates and UUIDs (safe today). A one-line comment noting this prevents future schema-change surprises.
  • Cross-year boundary test coverage narrowed. Old handler test exercised full QI-fetch flow with [{w:1,y:2026},{w:52,y:2025}]. New test mocks at the loadBrandPresenceData boundary. Add one cross-year case in the enrichment layer to retain that coverage.

Assessment

Ready to merge? With fixes.

The prior CHANGES_REQUESTED items are all resolved, with the source-query refactor substantially exceeding what was asked. The two remaining Important items are a one-line log reword and a small test addition. The fail-open and caching concerns are pre-existing, their impact is reduced by this PR's consolidation, and they have a clean follow-up path.

Next Steps

  1. Reword the stale catch-block log in loadBrandPresenceDataFromPostgrest and update the test matcher.
  2. Add one test pinning site parameter forwarding through computeTopicsFromBrandPresence.
  3. File a follow-up for the fail-open brandalf check and the caching concern so they do not get lost.

return null;
}

const rows = mapExecutionsToLegacyBrandPresenceRows(executions);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This log message still reads "Falling back to file-backed brand presence" but with the new no-fallback behavior for brandalf orgs, no fallback occurs - the function returns null. Reword to e.g. PostgREST query failed for site ${siteId}: ${error.message} and update the test matcher. One-line fix, real incident-response value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! Addressed it in 827daa8.

Copy link
Copy Markdown
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rarescheseli,

All prior findings addressed cleanly in the two new commits.

Strengths (previously flagged, now resolved)

  • Stale log message - loadBrandPresenceDataFromPostgrest catch block now reads "PostgREST query failed for site ${siteId}: ${error.message}" and the test matcher is updated to match. No more misleading "fallback" wording during incidents.
  • Site parameter forwarding test - new test "forwards the provided site object to organization resolution" stubs resolveOrganizationIdForSite and asserts site: fakeSite was passed through. Closes the regression-window for the item-5 fix.
  • Fail-open on brandalf check (improved beyond what was asked) - isBrandalfEnabled now returns null (instead of false) for missing env, non-2xx responses, non-array payloads, and thrown errors. loadBrandPresenceData short-circuits with isBrandalfOrg === null and skips both PostgREST and the legacy aem.live path. This closes the silent-fallback hole that the item-1 fix would otherwise leave open during a feature-flag endpoint outage.
  • 'all' -> 'chatgpt-paid' mapping documented with a one-line comment.
  • MAX_EXECUTION_FETCH_PAGES = 50 guard added to fetchExecutionsWithSources with throw on overrun and a corresponding test.
  • Keyset cursor trust assumption annotated with a comment noting the ISO-date and UUID dependency.
  • Cross-year boundary explicitly tested with [{w:1,y:2026},{w:52,y:2025}].

Assessment

Ready to merge? Yes.

Each prior item has a matching code change and a matching test. The brandalf null-return change is a particularly nice touch - the prior reviews flagged the fail-open as a separate concern and this PR resolves it tighter than the workaround we discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants