Skip to content

Markdown web scraper#5548

Open
shatfield4 wants to merge 5 commits into
masterfrom
feat/markdown-web-scraping
Open

Markdown web scraper#5548
shatfield4 wants to merge 5 commits into
masterfrom
feat/markdown-web-scraping

Conversation

@shatfield4
Copy link
Copy Markdown
Collaborator

@shatfield4 shatfield4 commented Apr 28, 2026

Pull Request Type

  • ✨ feat (New feature)
  • 🐛 fix (Bug fix)
  • ♻️ refactor (Code refactoring without changing behavior)
  • 💄 style (UI style changes)
  • 🔨 chore (Build, CI, maintenance)
  • 📝 docs (Documentation updates)

Relevant Issues

resolves #

Description

  • Switches the web scraper from using innerText to innerHTML and converts the HTML to markdown using node-html-markdown
  • Solves issues with LLMs hallucinating links because the current way this works is we strip all links from the page and only would keep text
  • This PR now allows us to have links and tables in markdown format giving LLMs much better context when doing things like deep research or chained agent actions to navigate to other links

Investigation:

  • markdown scrape was inflating tokens ~87% on Framer-built sites because Framer renders every responsive breakpoint in the same DOM, so innerHTML captures 3 copies of every paragraph and link while old innerText skipped CSS-hidden ones
  • added a DOM visibility filter in the puppeteer evaluate that drops display:none, visibility:hidden, and aria-hidden subtrees before serializing, so markdown matches what was on screen
  • added stripHiddenAttrs as a safety net for the fetch fallback path where computed styles aren't available
  • added stripEmptyAnchors to drop empty anchors from patterns since a link with no anchor text wastes tokens
  • tried a markdown-level dedupe pass, worked great on Framer (anythingllm.com went from +87% to +1.8%) but collapsed 30+ legitimately distinct brand entries on a portfolio site that shared short repeated labels like "Branding", "Deck", "3 Weeks"
  • tested dedupe on a non-Framer site (stripe.com), only saved 40 tokens out of 2,800 with zero unique URLs lost, basically idle
  • conclusion: dedupe was a Framer-specific bandaid with real false-positive risk on listing/portfolio pages, the DOM visibility filter is the correct fix and catches breakpoint duplicates at source

Visuals (if applicable)

Additional Information

Developer Validations

  • I ran yarn lint from the root of the repo & committed changes
  • Relevant documentation has been updated (if applicable)
  • I have tested my code functionality
  • Docker build succeeds locally

@shatfield4 shatfield4 self-assigned this Apr 28, 2026
@shatfield4 shatfield4 marked this pull request as ready for review April 30, 2026 01:14
Copy link
Copy Markdown
Contributor

@angelplusultra angelplusultra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit

Comment thread collector/utils/htmlToMarkdown/index.js
Copy link
Copy Markdown
Contributor

@angelplusultra angelplusultra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Reminder: There's an upstream issue with the web scraper tool that needs investigating. If you scrape a website then proceed to ask the agent to scrape a link found on that website you receive an error:

Error: Converting circular structure to JSON
–> starting at object with constructor ‘Anthropic’
| property ‘completions’ -> object with constructor ‘Completions’
— property ‘_client’ closes the circle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants