Markdown web scraper by shatfield4 · Pull Request #5548 · Mintplex-Labs/anything-llm

shatfield4 · 2026-04-28T18:31:28Z

Pull Request Type

✨ feat (New feature)
🐛 fix (Bug fix)
♻️ refactor (Code refactoring without changing behavior)
💄 style (UI style changes)
🔨 chore (Build, CI, maintenance)
📝 docs (Documentation updates)

Relevant Issues

resolves #

Description

Switches the web scraper from using innerText to innerHTML and converts the HTML to markdown using node-html-markdown
Solves issues with LLMs hallucinating links because the current way this works is we strip all links from the page and only would keep text
This PR now allows us to have links and tables in markdown format giving LLMs much better context when doing things like deep research or chained agent actions to navigate to other links

Investigation:

markdown scrape was inflating tokens ~87% on Framer-built sites because Framer renders every responsive breakpoint in the same DOM, so innerHTML captures 3 copies of every paragraph and link while old innerText skipped CSS-hidden ones
added a DOM visibility filter in the puppeteer evaluate that drops display:none, visibility:hidden, and aria-hidden subtrees before serializing, so markdown matches what was on screen
added stripHiddenAttrs as a safety net for the fetch fallback path where computed styles aren't available
added stripEmptyAnchors to drop empty anchors from patterns since a link with no anchor text wastes tokens
tried a markdown-level dedupe pass, worked great on Framer (anythingllm.com went from +87% to +1.8%) but collapsed 30+ legitimately distinct brand entries on a portfolio site that shared short repeated labels like "Branding", "Deck", "3 Weeks"
tested dedupe on a non-Framer site (stripe.com), only saved 40 tokens out of 2,800 with zero unique URLs lost, basically idle
conclusion: dedupe was a Framer-specific bandaid with real false-positive risk on listing/portfolio pages, the DOM visibility filter is the correct fix and catches breakpoint duplicates at source

Visuals (if applicable)

Additional Information

Developer Validations

I ran yarn lint from the root of the repo & committed changes
Relevant documentation has been updated (if applicable)
I have tested my code functionality
Docker build succeeds locally

angelplusultra

just a nit

angelplusultra

LGTM.

Reminder: There's an upstream issue with the web scraper tool that needs investigating. If you scrape a website then proceed to ask the agent to scrape a link found on that website you receive an error:

Error: Converting circular structure to JSON
–> starting at object with constructor ‘Anthropic’
| property ‘completions’ -> object with constructor ‘Completions’
— property ‘_client’ closes the circle

convert web scraper to markdown

185e48b

shatfield4 self-assigned this Apr 28, 2026

shatfield4 added 2 commits April 29, 2026 17:53

filter hidden elements and empty anchors from web scraper output

111212a

Merge branch 'master' into feat/markdown-web-scraping

337afd8

shatfield4 requested a review from angelplusultra April 30, 2026 01:14

shatfield4 assigned angelplusultra and unassigned shatfield4 Apr 30, 2026

shatfield4 marked this pull request as ready for review April 30, 2026 01:14

angelplusultra requested changes May 1, 2026

View reviewed changes

Comment thread collector/utils/htmlToMarkdown/index.js

angelplusultra assigned shatfield4 and unassigned angelplusultra May 1, 2026

shatfield4 added 2 commits May 6, 2026 12:16

use node-html-parser in flattenTables for consistency

bbfc94c

Merge branch 'master' into feat/markdown-web-scraping

e98322d

shatfield4 requested a review from angelplusultra May 6, 2026 19:20

shatfield4 assigned angelplusultra and unassigned shatfield4 May 6, 2026

angelplusultra approved these changes May 7, 2026

View reviewed changes

angelplusultra assigned shatfield4 and unassigned angelplusultra May 7, 2026

shatfield4 assigned timothycarambat and unassigned shatfield4 May 7, 2026

shatfield4 requested a review from timothycarambat May 7, 2026 21:36

timothycarambat added the blocked label May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Markdown web scraper#5548

Markdown web scraper#5548
shatfield4 wants to merge 5 commits into
masterfrom
feat/markdown-web-scraping

shatfield4 commented Apr 28, 2026 •

edited

Loading

Uh oh!

angelplusultra left a comment

Uh oh!

Uh oh!

angelplusultra left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

shatfield4 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Type

Relevant Issues

Description

Visuals (if applicable)

Additional Information

Developer Validations

Uh oh!

angelplusultra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

angelplusultra left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shatfield4 commented Apr 28, 2026 •

edited

Loading