Skip to content

feat(examples): vision-first document RAG (ColQwen2.5 + Florence-2-DocVQA)#178

Open
svonava wants to merge 1 commit into
mainfrom
daniel/vision-doc-rag-example
Open

feat(examples): vision-first document RAG (ColQwen2.5 + Florence-2-DocVQA)#178
svonava wants to merge 1 commit into
mainfrom
daniel/vision-doc-rag-example

Conversation

@svonava
Copy link
Copy Markdown
Contributor

@svonava svonava commented May 22, 2026

Summary

  • New example under examples/vision-doc-rag/. ColQwen2.5 ranks pages by reading them as images, Florence-2-FT-DocVQA reads the top page and produces a textual answer. OCR never enters the score path, so charts, tables, screenshots, and layout cues survive end-to-end.
  • Multi-tenant from the start: every page carries a client tag, queries are scoped via a Python filter before MaxSim. Same corpus serves multiple tenants with no per-tenant index.
  • Optional Qwen/Qwen3-VL-Reranker-2B second stage stays in the visual modality (off by default — gated on a cluster-side bugfix).
  • Self-contained: 12 synthetic pages across 3 fictional clients, a PIL renderer that turns each entry into a PNG, FastAPI server, minimal UI that shows the page image alongside the answer.

SIE features

Stage Model Primitive
Retrieval vidore/colqwen2.5-v0.2 encode (multivector, image + text)
Ranking client-side sie_sdk.scoring.maxsim
Rerank (optional) Qwen/Qwen3-VL-Reranker-2B score
Answer mynkchaudhry/Florence-2-FT-DocVQA extract with instruction=<question>
OCR snippet (UI only) mynkchaudhry/Florence-2-FT-DocVQA extract

Project layout

examples/vision-doc-rag/
├── README.md
├── config.yaml
├── data/
│   ├── fetch_dataset.py    # synthetic 3-tenant corpus → pages.json
│   └── render_pages.py     # pages.json → PNG screenshots
├── python/
│   ├── ingest.py           # encode every page → multivectors.npz
│   ├── search.py           # CLI demo: 4 scoped queries with timings
│   ├── server.py           # FastAPI /api/search?q=&client=
│   └── requirements.txt
└── static/
    └── index.html          # tenant selector + query box + answer card

Test plan

  • data/fetch_dataset.py generates 12 pages across 3 tenants
  • data/render_pages.py renders 12 PNGs (1024×1280) via PIL with a fallback font path
  • First page encode via vidore/colqwen2.5-v0.2 returns a [~740, 128] multivector on the dev cluster (verified before a cluster-side wedge took the worker out — see notes)
  • sie_sdk.scoring.maxsim(query_mv, [doc_mv]) returns the expected high score for a matched page
  • BAAI/bge-reranker-v2-m3 returns sensible scores on text inputs (sanity test, separate from the visual reranker)
  • End-to-end ingest + search against a healthy SIE cluster (blocked on the cluster recovery I've been working through with @ - internal context)
  • DocVQA instruction=<question> returns a focused answer rather than an OCR dump (depends on the cluster's Florence-2 adapter routing the task token correctly)

Notes for reviewers

The visual reranker is configured but disabled by default in config.yaml. There's a known cluster-side adapter issue where JSON image inputs are not base64-decoded before reaching the preprocessor; once that lands, flip search.visual_rerank: true and the second stage runs in the same modality as retrieval.

The synthetic corpus is intentionally domain-mixed (engineering runbooks, HR policies, finance procedures) so queries clearly disambiguate by tenant and the visual layout matters more than keyword overlap.

…2-DocVQA

A multi-tenant retrieval + QA example that keeps OCR out of the score path.
Pages are encoded as images with ColQwen2.5, MaxSim ranks them via late
interaction, and Florence-2-FT-DocVQA reads the top page to produce a
textual answer. An optional Qwen3-VL-Reranker-2B second stage stays in the
visual modality so layout cues survive both ranking stages.

Exercises encode + extract (and score when enabled). Includes a synthetic
3-tenant corpus, a PIL renderer that turns each entry into a PNG, a FastAPI
server, and a minimal UI that shows the page image alongside the answer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant