feat(examples): vision-first document RAG (ColQwen2.5 + Florence-2-DocVQA) by svonava · Pull Request #178 · superlinked/sie

svonava · 2026-05-22T04:04:58Z

Summary

New example under examples/vision-doc-rag/. ColQwen2.5 ranks pages by reading them as images, Florence-2-FT-DocVQA reads the top page and produces a textual answer. OCR never enters the score path, so charts, tables, screenshots, and layout cues survive end-to-end.
Multi-tenant from the start: every page carries a client tag, queries are scoped via a Python filter before MaxSim. Same corpus serves multiple tenants with no per-tenant index.
Optional Qwen/Qwen3-VL-Reranker-2B second stage stays in the visual modality (off by default — gated on a cluster-side bugfix).
Self-contained: 12 synthetic pages across 3 fictional clients, a PIL renderer that turns each entry into a PNG, FastAPI server, minimal UI that shows the page image alongside the answer.

SIE features

Stage	Model	Primitive
Retrieval	`vidore/colqwen2.5-v0.2`	`encode` (multivector, image + text)
Ranking	client-side	`sie_sdk.scoring.maxsim`
Rerank (optional)	`Qwen/Qwen3-VL-Reranker-2B`	`score`
Answer	`mynkchaudhry/Florence-2-FT-DocVQA`	`extract` with `instruction=<question>`
OCR snippet (UI only)	`mynkchaudhry/Florence-2-FT-DocVQA`	`extract`

Project layout

examples/vision-doc-rag/
├── README.md
├── config.yaml
├── data/
│   ├── fetch_dataset.py    # synthetic 3-tenant corpus → pages.json
│   └── render_pages.py     # pages.json → PNG screenshots
├── python/
│   ├── ingest.py           # encode every page → multivectors.npz
│   ├── search.py           # CLI demo: 4 scoped queries with timings
│   ├── server.py           # FastAPI /api/search?q=&client=
│   └── requirements.txt
└── static/
    └── index.html          # tenant selector + query box + answer card

Test plan

data/fetch_dataset.py generates 12 pages across 3 tenants
data/render_pages.py renders 12 PNGs (1024×1280) via PIL with a fallback font path
First page encode via vidore/colqwen2.5-v0.2 returns a [~740, 128] multivector on the dev cluster (verified before a cluster-side wedge took the worker out — see notes)
sie_sdk.scoring.maxsim(query_mv, [doc_mv]) returns the expected high score for a matched page
BAAI/bge-reranker-v2-m3 returns sensible scores on text inputs (sanity test, separate from the visual reranker)
End-to-end ingest + search against a healthy SIE cluster (blocked on the cluster recovery I've been working through with @ - internal context)
DocVQA instruction=<question> returns a focused answer rather than an OCR dump (depends on the cluster's Florence-2 adapter routing the task token correctly)

Notes for reviewers

The visual reranker is configured but disabled by default in config.yaml. There's a known cluster-side adapter issue where JSON image inputs are not base64-decoded before reaching the preprocessor; once that lands, flip search.visual_rerank: true and the second stage runs in the same modality as retrieval.

The synthetic corpus is intentionally domain-mixed (engineering runbooks, HR policies, finance procedures) so queries clearly disambiguate by tenant and the visual layout matters more than keyword overlap.

…2-DocVQA A multi-tenant retrieval + QA example that keeps OCR out of the score path. Pages are encoded as images with ColQwen2.5, MaxSim ranks them via late interaction, and Florence-2-FT-DocVQA reads the top page to produce a textual answer. An optional Qwen3-VL-Reranker-2B second stage stays in the visual modality so layout cues survive both ranking stages. Exercises encode + extract (and score when enabled). Includes a synthetic 3-tenant corpus, a PIL renderer that turns each entry into a PNG, a FastAPI server, and a minimal UI that shows the page image alongside the answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): vision-first document RAG (ColQwen2.5 + Florence-2-DocVQA)#178

feat(examples): vision-first document RAG (ColQwen2.5 + Florence-2-DocVQA)#178
svonava wants to merge 1 commit into
mainfrom
daniel/vision-doc-rag-example

svonava commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

svonava commented May 22, 2026

Summary

SIE features

Project layout

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant