Add user_vision_size in VLM's get_specializations for chunked embedding in vLLM v1 by quic-xiyushi · Pull Request #996 · quic/efficient-transformers

quic-xiyushi · 2026-05-18T18:52:41Z

Background

vLLM v1 introduces chunked prefill with encoder cache, where the model runner processes requests in scheduled token windows. For each window:

Overlap detection determines which multimodal embeddings are required
Only the relevant sub-tensors are gathered and used

From a QEfficient perspective, this is analogous to:

Identifying the image indices for the current chunk, and
Gathering the corresponding vision embeddings accordingly

Additionally, v1 eliminates the strict prefill/decode distinction, allowing:

Prefill chunks of one request to be interleaved with decode steps of other requests

However, if we reuse the full vision embedding for every prefill chunk, which is what currently QEfficient supports, this:

Breaks the intended vLLM v1 design
Introduces unnecessary overhead due to repeated large set_buffer calls

Proposal

This PR proposes adding user_vision_size to get_specializations for VLM models, with the following scope:

✅ Apply to all current VLM models
❌ Exclude mllama due to it's cross attention
❌ Exclude molmo (not yet supported in vLLM on QAIC, not sure about how the team is going to support it)

For models onboarded by QEfficient in the future, it should also follow the same way to enable user_vision_size.

Benefits

This change enables:

✅ In vLLM v1, align vision embedding size with prefill sequence length and enable efficient chunked prefill for multimodal inputs
✅ Better alignment with vLLM v1 scheduling model
✅ Reduced overhead from repeated large buffer updates
✅ Easier support for multi-resolution input and multiple images per request

Release Plan

🚫 Not intended for 1.21 release
✅ Target release: 1.22

This draft PR is currently based on the 1.21.6 branch because multi-resolution and multi-frame support for Qwen2.5-VL / Qwen3-VL is only available in that branch, and these features are prerequisites for this change

Next Steps

Once multi-resolution and multi-frame support are available in the QEfficient main branch, this PR will be rebased and migrated to the main branch.

… in vLLM v1 Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Add user_vision_size in VLM get_specializations for chunked embedding…

43cdf0d

… in vLLM v1 Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi marked this pull request as draft May 18, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add user_vision_size in VLM's get_specializations for chunked embedding in vLLM v1#996

Add user_vision_size in VLM's get_specializations for chunked embedding in vLLM v1#996
quic-xiyushi wants to merge 1 commit into
quic:release/v1.21.6from
quic-xiyushi:user_vision_size

quic-xiyushi commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

quic-xiyushi commented May 18, 2026

Background

Proposal

Benefits

Release Plan

Next Steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant