Skip to content

Add user_vision_size in VLM's get_specializations for chunked embedding in vLLM v1#996

Draft
quic-xiyushi wants to merge 1 commit into
quic:release/v1.21.6from
quic-xiyushi:user_vision_size
Draft

Add user_vision_size in VLM's get_specializations for chunked embedding in vLLM v1#996
quic-xiyushi wants to merge 1 commit into
quic:release/v1.21.6from
quic-xiyushi:user_vision_size

Conversation

@quic-xiyushi
Copy link
Copy Markdown
Contributor

Background

vLLM v1 introduces chunked prefill with encoder cache, where the model runner processes requests in scheduled token windows. For each window:

  • Overlap detection determines which multimodal embeddings are required

  • Only the relevant sub-tensors are gathered and used

From a QEfficient perspective, this is analogous to:

  • Identifying the image indices for the current chunk, and
  • Gathering the corresponding vision embeddings accordingly

Additionally, v1 eliminates the strict prefill/decode distinction, allowing:

  • Prefill chunks of one request to be interleaved with decode steps of other requests

However, if we reuse the full vision embedding for every prefill chunk, which is what currently QEfficient supports, this:

  • Breaks the intended vLLM v1 design
  • Introduces unnecessary overhead due to repeated large set_buffer calls

Proposal

This PR proposes adding user_vision_size to get_specializations for VLM models, with the following scope:

✅ Apply to all current VLM models
❌ Exclude mllama due to it's cross attention
❌ Exclude molmo (not yet supported in vLLM on QAIC, not sure about how the team is going to support it)

For models onboarded by QEfficient in the future, it should also follow the same way to enable user_vision_size.

Benefits

This change enables:

✅ In vLLM v1, align vision embedding size with prefill sequence length and enable efficient chunked prefill for multimodal inputs
✅ Better alignment with vLLM v1 scheduling model
✅ Reduced overhead from repeated large buffer updates
✅ Easier support for multi-resolution input and multiple images per request

Release Plan

🚫 Not intended for 1.21 release
✅ Target release: 1.22

This draft PR is currently based on the 1.21.6 branch because multi-resolution and multi-frame support for Qwen2.5-VL / Qwen3-VL is only available in that branch, and these features are prerequisites for this change

Next Steps

Once multi-resolution and multi-frame support are available in the QEfficient main branch, this PR will be rebased and migrated to the main branch.

… in vLLM v1

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi marked this pull request as draft May 18, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant