Skip to main content
Qwen2.5-VL is an advanced vision-language model that supports multi-image understanding, dynamic resolution, and sophisticated spatial reasoning through Multi-Resolution Rotary Position Embedding (MRoPE).

Overview

Qwen2.5-VL models are state-of-the-art vision-language models from Alibaba Cloud that excel at:
  • Multi-image understanding: Process and reason across multiple images
  • Dynamic resolution: Handle images of varying sizes and aspect ratios
  • 3D positional encoding: MRoPE for better spatial understanding
  • Long context: Support for extended context lengths
Qwen2.5-VL uses a unique 3D position encoding scheme with temporal, height, and width dimensions for superior spatial reasoning.

Architecture Details

Multi-Resolution Rotary Position Embedding (MRoPE)

Qwen2.5-VL uses MRoPE to encode positional information in three dimensions:
# Position IDs shape: [3, batch_size, sequence_length]
# Dimension 0: Temporal dimension
# Dimension 1: Height dimension  
# Dimension 2: Width dimension
This 3D encoding allows the model to:
  • Better understand spatial relationships in images
  • Handle dynamic image resolutions
  • Process multi-image inputs with proper position awareness

Model Components

The Qwen2.5-VL architecture consists of:
  1. Vision Encoder: Processes images into visual features
    • Patch embedding for image tokenization
    • Vision attention layers
    • Patch merger for feature aggregation
  2. Language Model: Core text generation model
    • Modified attention with MRoPE
    • Grouped Query Attention (GQA)
    • RMS Layer Normalization (always computed in FP32)
  3. Vision Pipeline: Multi-stage processing
    Image → Patch Embed → Vision Attention → Patch Merger → Embeddings
    

Building Qwen2.5-VL Models

Qwen2.5-VL requires specific versions of dependencies. Follow the installation steps carefully.

Prerequisites

1

Install ONNX Runtime

# Install nightly ONNX Runtime GenAI for CUDA
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-genai-cuda

# Uninstall stable ONNX Runtime
pip uninstall -y onnxruntime-gpu

# Install nightly ONNX Runtime
pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ \
  --pre onnxruntime-gpu
2

Install PyTorch

# Ensure PyTorch >= 2.7.0
pip install torch==2.7.0
3

Install Additional Dependencies

pip install transformers
pip install pillow
pip install numpy==1.26.4  # Must be < 2.0.0
pip install --pre onnxscript

Model Export

1

Download Base Model

mkdir -p qwen2.5-vl-7b-instruct
cd qwen2.5-vl-7b-instruct

# Download from Hugging Face
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct \
  --local-dir ./pytorch
2

Export to ONNX

# Use the builder from ONNX Runtime GenAI source
from onnxruntime_genai.models.builder import build_model

build_model(
    model_name="Qwen/Qwen2.5-VL-7B-Instruct",
    input_path="./pytorch",
    output_path="./onnx-cuda",
    precision="fp16",
    execution_provider="cuda",
    cache_dir="./cache"
)

Precision Options

# FP16 for CUDA (recommended for most GPUs)
python3 -m onnxruntime_genai.models.builder \
  --model_name Qwen/Qwen2.5-VL-7B-Instruct \
  --output ./onnx-fp16 \
  --precision fp16 \
  --execution_provider cuda

Using Qwen2.5-VL

Basic Usage

import onnxruntime_genai as og

# Load model
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load image
images = og.Images.open("image.jpg")

# Create prompt
prompt = "Describe this image in detail."

# Process inputs
inputs = processor(prompt, images=images)

# Generate response
params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

print("Response: ", end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)
print()

Multi-Image Processing

Qwen2.5-VL excels at reasoning across multiple images:
import onnxruntime_genai as og

# Load multiple images
images = og.Images.open(
    "product_front.jpg",
    "product_back.jpg",
    "product_side.jpg"
)

# Ask comparative question
prompt = """
Analyze these three product images:
1. What are the key features visible from different angles?
2. Are there any defects or quality issues?
3. How would you rate the product packaging?
"""

inputs = processor(prompt, images=images)

# Generate detailed analysis
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=4096,
    temperature=0.7,
    top_p=0.9
)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)

Chat Conversation with Images

import json
import onnxruntime_genai as og

# Initialize model
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Chat history
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that can see images."
    },
    {
        "role": "user",
        "content": "What objects can you identify in this image?"
    }
]

# Load image
images = og.Images.open("scene.jpg")

# Convert messages to prompt
messages_json = json.dumps(messages)
prompt = messages_json  # Processor handles chat template

# Process and generate
inputs = processor(prompt, images=images)

params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

response = ""
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    token_text = tokenizer.decode(new_token)
    response += token_text
    print(token_text, end="", flush=True)

# Add assistant response to history
messages.append({"role": "assistant", "content": response})

Image Preprocessing

Automatic Resolution Handling

Qwen2.5-VL automatically handles various image resolutions:
# High resolution image
high_res = og.Images.open("high_resolution_4k.jpg")

# Low resolution image  
low_res = og.Images.open("thumbnail_128x128.jpg")

# Both are automatically preprocessed to optimal resolution
inputs_high = processor("Analyze this high-res image", images=high_res)
inputs_low = processor("Analyze this thumbnail", images=low_res)

Grid Dimensions

The model uses grid-based image processing with temporal, height, and width dimensions:
# Grid dimensions are automatically calculated based on image size
# For a 1024x768 image with patch size 14:
# - Temporal: 1 (single image)
# - Height: 1024 / 14 ≈ 73 patches
# - Width: 768 / 14 ≈ 55 patches

Advanced Features

Custom Position IDs

For advanced use cases, you can work with 3D position IDs:
import torch

# Example: Create custom 3D position IDs
# Shape: [3, batch_size, sequence_length]
batch_size = 1
sequence_length = 100

# Temporal, Height, Width dimensions
position_ids = torch.zeros((3, batch_size, sequence_length), dtype=torch.int64)

# For text tokens, all dimensions use same sequential IDs
for i in range(sequence_length):
    position_ids[0, 0, i] = i  # Temporal
    position_ids[1, 0, i] = i  # Height
    position_ids[2, 0, i] = i  # Width

# For image patches, dimensions vary based on spatial location
# (automatically handled by the processor)

Vision Pipeline Components

Access individual vision pipeline components:
# The vision pipeline consists of:
# 1. Patch Embedding: Image -> Patches
# 2. Vision Attention: Process patches with attention
# 3. Patch Merger: Merge patches to reduce sequence length

# These are automatically orchestrated by the processor
images = og.Images.open("image.jpg")
inputs = processor(prompt, images=images)

# The processor handles:
# - Patch extraction from images
# - Window-based attention reordering
# - Spatial merge operations
# - Final embedding generation

Performance Optimization

Choose the right execution provider for your hardware:
config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")

# For NVIDIA GPUs
config.clear_providers()
config.append_provider("cuda")

# For DirectML (AMD/Intel)
# config.append_provider("dml")

model = og.Model(config)
Different precisions offer different trade-offs:
PrecisionSpeedMemoryAccuracyHardware
FP32SlowHighBestAll
FP16FastMediumGoodModern GPUs
BF16FastMediumVery GoodA100, H100
LayerNorm and RoPE are always computed in FP32 internally for numerical stability, regardless of model precision.
Process multiple images efficiently:
# Single batch with multiple images
images = og.Images.open("img1.jpg", "img2.jpg", "img3.jpg")
prompt = "Analyze all these images together."
inputs = processor(prompt, images=images)

# Batch size is automatically determined from inputs
generator = og.Generator(model, params)
generator.set_inputs(inputs)
For large images or long contexts:
# Monitor token count
generator.set_inputs(inputs)
total_tokens = generator.token_count()
print(f"Total input tokens: {total_tokens}")

# Adjust max_length based on available memory
available_memory_gb = 16  # Your GPU memory
if total_tokens > 2048:
    max_new_tokens = 1024  # Reduce for large inputs
else:
    max_new_tokens = 2048

params.set_search_options(max_length=total_tokens + max_new_tokens)

Implementation Details

RoPE Computation

Qwen2.5-VL uses a custom MRoPE implementation:
# Pseudo-code for MRoPE logic
# 1. Calculate dynamic RoPE caches from 3D position_ids
cos_cache, sin_cache = make_dynamic_rope_caches(position_ids)
# Shape: [3, batch_size, sequence_length, head_dim]

# 2. Flatten and reorder based on MRoPE sections
flat_cos, flat_sin = make_mrope_flattened_caches(cos_cache, sin_cache)
# Shape: [batch_size * sequence_length, head_dim / 2]

# 3. Apply rotation to Q and K
q_rotated = apply_mrope_rotation(q, flat_cos, flat_sin)
k_rotated = apply_mrope_rotation(k, flat_cos, flat_sin)

# 4. Grouped Query Attention
output = grouped_query_attention(q_rotated, k_rotated, v)

Layer Normalization

# Qwen2.5-VL uses RMSNorm with forced FP32 computation
# Regardless of model precision (FP16/BF16), normalization uses FP32

# This is configured automatically in the builder:
layernorm_attrs = {
    "cast": {
        "use_fp32": True,      # Compute in FP32
        "root_input": True,    # Cast input to FP32
        "skip_input": True,    # Cast skip connection to FP32
        "output_0": True,      # Cast output back to model dtype
        "output_3": True       # Cast residual output back
    }
}

Troubleshooting

If you see NumPy 2.0 compatibility errors:
# Uninstall numpy 2.0+
pip uninstall -y numpy

# Install compatible version
pip install numpy==1.26.4
Qwen2.5-VL requires PyTorch >= 2.7.0:
# Check version
python -c "import torch; print(torch.__version__)"

# Upgrade if needed
pip install --upgrade torch==2.7.0
If you encounter position_ids shape errors:
# Verify position_ids shape is [3, batch_size, sequence_length]
# This is handled automatically by the processor

# If manually creating inputs, ensure correct shape:
position_ids = torch.zeros((3, 1, seq_len), dtype=torch.int64)
For very high resolution images:
# The model automatically handles resolution
# But you can pre-resize large images if needed:

from PIL import Image

img = Image.open("very_large_image.jpg")
max_size = 2048
if max(img.size) > max_size:
    img.thumbnail((max_size, max_size), Image.LANCZOS)
    img.save("resized.jpg")

images = og.Images.open("resized.jpg")

Example: Document Understanding

import onnxruntime_genai as og
import argparse

def analyze_document(image_path, task="summarize"):
    """Analyze a document image with Qwen2.5-VL."""
    
    # Load model
    config = og.Config("./qwen2.5-vl-7b-instruct/onnx-cuda")
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    # Load document
    images = og.Images.open(image_path)
    
    # Create task-specific prompt
    prompts = {
        "summarize": "Summarize the key points from this document.",
        "extract": "Extract all important information including names, dates, and numbers.",
        "translate": "Translate the text in this document to English.",
        "ocr": "Extract all text from this document."
    }
    
    prompt = prompts.get(task, prompts["summarize"])
    
    # Process
    inputs = processor(prompt, images=images)
    
    # Generate with appropriate parameters
    params = og.GeneratorParams(model)
    params.set_search_options(
        max_length=4096,
        temperature=0.3,  # Lower temperature for factual tasks
        top_p=0.8,
        repetition_penalty=1.1
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    print(f"\nTask: {task}")
    print(f"Document: {image_path}")
    print("\nResult:")
    print("-" * 50)
    
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(tokenizer.decode(new_token), end="", flush=True)
    
    print("\n" + "-" * 50)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--image", required=True, help="Document image path")
    parser.add_argument("--task", default="summarize",
                       choices=["summarize", "extract", "translate", "ocr"],
                       help="Analysis task")
    args = parser.parse_args()
    
    analyze_document(args.image, args.task)

Next Steps

Phi Vision Models

Explore Microsoft’s Phi vision models

Gemma Vision Models

Learn about Google’s Gemma vision models

Model Optimization

Optimize inference performance

Custom Models

Build custom vision models