Skip to main content
Microsoft’s Phi vision models are compact yet powerful multi-modal models that combine visual understanding with language capabilities. ONNX Runtime GenAI supports Phi-3 Vision, Phi-3.5 Vision, and Phi-4 Multi-Modal models.

Supported Models

Phi-3 Vision

128k context length vision model for image understanding

Phi-3.5 Vision

Enhanced vision capabilities with improved accuracy

Phi-4 Multi-Modal

Latest model supporting both vision and audio inputs

Model Architecture

Phi vision models are multi-modal models consisting of several internal components:
  • Vision Encoder: Processes images and extracts visual features
  • Image Embedding: Converts visual features into embeddings compatible with the language model
  • Language Model: Core transformer model for text generation
  • Fusion Layers: Combine visual and text embeddings
For ONNX Runtime GenAI, each internal component is exported as a separate ONNX model for optimal performance.

Building Phi Vision Models

Phi-3 Vision (128k Context)

1

Download PyTorch Model

# Create workspace
mkdir -p phi3-vision-128k-instruct/pytorch
cd phi3-vision-128k-instruct/pytorch

# Download from Hugging Face
huggingface-cli download microsoft/Phi-3-vision-128k-instruct --local-dir .
2

Download Modified Files

cd ..
huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx \
  --include onnx/* --local-dir .
3

Replace Modeling Files

# Replace config (flash_attention_2 -> eager)
rm pytorch/config.json
mv onnx/config.json pytorch/

# Replace modified modeling file
rm pytorch/modeling_phi3_v.py
mv onnx/modeling_phi3_v.py pytorch/

# Add ONNX export helper
mv onnx/image_embedding_phi3_v_for_onnx.py pytorch/

# Move builder script
mv onnx/builder.py .
rm -rf onnx/
4

Build ONNX Models

python3 builder.py \
  --input ./pytorch \
  --output ./cpu \
  --precision fp32 \
  --execution_provider cpu
5

Add Configuration Files

Download the required JSON configuration files:

Using Phi Vision Models

Basic Image Understanding

import onnxruntime_genai as og

# Load model
config = og.Config("./phi3-vision-128k-instruct/cuda")
model = og.Model(config)
processor = model.create_multimodal_processor()
tokenizer = og.Tokenizer(model)

# Load image
images = og.Images.open("image.jpg")

# Create prompt
prompt = "<|user|>\n<|image_1|>\nWhat is shown in this image?<|end|>\n<|assistant|>\n"

# Process inputs
inputs = processor(prompt, images=images)

# Generate response
params = og.GeneratorParams(model)
params.set_search_options(max_length=2048)

generator = og.Generator(model, params)
generator.set_inputs(inputs)

print("Response: ", end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)
print()

Multi-Image Processing

import onnxruntime_genai as og

# Load multiple images
images = og.Images.open("image1.jpg", "image2.jpg", "image3.jpg")

# Reference images in prompt
prompt = """
<|user|>
<|image_1|>
<|image_2|>
<|image_3|>
Compare these three images and describe their similarities and differences.
<|end|>
<|assistant|>
"""

# Process and generate
inputs = processor(prompt, images=images)
generator = og.Generator(model, params)
generator.set_inputs(inputs)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer.decode(new_token), end="", flush=True)

Chat Template Integration

import json
import onnxruntime_genai as og

# Create messages with image
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Apply chat template
if hasattr(tokenizer, 'apply_chat_template'):
    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
else:
    # Manual template
    prompt = f"<|user|>\n<|image_1|>\n{messages[0]['content'][1]['text']}<|end|>\n<|assistant|>\n"

images = og.Images.open("image.jpg")
inputs = processor(prompt, images=images)

Image Input Handling

Supported Image Formats

Phi vision models support common image formats:
  • JPEG/JPG
  • PNG
  • BMP
  • TIFF

Image Preprocessing

The processor automatically handles:
  1. Resizing: Images are resized to the model’s expected dimensions
  2. Normalization: Pixel values are normalized
  3. Patch Extraction: Images are divided into patches
  4. Embedding: Visual patches are converted to embeddings

Image Resolution

# Phi-3 Vision supports high-resolution images
# The model automatically handles various resolutions
images = og.Images.open("high_res_image.jpg")  # Automatically preprocessed

Advanced Usage

Batch Processing

import onnxruntime_genai as og

# Process multiple image-text pairs in batch
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
prompts = [
    "<|user|>\n<|image_1|>\nDescribe this<|end|>\n<|assistant|>\n",
    "<|user|>\n<|image_1|>\nWhat do you see?<|end|>\n<|assistant|>\n",
    "<|user|>\n<|image_1|>\nAnalyze this image<|end|>\n<|assistant|>\n"
]

for img_path, prompt in zip(image_paths, prompts):
    images = og.Images.open(img_path)
    inputs = processor(prompt, images=images)
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    print(f"Processing {img_path}:")
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        print(tokenizer.decode(new_token), end="", flush=True)
    print("\n")

Custom Generation Parameters

params = og.GeneratorParams(model)
params.set_search_options(
    max_length=4096,           # Maximum output length
    do_sample=True,            # Enable sampling
    top_p=0.9,                 # Nucleus sampling
    top_k=50,                  # Top-k sampling
    temperature=0.7,           # Sampling temperature
    repetition_penalty=1.1     # Penalize repetition
)

Performance Optimization

Choose the right precision for your hardware:
  • FP32: Best accuracy, slower, works on all devices
  • FP16: Good balance, requires GPU with FP16 support
  • INT4: Fastest, smallest memory footprint, slight accuracy loss
# Build with INT4 quantization
python3 builder.py --input ./pytorch --output ./cuda \
  --precision fp16 --execution_provider cuda
config = og.Config("./model/cuda")
config.clear_providers()
config.append_provider("cuda")
model = og.Model(config)
For large images or long sequences:
# Monitor token count
generator = og.Generator(model, params)
generator.set_inputs(inputs)

input_tokens = generator.token_count()
print(f"Input tokens (including image): {input_tokens}")

# Process in chunks if needed
max_new_tokens = 1024
generated = 0

while not generator.is_done() and generated < max_new_tokens:
    generator.generate_next_token()
    generated += 1

Fine-Tuning Support

You can use your own fine-tuned Phi vision models:
1

Fine-tune with PyTorch

Fine-tune the model using your preferred training framework.
2

Replace Weights

# After downloading the base model files
# Replace the *.safetensors files with your fine-tuned weights
cp /path/to/finetuned/*.safetensors ./phi3-vision-128k-instruct/pytorch/
3

Build ONNX Models

python3 builder.py --input ./pytorch --output ./cuda \
  --precision fp16 --execution_provider cuda
4

Update Configurations

Modify genai_config.json and processor_config.json if your fine-tuning changed model architecture or tokenizer.

Troubleshooting

import os

# Verify image path exists
image_path = "image.jpg"
if not os.path.exists(image_path):
    raise FileNotFoundError(f"Image not found: {image_path}")

# Load with error handling
try:
    images = og.Images.open(image_path)
except Exception as e:
    print(f"Error loading image: {e}")
If you encounter OOM errors:
  1. Reduce image resolution before processing
  2. Use INT4 quantization instead of FP16
  3. Reduce max_length parameter
  4. Process images one at a time instead of batching
# Reduce max output length
params.set_search_options(max_length=1024)  # Instead of 4096
If you see flash attention errors:
# Verify config.json has eager attention
cat pytorch/config.json | grep _attn_implementation
# Should show: "_attn_implementation": "eager"

Example Application

Here’s a complete example script for document analysis:
import onnxruntime_genai as og
import argparse

def analyze_document(image_path, question):
    # Load model
    config = og.Config("./phi3-vision-128k-instruct/cuda")
    model = og.Model(config)
    processor = model.create_multimodal_processor()
    tokenizer = og.Tokenizer(model)
    
    # Load document image
    images = og.Images.open(image_path)
    
    # Create prompt
    prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"
    
    # Process
    inputs = processor(prompt, images=images)
    
    # Generate
    params = og.GeneratorParams(model)
    params.set_search_options(
        max_length=2048,
        do_sample=True,
        top_p=0.9,
        temperature=0.7
    )
    
    generator = og.Generator(model, params)
    generator.set_inputs(inputs)
    
    response = ""
    while not generator.is_done():
        generator.generate_next_token()
        new_token = generator.get_next_tokens()[0]
        token_text = tokenizer.decode(new_token)
        response += token_text
        print(token_text, end="", flush=True)
    print()
    
    return response

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--image", required=True, help="Path to document image")
    parser.add_argument("--question", required=True, help="Question about the document")
    args = parser.parse_args()
    
    analyze_document(args.image, args.question)

Next Steps

Qwen Vision

Explore Qwen’s advanced vision models

Gemma Vision

Learn about Google’s Gemma vision models

Whisper Audio

Add audio processing capabilities

Model Quantization

Optimize models with quantization