Models - ONNX Runtime GenAI

Supported Model Architectures

ONNX Runtime GenAI supports a wide range of decoder-only and encoder-decoder transformer architectures. The library is designed to work with models exported to ONNX format with the appropriate optimizations.

Decoder-Only Models

These architectures are currently supported (as of the source code):

Language Models

AMD OLMo - AMD’s open language model
ChatGLM - Bilingual conversational AI model
DeepSeek - Advanced reasoning models
ERNIE 4.5 - Baidu’s enhanced language model
Fara - Emerging architecture
Gemma - Google’s lightweight models
gpt-oss - Open source GPT variants
Granite - IBM’s enterprise models
InternLM2 - InternLM series models
Llama - Meta’s Llama family (2, 3, 3.1, etc.)
Mistral - Mistral AI models
Nemotron - NVIDIA’s language models
Phi - Microsoft’s small language models
Qwen - Alibaba’s Qwen series
SmolLM3 - Compact language models

Vision-Language Models

Phi-3 Vision - Microsoft’s multimodal model
Qwen-VL / Qwen2.5-VL - Vision-language models with image understanding

Audio Models

Whisper - OpenAI’s speech recognition model

Model Type Detection

The model type is specified in genai_config.json and determines which implementation is used:

{
  "model": {
    "type": "gpt2",  // or "llama", "phi", "whisper", etc.
    // ...
  }
}

The supported types are defined in src/models/model_type.h.

Model Configuration

Models are configured through a genai_config.json file that specifies the model architecture, token IDs, and runtime settings.

Configuration File Structure

The configuration is defined in src/config.h and includes:

{
  "model": {
    "type": "gpt2",
    "pad_token_id": 50256,
    "bos_token_id": 50256,
    "eos_token_id": 50256,
    "vocab_size": 50257,
    "context_length": 1024,
    "decoder": {
      "filename": "decoder_model.onnx",
      "num_hidden_layers": 12,
      "num_key_value_heads": 12,
      "head_size": 64
    }
  }
}

Key Configuration Sections

Model Section

From src/config.h:102:

type: Model architecture identifier
pad_token_id: Padding token ID
eos_token_id: End-of-sequence token ID(s) - can be single value or array
bos_token_id: Beginning-of-sequence token ID
vocab_size: Size of the vocabulary
context_length: Maximum sequence length the model supports

Decoder Section

From src/config.h:217:

filename: Path to the ONNX model file
num_hidden_layers: Number of transformer layers
num_key_value_heads: Number of KV heads (for GQA/MQA)
num_attention_heads: Number of attention heads
head_size: Dimension of each attention head
session_options: ORT session configuration
inputs/outputs: Custom input/output name mappings

Session Options

From src/config.h:80:

"session_options": {
  "log_severity_level": 3,
  "enable_profiling": "profile_output.json",
  "graph_optimization_level": 99,  // ORT_ENABLE_ALL
  "provider_options": [
    {
      "cuda": {
        "device_id": "0",
        "enable_cuda_graph": "1",
        "gpu_mem_limit": "4294967296"
      }
    }
  ]
}

Multimodal Models

For vision-language models, additional configuration sections are required:

{
  "model": {
    "type": "phi3v",
    "vision": {
      "filename": "vision_encoder.onnx",
      "inputs": {
        "pixel_values": "pixel_values",
        "image_sizes": "image_sizes"
      },
      "outputs": {
        "image_features": "image_features"
      }
    },
    "embedding": {
      "filename": "embedding_model.onnx",
      "inputs": {
        "input_ids": "input_ids",
        "image_features": "image_features"
      },
      "outputs": {
        "embeddings": "inputs_embeds"
      }
    },
    "decoder": {
      "filename": "decoder_model.onnx",
      "inputs": {
        "embeddings": "inputs_embeds"
      }
    }
  }
}

Model Loading and Management

Loading a Model

Models are loaded through the Model::Create API (defined in src/generators.h:163):

import onnxruntime_genai as og

# Load from directory containing genai_config.json
model = og.Model('path/to/model')

# Get model information
model_type = model.get_type()
device_type = model.get_device_type()

Advanced Model Loading

For more control over model loading:

import onnxruntime_genai as og

# Create custom config
config = og.Config('path/to/model')

# Modify execution providers
config.clear_providers()
config.append_provider('cuda')
config.set_provider_option('cuda', 'device_id', '0')

# Load model with config
model = og.Model(config)

Loading from Memory

Models can be loaded from memory buffers instead of files:

auto config = OgaConfig::Create("path/to/config");

// Add model data from memory
std::vector<std::byte> model_data = LoadModelFromSource();
config->AddModelData("decoder_model.onnx", model_data);

auto model = OgaModel::Create(*config);

Model Input/Output Naming

ONNX Runtime GenAI uses a flexible naming system for model inputs and outputs (from src/config.h:14):

Default Names

// Decoder inputs
"input_ids"              // Token IDs
"attention_mask"         // Attention mask
"position_ids"           // Position IDs
"past_key_values.%d.key" // KV cache keys
"past_key_values.%d.value" // KV cache values

// Decoder outputs
"logits"                 // Output logits
"present.%d.key"        // Updated KV cache keys
"present.%d.value"      // Updated KV cache values

Custom Naming

You can override default names in the config:

{
  "model": {
    "decoder": {
      "inputs": {
        "input_ids": "tokens",
        "past_names": "cache_%d"  // Combined key/value cache
      },
      "outputs": {
        "logits": "scores",
        "present_names": "new_cache_%d"
      }
    }
  }
}

Model Optimization and Quantization

Quantization Support

ONNX Runtime GenAI supports various quantization formats:

INT4 - 4-bit integer quantization (RTN, AWQ)
INT8 - 8-bit integer quantization
FP16 - Half precision floating point
FP32 - Full precision floating point

Quantized models are created using the model builder tools (see src/python/py/models/builder.py).

Graph Optimizations

Set optimization level in session options:

{
  "decoder": {
    "session_options": {
      "graph_optimization_level": 99  // ORT_ENABLE_ALL
    }
  }
}

Optimization levels:

1 - ORT_DISABLE_ALL
2 - ORT_ENABLE_BASIC
3 - ORT_ENABLE_EXTENDED
99 - ORT_ENABLE_ALL (recommended)

CUDA Graph Capture

For CUDA execution provider, enable graph capture for better performance:

{
  "search": {
    "past_present_share_buffer": true
  },
  "decoder": {
    "session_options": {
      "provider_options": [
        {
          "cuda": {
            "enable_cuda_graph": "1"
          }
        }
      ]
    }
  }
}

Requirements (from src/generators.h:96):

Must be enabled in config
Only works with num_beams=1 OR Whisper models
CUDA execution provider

Pipeline Models

Some models use a pipeline architecture with multiple ONNX files (from src/config.h:269):

{
  "model": {
    "decoder": {
      "pipeline": [
        {
          "filename": "stage1.onnx",
          "model_id": "stage1",
          "inputs": ["input_ids", "attention_mask"],
          "outputs": ["hidden_states"],
          "run_on_prompt": true,
          "run_on_token_gen": true
        },
        {
          "filename": "stage2.onnx",
          "model_id": "stage2",
          "inputs": ["hidden_states"],
          "outputs": ["logits"],
          "is_lm_head": true
        }
      ]
    }
  }
}

Model State Management

The State class (from src/models/model.h:24) manages model execution:

Maintains ORT session and I/O bindings
Manages KV cache lifecycle
Handles adapter switching (for LoRA)
Supports continuous decoding (rewinding)

struct State {
  virtual DeviceSpan<float> Run(int total_length, 
                                DeviceSpan<int32_t>& next_tokens,
                                DeviceSpan<int32_t> next_indices = {}) = 0;
  virtual void RewindTo(size_t index);  // For continuous decoding
  virtual OrtValue* GetInput(const char* name);
  virtual OrtValue* GetOutput(const char* name);
};

Next Steps

Generation

Learn about generation strategies and parameters

KV Cache

Understand KV cache management

Install

Install ONNX Runtime GenAI

Model Builder

Build and optimize your own models

​Supported Model Architectures

​Decoder-Only Models

​Model Type Detection

​Model Configuration

​Configuration File Structure

​Key Configuration Sections

​Model Section

​Decoder Section

​Session Options

​Multimodal Models

​Model Loading and Management

​Loading a Model

​Advanced Model Loading

​Loading from Memory

​Model Input/Output Naming

​Default Names

​Custom Naming

​Model Optimization and Quantization

​Quantization Support

​Graph Optimizations

​CUDA Graph Capture

​Pipeline Models

​Model State Management

​Next Steps