Skip to main content

Supported Model Architectures

ONNX Runtime GenAI supports a wide range of decoder-only and encoder-decoder transformer architectures. The library is designed to work with models exported to ONNX format with the appropriate optimizations.

Decoder-Only Models

These architectures are currently supported (as of the source code):
  • AMD OLMo - AMD’s open language model
  • ChatGLM - Bilingual conversational AI model
  • DeepSeek - Advanced reasoning models
  • ERNIE 4.5 - Baidu’s enhanced language model
  • Fara - Emerging architecture
  • Gemma - Google’s lightweight models
  • gpt-oss - Open source GPT variants
  • Granite - IBM’s enterprise models
  • InternLM2 - InternLM series models
  • Llama - Meta’s Llama family (2, 3, 3.1, etc.)
  • Mistral - Mistral AI models
  • Nemotron - NVIDIA’s language models
  • Phi - Microsoft’s small language models
  • Qwen - Alibaba’s Qwen series
  • SmolLM3 - Compact language models
  • Phi-3 Vision - Microsoft’s multimodal model
  • Qwen-VL / Qwen2.5-VL - Vision-language models with image understanding
  • Whisper - OpenAI’s speech recognition model

Model Type Detection

The model type is specified in genai_config.json and determines which implementation is used:
{
  "model": {
    "type": "gpt2",  // or "llama", "phi", "whisper", etc.
    // ...
  }
}
The supported types are defined in src/models/model_type.h.

Model Configuration

Models are configured through a genai_config.json file that specifies the model architecture, token IDs, and runtime settings.

Configuration File Structure

The configuration is defined in src/config.h and includes:
{
  "model": {
    "type": "gpt2",
    "pad_token_id": 50256,
    "bos_token_id": 50256,
    "eos_token_id": 50256,
    "vocab_size": 50257,
    "context_length": 1024,
    "decoder": {
      "filename": "decoder_model.onnx",
      "num_hidden_layers": 12,
      "num_key_value_heads": 12,
      "head_size": 64
    }
  }
}

Key Configuration Sections

Model Section

From src/config.h:102:
  • type: Model architecture identifier
  • pad_token_id: Padding token ID
  • eos_token_id: End-of-sequence token ID(s) - can be single value or array
  • bos_token_id: Beginning-of-sequence token ID
  • vocab_size: Size of the vocabulary
  • context_length: Maximum sequence length the model supports

Decoder Section

From src/config.h:217:
  • filename: Path to the ONNX model file
  • num_hidden_layers: Number of transformer layers
  • num_key_value_heads: Number of KV heads (for GQA/MQA)
  • num_attention_heads: Number of attention heads
  • head_size: Dimension of each attention head
  • session_options: ORT session configuration
  • inputs/outputs: Custom input/output name mappings

Session Options

From src/config.h:80:
"session_options": {
  "log_severity_level": 3,
  "enable_profiling": "profile_output.json",
  "graph_optimization_level": 99,  // ORT_ENABLE_ALL
  "provider_options": [
    {
      "cuda": {
        "device_id": "0",
        "enable_cuda_graph": "1",
        "gpu_mem_limit": "4294967296"
      }
    }
  ]
}

Multimodal Models

For vision-language models, additional configuration sections are required:
{
  "model": {
    "type": "phi3v",
    "vision": {
      "filename": "vision_encoder.onnx",
      "inputs": {
        "pixel_values": "pixel_values",
        "image_sizes": "image_sizes"
      },
      "outputs": {
        "image_features": "image_features"
      }
    },
    "embedding": {
      "filename": "embedding_model.onnx",
      "inputs": {
        "input_ids": "input_ids",
        "image_features": "image_features"
      },
      "outputs": {
        "embeddings": "inputs_embeds"
      }
    },
    "decoder": {
      "filename": "decoder_model.onnx",
      "inputs": {
        "embeddings": "inputs_embeds"
      }
    }
  }
}

Model Loading and Management

Loading a Model

Models are loaded through the Model::Create API (defined in src/generators.h:163):
import onnxruntime_genai as og

# Load from directory containing genai_config.json
model = og.Model('path/to/model')

# Get model information
model_type = model.get_type()
device_type = model.get_device_type()

Advanced Model Loading

For more control over model loading:
import onnxruntime_genai as og

# Create custom config
config = og.Config('path/to/model')

# Modify execution providers
config.clear_providers()
config.append_provider('cuda')
config.set_provider_option('cuda', 'device_id', '0')

# Load model with config
model = og.Model(config)

Loading from Memory

Models can be loaded from memory buffers instead of files:
auto config = OgaConfig::Create("path/to/config");

// Add model data from memory
std::vector<std::byte> model_data = LoadModelFromSource();
config->AddModelData("decoder_model.onnx", model_data);

auto model = OgaModel::Create(*config);

Model Input/Output Naming

ONNX Runtime GenAI uses a flexible naming system for model inputs and outputs (from src/config.h:14):

Default Names

// Decoder inputs
"input_ids"              // Token IDs
"attention_mask"         // Attention mask
"position_ids"           // Position IDs
"past_key_values.%d.key" // KV cache keys
"past_key_values.%d.value" // KV cache values

// Decoder outputs
"logits"                 // Output logits
"present.%d.key"        // Updated KV cache keys
"present.%d.value"      // Updated KV cache values

Custom Naming

You can override default names in the config:
{
  "model": {
    "decoder": {
      "inputs": {
        "input_ids": "tokens",
        "past_names": "cache_%d"  // Combined key/value cache
      },
      "outputs": {
        "logits": "scores",
        "present_names": "new_cache_%d"
      }
    }
  }
}

Model Optimization and Quantization

Quantization Support

ONNX Runtime GenAI supports various quantization formats:
  • INT4 - 4-bit integer quantization (RTN, AWQ)
  • INT8 - 8-bit integer quantization
  • FP16 - Half precision floating point
  • FP32 - Full precision floating point
Quantized models are created using the model builder tools (see src/python/py/models/builder.py).

Graph Optimizations

Set optimization level in session options:
{
  "decoder": {
    "session_options": {
      "graph_optimization_level": 99  // ORT_ENABLE_ALL
    }
  }
}
Optimization levels:
  • 1 - ORT_DISABLE_ALL
  • 2 - ORT_ENABLE_BASIC
  • 3 - ORT_ENABLE_EXTENDED
  • 99 - ORT_ENABLE_ALL (recommended)

CUDA Graph Capture

For CUDA execution provider, enable graph capture for better performance:
{
  "search": {
    "past_present_share_buffer": true
  },
  "decoder": {
    "session_options": {
      "provider_options": [
        {
          "cuda": {
            "enable_cuda_graph": "1"
          }
        }
      ]
    }
  }
}
Requirements (from src/generators.h:96):
  • Must be enabled in config
  • Only works with num_beams=1 OR Whisper models
  • CUDA execution provider

Pipeline Models

Some models use a pipeline architecture with multiple ONNX files (from src/config.h:269):
{
  "model": {
    "decoder": {
      "pipeline": [
        {
          "filename": "stage1.onnx",
          "model_id": "stage1",
          "inputs": ["input_ids", "attention_mask"],
          "outputs": ["hidden_states"],
          "run_on_prompt": true,
          "run_on_token_gen": true
        },
        {
          "filename": "stage2.onnx",
          "model_id": "stage2",
          "inputs": ["hidden_states"],
          "outputs": ["logits"],
          "is_lm_head": true
        }
      ]
    }
  }
}

Model State Management

The State class (from src/models/model.h:24) manages model execution:
  • Maintains ORT session and I/O bindings
  • Manages KV cache lifecycle
  • Handles adapter switching (for LoRA)
  • Supports continuous decoding (rewinding)
struct State {
  virtual DeviceSpan<float> Run(int total_length, 
                                DeviceSpan<int32_t>& next_tokens,
                                DeviceSpan<int32_t> next_indices = {}) = 0;
  virtual void RewindTo(size_t index);  // For continuous decoding
  virtual OrtValue* GetInput(const char* name);
  virtual OrtValue* GetOutput(const char* name);
};

Next Steps

Generation

Learn about generation strategies and parameters

KV Cache

Understand KV cache management

Install

Install ONNX Runtime GenAI

Model Builder

Build and optimize your own models