High-Performance Generative AI with ONNX Runtime
Run LLMs and multi-modal models on any device with ease. A complete inference library with optimized KV cache management, sampling strategies, and hardware acceleration.
Quick Start
Get running with your first generative AI model in minutes
Download a model
Run inference
Explore by Topic
Dive deep into core concepts, guides, and API references
Core Concepts
Multi-Modal
Hardware Acceleration
Python API
C++ API
Model Builder
Key Features
Everything you need to deploy generative AI at scale
Multi-Language Support
Use Python, C++, C#, C, or Java bindings with the same performant core
20+ Model Architectures
Llama, Phi, Gemma, Qwen, Mistral, Whisper, and more out of the box
Multi-Modal Ready
Vision and audio models with built-in preprocessing and feature extraction
Advanced Decoding
Constrained decoding, beam search, Multi-LoRA, and continuous decoding
Ready to Build?
Start deploying high-performance generative AI models on any device with ONNX Runtime GenAI