OME: Revolutionizing LLM Infrastructure with Model-Driven Architecture

by: The Oracle Team, Jul 08, 2025


The Tale of Two Teams: Why Model Serving Is Broken

In any large organization deploying LLMs, two distinct teams emerge with conflicting needs:

The ML Engineers spend months benchmarking models, experimenting with serving technologies, and crafting optimal deployment strategies. Each model demands different configurations—tensor parallelism for Llama 70B, expert parallelism for DeepSeek V3/R1, specialized settings for multimodal models. The parameters are endless: batch sizes, KV cache configurations, quantization levels. Worse, these configurations shift dramatically across GPU types (H100 vs A100 vs L40S).

The Production Engineers and Data Scientists just want to deploy models. They shouldn’t need to understand the intricacies of tensor parallelism or why a particular model needs 4 GPUs with NVLink. They have customers waiting, applications to build, and business value to deliver.

This gap creates a fundamental problem: MLEs need a way to encode their hard-won serving knowledge into reusable blueprints. Production teams need to deploy models without becoming distributed systems experts. The missing link? A system that understands models as first-class citizens.

The Birth of OME

The Oracle Cloud Infrastructure (OCI) GenAI team faced this exact challenge at scale. Supporting numerous models across diverse GPU hardware, they watched deployment cycles stretch into months. Each new model meant:

  • Weeks of MLE experimentation to find optimal configurations
  • Complex documentation that production teams struggled to follow
  • Deployment failures due to misconfiguration
  • Inability to reuse knowledge across similar models

The breakthrough came from a simple insight: The model itself should drive the deployment.

A Llama model isn’t just a file—it contains metadata about its architecture, parameter count, and requirements. By making the system model-aware rather than deployment-driven, they could bridge the gap between ML expertise and production simplicity.

This led to OME (Open Model Engine): a Kubernetes operator that treats models as first-class resources. The results were dramatic:

  • Model onboarding time: Months → Days
  • Configuration errors: Dramatically reduced
  • MLE knowledge: Captured and reused automatically
  • Production deployment: Simple YAML with just a few lines

But here’s what makes it revolutionary: the model-driven architecture makes it easy to encode and reuse sophisticated deployment strategies:

  • Multi-node serving: Deploy massive models like DeepSeek V3 (685B) across multiple nodes with a simple configuration
  • Prefill-decode disaggregation: Separate compute-intensive prefill from memory-bound decode, with each component scaling independently
  • Flexible architectures: Both prefill and decode can run in single-node or multi-node configurations based on your needs
  • Serverless deployment: Scale-to-zero for cost efficiency when models aren’t in use
  • Business-driven scaling: Complex autoscaling based on KV cache, tokens/second, latency targets, or any custom metric

The model-driven approach doesn’t constrain you—it liberates you. Because OME understands models deeply, it can support any deployment pattern your MLEs design while keeping the interface simple for production teams.

Enter OME: A Kubernetes-native platform where models become first-class citizens. Let’s explore how OME’s architecture transforms the chaos of LLM deployment into an elegant, scalable system that serves everyone from ML researchers to production engineers.

The OME Architecture: Models at the Center

ome-architecture.svg

Layer 1: Kubernetes API Layer

While users—MLEs, data scientists, production engineers, and applications—interact with OME through simple interfaces, the real magic happens in the Kubernetes API layer below.

Custom Resources - The Foundation of Model-Driven Architecture

At the heart of OME lies its Custom Resource Definitions (CRDs), which transform Kubernetes from a generic container orchestrator into an ML platform. These aren’t just configuration files—they’re the language through which you express your ML requirements.

BaseModel/ClusterBaseModel: Models as First-Class Citizens

What is a Base Model?

A Base Model in OME is a Kubernetes resource that represents a foundation AI model (like GPT, Llama, or Mistral) that you want to use for inference workloads. Think of it as a blueprint that tells OME where to find your model, how to download it, and where to store it on your cluster nodes.

When you create a BaseModel resource, OME automatically handles the complex process of downloading the model files, parsing the model’s configuration to understand its capabilities, and making it available across your cluster nodes where AI workloads can use it.

BaseModel vs ClusterBaseModel

OME provides two types of model resources:

  • BaseModel is namespace-scoped, meaning it exists within a specific Kubernetes namespace. If you create a BaseModel in the “team-a” namespace, only workloads in that namespace can use it. This is perfect for team-specific models or when you want to isolate model access.
  • ClusterBaseModel is cluster-scoped, meaning it’s available to workloads in any namespace across your entire cluster. This is ideal for organization-wide models that multiple teams need to access, like a shared Llama-3 model that everyone uses.

Both types use exactly the same specification format—the only difference is their visibility scope.

Traditional platforms treat models as static files to be downloaded and mounted. OME revolutionizes this by making models intelligent, versioned resources that understand their own requirements:

apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: llama-3-70b-instruct
spec:
  vendor: meta
  modelType: llama
  modelArchitecture: LlamaForCausalLM
  modelParameterSize: "70B"
  quantization: fp16
  storage:
    storageUri: "hf://meta-llama/Llama-3.3-70B-Instruct"
    path: "/models/llama-3.3-70b"
    nodeSelector:
      gpu.memory: "80Gi"  # Only download to nodes with sufficient GPU memory

When you create a BaseModel resource, OME’s control plane and data plane components work together to make the model available across your cluster. The BaseModel CRD acts as the declarative specification, while the actual work of downloading, parsing, and distributing models happens in the data plane through the Model Agent.

ServingRuntime: The Brain of Runtime Selection

ClusterServingRuntime is a cluster-scoped resource that manages the runtime environment for model serving. A ClusterServingRuntime defines the templates for Pods that can serve one or more particular models. Each ClusterServingRuntime defines key information such as the container image of the runtime and a list of the models that the runtime supports. Other configuration settings for the runtime can be conveyed through environment variables in the container specification.

These CRDs allow for improved flexibility and extensibility, enabling users to quickly define or customize reusable runtimes without having to modify any controller code or any resources in the controller namespace. The only difference between ServingRuntime and ClusterServingRuntime is that one is namespace-scoped and the other is cluster-scoped.

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: sglang-llama-70b
spec:
  supportedModelFormats:
    - modelFormat:
        name: safetensors
      modelArchitecture: LlamaForCausalLM
      modelSizeRange:
        min: "65B"
        max: "75B"
      autoSelect: true
      priority: 100

Full runtime specifications for advanced deployments can be found in the OME repository:

ServingRuntimes define how to serve different model types, with the actual runtime selection logic handled by the control plane when you create an InferenceService.

Advanced Deployment Architectures

ServingRuntimes serve as blueprints for how router, engine, and decoder components are deployed. Each component (except router) can be configured for single-node, serverless, or multi-node deployment. This flexibility enables cutting-edge serving patterns: PD-Disaggregated Serving - The state-of-the-art for high-performance LLM serving at scale

This isn’t just incremental improvement—it’s a fundamental advancement in serving architecture that OME makes accessible through simple runtime configuration.

InferenceService: Orchestrating Model Deployments and Ingress

An InferenceService is the central Kubernetes resource in OME that orchestrates the complete lifecycle of model serving. It acts as a declarative specification that describes how you want your AI models deployed, scaled, and served across your cluster.

Think of InferenceService as the “deployment blueprint” for your AI workloads. It brings together models (defined by BaseModel/ClusterBaseModel), runtimes (defined by ServingRuntime/ClusterServingRuntime), and infrastructure configuration to create a complete serving solution. InferenceService is what puts models, runtimes, as well as traditional Kubernetes services, complex ingress, scheduling, auto scaling, and permission controls all together to form a complete cluster serving fleet.

Architecture Overview

OME uses a component-based architecture where InferenceService can be composed of multiple specialized components:

  • Model: References the AI model to serve (BaseModel/ClusterBaseModel)
  • Runtime: References the serving runtime environment (ServingRuntime/ClusterServingRuntime)
  • Engine: Main inference component that processes requests, typically an OpenAI-compatible server handling request processing, tool parsing, and model backend operations
  • Decoder: Optional component for disaggregated serving (prefill-decode separation)
  • Router: A standalone high-performance component that enables data parallelism across inference instances, supporting advanced load balancing algorithms (cache-aware, power of two, random, round robin) and acting as a specialized load balancer for prefill-decode disaggregated serving architectures
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: production-chat-service
spec:
  model:
    name: llama-3-70b-instruct
  engine:
    minReplicas: 2
    maxReplicas: 10
  decoder:  # Only created for disaggregated deployments
    minReplicas: 4
    maxReplicas: 20
  router:  # Optional optimal serving routing layer
    minReplicas: 2

This component architecture enables sophisticated optimizations impossible with monolithic deployments:

  • Independent Scaling: Scale compute-heavy prefill separately from memory-bound decode
  • Resource Optimization: Routers don’t need GPUs, saving precious accelerator resources
  • Failure Isolation: Component failures don’t bring down the entire service
  • Performance Tuning: Each component can be optimized for its specific workload

BenchmarkJob: Performance Testing as a First-Class Operation

OME is the only platform that treats performance testing as a core primitive:

apiVersion: ome.io/v1beta1
kind: BenchmarkJob
metadata:
  name: llama-70b-production-benchmark
spec:
  # Target service to benchmark
  endpoint:
    inferenceService:
      name: llama-chat-optimized
  outputLocation:
    storageUri: "oci://n/benchmark-results/b/prod/o/llama-70b-bench"

This isn’t just about running load tests. BenchmarkJob provides:

  • genai-bench integration: Industry-standard benchmarking tool
  • Realistic traffic patterns: Normal distributions, fixed patterns, long-context scenarios
  • Comprehensive metrics: Tokens/second, TTFT, latency percentiles
  • Multi-cloud storage: Results stored for historical analysis
  • Service metadata tracking: GPU types, engine versions for fair comparisons

Admission Webhooks: Validation and Mutation

OME’s admission webhooks act as gatekeepers in the API layer:

  1. Validating Webhooks ensure model-runtime compatibility before resources are created, preventing runtime failures
  2. Mutating Webhooks inject optimal configurations based on model characteristics
  3. Pod Mutating Webhooks handle complex scenarios like:
    • RDMA configuration for multi-node deployments
    • GPU affinity rules for optimal memory bandwidth
    • Security contexts for model encryption

Layer 2: Control Plane - The Orchestrator

The control plane is where OME’s main operation lives. This isn’t just CRUD operations on Kubernetes resources—it’s a sophisticated system that makes optimal decisions based on model characteristics, hardware availability, and business requirements.

OME Controller Manager: The Orchestration Brain

The controller manager coordinates all OME operations with a reconciliation loop that’s aware of ML-specific concerns.

Runtime Selection Algorithm

When you deploy a model through an InferenceService, the controller:

  • Matches model characteristics against all available ServingRuntimes
  • Scores each runtime based on compatibility and optimization potential
  • Uses Model Size Range matching—when multiple runtimes support a model, OME selects the one with the closest size range for optimal performance
  • Handles edge cases like quantized models or models requiring specific GPU features

Component Orchestration

The InferenceService controller orchestrates multiple components based on your deployment requirements:

// Simplified reconciliation logic showing component-based orchestration
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // 1. Fetch InferenceService and determine deployment mode
    inferenceService := &omev1beta1.InferenceService{}
    deploymentMode := r.inferDeploymentMode(isvc)
    
    // 2. Select optimal runtime if not specified
    if isvc.Spec.Runtime == nil {
        runtime := r.selectOptimalRuntime(isvc.Spec.Model, deploymentMode)
        isvc.Spec.Runtime = runtime
    }
    
    // 3. Reconcile components based on deployment mode
    switch deploymentMode {
    case PDDisaggregated:
        // Deploy separate engine (prefill) and decoder components
        r.reconcileRouter(isvc)    // Cache-aware routing
        r.reconcileEngine(isvc)    // Prefill processing
        r.reconcileDecoder(isvc)   // Token generation
        
    case MultiNode:
        // Deploy using LeaderWorkerSet for distributed serving
        r.reconcileMultiNodeComponents(isvc)
        
    default:
        // Standard single-component deployment
        r.reconcileEngine(isvc)    // Handles both prefill and decode
        if isvc.Spec.Router != nil {
            r.reconcileRouter(isvc) // Optional routing layer
        }
    }
}

Deployment Mode Decision Logic

The controller automatically determines the optimal deployment pattern:

  • RawDeployment: Single engine for models that fit on one node
  • PDDisaggregated: Separate prefill/decode for high-throughput scenarios
  • MultiNode: Distributed serving for massive models (e.g., DeepSeek V3 685B)
  • Serverless: Scale-to-zero for cost optimization (via Knative integration)

Layer 3: Data Plane - Where Models Come to Life

The data plane is where OME’s architectural decisions deliver real value. This layer handles the actual model serving with sophisticated optimizations.

Model Agent: Model Distribution

The Model Agent is OME’s data plane component responsible for making models available across your cluster. When you create a BaseModel resource, the Model Agent springs into action:

What Makes Model Distribution Powerful:

  1. Automatic Model Parsing: Downloads and parses the model’s config.json and safetensors file headers, extracting architecture details, parameter counts, supported features, and optimal serving configurations. No more manual specification of model characteristics.
  2. Multi-Cloud Storage Abstraction: The hf:// prefix in your BaseModel isn’t just syntactic sugar. OME supports multiple storage backends with a unified interface. Switch from HuggingFace to OCI Object Storage by changing one line—no code modifications needed.
  3. Node-Aware Distribution: Models aren’t blindly copied everywhere. The Model Agent runs as a DaemonSet, honoring node selectors and affinity rules, only downloading models to nodes that match your specifications. This saves precious NVMe space and reduces download times.
  4. Lifecycle Management: Models are tracked, versioned, and health-checked. If a node goes down, OME ensures model availability on other nodes. When you delete a BaseModel, cleanup happens automatically across all nodes.

mode-agent.svg

The Scout-Gopher Architecture

OME’s Model Agent employs a sophisticated producer-consumer pattern:

1. Scout Component: The Distribution Layer

The Scout acts as the brain of model distribution, continuously monitoring the Kubernetes API for BaseModel and ClusterBaseModel resources.

  • Node-Aware Filtering: Scout evaluates node selectors and affinity rules, ensuring models are only downloaded to appropriate nodes.
  • Graceful Deletion Handling: When models are deleted, Scout ensures complete cleanup across all nodes before releasing resources, preventing orphaned multi-gigabyte files.

2. Gopher Component: The Task Engine

Storage Backend Performance:

  • OCI Object Storage: Achieves GB/s download speeds through parallel chunk downloads and 20-thread concurrency. A 140GB Llama 3 70B model downloads in minutes.
  • HuggingFace Hub: Production-grade Golang client with automatic retry, rate limit handling, and resume support for interrupted downloads.
  • Unified Interface: Switch between storage providers by changing one URI prefix—no code changes needed.

3. Model Configuration Parser

The parser automatically extracts model metadata from config.json and safetensors files, determining exact parameter counts and capabilities. This eliminates manual configuration for 30+ supported model architectures.

4. State Management & Cleanup

OME provides self-healing state management through:

  • ConfigMap Reconciliation: Automatically recreates deleted ConfigMaps through internal cache, ensuring model states are never lost
  • Node Labels: Enable pod scheduling decisions with labels like models.ome.io/basemodel.llama-3-70b=Ready
  • Finalizer-Based Cleanup: Ensures complete model removal across all nodes before deletion, even handling node failures gracefully

The Result: Production-Grade Model Management

This architecture delivers capabilities unmatched by traditional approaches:

  • Scale: Tested with large multi-gigabyte models, supporting multiple nodes downloading multiple models simultaneously
  • Efficiency: Models download once per node, not per pod—saving petabytes of bandwidth
  • Reliability: Self-healing ConfigMaps, automatic retries, and graceful error handling ensure models are always available
  • Performance: GB/s download speeds with OCI Object Storage, 20x faster than naive implementations
  • Intelligence: Automatic model understanding eliminates manual configuration errors

Inference Workloads

Based on your InferenceService specification, OME deploys different components optimized for specific workload patterns, including PD-disaggregated serving, multi-node serving, standard deployment, and serverless deployment.

Component When Used Primary Function Key Optimizations
Engine All deployments Inference server (prefill in PD mode, full inference otherwise) • Compute optimization• Batch processing• Tensor parallelism
Decoder PD-disaggregated only Token generation with KV cache from engine • Memory bandwidth optimization• Efficient cache management
Router When specified or PD mode Optimized request distribution • Cache-aware routing• Connection pooling• Health monitoring
Ingress Automatically created External API access • TLS termination• Rate limiting• Request routing

The beauty of this architecture is its flexibility—start with a simple engine-only deployment and progressively adopt advanced patterns as your needs grow.

Layer 4: External Integrations - Ecosystem Power

OME doesn’t reinvent the wheel—it deeply integrates with the Kubernetes ecosystem:

Kubernetes Ecosystem Integration: Deep integration with modern Kubernetes components including Kueue for gang scheduling of multi-pod workloads, LeaderWorkerSet for resilient multi-node deployments, KEDA for advanced custom metrics-based autoscaling, K8s Gateway API for sophisticated traffic routing, and Gateway API Inference Extension for standardized inference endpoints.

SGLang: First-Class Runtime Support

SGLang is the primary runtime in OME, with deep native integration that showcases OME’s model-driven architecture capabilities.

Native Router Integration

OME provides native integration with SGLang’s router component, implementing:

Kubernetes Service Discovery: The router automatically discovers engine and decoder pods through Kubernetes APIs, adjusting to scaling events and pod lifecycle changes without manual intervention.

Least-Privilege RBAC: Each router receives minimal permissions—only the ability to list, get, and watch pods in its namespace. This prevents cross-tenant information leakage while enabling dynamic discovery.

Load Balancing Capabilities

OME supports SGLang’s advanced load balancing strategies:

For PD-Disaggregated Deployments: The router distributes requests between prefill (engine) and decode (decoder) components, maintaining KV cache coherency and optimizing throughput.

For Standard Deployments: Cache-aware load balancing tracks KV cache state across workers using RadixAttention, routing requests to workers most likely to have relevant cached prefixes.

Deployment Flexibility

Through OME’s ServingRuntime configurations, SGLang supports:

  • Single-node serving for models that fit on one GPU
  • Multi-node serving with tensor/pipeline parallelism for large models
  • PD-disaggregated serving separating compute-intensive prefill from memory-bound decode
  • Expert parallelism for MoE models like DeepSeek V3

The integration between OME and SGLang demonstrates how a model-driven architecture enables sophisticated serving patterns while maintaining operational simplicity.

Production-Grade Features: Built for Scale

Native Benchmarking

OME includes BenchmarkJob as a core CRD, enabling systematic performance testing with realistic traffic patterns, concurrency sweeps, and automated result storage. This allows teams to compare different model configurations, runtime settings, and hardware choices with standardized, reproducible benchmarks.

Multi-LoRA Serving: One Model, Many Adapters

Deploy a single base model with multiple LoRA adapters for different use cases:

Each request can specify which adapter to use, enabling multi-tenant serving with a single deployment.

High-Performance Serving at Scale

Prefill-Decode Disaggregation: Separate compute-intensive prefill from memory-bound decode operations, achieving 2.5x throughput improvement for mixed workloads.

Multi-Node Serving: Seamlessly scale models across multiple nodes with RDMA support, enabling deployment of frontier models like DeepSeek V3 (685B parameters).

GB/s Download Speeds: Native OCI Object Storage integration delivers multi-gigabyte per second download speeds, getting models into production faster.

Enterprise Security: Defense in Depth

Optional Model Double Encryption: OME supports double encryption for models, integrating with OCI KMS and OCI Vault for enhanced security when required.

Fine-Grained Access Control: RBAC policies control who can deploy which models, with namespace isolation and audit logging.

Secure Multi-Tenancy: Complete isolation between different teams’ models and serving infrastructure.

Real-World Impact: From Months to Days

OME is currently powering production workloads at Oracle Cloud Infrastructure (OCI), where it has transformed their LLM operations:

Operational Transformation at Scale

Before OME:

  • Model onboarding: Months of manual configuration
  • Runtime selection: Trial and error approach
  • Benchmarking: Ad-hoc, inconsistent processes
  • Serving configuration: Specialized knowledge required

With OME:

  • Model onboarding: Days with automated workflows
  • Runtime selection: Automatic and optimal
  • Benchmarking: Standardized and reproducible
  • Serving configuration: Declarative and simple

The impact has been dramatic—what previously required months of specialized engineering effort now happens in days with standardized, automated processes. Teams can focus on evaluating model capabilities rather than wrestling with infrastructure complexity.

Closing the Gap: How OME Bridges Two Worlds

Remember the two teams we started with? The ML Engineers perfecting model serving, and the Production Engineers trying to deploy models quickly?

OME closes this gap through its model-driven architecture:

For ML Engineers: Your optimizations are captured in ServingRuntimes. Your benchmarking results guide future deployments. Your expertise becomes reusable infrastructure.

For Production Teams: Deploy any model with a simple InferenceService. The platform handles runtime selection, scaling, and optimization automatically.

For Organizations: Standardized workflows, reduced time to production, and the ability to leverage cutting-edge serving techniques without deep expertise.

The model-driven approach transforms the chaos of LLM deployment into a systematic, scalable process. By making models first-class citizens, OME enables both innovation and operational excellence.

The Path Forward: Challenges Ahead

While OME has transformed model serving at OCI, significant challenges remain as we build toward a truly universal LLM platform:

Accelerator-Aware Runtime Selection

Currently, mixed GPU clusters (H100, A100, L40S) require separate runtime configurations for each accelerator type, leading to runtime proliferation and operational complexity. We’re working on AcceleratorClass abstractions to enable single runtimes that adapt to available hardware automatically.

Multi-Cloud Support

OME’s current implementation is tightly coupled to OCI. As organizations increasingly adopt multi-cloud strategies, we need to decouple provider-specific logic and create a unified interface that supports AWS, Azure, GCP, and on-premises deployments without code duplication.

Multi-Cluster Management

Future architectures will support a single management cluster running OME that can orchestrate model deployments across multiple worker clusters, enabling federation across regions and cloud providers.

The journey to simplify LLM deployment continues. Each challenge presents an opportunity to make model serving more accessible, efficient, and powerful for organizations worldwide.

Join the Revolution

OME represents a paradigm shift from deployment-driven to model-driven infrastructure. By abstracting complexity while exposing powerful capabilities, it enables organizations to focus on delivering AI value rather than wrestling with infrastructure.

Get Involved:

Acknowledgments

We would like to express our heartfelt gratitude to the following teams and collaborators:

SGLang Community — Yineng Zhang, Ying Sheng, Lianmin Zheng — for their groundbreaking work on SGLang and continuous collaboration on making it the flagship runtime for OME.

Oracle Cloud Infrastructure Team — Simo Lin, Chang Su, Beiwen Guo, Yifeng Liu, David Nahm, Feng Gao, Frank Zhou, Weiwei Zheng, Arthur Cheng, Chao Yang, Varun Shenoy, Jinguo Zhang, Wei Gao, Jun Qian, Jingqiao Zhang and colleagues — for driving the vision of model-driven infrastructure and validating OME in production at scale.

Thank you all for your invaluable support and collaboration.


OME is transforming how organizations deploy and manage LLMs at scale. Join leading enterprises already using OME to power their AI infrastructure with model-driven simplicity and cutting-edge performance.