Distributed Inference Runtime
Distributed inference across multiple runtime engines, running multiple models, on Kubernetes.
LLM-aware load balancing
Load balancing that is aware of the LLMs running on each node, their load, and context history. Route to the optimal node for inference requests.
Orchestration of multiple models and formats
Manage which models are available, control replicas, automated and manual scaling and distribution. Full support for multiple model formats.