Use cases

Inference at scale, training jobs, and hybrid enterprise—with verified stack blueprints.

Inference at scale

ML platform and product teams who need to serve models in production with predictable latency and cost.

Benefits

Managed inference endpoints with health checks and rolling updates
Autoscaling (including scale-to-zero) to minimize idle cost
Low idle cost: pay only when traffic is served
Multiple backends and verified stack blueprints (vLLM, Triton, custom)

Related features

Inference endpoints

Autoscaling

Verified blueprints

Cost caps

Training jobs

Data science and ML teams running distributed training with the need for reliability and resource fairness.

Benefits

Checkpoint durability: persist to object storage and resume from last state
Quotas and fair-share scheduling so teams get predictable capacity
Preemption-aware scheduling (roadmap): use spot/preemptible with automatic resume
Single control plane for both inference and training workloads

Related features

Training jobs

Checkpoints

Quotas

Preemption (roadmap)

Hybrid enterprise

Enterprises that need data locality, region/zone controls, and on-prem-first with cloud burst.

Benefits

Data locality: place workloads where your data lives (on-prem or specific region)
Region and zone controls via placement policies
On-prem first: use your GPU clusters before bursting to cloud
Single pane of glass across all environments

Related features

Hybrid placement

On-prem first

Placement policies

Multi-cloud

Blueprint spotlight

Verified stacks you can deploy in minutes. Pre-validated hardware targets and runtime versions.

vLLM inference

High-throughput LLM serving with PagedAttention and continuous batching.

Hardware target

NVIDIA GPU (A100, H100, L4, T4)

Runtime versions

vLLM 0.4.x
CUDA 12.x
Python 3.10+

Deploy in minutes

Triton Inference Server

Multi-framework inference (TensorRT, ONNX, PyTorch) with dynamic batching.

Hardware target

NVIDIA GPU (Ampere or newer)

Runtime versions

Triton 2.40+
CUDA 12.x
cuDNN 8.x

Deploy in minutes

ComfyUI

Stable Diffusion and image generation workflows with node-based UI.

Hardware target

NVIDIA GPU (8GB+ VRAM)

Runtime versions

ComfyUI latest
PyTorch 2.x
CUDA 12.x

Deploy in minutes

Whisper transcription

OpenAI Whisper for speech-to-text at scale with batch and streaming.

Hardware target

NVIDIA GPU (T4, L4, A10)

Runtime versions

Whisper (large-v3)
faster-whisper / CTranslate2
Python 3.10+

Deploy in minutes

Find your use case

Tell us about your workloads and we will show you how InferoFabric fits.

Use cases

Core use cases

Inference at scale

Benefits

Related features

Training jobs

Benefits

Related features

Hybrid enterprise

Benefits

Related features

Blueprint spotlight

vLLM inference

Triton Inference Server

ComfyUI

Whisper transcription

Find your use case