— by Philipp Renoth, Adrian Knorr
Our vision of local AI inference at ConSol
We follow a local‑first AI strategy—because when it comes to performance, data sovereignty, and cost control, nothing beats running models close to where they’re needed. And since we’re long-time OpenShift experts, it was only natural for us to bring AI workloads directly onto Red Hat OpenShift. In this post, we’ll share how we combined GPT-OSS with vLLM to run high‑performance local AI inference, while leveraging the flexibility and robustness of OpenShift.
GPU Workstation
Why host GPT-OSS for local AI?
Running GPT-OSS on your own infrastructure comes with several compelling benefits:
- Cost control – No per-token billing. Instead, you plan your hardware investment and operational costs up front, saving money in high‑load scenarios.
- Performance – Full control of your GPUs and inference parameters, predictable latency, and no API throughput caps.
- Data privacy – Sensitive data never leaves your environment, ensuring compliance and regulatory alignment.
- Independence – You’re not locked into a single vendor’s roadmap or pricing model.
We chose vLLM for inference because of its efficient memory management and high throughput and stability, making GPT-OSS practical for local GPU hosting.
Prerequisites & Setup
Before running GPT-OSS with vLLM, the right hardware and infrastructure are key:
Hardware
NVIDIA RTX PRO 6000 Blackwell GPU (96 GB VRAM) – large memory capacity allows running gpt-oss:120b efficiently, minimizing out-of-memory risks and features ultra‑fast GDDR7 memory.
Infrastructure
Red Hat OpenShift plus the NVIDIA GPU Operator automatically discovers and configures GPU resources. The operator installs required drivers and libraries for GPU acceleration in containers.
Runtime
We use the prebuilt vllm-openai container image with Python, CUDA libraries, vLLM, and the OpenAI-compatible API already included. This ensures reproducibility and avoids version conflicts.
Deploy GPT-OSS on vLLM
On OpenShift, a Kubernetes Deployment provisions one replica of the inference server, labeled consistently for easy management. The container runs docker.io/vllm-openai, requesting one NVIDIA GPU, 32 GiB memory, and 0.5 CPU cores. The NVIDIA GPU Operator ensures correct scheduling.
Health probes check the /health endpoint for readiness, liveness, and startup, ensuring stability even when loading 100B+ parameter models.
Storage combines:
- PersistentVolumeClaim for model weights
- EmptyDir volumes for runtime caches
Startup arguments load the model (openai/gpt-oss-120b) and set GPU memory utilization to 97%. We increased the limit from the default 90 % (used for stability) to 97 % because 9 GB of GPU memory remained idle, giving us additional head‑room without risking OOM. Tolerations ensure scheduling on GPU-enabled nodes. The result: a GPU-backed, containerized vLLM server managed by Red Hat OpenShift, ready for scalable GPT-OSS inference.
Benchmark Testing
We benchmarked gpt-oss:120b with inference-benchmarker, focusing on realistic load rather than parameter tuning.
Test Setup
- Tool: inference-benchmarker
- Dataset:
share_gpt_turns.jsonwith short and long requests - Load profile: six 30-min runs at 5, 10, 20, 50, 100, and 200 virtual users
- GPU: NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM
- Stability: >5 hours runtime at nearly 100% GPU utilization, with no failed requests
Key Results
- Throughput: Nearly 3M tokens in 30 min at 200 virtual users (~1666 tokens/sec). Easily supports real-time streaming for dozens of users (slow ~2.5 t/s, fast ~6–7 t/s).
- Latency: TTFT ~16s (p50), inter-token latency ~89 ms → fluent stream delivery.
- Memory: VRAM peaked at 97% yet remained stable across all tests.
Takeaway: Even with mostly default settings, GPT-OSS on vLLM scales from 5 to 200 users, sustaining high throughput and predictable latency.
Lessons Learned & Best Practices for local AI
Performance Benchmarks
Benchmarks revealed that our GPU does not scale linearly when serving only a few requests. High concurrency is required to reach the advertised throughput, and the achieved values are heavily influenced by the context‑window size. In practice, our colleagues will not issue huge queries every second, they will typically ask general question or request summaries or check offers, which involve far smaller batches on average.
Egress Isolation
Egress Isolation was a challenge: GPT‑OSS (via openai‑harmony) fetches tiktoken files by default, causing unwanted outbound calls. While model weights live on a PersistentVolume, we needed to enforce offline behavior.
Fix: set environment variables inside the container:
env:
- name: HF_HUB_OFFLINE
value: "1"
- name: TIKTOKEN_RS_CACHE_DIR
value: /.cache/tiktoken-rs-cache
With this, we enabled the OpenShift EgressFirewall and ran vLLM in an outbound-isolated environment.
Best Practice: Explicitly set the target for temporary downloaded cache files to persist them, and enable offline mode early to prevent hidden dependencies and ensure reliable startup in restricted networks.
Need help with local AI on Red Hat OpenShift?
If you’re interested in adopting a local‑first AI strategy like the one described in this post, our team at ConSol is ready to assist. We combine deep expertise in OpenShift and AI to deliver tailored, production‑ready inference solutions that meet your performance, cost, and data‑sovereignty requirements.
Whether you need an architecture review, a proof‑of‑concept deployment, or full‑scale consulting & managed services, reach out to us. Contact our OpenShift experts or learn more about our AI consulting offerings – we’ll help you turn local AI from a concept into a reliable business asset.



