7 posts with tag #vllm

Architecture diagram: LLM inference pipeline on OVH Managed Kubernetes Service

vllm llm kubernetes ovh gpu inference vscode continue zed agents openwebui owncloud

LLM Inference on OVH MKS: Connect IDEs and Web UIs

Connect Continue.dev, Zed, Cline, Open WebUI, and ownCloud Infinite Scale to a self-hosted vLLM endpoint on OVH MKS. Per-client setup guide. Part 5 of 6.

2026-06-027 min

LLM Inference on OVH MKS: Connect IDEs and Web UIs

vllm llm kubernetes ovh gpu terraform ansible istio inference

LLM Inference on OVH MKS: Terraform, Ansible, and Deployment

Provision an OVH MKS GPU node pool with Terraform, deploy vLLM, Istio, and cert-manager with Ansible, and walk through a first deployment. Part 2 of 6.

2026-06-028 min

LLM Inference on OVH MKS: Terraform, Ansible, and Deployment

Architecture diagram: LiteLLM API gateway in front of LLM inference services

vllm llm kubernetes ovh litellm gateway inference

LLM Inference on OVH MKS: LiteLLM API Gateway

LiteLLM gateway on top of vLLM: per-user API keys, budget limits, and automatic fallback to commercial APIs when the local GPU node is cold. Part 6 of 6.

2026-06-028 min

LLM Inference on OVH MKS: LiteLLM API Gateway

vllm llm kubernetes ovh gpu istio inference guide

LLM Inference on OVH MKS: The Complete Guide

Index and reading guide for a six-part series on self-hosting LLM inference on OVH MKS — vLLM, GPU node pools, Terraform, observability, clients, and a gateway.

2026-06-0213 min

LLM Inference on OVH MKS: The Complete Guide

vllm llm kubernetes ovh gpu istio inference

LLM Inference on OVH MKS: Introduction

When to self-host an LLM on Kubernetes, why vLLM, and what the stack looks like on OVH MKS. Covers use cases, cost framing, and architecture. Part 1 of 6.

2026-06-0210 min

LLM Inference on OVH MKS: Introduction

Diagram: Prometheus, Grafana, and KEDA observability stack for LLM inference

vllm llm kubernetes ovh gpu prometheus grafana keda observability autoscaling inference

LLM Inference on OVH MKS: Prometheus, Grafana, and KEDA

Scrape vLLM and DCGM metrics with kube-prometheus-stack, visualise TTFT and tokens/s in Grafana, and autoscale to zero with KEDA. Part 4 of 6.

2026-06-028 min

LLM Inference on OVH MKS: Prometheus, Grafana, and KEDA

vllm llm kubernetes ovh gpu quantization awq openai inference huggingface

LLM Inference on OVH MKS: Models, AWQ, and OpenAI API

Which models fit on a 16 GB GPU, why AWQ is required for 7B+ models on the RTX5000-28, and how to use the OpenAI-compatible API from Python. Part 3 of 6.

2026-06-029 min

LLM Inference on OVH MKS: Models, AWQ, and OpenAI API