AI Agents
Self-hosting LLMs in 2026 — cost math for Indian teams
A real cost comparison: running Llama 4, Qwen 3, and DeepSeek on Indian VPS hardware vs API pricing. With quantised models and the right GPU, self-hosting pays back at surprisingly low volume.
The question comes up in nearly every client conversation now: should we self-host the LLM or use an API? The answer, as always, is "it depends." But in 2026, the dependency list is shorter and more favourable to self-hosting than most teams realise.
This is the honest cost comparison we use internally, based on actual hardware pricing in India as of May 2026, actual model benchmarks, and our own experience running Llama 4 (via vLLM) and DeepSeek-V3 (via llama.cpp) on rented and owned hardware.
The numbers that matter
API pricing (per million tokens, May 2026):
| Model | Input ($/1M) | Output ($/1M) |
|---|---|---|
| GPT-5 (OpenAI) | $2.50 | $10.00 |
| Claude Sonnet 4.6 (Anthropic) | $3.00 | $15.00 |
| DeepSeek-V3 (API) | $0.27 | $1.10 |
| Gemini 2.5 Pro (Google) | $1.25 | $10.00 |
Indian VPS GPU pricing (monthly, May 2026):
| GPU | VRAM | Approx ₹/month |
|---|---|---|
| RTX A4000 (16 GB) | 16 GB | ₹28,000 |
| L40S (48 GB) | 48 GB | ₹52,000 |
| A100 (80 GB) | 80 GB | ₹1,00,000+ |
| RTX 4090 (24 GB) | 24 GB | ₹35,000 |
A ₹28,000/month A4000 running a quantised 8B-12B parameter model can serve roughly 500-800 requests per hour at reasonable latency. At DeepSeek API pricing, 500 requests averaging 2,000 tokens each (1M output tokens) would cost about $1.10. At 10,000 requests a day (300,000/month), that's roughly $330/month in API fees, or about ₹28,000 — exactly the cost of the GPU.
The crossover point is lower than most people assume.
The model fit
For self-hosting on a single A4000 or L40S:
Llama 4 Scout (17B, 4-bit quant): Fits in 16 GB VRAM. Strong general-purpose model. Good at structured output, decent at Hindi and code-mixed Hinglish. Our default for document classification and data extraction pipelines.
Qwen 3 (8B/14B, 4-bit quant): The 14B Q4 fits in 16 GB. Competitive with Llama 4 on reasoning, slightly worse on instruction following. We use it for simpler agent tasks where throughput matters more than precision.
DeepSeek-V3 (671B MoE, 4-bit quant via llama.cpp): Only ~37B active parameters per token. A Q4 quant fits on an L40S with careful offloading. 15-20 tokens/second, but quality competitive with GPT-5 at roughly 1/8 the API cost for high-volume workloads. We run this for internal agent workflows.
Phi-4 (14B): Microsoft's latest. Strong at reasoning, weaker at creative tasks. Good for structured extraction and classification. Runs comfortably on an A4000.
What does not fit at this budget tier:
- Full-precision Llama 4 Maverick (128B MoE). Needs two L40S or one A100.
- Full-precision DeepSeek-V3. Needs A100 or better.
- GPT-5-tier models. The frontier is still API-only for SMB budgets.
The tooling layer
Our standard stack:
vLLM (v0.8+, May 2026): The default for models that fit in VRAM with a margin. Continuous batching, PagedAttention, OpenAI-compatible API.
vllm serve meta-llama/Llama-4-Scout-17B-Instruct \
--quantization fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--port 8000
llama.cpp (b4415+, May 2026): For GGUF quants of DeepSeek and Qwen. The server mode is OpenAI-compatible, kv-cache quantisation (Q8_0) reduces memory, and the community moves fast.
./llama-server -m deepseek-v3-Q4_K_M.gguf -ngl 99 -c 8192 --port 8080
LiteLLM proxy: A thin proxy in front of both vLLM and llama.cpp instances. Single OpenAI-compatible endpoint, usage tracking per API key, cost attribution, rate limiting, and fallback routing.
When self-hosting wins
The crossover math: if your monthly API bill exceeds the monthly GPU rental cost, self-hosting is cheaper. In India as of May 2026, with DeepSeek API at $0.27/M input tokens, the crossover is roughly 100 million input tokens per month, or about 3.3 million per day.
A single agent workflow that processes 1,000 documents a day, each with 5,000 tokens of context (the document plus system prompt), burns through 5 million input tokens daily. Two such workflows cross the line.
Data residency requirements — RBI circulars, healthcare data, client NDAs — make self-hosting on an Indian VPS the only option for some workflows. Latency-sensitive multi-step agent workflows (10+ inference calls per task) benefit from sub-50ms latency vs 200-400ms to OpenAI's US endpoints.
When the API still wins
Intermittent usage. If your workload is 500 requests a day and nothing on weekends, a ₹28,000/month GPU is wasted. Pay the API's ₹3,000/month and skip the ops.
Frontier model quality. GPT-5 and Claude Sonnet 4.6 are still objectively better on complex reasoning, multi-step planning, and long-context tasks. Pay for the quality when the task needs it.
Team size of one. Self-hosting adds ops work: model updates, CUDA compatibility, monitoring, restart loops. The ₹50,000/month API bill is expensive but has zero maintenance.
What we actually run
- Internal agent workflows: DeepSeek-V3 Q4 on an L40S via llama.cpp. Fixed cost, good quality, acceptable latency.
- Client-facing features: DeepSeek API or GPT-5 API. We do not self-host client-facing inference because the ops risk is on us.
- Experiments and evals: Qwen 3 14B on an A4000 via vLLM. Cheap, fast, iterate on prompts before promoting to production.
The split is economic, not ideological. The question is "which model, which hosting, for which workload?"
Self-hosting crosses over at lower volume than you think. But the ops work is real and the quality gap to frontier models is still real. The answer is a mix. Always has been.
Tags
- llm
- self-hosting
- llama
- deepseek
- cost-analysis
- india
More on ai agents
- MCP six months in — what the Model Context Protocol actually delivers in 2026Anthropic open-sourced MCP in November 2024. Eighteen months later, it has a real ecosystem, real adoption, and real limitations. Here's what it's good for, what it's not, and where it's headed — from a studio that has built MCP servers in production.
- What agentic AI actually looks like in productionMost autonomous workflow demos collapse the moment money or compliance enters the loop. The realistic 2026 default is a hybrid, and the boundary line is the product.
- MCP, explained for people who didn't read the specAnthropic's Model Context Protocol went from a niche RFC in late 2024 to the way every serious agent talks to its tools in 2026. Here's what it actually does, and where it still doesn't fit.