Cost of Running LLMs

Mar 20, 2026·2 min read

Generated by AI from multiple sources. Always verify critical information.

TL;DR

LLM costs come from two sources: API pricing (pay per token with closed models) or compute costs (GPU hosting for open models). Both can spiral quickly at scale. Smart caching, model routing, and prompt optimization can cut costs by 5-10x without sacrificing quality.

What Happened

LLM pricing follows a simple model for API-based services: you pay per input token and per output token, with output tokens costing 3-5x more. GPT-4o costs about $2.50 per million input tokens; Claude 3.5 Sonnet is similar. Smaller models like GPT-4o-mini cost 10-20x less.

For self-hosted models, costs come from GPU compute. Running a 70B parameter model requires at least 2x A100 GPUs ($4-8/hour on cloud). Smaller models (7-13B parameters) run on single GPUs or even consumer hardware with quantization.

The hidden cost multiplier is iteration. During development, you'll call the API thousands of times testing prompts. In production, features like regeneration, multi-step agents, and long conversations multiply costs. A seemingly cheap $0.01/request feature costs $10,000/day at 1 million requests.

So What?

Cost optimization is a core engineering skill for AI products, not an afterthought. The biggest levers are: model routing (use the cheapest model that works for each task), caching (identical or similar queries don't need fresh LLM calls), prompt optimization (shorter prompts = lower costs), and batching (group requests where possible).

Pricing is dropping rapidly — about 10x per year for equivalent capability. What costs $100 today will cost $10 next year. But usage grows faster than prices drop, so optimization still matters.

Now What?

Track cost per feature, not just total spend — find your expensive paths

Implement semantic caching: if a very similar question was asked recently, return the cached answer

Use model routing: GPT-4o-mini or Claude Haiku for classification/extraction, full models for generation

Set budget alerts at 50%, 80%, and 100% of your monthly target

Sign up to read the full brief

Free account. No credit card.

Back to feed

Cost of Running LLMs

Sign up to read the full brief

Keep Learning