A token is roughly 4 characters or 0.75 words in English. A 1,000-word article is approximately 1,333 tokens. Most API providers charge separately for input (prompt) tokens and output (completion) tokens.

Are these prices exact?

Prices shown are approximate 2026 rates and may vary. Providers frequently update pricing, offer volume discounts, prompt caching (50–90% off cached input), and batch API rates (~50% off for non-real-time). Check the provider's current pricing page for exact rates.

How do I reduce AI API costs?

Use smaller/cheaper models when possible (Haiku vs Opus), enable prompt caching for repeated context, minimize prompt size, use batch APIs for non-real-time tasks, cap output tokens, and consider fine-tuning a smaller model for specialized tasks.

Why is output more expensive than input?

Generating tokens requires running the model one forward pass per token, while input tokens are processed in parallel. Output is typically 3–5× more expensive than input, which is why capping max_tokens and asking for concise responses is a high-impact optimization.

What is prompt caching and how much does it save?

Prompt caching stores the model's processed representation of a large repeated prefix (system prompt, docs, examples) so subsequent requests pay only 10–50% of normal input cost for that block. Savings are highest for apps with stable system prompts or RAG contexts that don't change every request.

When does it make sense to self-host vs use an API?

Hosted APIs are cheaper below ~$10k/month because you don't pay for idle GPUs. Above that, dedicated inference on rented GPUs (or services like AWS Bedrock provisioned throughput) can be 30–60% cheaper but requires ops investment. Frontier model quality is also still hard to match with open-weight models for many tasks.

How many tokens are in a typical prompt?

A short user message: 20–100 tokens. A system prompt: 500–3,000 tokens. A RAG context window: 2,000–20,000 tokens. A full conversation history at turn 10: 5,000–50,000 tokens. Always measure with the provider's tokenizer (tiktoken for OpenAI, Anthropic's SDK for Claude) — token counts vary by model and language.

AI API Cost Calculator

Calculate the cost of using AI model APIs for your application or project. Select your model, specify token usage per request and daily volume, and see your estimated daily, monthly, and annual costs. Compare pricing across GPT-4o, Claude Sonnet, Claude Haiku, Gemini, and more.

AI API costs surprise nearly every team that ships an LLM-powered feature. A demo chatbot that costs $2 a day during development can balloon to $20,000 a month at scale — not because anything went wrong, but because token usage compounds across users, retries, system prompts, retrieval context, and conversation history. The cost difference between a frontier model like Claude Opus and a fast model like GPT-4o Mini can be 60× for the same task, and most teams pick a model long before they understand the unit economics.

This calculator estimates daily, monthly, and annual API spend based on your model choice, token sizes, and request volume. Use it before you build to size the budget, during build to compare model alternatives, and after launch to model the impact of scaling user counts or prompt engineering changes. Pricing assumptions reflect typical 2026 list rates; expect real costs to vary with batching discounts, prompt caching, cached input pricing, and any enterprise agreements you negotiate.

The biggest cost lever is almost never the model — it's the architecture. Caching identical system prompts, using smaller models for routing decisions, and trimming retrieval context typically cuts spend 40–80% with zero quality loss. Use this calculator to find the model that fits the workload, then attack token volume.

Inputs

AI Model

Input Tokens per Request

Average tokens sent to the model per request

Output Tokens per Request

Average tokens generated by the model per request

Requests per Day

Results

Cost per Request

$0.007500

Daily Cost

$0.75

Monthly Cost

$22.50

Annual Cost

$273.75

Input vs Output Cost (Monthly)

Cost Over Time

Last updated: May 28, 2026

Formula

**Per-request cost:** cost = (input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price API providers price per million tokens (MTok). Input tokens are everything you send (system prompt, conversation history, user message, retrieved context). Output tokens are what the model generates. **Daily / monthly / annual cost:** - daily = cost_per_request × requests_per_day - monthly = daily × 30.44 - annual = daily × 365.25 **Approximate 2026 list pricing (per million tokens, input / output):** | Model | Input $/MTok | Output $/MTok | |---|---|---| | GPT-4o | $2.50 | $10.00 | | GPT-4o Mini | $0.15 | $0.60 | | Claude Sonnet | $3.00 | $15.00 | | Claude Haiku | $0.80 | $4.00 | | Claude Opus | $15.00 | $75.00 | | Gemini 2.0 Flash | $0.10 | $0.40 | | Gemini 2.0 Pro | $1.25 | $5.00 | | Llama 3.1 70B (hosted) | $0.60 | $0.60 | | Mistral Large | $2.00 | $6.00 | **Token rough conversions:** - 1 token ≈ 4 English characters ≈ 0.75 words - 1,000 tokens ≈ ¾ of a page of prose - A novel: 100k–150k tokens - A typical RAG context window: 4k–32k tokens **Cost-reduction multipliers to look up before committing:** - **Prompt caching**: 50–90% off cached input tokens (Anthropic, OpenAI, Google all offer it) - **Batch APIs**: 50% off for non-real-time jobs - **Provisioned throughput**: flat-rate billing if your volume is predictable

How to use this calculator

Pick the model that matches your task. Start with the smallest model that meets quality bar; upgrade only when needed.
Estimate input tokens per request. Include system prompt + conversation history + retrieved context + user message — not just the user message.
Estimate output tokens. If you cap max_tokens, use that; otherwise estimate from typical responses (a paragraph is ~150 tokens).
Enter realistic request volume. Daily active users × requests per session × turns per conversation.
Multiply by a 1.2–1.5x fudge factor for retries, eval runs, and traffic spikes.
Run the calculation again with a cheaper model to see the spread. The difference is usually larger than people expect.

Worked examples

Customer support chatbot

**Scenario:** You're building a Claude Sonnet support bot. System prompt + docs context: 3,000 input tokens. Average user query: 100 tokens. Average response: 400 tokens. 500 conversations/day, 6 messages per conversation. **Calculation:** Input per turn = 3,000 + 100 = 3,100 tokens. Output = 400. Per-turn cost = (3,100/1M × $3) + (400/1M × $15) = $0.0093 + $0.0060 = $0.0153. Per conversation (6 turns) = $0.092. Daily = 500 × $0.092 = $46. Monthly = ~$1,400. Annual = ~$16,800. **Result:** Annual API cost: ~$16,800. With prompt caching on the docs context (90% off the 3,000-token block), this drops to ~$5,400/year. Same answer quality, ⅓ the cost — caching is usually the highest-leverage change you can make.

Model-comparison sticker shock

**Scenario:** You have a content classification task: 800 input tokens, 50 output tokens, 50,000 requests/day. You're deciding between Claude Opus and GPT-4o Mini. **Calculation:** Opus: (800/1M × $15) + (50/1M × $75) = $0.012 + $0.00375 = $0.01575 per request × 50,000 = $787/day = $23,975/month. Mini: (800/1M × $0.15) + (50/1M × $0.60) = $0.00012 + $0.00003 = $0.00015 per request × 50,000 = $7.50/day = $228/month. **Result:** Opus costs 100× more for this workload ($288k/year vs $2.7k/year). For classification, GPT-4o Mini is almost certainly within accuracy tolerance. Always evaluate smaller models on your task before paying frontier prices.

RAG application with retrieval bloat

**Scenario:** A retrieval-augmented Q&A app: top-20 chunks at 500 tokens each = 10,000 retrieval tokens + 200 user query + 500 output, on GPT-4o, 10,000 requests/day. **Calculation:** Per request: (10,200/1M × $2.50) + (500/1M × $10) = $0.0255 + $0.0050 = $0.0305. Daily = 10,000 × $0.0305 = $305. Monthly = $9,283. Annual = $111,400. **Result:** Trimming retrieval from top-20 to top-5 (5,000 token reduction) cuts cost to $43,800/year — a $67k saving with usually no quality loss because most signal is in the top results. Retrieval engineering pays for itself fast at scale.

When to use this calculator

**Use this calculator when:**

- **Budgeting a new AI feature**: before you commit to a model, model the unit economics at expected scale. - **Comparing model choices**: the same task can vary 50–100× in cost between models. Run the math before locking in. - **Planning fundraising or pricing**: cost-of-goods on AI features directly affects your gross margin and what you can charge. - **Optimizing an existing pipeline**: identify whether the bottleneck is input tokens (cache or trim), output tokens (cap max_tokens), or volume (rate-limit or batch). - **Negotiating with finance**: a credible per-user cost number lets you defend the AI line item. - **Choosing between hosted and self-hosted**: when API costs cross ~$10k–30k/month, dedicated inference (vLLM, TGI, sagemaker) may break even.

**Patterns that drive cost up:** - Long system prompts repeated every turn (cache them) - Multi-shot examples in the prompt (consider fine-tuning instead) - Multi-turn conversations with full history (summarize older turns) - Streaming + cancelled requests (you still pay for what was generated) - Tool use loops that retry on bad parses - Evaluation runs hitting prod models instead of mocks

**Patterns that drive cost down:** - Prompt caching (largest single lever for any RAG or agent app) - Smaller models with structured output mode - Batch API for nightly classification/summarization jobs - Local embedding models (free) for retrieval, premium model only for the final generation - Hard caps on output tokens

Common mistakes to avoid

Estimating input tokens from the user message only. The system prompt, examples, conversation history, and retrieval context are all billed as input every turn.
Forgetting output token costs are typically 3–5× input token costs. Encouraging shorter responses (or capping max_tokens) often beats reducing the prompt.
Ignoring conversation history. By turn 10 of a chat, you're sending all prior turns as input on every request. Costs grow quadratically without summarization.
Pricing at list rates when you should price at cached rates. Anthropic, OpenAI, and Google all support prompt caching with 50–90% input discounts.
Sizing for average load instead of peak. AI features often have spiky usage (Monday mornings, product launches); budget for the 95th percentile.
Picking the model on a demo prompt without running real evals. A model that's 5× cheaper but 2% less accurate might be perfectly fine for your task — or catastrophic, depending on what the task is.
Forgetting eval and dev costs. Running 1,000 prompts through Opus during a model comparison can cost more than a month of production usage at a smaller model.

AI API Cost Calculator

Inputs

Results

Input vs Output Cost (Monthly)

Cost Over Time

Formula

How to use this calculator

Worked examples

Customer support chatbot

Model-comparison sticker shock

RAG application with retrieval bloat

When to use this calculator

Common mistakes to avoid

Frequently Asked Questions

Sources & further reading

Related Calculators

Bandwidth Calculator

Inputs

Results

Input vs Output Cost (Monthly)

Cost Over Time

Formula

How to use this calculator

Worked examples

Customer support chatbot

Model-comparison sticker shock

RAG application with retrieval bloat

When to use this calculator

Common mistakes to avoid

Frequently Asked Questions

What is a token?

Are these prices exact?

How do I reduce AI API costs?

Why is output more expensive than input?

What is prompt caching and how much does it save?

When does it make sense to self-host vs use an API?

How many tokens are in a typical prompt?

Sources & further reading

Related Calculators

Bandwidth Calculator