CalcMountain

AI API Cost Calculator

Calculate the cost of using AI model APIs for your application or project. Select your model, specify token usage per request and daily volume, and see your estimated daily, monthly, and annual costs. Compare pricing across GPT-4o, Claude Sonnet, Claude Haiku, Gemini, and more.

AI API costs surprise nearly every team that ships an LLM-powered feature. A demo chatbot that costs $2 a day during development can balloon to $20,000 a month at scale — not because anything went wrong, but because token usage compounds across users, retries, system prompts, retrieval context, and conversation history. The cost difference between a frontier model like Claude Opus and a fast model like GPT-4o Mini can be 60× for the same task, and most teams pick a model long before they understand the unit economics.

This calculator estimates daily, monthly, and annual API spend based on your model choice, token sizes, and request volume. Use it before you build to size the budget, during build to compare model alternatives, and after launch to model the impact of scaling user counts or prompt engineering changes. Pricing assumptions reflect typical 2026 list rates; expect real costs to vary with batching discounts, prompt caching, cached input pricing, and any enterprise agreements you negotiate.

The biggest cost lever is almost never the model — it's the architecture. Caching identical system prompts, using smaller models for routing decisions, and trimming retrieval context typically cuts spend 40–80% with zero quality loss. Use this calculator to find the model that fits the workload, then attack token volume.

Inputs

Average tokens sent to the model per request

Average tokens generated by the model per request

Results

Cost per Request

$0.007500

Daily Cost

$0.75

Monthly Cost

$22.50

Annual Cost

$273.75

Input vs Output Cost (Monthly)

Cost Over Time

Last updated:

Formula

**Per-request cost:** cost = (input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price API providers price per million tokens (MTok). Input tokens are everything you send (system prompt, conversation history, user message, retrieved context). Output tokens are what the model generates. **Daily / monthly / annual cost:** - daily = cost_per_request × requests_per_day - monthly = daily × 30.44 - annual = daily × 365.25 **Approximate 2026 list pricing (per million tokens, input / output):** | Model | Input $/MTok | Output $/MTok | |---|---|---| | GPT-4o | $2.50 | $10.00 | | GPT-4o Mini | $0.15 | $0.60 | | Claude Sonnet | $3.00 | $15.00 | | Claude Haiku | $0.80 | $4.00 | | Claude Opus | $15.00 | $75.00 | | Gemini 2.0 Flash | $0.10 | $0.40 | | Gemini 2.0 Pro | $1.25 | $5.00 | | Llama 3.1 70B (hosted) | $0.60 | $0.60 | | Mistral Large | $2.00 | $6.00 | **Token rough conversions:** - 1 token ≈ 4 English characters ≈ 0.75 words - 1,000 tokens ≈ ¾ of a page of prose - A novel: 100k–150k tokens - A typical RAG context window: 4k–32k tokens **Cost-reduction multipliers to look up before committing:** - **Prompt caching**: 50–90% off cached input tokens (Anthropic, OpenAI, Google all offer it) - **Batch APIs**: 50% off for non-real-time jobs - **Provisioned throughput**: flat-rate billing if your volume is predictable

How to use this calculator

  1. Pick the model that matches your task. Start with the smallest model that meets quality bar; upgrade only when needed.
  2. Estimate input tokens per request. Include system prompt + conversation history + retrieved context + user message — not just the user message.
  3. Estimate output tokens. If you cap max_tokens, use that; otherwise estimate from typical responses (a paragraph is ~150 tokens).
  4. Enter realistic request volume. Daily active users × requests per session × turns per conversation.
  5. Multiply by a 1.2–1.5x fudge factor for retries, eval runs, and traffic spikes.
  6. Run the calculation again with a cheaper model to see the spread. The difference is usually larger than people expect.

Worked examples

Customer support chatbot

**Scenario:** You're building a Claude Sonnet support bot. System prompt + docs context: 3,000 input tokens. Average user query: 100 tokens. Average response: 400 tokens. 500 conversations/day, 6 messages per conversation. **Calculation:** Input per turn = 3,000 + 100 = 3,100 tokens. Output = 400. Per-turn cost = (3,100/1M × $3) + (400/1M × $15) = $0.0093 + $0.0060 = $0.0153. Per conversation (6 turns) = $0.092. Daily = 500 × $0.092 = $46. Monthly = ~$1,400. Annual = ~$16,800. **Result:** Annual API cost: ~$16,800. With prompt caching on the docs context (90% off the 3,000-token block), this drops to ~$5,400/year. Same answer quality, ⅓ the cost — caching is usually the highest-leverage change you can make.

Model-comparison sticker shock

**Scenario:** You have a content classification task: 800 input tokens, 50 output tokens, 50,000 requests/day. You're deciding between Claude Opus and GPT-4o Mini. **Calculation:** Opus: (800/1M × $15) + (50/1M × $75) = $0.012 + $0.00375 = $0.01575 per request × 50,000 = $787/day = $23,975/month. Mini: (800/1M × $0.15) + (50/1M × $0.60) = $0.00012 + $0.00003 = $0.00015 per request × 50,000 = $7.50/day = $228/month. **Result:** Opus costs 100× more for this workload ($288k/year vs $2.7k/year). For classification, GPT-4o Mini is almost certainly within accuracy tolerance. Always evaluate smaller models on your task before paying frontier prices.

RAG application with retrieval bloat

**Scenario:** A retrieval-augmented Q&A app: top-20 chunks at 500 tokens each = 10,000 retrieval tokens + 200 user query + 500 output, on GPT-4o, 10,000 requests/day. **Calculation:** Per request: (10,200/1M × $2.50) + (500/1M × $10) = $0.0255 + $0.0050 = $0.0305. Daily = 10,000 × $0.0305 = $305. Monthly = $9,283. Annual = $111,400. **Result:** Trimming retrieval from top-20 to top-5 (5,000 token reduction) cuts cost to $43,800/year — a $67k saving with usually no quality loss because most signal is in the top results. Retrieval engineering pays for itself fast at scale.

When to use this calculator

**Use this calculator when:**

- **Budgeting a new AI feature**: before you commit to a model, model the unit economics at expected scale. - **Comparing model choices**: the same task can vary 50–100× in cost between models. Run the math before locking in. - **Planning fundraising or pricing**: cost-of-goods on AI features directly affects your gross margin and what you can charge. - **Optimizing an existing pipeline**: identify whether the bottleneck is input tokens (cache or trim), output tokens (cap max_tokens), or volume (rate-limit or batch). - **Negotiating with finance**: a credible per-user cost number lets you defend the AI line item. - **Choosing between hosted and self-hosted**: when API costs cross ~$10k–30k/month, dedicated inference (vLLM, TGI, sagemaker) may break even.

**Patterns that drive cost up:** - Long system prompts repeated every turn (cache them) - Multi-shot examples in the prompt (consider fine-tuning instead) - Multi-turn conversations with full history (summarize older turns) - Streaming + cancelled requests (you still pay for what was generated) - Tool use loops that retry on bad parses - Evaluation runs hitting prod models instead of mocks

**Patterns that drive cost down:** - Prompt caching (largest single lever for any RAG or agent app) - Smaller models with structured output mode - Batch API for nightly classification/summarization jobs - Local embedding models (free) for retrieval, premium model only for the final generation - Hard caps on output tokens

Common mistakes to avoid

  • Estimating input tokens from the user message only. The system prompt, examples, conversation history, and retrieval context are all billed as input every turn.
  • Forgetting output token costs are typically 3–5× input token costs. Encouraging shorter responses (or capping max_tokens) often beats reducing the prompt.
  • Ignoring conversation history. By turn 10 of a chat, you're sending all prior turns as input on every request. Costs grow quadratically without summarization.
  • Pricing at list rates when you should price at cached rates. Anthropic, OpenAI, and Google all support prompt caching with 50–90% input discounts.
  • Sizing for average load instead of peak. AI features often have spiky usage (Monday mornings, product launches); budget for the 95th percentile.
  • Picking the model on a demo prompt without running real evals. A model that's 5× cheaper but 2% less accurate might be perfectly fine for your task — or catastrophic, depending on what the task is.
  • Forgetting eval and dev costs. Running 1,000 prompts through Opus during a model comparison can cost more than a month of production usage at a smaller model.

Frequently Asked Questions

Sources & further reading

SponsoredShop Top Deals on AmazonSupport CalcMountain — browse top-rated products at no extra cost to you.

Related Calculators