The answer, before anything else: Claude’s API in mid-2026 is a RESTful interface at https://api.anthropic.com offering three active model tiers (Opus 4.7, Sonnet 4.6, Haiku 4.5), eight official SDKs, a 50%-cost batch API, prompt caching that reads at 10% of base input price, server-executed web search and code execution tools, and Claude Code a free agentic coding tool bundled with the $17/month Pro plan.
If you came for a quick summary, that’s it. The rest of this post unpacks each piece with numbers pulled directly from Anthropic’s own documentation.
Model Pricing: What You Actually Pay (May 2026)
Pricing is per million tokens. One token is roughly 0.75 of an English word. Input tokens are what you send; output tokens are what Claude generates.
| Model | Input (per MTok) | Output (per MTok) | Context Window | Max Output | Knowledge Cutoff |
|---|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | 1M tokens | 128K tokens | January 2026 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M tokens | 64K tokens | August 2026 |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K tokens | 64K tokens | February 2026 |
Batch processing halves all of these. Opus 4.7 drops to $2.50/$12.50, Sonnet 4.6 to $1.50/$7.50, Haiku 4.5 to $0.50/$2.50. Batches complete within an hour and accept up to 100,000 requests or 256 MB.
Prompt caching adds two cost dimensions:
- Cache writes cost 1.25x the base input price for a 5-minute TTL, 2x for a 1-hour TTL.
- Cache reads cost only 10% of the base input price. On Sonnet 4.6, that’s $0.30/MTok for a cache hit versus $3.00/MTok for fresh input.
“Cache reads cost significantly less than uncached input tokens, so reaching the minimum can reduce costs for frequently reused prompts.” Anthropic Prompt Caching docs
A 100,000-token system prompt cached and reused across 100 requests costs $0.50 to write once (Opus 4.7, 5-min TTL) and $0.05 per read thereafter. Without caching, those same 100 requests would charge $5.00 each in input tokens. The math is not subtle.
Platform-specific pricing also matters. US-only inference adds a 1.1x multiplier. Claude Managed Agents (currently beta) charges $0.08 per session-hour on top of standard token rates. Web search costs $10 per 1,000 searches. Code execution in the sandboxed environment is free for the first 50 hours per day per organization, then $0.05 per hour per container.
The Models: Which One Does What
-
Claude Opus 4.7 (
claude-opus-4-7) Anthropic’s most capable generally available model. Built for complex reasoning and agentic coding with what the company calls “a step-change improvement” over Opus 4.6. Uses adaptive thinking (thinking: {type: "adaptive"}) rather than manual extended thinking. 1M token context window with a new tokenizer. Available on the direct API, Claude Platform on AWS, Amazon Bedrock, Vertex AI, and Microsoft Foundry. -
Claude Sonnet 4.6 (
claude-sonnet-4-6) The speed-intelligence sweet spot. Supports both adaptive thinking (recommended) and manual extended thinking (deprecated but functional). 1M token context window. Anthropic positions it for coding, agents, and enterprise workflows where you need Opus-level behavior at 60% of the input cost. -
Claude Haiku 4.5 (
claude-haiku-4-5) Fastest model, near-frontier intelligence at $1/$5. 200K context window. Designed for high-throughput tasks like classification, content moderation, and lightweight chat. No adaptive thinking support; manual extended thinking works on it.
Legacy models still available: Opus 4.6 ($5/$25), Sonnet 4.5 ($3/$15), Opus 4.5 ($5/$25), Opus 4.1 ($15/$75 the expensive one to migrate away from). Opus 4 and Sonnet 4 are deprecated and retire June 15, 2026.
Claude Mythos Preview is a separate research-preview model for defensive cybersecurity workflows under Project Glasswing. Invitation-only, no self-serve signup.
Official SDKs: 8 Languages, One API Surface
Anthropic publishes first-party SDKs that handle authentication headers, retry logic, streaming, and typed request/response interfaces:
- Python
pip install anthropic. Python 3.9+. Sync and async clients, Pydantic models. - TypeScript
npm install @anthropic-ai/sdk. Node.js 20+, also works in Deno, Bun, and browsers. - Java Maven/Gradle,
com.anthropic:anthropic-java:2.33.0. Java 8+. Builder pattern, CompletableFuture async. - Go
go get github.com/anthropics/anthropic-sdk-go. Go 1.23+. Context-based cancellation, functional options. - C#
dotnet add package Anthropic. .NET Standard 2.0+.IChatClientintegration. - Ruby
bundle add anthropic. Ruby 3.2.0+. Sorbet types, streaming helpers. - PHP
composer require anthropic-ai/sdk. PHP 8.1.0+. Value objects, builder pattern. - CLI (ant)
brew install anthropics/tap/ant. Shell scripting, typed flags, response transforms.
Every SDK supports the same core operations: Messages API, streaming, prompt caching, tool use, batch processing, and beta features via the beta namespace. GitHub repos live at github.com/anthropics/anthropic-sdk-{language}.
Claude Code: The Agentic Coding Layer
Claude Code is an agentic coding tool that reads your codebase, edits files, runs terminal commands, and integrates with git, GitHub, GitLab, Slack, Jira, and Google Drive via the Model Context Protocol (MCP). It is free with any paid Claude subscription ($17/month Pro billed annually, or $20/month).
Available surfaces:
- Terminal CLI Native install via
curl -fsSL https://claude.ai/install.sh | bashon macOS/Linux, orirm https://claude.ai/install.ps1 | iexon Windows PowerShell. Also available via Homebrew (brew install --cask claude-code) and WinGet (winget install Anthropic.ClaudeCode). - VS Code extension Inline diffs, @-mentions, plan review.
- JetBrains plugin IntelliJ, PyCharm, WebStorm.
- Desktop app Standalone app with visual diff review, multiple parallel sessions, scheduled tasks.
- Web
claude.ai/code, no local setup, sessions persist across devices.
Key capabilities:
- CLAUDE.md A project-level markdown file that Claude Code reads at session start for coding standards, architecture decisions, and review checklists.
- Hooks Shell commands that fire before/after Claude actions (auto-format after file edits, lint before commits).
- Skills Package repeatable workflows as
/review-pror/deploy-stagingthat teams share. - Sub-agents Multiple Claude Code agents working different parts of a task in parallel, coordinated by a lead agent.
- Background agents Run sessions in parallel from one screen.
- Agent SDK Build custom agents with full control over orchestration, tool access, and permissions.
- Routines Scheduled tasks that run on Anthropic-managed infrastructure, triggerable via API or GitHub events.
- Remote Control Continue a local session from a phone or browser.
- Channels Push events from Telegram, Discord, iMessage, or custom webhooks into a session.
Key API Features: What’s Actually Available
Messages API (POST /v1/messages) The core endpoint. Send a system prompt and message array, receive a response. Supports text, images (vision), streaming via SSE, extended/adaptive thinking, and tool use.
Message Batches API (POST /v1/messages/batches) Submit up to 100,000 requests or 256 MB for asynchronous processing at 50% of standard pricing. Results available for 29 days. Supports vision, tool use (including server tools), multi-turn conversations, and extended thinking. Does not support streaming, Fast mode, or Threads.
Token Counting API (POST /v1/messages/count_tokens) Count tokens before sending to manage costs and rate limits.
Prompt Caching Two modes: automatic caching (one cache_control field at request top level, system moves the breakpoint as conversations grow) and explicit caching (place cache_control on individual content blocks, up to 4 breakpoints). Minimum cacheable prompt length varies by model (1,024 tokens for Sonnet/Opus 4.1+, 4,096 for Opus 4.7/Haiku 4.5). Cache pre-warming supported via max_tokens: 0 requests. Workspace-level cache isolation as of February 5, 2026.
Tool Use Three categories:
- User-defined tools (client-executed) You define the JSON schema, Claude decides when to call, your code executes, you return results. The canonical loop: while
stop_reason == "tool_use", execute tools, continue conversation. - Anthropic-schema tools (client-executed)
bash,text_editor,computer,memory. Trained-in schemas for reliable calling; your code still does the execution. - Server-executed tools
web_search(type:web_search_20260305),code_execution,web_fetch,tool_search. Anthropic’s infrastructure runs these; you see results directly without handling the execution loop.
Files API (beta) Upload files up to 500 MB for reuse across multiple API calls (POST /v1/files).
Claude Managed Agents (beta) The Agents API (POST /v1/agents) for reusable, versioned agent configs. The Sessions API (POST /v1/sessions) for stateful agent sessions in managed cloud containers. The Environments API for container templates. The Skills API for custom agent skills.
Extended / Adaptive Thinking Opus 4.7 uses adaptive thinking (thinking: {type: "adaptive"}) with an effort parameter. Manual extended thinking (thinking: {type: "enabled", budget_tokens: N}) is not accepted on Opus 4.7 (returns 400). Sonnet 4.6 supports both; manual mode is deprecated. Display can be "summarized" or "omitted" (faster time-to-first-text-token in streaming).
Service Tiers Standard (default), Priority (committed spend, guaranteed throughput), and Batch (50% discount, asynchronous).
Architecture Decisions That Matter
Authentication Every request needs x-api-key (your Console API key), anthropic-version (e.g., 2023-06-01), and content-type: application/json. Production systems should use Workload Identity Federation (Authorization: Bearer <short-lived-token>) instead of long-lived API keys. API keys are scoped to Workspaces, which you can use to segment spends by use case.
Rate limits Organized into usage tiers that increase automatically with API consumption. Each tier specifies spend limits (maximum monthly cost) and rate limits (RPM and TPM). View current limits in the Console at /settings/limits. The token bucket algorithm governs enforcement. For Priority Tier with committed spend, contact sales.
Streaming vs. synchronous Streaming uses SSE (server-sent events) with a structured event flow: message_start ? content_block_start ? content_block_delta (text_delta, input_json_delta, thinking_delta, signature_delta) ? content_block_stop ? message_delta ? message_stop. Use streaming for user-facing features where time-to-first-token matters. Use synchronous for background processing where reliability matters more and 128K+ output tokens aren’t needed.
Context window management Opus 4.7 and Sonnet 4.6 share a 1M token context window (roughly 750,000 words). Haiku 4.5 has 200K tokens. In long multi-turn conversations, implement sliding windows, summarization, or retrieval-augmented approaches before hitting the limit. The prompt cache system already handles this for you if you use automatic caching: old content reads from cache, only new messages get fresh input token charges.
Stop reasons Check stop_reason on every response. "end_turn" means Claude finished naturally. "tool_use" means Claude wants to call a tool (enter the agentic loop). "max_tokens" means the output was truncated increase max_tokens or reduce the task scope. "stop_sequence" means a custom stop sequence was matched. "pause_turn" is specific to server-executed tools hitting an iteration limit re-send the conversation to continue.
Cost Optimization Patterns
- Cache aggressively. System prompts, tool definitions, and large context documents should always be cached. The 1-hour TTL (2x write cost) is worth it for prompts used less frequently than every 5 minutes but more than hourly.
- Batch non-interactive workloads. Any task where users aren’t waiting for a response document processing, bulk analysis, scheduled reporting, large-scale evaluations should use the Batch API at half price.
- Right-size the model. Don’t default to Opus 4.7 for everything. Classification tasks, content moderation, and simple Q&A can run on Haiku 4.5 at 1/5th the input cost and 1/5th the output cost.
- Set
display: "omitted"on thinking. You pay for full thinking tokens regardless of display mode, but omitted display skips streaming thinking tokens, reducing time-to-first-text-token without changing the cost. - Use
max_tokens: 0for cache pre-warming. Fire a request with your system prompt andmax_tokens: 0before users arrive. The cache gets written; the first real user request reads it. No output tokens to bill. - Monitor
usagein every response. Theinput_tokens,cache_read_input_tokens,cache_creation_input_tokens, andoutput_tokensfields tell you exactly where your money is going.
FAQ
What’s the difference between the Claude API and cloud platform access (Bedrock, Vertex AI, Microsoft Foundry)?
The Claude API gives you direct access to the latest models and features with Anthropic billing. Cloud platforms integrate with your provider’s IAM and billing but may lag on feature availability and model releases. Claude Managed Agents is currently only available on the direct API and Claude Platform on AWS. If you have existing cloud commitments, check each platform’s feature page before committing.
How do I use Claude Code without paying for the API separately?
Claude Code is included with a Claude Pro subscription ($17/month billed annually, $20/month monthly). No separate API key is needed. The Pro plan also gives you Claude Cowork, Claude for Microsoft 365, Claude for Microsoft Outlook, web search, extended thinking, and unlimited projects.
Does prompt caching work inside batch requests?
Yes, and the discounts stack. A batch request that reads from cache pays 50% of the cache-read rate. However, since batch requests are processed asynchronously and concurrently, cache hits are best-effort (observed rates range from 30% to 98%). Use the 1-hour cache TTL inside batches for better hit rates.
Can I fine-tune Claude models?
Anthropic does not offer fine-tuning for Claude 4 models. Prompt engineering, system prompts, prompt caching, and tool use are the recommended customization paths. If you need behavior that prompt engineering cannot achieve, contact Anthropic sales to discuss options.
Which SDK should I use?
All eight official SDKs support the same API surface. Pick the language your team already uses. The Python and TypeScript SDKs are the most mature and get feature updates first, but the Java, Go, C#, Ruby, and PHP SDKs have reached parity for core features. The CLI (ant) is excellent for scripting and CI/CD.
How do I handle errors?
Non-streaming errors return standard HTTP status codes with JSON error bodies ({"type": "error", "error": {"type": "overloaded_error", "message": "..."}}). Streaming errors arrive as SSE error events. Distinguish between retryable errors (529 overloaded, 5xx server errors use exponential backoff) and non-retryable errors (400 validation, 401 authentication, 403 permission, 404 not found, 413 request too large fix the request, don’t retry).
Sources
- Anthropic API Overview Complete API reference, SDKs, and authentication.
- Claude Pricing Page Real-time pricing for all models, service tiers, and features.
- Claude Models Overview Model comparison table, capabilities, context windows, knowledge cutoffs.
- Prompt Caching Guide Automatic/explicit caching, pricing, 1-hour TTL, cache pre-warming.
- Batch Processing Guide Batches API, pricing table, supported features, result retrieval.
- Extended Thinking Guide Adaptive vs. manual thinking, display modes, streaming thinking.
- Tool Use Documentation Architecture, client vs. server tools, agentic loop.
- Claude Code Documentation Installation, features, MCP, hooks, skills, Agent SDK.
- Client SDKs Reference Installation, quickstart, language-specific documentation.
- Anthropic Console Workbench, API keys, workspaces, usage monitoring.