Methodology
How your cost estimate is calculated
The Goldfinch Agent Cost Grader produces pre-deployment cost scenarios based on published provider pricing, agent-type benchmarks, and your architecture choices. This page explains the model in full.
1. What your inputs represent
The grader collects five categories of input. Each maps directly to a cost variable in the model.
| Input | What it controls in the model |
|---|---|
| Agent type(s) | Selects the token benchmark profile (input and output token ranges per task, conservative to upside). A support agent has a different profile than a code agent. |
| Tasks per month | Scales the per-task cost to a monthly total and annual projection. Ranges are used (not point estimates) to reflect the variability of agent workloads. |
| Model / provider | Sets the per-token pricing from published provider documentation. The model tier is the single largest driver of absolute cost — a 10× difference between tiers is common. |
| Caching policy | Applies a reduction multiplier to input token costs. Partial caching reduces average input billing by ~18%. Aggressive caching by ~45%. Source: Anthropic and OpenAI pricing documentation, April 2026. |
| Tool calls and retrieval rate | Adds an overhead multiplier. Tool calls generate additional input tokens (function definitions, results) and output tokens (call payloads). Overhead ranges from +5% (rarely) to +35% (almost always). |
| Human-in-the-loop rate | Adds a cost premium representing review overhead — latency buffers, retry patterns on rejected completions, and additional passes. Ranges from +3% (low HITL) to +15% (high HITL). |
| Retry configuration | Sets the Agentic Resource Exhaustion risk factor and the ARE variance buffer applied to your budget envelope. No retry controls is the highest-risk configuration. |
2. Pricing sources
All per-token pricing is sourced directly from provider pricing pages. We do not use aggregator sites, community estimates, or third-party databases. Prices are updated manually when providers publish changes.
Sources — April 2026
Reasoning models carry a 2× output token multiplier, consistent with provider documentation describing how thinking tokens are billed as output tokens.
Batch API pricing is available at approximately 50% of synchronous pricing for OpenAI and Anthropic — the grader surfaces this as a Shift-Left Costing lever where applicable.
3. Agent type benchmarks
Each agent type carries a token benchmark profile — a range of input and output token counts per task, from conservative to upside.
| Agent type | Profile rationale |
|---|---|
| Support agent | Customer-facing, moderate context window, structured outputs. Relatively predictable per-task token volume. |
| Document agent | High input token volume — full document ingestion per task. Output is typically structured and constrained. |
| Code agent | High output token volume — code generation produces longer completions. Input includes context (files, instructions, history). |
| Ops agent | Multi-step workflows with tool calls. Moderate per-step token count but higher task counts and retry exposure. |
| Research agent | Highest token volume — iterative retrieval, large context windows, extensive output synthesis. |
When multiple agent types are selected, the model blends the benchmark profiles proportionally. The blended benchmark is shown in the grader before you confirm your estimate.
4. The three scenario bands
The grader always produces three cost scenarios. A single-point estimate is not produced — it would misrepresent the uncertainty inherent in pre-deployment agent economics.
The budget envelope is the Baseline monthly cost plus an ARE variance buffer — a percentage added to account for cost unpredictability introduced by your retry and architecture configuration. The buffer ranges from 10% to 35% depending on your ARE risk level.
5. Agentic Resource Exhaustion (ARE) risk score
The ARE risk score reflects exposure to uncontrolled cost escalation from agent behavior. See the full ARE definition for background. The score is determined by retry configuration, tool call rate, and human review rate.
6. Shift-Left Costing readiness score
The Shift-Left Costing (SLC) readiness score (0–100) measures how much of your agent's cost profile has been considered and governed at the architecture stage — before deployment.
The score rewards five policy choices: prompt caching adoption, model tier selection discipline, retry and spend controls, output verbosity constraints, and appropriate human review routing.
A higher SLC score correlates with a narrower spread between your Rainy Day and Blue Sky scenarios — meaning your cost model is more predictable and your budget conversations with finance are more defensible.
7. Known limitations
This model does not account for egress, storage, or infrastructure costs. Provider pricing for vector databases, object storage, compute, and network egress are not included. For most agent programs, LLM API costs are the dominant variable cost — but at scale, storage and infrastructure costs become material.
Multi-agent coordination overhead is not modeled at the individual agent level. If you are running orchestrated multi-agent pipelines, inter-agent context exchange can add 10–50× tokens per task compared to single-agent execution. The tool call overhead multiplier partially captures this, but orchestration cost is not explicitly modeled.
Pricing changes without notice. Provider pricing changes with some frequency. We update the grader's pricing data manually when changes are published, but there may be a lag between a provider update and our model reflecting it. The report notes the pricing date.
Behavioral variability is real. Agent token consumption is behavior-driven, not seat-based. Actual costs depend on how users interact with your agents, what tasks are submitted, and how the agent reasons through them. Pre-deployment estimates will diverge from actuals — the goal is a defensible planning range, not a precise forecast.
Score your agent costs before you deploy
Free. No account required. Takes 3 minutes.