AI agents do not become expensive all at once. They become expensive one small decision at a...

The Hidden Cost of AI Agents: Tracing Tokens, Tool Calls, and Retries in TypeScript

#ai#agents#agentaichallenge#agentskills

DivyanshuLohaniJun 3, 202616 min readDivyanshuLohaniWebsite

AI agents do not become expensive all at once.

They become expensive one small decision at a time.

One extra routing call. One confidence check. One retry. One tool failure that triggers another LLM request. One formatting agent that exists because it felt cleaner during the first design.

Individually, these calls look harmless. Together, they can turn a simple support request into a chain of model calls, tool executions, retries, and post-processing steps that nobody can easily explain from logs alone.

That is the problem I wanted to explore with a small TypeScript project: a cost-aware customer support agent that routes incoming requests, validates actions, calls internal tools, and generates a final response.

The goal was not just to build another agent demo. The goal was to answer a more practical engineering question:

Where is the LLM spend actually going?

Article content

The Problem With Agent Cost Visibility

Most teams start with simple logs.

code
[10:14:02] RouterAgent: classified request as ORDER_CHANGE
[10:14:03] OrderAgent: fetched order details
[10:14:05] OrderAgent: generated response
[10:14:06] ResponseAgent: formatted final message

This is useful, but only at the surface level.

It tells us that something happened. It does not tell us how expensive the interaction was. It does not show how many LLM calls happened inside each agent. It does not clearly separate tool calls from model calls. It does not show whether retries happened. It does not reveal whether an agent made multiple calls for work that could have been handled by a cheaper model, a cache, or a simple rule.

That becomes a real problem when the system grows.

Consider a typical customer message:

"I need to change my shipping address for order #12345."

On the surface, this looks like a single, narrow intent — change a shipping address. Any developer looking at a product backlog would estimate this as a fast, cheap operation. The reality at runtime is often quite different.

A simple version of the agent flow might look like this:

RouterAgent classifies the request
OrderAgent fetches order details
OrderAgent validates whether the address can be changed
OrderAgent generates a confirmation message
ResponseAgent formats the final response

That looks reasonable until you inspect the runtime behavior.

The router might make one LLM call to classify the query, another call to verify confidence, and a third call if confidence is low. The order agent might use an LLM for validation even when the order state is clearly readable from the database. The response agent might use another LLM just to rewrite a message that could have been generated from a template.

Suddenly, one support request is no longer one AI interaction. It is a small execution graph — and that graph has a cost attached to every edge.

If you cannot see that graph, you cannot optimize it. And if you cannot optimize it during development, those costs compound silently in production.

Article content

Building a Cost-Aware Agent Workflow

For this project, I built the workflow around three basic ideas:

First, every meaningful agent phase should have a name. Anonymous execution is the enemy of cost visibility. When a step has no identity, there is no way to associate its cost with a design decision.

Second, every LLM call should capture token usage, model name, purpose, and estimated cost — not just a response string. The response content is what the application cares about. The metadata is what engineering should care about.

Third, the full interaction should be inspectable as one execution tree, not scattered across flat logs. A tree preserves the parent-child relationships between agents, tool calls, and model calls. It makes retries visible as siblings, not mysteries.

Here is a simplified wrapper around an LLM client that puts these ideas into practice:

code
import { step } from "agent-inspect";

import { OpenAI } from "openai";

type ModelRate = {

  inputPerMillion: number;

  outputPerMillion: number;

};

type LLMCallMetadata = {

  model: string;

  purpose: string;

  promptTokens: number;

  completionTokens: number;

  totalTokens: number;

  estimatedCost: number;

  durationMs: number;

};

const MODEL_RATES: Record<string, ModelRate> = {

  "gpt-4.1-mini": {

    inputPerMillion: 0.4,

    outputPerMillion: 1.6,

  },

  "gpt-4.1": {

    inputPerMillion: 2,

    outputPerMillion: 8,

  },

};

export class CostAwareLLMClient {

  private openai: OpenAI;

  constructor(apiKey: string) {

    this.openai = new OpenAI({ apiKey });

  }

  async chat(params: {

    model: string;

    purpose: string;

    messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;

  }): Promise<{ content: string; metadata: LLMCallMetadata }> {

    return step.llm(`llm:${params.purpose}`, async () => {

      const startedAt = Date.now();

      const response = await this.openai.chat.completions.create({

        model: params.model,

        messages: params.messages,

      });

      const usage = response.usage;

      const promptTokens = usage?.prompt_tokens ?? 0;

      const completionTokens = usage?.completion_tokens ?? 0;

      const totalTokens = usage?.total_tokens ?? 0;

      const metadata: LLMCallMetadata = {

        model: params.model,

        purpose: params.purpose,

        promptTokens,

        completionTokens,

        totalTokens,

        estimatedCost: this.estimateCost(

          params.model,

          promptTokens,

          completionTokens

        ),

        durationMs: Date.now() - startedAt,

      };

      return {

        content: response.choices[0]?.message?.content ?? "",

        metadata,

      };

    });

  }

  private estimateCost(

    model: string,

    promptTokens: number,

    completionTokens: number

  ) {

    const rate = MODEL_RATES[model];

    if (!rate) {

      return 0;

    }

    const inputCost = (promptTokens / 1_000_000) * rate.inputPerMillion;

    const outputCost = (completionTokens / 1_000_000) * rate.outputPerMillion;

    return inputCost + outputCost;

  }

}

Note on pricing: The exact rates in MODEL_RATES should always come from the provider's latest pricing page. Model pricing changes frequently, sometimes without prominent announcements. The important part is the pattern: every LLM call returns both the content and the metadata needed to understand cost. The pricing table is just a configuration concern.

The step.llm() boundary does something more than just wrap the call. It makes the model invocation a named, structured node inside the larger agent execution tree — which means you can later ask, "which step in which agent caused this spend?" and get a real answer.

Wrapping the Agent Run

Next, I wrapped the entire support workflow with inspectRun(). This is the outermost boundary — the container that gives the whole interaction a single, inspectable identity.

code
import { inspectRun, step } from "agent-inspect";

type SupportRequest = {

  customerId: string;

  message: string;

};

export async function handleSupportRequest(

  request: SupportRequest,

  agents: {

    router: RouterAgent;

    order: OrderAgent;

    refund: RefundAgent;

    response: ResponseAgent;

  }

) {

  return inspectRun(

    "support-agent-request",

    async () => {

      const route = await step("route-request", async () => {

        return agents.router.classify(request.message);

      });

      const result = await step(`handle-${route.intent.toLowerCase()}`, async () => {

        if (route.intent === "ORDER_CHANGE") {

          return agents.order.handle(request);

        }

        if (route.intent === "REFUND") {

          return agents.refund.handle(request);

        }

        return {

          type: "GENERAL",

          message: "This request can be handled by the general support flow.",

        };

      });

      return step("prepare-final-response", async () => {

        return agents.response.format(result);

      });

    },

    {

      traceDir: "./.agent-inspect",

    }

  );

}

This structure enforces something that most agent frameworks skip: a single entry point with named phases. Each call to step() is a named boundary. Each step.llm() inside an agent is a named model call. The result is an execution trace that mirrors your architecture — not just a raw sequence of events.

This is where the article's main idea becomes concrete. The code is not just instrumented for debugging failures. It is instrumented to understand cost behavior before failures occur. The trace answers questions like: where did time go? Where did tokens go? Did the expensive work happen in the right phase?

The RouterAgent Problem

In the first version of the project, the router was doing too much.

code
class RouterAgent {

  constructor(private llm: CostAwareLLMClient) {}

  async classify(message: string) {

    return step("router-agent", async () => {

      const classification = await step("classify-intent", async () => {

        const { content, metadata } = await this.llm.chat({

          model: "gpt-4.1",

          purpose: "classify-intent",

          messages: [

            {

              role: "system",

              content:

                "Classify the support request as ORDER_CHANGE, REFUND, SHIPPING, PRODUCT, or GENERAL.",

            },

            {

              role: "user",

              content: message,

            },

          ],

        });

        return {

          intent: content.trim(),

          cost: metadata.estimatedCost,

        };

      });

      const confidence = await step("verify-classification-confidence", async () => {

        const { content, metadata } = await this.llm.chat({

          model: "gpt-4.1",

          purpose: "verify-routing-confidence",

          messages: [

            {

              role: "user",

              content: How confident are you that this request is ${classification.intent}? Return a number between 0 and 1.\n\n${message},

            },

          ],

        });

        return {

          score: Number(content),

          cost: metadata.estimatedCost,

        };

      });

      if (confidence.score < 0.8) {

        return step("reclassify-with-context", async () => {

          const { content } = await this.llm.chat({

            model: "gpt-4.1",

            purpose: "reclassify-intent",

            messages: [

              {

                role: "system",

                content:

                  "Reclassify the request with additional context. Return only the intent.",

              },

              {

                role: "user",

                content: Previous intent: ${classification.intent}\nConfidence: ${confidence.score}\nRequest: ${message},

              },

            ],

          });

          return {

            intent: content.trim(),

            source: "reclassification",

          };

        });

      }

      return {

        intent: classification.intent,

        source: "initial-classification",

      };

    });

  }

}

This looked reasonable at first. The router was being careful. It classified, then verified, then reclassified when uncertain. This pattern is common in agent design — confidence thresholds feel responsible, like the system is checking its own work.

But careful is not always cheap. And in this case, careful was doing something more problematic: it was front-loading cost onto every request, including the easy ones.

For every request, the router performed classification with a high-capability model. For many requests, it also performed a confidence check — again with the same model. For some requests, it performed reclassification as a third call. That meant the very first step in the system could consume the majority of the per-request budget before any actual business logic ran.

This is the kind of issue that is easy to miss in flat logs. A log line that says RouterAgent: classified request does not show that the router made two or three model calls internally. It hides the retry. It hides the model version. It hides the cost. An execution tree does not.

Inspecting the Trace

After running a few sample support requests, I inspected the local trace using the CLI:

code
npx agent-inspect list --dir ./.agent-inspect

npx agent-inspect view <run-id> --dir ./.agent-inspect

The output made the structural issue immediately visible:

code
support-agent-request                                      [7.8s] ✓
├─ route-request                                           [3.9s] ✓
│  └─ router-agent                                         [3.8s] ✓
│     ├─ classify-intent                                   [1.4s] ✓
│     │  └─ llm:classify-intent                            [1.3s] ✓
│     ├─ verify-classification-confidence                  [1.1s] ✓
│     │  └─ llm:verify-routing-confidence                  [1.0s] ✓
│     └─ reclassify-with-context                           [1.2s] ✓
│        └─ llm:reclassify-intent                          [1.1s] ✓
├─ handle-order_change                                     [2.6s] ✓
│  └─ order-agent                                          [2.5s] ✓
│     ├─ fetch-order                                       [0.2s] ✓
│     ├─ validate-address-change                           [0.8s] ✓
│     │  └─ llm:validate-order-change                      [0.7s] ✓
│     └─ generate-confirmation                             [1.4s] ✓
│        └─ llm:generate-customer-response                 [1.3s] ✓
└─ prepare-final-response                                  [1.1s] ✓
   └─ response-agent                                       [1.0s] ✓
      └─ llm:format-final-response                         [0.9s] ✓

The numbers tell a clear story. Of the 7.8 seconds total, 3.9 seconds — exactly half — were spent before a single line of business logic ran. The router, whose job is to label an intent, was consuming as much wall-clock time and token budget as the actual order handling.

The trace also immediately revealed two secondary issues:

The response-agent was making a model call purely for formatting — a job that did not require a model at all.
The validate-address-change step was calling an LLM, but the order fetch happened before it. That is actually the right order — but it is only visible in the tree. In flat logs, you cannot tell whether deterministic data was used to constrain the model prompt.

The trace changed the conversation. Instead of saying "the support agent is expensive," I could now say: "the router is making three LLM calls on some requests, and the response agent is consuming model tokens for formatting." That is a different kind of engineering discussion. One is a budget complaint. The other is an actionable observation with a root cause attached.

Optimization 1: Use Rules Before Models

The first fix was simple: do not call an LLM when the intent is obvious from the message itself.

Routing is fundamentally a classification problem. For a large enough slice of real-world support requests, the classification is deterministic — a message that contains an order number and the phrase "shipping address" is almost certainly an ORDER_CHANGE. No model needs to make that call.

code
function classifyWithRules(message: string) {

  const normalized = message.toLowerCase();

  if (

    /order\s*#?\d+/i.test(message) &&

    (normalized.includes("shipping address") ||

      normalized.includes("delivery address") ||

      normalized.includes("change address"))

  ) {

    return {

      intent: "ORDER_CHANGE",

      confidence: 0.95,

      source: "rule",

    };

  }

  if (

    normalized.includes("refund") ||

    normalized.includes("money back") ||

    normalized.includes("cancel my order")

  ) {

    return {

      intent: "REFUND",

      confidence: 0.9,

      source: "rule",

    };

  }

  return null;

}

Then the router uses rules first and falls back to the LLM only when the message is genuinely ambiguous:

code
async classify(message: string) {

  return step("router-agent", async () => {

    const ruleBasedRoute = await step("rule-based-routing", async () => {

      return classifyWithRules(message);

    });

    if (ruleBasedRoute) {

      return ruleBasedRoute;

    }

    return step("llm-routing-fallback", async () => {

      const { content } = await this.llm.chat({

        model: "gpt-4.1-mini",

        purpose: "classify-intent",

        messages: [

          {

            role: "system",

            content:

              "Classify this support request as ORDER_CHANGE, REFUND, SHIPPING, PRODUCT, or GENERAL. Return only the label.",

          },

          {

            role: "user",

            content: message,

          },

        ],

      });

      return {

        intent: content.trim(),

        confidence: 0.75,

        source: "llm",

      };

    });

  });

}

Two things changed here beyond just adding regex checks.

First, the LLM fallback now uses gpt-4.1-mini instead of gpt-4.1. Routing is a short, low-complexity classification task. There is no reason to run a high-capability model on it. If the message is genuinely hard to classify, the mini model will still get it right most of the time — and if it does not, the agent's downstream error handling can manage the rare misroute far more cheaply than running the expensive model on every request.

Second, the confidence-check loop is gone entirely. Instead of asking the model whether it was confident, the architecture now defines confidence at the source: rule-based matches carry explicit confidence scores, and the LLM fallback carries a fixed lower score. Confidence checking as a second LLM call is a design smell — it outsources a system-level decision to the model, at the model's cost.

This is not about replacing AI with rules everywhere. It is about using the right tool at the right point in the workflow. If a deterministic rule can safely route obvious cases, the model should not be charged for that decision.

Article content

Optimization 2: Make Model Choice a Per-Step Decision

The second improvement was model selection — and the realization that treating model choice as a workflow-level constant is an architectural mistake.

Different steps in an agent workflow have fundamentally different quality requirements. A customer-facing response that explains why an address change was rejected needs care, nuance, and natural language quality. A routing label that picks between five categories needs accuracy, not eloquence. A validation result that checks whether a timestamp is in the past needs neither.

Applying the same model everywhere conflates these requirements. It uses the budget of a high-capability model for work that a cheaper model handles just as well.

The fix is to make model selection explicit and purpose-driven:

code
const MODEL_BY_PURPOSE = {
  "classify-intent": "gpt-4.1-mini",
  "validate-order-change": "gpt-4.1-mini",
  "generate-customer-response": "gpt-4.1",
  "summarize-internal-tool-result": "gpt-4.1-mini",
} as const;

function modelForPurpose(purpose: keyof typeof MODEL_BY_PURPOSE) {
  return MODEL_BY_PURPOSE[purpose];
}

Then each LLM call is tied explicitly to a purpose — and the model follows from the purpose:

code
const { content, metadata } = await this.llm.chat({
  model: modelForPurpose("validate-order-change"),
  purpose: "validate-order-change",
  messages: [
    {
      role: "system",
      content:
        "Validate whether this address change is allowed based on the order state.",
    },
    {
      role: "user",
      content: JSON.stringify({
        order,
        requestedChange,
      }),
    },
  ],
});

This table is also documentation. When a new developer reads MODEL_BY_PURPOSE, they immediately understand which steps are considered high-stakes (warranting a stronger model) and which are treated as routine. That is a design decision that is now visible in the code, not buried in a prompt file or tribal knowledge.

In the trace, this also gives more actionable signal. When a run is expensive, you can distinguish between: the expensive model was used in the right place for the right reason, versus the expensive model was used for a step that did not require it. Without purpose-tagged model selection, those two cases look identical in a cost report.

Optimization 3: Remove the Formatting Agent

The third issue was the ResponseAgent, and it is the most instructive problem because it came from a design philosophy, not a specific technical choice.

The first version of the architecture treated every phase as an agent. That made the design look clean and consistent — every step was an agent, every agent had a name, every agent was wired into the orchestrator the same way. Architecturally, it felt principled.

At runtime, it was wasteful.

The ResponseAgent existed to produce a consistently formatted customer-facing message. But when the actual output of the OrderAgent is already structured — it contains a customer name, an order ID, a new address — formatting that data into a natural-sounding sentence is a template problem, not a model problem.

code
function formatAddressChangeConfirmation(input: {

  customerName: string;

  orderId: string;

  newAddress: string;

}) {

  return Hi ${input.customerName}, your shipping address for order ${input.orderId} has been updated to ${input.newAddress}.;

}

The template produces output that is deterministic, consistent, and free. The model call produced output that was variable, slightly inconsistent across runs, and expensive relative to the complexity of the task.

The trap here is aesthetic. Agent-based architectures look elegant when every component is symmetrical. But symmetry at the design level does not translate to efficiency at the runtime level. Some steps in a workflow are genuinely model-shaped — they require language understanding, reasoning over ambiguous inputs, or generating novel text. Others are data-shaped — they take structured inputs and produce structured or templated outputs.

Modeling a data-shaped step as an LLM call is not just wasteful. It is also less reliable — template output is deterministic and testable in a way that model output is not.

The updated trace reflected the change immediately:

code
support-agent-request                                      [3.2s] ✓
├─ route-request                                           [0.1s] ✓
│  └─ router-agent                                         [0.1s] ✓
│     └─ rule-based-routing                                [0.1s] ✓
├─ handle-order_change                                     [2.8s] ✓
│  └─ order-agent                                          [2.7s] ✓
│     ├─ fetch-order                                       [0.2s] ✓
│     ├─ validate-address-change                           [0.8s] ✓
│     │  └─ llm:validate-order-change                      [0.7s] ✓
│     └─ generate-confirmation                             [1.6s] ✓
│        └─ llm:generate-customer-response                 [1.5s] ✓
└─ prepare-final-response                                  [0.1s] ✓
   └─ template-format-response                             [0.1s] ✓

Total request time dropped from 7.8 seconds to 3.2 seconds — a 59% reduction — without any change to the actual intelligence of the system. The model calls that remained were the ones doing genuinely model-shaped work: generating a natural-language confirmation message that requires understanding context, and validating an order change that requires reasoning about business rules.

The important part is not just that the run became faster or cheaper. The important part is that the trace made the architectural waste visible before it became a production cost problem.

Why Execution Trees Help With Cost Debugging

Cost problems in AI systems are rarely isolated to one line of code.

They usually come from the shape of the workflow — and shape is exactly what flat logs cannot show.

A retry policy may be too aggressive, triggering an LLM call after every tool failure instead of only after structured retries are exhausted. A router may be overthinking simple cases that deterministic logic could handle cheaply. A formatting step may be using a model when a template would produce better-tested output at zero cost. A validation step may be calling the model before fetching enough deterministic data to constrain the prompt — wasting tokens on an under-informed decision. A tool failure may trigger a second model call that masks the real root cause in the logs.

Each of these is an architectural problem. None of them appear in a cost invoice. None of them are obvious in flat logs. All of them become visible in an execution tree.

The reason is structural. Flat logs record events in sequence. An execution tree records relationships. It shows which LLM call belongs to which agent. It shows whether a tool call happened before or after a model call — and whether that ordering was intentional. It shows which branches ran in parallel and which were sequential. It shows where retries happened and what triggered them. It shows whether the expensive call was part of routing, validation, generation, or cleanup.

That structure matters because cost optimization is not only a token problem. Reducing prompt length by 15% is a micro-optimization. Eliminating an unnecessary routing loop is a structural optimization. The execution tree is how you find structural optimizations.

Local-First Tracing Fits the Development Loop

There are strong hosted observability platforms for production AI systems — LangSmith, Arize, Helicone, and others — and they are valuable once a team needs dashboards, evaluations, collaboration, alerting, and long-term analysis across thousands of runs.

But during development, the requirements are different. The developer is not trying to monitor a fleet of agents. They are trying to understand one run. They changed the routing logic. They want to know whether that change made the call cheaper or more expensive. They want to see the trace, compare it to the previous run, and keep iterating.

A hosted observability dashboard introduces friction into that loop. There is authentication, a data pipeline, a UI to navigate. Those things are appropriate in production. During design iteration, they slow down the feedback cycle without adding proportional value.

That is where a local-first tool like agent-inspect fits naturally. It writes traces to a local directory. It surfaces them through a terminal command. It supports named boundaries — inspectRun(), step(), step.llm(), step.tool() — that map directly onto the agent architecture. The result is a readable execution tree that makes agent behavior easier to reason about without leaving the development environment.

The comparison to think about is code coverage tooling. A developer does not send code to a hosted service to see which lines were executed. They run the tests locally and inspect the output. The same principle applies to agent cost visibility during design iteration.

For this project, local-first tracing was enough to answer the practical question: which parts of the agent are actually costing money? That question has to be answered before any optimization is worth doing. Getting it wrong means spending engineering time on the wrong problems.

Article content

Practical Lessons

Instrument before optimizing. Without execution visibility, it is easy to guess wrong about where the cost is coming from. You may spend time reducing prompt length while the real issue is an unnecessary routing loop that adds two model calls before the prompt is even constructed. The trace tells you where to look. Guessing does not.

Treat model selection as an architectural decision. A single agent workflow can and should use different models for classification, validation, summarization, and final response generation. The right model for a customer-facing message is not the right model for a routing label. Making this explicit in a purpose-to-model mapping turns a runtime cost driver into a visible, reviewable configuration.

Use deterministic logic where it is safe. Rules, templates, caches, and structured tool outputs can remove unnecessary model calls without making the system less intelligent. The key question for each step is: does this step require language understanding and reasoning, or does it require deterministic computation over structured data? The answer should determine whether a model is involved at all.

Traces are useful beyond failures. Most observability tooling is set up to catch errors. But execution traces are equally useful for understanding cost, latency, retry behavior, and design complexity — even in runs that succeed. A successful run that costs three times what it should is still a problem. The trace shows it. The success status does not.

Keep local debugging simple during design iteration. Not every development workflow needs a production observability stack on day one. The cost of integrating hosted tooling early is overhead at exactly the time when iteration speed matters most. A local trace and a terminal view are often enough to improve the design significantly before the system is ready to be monitored at scale.

Final Thoughts

AI agents are not expensive only because models are expensive.

They are expensive because agent workflows multiply model calls. A single user request passes through a router, a validator, a tool handler, a formatter, and possibly a retry handler — and each of those phases may invoke an LLM, sometimes more than once. The cost is not on the label. It is inside the orchestration.

This multiplication is not inherently bad. Some agent workflows genuinely require multiple model calls to do their job well. The problem is when the multiplication is accidental — when model calls accumulate through design decisions that nobody revisited after the first prototype, through retry policies that are too broad, through formatting steps that felt convenient, through confidence checks that nobody questioned.

Execution-level visibility is what turns that accidental complexity into something that can be reasoned about. When you can see the run as a tree, you can stop treating cost as a monthly invoice problem and start treating it as an engineering feedback loop. You can point to a specific step, ask whether it should exist, ask whether it should use the model it is using, and make a decision based on evidence rather than assumption.

For TypeScript teams building AI agents, agent-inspect gives a lightweight way to inspect that loop locally. It does not need to replace production observability. It sits earlier in the workflow — during design — where developers are still shaping the architecture and the decisions are cheapest to change.

And in many cases, that is exactly where the biggest cost savings begin: not in prompt tuning or model compression, but in the question of whether a given model call should exist at all.

NPM library: agent-inspect on npm GitHub repository: rajudandigam/agent-inspect