Debugging a flaky research agent with checkpoint replay

A team running a research agent on Chidori shipped a new prompt on a Wednesday. The eval suite passed — a thousand recorded sessions replayed clean. Production looked fine for two days. Then, on Friday afternoon, the agent started returning empty answers for roughly one in fifty queries. By Monday it was one in twenty. By Tuesday they paged us.

This is a short writeup of how we found it. The interesting part is not the bug itself — the bug is small and dumb, as bugs usually are. The interesting part is that we never reran the agent against a live model. Every step of the investigation was a replay against the recorded session log.

The shape of the failure

First pass through Tael:

$ tael query traces \
    --service research-agent \
    --status error \
    --last 24h \
    --format table

Nothing. The traces weren't marked as errors — the agent was returning {"answer": ""} with a 200. So we filtered on output shape instead:

$ tael query traces \
    --service research-agent \
    --where 'output.answer == ""' \
    --last 24h \
    --format json | tael stats group-by attrs.input.depth

Every empty answer had depth="deep". None of the depth="standard" runs were affected. That's a useful constraint — the agent has a branch that triggers an extra round of follow-up search at depth=deep, and that branch was only added in last Wednesday's release.

Pulling a representative session

Tael keeps the session ID on every span, and Chidori keeps the full call log keyed by that ID. We grabbed three:

$ tael query traces \
    --where 'output.answer == "" && attrs.input.depth == "deep"' \
    --last 24h --limit 3 --format json \
    | jq -r '.[].attrs.session_id' \
    | xargs -I{} chidori session export {} --out ./sessions/

Three session files, ~80kb each, each one a full record of every host call the agent made. No replaying against the live model. No inference cost. We could now reproduce the bug on a laptop, in an airplane, with no API keys configured.

Replaying with traces

First just re-run one of them and confirm it reproduces:

$ chidori replay ./sessions/01HXX...json --emit-otlp
$ tael get trace $(tael query traces --last 5m --limit 1 -o id)

Same empty output. Good — replay is faithful, the bug is in the agent code or the call log, not in the production environment.

Pulling the trace into Tael's timeline view, the failing run had a clear shape. The agent fanned out three search queries in parallel, got three sets of results back, then called a follow-up prompt that produced... nothing. No error. No retry. Just an empty string.

The smoking gun

We replayed again with the prompt span expanded:

$ chidori replay ./sessions/01HXX...json \
    --inspect prompt:follow_up

That dumped the rendered prompt as the model saw it. The template was something like:

Given the original question and the search results below,
ask up to 3 follow-up questions to fill gaps.

Original question: {{question}}

Search results:
{% for r in results %}
- {{r.title}}: {{r.snippet}}
{% endfor %}

And the rendered version, on the failing session:

Given the original question and the search results below,
ask up to 3 follow-up questions to fill gaps.

Original question: What's the recent literature on...

Search results:

Empty. The results list was being passed as a list of dicts where each dict had a title and snippet— except for the deep-search code path, which was passing a list of strings from a different tool. The Jinja loop boundr to a string, r.title resolved to nothing, and the entire bullet list rendered as blanks. The model got an empty result set and dutifully reported nothing to follow up on. Empty answer. 200 OK.

The bug was about thirty lines into the new depth=deep branch — a tool that returned string[] instead of Result[]. The eval suite didn't catch it because every recorded session was on the standard depth path, which used the original tool with the original return shape.

The fix, validated against history

Two-line fix in the agent: normalize the deep-search results into the same struct shape before passing them in. The interesting part is what happened next.

We replayed all three failing sessions against the patched agent:

$ for s in ./sessions/*.json; do
    chidori run agents/researcher.ts --replay "$s"
  done | tael ingest --tag patched

All three produced non-empty answers, with the same search results, against the same recorded model outputs. No additional inference. No staging environment. We then ran the existing thousand-session eval suite to make sure the standard-depth path hadn't regressed:

$ chidori eval ./sessions/eval-suite/ --agent agents/researcher.ts
1000/1000 passed (replay; 0 LLM calls; 4.2s)

Total time from page to PR: about 35 minutes. Total inference cost of the investigation: zero. We added the three previously broken sessions to the eval suite as new fixtures, so the next time someone changes the prompt template the failure mode is caught before it ships.

What made this easy

Nothing about the agent or the bug was special. The thing that made the investigation tractable was that the production runtime and the laptop debugger were the same runtime, replaying the same call log, producing the same results down to the byte. The fancy part of debugging — “why does this only happen sometimes?” — collapsed into a normal kind of debugging where you have a failing input and a deterministic program and you just read the code until you find the mistake.

That's the bar we want for agents. Boring, deterministic, replayable. The clever part should be the agent. The runtime should be a tool you don't have to think about.