Tracing with Tael

Chidori emits a full OpenTelemetry trace for every agent run — one parent span per run, one child span per host function call, with model names, token counts, durations, and error status attached as attributes. Point it at Tael and every run becomes queryable from a CLI that returns structured JSON — built for agents like Claude Code to inspect their own telemetry.

Why Tael

  • CLI-first, JSON-native. Every command prints structured JSON by default; tables are opt-in. An agent can tael query traces --status error --format json and parse the result directly.
  • Single binary. cargo install tael-server — no Docker, no cluster.
  • Zero SDK lock-in. Ingests standard OTLP/gRPC on port 4317 — if you ever want to swap backends, you change one env var.
  • Trace comments. Agents can annotate failing traces with tael comment add — useful for audit trails and collaborative on-call work.

Install Tael

cargo install tael-server
cargo install tael-cli

Start the server (OTLP on :4317, REST API on :7701):

tael-server

Wire Chidori up

Two environment variables and every run is instrumented. No code changes.

export OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4317
export OTEL_SERVICE_NAME=my-agent         # optional; defaults to "app-agent"

Run an agent:

chidori run agents/researcher.star --input question="What is Rust?"

Spans arrive in Tael within the batch-export window (a few seconds by default).

What you'll see

For every run Chidori emits:

  • One parent span named agent.run <agent_name> with attributes:
    • agent.name
    • agent.run_id
  • One child span per host function call, named host.<function> (e.g. host.prompt, host.tool, host.http, host.memory). Each child span carries:
    • call.seq — the sequence number shared with Chidori's checkpoint log
    • call.function
    • call.duration_ms
    • gen_ai.request.model — for prompt calls
    • gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — for prompt calls
    • OTEL Status::Error(message) when the host function raised

Span kinds follow the OTEL semantic conventions: prompt, http, tool, agent, exec, and memory are CLIENT; everything else is INTERNAL.

Querying

# All recent traces
tael query traces --last 1h --format table

# Everything that errored in the last 10 minutes
tael query traces --status error --last 10m

# Slow prompt calls
tael query traces --operation host.prompt --min-duration 500ms

# Pull the full span hierarchy for one run
tael get trace <trace-id>

# Aggregate health over a window: service-level error rate, top ops, log/metric volume
tael summarize --last 15m

Correlating with the Chidori call log

The call.seq attribute on each OTEL span is the same sequence number Chidori writes into the checkpoint JSON for that call. So a production trace in Tael can be linked to a local checkpoint file by matching agent.run_id and call.seq:

# In Tael: find the failing run
tael query traces --status error --last 1h --format json | jq '.[0].trace_id'

# Get the full span hierarchy and attributes
tael get trace <trace-id> | jq '.spans[] | {seq: .attributes["call.seq"], fn: .name, err: .status.message}'

# Locally: replay the exact same run from its checkpoint — zero LLM cost
chidori resume agents/researcher.star <run_id>

The checkpoint has the deterministic inputs and the full call log; Tael has the wall-clock performance and cross-service context. Use both.

Trace comments from an agent

Tael lets agents annotate their own traces. This is the natural pairing with Chidori's human-in-the-loop and try_call patterns — when an agent recovers from an error, it can record why for future audits:

def agent(event):
    result = try_call(lambda: tool("flaky_api"))
    if result.error:
        run_id = env("CHIDORI_RUN_ID")   # surfaced automatically in serve mode
        http("POST", "http://localhost:7701/comments", json = {
            "trace_id":  run_id,
            "author":    "my-agent",
            "body":      "Retried after flaky_api timeout; succeeded on fallback.",
        })
        result = prompt("Fallback answer:\n" + event["body"]["question"])
    return {"answer": result}

Alternatives

Because Chidori emits standard OTLP, you can point OTEL_EXPORTER_OTLP_ENDPOINT at anything OTLP-compatible:

  • Jaegerhttp://localhost:4317 after starting the all-in-one image.
  • Grafana Tempo — same endpoint format; visualize in Grafana.
  • OpenTelemetry Collector — fan traces out to multiple backends simultaneously.
  • Honeycomb / Datadog / New Relic — use their OTLP ingest URL with the appropriate auth header via OTEL_EXPORTER_OTLP_HEADERS.

Troubleshooting

  • No spans appearing. Check that tael-server is running and listening on :4317. Set APP_AGENT_OTEL_DEBUG=1 when running Chidori to print OTLP flush/shutdown errors.
  • Spans lagging. The default batch exporter buffers for up to a few seconds. For one-shot CLI runs, Chidori flushes before exiting, so you shouldn't need to wait.
  • Short runs missing trailing spans. If the process exits before the final batch is flushed, spans can be dropped. Chidori calls force_flush + shutdown automatically at exit; if you're still losing spans, confirm no panic is happening before main() returns.

Was this page helpful?