RAG — Retrieval-Augmented Generation
RAG grounds an agent's answers in your own documents instead of the LLM's parametric memory. In Chidori, RAG is two host functions: memory() for built-in vector storage, and tool() for custom retrieval over your data source.
Strategy
- Ingest — chunk documents and store them with
memory("store", ...)or in your own vector DB. - Retrieve — at query time, call
memory("search", ...)or a custom tool to fetch relevant chunks. - Generate — pass the retrieved chunks into
prompt()with clear instructions to cite them.
Option A — built-in memory()
For small-to-medium corpora, Chidori's built-in memory is enough. It supports key-value and semantic search in the same store.
Ingest
A one-shot agent that chunks and stores a document:
agents/ingest.star
def agent(document, source_id):
# Chunk naively — swap in a smarter splitter for production
chunks = [document[i:i+800] for i in range(0, len(document), 700)]
for idx, chunk in enumerate(chunks):
memory("store",
key = source_id + ":" + str(idx),
value = chunk,
metadata = {"source": source_id, "idx": idx},
)
return {"stored_chunks": len(chunks), "source": source_id}Retrieve + generate
agents/rag.star
config(model = "claude-sonnet")
def agent(question, top_k = 5):
results = memory("search", query = question, top_k = top_k)
context = "\n\n---\n\n".join([r["value"] for r in results])
citations = [{"source": r["metadata"]["source"], "idx": r["metadata"]["idx"]} for r in results]
answer = prompt(
template("prompts/rag.jinja", question = question, context = context),
temperature = 0.2,
)
return {"answer": answer, "citations": citations}prompts/rag.jinja
You are a research assistant. Answer the question **only** from the context below.
If the context doesn't contain the answer, say so instead of guessing.
## Context
{{ context }}
## Question
{{ question }}
## AnswerOption B — custom retrieval tool
For production corpora, wrap your own vector DB as a tool. Everything upstream stays the same.
tools/retrieve.star
def retrieve(query, top_k = 5, namespace = "default"):
"""Retrieve the top_k most relevant document chunks for the query."""
resp = http("POST", "http://vectors.internal:8080/search",
json = {"query": query, "top_k": top_k, "namespace": namespace},
)
return resp["matches"]agents/rag_custom.star
config(model = "claude-sonnet")
def agent(question, namespace = "default"):
matches = tool("retrieve", query = question, top_k = 5, namespace = namespace)
context = "\n\n---\n\n".join([m["text"] for m in matches])
answer = prompt(
template("prompts/rag.jinja", question = question, context = context),
temperature = 0.2,
)
return {
"answer": answer,
"sources": [{"id": m["id"], "score": m["score"]} for m in matches],
}Letting the LLM retrieve on demand
Instead of retrieving unconditionally, expose the retrieval tool to the LLM and let it decide when (and with what queries) to search:
answer = prompt(
"Answer this question, searching our knowledge base as needed:\n" + question,
tools = ["retrieve"],
max_turns = 4,
)This is especially useful for multi-hop questions where the LLM needs to refine its query based on what it found.
Tips
- Keep
temperaturelow — 0.0–0.3 — when generating from retrieved context; you want faithfulness, not creativity. - Always store
metadataalongside each chunk so citations survive retrieval. - Use
parallel()for multi-query RAG — if you generate several search queries, fan them out concurrently. - Replay matters for evals — check a checkpoint of each canonical question into git, then diff new traces to catch regressions when you swap models or retrievers.