Peaky Peek is a local-first agent debugger with replay, failure memory, smart highlights, and drift detection. Think of it as a flight recorder for AI — it captures everything that happens during execution so you can replay, inspect, and understand failures. This course teaches you how it works under the hood.
Start Learning ↓Like a flight recorder for AI agents — it captures every decision, tool call, and conversation so you can debug what went wrong.
When an AI agent goes wrong — gives a wrong answer, loops forever, or silently fails — you're usually left guessing. You can see the input and the output, but everything in between is a mystery.
Peaky Peek solves this by recording the event trail: every LLM request, every tool call, every decision the agent made, and every error. It's like having a DVR for your AI's thought process.
The word "debug" comes from the early days of computing when a real moth was found inside a computer, causing it to malfunction. The engineer taped it in the logbook with the note "First actual case of bug being found." Debugging has always been about seeing what's happening — Peaky Peek just makes the invisible visible.
Before Peaky Peek can record anything, it needs a vocabulary — a set of event types that describe every possible thing an AI agent might do. Here's that vocabulary, straight from the codebase:
class EventType(StrEnum):
AGENT_START = "agent_start"
AGENT_END = "agent_end"
TOOL_CALL = "tool_call"
TOOL_RESULT = "tool_result"
LLM_REQUEST = "llm_request"
LLM_RESPONSE = "llm_response"
DECISION = "decision"
ERROR = "error"
CHECKPOINT = "checkpoint"
SAFETY_CHECK = "safety_check"
REFUSAL = "refusal"
POLICY_VIOLATION = "policy_violation"
A dictionary of everything an AI agent can do. Each line is a type of event — a category of "thing that happened."
AGENT_START/END — The agent woke up or went to sleep.
TOOL_CALL/RESULT — The agent used a tool (like a web search) and got a result back.
LLM_REQUEST/RESPONSE — The agent asked the AI brain a question and got an answer.
DECISION — The agent made a choice (e.g., "I'll search instead of calculate").
ERROR — Something broke.
SAFETY_CHECK, REFUSAL, POLICY_VIOLATION — The agent hit a guardrail.
This enum is the foundation of the entire system. Every event that Peaky Peek captures is typed as one of these. Without this shared vocabulary, the SDK wouldn't know what to record, and the frontend wouldn't know how to display it.
Peaky Peek has three main parts that work together like departments in a hospital — each with a distinct role, but all serving the same patient.
Works closest to the patient — your AI agent. It instruments the code to capture events as they happen, then ships them to the server. Think of it as the wearable heart monitor.
Receives events from the SDK, stores them in a SQLite database, and serves them to the frontend when asked. Built with FastAPI.
A React + TypeScript app that visualizes the trace data. Decision trees, timelines, replay controls — this is where you actually see what happened.
Click "Play All" to watch the three components explain their roles in their own words.
The SDK uses Python dataclasses — lightweight containers that hold event data. No heavy framework needed — just clean, typed data structures.
The server uses FastAPI for lightning-fast request handling and SQLAlchemy to talk to the database. Async everything — no waiting around.
The frontend combines React for the UI structure with D3.js for the decision tree visualization. D3 draws the tree nodes and connections on an SVG canvas.
All data stays local in SQLite. Real-time updates use Server-Sent Events (SSE) — simpler than WebSockets and perfect for one-directional streaming.
Like a wildlife documentary crew, Peaky Peek offers three levels of observation — from a motion-activated camera to a full production crew following every move.
The simplest way to trace your agent: add one word above your function. That's it. The @trace decorator wraps your agent function and automatically records everything that happens inside.
@trace
async def my_agent(prompt: str) -> str:
return await llm_call(prompt)
@trace — "Wrap this function with a recording device."
async def my_agent — This is the function being watched. The name "my_agent" becomes the session name.
return await llm_call — The function runs normally. The decorator watches from the outside, recording everything without changing the function's behavior.
Under the hood, the decorator does something clever: it creates a context manager that starts a tracing session when the function begins and flushes all recorded events when the function returns. Here's the real implementation:
def trace(func=None, *, name=None, framework="custom"):
_ensure_initialized()
def decorator(fn):
agent_name = name or fn.__qualname__
@wraps(fn)
async def wrapper(*args, **kwargs):
ctx = TraceContext(agent_name=agent_name,
framework=framework)
async with ctx:
if ctx._session_start_event:
ctx.set_parent(ctx._session_start_event.id)
return await fn(*args, **kwargs)
return wrapper
if func is not None:
return decorator(func)
return decorator
_ensure_initialized() — Make sure the SDK is set up (but don't force the user to do it — do it automatically).
agent_name = name or fn.__qualname__ — Use the provided name, or figure out the function's name automatically. "my_agent" → session name is "my_agent".
async with ctx — Start tracing. Everything inside this block gets recorded.
return await fn(*args, **kwargs) — Run the original function and pass through whatever it returns. The tracing is invisible to the function.
When you need more control — recording specific decisions, tool calls, and results — use trace_session(). It gives you a context object you can use to manually record events.
async with trace_session("weather_agent") as ctx:
await ctx.record_decision(
reasoning="User asked for weather",
confidence=0.9,
chosen_action="call_weather_api",
)
result = await weather_api("Seattle")
await ctx.record_tool_result(
"weather_api", result={"temp": 52}
)
trace_session("weather_agent") — Start a recording session named "weather_agent".
record_decision — Manually note: "I decided to call the weather API because the user asked about weather. I'm 90% confident this is the right move."
await weather_api — Actually call the tool (the real work happens here).
record_tool_result — Note the result: "The weather API said 52 degrees."
The most hands-off approach: set one environment variable and Peaky Peek automatically patches popular AI frameworks at runtime. Zero code changes required.
PEAKY_PEEK_AUTO_PATCH=true before running your agent
import agent_debugger_sdk.auto_patch — this activates the patches
Peaky Peek silently wraps OpenAI, Anthropic, LangChain, PydanticAI, CrewAI, AutoGen, and LlamaIndex calls
"Auto-patching" works by replacing framework functions with instrumented versions at runtime — like swapping out a regular lightbulb for one with a built-in camera. The agent still works exactly the same, but now every call is being recorded. The original function is preserved and called normally; the patching just adds a thin recording layer on top.
You're building a customer support agent that uses LangChain. Your manager says "We need to debug this agent but I don't want developers changing any code." Which approach do you recommend?
Like a package traveling through the postal service — created, labeled, shipped, stored, and finally delivered to the person who needs to read it.
Click "Next Step" to trace a single event from creation in your agent's code to display in the browser.
Events don't just teleport from the SDK to the server. They travel over HTTP — and networks are unreliable. The transport layer handles delivery with automatic retries, like a delivery truck that keeps trying if no one answers the door.
async def _send_with_retry(self, *, method,
path, payload, context,
on_delivery_failure=None):
last_error = None
backoff = INITIAL_BACKOFF_SECONDS
for attempt in range(MAX_RETRIES + 1):
try:
await self._execute_request(
method=method, path=path,
payload=payload)
return
except Exception as exc:
last_error, should_retry = \
self._classify_error(exc)
if not should_retry:
break
if attempt < MAX_RETRIES:
await asyncio.sleep(backoff)
backoff *= BACKOFF_MULTIPLIER
if last_error is not None:
callback = on_delivery_failure \
or self._on_delivery_failure
if callback is not None:
callback(last_error)
backoff = 0.5 seconds — Start with a short wait between retries.
for attempt in range(4) — Try up to 4 times (initial + 3 retries).
_execute_request — Actually send the HTTP request to the server.
_classify_error — Is this a temporary glitch (retry) or a permanent problem (give up)?
await asyncio.sleep(backoff) — Wait before retrying. Each wait is twice as long as the last (0.5s, 1s, 2s).
callback(last_error) — If all retries fail, notify the developer so they know events are being lost.
Exponential backoff — doubling the wait time between retries (0.5s, 1s, 2s) — is one of the most important patterns in distributed systems. If a server is overwhelmed, hammering it with rapid retries makes things worse. Backing off gives the server breathing room to recover. This pattern appears everywhere: database connections, API calls, even TCP packets on the internet.
The Peaky Peek server returns a 500 error (server crashed) when the SDK tries to send an event. The SDK retries with exponential backoff: 0.5s, 1s, 2s. All retries fail. What happens to the event?
The frontend is like NASA's mission control — a dashboard where you can watch your agent's execution in real time, zoom into failures, and replay key moments.
Every node in the decision tree is color-coded so you can spot problems at a glance. Here's the color map from the actual component code:
const NODE_COLORS = {
trace_root: 'var(--node-default)',
agent_start: 'var(--node-session)',
agent_end: 'var(--node-session)',
llm_request: 'var(--node-llm)',
llm_response: 'var(--node-llm)',
tool_call: 'var(--node-tool)',
tool_result: 'var(--node-tool)',
decision: 'var(--node-decision)',
error: 'var(--node-risk)',
checkpoint: 'var(--node-checkpoint)',
safety_check: 'var(--node-decision)',
refusal: 'var(--node-risk)',
policy_violation: 'var(--node-risk)',
behavior_alert: 'var(--node-decision)',
}
agent_start/end — Session nodes (one color for the whole session lifecycle).
llm_request/response — AI brain calls (another color, so LLM activity clusters visually).
tool_call/result — Tool usage (distinct from LLM calls).
decision — Agent reasoning points (yet another color for decisions).
error, refusal, policy_violation — All red/risk color. Problems jump out immediately.
Node size also matters — it's based on the event's importance score. A failed tool call will be both red (error color) and large (high importance). A routine agent start will be small and neutrally colored. This dual encoding (color + size) lets you scan a tree with hundreds of nodes and instantly find the trouble spots.
The WhyButton is one of the most powerful features in the frontend. One click triggers a server-side analysis that returns a structured diagnosis. Here's how it works:
const handleFetch = async () => {
setStatus('loading')
try {
const result = await getAnalysis(sessionId)
const explanations =
result?.analysis?.failure_explanations ?? []
if (explanations.length === 0) {
setNoFailures(true)
return
}
setExplanation(explanations[0])
setStatus('loaded')
} catch (err) {
if (err instanceof TypeError
&& err.message.includes('fetch')) {
message = 'Network error.'
isNetwork = true
} else if (err instanceof Error) {
const errStr = err.message.toLowerCase()
if (errStr.includes('404'))
message = 'Session not found.'
else if (errStr.includes('500'))
message = 'Server error.'
}
setErrorInfo({ message, isNetwork })
setStatus('error')
}
}
setStatus('loading') — Show a spinner so the user knows something is happening.
await getAnalysis(sessionId) — Ask the server to analyze this session's failures.
explanations.length === 0 — No failures found? Tell the user "all clear."
setExplanation(explanations[0]) — Show the top diagnosis with confidence score, symptoms, and evidence.
catch (err) — If the analysis fails, give a helpful error message instead of a cryptic stack trace. Network errors get a "Retry" button.
You're debugging an agent that sometimes gives wrong answers. You open the decision tree and see a large red node surrounded by smaller blue and teal nodes. What does this tell you?
Like the safety systems in a car — seatbelts, airbags, and a check-engine light — Peaky Peek has multiple layers of protection to handle things going wrong.
Not all failures are equal. A momentary network blip deserves a retry; an authentication failure does not. The transport layer classifies every error into two categories:
class TransientError(TransportError):
"""Error resolved by retrying
(e.g., network timeout, 5xx)."""
pass
class PermanentError(TransportError):
"""Error NOT resolved by retrying
(e.g., 4xx auth failure)."""
pass
def _classify_error(self, exc):
if isinstance(exc, httpx.TimeoutException):
return TransientError(...), True
if isinstance(exc, httpx.NetworkError):
return TransientError(...), True
if isinstance(exc, TransientError):
return exc, True
if isinstance(exc, PermanentError):
return exc, False
return PermanentError(...), False
TransientError — "This might fix itself." Network timeouts, server overloads, temporary hiccups. Worth retrying.
PermanentError — "This won't fix itself." Wrong API key, bad request format, endpoint doesn't exist. Retrying is pointless.
_classify_error — Examines the exception and decides: retry (True) or give up (False). Unknown errors default to permanent — safer to fail fast than loop forever.
Retrying permanent errors is a classic beginner mistake. If the server says "403 Forbidden" (you don't have permission), retrying 100 times won't help — you'll just get "403 Forbidden" 100 times. Always classify errors before deciding whether to retry. This is why the code defaults unknown errors to permanent: an unexpected error is more likely to be a real bug than a transient glitch.
Not every event is equally interesting. A successful tool call is routine; a failed safety check is a big deal. The ImportanceScorer assigns each event a score from 0 to 1, which drives highlighting, replay filtering, and the decision tree node sizes.
def score(self, event: TraceEvent) -> float:
base_scores = {
EventType.ERROR: 0.9,
EventType.POLICY_VIOLATION: 0.92,
EventType.BEHAVIOR_ALERT: 0.88,
EventType.REFUSAL: 0.85,
EventType.SAFETY_CHECK: 0.75,
EventType.DECISION: 0.7,
EventType.CHECKPOINT: 0.6,
EventType.TOOL_RESULT: 0.5,
EventType.LLM_RESPONSE: 0.5,
EventType.TOOL_CALL: 0.4,
EventType.AGENT_START: 0.2,
}
score = base_scores.get(event.event_type, 0.3)
score += self._score_tool_result(event)
score += self._score_llm_response(event)
score += self._score_duration(event)
return min(score, 1.0)
base_scores — Starting scores for each event type. Errors and policy violations start high (0.9+); routine events start low (0.2).
_score_tool_result — Bonus points if a tool call failed (adds 0.4). Failed tools are almost always worth investigating.
_score_llm_response — Bonus for expensive LLM calls (cost > $0.01). Costly calls that produce bad results deserve attention.
_score_duration — Bonus for slow events (> 1 second). Performance issues are bugs too.
min(score, 1.0) — Cap at 1.0. No event can be "more than 100% important."
This transport code has a subtle bug that could cause events to be lost. Click the line you think is wrong:
for attempt in range(MAX_RETRIES + 1):
try:
await self._execute_request(method=method, path=path, payload=payload)
return
except Exception as exc:
last_error, should_retry = self._classify_error(exc)
if not should_retry:
break
if attempt < MAX_RETRIES:
await asyncio.sleep(backoff)
backoff *= BACKOFF_MULTIPLIER
Hint: Think about what happens if _classify_error itself throws an exception. What would last_error be at the end of the loop?
Like a city planner surveying the whole map — now that you've walked every street, let's zoom out and see how the neighborhoods connect.
Here's the complete project structure with annotations on what each area does:
Every event in Peaky Peek inherits from the same dataclass. Here's the foundation that every single event builds on:
@dataclass(kw_only=True)
class TraceEvent:
id: str = field(
default_factory=lambda: str(uuid.uuid4()))
session_id: str = ""
parent_id: str | None = None
event_type: EventType = EventType.AGENT_START
timestamp: datetime = field(
default_factory=lambda: datetime.now(timezone.utc))
name: str = ""
data: dict[str, Any] = field(
default_factory=dict)
metadata: dict[str, Any] = field(
default_factory=dict)
importance: float = 0.5
upstream_event_ids: list[str] = field(
default_factory=list)
id — A unique fingerprint for this event (auto-generated UUID). No two events share an ID.
session_id — Which agent run does this belong to? Groups all events from a single execution.
parent_id — What event caused this one? This creates the tree structure — events form a family tree, not a flat list.
event_type — What kind of event is this? Uses the EventType enum from Module 1.
timestamp — When did this happen? Auto-set to the current time in UTC.
importance — How interesting is this? (0 to 1). Defaults to 0.5 (neutral). The scorer adjusts this.
upstream_event_ids — What other events influenced this one? Enables causal chain analysis.
The parent_id field is what makes Peaky Peek's event tree possible. Without it, you'd have a flat list of events — like reading a book with all the sentences shuffled. With parent_id, events form a tree structure: the agent starts (root), makes decisions (branches), calls tools (sub-branches), and handles errors (red leaves). This is the same pattern used in distributed tracing systems like Jaeger and Zipkin.
All data stays on your machine in a SQLite file. No accounts, no cloud, no data leaving your computer. Your AI agent's prompts and responses can contain sensitive data — keeping it local isn't just a feature, it's a security requirement.
The SDK, server, and transport are all async. AI agents spend most of their time waiting — for LLM responses, for tool calls, for network requests. Async code handles this waiting efficiently without blocking, so tracing adds near-zero overhead.
@trace for quick wins, trace_session for control, auto-patch for zero-touch. Each level trades control for convenience. This layered design means you can start simple and graduate to more detailed tracing as your needs grow — without rewriting anything.
Everything is an event. Events flow through a pipeline: capture, buffer, score, persist, serve. This makes the system extensible — add a new step to the pipeline (like anomaly detection) without touching the event capture code.
You want to add a new feature to Peaky Peek: "cost alerts" — notify the developer when a single agent session exceeds $1 in LLM costs. Based on everything you've learned, which files would you need to touch?