What is Peaky Peek?

Like a flight recorder for AI agents — it captures every decision, tool call, and conversation so you can debug what went wrong.

The Black Box Problem

When an AI agent goes wrong — gives a wrong answer, loops forever, or silently fails — you're usually left guessing. You can see the input and the output, but everything in between is a mystery.

Peaky Peek solves this by recording the event trail: every LLM request, every tool call, every decision the agent made, and every error. It's like having a DVR for your AI's thought process.

💡

Key Insight

The word "debug" comes from the early days of computing when a real moth was found inside a computer, causing it to malfunction. The engineer taped it in the logbook with the note "First actual case of bug being found." Debugging has always been about seeing what's happening — Peaky Peek just makes the invisible visible.

The Alphabet of Debugging

Before Peaky Peek can record anything, it needs a vocabulary — a set of event types that describe every possible thing an AI agent might do. Here's that vocabulary, straight from the codebase:

CODE

class EventType(StrEnum):
    AGENT_START = "agent_start"
    AGENT_END = "agent_end"
    TOOL_CALL = "tool_call"
    TOOL_RESULT = "tool_result"
    LLM_REQUEST = "llm_request"
    LLM_RESPONSE = "llm_response"
    DECISION = "decision"
    ERROR = "error"
    CHECKPOINT = "checkpoint"
    SAFETY_CHECK = "safety_check"
    REFUSAL = "refusal"
    POLICY_VIOLATION = "policy_violation"

PLAIN ENGLISH

A dictionary of everything an AI agent can do. Each line is a type of event — a category of "thing that happened."

AGENT_START/END — The agent woke up or went to sleep.

TOOL_CALL/RESULT — The agent used a tool (like a web search) and got a result back.

LLM_REQUEST/RESPONSE — The agent asked the AI brain a question and got an answer.

DECISION — The agent made a choice (e.g., "I'll search instead of calculate").

ERROR — Something broke.

SAFETY_CHECK, REFUSAL, POLICY_VIOLATION — The agent hit a guardrail.

This enum is the foundation of the entire system. Every event that Peaky Peek captures is typed as one of these. Without this shared vocabulary, the SDK wouldn't know what to record, and the frontend wouldn't know how to display it.

Quick Check

An AI agent gives a confidently wrong answer to a user's question. Which event type would be most useful for understanding why?

You want to understand the complete chain of events from start to finish. What makes Peaky Peek different from just reading log files?

02

Meet the Cast

Peaky Peek has three main parts that work together like departments in a hospital — each with a distinct role, but all serving the same patient.

Three Departments, One Mission

🎥

The SDK (Field Medic)

Works closest to the patient — your AI agent. It instruments the code to capture events as they happen, then ships them to the server. Think of it as the wearable heart monitor.

🏢

The API Server (Central Records)

Receives events from the SDK, stores them in a SQLite database, and serves them to the frontend when asked. Built with FastAPI.

📊

The Frontend (Monitoring Station)

A React + TypeScript app that visualizes the trace data. Decision trees, timelines, replay controls — this is where you actually see what happened.

Listen to Them Introduce Themselves

Click "Play All" to watch the three components explain their roles in their own words.

Team Introduction

SDK

Hey everyone! I'm the SDK. I live inside the agent's code. Every time the agent does something — calls a tool, makes a decision, talks to the LLM — I create an event record and ship it to the server.

API

API Server

And I'm the API! I receive those events from you, SDK, and store them permanently in our SQLite database. When the frontend asks "show me session #123," I fetch all the events and send them back.

UI

Frontend

I'm the Frontend! I take all those events from the API and turn them into something humans can understand — decision trees, timelines, replay controls. I also stream events live so you can watch the agent in real time.

SDK

The cool part? I can be set up in three ways — a decorator, a context manager, or even zero code changes with auto-patching. The developer chooses how much control they want.

API

API Server

And I handle the messy parts — retrying when the network hiccups, scoring events by importance, and detecting anomalies. I'm the brains behind the scenes.

?

0 / 5

The Tools They Use

🐍

Python + dataclasses

The SDK uses Python dataclasses — lightweight containers that hold event data. No heavy framework needed — just clean, typed data structures.

⚡

FastAPI + SQLAlchemy

The server uses FastAPI for lightning-fast request handling and SQLAlchemy to talk to the database. Async everything — no waiting around.

⚛️

React + D3.js

The frontend combines React for the UI structure with D3.js for the decision tree visualization. D3 draws the tree nodes and connections on an SVG canvas.

💾

SQLite + SSE

All data stays local in SQLite. Real-time updates use Server-Sent Events (SSE) — simpler than WebSockets and perfect for one-directional streaming.

Quick Check

You want to add a new visualization to the UI — a bar chart showing how many tool calls happened per session. Which two parts of Peaky Peek would you need to touch?

03

Three Ways to Spy on Your AI

Like a wildlife documentary crew, Peaky Peek offers three levels of observation — from a motion-activated camera to a full production crew following every move.

Level 1: The Motion-Activated Camera

The simplest way to trace your agent: add one word above your function. That's it. The @trace decorator wraps your agent function and automatically records everything that happens inside.

CODE

@trace
async def my_agent(prompt: str) -> str:
    return await llm_call(prompt)

PLAIN ENGLISH

@trace — "Wrap this function with a recording device."

async def my_agent — This is the function being watched. The name "my_agent" becomes the session name.

return await llm_call — The function runs normally. The decorator watches from the outside, recording everything without changing the function's behavior.

Under the hood, the decorator does something clever: it creates a context manager that starts a tracing session when the function begins and flushes all recorded events when the function returns. Here's the real implementation:

CODE

def trace(func=None, *, name=None, framework="custom"):
    _ensure_initialized()

    def decorator(fn):
        agent_name = name or fn.__qualname__
        @wraps(fn)
        async def wrapper(*args, **kwargs):
            ctx = TraceContext(agent_name=agent_name,
                              framework=framework)
            async with ctx:
                if ctx._session_start_event:
                    ctx.set_parent(ctx._session_start_event.id)
                return await fn(*args, **kwargs)
        return wrapper

    if func is not None:
        return decorator(func)
    return decorator

PLAIN ENGLISH

_ensure_initialized() — Make sure the SDK is set up (but don't force the user to do it — do it automatically).

agent_name = name or fn.__qualname__ — Use the provided name, or figure out the function's name automatically. "my_agent" → session name is "my_agent".

async with ctx — Start tracing. Everything inside this block gets recorded.

return await fn(*args, **kwargs) — Run the original function and pass through whatever it returns. The tracing is invisible to the function.

Level 2: The Production Crew

When you need more control — recording specific decisions, tool calls, and results — use trace_session(). It gives you a context object you can use to manually record events.

CODE

async with trace_session("weather_agent") as ctx:
    await ctx.record_decision(
        reasoning="User asked for weather",
        confidence=0.9,
        chosen_action="call_weather_api",
    )
    result = await weather_api("Seattle")
    await ctx.record_tool_result(
        "weather_api", result={"temp": 52}
    )

PLAIN ENGLISH

trace_session("weather_agent") — Start a recording session named "weather_agent".

record_decision — Manually note: "I decided to call the weather API because the user asked about weather. I'm 90% confident this is the right move."

await weather_api — Actually call the tool (the real work happens here).

record_tool_result — Note the result: "The weather API said 52 degrees."

Level 3: The Drone Camera

The most hands-off approach: set one environment variable and Peaky Peek automatically patches popular AI frameworks at runtime. Zero code changes required.

1

Set the environment variable

PEAKY_PEEK_AUTO_PATCH=true before running your agent

2

Import the auto-patch module

import agent_debugger_sdk.auto_patch — this activates the patches

3

Run your agent normally

Peaky Peek silently wraps OpenAI, Anthropic, LangChain, PydanticAI, CrewAI, AutoGen, and LlamaIndex calls

📝

Good to Know

"Auto-patching" works by replacing framework functions with instrumented versions at runtime — like swapping out a regular lightbulb for one with a built-in camera. The agent still works exactly the same, but now every call is being recorded. The original function is preserved and called normally; the patching just adds a thin recording layer on top.

Quick Check

Scenario

You're building a customer support agent that uses LangChain. Your manager says "We need to debug this agent but I don't want developers changing any code." Which approach do you recommend?

04

The Journey of an Event

Like a package traveling through the postal service — created, labeled, shipped, stored, and finally delivered to the person who needs to read it.

Follow the Package

Click "Next Step" to trace a single event from creation in your agent's code to display in the browser.

🎥

SDK

🛤️

Transport

🏢

API

💾

SQLite

📊

Frontend

Click "Next Step" to begin tracing an event's journey

Step 0 / 6

The Delivery Truck: Transport with Retry

Events don't just teleport from the SDK to the server. They travel over HTTP — and networks are unreliable. The transport layer handles delivery with automatic retries, like a delivery truck that keeps trying if no one answers the door.

CODE

async def _send_with_retry(self, *, method,
        path, payload, context,
        on_delivery_failure=None):
    last_error = None
    backoff = INITIAL_BACKOFF_SECONDS

    for attempt in range(MAX_RETRIES + 1):
        try:
            await self._execute_request(
                method=method, path=path,
                payload=payload)
            return
        except Exception as exc:
            last_error, should_retry = \
                self._classify_error(exc)

            if not should_retry:
                break

            if attempt < MAX_RETRIES:
                await asyncio.sleep(backoff)
                backoff *= BACKOFF_MULTIPLIER

    if last_error is not None:
        callback = on_delivery_failure \
            or self._on_delivery_failure
        if callback is not None:
            callback(last_error)

PLAIN ENGLISH

backoff = 0.5 seconds — Start with a short wait between retries.

for attempt in range(4) — Try up to 4 times (initial + 3 retries).

_execute_request — Actually send the HTTP request to the server.

_classify_error — Is this a temporary glitch (retry) or a permanent problem (give up)?

await asyncio.sleep(backoff) — Wait before retrying. Each wait is twice as long as the last (0.5s, 1s, 2s).

callback(last_error) — If all retries fail, notify the developer so they know events are being lost.

💡

Key Insight

Exponential backoff — doubling the wait time between retries (0.5s, 1s, 2s) — is one of the most important patterns in distributed systems. If a server is overwhelmed, hammering it with rapid retries makes things worse. Backing off gives the server breathing room to recover. This pattern appears everywhere: database connections, API calls, even TCP packets on the internet.

Quick Check

Scenario

The Peaky Peek server returns a 500 error (server crashed) when the SDK tries to send an event. The SDK retries with exponential backoff: 0.5s, 1s, 2s. All retries fail. What happens to the event?

05

Mission Control

The frontend is like NASA's mission control — a dashboard where you can watch your agent's execution in real time, zoom into failures, and replay key moments.

Click Any Component to Learn More

Frontend Components

🌳

DecisionTree

⏯️

SessionReplay

❓

WhyButton

🔍

ToolInspector

🧠

LLMViewer

📊

FailureClusterPanel

Click any component above to learn what it does.

The Color Language of the Decision Tree

Every node in the decision tree is color-coded so you can spot problems at a glance. Here's the color map from the actual component code:

CODE

const NODE_COLORS = {
  trace_root: 'var(--node-default)',
  agent_start: 'var(--node-session)',
  agent_end: 'var(--node-session)',
  llm_request: 'var(--node-llm)',
  llm_response: 'var(--node-llm)',
  tool_call: 'var(--node-tool)',
  tool_result: 'var(--node-tool)',
  decision: 'var(--node-decision)',
  error: 'var(--node-risk)',
  checkpoint: 'var(--node-checkpoint)',
  safety_check: 'var(--node-decision)',
  refusal: 'var(--node-risk)',
  policy_violation: 'var(--node-risk)',
  behavior_alert: 'var(--node-decision)',
}

PLAIN ENGLISH

agent_start/end — Session nodes (one color for the whole session lifecycle).

llm_request/response — AI brain calls (another color, so LLM activity clusters visually).

tool_call/result — Tool usage (distinct from LLM calls).

decision — Agent reasoning points (yet another color for decisions).

error, refusal, policy_violation — All red/risk color. Problems jump out immediately.

💡

Key Insight

Node size also matters — it's based on the event's importance score. A failed tool call will be both red (error color) and large (high importance). A routine agent start will be small and neutrally colored. This dual encoding (color + size) lets you scan a tree with hundreds of nodes and instantly find the trouble spots.

"Why Did It Fail?" — One-Click Diagnosis

The WhyButton is one of the most powerful features in the frontend. One click triggers a server-side analysis that returns a structured diagnosis. Here's how it works:

CODE

const handleFetch = async () => {
  setStatus('loading')
  try {
    const result = await getAnalysis(sessionId)
    const explanations =
      result?.analysis?.failure_explanations ?? []
    if (explanations.length === 0) {
      setNoFailures(true)
      return
    }
    setExplanation(explanations[0])
    setStatus('loaded')
  } catch (err) {
    if (err instanceof TypeError
        && err.message.includes('fetch')) {
      message = 'Network error.'
      isNetwork = true
    } else if (err instanceof Error) {
      const errStr = err.message.toLowerCase()
      if (errStr.includes('404'))
        message = 'Session not found.'
      else if (errStr.includes('500'))
        message = 'Server error.'
    }
    setErrorInfo({ message, isNetwork })
    setStatus('error')
  }
}

PLAIN ENGLISH

setStatus('loading') — Show a spinner so the user knows something is happening.

await getAnalysis(sessionId) — Ask the server to analyze this session's failures.

explanations.length === 0 — No failures found? Tell the user "all clear."

setExplanation(explanations[0]) — Show the top diagnosis with confidence score, symptoms, and evidence.

catch (err) — If the analysis fails, give a helpful error message instead of a cryptic stack trace. Network errors get a "Retry" button.

Quick Check

Scenario

You're debugging an agent that sometimes gives wrong answers. You open the decision tree and see a large red node surrounded by smaller blue and teal nodes. What does this tell you?

06

Safety Systems

Like the safety systems in a car — seatbelts, airbags, and a check-engine light — Peaky Peek has multiple layers of protection to handle things going wrong.

The Seatbelt: Error Classification

Not all failures are equal. A momentary network blip deserves a retry; an authentication failure does not. The transport layer classifies every error into two categories:

CODE

class TransientError(TransportError):
    """Error resolved by retrying
    (e.g., network timeout, 5xx)."""
    pass

class PermanentError(TransportError):
    """Error NOT resolved by retrying
    (e.g., 4xx auth failure)."""
    pass

def _classify_error(self, exc):
    if isinstance(exc, httpx.TimeoutException):
        return TransientError(...), True
    if isinstance(exc, httpx.NetworkError):
        return TransientError(...), True
    if isinstance(exc, TransientError):
        return exc, True
    if isinstance(exc, PermanentError):
        return exc, False
    return PermanentError(...), False

PLAIN ENGLISH

TransientError — "This might fix itself." Network timeouts, server overloads, temporary hiccups. Worth retrying.

PermanentError — "This won't fix itself." Wrong API key, bad request format, endpoint doesn't exist. Retrying is pointless.

_classify_error — Examines the exception and decides: retry (True) or give up (False). Unknown errors default to permanent — safer to fail fast than loop forever.

⚠️

Common Mistake

Retrying permanent errors is a classic beginner mistake. If the server says "403 Forbidden" (you don't have permission), retrying 100 times won't help — you'll just get "403 Forbidden" 100 times. Always classify errors before deciding whether to retry. This is why the code defaults unknown errors to permanent: an unexpected error is more likely to be a real bug than a transient glitch.

The Check Engine Light: Importance Scoring

Not every event is equally interesting. A successful tool call is routine; a failed safety check is a big deal. The ImportanceScorer assigns each event a score from 0 to 1, which drives highlighting, replay filtering, and the decision tree node sizes.

CODE

def score(self, event: TraceEvent) -> float:
    base_scores = {
        EventType.ERROR: 0.9,
        EventType.POLICY_VIOLATION: 0.92,
        EventType.BEHAVIOR_ALERT: 0.88,
        EventType.REFUSAL: 0.85,
        EventType.SAFETY_CHECK: 0.75,
        EventType.DECISION: 0.7,
        EventType.CHECKPOINT: 0.6,
        EventType.TOOL_RESULT: 0.5,
        EventType.LLM_RESPONSE: 0.5,
        EventType.TOOL_CALL: 0.4,
        EventType.AGENT_START: 0.2,
    }
    score = base_scores.get(event.event_type, 0.3)
    score += self._score_tool_result(event)
    score += self._score_llm_response(event)
    score += self._score_duration(event)
    return min(score, 1.0)

PLAIN ENGLISH

base_scores — Starting scores for each event type. Errors and policy violations start high (0.9+); routine events start low (0.2).

_score_tool_result — Bonus points if a tool call failed (adds 0.4). Failed tools are almost always worth investigating.

_score_llm_response — Bonus for expensive LLM calls (cost > $0.01). Costly calls that produce bad results deserve attention.

_score_duration — Bonus for slow events (> 1 second). Performance issues are bugs too.

min(score, 1.0) — Cap at 1.0. No event can be "more than 100% important."

Spot the Bug

This transport code has a subtle bug that could cause events to be lost. Click the line you think is wrong:

1 for attempt in range(MAX_RETRIES + 1):

2 try:

3 await self._execute_request(method=method, path=path, payload=payload)

4 return

5 except Exception as exc:

6 last_error, should_retry = self._classify_error(exc)

7 if not should_retry:

8 break

9 if attempt < MAX_RETRIES:

10 await asyncio.sleep(backoff)

11 backoff *= BACKOFF_MULTIPLIER

Hint: Think about what happens if _classify_error itself throws an exception. What would last_error be at the end of the loop?

Quick Check

A tool call event has a base score of 0.4 (TOOL_CALL). The tool result comes back with an error. The _score_tool_result bonus adds 0.4. The call took 3 seconds, adding another 0.15 from _score_duration. What's the final importance score?

07

The Big Picture

Like a city planner surveying the whole map — now that you've walked every street, let's zoom out and see how the neighborhoods connect.

The Blueprint

Here's the complete project structure with annotations on what each area does:

agent_debugger_sdk/ The recording device — lives inside your agent

core/ Tracing primitives — events, context, decorators, scoring

events/All event types and the base TraceEvent dataclass

context/TraceContext — manages async tracing sessions

scorer.pyImportanceScorer — ranks events by significance

emitter.pyEventEmitter — the pipeline that processes events

simple.py@trace and trace_session() — the zero-setup API

transport.pyHttpTransport — delivers events to the server

config.pyConfiguration — endpoint, API key, mode

auto_patch/ Zero-code instrumentation for 7 frameworks

adapters/ Framework-specific integration handlers

api/ The server — receives, stores, and serves trace data

routes/Endpoints for traces, sessions, replay, search

schemas.pyData shapes shared between server and frontend

storage/ The filing cabinet — SQLite database and queries

collector/ The analyst — event scoring, failure clustering, anomaly detection

frontend/ The dashboard — React + TypeScript visualization

src/components/24 UI components (DecisionTree, WhyButton, etc.)

src/api/client.tsAPI boundary — all calls to the server

src/types/index.tsTypeScript types mirroring the server schemas

The DNA of Every Event

Every event in Peaky Peek inherits from the same dataclass. Here's the foundation that every single event builds on:

CODE

@dataclass(kw_only=True)
class TraceEvent:
    id: str = field(
        default_factory=lambda: str(uuid.uuid4()))
    session_id: str = ""
    parent_id: str | None = None
    event_type: EventType = EventType.AGENT_START
    timestamp: datetime = field(
        default_factory=lambda: datetime.now(timezone.utc))
    name: str = ""
    data: dict[str, Any] = field(
        default_factory=dict)
    metadata: dict[str, Any] = field(
        default_factory=dict)
    importance: float = 0.5
    upstream_event_ids: list[str] = field(
        default_factory=list)

PLAIN ENGLISH

id — A unique fingerprint for this event (auto-generated UUID). No two events share an ID.

session_id — Which agent run does this belong to? Groups all events from a single execution.

parent_id — What event caused this one? This creates the tree structure — events form a family tree, not a flat list.

event_type — What kind of event is this? Uses the EventType enum from Module 1.

timestamp — When did this happen? Auto-set to the current time in UTC.

importance — How interesting is this? (0 to 1). Defaults to 0.5 (neutral). The scorer adjusts this.

upstream_event_ids — What other events influenced this one? Enables causal chain analysis.

💡

Key Insight

The parent_id field is what makes Peaky Peek's event tree possible. Without it, you'd have a flat list of events — like reading a book with all the sentences shuffled. With parent_id, events form a tree structure: the agent starts (root), makes decisions (branches), calls tools (sub-branches), and handles errors (red leaves). This is the same pattern used in distributed tracing systems like Jaeger and Zipkin.

Why These Choices?

🏠

Local-First

All data stays on your machine in a SQLite file. No accounts, no cloud, no data leaving your computer. Your AI agent's prompts and responses can contain sensitive data — keeping it local isn't just a feature, it's a security requirement.

⚡

Async Everywhere

The SDK, server, and transport are all async. AI agents spend most of their time waiting — for LLM responses, for tool calls, for network requests. Async code handles this waiting efficiently without blocking, so tracing adds near-zero overhead.

🔧

Three-Level API

@trace for quick wins, trace_session for control, auto-patch for zero-touch. Each level trades control for convenience. This layered design means you can start simple and graduate to more detailed tracing as your needs grow — without rewriting anything.

🧬

Event-Driven Architecture

Everything is an event. Events flow through a pipeline: capture, buffer, score, persist, serve. This makes the system extensible — add a new step to the pipeline (like anomaly detection) without touching the event capture code.

Final Challenge

Scenario

You want to add a new feature to Peaky Peek: "cost alerts" — notify the developer when a single agent session exceeds $1 in LLM costs. Based on everything you've learned, which files would you need to touch?

How to X-Ray Your AI Agent

The Black Box Problem

The Alphabet of Debugging

Quick Check

An AI agent gives a confidently wrong answer to a user's question. Which event type would be most useful for understanding why?

You want to understand the complete chain of events from start to finish. What makes Peaky Peek different from just reading log files?

Three Departments, One Mission

Listen to Them Introduce Themselves

The Tools They Use

Python + dataclasses

FastAPI + SQLAlchemy

React + D3.js

SQLite + SSE

Quick Check

You want to add a new visualization to the UI — a bar chart showing how many tool calls happened per session. Which two parts of Peaky Peek would you need to touch?

Level 1: The Motion-Activated Camera

Level 2: The Production Crew

Level 3: The Drone Camera

Quick Check

Follow the Package

The Delivery Truck: Transport with Retry

Quick Check

Click Any Component to Learn More

Frontend Components

The Color Language of the Decision Tree

"Why Did It Fail?" — One-Click Diagnosis

Quick Check

The Seatbelt: Error Classification

The Check Engine Light: Importance Scoring

Spot the Bug

Quick Check

A tool call event has a base score of 0.4 (TOOL_CALL). The tool result comes back with an error. The _score_tool_result bonus adds 0.4. The call took 3 seconds, adding another 0.15 from _score_duration. What's the final importance score?

The Blueprint

The DNA of Every Event

Why These Choices?

Local-First

Async Everywhere

Three-Level API

Event-Driven Architecture

Final Challenge