# Debugging

Inspect houses, agents, threads, and streams against local or deployed arbe. Most lookups go through the regular surface (`thread`, `http`); `arbe debug` is the small set that bypasses permissions or env resolution. Scripted proofs per surface live in [`tests/README.md`](../../tests/README.md).

## Testing layers

Five ways to verify a change — pick by what you touched. Only the unit layer runs offline.

| Layer | Run | Needs |
|---|---|---|
| Unit / integration | `bun run test` (scope: `--filter '@arbe/<pkg>'`) | nothing |
| HTTP CRUD contract | `bun run tests/http-crud-proof.ts` | dev server + `arbe auth login` |
| Env-bound dispatch e2e | `bun run scripts/remote-dispatch.ts [<env>] [--local]`; manual runbook: [remote-dispatch prompt](../../tests/remote-dispatch-prompt.md) | env + sandbox + a house bot |
| LLM prompt suite | hand an agent [`tests/README.md`](../../tests/README.md) | running stack |
| Browser / UI | `agent-browser` via `arbe-dev` profile (below) | dev server |

For browser proofs, use the checked-in `agent-browser.json`: it selects the `arbe-dev` session and limits navigation to `localhost` / `127.0.0.1`. Dev www is usually `http://localhost:8888`. Password login is collapsed on `/login` — open `http://localhost:8888/login?method=password` before `agent-browser auth login arbe-dev` (or `auth save` with that URL). First-time setup (selectors, shared robot account, failure modes) lives in [`packages/skills/browser-testing/SKILL.md`](../../packages/skills/browser-testing/SKILL.md) → "Auth profile"; `agent-browser auth list` tells you whether the profile exists yet. Keep UI testing on the website; don't fall back to the CLI when the website is the thing under test.

```sh
arbe --local debug env       # resolved URLs + LOCAL/REMOTE scope (--local must precede subcommand)
arbe --local auth whoami     # token validates against the resolved backend
arbe http /api/me            # raw API; --jq filter, --status, -v stderr banner; stdin body for writes
arbe env diagnose <env>      # preflight: can this env dispatch now?
arbe sandbox diagnose <env>  # runtime/topology: sprite health + worker reachability
arbe x -s <sandbox> -- bash -lc 'tail ~/arbe-pi-runner.log; tail ~/pi-run.log'  # detached-run forensics (daytona id by default; --runtime sprite for a sprite name)
arbe thread diagnose <id>    # classify last-dispatch stage; exits 2 on failed, 3 on stalled
arbe thread entries list <id> --follow            # tail raw entries via permission-checked proxy
arbe thread entries create <ref> "<text>"         # thread/agent ref; triggers dispatch
arbe thread entries create <ref> --payload '{...}' # any ArbeThreadPayload variant
arbe thread entries read <id>                     # tail + render pi text; exits on dispatch terminals
durable-stream read arbe-thread-<id> --offset -1 --live   # raw, bypasses proxy
```

Two auth contexts: the user token (`arbe auth login`) for permission-checked calls, and `DURABLE_STREAMS_SECRET` (in `apps/www/.env.local`) for raw stream reads.

- **Footgun:** `arbe debug env` warns when www is `--local` but data (Supabase / Electric / streams) points at prod — local writes hit prod data.
- `localhost` + connection refused → `bun run dev` not running.
- 401 on `auth whoami` → `arbe auth login`.
- No version/commit endpoint on www yet — check the Cloudflare Pages dashboard for deploy SHA.
- **Stale binary:** the installed `arbe` can lag the repo — `arbe --version` prints the baked commit. When you run the compiled binary from inside a checkout whose HEAD differs, passive update check warns on stderr. To be safe while debugging a fix, run flows from source (`bun apps/cli/src/cli.ts …`) or rebuild (`arbe upgrade` / `bun run --filter '@arbe/cli' build`).

`arbe http` honours `--local`/`APP_URL`, writes the body to stdout and the banner (with `-v`) to stderr — `| jq` and `set -e` work without rituals. On non-2xx, `--jq` is skipped and the raw error body lands on stderr so the failure stays visible. `arbe thread diagnose` checks stream setup, agent presence, and trigger modes; dispatch failures land on the thread as `signal.dispatch.failed`. For live worker logs, `wrangler tail` on `apps/www` shows bracketed `[dispatch.gate]` lines (gate verdicts; gate cost now lands in `usage_events`, not the log) and `[dispatch.turn]` lines (per bot turn: `tools=` advertised, `rounds=` tool-loop iterations, `calls=` handlers run). Stream names are `arbe-thread-<thread-id>`. Same `offset` / `limit` / `live` semantics work through the app proxy: `GET /api/threads/<id>/stream?offset=-1&live=1`.

Where to look: [observability](./observability.md) maps the four layers (run state, usage, lifecycle, CF logs) to the questions they answer. Money lives in `usage_events` + PostHog ([analytics → usage](./analytics.md#usage--money)).

Checking spend — both sinks, joined by `trace_id`:

```sh
# the ledger (run from packages/)
bunx supabase db query --linked "select created_at, seam, key_source, cost_usd from usage_events order by created_at desc limit 20" -o table

# PostHog (needs query:read on your personal API key; \$ stops the shell from eating $ai_*)
posthog-cli exp query run "select event, properties.\$ai_total_cost_usd from events where event = '\$ai_generation' order by timestamp desc limit 10"
```

If a bot replied but neither sink has a `gate`/`reply` row, the server is running old code — restart it. If PostHog has the event but Postgres doesn't, the insert failed (usually RLS: `usage_events` only accepts the service-role client).

**Common flows.** *Something broken — app or stream service?* `arbe debug env` → `arbe http /api/me` (fails: www down or token stale) → write a chat entry on a known thread; if the write succeeds but the entry doesn't appear, the stream service is the suspect; `curl -s -o /dev/null -w "%{http_code}" $DURABLE_STREAMS_URL` returning `404`/`405` = up. *Local dev hangs.* HMR feels like dial-up, network shows `live=true` requests pending forever — HTTP/1.1 connection cap from Electric long-polls; run the Caddy h2 proxy and hit `https://localhost:8443` ([system/development](./development.md)). *DB migrations / verify scripts.* See [system/supabase](./supabase.md).

**Website perf.** Reach for `apps/www/scripts/profile-routes.ts` first — `bun run scripts/profile-routes.ts` (from `apps/www`) drives `agent-browser` through a list of routes, captures `PerformanceResourceTiming` + the CDP network log, and prints a markdown table of FCP / settle-time / slowest `/api/` calls per route. It auto-logs-in via the `arbe-dev` profile, so make sure that profile exists (`agent-browser auth list`) and `bun run dev` is up on `:8888`. Defaults to `/houses /account /account/tokens`; pass `/path` args to override, `--json` for machine output, `--no-login` to skip the auth step. Reach for `agent-browser vitals --json` for a one-off Core Web Vitals reading on the current page, and `agent-browser trace start|stop` or `profiler start|stop` when you need a DevTools-grade timeline. The harness deliberately runs against the dev server — prod numbers are different (no vite module graph, edge cache, real network) but the bottleneck order is usually preserved.

Bot replies run in-process from `packages/core/dispatch/`, fired by `POST /api/threads/:id/entries` through `waitUntil`. Tail dispatch via `wrangler tail` on `apps/www`. Tool calls surface two ways: as `pi.tool_result` entries on the thread (`arbe thread entries list <id>` — `--json` shows `payload.message.toolName` / `content` / `isError`) and as `[dispatch.turn] … tools= rounds= calls=` log lines. How the loop works and how to add a tool: [dispatch](./dispatch.md#tool-calling).

## Local dispatch (in-process) — reading a rally

The thread stream IS the trail (post-arbe-70f5). A healthy rally reads `chat → signal.dispatch.started → pi.assistant` per firing bot, recursing on each reply, ending in `signal.dispatch.skipped reason=debounced` (the last reply re-triggered nobody) + `completed`. `list` renders pi text; `--json` gives raw `[{type,text}]`. A bot that never replies when its trigger was plainly met = broken dispatch; the `skipped` `reason` is the discriminator:

- `no_targets` — thread has no bot besides the trigger author.
- `no_api_key` — the required LLM provider key is unset (`OPENROUTER_API_KEY` for the default model ref, or a direct-provider key for explicit direct refs); dispatch never started. The skip lands on the stream immediately after the chat entry.
- `debounced` — bot-rally cooldown suppressed every candidate. **Only bot-authored triggers can debounce** — human turns reset cooldown (arbe-9ec2). A `debounced` skip ends every healthy rally; not a fault.
- `filtered` — the gate/turn ran and yielded nothing: ambient gate said no, a fired bot chose not to reply, or the ambient cap hit.
- `signal.dispatch.failed` after `started` — the bot turn errored. Provider-returned failures (for example Anthropic quota/billing/auth errors encoded as `stopReason: error`) are promoted to this visible signal instead of being stored as empty assistant replies.

Both `debounced` and `filtered` mean "no reply" but are opposite diagnoses — debounced never consulted the gate, filtered did. **A should-have-triggered message coming back `debounced` implicates cooldown / `dispatch.ts` `triggerAuthorIsBot`, not the gate.** Gate verdicts aren't on the thread — only `wrangler tail` `[dispatch.gate]` shows them. Rules engine: `packages/core/dispatch/selection.ts` (pure, unit-tested). The disposable `pingpong` team ([tests/_pingpong-fixture.md](../../tests/_pingpong-fixture.md)) exercises both modes: `@pinger go` (mention) and `marco` (ambient) each rally deterministically.

## Env-bound dispatch

An env-bound thread dispatches in-process like any other (read the rally above); the only addition is the bot's `run_command` tool reaching the sandbox. Debugging splits in two — is the rally healthy, and did the sandbox answer?

Preflight. `arbe env diagnose <env>` is the one-stop readiness check before dispatch: house capability, model-specific provider key, required env secrets, then live sandbox health. Pass `-m <provider/model>` to check the same model a thread will use.

Rally. `arbe thread diagnose <id>` reads the row + stream once and prints the stage, the last-turn timeline, and a remediation hint; exits 0 on ok/idle/running, 2 on failed, 3 on stalled (so `arbe thread diagnose <id> && …` composes). Pure classifier: [apps/cli/src/thread-diagnose.ts](../../apps/cli/src/thread-diagnose.ts).

Sandbox. `arbe sandbox diagnose <env-or-sandbox>` stays runtime/topology focused: worker drift, sprite reachability, sprite→worker stream-proxy reachability, pi install, and local session files. The `run_command` call lands on the thread as a `pi.tool_result` — `arbe thread entries list <id> --json` shows `payload.message.toolName`/`content`/`isError`. Poke the sandbox directly with `arbe x -s <sandbox> -- <argv…>` (daytona id by default — get one from `arbe sandbox list --runtime daytona`; `--runtime sprite` for a sprite name). Daytona execs run the same house-scoped path as `run_command`, so a stopped box is woken first. No remote shell — wrap pipes, `$?`, or `~` in `-- bash -lc '…'`. For lower-level pokes use the raw Sprites CLI `sprite x -s <sprite> -- …`, adding `--http-post` to match the dispatch transport (HTTP, not WSS).

End-to-end: `bun run scripts/remote-dispatch.ts [<env-ref>] [--local]` binds a thread to an env, plants a sandbox-only nonce via `arbe x -e`, @mentions a bot to read it back, and asserts the reply carries the nonce and is authored by that bot. Manual version with preflight + continuation + DX notes: [tests/remote-dispatch-prompt.md](../../tests/remote-dispatch-prompt.md).

A thread left `running` without a terminal self-heals on read via `reconcileStuckThread` (`packages/core/threads.ts`) — cold rows nobody reads stay stuck.

Stale sandbox pointer (env switched/deleted but commands still fail). Symptom: you change a thread's env — or delete its old env and bind a new one — yet `run_command` keeps hitting the old box. The thread's `environment_id` updates fine; the live pointer is `threads.sandbox_id`, and `ensureThreadSandbox` only re-resolves it when the row is `dead`/unprovisioned **or built from a different env**. Deleting an env is a soft delete (`environments.deleted_at`) that never tombstones the env's `sandboxes` row, so the row stays `live` with a `provider_ref`. Check it: `select environment_id, sandbox_id from threads where id=…`, then `select environment_id, status from sandboxes where id=<sandbox_id>` — a mismatch between the two `environment_id`s is the bug. Unblock a thread by clearing the stale pointer (`update threads set sandbox_id = null where id=…`); the next turn re-resolves via `findSandboxByEnvironment` onto the bound env's live box.

Detached coding runs post their work onto a child thread via the same `arbe-pi-runner` on both runtimes (daytona default, sprite legacy): a pi extension mirrors `pi.*` directly and the runner posts the terminal on exit (see [daytona runtime](./sandbox-daytona.md)). Healthy streaming is `pi.chunk* → pi.assistant → signal.thread.status_changed → signal.dispatch.completed`. Stages and forensics live with [sandbox-sprite](./sandbox-sprite.md#detached-work-arbe-pi-runner); `thread-diagnose` carries remote-stage branches. Start with the child thread, not the parent — all signals post there. If a phase stays silent, diagnose it rather than waiting out a 60-90 s timeout. In agent tooling, prefer `arbe thread entries list <child>` snapshots over `--follow`; `--follow` streams forever and never returns, blocking the caller indefinitely.

When a detached child looks stuck, `arbe x -s <sandbox> -- bash -lc 'tail ~/arbe-pi-runner.log; tail ~/pi-run.log'` pulls the runner launch log and pi's transcript. Full command set lives with [sandbox-sprite](./sandbox-sprite.md#detached-work-arbe-pi-runner).

## Workflow conductor

Runs sitting `pending` mean the conductor is down. `arbe wf runs` says so itself — it leads with the conductor heartbeat (`conductor … ok, polled Ns ago`, fed by the 30s reconcile sweep upserting `wf_conductors`) and warns loudly when the beat is >90s old or runnable runs sit unclaimed. The daemon runs on Fly.io as app `arbe-workflow-conductor` (one always-on `shared-cpu-1x` machine in `ams`, restart policy `always`): logs via `fly logs -a arbe-workflow-conductor`, restart via `fly apps restart arbe-workflow-conductor`, deploy via `fly deploy --ha=false` from `apps/workflow-conductor/` (Dockerfile + fly.toml live there; secrets `DATABASE_URL`/`CONDUCTOR_SECRET` are set on the app, APP_URL defaults to prod). It moved off the arbe1 sprite because sprites pause when idle (see below); the stopped sprite service definition still exists on arbe1 as a fallback.

Failure modes seen in the wild (2026-06):

- **Why it left the sprite: sprites pause when idle.** No HTTP requests + no active sessions → the VM pauses, and a running `sprite-env` service does NOT keep it awake, despite `sprite create`'s lifecycle notes. A paused VM freezes the daemon's event loop wholesale: no polls, no timers, no heartbeat, TCP connections to the pooler wedge (bytes stuck in send-q → `EAUTHTIMEOUT`/`Query read timeout` on resume). Any `sprite x` against the host wakes it and everything bursts back to life — so the daemon always "recovered" the moment anyone investigated, and looked healthy whenever watched. Diagnose: `sprite api /v1/sprites/<name>` shows `status: "warm"` and `last_running_at` = your last poke; the conductor logs `[loop] event loop frozen for ~Ns` right after each wake. Moral: never host an always-on daemon on a sprite.
- **Conductor wedged by a dead pg connection.** Hardened: the pool has connect/query timeouts, fatal pool/connection errors exit the process (supervisor restarts it), and a 30s `select 1` watchdog (two strikes → exit) catches silent poll-loop death. Note the watchdog cannot fire while the VM itself is paused — see above.
- **Run stuck `sleeping` after a lost terminal signal.** The bot's reply is in the run thread but no `signal.dispatch.completed` follows (www isolate evicted between reply and signal publish). The done-check's primary evidence is that signal, so inline finish and reconcile both saw "still open" — runs were stranded forever. The done-check now has fallback evidence: the bot's newest `pi.assistant` entry with a final `stopReason` (`stop`/`length`/`error`/`aborted`), once past a 60s grace, counts as the turn's end, so the reconcile sweep self-heals these within ~90 seconds. Diagnose a stuck run: `/workflows` debug pane shows `Awaiting yes` while the thread shows a finished bot reply; conductor log repeats `[reconcile] emitted=0 checked=N`. Manual unstick (also works on old deploys): reply anything in the run thread — the new bot turn's terminal satisfies the done-check.

Code: `apps/cli/src/http.ts` + `commands/http.ts` (`arbe http`), `apps/cli/src/debug.ts` + `commands/debug.ts`, `apps/cli/src/auth.ts` (token storage), `apps/www/src/routes/api/threads/[id]/stream/+server.ts` (owner-only stream proxy), `apps/www/scripts/profile-routes.ts` (route perf harness).

Every failed command, surprising flag, and "had to read the source" is DX signal — capture it; `tests/README.md` collects findings into prompt-shaped reports.
