Debugging
Inspect houses, agents, threads, and streams against local or deployed arbe. Most lookups go through the regular surface (thread, http); arbe debug is the small set that bypasses permissions or env resolution. Scripted proofs per surface live in tests/README.md.
Testing layers
Five ways to verify a change — pick by what you touched. Only the unit layer runs offline.
| Layer | Run | Needs |
|---|---|---|
| Unit / integration | bun run test (scope: --filter '@arbe/<pkg>') | nothing |
| HTTP CRUD contract | bun run tests/http-crud-proof.ts | dev server + arbe auth login |
| Env-bound dispatch e2e | bun run scripts/remote-dispatch.ts [<env>] [--local]; manual runbook: remote-dispatch prompt | env + sandbox + a house bot |
| LLM prompt suite | hand an agent tests/README.md | running stack |
| Browser / UI | agent-browser via arbe-dev profile (below) | dev server |
For browser proofs, use the checked-in agent-browser.json: it selects the arbe-dev session and limits navigation to localhost / 127.0.0.1. Dev www is usually http://localhost:8888. Password login is collapsed on /login — open http://localhost:8888/login?method=password before agent-browser auth login arbe-dev (or auth save with that URL). First-time setup (selectors, shared robot account, failure modes) lives in packages/skills/browser-testing/SKILL.md → “Auth profile”; agent-browser auth list tells you whether the profile exists yet. Keep UI testing on the website; don’t fall back to the CLI when the website is the thing under test.
arbe --local debug env # resolved URLs + LOCAL/REMOTE scope (--local must precede subcommand)arbe --local auth whoami # token validates against the resolved backendarbe http /api/me # raw API; --jq filter, --status, -v stderr banner; stdin body for writesarbe env diagnose <env> # preflight: can this env dispatch now?arbe sandbox diagnose <env> # runtime/topology: sprite health + worker reachabilityarbe x -s <sandbox> -- bash -lc 'tail ~/arbe-pi-runner.log; tail ~/pi-run.log' # detached-run forensics (daytona id by default; --runtime sprite for a sprite name)arbe thread diagnose <id> # classify last-dispatch stage; exits 2 on failed, 3 on stalledarbe thread entries list <id> --follow # tail raw entries via permission-checked proxyarbe thread entries create <ref> "<text>" # thread/agent ref; triggers dispatcharbe thread entries create <ref> --payload '{...}' # any ArbeThreadPayload variantarbe thread entries read <id> # tail + render pi text; exits on dispatch terminalsdurable-stream read arbe-thread-<id> --offset -1 --live # raw, bypasses proxyTwo auth contexts: the user token (arbe auth login) for permission-checked calls, and DURABLE_STREAMS_SECRET (in apps/www/.env.local) for raw stream reads.
- Footgun:
arbe debug envwarns when www is--localbut data (Supabase / Electric / streams) points at prod — local writes hit prod data. localhost+ connection refused →bun run devnot running.- 401 on
auth whoami→arbe auth login. - No version/commit endpoint on www yet — check the Cloudflare Pages dashboard for deploy SHA.
- Stale binary: the installed
arbecan lag the repo —arbe --versionprints the baked commit. When you run the compiled binary from inside a checkout whose HEAD differs, passive update check warns on stderr. To be safe while debugging a fix, run flows from source (bun apps/cli/src/cli.ts …) or rebuild (arbe upgrade/bun run --filter '@arbe/cli' build).
arbe http honours --local/APP_URL, writes the body to stdout and the banner (with -v) to stderr — | jq and set -e work without rituals. On non-2xx, --jq is skipped and the raw error body lands on stderr so the failure stays visible. arbe thread diagnose checks stream setup, agent presence, and trigger modes; dispatch failures land on the thread as signal.dispatch.failed. For live worker logs, wrangler tail on apps/www shows bracketed [dispatch.gate] lines (gate verdicts; gate cost now lands in usage_events, not the log) and [dispatch.turn] lines (per bot turn: tools= advertised, rounds= tool-loop iterations, calls= handlers run). Stream names are arbe-thread-<thread-id>. Same offset / limit / live semantics work through the app proxy: GET /api/threads/<id>/stream?offset=-1&live=1.
Where to look: observability maps the four layers (run state, usage, lifecycle, CF logs) to the questions they answer. Money lives in usage_events + PostHog (analytics → usage).
Checking spend — both sinks, joined by trace_id:
# the ledger (run from packages/)bunx supabase db query --linked "select created_at, seam, key_source, cost_usd from usage_events order by created_at desc limit 20" -o table
# PostHog (needs query:read on your personal API key; \$ stops the shell from eating $ai_*)posthog-cli exp query run "select event, properties.\$ai_total_cost_usd from events where event = '\$ai_generation' order by timestamp desc limit 10"If a bot replied but neither sink has a gate/reply row, the server is running old code — restart it. If PostHog has the event but Postgres doesn’t, the insert failed (usually RLS: usage_events only accepts the service-role client).
Common flows. Something broken — app or stream service? arbe debug env → arbe http /api/me (fails: www down or token stale) → write a chat entry on a known thread; if the write succeeds but the entry doesn’t appear, the stream service is the suspect; curl -s -o /dev/null -w "%{http_code}" $DURABLE_STREAMS_URL returning 404/405 = up. Local dev hangs. HMR feels like dial-up, network shows live=true requests pending forever — HTTP/1.1 connection cap from Electric long-polls; run the Caddy h2 proxy and hit https://localhost:8443 (system/development). DB migrations / verify scripts. See system/supabase.
Website perf. Reach for apps/www/scripts/profile-routes.ts first — bun run scripts/profile-routes.ts (from apps/www) drives agent-browser through a list of routes, captures PerformanceResourceTiming + the CDP network log, and prints a markdown table of FCP / settle-time / slowest /api/ calls per route. It auto-logs-in via the arbe-dev profile, so make sure that profile exists (agent-browser auth list) and bun run dev is up on :8888. Defaults to /houses /account /account/tokens; pass /path args to override, --json for machine output, --no-login to skip the auth step. Reach for agent-browser vitals --json for a one-off Core Web Vitals reading on the current page, and agent-browser trace start|stop or profiler start|stop when you need a DevTools-grade timeline. The harness deliberately runs against the dev server — prod numbers are different (no vite module graph, edge cache, real network) but the bottleneck order is usually preserved.
Bot replies run in-process from packages/core/dispatch/, fired by POST /api/threads/:id/entries through waitUntil. Tail dispatch via wrangler tail on apps/www. Tool calls surface two ways: as pi.tool_result entries on the thread (arbe thread entries list <id> — --json shows payload.message.toolName / content / isError) and as [dispatch.turn] … tools= rounds= calls= log lines. How the loop works and how to add a tool: dispatch.
Local dispatch (in-process) — reading a rally
The thread stream IS the trail (post-arbe-70f5). A healthy rally reads chat → signal.dispatch.started → pi.assistant per firing bot, recursing on each reply, ending in signal.dispatch.skipped reason=debounced (the last reply re-triggered nobody) + completed. list renders pi text; --json gives raw [{type,text}]. A bot that never replies when its trigger was plainly met = broken dispatch; the skipped reason is the discriminator:
no_targets— thread has no bot besides the trigger author.no_api_key— the required LLM provider key is unset (OPENROUTER_API_KEYfor the default model ref, or a direct-provider key for explicit direct refs); dispatch never started. The skip lands on the stream immediately after the chat entry.debounced— bot-rally cooldown suppressed every candidate. Only bot-authored triggers can debounce — human turns reset cooldown (arbe-9ec2). Adebouncedskip ends every healthy rally; not a fault.filtered— the gate/turn ran and yielded nothing: ambient gate said no, a fired bot chose not to reply, or the ambient cap hit.signal.dispatch.failedafterstarted— the bot turn errored. Provider-returned failures (for example Anthropic quota/billing/auth errors encoded asstopReason: error) are promoted to this visible signal instead of being stored as empty assistant replies.
Both debounced and filtered mean “no reply” but are opposite diagnoses — debounced never consulted the gate, filtered did. A should-have-triggered message coming back debounced implicates cooldown / dispatch.ts triggerAuthorIsBot, not the gate. Gate verdicts aren’t on the thread — only wrangler tail [dispatch.gate] shows them. Rules engine: packages/core/dispatch/selection.ts (pure, unit-tested). The disposable pingpong team (tests/_pingpong-fixture.md) exercises both modes: @pinger go (mention) and marco (ambient) each rally deterministically.
Env-bound dispatch
An env-bound thread dispatches in-process like any other (read the rally above); the only addition is the bot’s run_command tool reaching the sandbox. Debugging splits in two — is the rally healthy, and did the sandbox answer?
Preflight. arbe env diagnose <env> is the one-stop readiness check before dispatch: house capability, model-specific provider key, required env secrets, then live sandbox health. Pass -m <provider/model> to check the same model a thread will use.
Rally. arbe thread diagnose <id> reads the row + stream once and prints the stage, the last-turn timeline, and a remediation hint; exits 0 on ok/idle/running, 2 on failed, 3 on stalled (so arbe thread diagnose <id> && … composes). Pure classifier: apps/cli/src/thread-diagnose.ts.
Sandbox. arbe sandbox diagnose <env-or-sandbox> stays runtime/topology focused: worker drift, sprite reachability, sprite→worker stream-proxy reachability, pi install, and local session files. The run_command call lands on the thread as a pi.tool_result — arbe thread entries list <id> --json shows payload.message.toolName/content/isError. Poke the sandbox directly with arbe x -s <sandbox> -- <argv…> (daytona id by default — get one from arbe sandbox list --runtime daytona; --runtime sprite for a sprite name). Daytona execs run the same house-scoped path as run_command, so a stopped box is woken first. No remote shell — wrap pipes, $?, or ~ in -- bash -lc '…'. For lower-level pokes use the raw Sprites CLI sprite x -s <sprite> -- …, adding --http-post to match the dispatch transport (HTTP, not WSS).
End-to-end: bun run scripts/remote-dispatch.ts [<env-ref>] [--local] binds a thread to an env, plants a sandbox-only nonce via arbe x -e, @mentions a bot to read it back, and asserts the reply carries the nonce and is authored by that bot. Manual version with preflight + continuation + DX notes: tests/remote-dispatch-prompt.md.
A thread left running without a terminal self-heals on read via reconcileStuckThread (packages/core/threads.ts) — cold rows nobody reads stay stuck.
Stale sandbox pointer (env switched/deleted but commands still fail). Symptom: you change a thread’s env — or delete its old env and bind a new one — yet run_command keeps hitting the old box. The thread’s environment_id updates fine; the live pointer is threads.sandbox_id, and ensureThreadSandbox only re-resolves it when the row is dead/unprovisioned or built from a different env. Deleting an env is a soft delete (environments.deleted_at) that never tombstones the env’s sandboxes row, so the row stays live with a provider_ref. Check it: select environment_id, sandbox_id from threads where id=…, then select environment_id, status from sandboxes where id=<sandbox_id> — a mismatch between the two environment_ids is the bug. Unblock a thread by clearing the stale pointer (update threads set sandbox_id = null where id=…); the next turn re-resolves via findSandboxByEnvironment onto the bound env’s live box.
Detached coding runs post their work onto a child thread via the same arbe-pi-runner on both runtimes (daytona default, sprite legacy): a pi extension mirrors pi.* directly and the runner posts the terminal on exit (see daytona runtime). Healthy streaming is pi.chunk* → pi.assistant → signal.thread.status_changed → signal.dispatch.completed. Stages and forensics live with sandbox-sprite; thread-diagnose carries remote-stage branches. Start with the child thread, not the parent — all signals post there. If a phase stays silent, diagnose it rather than waiting out a 60-90 s timeout. In agent tooling, prefer arbe thread entries list <child> snapshots over --follow; --follow streams forever and never returns, blocking the caller indefinitely.
When a detached child looks stuck, arbe x -s <sandbox> -- bash -lc 'tail ~/arbe-pi-runner.log; tail ~/pi-run.log' pulls the runner launch log and pi’s transcript. Full command set lives with sandbox-sprite.
Workflow conductor
Runs sitting pending mean the conductor is down. arbe wf runs says so itself — it leads with the conductor heartbeat (conductor … ok, polled Ns ago, fed by the 30s reconcile sweep upserting wf_conductors) and warns loudly when the beat is >90s old or runnable runs sit unclaimed. The daemon runs on Fly.io as app arbe-workflow-conductor (one always-on shared-cpu-1x machine in ams, restart policy always): logs via fly logs -a arbe-workflow-conductor, restart via fly apps restart arbe-workflow-conductor, deploy via fly deploy --ha=false from apps/workflow-conductor/ (Dockerfile + fly.toml live there; secrets DATABASE_URL/CONDUCTOR_SECRET are set on the app, APP_URL defaults to prod). It moved off the arbe1 sprite because sprites pause when idle (see below); the stopped sprite service definition still exists on arbe1 as a fallback.
Failure modes seen in the wild (2026-06):
- Why it left the sprite: sprites pause when idle. No HTTP requests + no active sessions → the VM pauses, and a running
sprite-envservice does NOT keep it awake, despitesprite create’s lifecycle notes. A paused VM freezes the daemon’s event loop wholesale: no polls, no timers, no heartbeat, TCP connections to the pooler wedge (bytes stuck in send-q →EAUTHTIMEOUT/Query read timeouton resume). Anysprite xagainst the host wakes it and everything bursts back to life — so the daemon always “recovered” the moment anyone investigated, and looked healthy whenever watched. Diagnose:sprite api /v1/sprites/<name>showsstatus: "warm"andlast_running_at= your last poke; the conductor logs[loop] event loop frozen for ~Nsright after each wake. Moral: never host an always-on daemon on a sprite. - Conductor wedged by a dead pg connection. Hardened: the pool has connect/query timeouts, fatal pool/connection errors exit the process (supervisor restarts it), and a 30s
select 1watchdog (two strikes → exit) catches silent poll-loop death. Note the watchdog cannot fire while the VM itself is paused — see above. - Run stuck
sleepingafter a lost terminal signal. The bot’s reply is in the run thread but nosignal.dispatch.completedfollows (www isolate evicted between reply and signal publish). The done-check’s primary evidence is that signal, so inline finish and reconcile both saw “still open” — runs were stranded forever. The done-check now has fallback evidence: the bot’s newestpi.assistantentry with a finalstopReason(stop/length/error/aborted), once past a 60s grace, counts as the turn’s end, so the reconcile sweep self-heals these within ~90 seconds. Diagnose a stuck run:/workflowsdebug pane showsAwaiting yeswhile the thread shows a finished bot reply; conductor log repeats[reconcile] emitted=0 checked=N. Manual unstick (also works on old deploys): reply anything in the run thread — the new bot turn’s terminal satisfies the done-check.
Code: apps/cli/src/http.ts + commands/http.ts (arbe http), apps/cli/src/debug.ts + commands/debug.ts, apps/cli/src/auth.ts (token storage), apps/www/src/routes/api/threads/[id]/stream/+server.ts (owner-only stream proxy), apps/www/scripts/profile-routes.ts (route perf harness).
Every failed command, surprising flag, and “had to read the source” is DX signal — capture it; tests/README.md collects findings into prompt-shaped reports.