View as .md

Debugging

Inspect houses, agents, threads, and streams against local or deployed arbe. Most lookups go through the regular surface (thread, http); arbe debug is the small set that bypasses permissions or env resolution. Scripted proofs per surface live in tests/README.md.

Testing layers

Five ways to verify a change — pick by what you touched. Only the unit layer runs offline.

Layer	Run	Needs
Unit / integration	`bun run test` (scope: `--filter '@arbe/<pkg>'`)	nothing
HTTP CRUD contract	`bun run tests/http-crud-proof.ts`	dev server + `arbe auth login`
Env-bound dispatch e2e	`bun run scripts/remote-dispatch.ts [<env>] [--local]`; manual runbook: remote-dispatch prompt	env + sandbox + a house bot
LLM prompt suite	hand an agent `tests/README.md`	running stack
Browser / UI	`agent-browser` via `arbe-dev` profile (below)	dev server

For browser proofs, use the checked-in agent-browser.json: it selects the arbe-dev session and limits navigation to localhost / 127.0.0.1. Dev www is usually http://localhost:8888. Password login is collapsed on /login — open http://localhost:8888/login?method=password before agent-browser auth login arbe-dev (or auth save with that URL). First-time setup (selectors, shared robot account, failure modes) lives in packages/skills/browser-testing/SKILL.md → “Auth profile”; agent-browser auth list tells you whether the profile exists yet. Keep UI testing on the website; don’t fall back to the CLI when the website is the thing under test.

Repair localhost TLS

net::ERR_CERT_DATE_INVALID or a missing localhost.key → stop and restart bun run dev; Caddy issues a fresh leaf automatically. If it persists, move ~/.local/share/caddy/certificates/local/localhost aside and restart; bun run dev:trust is only for an untrusted issuer.

arbe --local debug env       # resolved URLs + LOCAL/REMOTE scope (--local must precede subcommand)
arbe --local auth whoami     # token validates against the resolved backend
arbe http /api/me            # raw API; --jq filter, --status, -v stderr banner; stdin body for writes
arbe env diagnose <env>      # preflight: can this env dispatch now?
arbe sandbox diagnose <env>  # runtime/topology: sprite health + worker reachability
arbe x -s <sandbox> -- bash -lc 'tail ~/arbe-pi-runner.log; tail ~/pi-run.log'  # detached-run forensics (daytona id by default; --runtime sprite for a sprite name)
arbe thread diagnose <id>    # classify last-dispatch stage; exits 2 on failed, 3 on stalled
arbe thread entries list <id> --follow            # tail raw entries via permission-checked proxy
arbe thread entries create <ref> "<text>"         # thread/agent ref; triggers dispatch
arbe thread entries create <ref> --payload '{...}' # any ArbeThreadPayload variant
arbe thread entries read <id>                     # tail + render pi text; exits on dispatch terminals
durable-stream read arbe-thread-<id> --offset -1 --live   # raw, bypasses proxy

Two auth contexts: the user token (arbe auth login) for permission-checked calls, and DURABLE_STREAMS_SECRET (in apps/www/.env.local) for raw stream reads.

Footgun: arbe debug env warns when www is --local but data (Supabase / Electric / streams) points at prod — local writes hit prod data.
localhost + connection refused → bun run dev not running.
401 on auth whoami → arbe auth login.
No version/commit endpoint on www yet — check the Cloudflare Pages dashboard for deploy SHA.
Stale binary: the installed arbe can lag the repo — arbe --version prints the baked commit. When you run the compiled binary from inside a checkout whose HEAD differs, passive update check warns on stderr. To be safe while debugging a fix, run flows from source (bun apps/cli/src/cli.ts …) or rebuild (arbe upgrade / bun run --filter '@arbe/cli' build).

arbe http honours --local/APP_URL, writes the body to stdout and the banner (with -v) to stderr — | jq and set -e work without rituals. On non-2xx, --jq is skipped and the raw error body lands on stderr so the failure stays visible. arbe thread diagnose checks stream setup, agent presence, and trigger modes; dispatch failures land on the thread as signal.dispatch.failed. For live worker logs, wrangler tail on apps/www shows bracketed [dispatch.gate] lines (gate verdicts; gate cost now lands in usage_events, not the log) and [dispatch.turn] lines (per bot turn: tools= advertised, rounds= tool-loop iterations, calls= handlers run). Stream names are arbe-thread-<thread-id>. Same offset / limit / live semantics work through the app proxy: GET /api/threads/<id>/stream?offset=-1&live=1.

Where to look: observability maps the four layers (run state, usage, lifecycle, CF logs) to the questions they answer. Money lives in usage_events + PostHog (analytics → usage).

Checking spend — both sinks, joined by trace_id:

# the ledger (run from packages/)
bunx supabase db query --linked "select created_at, seam, key_source, cost_usd from usage_events order by created_at desc limit 20" -o table

# PostHog (needs query:read on your personal API key; \$ stops the shell from eating $ai_*)
posthog-cli exp query run "select event, properties.\$ai_total_cost_usd from events where event = '\$ai_generation' order by timestamp desc limit 10"

If a bot replied but neither sink has a gate/reply row, the server is running old code — restart it. If PostHog has the event but Postgres doesn’t, the insert failed (usually RLS: usage_events only accepts the service-role client).

Common flows. Something broken — app or stream service? arbe debug env → arbe http /api/me (fails: www down or token stale) → write a chat entry on a known thread; if the write succeeds but the entry doesn’t appear, the stream service is the suspect; curl -s -o /dev/null -w "%{http_code}" $DURABLE_STREAMS_URL returning 404/405 = up. Local dev hangs. HMR feels like dial-up, network shows live=true requests pending forever — HTTP/1.1 connection cap from Electric long-polls; run the Caddy h2 proxy and hit https://localhost:8443 (system/development). DB migrations / verify scripts. See system/supabase.

Website perf. Reach for apps/www/scripts/profile-routes.ts first — bun run scripts/profile-routes.ts (from apps/www) drives agent-browser through a list of routes, captures PerformanceResourceTiming + the CDP network log, and prints a markdown table of FCP / settle-time / slowest /api/ calls per route. It auto-logs-in via the arbe-dev profile, so make sure that profile exists (agent-browser auth list) and bun run dev is up on :8888. Defaults to /houses /account /account/tokens; pass /path args to override, --json for machine output, --no-login to skip the auth step. Reach for agent-browser vitals --json for a one-off Core Web Vitals reading on the current page, and agent-browser trace start|stop or profiler start|stop when you need a DevTools-grade timeline. The harness deliberately runs against the dev server — prod numbers are different (no vite module graph, edge cache, real network) but the bottleneck order is usually preserved.

Bot replies run in-process from packages/core/dispatch/. POST /api/threads/:id/entries spawns the turn as an Absurd reply-turn task (queue dispatch) rather than firing it into waitUntil — see “Reply turns are Absurd tasks” below; /api/wf/step still uses waitUntil. Tail dispatch via wrangler tail on apps/www. Tool calls surface two ways: as pi.tool_result entries on the thread (arbe thread entries list <id> — --json shows payload.message.toolName / content / isError) and as [dispatch.turn] … tools= rounds= calls= log lines. How the loop works and how to add a tool: dispatch.

Local dispatch (in-process) — reading a rally

The thread stream IS the trail (post-arbe-70f5). A healthy rally reads chat → signal.dispatch.started → pi.assistant per firing bot, recursing on each reply, ending in signal.dispatch.skipped reason=debounced (the last reply re-triggered nobody) + completed. list renders pi text; --json gives raw [{type,text}]. A bot that never replies when its trigger was plainly met = broken dispatch; the skipped reason is the discriminator:

no_targets — thread has no bot besides the trigger author.
no_api_key — the required LLM provider key is unset (OPENROUTER_API_KEY for the default model ref, or a direct-provider key for explicit direct refs); dispatch never started. The skip lands on the stream immediately after the chat entry.
debounced — bot-rally cooldown suppressed every candidate. Only bot-authored triggers can debounce — human turns reset cooldown (arbe-9ec2). A debounced skip ends every healthy rally; not a fault.
filtered — the gate/turn ran and yielded nothing: ambient gate said no, a fired bot chose not to reply, or the ambient cap hit.
signal.dispatch.failed after started — the bot turn errored. Provider-returned failures (for example Anthropic quota/billing/auth errors encoded as stopReason: error) are promoted to this visible signal instead of being stored as empty assistant replies.

Both debounced and filtered mean “no reply” but are opposite diagnoses — debounced never consulted the gate, filtered did. A should-have-triggered message coming back debounced implicates cooldown / dispatch.ts triggerAuthorIsBot, not the gate. Gate verdicts aren’t on the thread — only wrangler tail [dispatch.gate] shows them. Rules engine: packages/core/dispatch/selection.ts (pure, unit-tested). The disposable pingpong team (tests/_pingpong-fixture.md) exercises both modes: @pinger go (mention) and marco (ambient) each rally deterministically.

Reply turns are Absurd tasks (was: stranded in-process turn)

A chat reply turn used to strand silently: fired into waitUntil, which Cloudflare cancels ~30s after the response with no throw, so a slow LLM call left chat → signal.dispatch.started on the stream with no terminal ever — the thread stream, the one witness, stayed blank and that was not evidence the call didn’t happen. That class is closed for chat (arbe-6e54): POST /entries now calls dispatch_spawn (queue dispatch, task reply-turn, params {threadId, entryId}); the conductor (Fly app arbe-workflow-conductor — same daemon workflows use, fly logs -a arbe-workflow-conductor) claims it and holds one POST to /api/dispatch/run open for the whole turn, which runs the turn in-process and returns only once it settles. /api/wf/step hasn’t flipped yet (stage 2B) — it still fires into waitUntil, raced against a 25s deadline (raceDispatchDeadline) that at least writes a failure signal, but with no automatic retry.

Start with arbe thread diagnose <id> — it still prints the turn’s trace: id and flips its verdict to stalled once a turn is >2 min quiet. Post-flip, stalled no longer means lost: it means “check the task tables” — a legitimately slow turn shows a live claim below while the model thinks. Query the task/run directly — no unified absurd.tasks; each queue is its own pair of tables:

# run from packages/ — state, attempts, params for recent reply-turn tasks
bunx supabase db query --linked "select task_id, state, attempts, params->>'threadId' thread_id, params->>'entryId' entry_id, enqueue_at from absurd.t_dispatch where task_name='reply-turn' order by enqueue_at desc limit 20" -o table

# the claim — who has it, when the lease expires, why a run failed
bunx supabase db query --linked "select r.run_id, r.state, r.claimed_by, r.claim_expires_at, r.failure_reason from absurd.r_dispatch r join absurd.t_dispatch t on t.task_id = r.task_id where t.task_name='reply-turn' order by r.created_at desc limit 20" -o table

Three caps, smallest to largest, and what firing each looks like:

10min (/api/dispatch/run’s own TURN_CAP_MS) — writes signal.dispatch.failed, returns 200; the task completes (completed_payload.outcome: 'failed'), no retry.
12min (the conductor’s request abort in driveDispatchRun) — the fetch itself never got a response, so the run throws and Absurd retries per dispatch_spawn’s bounded backoff (5 attempts, exponential).
15min (claimTimeout, the outer bound) — the conductor process died mid-turn; the lease expires and the same task redelivers to a fresh claim.

Redelivery is why the idempotency guard matters: every signal.dispatch.* now carries entryId, so if a terminal already landed for this entry the redelivered attempt returns {skipped: true} instead of re-running the turn. To confirm an LLM call actually ran (vs. never got that far), join usage_events/PostHog on the turn’s trace_id as before — minted per turn, stamped on signal.dispatch.started, every recordUsage row, every PostHog dispatch.*/$ai_generation event (see observability):

bunx supabase db query --linked "select created_at, seam, cost_usd from usage_events where trace_id='<trace>' order by created_at" -o table   # run from packages/

Env-bound dispatch

An env-bound thread dispatches in-process like any other (read the rally above); the only addition is the bot’s run_command tool reaching the sandbox. Debugging splits in two — is the rally healthy, and did the sandbox answer?

Preflight. arbe env diagnose <env> is the one-stop readiness check before dispatch: house capability, model-specific provider key, required env secrets, then live sandbox health. Pass -m <provider/model> to check the same model a thread will use.

Rally. arbe thread diagnose <id> reads the row + stream once and prints the stage, the last-turn timeline, and a remediation hint; exits 0 on ok/idle/running, 2 on failed, 3 on stalled (so arbe thread diagnose <id> && … composes). Pure classifier: apps/cli/src/thread-diagnose.ts.

Sandbox. arbe sandbox diagnose <env-or-sandbox> stays runtime/topology focused: worker drift, sprite reachability, sprite→worker stream-proxy reachability, pi install, and local session files. The run_command call lands on the thread as a pi.tool_result — arbe thread entries list <id> --json shows payload.message.toolName/content/isError. Poke the sandbox directly with arbe x -s <sandbox> -- <argv…> (daytona id by default — get one from arbe sandbox list --runtime daytona; --runtime sprite for a sprite name). Daytona execs run the same house-scoped path as run_command, so a stopped box is woken first. No remote shell — wrap pipes, $?, or ~ in -- bash -lc '…'. For lower-level pokes use the raw Sprites CLI sprite x -s <sprite> -- …, adding --http-post to match the dispatch transport (HTTP, not WSS).

End-to-end: bun run scripts/remote-dispatch.ts [<env-ref>] [--local] binds a thread to an env, plants a sandbox-only nonce via arbe x -s <sandbox>, @mentions a bot to read it back, and asserts the reply carries the nonce and is authored by that bot. Manual version with preflight + continuation + DX notes: tests/remote-dispatch-prompt.md.

A thread left running without a terminal self-heals on read via reconcileStuckThread (packages/core/threads.ts) — cold rows nobody reads stay stuck.

Stale sandbox pointer (env switched/deleted but commands still fail). Symptom: you change a thread’s env — or delete its old env and bind a new one — yet run_command keeps hitting the old box. The thread’s environment_id updates fine; the live pointer is threads.sandbox_id, and ensureThreadSandbox only re-resolves it when the row is dead/unprovisioned or built from a different env. Deleting an env is a soft delete (environments.deleted_at) that never tombstones the env’s sandboxes row, so the row stays live with a provider_ref. Check it: select environment_id, sandbox_id from threads where id=…, then select environment_id, status from sandboxes where id=<sandbox_id> — a mismatch between the two environment_ids is the bug. Unblock a thread by clearing the stale pointer (update threads set sandbox_id = null where id=…); the next turn re-resolves via findSandboxByEnvironment onto the bound env’s live box.

Detached coding runs post their work onto a child thread via the same arbe-pi-runner on both runtimes (daytona default, sprite legacy): a pi extension mirrors pi.* directly and the runner posts the terminal on exit (see daytona runtime). Healthy streaming is pi.chunk* → pi.assistant → signal.thread.status_changed → signal.dispatch.completed. Stages and forensics live with sandbox-sprite; thread-diagnose carries remote-stage branches. Start with the child thread, not the parent — all signals post there. If a phase stays silent, diagnose it rather than waiting out a 60-90 s timeout. In agent tooling, prefer arbe thread entries list <child> snapshots over --follow; --follow streams forever and never returns, blocking the caller indefinitely.

When a detached child looks stuck, arbe x -s <sandbox> -- bash -lc 'tail ~/arbe-pi-runner.log; tail ~/pi-run.log' pulls the runner launch log and pi’s transcript. Full command set lives with sandbox-sprite.

Workflow conductor

Runs sitting pending mean the conductor is down. arbe wf runs says so itself — it leads with the conductor heartbeat (conductor … ok, polled Ns ago, fed by the 30s reconcile sweep upserting wf_conductors) and warns loudly when the beat is >90s old or runnable runs sit unclaimed. The daemon runs on Fly.io as app arbe-workflow-conductor (one always-on shared-cpu-1x machine in ams, restart policy always): logs via fly logs -a arbe-workflow-conductor, restart via fly apps restart arbe-workflow-conductor, deploy via fly deploy --ha=false from apps/workflow-conductor/ (Dockerfile + fly.toml live there; secrets DATABASE_URL/CONDUCTOR_SECRET are set on the app, APP_URL defaults to prod). It moved off the arbe1 sprite because sprites pause when idle (see below); the stopped sprite service definition still exists on arbe1 as a fallback.

Failure modes seen in the wild (2026-06):

Why it left the sprite: sprites pause when idle. No HTTP requests + no active sessions → the VM pauses, and a running sprite-env service does NOT keep it awake, despite sprite create’s lifecycle notes. A paused VM freezes the daemon’s event loop wholesale: no polls, no timers, no heartbeat, TCP connections to the pooler wedge (bytes stuck in send-q → EAUTHTIMEOUT/Query read timeout on resume). Any sprite x against the host wakes it and everything bursts back to life — so the daemon always “recovered” the moment anyone investigated, and looked healthy whenever watched. Diagnose: sprite api /v1/sprites/<name> shows status: "warm" and last_running_at = your last poke; the conductor logs [loop] event loop frozen for ~Ns right after each wake. Moral: never host an always-on daemon on a sprite.
Conductor wedged by a dead pg connection. Hardened: the pool has connect/query timeouts, fatal pool/connection errors exit the process (supervisor restarts it), and a 30s select 1 watchdog (two strikes → exit) catches silent poll-loop death. Note the watchdog cannot fire while the VM itself is paused — see above.
Run stuck sleeping after a lost terminal signal. The bot’s reply is in the run thread but no signal.dispatch.completed follows (www isolate evicted between reply and signal publish). The done-check’s primary evidence is that signal, so inline finish and reconcile both saw “still open” — runs were stranded forever. The done-check now has fallback evidence: the bot’s newest pi.assistant entry with a final stopReason (stop/length/error/aborted), once past a 60s grace, counts as the turn’s end, so the reconcile sweep self-heals these within ~90 seconds. Diagnose a stuck run: /workflows debug pane shows Awaiting yes while the thread shows a finished bot reply; conductor log repeats [reconcile] emitted=0 checked=N. Manual unstick (also works on old deploys): reply anything in the run thread — the new bot turn’s terminal satisfies the done-check.

Code: apps/cli/src/http.ts + commands/http.ts (arbe http), apps/cli/src/debug.ts + commands/debug.ts, apps/cli/src/auth.ts (token storage), apps/www/src/routes/api/threads/[id]/stream/+server.ts (owner-only stream proxy), apps/www/scripts/profile-routes.ts (route perf harness).

Every failed command, surprising flag, and “had to read the source” is DX signal — capture it; tests/README.md collects findings into prompt-shaped reports.