Skip to content
View as .md

Daytona runtime

delegate_task (in @arbe/core/dispatch/agent-tools) is arbe’s one coding tool. It runs a coding agent inside ctx.sandbox. ctx.sandbox is one interface with two implementations, sprite and daytona; environment.runtime picks one per environment, default daytona.

This doc is the daytona implementation. It runs pi in a Daytona sandbox, mirrors pi’s session into one durable arbe thread, and resumes a run from that thread. The code is in packages/sandbox/src/daytona/. Each layer has a runnable proof at proofs/prove-*.ts (bun --env-file=.env proofs/<file> from that dir).

The shape

The thread is the only durable handle, and the source of truth. A thread is a durable stream of entries. Everything else is a cache you can lose — pi’s session files on disk, the in-process sandbox handle, any real-time watcher — and every cache writes to the thread or reads from it.

The parts:

  • thread — the durable stream and source of truth. Entry types: chat, pi.*, signal.*.
  • finished-signal — the signal.thread.* entry that ends a run. Status is one of status_changed:completed, pi_failed, or pi_session_orphaned.
  • pi — the coding-agent process.
  • CodingAgent — the harness descriptor (binary, file extension, how a run ends) that selects pi, codex, or claude-code. Not arbe’s Agent, which is the persona (model plus instructions) composed onto a CodingAgent.
  • arbe-pi-runner — starts pi inside the sandbox, owns pi’s exit code, and posts the finished-signal when pi exits. One per harness; lives in src/runner.ts.
  • the mirror — a pi extension that posts pi.* entries plus a heartbeat every ~5s. One extension, posting directly; no relay process.
  • stream-proxy — a Supabase Edge Function the sandbox posts to because it can’t reach the CF worker. Covered under egress below.

A run, end to end:

launchCodingAgent(pi, sandbox, task, { thread? }) spawns a session (or uses thread),
│ provision + fire DETACHED returns the thread id at once
╭──────────────────── Daytona sandbox ────────────────────╮
│ arbe-pi-runner ─▶ pi ─▶ tool calls │
│ owns exit code ╰─ pi extension (the mirror) │
│ posts terminal on exit │
╰─────────────────────────────┬───────────────────────────╯
│ pi.* + heartbeat, via scoped JWT
edge fn (stream-proxy)
╭───────────────────── thread ──────────────────────╮
│ source of truth · entries: chat | pi.* | signal.* │
│ a finished-signal ends a run │
╰────────────────────────────────────────────────────╯

A run is never silently terminal. Whoever is still alive when the run ends posts the finished-signal:

the mirror soft error (pi emits stopReason:"error") inside pi
arbe-pi-runner pi crashed, sandbox alive (has exit code) → pi_failed
pull-confirm the sandbox itself died (no exit code) → pi_session_orphaned

The heartbeat makes “dead” decidable from “still thinking” using the thread alone, so no watcher is needed.

The thread’s Postgres row is a lazy cache of the stream. Dispatch claims the child idle → running when it fires the runner (launchOnDaytona posts the status_changed readiness milestone), and that claim arms the read-path reconcile (reconcileStuckThread, run on every thread GET): it adopts the stream’s finished-signal onto the row, or — when the stream has gone silent past the orphan threshold — flips the row to failed with pi_session_orphaned. Without the claim the row stays idle and reconcile does nothing, so a run that dies in bootstrap, before pi ever mirrors, would hang dark forever. For a stuck running thread, reconcile also asks Daytona whether the box still exists, so a dead box is settled at once instead of waiting out the threshold — see box lifecycle.

When the reconciled thread is a delegated child (a delegate_task run, parent.kind:'thread'), that same reconcile notifies the parent: it posts a signal.thread.child_finished plus a brain-authored chat carrying the child’s result onto the parent thread, with no dispatch re-fire. See dispatch.

Resume rebuilds the same handles from the thread, not from disk. pi’s disk --continue is a cache that goes stale once the thread is continued on another device. The pi session format and what reconstruction can and can’t recover are worked out in resume notes.

same sandbox { resume } pi --continue <disk cache> L5
fresh sandbox { resume, hydrate } rebuild session from thread, L7
upload, pi --session <file> (no disk --continue)

pi’s session is one portable JSONL file, but the mirror posts a lossy projection of it: user turns become chat, assistant turns become pi.assistant, and tree ids, labels, and model entries are dropped. So a fresh-sandbox resume can’t replay the thread directly. It rebuilds a loadable linear session with pi.reconstructSession and loads it with --session (not --continue, which scans a directory filtered by cwd and raced by mtime). The prompt is canonical on the thread: run() posts it as a chat entry before driving pi, so a rebuild reads it from the thread.

Library (src/)

  • thread.ts — the write side. openThread() mints a scoped grant and returns the stream endpoint; create, post, and read delegate to @arbe/core/entries.
  • sandbox.ts — the compute side. createSandbox() returns a fresh Daytona sandbox with .exec and .provision(agent).
  • coding-agent.ts — the pi CodingAgent descriptor.
  • decide-pi-outcome.tsdecidePiOutcome maps pi’s stopReason and exit code to an outcome plus the signal.thread.* entries to post.
  • run.tsrun(), the synchronous host driver: provision, drive pi, read the thread back; the host posts the terminal entry. Used by the proofs and cli.ts.
  • launch-coding-agent.tslaunchCodingAgent(), the daytona body of delegate_task: spawn a session (or use a given thread), fire arbe-pi-runner detached, return the thread id at once. The CF worker can’t block for minutes, so dispatch needs the detached shape. Its default thread is an openThread() grant and stream with no DB row; the parented child row with environmentId is core createThread’s job.
  • runner.tsarbe-pi-runner, running in the sandbox: owns pi’s exit code, reads the thread back, runs decidePiOutcome, and posts the terminal entry on exit.

run and launchCodingAgent are two callers of the same handles. run blocks on exec and the host posts the terminal; launchCodingAgent returns at once and the in-sandbox runner posts the terminal. Both feed decidePiOutcome the same two inputs, the thread and the exit code. A second harness (codex, claude-code) is a second CodingAgent with its own decideMessageType; the orchestrator does not change.

Egress and env

Daytona egress is whitelist-only and matched by domain. The sandbox can reach npm, github*, cloudflare.com, and our Supabase host *.supabase.co (plus the other default-allowlisted services: package managers, git hosts, container registries, LLM APIs). It cannot reach the CF worker arbe.0sk.ar, workers.dev, or any other host, and a tunnel doesn’t help because the block is by domain. At our org tier the restriction cannot be overridden per sandbox; lifting it means Daytona tier 3 (~400 EUR/mo), which we’ve decided against. So: the open web is reachable only via a proxy on an allowlisted host, if a workload ever truly needs it — none does today, since the harness itself only needs the allowlist (arbe-5783). The signature of a blocked host: HTTPS gets a TLS reset (curl exit 35, 000 status), plain HTTP gets a proxy 403.

That is why the stream-write path runs through a shim. *.supabase.co is reachable, so the proxy is deployed as a Supabase Edge Function at supabase/functions/stream-proxy. It is a dumb transport proxy: it forwards the request (method, path, query, body, and the original Authorization: Bearer <scoped-jwt>) verbatim to the CF worker at arbe.0sk.ar/api/stream, which owns all JWT verification, thread-scope enforcement, and secret-swap logic. The edge fn requires one secret — ARBE_WORKER_URL=https://arbe.0sk.ar/api/stream — and no longer holds DURABLE_STREAMS_SECRET. Deploy with bunx supabase functions deploy stream-proxy (config pins verify_jwt=false).

Two more sandbox facts: pi installs in-sandbox with npm i -g @earendil-works/pi-coding-agent@0.78.0 (node is present via language: 'typescript'), and the sandbox runs as a non-root user, so upload artifacts to a HOME-relative path, not /root.

The mirror’s write path:

mint: mintStreamWriteJwt(houseId, threadId, DURABLE_STREAMS_SECRET) -> { jwt, expiresAt } (2h TTL)
write: POST <url>/api/stream/arbe-thread-<threadId> Authorization: Bearer <jwt>
body = { id, ts, authorId?, payload } (payload: chat | pi.* | signal.*)
read: GET <url>/api/stream/arbe-thread-<threadId>?offset=0 Authorization: Bearer <jwt> (NDJSON)

A scoped token can only touch its own thread; a cross-thread write returns 403.

The pi extension reads its config from these environment variables (via readPiThreadMirrorEnv in packages/sandbox/src/pi-extension/thread-mirror.ts). The sandbox must export the same canonical names dispatch uses:

Env varRequiredMeaning
ARBE_THREAD_IDyestarget thread
ARBE_STREAM_URLyesstream-write base URL — the Supabase edge fn from the sandbox; the CF worker only off-sandbox
ARBE_STREAM_TOKENyesscoped stream:write JWT (the jwt from mintStreamWriteJwt)
ARBE_AUTHOR_IDnoauthor stamped on entries
ARBE_PI_MIRROR_NEXT_INDEXnoresume offset

Box lifecycle

A sandbox is a machine, with a lifecycle independent of any run. pi is one process inside it; many threads, commands, and runs share a box, and the box outlives any single run (schema-v2 req 4 & 6). A pi terminal ends a process, never the machine — box teardown is never keyed on a run’s terminal, and never on a wall-clock timeout against the run.

There is no run timeout. A coding agent runs until it finishes; ARBE_PI_TIMEOUT is a ~3-day runaway guard for a hung process, not a run length.

Reaping is arbe-owned and idle-based. A box stays up while any thread on it (threads.sandbox_id) is running; once none is, the reconcile/prune sweep (reconcileStuckThread / pruneStuckThreads) deletes it — the same seam that clears stale threads. When a run finishes there, reapSandboxIfIdle deletes the box (via an injected reapBox, so @arbe/core stays provider-free) and tombstones its row. Resume re-resolves to the environment’s live box, or makes a fresh one, so idle means delete; a stopped box buys nothing. The discriminator is sandboxes.ephemeral: delegate_task boxes are ephemeral=true and reapable, while an environment’s shared box (inline run_command) is ephemeral=false and never touched.

Death is pull-confirmed, never webhook-driven — one org-wide webhook is a forged-event risk against other houses’ boxes. confirmSandboxLiveness(sandboxId, probeBox) asks Daytona about a house-scoped provider_ref through that house’s own runtime: a missing box (404) or terminal-fault state (error/build_failed/removing) tombstones the sandboxes row to dead and orphans the threads on it; a live/unknown verdict is a no-op, so a network blip never tombstones a real box. It rides the reaper’s two seams: the read-path/prune reconcile (the current thread’s box) and arbe sandbox list --reconcile (the cold-row sweep — there is no cron). probeBox is injected by the worker, mirroring reapBox.

Daytona’s own auto-stop/delete is the dumb backstop. createSandbox sets autoStopInterval from the runaway guard plus a buffer (Daytona counts only API calls as activity, and a detached run makes none, so the clock runs from launch — the value must clear the guard or it would stop a live run) and autoDeleteInterval: 0. This catches an abandoned box (worker dead, runner crashed) after ~3 days; it is not the primary reaper.

Spawned boxes are labelled arbe.house / arbe.thread / arbe.environment on create, so arbe sandbox list --runtime daytona shows which run owns a box. Driving the broader lifecycle is the Daytona epic (arbe-7959).

Once open, now answered (2026-06): the egress whitelist cannot include arbe.0sk.ar at our org tier (tier 1/2 restrictions are not overridable, and tier 3/4 — networkAllowList, max 10 IPv4 CIDRs — is out of budget). So all sandbox-to-thread traffic flows through the edge function permanently and the two proxies must be kept from drifting.

See sprite runtime, dispatch, environments, secrets, runtime.