View as .md

Workflows

A workflow is a recipe; a run is a conversation. The recipe — an agent plus an ordered list of steps — is a row a house owns. Spawning it opens a fresh thread and the run plays out there: each step is a message posted into the thread, the agent’s reply is the work, and when the reply lands the next step goes out. Between steps, nothing runs anywhere.

Four step kinds (@arbe/core/schemas/workflow):

run_command — the bot runs a shell command in the thread’s sandbox;
dispatch_task — a natural-language prompt the bot acts on with its normal tools (e.g. dispatching a pi coding agent);
sleep — a durable pause of seconds; nothing posts to the thread and no process is held;
human_gate — posts a prompt and parks until a human replies in the thread.

A step’s intent is natural language and the bot’s tools do the work, so a workflow can do anything an agent can — and a new workflow is a row, not a deploy.

A step’s command/prompt may carry {{path}} placeholders, rendered per run against the run’s payload — a JSON object the trigger supplies at spawn (arbe wf spawn <id> --payload '{"branch":"main"}', or payload on POST /api/wf). The payload is run-scoped, not part of the recipe, so editing it never touches the row; cron and bare spawns pass {}. An unresolved {{path}} fails the step early with that path named, rather than posting a half-blank instruction.

What workflows add to threads is time and durability:

a run can fire on its own (schedule or event), with no one watching;
it can sleep for hours or days, or gate on a human reply, at zero compute and holding no process open;
every finished step is checkpointed, so a crashed run resumes where it left off and never repeats a step;
a retried run keeps its identity — same run id, the attempts count ticks up.

Author and watch them in the console at /workflows: create at /workflows/new, edit at /workflows/<id>/edit, delete from the workflows page. Spawn a workflow, follow the live stepper, and reply to a gate in the embedded run thread. The same surface is GET/POST /api/workflows and GET/PATCH/DELETE /api/workflows/<id>. From the CLI:

printf '%s' '{"name":"nightly","steps":[]}' | arbe wf create --stdin  # create a recipe
arbe wf spawn <workflow_id>   # fire a run
arbe wf runs                  # recent runs, newest first (+ conductor heartbeat)
arbe wf show <run_id>         # one run: steps + events (short-id prefix ok)
arbe wf cancel <run_id>       # stop a run in place — no-op if already terminal
arbe wf proof <workflow_id>   # spawn + verify it completes — e2e health check, ~30s–3m

Each run gets its own thread, auto-named <workflow> <YYYY-MM-DD> — a nightly workflow opens a new one every night. The thread is the run’s log and its control surface: open it, read it, interrupt it, reply to unblock a gate.

A real recipe, end to end — sweep the website’s public routes every night and report (POST /api/workflows):

{
  "houseId": "<house>",
  "agentId": "<bot>",
  "name": "nightly-route-health",
  "schedule": "0 5 * * *",
  "steps": [
    {
      "kind": "run_command",
      "name": "sweep",
      "command": "for p in / /about /login /guide; do curl -s -o /dev/null -w \"%{http_code} %{time_total}s $p\\n\" \"https://arbe.0sk.ar$p\"; done"
    },
    {
      "kind": "dispatch_task",
      "name": "report",
      "prompt": "The previous step swept the public routes (status, total time, path per line). Write a one-paragraph health report: flag any non-200 or any route slower than 1.5s as a regression; otherwise say all is well and name the slowest route."
    }
  ]
}

Every night a fresh thread opens, the sandbox runs the sweep, the bot reads the numbers and writes the verdict — the thread is the report. Add a human_gate step after report and the run parks until someone replies; that’s an approval flow, same recipe shape.

Caveat (2026-06): sandbox egress is allowlisted at our Daytona tier — github, npm, and LLM APIs work, but arbitrary hosts (including arbe.0sk.ar) get a TLS reset, so this exact sweep returns 000s today (arbe-5783). The recipe exists as nightly-route-health, unscheduled until egress is solved.

Five gotchas:

a run_command step is one bot turn, so it must finish inside the pi turn cap (~2 min). A longer command loses the turn; the run then fails with a reason telling you to restructure it (arbe-5041), rather than parking silently. For slow work (installs, builds, suites), background it: one step starts it with nohup … & echo started writing to a log, a sleep step waits, a third step tails the log;
a step’s name is its checkpoint identity — renaming a step makes it run again, and names must be unique within a workflow;
steps are snapshotted at spawn, so editing or deleting a recipe never touches a run already in flight (historical runs carry their own snapshot);
runs sitting pending mean the conductor daemon is down; arbe wf runs and the console’s health badge both say so;
a spawned run gets a bounded retry budget (max 5 attempts, exponential backoff from 10s up to 300s) rather than retrying forever — retryable step failures re-run that step in place while earlier steps stay cached; permanent failures fail fast instead of burning attempts on the same bot turn. If a run still needs to be stopped before it exhausts its attempts, arbe wf cancel <run_id> (or DELETE /api/wf?run=<id>) cancels it in place.

Triggers

Every trigger does one thing: call wf_spawn(id, payload). So there is one door, not a growing list of trigger types — cron is the door with no caller, a manual spawn is a human calling it, and an inbound webhook (future) is an external system calling it with its body as payload. A new integration is “point a caller at the door and map its body into the recipe’s {{...}}”, not new engine code.

A schedule is part of the recipe: set schedule on the workflow row to a cron expression (UTC) and each slot spawns a run with an empty payload, exactly as arbe wf spawn. Clear it to stop; an invalid expression is rejected at write time. A trigger mirrors the column into one pg_cron job per workflow (wf:<id>), so there is no scheduler daemon — Postgres fires, the conductor executes.

Under the hood

Absurd — Postgres-native durable execution — is the engine, driven by the conductor (apps/workflow-conductor), a daemon that polls the queue. Upstream engine reference: Absurd README. The wrapper is one word each way, and engine vocabulary stays below this line:

arbe says	Absurd says
run	task (the stable identity)
attempt	run (one try at it)
step	checkpoint
sleep / gate	`sleepFor` / `awaitEvent`

Design and decisions: thinking/durable-workflows.