# Durable workflows

Status: engine complete, schedule live. Proven live (2026-06-09): the engine, the conductor daemon, the least-privilege role, read + spawn, the **send seam**, the **one-bot invariant** (`createThread` seeds the driving bot `triggerMode: 'always'`), the **finish seam** both ways — inline in `/api/wf/step` and the **reconcile backstop** (`POST /api/wf/reconcile`, called by the conductor on a 30s timer) — plus **durable sleep** and the **human-gate** step. **Proof 1**: conductor killed mid-run; the step event emitted while it was dead; restart resumed exactly-once; thread reads clean. **Proof 2**: a run slept 20s at zero compute, parked on a human gate, and continued seconds after a human replied — released by the sweep, which is the gate's *only* path (a human reply fires no workflow hook). **Proof 3 (schedule)**: a workflow with `schedule = '* * * * *'` fired at five consecutive minute slots, each spawning a run the conductor completed in one attempt; clearing the column removed the cron job. **Operational (2026-06-09)**: the conductor runs as a supervised sprite service on `arbe1` (auto-restart on crash + boot, kill-proven), heartbeats via the reconcile sweep, and `arbe wf runs` reports its liveness — closing arbe-5793. **Console (arbe-ea32, 2026-06-12)**: the `/workflows` playground — create a recipe, spawn, live stepper (active / sleep / gate / failed per step), conductor health badge, Absurd-internals debug pane (checkpoints, events, await state), embedded run thread. **Thread-driven**: the conductor never runs tools; it sends an instruction into the run thread and the bound bot does the work. This doc is the design and the decisions as built.

## Why

Now that we have agents, threads, tools, and sandboxes (`dispatch_task`, pi inside a coding agent), the missing piece is **time and durability**: a thing that wakes on a schedule or an event, runs a sequence of steps that may span minutes or days, survives restarts, and can pause to wait for a human. Cron jobs but composed of agent work, and observable as conversation.

arbe has an unfair advantage over Temporal / Inngest / pg_durable: **every run is already a thread you can watch, interrupt, and talk to.** We don't get opaque background jobs — we get durable workflows that are also threads. Human-in-the-loop is the part generic engines are worst at, and for us it's free: "wait for a human reply" is just waiting on a stream entry.

## Goal — done when

A user can describe a recurring job **once** — e.g. "every morning run this command suite in a sandbox, and if a check fails have pi investigate and post findings to a thread; wait for my 👍 before taking the next action" — close the laptop, and trust that:

- it **fires on its own** (schedule or event), with no one watching;
- each run is a **thread** they can open, read, and interrupt;
- it **survives restarts** — a crash mid-run resumes from the last finished step, never re-runs a completed one, never silently drops;
- it can **sleep for hours or days and block on a human** without holding any process open;
- every step is just `run_command` or `dispatch_task` — **no new language to learn.**

One line: *a scheduled-or-triggered chain of `run_command` / `dispatch_task` steps that runs unattended, resumes after crashes, can wait on time or a human, and is watchable as a thread the whole way through.*

The sharp test that proves it: kill the conductor halfway through a run, bring it back, and the workflow finishes correctly — exactly once per step — and the thread reads like nothing went wrong.

Note: examples here avoid repository / GitHub actions on purpose — we have not wired GitHub tokens into arbe yet. The model applies to repos too once that exists.

## Decision — adopt Absurd

Use [**Absurd**](https://github.com/earendil-works/absurd) as the durable-execution engine via its TypeScript SDK (`absurd-sdk`). Absurd is Postgres-native durable execution: a SQL schema holds the engine (queue, checkpoints, retries, sleep, events) and a thin SDK (~2K LOC) is a client over it. We write a small daemon (the conductor) that maps Absurd's primitives onto arbe's two tool calls.

Key clarification that settled it: **Absurd is not a layer on pgmq — it ships its own queue.** So pgmq / Supabase Queues and Absurd are alternatives at the same layer, not a stack. We pick Absurd because it gives us the step journal, sleep, and events for free; pgmq would give us only the mailbox and leave the durable layer to hand-roll.

It also stays inside arbe's no-DSL line: Absurd steps are **ordinary code around checkpoints** (`ctx.step(...)`), not a SQL DSL like pg_durable. It re-runs around cached checkpoints rather than doing deterministic replay.

Naming: arbe surfaces speak **workflow / run / step / attempt** and never leak engine words (task, queue, checkpoint, lease). Our *run* is Absurd's *task* — the stable identity that survives retries; Absurd's own "run" (one try) surfaces as *attempt*. The `wf_*` RPCs key on it (migration `20260609080000`). User-facing articulation: [docs/workflows.md](../workflows.md).

Mapping Absurd → arbe:

| Need | Absurd | arbe wrapping |
|------|--------|---------------|
| Workflow definition | `app.registerTask({name}, handler)` | a `workflow` record |
| A run | `app.spawn(name, params, {idempotencyKey})` | a run thread (child room) |
| A step | `ctx.step('name', () => …)` | **send** an instruction into the run thread; the bound bot runs `run_command` / `dispatch_task` |
| Durable sleep | `ctx.sleepFor` / `ctx.sleepUntil` | — |
| Wait for human | `ctx.awaitEvent(...)` | a step whose done-check is "a human replied" |
| Step done | `ctx.awaitEvent('wf.step.done:<runId>:<step>')` | **finish**: the step's done-check passes → www emits (inline + reconcile) |
| Schedule | pg_cron → `absurd.spawn_task(...)` | cron entry per workflow |
| Exactly-once-ish | `startWorker()` daemon + step caching | `apps/workflow-conductor` on `arbe1` |
| Agent loop | `beginStep`/`completeStep` over a message log | one `dispatch_task` / pi turn-loop |

The agent-loop fit is exact: Absurd's [AI-agent pattern](https://github.com/earendil-works/absurd/blob/main/docs/patterns/pi-ai-agent.md) appends each `message_end` to a durable step log and resumes the loop when the last message isn't the assistant's — i.e. when it's waiting on a human. That is `dispatch_task` / pi, durable.

License is permissive (confirmed); we pin `absurd-sdk@0.4.0` and the matching schema (`sql/absurd.sql` @ tag `0.4.0`) checked into our migrations.

## The hard constraint: no new primitive, no DSL

This is the design line, straight from [`what-not-to-build`](./what-not-to-build.md) and [`primitives`](./primitives.md):

- **No workflow DSL.** pg_durable is the cautionary example — a whole SQL DSL (`~>`, `&`, `?>`, `df.loop`, `df.race`). That is exactly what arbe says not to build. Activation policies + permissions + tools already compose into workflows.
- **A workflow is a record** (`type: 'workflow'`), not a seventh primitive. A run is a thread (a child room). Steps are dispatches and stream entries. We already have these.
- **Durability is operational, like signals.** The step queue / timers / resume tokens are the system witnessing its own execution — closer to the signals plane than to content. They are not a new content type and not stored as stream messages.

The model to copy is Armin Ronacher's [Absurd Workflows](https://lucumr.pocoo.org/2025/11/3/absurd-workflows/): minimal, Postgres-only, `SELECT ... FOR UPDATE SKIP LOCKED`, steps journaled and replayed, **agent loops as a single self-iterating step** rather than a static DAG. That last point matters for us: an agent doing N turns is one durable step that checkpoints between turns, not a graph we have to author.

## Mechanism: Absurd's schema + a daemon we own

Absurd's SQL schema is the durable substrate — same DB local and prod, no extra infra; the hand-rolled durable table done well, with a TS client. Steps cache their result (exactly-once-per-step), `sleepFor`/`sleepUntil` give durable sleep without holding a process, `awaitEvent`/`emitEvent` are the wait-for-human/-signal mechanism, and a crashed worker's lease expires so the task is reclaimed on restart (steps stay idempotent via caching + `idempotencyKey`).

**Schedule ≠ run** — two jobs people conflate. The *schedule* is pure SQL (`cron.schedule(...)` → `select absurd.spawn_task(...)`, enqueues a row, no process). The *run* needs a JS runtime to execute the TS handler, so we run a **long-poll daemon**: `apps/workflow-conductor` runs `app.startWorker()` on the `arbe1` sprite over its `absurd_worker` connection. Nothing calls *in*; it only calls *out* to Postgres (claim work) and www (send steps). We rejected a `pg_net` → `/tick` model (an endpoint, its auth, the extension, cold-start hops for a few cents of always-on); the daemon's poll loop keeps the sprite warm, and the lease reclaims if it ever pauses.

## The engine and the run thread

A workflow run lives in two durable substrates at once, and the design *is* the bridge between them:

- **The engine — Absurd, in our Postgres.** The durable program: queue, step journal, sleep, events, leases. The generic `arbe-workflow` handler runs here, executed by the **conductor** (`apps/workflow-conductor`) over its `absurd_worker` connection. The engine decides *what step is next* and survives crashes.
- **The run thread — entries in durable streams.** Where the work happens and is watchable: the System agent's instruction, the bound bot's replies, the dispatch signals. It *does the work* and *narrates it*.

One run, two faces — an Absurd **task** and a **run thread** sharing one identity: `runId` is the Absurd task id, and `wf_run_threads` maps it to the thread. Each run gets its own fresh run thread — a nightly workflow opens a new thread every night; the thread is per *run*, not per *workflow*.

The engine and the run thread meet at exactly **two seams**:

- **Send (engine → run thread).** Per step, the conductor's `ctx.step(…)` calls www `POST /api/wf/step`; www posts the instruction into the run thread authored by the house **System** agent, and the bound bot takes a **turn** with its own `run_command` / `dispatch_task` tools. *Built and proven.*
- **Finish (run thread → engine).** When the turn ends, www emits `wf.step.done:<runId>:<step>`; the conductor — parked on `ctx.awaitEvent` for exactly that key — wakes and resumes to the next step. *Built: inline (`finishStep` in `/api/wf/step`) over the pure done-check (`@arbe/core/workflow-steps`). The reconcile backstop remains.*

Between the seams the conductor is **suspended**: `awaitEvent` frees the worker slot, so a run can sleep for days or block on a human while holding no process open. Crash-resume is free — every finished step is a cached checkpoint, so a restarted run never repeats work.

### The done-check: a step finishes by reading the run thread

The fact that shapes everything: **turn-end lives only on the external stream, never in Postgres.** A bound bot's turn ends with `signal.dispatch.completed`/`failed` *appended to the run thread's stream*; it flips no Postgres row (only the detached `dispatch_task`/sandbox path emits `signal.thread.status_changed`, so the existing `findLatestTerminal` reconciler never fires here and `threads.status` stays `idle`). There is no row-change for a Database Webhook to ride. The only way to know a turn ended is the **done-check**: read the run thread's tail and run `stepOutcome(tail, step) → completed | failed | null`. What it looks for varies by step kind — the `findLatest*` helpers are its internals:

- a **command / dispatch step** — the bound bot's `signal.dispatch.{completed,failed}` is the latest terminal: `findLatestDispatchTerminal(tail, botId)`;
- a **human-gate step** — a *human* replied after the instruction: `findLatestHumanReply(tail, afterTs)`;
- *(future)* **silence** — no terminal and no activity past a threshold → `failed`.

Same shape every time: the step parks on `awaitEvent`, and the bridge emits `wf.step.done:<awaitKey>` once that step's done-check passes. This is Absurd's own external-completion pattern — a step suspends on an event named for the thing it waits on, and something outside emits when that thing is done.

### Two ways to finish: inline (primary) + reconcile (backstop)

Reading the stream needs the stream token and service role — which www holds and the conductor deliberately does not. So the finish runs in www, two ways:

- **Inline (primary).** The send request already triggered the turn: `submitThreadEntries` hands back a `dispatch` promise that resolves when the in-process turn finishes. `/api/wf/step` awaits it, reads the tail, runs the done-check, and emits — all in the request it already owns. Lowest latency, no extra process.
- **Reconcile (backstop).** A www instance can be evicted mid-turn (`waitUntil` is not durable), losing the inline emit. So a sweep — `POST /api/wf/reconcile`, called by the **conductor on a 30s timer** — lists runs still `awaiting`, reads each tail, and emits any missed completion. **Load-bearing, not optional**: it is also the human-gate's only completion path (a human reply fires no workflow hook). We chose the conductor over `pg_cron → pg_net`: the daemon already holds the secret, and a dead conductor means no progress regardless, so the coupling costs nothing — and no HTTP plumbing or secrets land in Postgres.

They can't conflict: Absurd events are **first-write-wins**, so whichever emits `wf.step.done:<awaitKey>` first resolves the step and the other is a no-op. It's the same "stream is truth, materialize a missed terminal" shape as `pruneStuckThreads` — a sibling, not a reuse (different signal, different action).

### The foundational invariant: one run thread, one bot, always-reply

"The turn ended" is only well-defined because a run thread is bound to **exactly one worker bot that always replies.** Two reacting bots make turn-end ambiguous; a relevance/mention gate makes it never fire — the one live failure we hit (`signal.dispatch.skipped: filtered`, until the bot was @-mentioned). So the run thread is created single-bot with the worker bot forced to `triggerMode: 'always'`, enforced structurally at thread creation — not by convention. Without this the bridge is built on sand.

### State of record: `wf_run_threads`

The mapping table is the bridge's source of truth — what the backstop reads, since the conductor's state is only in-process and the sweep has no memory:

| column | role |
|---|---|
| `run_id` (pk) | the Absurd task id; one row per run |
| `thread_id` | the run thread |
| `worker_bot_id` | the one bot whose terminal = turn-end (so the sweep needn't read `workflows`) |
| `await_key` | the **current** step's key (`<runId>:<step>`), upserted on every send |
| `awaiting` | true while parked on this step's turn |
| `last_emitted_key` | idempotency / drift guard for the sweep |
| `updated_at` | sweep eligibility; leaves room for a silence-timeout |

Steps are strictly sequential per run (the conductor blocks on each `awaitEvent`), so there's never more than one live `await_key` per run — the sequential invariant *is* the concurrency control, no locking. `await_key` is authoritative: every send upserts the current key (plus `worker_bot_id`, `awaiting: true`), and the inline finish clears `awaiting` after it emits — migration `20260609090000`. The backstop only has to read this table and run the same done-check.

### Runtime flow

```diagram
  pg_cron ──▶ spawn_task('arbe-workflow', { workflowId, steps, payload })
                  │
  ── ENGINE · Absurd / Postgres ──────────────────────────────────
                  ▼
       conductor (startWorker daemon, pg-only) — for each step:
         1. ctx.step ───────────────── SEND ──────────▶ POST /api/wf/step
         2. ctx.awaitEvent('wf.step.done:<runId>:<step>')   ◀── parks here
                                                              ▲
  ── RUN THREAD · threads / durable streams ──────────────────┼───
                                                              │
       run thread (one bound bot, always-reply):              │
         System posts the instruction → bot takes a turn →    │
         signal.dispatch.{completed,failed} on the stream tail │
                  │                                            │
       FINISH = done-check over the tail ─────────────────────┘
         • inline    (primary)   /api/wf/step awaits its dispatch promise
         • reconcile (backstop)  pg_cron → /api/wf/reconcile sweeps awaiting runs
         → wf_emit_event('wf.step.done:<runId>:<step>', { outcome })  [first-write-wins]
```

The conductor stays thin: only its `absurd_worker` pg connection (zero `public` access), `APP_URL`, and a shared `CONDUCTOR_SECRET`. All privileged arbe work — open the run thread, post as the house's `kind:'system'` "System" agent (not the worker bot, which the dispatcher excludes as the trigger author; not a human, to keep attribution honest), trigger dispatch, emit the terminal event — lives in www. A new workflow is a row, not a deploy.

### Status & the two gaps (closed)

The two things that blocked the SDK are solved. **Raw Postgres connection:** a least-privilege `absurd_worker` role (migration `20260609020000`) with DML on the `absurd` tables, `execute` on its functions, and zero `public` privileges, reached over the Supabase **session pooler** (port 5432, custom role as `absurd_worker.<project-ref>`); the secret lives only in the sprite's `.env`. **No long-lived worker:** `app.startWorker()` on `arbe1` (poll + lease, no endpoint, no `pg_net`). Schema rides our migrations (`absurd.sql` @ `0.4.0`); upgrades use `absurdctl migrate --dump-sql` → commit → push. `pg_cron` is enabled; `absurd.cleanup_all_queues()` is a TODO.

**Proven end to end:** `spawn_task('ping')` → the daemon auto-runs it → `completed`, zero poke; `wf_spawn` → run enqueued as `arbe-workflow`, inspectable via `arbe wf runs` / `wf show`; and `POST /api/wf/step` → a run thread opens with the instruction authored by **System**, the bound bot runs it, `signal.dispatch.completed` lands — the full thread-driven step.

### Operate it

The daemon auto-runs anything spawned, so the loop is spawn + watch:

```bash
arbe --local wf spawn <workflowId>  # wf_spawn enqueues an arbe-workflow run
arbe --local wf runs                # daemon picks it up; watch state advance
arbe --local wf show <run_id>       # steps (checkpoints) + events
```

Two gotchas: the `absurd` schema isn't on PostgREST, so inspect through the `wf_*` RPCs / `arbe wf`, never the management API; and runs sitting `pending` mean the conductor is down — `arbe wf runs` says so itself (it leads with the conductor heartbeat and warns when the beat is >90s old or runnable runs sit unclaimed; `wf_health` RPC, fed by the reconcile sweep upserting `wf_conductors`). The conductor is a supervised sprite service on `arbe1` (`sprite-env services get workflow-conductor`) that auto-restarts on crash and on boot — kill-`-9` proven 2026-06-09. Ops one-liners: [debugging.md](../system/debugging.md#workflow-conductor).

## Triggers — one door, not a plane

The sharpening (2026-06-13): triggers don't grow like steps do. A step is *data* — a new workflow is a row. A trigger *kind* is an integration — auth, ingest, a payload schema — so each one is real work and there will be few. Don't model `triggers[]` as an open plugin plane; model the **one thing they all bottom out in**: a call to `wf_spawn(id, payload)`. Cron is that door with no caller; a manual spawn is a human calling it; an inbound webhook is an external system calling it with its body as payload. "More triggers" then costs nothing — there aren't more triggers, just more callers of the same door, and the GitHub/Linear-specific part (map *their* body into *these* steps) is per-workflow data, not engine code.

What the door carries is the only genuinely new mechanism: a **payload**. A cron run starts empty; an event run starts with context (the PR, the issue). Steps reach it via `{{path}}` placeholders the conductor renders before posting each instruction (`render()` in the conductor; migration `20260613150000` snapshots the payload into spawn params beside `steps`). An unresolved path fails the step early with the path named — silent-"" drift is a debugging tax we refuse. Payload is run-scoped (off the recipe row), so editing a recipe never disturbs a run in flight, same as steps.

`run_command` templates like everything else — no escaping. Within the recipe-author trust boundary that's no new surface: a house member can already run anything in a sandbox. The provenance that *would* matter is the external caller, so when the inbound webhook lands the control is **webhook-caller auth** (signed/scoped URL), not payload sanitization — that keeps escaping out of the hot path.

Sources, in build order — **schedule** (built, migration `20260609120000`: a `schedule` column mirrored to one pg_cron job `wf:<id>` → `select public.wf_spawn('<id>')`, validated UTC at write time, one firing per slot so no idempotency key); **events / signals** (a thread entry, tool result, or activation relays to `emitEvent`; foreshadowed by activation policies); **inbound webhook** (a stable URL → `wf_spawn` with the body as payload — the one seam left to build, and once built it *is* the events router too; unlike schedule it *does* need an idempotency key, derived from the caller's delivery id, since webhooks retry — see [thinking/workflow-triggers](./workflow-triggers.md)).

The **reconcile sweep** rides the conductor's own 30s timer (not pg_cron/pg_net — see the finish seam above) and emits any completion the inline path missed. **Database Webhooks are *not* used for turn-end**: the terminal lives on the external stream, not a Postgres row, so there's nothing for a webhook to ride — why the earlier "webhook relay" plan was dropped. Where execution lives: the conductor sends an instruction; the bound bot runs the tools. Cron only feeds and sweeps; don't scatter step logic into Edge Functions.

## Authoring: data vs. agent-driven

Key constraint from `deployments.md`: Absurd tasks are **registered handlers keyed by name**, and a worker that meets an unregistered name just defers it (two-phase rollout: workers first, producers second). So we do **not** register a task per user-workflow — that would make every new workflow a deploy.

Instead: **one generic `arbe-workflow` handler.** It reads the workflow record (steps as data) and, per step, posts the instruction into the run thread via www and `ctx.awaitEvent`s the turn's terminal event. The workflow id is the task params. `wf_spawn` **snapshots the steps into the spawn params**, so the conductor never reads `public.workflows` mid-run — a run executes the plan as it was at spawn time, and editing a recipe never changes a run in flight. The step vocabulary (`run_command` / `dispatch_task` / `sleep` / `human_gate`) is the discriminated union in `@arbe/core/schemas/workflow` — an app-level invariant, not a DB constraint. This collapses both surfaces:

- **Declarative recipe** — the record's steps are data the generic handler interprets. No deploy per workflow.
- **Agent-driven** — the agent loop is itself a step inside the same handler (the `pi-ai-agent` pattern). The thread is still the program; the handler is fixed.

Keep the engine dumb and the authoring layer swappable (cognitive layer, per [`thesis`](./thesis.md)). New *handler shapes* follow the two-phase rollout; new *workflows* never do.

## Use cases by user type

- **Solo dev / indie** — nightly check: wake → sandbox → run a command suite → if something fails, dispatch pi to investigate → post findings to a thread for morning review.
- **Eng team / lead** — gated job: run a validation step in a sandbox, agent drafts a summary, **pause for a human 👍 in the thread**, then proceed to the next action. Flakiness hunter: run a check N times in a sandbox, file a task with the offenders.
- **Researcher / analyst** — recurring data pull → analysis thread → cited weekly summary. The thread is the deliverable.
- **Ops / SRE** — scheduled health probes; on anomaly, spin a sandbox, run diagnostics, escalate by @-mentioning a human and waiting.
- **Non-technical / PM** — Monday competitive scan or inbox digest; a workflow that watches a condition and, when it trips, opens a setup thread humans join.

## Open questions

- **A dropped terminal hangs a step (mostly closed).** `signal.dispatch.failed` is best-effort — it can be swallowed after retries (`dispatch/signals.ts`). The done-check now falls back to the durable fact that survives a lost signal: the bot's final `pi.assistant` entry and its `stopReason` (`lostTerminalOutcome` in `@arbe/core/workflow-steps`). A turn that ended `stop`/`length` past a grace window reads as completed; `error`/`aborted` as failed; an over-cap `toolUse` turn stalled as the thread tail past `TOOL_USE_STALL_MS` fails the run with a reason pointing at backgrounding (arbe-5041). Still open: a step that produces no terminal *and* no final assistant entry at all — a general silence timeout keyed on `wf_run_threads.updated_at` would cover that, not yet built.
- **Idempotency for side-effecting steps.** The lease auto-extends on each checkpoint write, but a stalled worker still allows brief overlapping execution — a `dispatch_task` could fire twice. `ctx.step` caching covers in-task retries; genuinely external side effects need a key derived from `ctx.taskID` (Absurd's guidance in `concepts.md`).
- **Step log vs. stream.** Absurd's step/checkpoint log is operational (its own tables); the run's *narration* — "ran checks", "waiting for your 👍" — belongs as entries on the run thread. Keep the two cleanly separated.

## References

- [Absurd](https://github.com/earendil-works/absurd) — the engine we're adopting. [concepts](https://github.com/earendil-works/absurd/blob/main/docs/concepts.md), [database/migrations](https://github.com/earendil-works/absurd/blob/main/docs/database.md), [TS SDK](https://github.com/earendil-works/absurd/blob/main/docs/sdks/typescript.md), [deployments/rollout](https://github.com/earendil-works/absurd/blob/main/docs/patterns/deployments.md), [AI-agent pattern](https://github.com/earendil-works/absurd/blob/main/docs/patterns/pi-ai-agent.md), [cron pattern](https://github.com/earendil-works/absurd/blob/main/docs/patterns/cron.md), [comparison](https://github.com/earendil-works/absurd/blob/main/docs/comparison.md).
- [Absurd Workflows](https://lucumr.pocoo.org/2025/11/3/absurd-workflows/) — Armin Ronacher's post that frames the minimal-Postgres-durable-execution thesis.
- [Supabase Queues / pgmq quickstart](https://supabase.com/docs/guides/queues/quickstart.md) — the queue substrate we considered and did *not* pick (Absurd ships its own queue).
- [microsoft/pg_durable USER_GUIDE](https://github.com/microsoft/pg_durable/blob/main/USER_GUIDE.md) — powerful, but a SQL DSL; the thing we are deliberately *not* building.

**See:** [thinking/primitives](./primitives.md), [thinking/what-not-to-build](./what-not-to-build.md), [thinking/thesis](./thesis.md).
