Skip to content
View as .md

House budget design

Status: shipped — per worker-key seam enforcement (worker-budget.ts); migrations 20260611140000 + 20260611150000 + 20260612120000 (default/backfill) + 20260612130000 (usage split). Prerequisite: arbe-d9b7 ledger (usage_events) + item-4 finally-block tally (errored/stalled turns now record spend)

Implementation gotcha (don’t relearn this): budget_check must be SECURITY DEFINER. The gate calls it through the request user’s RLS-scoped client, and usage_events has RLS enabled with no select policy — an invoker-rights function sees zero rows there and the sum returns 0, so the cap never trips. Definer rights (with a pinned empty search_path) let the trusted gate read the whole ledger. resolve_scope_context is SECURITY DEFINER for the same reason. See migration 20260611150000.


Grounding: what the ledger actually looks like

usage_events is an append-only table (migration 20260611100000_usage_events.sql):

house_id, agent_id, thread_id, trace_id,
capability (llm|sandbox|file_index|…),
seam (gate|reply|sandbox_provision|sandbox_exec|…),
key_source (worker|house|env),
model_ref, input_tokens, output_tokens,
cache_read_tokens, cache_write_tokens,
cost_usd,
created_at

Index: (house_id, created_at desc) — the gate query shape below hits it.

recordUsage() (packages/core/usage.ts) is fire-and-forget into Postgres + PostHog. It never throws, never blocks the caller. It is called at three seams:

  • replyrunBotTurn in a finally block, from the caller-owned tally; fires whether the turn finished clean, threw, or returned an error. (dispatch.ts lines 806–831)
  • gate — ambient relevance gate before a turn fires. (dispatch.ts lines 269–283)
  • sandbox_provision / sandbox_exec — sandbox tools (sandbox-tools.ts)

flushUsage() drains all in-flight writes in the outermost dispatch finally so workerd doesn’t drop them when the response returns.

The existing usage_house_overview RPC aggregates over usage_events in a single SQL pass (migration 20260611130000). by_capability, by_model, and by_thread sections are already used by the usage UI. No cached counter table exists today.

keySource attribution matters: worker = arbe pays; house/env = the house is using its own BYOK. A cap on arbe-funded spend (key_source = 'worker') means something different from a cap on total spend. Policy must say which.


Decision 1 — Policy model

Options

ShapeDescription
Monthly rollingCap resets on a calendar-month boundary. Most natural billing analogy.
One-time / lifetimeBucket of credit that depletes and never refills without manual action.
Both — pick per houseA cap_type field + cap_amount; default null = uncapped.

Recommendation: one-time (lifetime) cap for v1, stored as a single nullable column on houses. Monthly is a later add-on.

Rationale:

  • One-time is the simplest correct thing and it removes a whole decision: with no reset window there is no calendar-vs-rolling question (OQ-2 dissolves), no created_at >= window_start clause in the gate query, and no reset cron/trigger ever. The gate is just “total arbe-funded spend so far vs the cap.”
  • A refill is a manual bump of the column. At alpha (no users but us) that is the right amount of machinery, not a gap.
  • Monthly is a strict superset: when a real billing-cycle need appears, add a window_start back to the query (and a cap_period field) without changing the column’s meaning. Nothing here forecloses it.
  • A nullable column on houses is the smallest correct change: no new table, no new FK, no join in the enforcement query. The house row already has id, name, created_at; a spend_cap_usd numeric(12,6) column fits naturally.

Decided (OQ-1) — cap arbe-funded spend only. v1 enforces against key_source = 'worker' rows (turns arbe subsidises). BYOK spend (key_source IN ('house','env')) is the house’s own provider bill and is never capped by default. The gate query therefore filters key_source = 'worker'.

Forward-compat: a house may later want to cap its own BYOK spend too. The single nullable spend_cap_usd column does not preclude this — when we add it, introduce a cap_scope field (worker default vs all) rather than changing the existing column’s meaning. Out of scope for v1; noted so the column choice stays compatible.

Resolved (OQ-2) — no reset window in v1. The cap is one-time/lifetime, so there is no anchor to choose. If monthly is added later, the anchor question returns then (calendar-month date_trunc('month', now()) vs rolling now() - interval '30 days') — not now.


Decision 2 — Enforcement seam

Why pre-dispatch is impractical

A turn’s cost is only known after the model call returns (usage tokens are in the response, not the request). The gate call itself also spends tokens. There is no way to know the exact cost before starting a turn.

Options

SeamDescriptionTrade-off
Pre-dispatch gateRefuse to start a turn if running total already exceeds cap.May refuse a turn that would have cost $0.001 while the house is $0.0001 over; still lets the current-in-flight turn complete first.
Post-turn blockFinish the turn, block the next one.Lets one turn overshoot the cap by its own cost. For a $1/month cap and a $0.05 turn, that’s a 5% overshoot max.

Recommendation: pre-dispatch gate (check before starting a new turn, not after).

Rationale: the check is “is the running total already over the cap?”, not “will this turn put us over?”. The answer is known from committed usage_events rows before the turn starts. A house that is over budget at turn-start gets a clear refusal. A house that goes over during a turn has already committed the spend — accepting that one overshoot is unavoidable (and is bounded by a single turn’s cost). The pre-dispatch check is the right place to insert the guard: createThreadDispatcher loads scope.houseId before the bot loop, and a single RPC call there answers the question.

The enforcement seam is dispatch.ts just after resolveScopeContext returns scope.houseId (line ~165), before the bot classification loop. The gate is synchronous from dispatch’s perspective (one await), and failure short-circuits with a signal.dispatch.skipped reason budget_exceeded.


Decision 3 — Checking the cap cheaply

No cached counter exists. The right query is a live aggregate over (house_id, created_at), hitting the existing index.

Query shape (RPC)

-- v1: lifetime arbe-funded spend total for a house (no time window).
-- Proposed as a new RPC: budget_check(p_house_id).
select coalesce(sum(cost_usd), 0) as total_cost_usd
from public.usage_events
where house_id = p_house_id
and key_source = 'worker'; -- cap arbe-subsidised spend only (OQ-1)

No created_at bound in v1 — the cap is lifetime (OQ-2 resolved). When monthly is added later, a p_window_start param returns and the clause becomes and created_at >= p_window_start.

This is a single index scan on (house_id, created_at desc) keyed by house_id — the index already exists. For the alpha phase with a small event count per house, this is cheap enough to call on every dispatch turn. At scale a materialized running counter is the upgrade path; no need to add it now.

Alternative: cached counter

A house_spend table with a current_month_usd column, incremented by a Postgres trigger on usage_events insert. Pros: O(1) read. Cons: trigger complexity, drift risk if triggers misfire, and the counter needs to reset on month rollover (a cron or a conditional in the trigger). The live aggregate is simpler and correct-by-construction. Defer the cache until the aggregate proves too slow.


Decision 4 — Failure UX

When the gate query shows total_cost_usd >= spend_cap_usd:

Agent sees (on the stream):

signal.dispatch.skipped reason: budget_exceeded

This is already the standard skipped shape; budget_exceeded is a new reason value. No new entry type needed.

Human sees (in the UI):

A thread status chip or inline notice: “Budget limit reached — this house has spent $X of its $Y cap. Turns paused until an owner raises the cap.” (The cap is one-time/lifetime in v1, so there is no reset to wait for — only a manual bump.) The exact copy is a product decision; the data needed (total_cost_usd, spend_cap_usd) is available from the gate check result.

Agent that attempted to reply:

The bot turn never starts. No signal.dispatch.started is emitted. The skipped signal carries the reason, so a client rendering dispatch signals can surface it to the bot’s author without embedding budget details in the thread narrative.

No new error code is needed in @arbe/errors for MVP; the existing signal.dispatch.skipped reason field is the observable surface. An ArbeError subclass is the right extension if the HTTP API needs to return a structured error (e.g. POST /api/threads/:id/entries → 402 Payment Required), but that is a separate surface decision.


Decision 5 — Worker-hang residual

What item-4 fixed

Before item-4, an errored or stalled turn might not flush its token tally. The fix wraps runBotTurn in a finally that calls recordUsage() with the caller-owned tally, so the spend is recorded even if the turn threw or returned an error final.

What remains: worker-process kill (no finally)

If the Cloudflare Worker process is killed mid-turn — OOM, platform-forced shutdown, waitUntil budget exhausted — the finally never runs. The tally is in memory and is lost. The model call may have already streamed tokens to the provider, but the ledger row is never written.

Implication for budget accuracy: The cap check reads usage_events. A killed-process turn’s tokens are not in usage_events. The house appears to have spent less than it actually did. A house at $0.98 of a $1.00 cap could have one killed turn whose $0.05 of spend is missing, making the next turn appear safe when it is not.

Recommendation: accept this as a known residual with documented bounds, not a blocker.

Rationale:

  • Worker kills are rare on Cloudflare Workers (the platform is designed to avoid them; waitUntil gives 30 seconds for background work).
  • The overshoot is bounded: one turn’s cost. For modest caps ($5–$50/month) and typical turn costs ($0.01–$0.10), the overshoot is 0.2–2% of the cap.
  • A 10% safety buffer on the cap (enforce at 90% of stated cap) eliminates the practical risk entirely without requiring infra changes.
  • The right long-term fix is a provider-side webhook or usage API (Anthropic/OpenRouter reports actual charges, not what we recorded) — that is a separate workstream.

Design note: document the residual in the enforcement comment so the next engineer does not misread a missing event as a ledger bug.


Schema change (shipped)

alter table public.houses
add column spend_cap_usd numeric(12,6);
-- null = uncapped (legacy rows created before the default cap)
-- positive value = one-time/lifetime cap in USD

createHouse stamps DEFAULT_HOUSE_SPEND_CAP_USD (5, in packages/core/houses.ts) on every new house. The cap does not renew; bump it manually during alpha.

Plus the RPC budget_check(p_house_id text) returns numeric from Decision 3 — no p_window_start in v1 (lifetime cap). It must be SECURITY DEFINER (see the gotcha note at the top). Both live in migrations 20260611140000 (column + RPC) and 20260611150000 (definer fix).


Summary of recommendations

DecisionRecommendation
Policy modelOne-time/lifetime cap (v1), nullable spend_cap_usd column on houses; monthly is a later add-on
Cap scopearbe-funded only (key_source = 'worker'); BYOK uncapped — decided OQ-1
Reset windowNone in v1 (lifetime cap) — OQ-2 dissolved
Enforcement seamPre-dispatch gate, after resolveScopeContext, before bot loop
Check methodLive aggregate RPC; no cached counter yet
Failure UXsignal.dispatch.skipped { reason: 'budget_exceeded' } + UI surface
Hang residualAccept as known; add 10% safety buffer on cap if needed

Open questions for the human

OQ-1 — Cap scope: DECIDED. v1 caps key_source = 'worker' (arbe-subsidised) spend only; BYOK is the house’s own bill and stays uncapped. A future opt-in BYOK cap is anticipated via a cap_scope field — see Decision 1.

OQ-2 — Reset anchor: DISSOLVED. v1 uses a one-time/lifetime cap, so there is no reset window to anchor. The calendar-vs-rolling question only returns if monthly budgets are added later.