Testing
How we write tests, so they catch real regressions and survive honest refactors. A test that breaks when behaviour is unchanged is a liability, not a safety net. Read before adding or editing a test.
The litmus: would this test still pass if you rewrote the implementation cleaner but kept behaviour identical? No → it tests the code’s shape (call order, exact strings, mock choreography). Assert the outcome instead, or delete it.
Test the outcome, not the choreography
Our worst tests hand-roll a fake of a stateful dependency (usually Supabase) and assert on a
recorded call-log — they re-implement the database to assert the code talks to the database
the way the code talks to the database. Passes by construction, breaks on any refactor,
catches nothing a type error wouldn’t. This is the parallel universe, and it is gone. The worked
way out (arbe-473b): agents.test.ts, apps/www/src/lib/server/api-key-db.test.ts
(projectApiKeys), and core/threads.test.ts — each pulled its decisions into pure functions or a
narrow owned store interface (ThreadStore) and deleted the fake. The last holdout,
packages/core/narrated-entity.test.ts (a 200-line postgrest re-implementation), had little pure
logic and was nothing but thin orchestration over the boundary — so we deleted it rather than port
the fake or stand up a DB double for it (arbe-344b, closed). If that glue breaks, we catch it by
using arbe, not by re-faking postgrest.
expect(calls[0]).toMatchObject({ table: 'agents', op: 'select' }) // noexpect(selects).toEqual(['id, name, created_at, last_used_at']) // noInstead, in order:
- Extract the logic from the boundary. What these tests want to check is usually pure —
which fields get written, how a row maps to a domain type, what status fires. Pull it into
a function and test plain inputs → outputs, no fake. Models:
core/agents.test.ts(which fields a write carries, who may edit —buildAgentInsert/canEditAgentDecision/ the row mappers),core/threads.test.ts(the stuck-thread reconcile decision —decideStuckThreadReconciliationwith an injectednow, pluschildFinishedText/buildChildFinishedEntries; the row update + narration routing left over is checked against the narrowThreadStoredouble’s boundary payloads, not a DB),core/client.test.ts(status transitions),dispatch/build-messages.test.ts(entries → messages). The shared envelope →ArbeErrortranslation is tested once inutils/supabase-rows.test.tsso no write module re-fakes “an insert error becomes server.internal”. - Prove SQL/RPC/RLS with a SQL proof, not a JS fake. The few contracts that genuinely live in
the database (
resolve_scope, where-clause predicates, cascade deletes, RLS) are covered by thebegin … rollbackproofs inpackages/supabase/tests/verify-*.sql, run against a linked dev DB (same DB local and prod). We deliberately do not run them as an automated JS suite: a JS DB double would mean eithersupabase startDocker on everybun run test, or re-faking postgrest — the parallel universe. Don’t reach for a builder fake; if the logic is pure, do option 1, and if it’s only thin orchestration over the boundary, leave it to dogfooding (we deletednarrated-entity.test.tson exactly this call — arbe-344b). - Never make a
calls/selectslog the assertion. A side-effect at a real external boundary may be worth asserting — then assert the payload, not that the call happened.
Mock the boundary, not the collaborator
Mock what you can’t run: network (Tenor, sandbox HTTP, Daytona), the clock, incidental
filesystem. Mocking your own modules is a smell — a test that stubs a stack of collaborators
only proves the wiring passes args through, and stays green when every collaborator is broken. If
a unit needs that much mocking, the seam is wrong: test a level up where the real collaborators
run, or extract the pure decision it’s making — as dispatch/agent-tools.test.ts now does, pulling
the launch-env contract and the exec summary into pure functions it tests with no mock, leaving the
mocked tool tests as thin wiring checks.
Prefer real fixtures of real data
Our strongest tests replay captured wire data: core/pi/events.test.ts and
pi/transcript.test.ts decode real pi JSONL; sandbox/src/sprites-http.test.ts feeds real
byte frames. When testing a decoder, parser, projector, or mapper, commit a real sample as a
fixture — a synthesized “valid” input only proves the happy path you imagined.
Don’t restate, don’t pad
- Generated strings: assert meaning, not surface. For a SQL
whereclause or other wire string, the contract is which rows it selects, not its spacing orAND-order. Assert the predicate it embeds (toContain), or run it against rows once a DB double lands (arbe-473b) — a reformat must not break the test. Reserve an exact-stringtoBefor a string with no incidental spelling: a single predicate, a sentinel, a serialized error (errors/arbe-error.test.ts), where every character is load-bearing.electric-shape.test.tsshows both, and tests the injection-rejection contract its builders own. - Redundant ≠ padding. A
toHaveBeenCalledWith/not.toHaveBeenCalledcheck that an outcome assertion already implies is still worth keeping when it names a documented behavior — an avoided side-effect (“resolves without ever fetching members”), a boundary payload, a short-circuit. Cut only bare call-count/order checks that pin mechanics nothing in the contract depends on. When in doubt, leave a passing test alone. - Don’t test what you don’t own:
Schema.parse(validInput)tests Zod; a re-typed constant map tests nothing;render(x, {tty:false}) === xrestates an early return. Test our refinements and rejection rules, not the library. Delete the rest on sight.
What good looks like
Pure inputs → asserted outputs, or real fixture → decoded result. Named for the behaviour
(cooldown + human trigger resets eligibility, not classify works). Aimed at logic that can
break: decisions, mappers, parsers, state machines, concurrency, shell-escaping. No mock
unless a real boundary forces it. Copy from: errors/arbe-error.test.ts,
core/pi/events.test.ts, cmd/src/parse.test.ts, dispatch/build-messages.test.ts,
dispatch/selection.test.ts, apps/cli/src/task/task.test.ts (real fs + concurrent
subprocesses), sandbox/src/daytona/decide-pi-outcome.test.ts.
Cut noise, not coverage
Fewer, sharper means shrinking restated scaffolding, not distinct behaviours. A long test file is
usually thorough, not bloated — before you delete, check whether the length is repeated cases
(keep them) or repeated fixture boilerplate (factor it). When the same envelope or setup is
hand-rolled across cases, pull it into a small builder and keep each case’s asserted data inline,
so the test still shows at a glance what input yields what output — chat-stream.test.ts
collapsed ~17 projector cases this way without dropping one. Two traps: don’t hide the value under
test inside the builder, and drop fixture fields no assertion reads — they’re noise that buries the
one thing the case is about.
Don’t backfill a test per function, and don’t delete a real behaviour to shorten a file. When in doubt, write the one that would have caught a real past bug; a regression test with a one-line note on what broke is gold.
Mechanics
bun run test(scope:bun run --filter '@arbe/core' test). Never barebun test— it bypasses the package script.- Two runners, deliberately split by runtime:
packages/*+apps/wwwon vitest (vi.mock),apps/clion bun:test (mock). Match the runner already in the package — the boundary is the runtime, not preference. The CLI is a bun program to its core (Bun.spawn/Bun.file,import.meta.dir, andimport … with { type: 'file' }asset-embedding for the compiled binary);bun:testruns that natively with zero config, vitest can’t without a vite asset plugin +Bunshim and still runs production bun code on a non-bun runtime. Evaluated and closed in arbe-883f. - Co-locate
foo.test.tsnext tofoo.ts; shared fixtures in a__fixtures__dir beside the tests that use them.