Skip to content
View as .md

Testing

How we write tests, so they catch real regressions and survive honest refactors. A test that breaks when behaviour is unchanged is a liability, not a safety net. Read before adding or editing a test.

The litmus: would this test still pass if you rewrote the implementation cleaner but kept behaviour identical? No → it tests the code’s shape (call order, exact strings, mock choreography). Assert the outcome instead, or delete it.

Test the outcome, not the choreography

Our worst tests hand-roll a fake of a stateful dependency (usually Supabase) and assert on a recorded call-log — they re-implement the database to assert the code talks to the database the way the code talks to the database. Passes by construction, breaks on any refactor, catches nothing a type error wouldn’t. This is the parallel universe, and it is gone. The worked way out (arbe-473b): agents.test.ts, apps/www/src/lib/server/api-key-db.test.ts (projectApiKeys), and core/threads.test.ts — each pulled its decisions into pure functions or a narrow owned store interface (ThreadStore) and deleted the fake. The last holdout, packages/core/narrated-entity.test.ts (a 200-line postgrest re-implementation), had little pure logic and was nothing but thin orchestration over the boundary — so we deleted it rather than port the fake or stand up a DB double for it (arbe-344b, closed). If that glue breaks, we catch it by using arbe, not by re-faking postgrest.

expect(calls[0]).toMatchObject({ table: 'agents', op: 'select' }) // no
expect(selects).toEqual(['id, name, created_at, last_used_at']) // no

Instead, in order:

  1. Extract the logic from the boundary. What these tests want to check is usually pure — which fields get written, how a row maps to a domain type, what status fires. Pull it into a function and test plain inputs → outputs, no fake. Models: core/agents.test.ts (which fields a write carries, who may edit — buildAgentInsert / canEditAgentDecision / the row mappers), core/threads.test.ts (the stuck-thread reconcile decision — decideStuckThreadReconciliation with an injected now, plus childFinishedText / buildChildFinishedEntries; the row update + narration routing left over is checked against the narrow ThreadStore double’s boundary payloads, not a DB), core/client.test.ts (status transitions), dispatch/build-messages.test.ts (entries → messages). The shared envelope → ArbeError translation is tested once in utils/supabase-rows.test.ts so no write module re-fakes “an insert error becomes server.internal”.
  2. Prove SQL/RPC/RLS with a SQL proof, not a JS fake. The few contracts that genuinely live in the database (resolve_scope, where-clause predicates, cascade deletes, RLS) are covered by the begin … rollback proofs in packages/supabase/tests/verify-*.sql, run against a linked dev DB (same DB local and prod). We deliberately do not run them as an automated JS suite: a JS DB double would mean either supabase start Docker on every bun run test, or re-faking postgrest — the parallel universe. Don’t reach for a builder fake; if the logic is pure, do option 1, and if it’s only thin orchestration over the boundary, leave it to dogfooding (we deleted narrated-entity.test.ts on exactly this call — arbe-344b).
  3. Never make a calls/selects log the assertion. A side-effect at a real external boundary may be worth asserting — then assert the payload, not that the call happened.

Mock the boundary, not the collaborator

Mock what you can’t run: network (Tenor, sandbox HTTP, Daytona), the clock, incidental filesystem. Mocking your own modules is a smell — a test that stubs a stack of collaborators only proves the wiring passes args through, and stays green when every collaborator is broken. If a unit needs that much mocking, the seam is wrong: test a level up where the real collaborators run, or extract the pure decision it’s making — as dispatch/agent-tools.test.ts now does, pulling the launch-env contract and the exec summary into pure functions it tests with no mock, leaving the mocked tool tests as thin wiring checks.

Prefer real fixtures of real data

Our strongest tests replay captured wire data: core/pi/events.test.ts and pi/transcript.test.ts decode real pi JSONL; sandbox/src/sprites-http.test.ts feeds real byte frames. When testing a decoder, parser, projector, or mapper, commit a real sample as a fixture — a synthesized “valid” input only proves the happy path you imagined.

Don’t restate, don’t pad

  • Generated strings: assert meaning, not surface. For a SQL where clause or other wire string, the contract is which rows it selects, not its spacing or AND-order. Assert the predicate it embeds (toContain), or run it against rows once a DB double lands (arbe-473b) — a reformat must not break the test. Reserve an exact-string toBe for a string with no incidental spelling: a single predicate, a sentinel, a serialized error (errors/arbe-error.test.ts), where every character is load-bearing. electric-shape.test.ts shows both, and tests the injection-rejection contract its builders own.
  • Redundant ≠ padding. A toHaveBeenCalledWith / not.toHaveBeenCalled check that an outcome assertion already implies is still worth keeping when it names a documented behavior — an avoided side-effect (“resolves without ever fetching members”), a boundary payload, a short-circuit. Cut only bare call-count/order checks that pin mechanics nothing in the contract depends on. When in doubt, leave a passing test alone.
  • Don’t test what you don’t own: Schema.parse(validInput) tests Zod; a re-typed constant map tests nothing; render(x, {tty:false}) === x restates an early return. Test our refinements and rejection rules, not the library. Delete the rest on sight.

What good looks like

Pure inputs → asserted outputs, or real fixture → decoded result. Named for the behaviour (cooldown + human trigger resets eligibility, not classify works). Aimed at logic that can break: decisions, mappers, parsers, state machines, concurrency, shell-escaping. No mock unless a real boundary forces it. Copy from: errors/arbe-error.test.ts, core/pi/events.test.ts, cmd/src/parse.test.ts, dispatch/build-messages.test.ts, dispatch/selection.test.ts, apps/cli/src/task/task.test.ts (real fs + concurrent subprocesses), sandbox/src/daytona/decide-pi-outcome.test.ts.

Cut noise, not coverage

Fewer, sharper means shrinking restated scaffolding, not distinct behaviours. A long test file is usually thorough, not bloated — before you delete, check whether the length is repeated cases (keep them) or repeated fixture boilerplate (factor it). When the same envelope or setup is hand-rolled across cases, pull it into a small builder and keep each case’s asserted data inline, so the test still shows at a glance what input yields what output — chat-stream.test.ts collapsed ~17 projector cases this way without dropping one. Two traps: don’t hide the value under test inside the builder, and drop fixture fields no assertion reads — they’re noise that buries the one thing the case is about.

Don’t backfill a test per function, and don’t delete a real behaviour to shorten a file. When in doubt, write the one that would have caught a real past bug; a regression test with a one-line note on what broke is gold.

Mechanics

  • bun run test (scope: bun run --filter '@arbe/core' test). Never bare bun test — it bypasses the package script.
  • Two runners, deliberately split by runtime: packages/* + apps/www on vitest (vi.mock), apps/cli on bun:test (mock). Match the runner already in the package — the boundary is the runtime, not preference. The CLI is a bun program to its core (Bun.spawn/Bun.file, import.meta.dir, and import … with { type: 'file' } asset-embedding for the compiled binary); bun:test runs that natively with zero config, vitest can’t without a vite asset plugin + Bun shim and still runs production bun code on a non-bun runtime. Evaluated and closed in arbe-883f.
  • Co-locate foo.test.ts next to foo.ts; shared fixtures in a __fixtures__ dir beside the tests that use them.