# Testing

How we write tests, so they catch real regressions and survive honest refactors. A test that
breaks when behaviour is unchanged is a liability, not a safety net. Read before adding or
editing a test.

**The litmus:** would this test still pass if you rewrote the implementation cleaner but kept
behaviour identical? No → it tests the code's shape (call order, exact strings, mock
choreography). Assert the outcome instead, or delete it.

## Test the outcome, not the choreography

Our worst tests hand-roll a fake of a stateful dependency (usually Supabase) and assert on a
recorded call-log — they re-implement the database to assert the code talks to the database
the way the code talks to the database. Passes by construction, breaks on any refactor,
catches nothing a type error wouldn't. This is the parallel universe, and it is gone. The worked
way out (arbe-473b): `agents.test.ts`, `apps/www/src/lib/server/api-key-db.test.ts`
(`projectApiKeys`), and `core/threads.test.ts` — each pulled its decisions into pure functions or a
narrow owned store interface (`ThreadStore`) and deleted the fake. The last holdout,
`packages/core/narrated-entity.test.ts` (a 200-line postgrest re-implementation), had little pure
logic and was nothing but thin orchestration over the boundary — so we deleted it rather than port
the fake or stand up a DB double for it (arbe-344b, closed). If that glue breaks, we catch it by
using arbe, not by re-faking postgrest.

```ts
expect(calls[0]).toMatchObject({ table: 'agents', op: 'select' })   // no
expect(selects).toEqual(['id, name, created_at, last_used_at'])     // no
```

Instead, in order:

1. **Extract the logic from the boundary.** What these tests want to check is usually pure —
   which fields get written, how a row maps to a domain type, what status fires. Pull it into
   a function and test plain inputs → outputs, no fake. Models: `core/agents.test.ts`
   (which fields a write carries, who may edit — `buildAgentInsert` / `canEditAgentDecision` /
   the row mappers), `core/threads.test.ts` (the stuck-thread reconcile decision —
   `decideStuckThreadReconciliation` with an injected `now`, plus `childFinishedText` /
   `buildChildFinishedEntries`; the row update + narration routing left over is checked against the
   narrow `ThreadStore` double's boundary payloads, not a DB), `core/client.test.ts` (status
   transitions), `dispatch/build-messages.test.ts`
   (entries → messages). The shared envelope → `ArbeError` translation is tested once in
   `utils/supabase-rows.test.ts` so no write module re-fakes "an insert error becomes
   server.internal".
2. **Prove SQL/RPC/RLS with a SQL proof, not a JS fake.** The few contracts that genuinely live in
   the database (`resolve_scope`, where-clause predicates, cascade deletes, RLS) are covered by the
   `begin … rollback` proofs in `packages/supabase/tests/verify-*.sql`, run against a linked dev DB
   (same DB local and prod). We deliberately do not run them as an automated JS suite: a JS DB
   double would mean either `supabase start` Docker on every `bun run test`, or re-faking postgrest
   — the parallel universe. Don't reach for a builder fake; if the logic is pure, do option 1, and
   if it's only thin orchestration over the boundary, leave it to dogfooding (we deleted
   `narrated-entity.test.ts` on exactly this call — arbe-344b).
3. **Never** make a `calls`/`selects` log the assertion. A side-effect at a real external
   boundary may be worth asserting — then assert the payload, not that the call happened.

## Mock the boundary, not the collaborator

Mock what you can't run: network (Tenor, sandbox HTTP, Daytona), the clock, incidental
filesystem. Mocking your *own* modules is a smell — a test that stubs a stack of collaborators
only proves the wiring passes args through, and stays green when every collaborator is broken. If
a unit needs that much mocking, the seam is wrong: test a level up where the real collaborators
run, or extract the pure decision it's making — as `dispatch/agent-tools.test.ts` now does, pulling
the launch-env contract and the exec summary into pure functions it tests with no mock, leaving the
mocked tool tests as thin wiring checks.

## Prefer real fixtures of real data

Our strongest tests replay captured wire data: `core/pi/events.test.ts` and
`pi/transcript.test.ts` decode real pi JSONL; `sandbox/src/sprites-http.test.ts` feeds real
byte frames. When testing a decoder, parser, projector, or mapper, commit a real sample as a
fixture — a synthesized "valid" input only proves the happy path you imagined.

## Don't restate, don't pad

- **Generated strings: assert meaning, not surface.** For a SQL `where` clause or other wire
  string, the contract is which rows it selects, not its spacing or `AND`-order. Assert the
  predicate it embeds (`toContain`), or run it against rows once a DB double lands (arbe-473b) —
  a reformat must not break the test. Reserve an exact-string `toBe` for a string with no
  incidental spelling: a single predicate, a sentinel, a serialized error
  (`errors/arbe-error.test.ts`), where every character is load-bearing. `electric-shape.test.ts`
  shows both, and tests the injection-rejection contract its builders own.
- **Redundant ≠ padding.** A `toHaveBeenCalledWith` / `not.toHaveBeenCalled` check that an
  outcome assertion already implies is still worth keeping when it names a documented behavior —
  an avoided side-effect ("resolves without ever fetching members"), a boundary payload, a
  short-circuit. Cut only bare call-count/order checks that pin mechanics nothing in the contract
  depends on. When in doubt, leave a passing test alone.
- **Don't test what you don't own:** `Schema.parse(validInput)` tests Zod; a re-typed
  constant map tests nothing; `render(x, {tty:false}) === x` restates an early return. Test
  our *refinements* and *rejection* rules, not the library. Delete the rest on sight.

## What good looks like

Pure inputs → asserted outputs, or real fixture → decoded result. Named for the behaviour
(`cooldown + human trigger resets eligibility`, not `classify works`). Aimed at logic that can
break: decisions, mappers, parsers, state machines, concurrency, shell-escaping. No mock
unless a real boundary forces it. Copy from: `errors/arbe-error.test.ts`,
`core/pi/events.test.ts`, `cmd/src/parse.test.ts`, `dispatch/build-messages.test.ts`,
`dispatch/selection.test.ts`, `apps/cli/src/task/task.test.ts` (real fs + concurrent
subprocesses), `sandbox/src/daytona/decide-pi-outcome.test.ts`.

## Cut noise, not coverage

Fewer, sharper means shrinking restated scaffolding, not distinct behaviours. A long test file is
usually thorough, not bloated — before you delete, check whether the length is repeated *cases*
(keep them) or repeated *fixture boilerplate* (factor it). When the same envelope or setup is
hand-rolled across cases, pull it into a small builder and keep each case's asserted data inline,
so the test still shows at a glance what input yields what output — `chat-stream.test.ts`
collapsed ~17 projector cases this way without dropping one. Two traps: don't hide the value under
test inside the builder, and drop fixture fields no assertion reads — they're noise that buries the
one thing the case is about.

Don't backfill a test per function, and don't delete a real behaviour to shorten a file. When in
doubt, write the one that would have caught a real past bug; a regression test with a one-line note
on what broke is gold.

## Mechanics

- `bun run test` (scope: `bun run --filter '@arbe/core' test`). Never bare `bun test` — it
  bypasses the package script.
- Two runners, deliberately split by runtime: `packages/*` + `apps/www` on vitest (`vi.mock`),
  `apps/cli` on bun:test (`mock`). Match the runner already in the package — the boundary is the
  runtime, not preference. The CLI is a bun program to its core (`Bun.spawn`/`Bun.file`,
  `import.meta.dir`, and `import … with { type: 'file' }` asset-embedding for the compiled
  binary); `bun:test` runs that natively with zero config, vitest can't without a vite asset
  plugin + `Bun` shim and still runs production bun code on a non-bun runtime. Evaluated and
  closed in arbe-883f.
- Co-locate `foo.test.ts` next to `foo.ts`; shared fixtures in a `__fixtures__` dir beside the
  tests that use them.