AI at Authoring Time, Determinism at Execution Time
Status: RFC — Not Yet Implemented. This document describes an aspirational pattern for AI-assisted test generation. The codebase currently uses traditional Playwright tests. See also: ADR-009 (Agent Harness Architecture).
AI at Authoring Time, Determinism at Execution Time
Problem
Using AI to interpret and execute test specifications at runtime is expensive, slow, and non-deterministic. Each test run incurs LLM costs, and identical specs can produce different test behavior across runs.
Symptoms:
- Test suite costs scale linearly with run frequency (LLM tokens per run)
- Same test spec produces different results on consecutive runs
- Test execution takes minutes per scenario (LLM latency per step)
- Flaky tests caused by LLM interpretation variance, not app bugs
Root Cause
Conflating two distinct concerns:
- Authoring — Understanding requirements, deciding what to test, writing test logic (benefits from AI judgment)
- Execution — Running tests against the app, asserting outcomes (needs determinism and speed)
When AI drives execution, every run pays the authoring cost again. The AI re-interprets specs, re-decides on selectors, and re-generates assertions — introducing variance at every step.
Solution
Split the pipeline into AI-assisted authoring and deterministic execution:
AI Authoring Phase (one-time cost) discover → NL specs → Planner → Generator → .spec.ts files
Deterministic Execution Phase (per-run, no AI) npx playwright test → results + report
AI Maintenance Phase (as-needed) audit → drift detection → Healer → fixed .spec.ts filesThe Three Layers
| Layer | When AI Runs | Output | Durability |
|---|---|---|---|
| Discovery | Once per feature | NL spec (.md) | Permanent — survives UI changes |
| Generation | Once per spec change | .spec.ts file | Regenerable from NL spec |
| Execution | Every run | Test results | Ephemeral |
NL Specs as Durable Contract
Natural Language specs are the source of truth — not the generated test code:
---title: Workspace Page CRUDarea: workspacepriority: P0persona: returning-user---
### 1. Create a new pageUser clicks "+ New Page", fills title "Test Page", clicks Create.**Expected:** Page appears in sidebar tree and editor opens with the title.This spec survives:
- UI redesigns (selectors change, spec stays)
- Framework upgrades (Playwright API changes, spec stays)
- Backend refactors (API shapes change, spec stays)
Generated .spec.ts files are disposable — regenerate them from the spec when they break.
Role Separation
| Role | Tool | Purpose |
|---|---|---|
| Discover (Sentinel) | Claude | Analyze app, author NL specs |
| Plan (Playwright) | Planner Agent | Convert NL spec → test plan |
| Generate (Playwright) | Generator Agent | Convert plan → .spec.ts code |
| Execute (Playwright) | npx playwright test | Run tests deterministically |
| Heal (Playwright) | Healer Agent | Fix broken selectors/assertions |
| Audit (Sentinel) | Claude | Detect spec ↔ test ↔ app drift |
Cost Model
| Approach | Per-Run Cost | Deterministic? | Speed |
|---|---|---|---|
| AI interprets specs at runtime | $0.10–1.00/test | No | ~30s/test |
AI generates code, npx executes | $0/run | Yes | ~0.5s/test |
| AI heals on failure only | $0.05–0.50/failure | N/A | On-demand |
Implementation Notes
- NL specs live in
specs/and are version-controlled (they’re the durable artifact) - Generated
.spec.tsfiles live intests/and are also version-controlled (for review and debugging) - The Healer runs only when tests fail — not on every run
- Audit runs periodically (not per-commit) to detect drift between specs, tests, and app
Prevention
When to Apply This Pattern
- Any AI-driven test system where per-run costs matter
- Systems where test determinism is important (CI gates, pre-release checks)
- Large test suites (100+ tests) where LLM latency would dominate execution time
When NOT to Apply
- Exploratory testing (AI judgment at runtime is the point)
- One-off investigations (authoring overhead isn’t justified)
- Tests that require adaptive behavior (the app changes between steps)
Warning Signs You Need This
- Test suite takes 10+ minutes due to LLM calls
- Monthly LLM bill scales with test run frequency
- Same test produces different pass/fail results on identical app state
References
tests/e2e/specs/— NL spec examplestests/e2e/tests/— Generated.spec.tsfilestests/e2e/.claude/agents/— Playwright Test Agent definitions (Planner, Generator, Healer)docs/reference/qa-testing.md— Full architecture documentation
Trait Default Method Missing Override: Silent No-Op on Data Mutation Next
Atomic Settings Persistence with Corruption Recovery
Was this page helpful?
Thanks for your feedback!