Patterns

AI at Authoring Time, Determinism at Execution Time

Status: RFC — Not Yet Implemented. This document describes an aspirational pattern for AI-assisted test generation. The codebase currently uses traditional Playwright tests. See also: ADR-009 (Agent Harness Architecture).

AI at Authoring Time, Determinism at Execution Time

Problem

Using AI to interpret and execute test specifications at runtime is expensive, slow, and non-deterministic. Each test run incurs LLM costs, and identical specs can produce different test behavior across runs.

Symptoms:

Test suite costs scale linearly with run frequency (LLM tokens per run)
Same test spec produces different results on consecutive runs
Test execution takes minutes per scenario (LLM latency per step)
Flaky tests caused by LLM interpretation variance, not app bugs

Root Cause

Conflating two distinct concerns:

Authoring — Understanding requirements, deciding what to test, writing test logic (benefits from AI judgment)
Execution — Running tests against the app, asserting outcomes (needs determinism and speed)

When AI drives execution, every run pays the authoring cost again. The AI re-interprets specs, re-decides on selectors, and re-generates assertions — introducing variance at every step.

Solution

Split the pipeline into AI-assisted authoring and deterministic execution:

AI Authoring Phase (one-time cost)
  discover → NL specs → Planner → Generator → .spec.ts files

Deterministic Execution Phase (per-run, no AI)
  npx playwright test → results + report

AI Maintenance Phase (as-needed)
  audit → drift detection → Healer → fixed .spec.ts files

The Three Layers

Layer	When AI Runs	Output	Durability
Discovery	Once per feature	NL spec (`.md`)	Permanent — survives UI changes
Generation	Once per spec change	`.spec.ts` file	Regenerable from NL spec
Execution	Every run	Test results	Ephemeral

NL Specs as Durable Contract

Natural Language specs are the source of truth — not the generated test code:

---
title: Workspace Page CRUD
area: workspace
priority: P0
persona: returning-user
---

### 1. Create a new page
User clicks "+ New Page", fills title "Test Page", clicks Create.
**Expected:** Page appears in sidebar tree and editor opens with the title.

This spec survives:

UI redesigns (selectors change, spec stays)
Framework upgrades (Playwright API changes, spec stays)
Backend refactors (API shapes change, spec stays)

Generated .spec.ts files are disposable — regenerate them from the spec when they break.

Role Separation

Role	Tool	Purpose
Discover (Sentinel)	Claude	Analyze app, author NL specs
Plan (Playwright)	Planner Agent	Convert NL spec → test plan
Generate (Playwright)	Generator Agent	Convert plan → `.spec.ts` code
Execute (Playwright)	`npx playwright test`	Run tests deterministically
Heal (Playwright)	Healer Agent	Fix broken selectors/assertions
Audit (Sentinel)	Claude	Detect spec ↔ test ↔ app drift

Cost Model

Approach	Per-Run Cost	Deterministic?	Speed
AI interprets specs at runtime	$0.10–1.00/test	No	~30s/test
AI generates code, `npx` executes	$0/run	Yes	~0.5s/test
AI heals on failure only	$0.05–0.50/failure	N/A	On-demand

Implementation Notes

NL specs live in specs/ and are version-controlled (they’re the durable artifact)
Generated .spec.ts files live in tests/ and are also version-controlled (for review and debugging)
The Healer runs only when tests fail — not on every run
Audit runs periodically (not per-commit) to detect drift between specs, tests, and app

Prevention

When to Apply This Pattern

Any AI-driven test system where per-run costs matter
Systems where test determinism is important (CI gates, pre-release checks)
Large test suites (100+ tests) where LLM latency would dominate execution time

When NOT to Apply

Exploratory testing (AI judgment at runtime is the point)
One-off investigations (authoring overhead isn’t justified)
Tests that require adaptive behavior (the app changes between steps)

Warning Signs You Need This

Test suite takes 10+ minutes due to LLM calls
Monthly LLM bill scales with test run frequency
Same test produces different pass/fail results on identical app state

References

tests/e2e/specs/ — NL spec examples
tests/e2e/tests/ — Generated .spec.ts files
tests/e2e/.claude/agents/ — Playwright Test Agent definitions (Planner, Generator, Healer)
docs/reference/qa-testing.md — Full architecture documentation

Previous
Trait Default Method Missing Override: Silent No-Op on Data Mutation Next
Atomic Settings Persistence with Corruption Recovery

Was this page helpful?