Skip to content
Documentation GitHub
Patterns

AI at Authoring Time, Determinism at Execution Time

Status: RFC — Not Yet Implemented. This document describes an aspirational pattern for AI-assisted test generation. The codebase currently uses traditional Playwright tests. See also: ADR-009 (Agent Harness Architecture).

AI at Authoring Time, Determinism at Execution Time

Problem

Using AI to interpret and execute test specifications at runtime is expensive, slow, and non-deterministic. Each test run incurs LLM costs, and identical specs can produce different test behavior across runs.

Symptoms:

  • Test suite costs scale linearly with run frequency (LLM tokens per run)
  • Same test spec produces different results on consecutive runs
  • Test execution takes minutes per scenario (LLM latency per step)
  • Flaky tests caused by LLM interpretation variance, not app bugs

Root Cause

Conflating two distinct concerns:

  1. Authoring — Understanding requirements, deciding what to test, writing test logic (benefits from AI judgment)
  2. Execution — Running tests against the app, asserting outcomes (needs determinism and speed)

When AI drives execution, every run pays the authoring cost again. The AI re-interprets specs, re-decides on selectors, and re-generates assertions — introducing variance at every step.

Solution

Split the pipeline into AI-assisted authoring and deterministic execution:

AI Authoring Phase (one-time cost)
discover → NL specs → Planner → Generator → .spec.ts files
Deterministic Execution Phase (per-run, no AI)
npx playwright test → results + report
AI Maintenance Phase (as-needed)
audit → drift detection → Healer → fixed .spec.ts files

The Three Layers

LayerWhen AI RunsOutputDurability
DiscoveryOnce per featureNL spec (.md)Permanent — survives UI changes
GenerationOnce per spec change.spec.ts fileRegenerable from NL spec
ExecutionEvery runTest resultsEphemeral

NL Specs as Durable Contract

Natural Language specs are the source of truth — not the generated test code:

---
title: Workspace Page CRUD
area: workspace
priority: P0
persona: returning-user
---
### 1. Create a new page
User clicks "+ New Page", fills title "Test Page", clicks Create.
**Expected:** Page appears in sidebar tree and editor opens with the title.

This spec survives:

  • UI redesigns (selectors change, spec stays)
  • Framework upgrades (Playwright API changes, spec stays)
  • Backend refactors (API shapes change, spec stays)

Generated .spec.ts files are disposable — regenerate them from the spec when they break.

Role Separation

RoleToolPurpose
Discover (Sentinel)ClaudeAnalyze app, author NL specs
Plan (Playwright)Planner AgentConvert NL spec → test plan
Generate (Playwright)Generator AgentConvert plan → .spec.ts code
Execute (Playwright)npx playwright testRun tests deterministically
Heal (Playwright)Healer AgentFix broken selectors/assertions
Audit (Sentinel)ClaudeDetect spec ↔ test ↔ app drift

Cost Model

ApproachPer-Run CostDeterministic?Speed
AI interprets specs at runtime$0.10–1.00/testNo~30s/test
AI generates code, npx executes$0/runYes~0.5s/test
AI heals on failure only$0.05–0.50/failureN/AOn-demand

Implementation Notes

  • NL specs live in specs/ and are version-controlled (they’re the durable artifact)
  • Generated .spec.ts files live in tests/ and are also version-controlled (for review and debugging)
  • The Healer runs only when tests fail — not on every run
  • Audit runs periodically (not per-commit) to detect drift between specs, tests, and app

Prevention

When to Apply This Pattern

  • Any AI-driven test system where per-run costs matter
  • Systems where test determinism is important (CI gates, pre-release checks)
  • Large test suites (100+ tests) where LLM latency would dominate execution time

When NOT to Apply

  • Exploratory testing (AI judgment at runtime is the point)
  • One-off investigations (authoring overhead isn’t justified)
  • Tests that require adaptive behavior (the app changes between steps)

Warning Signs You Need This

  • Test suite takes 10+ minutes due to LLM calls
  • Monthly LLM bill scales with test run frequency
  • Same test produces different pass/fail results on identical app state

References

  • tests/e2e/specs/ — NL spec examples
  • tests/e2e/tests/ — Generated .spec.ts files
  • tests/e2e/.claude/agents/ — Playwright Test Agent definitions (Planner, Generator, Healer)
  • docs/reference/qa-testing.md — Full architecture documentation

Was this page helpful?