$ cat ~/posts/playwright-ai-test-explosion.md
STRATEGY 24 May 2026 · ~6 min read · 367 words

When AI can write every test, what ships to CI is the job.

AI-generated Playwright tests flake under 1.5%. The new problem is test explosion, and coverage intent is still yours to define.

Tim Stacey
Tim Stacey
lead quality engineer · @timjstacey

AI-generated Playwright tests flake under 1.5% when teams use role-based locators and structured output, per TestDino’s generation benchmarks. The generation problem is solved.

The new problem is volume

Currents.dev calls it test explosion: agents can generate coverage for every route, form, and edge case your app exposes. Your team then decides which of those belong in CI, which are redundant, and what your coverage signal means beyond a raw test count. A suite that triples overnight is not three times the confidence.

A gate that treats generated code like written code

The production pattern holding up is reliability gating. AI drafts the test, an engineer reviews the PR, and the test holds in CI for five to ten passing runs before it earns a merge-gate slot:

.github/workflows/quarantine.md
// New AI-drafted specs run in a quarantine project first.
// Promote to the blocking suite only after 5-10 green runs.
{ name: 'quarantine', testMatch: /.*\.ai\.spec\.ts/, retries: 0 }

Same bar as code you wrote by hand. A test that cannot stay green for a week does not get to block a deploy.

Coverage intent is the hard part

AI asserts what shows on screen. It cannot tell whether that outcome is correct without a human-defined pass/fail rule. Generating a checkout-flow test takes seconds; defining what a correct checkout looks like takes someone who knows the business rule and writes it down. A £0.00 total is a valid free order or a pricing bug, and only your domain answers that.

Agents amplify whatever foundation you give them. With inconsistent locators, they generate ten failing tests where one used to live, and brittle fixtures get exercised at scale before anyone notices.

Spend the recovered time on intent

The 2026 ecosystem survey and BuildBetter’s guide both arrive at the same place: generation is cheap, judgment is not. Roundups from QA Wolf and Qate compare the tools, but the Playwright release notes make the foundation point concrete: role-based locators and structured output are what keep generated tests stable.

Your AI tools can draft a full suite by morning. What earns a slot in CI, and what a green run is allowed to mean, stays your call.

$ echo "EOF · thanks for reading"
Tim Stacey
Written by
Tim Stacey
Lead quality engineer. Writes about testing strategy.