Test Suite Health
What's the suite's flake rate, measured honestly — not "what we admit to," but "what shows up in CI rerun logs"?
A suite over 10% flake cannot benefit from AI until the foundation is stabilized.
How are selectors written today — IDs, CSS, XPath, data-test attributes, role-based, mixed?
Selector strategy determines whether AI self-healing is a 1-day or a 6-month project.
How is synchronization handled — explicit waits, polling, hardcoded sleeps, framework auto-wait, mix?
Hardcoded sleeps are the second-largest source of flake and the easiest fix.
What percentage of your tests have meaningful, behavior-level assertions vs. "the page loaded without throwing"?
Assertion-light tests inflate coverage metrics without catching bugs. AI cannot fix this.
When a test fails on main, how long does it typically take to triage — minutes, hours, "we look at it tomorrow"?
Triage time is the single biggest opportunity for an AI failure-classifier.
Architecture & Tooling
What CI/CD platform runs the tests, and how parallelized is the suite today?
If parallelization is poor, AI test generation will produce a suite that's too slow to run.
What does your test data strategy look like — fixtures, factories, shared staging DB, full reset between runs?
Shared mutable state breaks AI-generated tests faster than anything else.
Where do tests run — local Docker, cloud grid (BrowserStack/Sauce/LambdaTest), self-hosted Selenium grid, Playwright cloud?
Determines which AI tooling can integrate without re-architecting infra.
How do you handle visual regressions today — pixel diffs, ignored, manual review, none?
Visual checks are the highest-leverage AI add-on. If they have a baseline, integration is easy.
How are page objects / fixtures organized — Page Object Model, screenplay pattern, ad-hoc, none?
Architecture determines how cleanly AI-generated tests will integrate with what you have.
Coverage & Authoring
How do you decide what to test? Risk-based, requirements traceability, "what we always tested," gut feel?
If there's no method, an AI coverage analyzer will surface dozens of unaddressed risks.
When a new feature ships, who writes the tests — the dev who built it, a dedicated SDET, QA after the fact, no one?
Test ownership tells you where AI authoring tools should plug in.
What percentage of your test cases live in code vs. test management tools (TestRail, Zephyr, Xray, qTest, spreadsheets)?
Manual cases are prime candidates for AI-assisted automation conversion.
How much of the application is covered by accessibility, performance, and security tests today?
These three categories are where AI checks deliver the most surprising value.
When was the last time you deleted a test? What was the criterion?
Teams that never delete have suites full of dead weight that AI evaluation will need to prune.
Operations & Failure Modes
When the suite fails on main, who gets paged, what's the SLA, and is there a rollback path?
If failures aren't blocking, the suite isn't really protecting anything.
How do you currently distinguish "real bug" from "flaky test" from "infra problem" in a failure?
This is exactly the workflow an AI triage agent automates.
What's the suite's runtime — and how does that compare to your team's tolerance for waiting on a PR?
If runtime > tolerance, devs are skipping tests, and AI must shrink the suite, not grow it.
Do you have observability into test execution — duration trends, flake trends, coverage trends, failure clustering?
No telemetry = no way to measure AI's impact later.
What's the worst test in your suite — the one everyone curses but nobody fixes? Why hasn't it been fixed?
The honest answer to this question reveals more about your readiness than any survey.
AI-Specific Readiness
Have you experimented with AI-generated tests yet — Copilot, Cursor, prompt-driven test authoring? What worked, what didn't?
Real experience beats opinion. Surfaces what's already been tried and abandoned.
Are there parts of the codebase or test data that cannot be sent to a third-party AI service? Where are the boundaries?
Determines whether you need self-hosted models, redaction layers, or air-gapped tooling.
If an AI proposed a change to a test selector overnight and committed it to a PR, who reviews it and what do they look for?
Defines the human-in-the-loop pattern before vendors define it for you.
What would you need to see in a 2-week pilot to recommend expanding AI across the whole suite?
If they can't answer this, the pilot won't have an exit criterion and will drag forever.