Testing AI
Engineering Confidence in AI Systems
A working draft for people building, shipping, buying, governing, or validating AI systems. This book is about moving past one-run demos and into evidence: samples, slices, traces, costs, failures, risk, and judgment.
For builders
Test RAG, agents, tool calls, MCP integrations, generated code, prompts, and model-backed workflows like real software.
For leaders
Learn what evidence to ask for before release: uncertainty, cost, rollback thresholds, severe failures, and ownership.
For reviewers
Comment on the draft. Feedback that materially improves the book earns the final book for free, and optional credit in the acknowledgements.
Concepts Covered
The book is intentionally broad because AI quality is no longer one skill. It is statistics, product judgment, automation, security, governance, and systems engineering.
Draft Preview
A sample of the 110 draft chapters currently in the book.
- The Next Generation Tester Will Measure Uncertainty
- What Makes a System Non-Deterministic?
- From Exact Assertions to Evaluation Criteria
- LLM as a Judge
- Variance: Not All Differences Are Bugs
- Sampling: One Run Tells You Almost Nothing
- Confidence Intervals: Saying About Like a Professional
- Comparing Versions With T-Tests
- P-Values: Evidence, Not Permission
- Metamorphic Testing
- Rare Failure Hunting
- Reproducibility: Logging the Right Things
- Release Gates for Non-Deterministic Systems
- Rubrics That Actually Work
- Adversarial and Red-Team Sampling
- NDCG for Search Relevance
- Stop Chasing High-Water Marks
- Evals and Benchmarks
- Testing a Chatbot
- Seeing Inside Models With Interpretability Tools
- RAG Evaluation
- Agent Trajectory Scoring
- Canary, Shadow, and Rollback Strategy
- Cost and Token Budget Testing
- Validation Is the Hard Part of AI-Generated Code
- Halting, Godel, and the Limits of Testing AI-Generated Code
- Testing Whether AI Is Dangerous
- Quality as a Horizontal Layer
- The Last Engineers Standing
- The Future: Validation Becomes the Main Work
- Anti-Patterns: Treating the Judge as Truth
- Token Efficiency, Model Choice, and Business Value
- How to Read an AI Eval Report
- Aesthetic Judgment of AI Output
- Eval Case Examples for Prompts, Chatbots, and LLM Inputs
- Agentic Frameworks vs. Parameterized Workflows