Testing AI

Engineering Confidence in AI Systems

A working draft for people building, shipping, buying, governing, or validating AI systems. This book is about moving past one-run demos and into evidence: samples, slices, traces, costs, failures, risk, and judgment.

For builders

Test RAG, agents, tool calls, MCP integrations, generated code, prompts, and model-backed workflows like real software.

For leaders

Learn what evidence to ask for before release: uncertainty, cost, rollback thresholds, severe failures, and ownership.

For reviewers

Comment on the draft. Feedback that materially improves the book earns the final book for free, and optional credit in the acknowledgements.

Concepts Covered

The book is intentionally broad because AI quality is no longer one skill. It is statistics, product judgment, automation, security, governance, and systems engineering.

evals sampling confidence intervals LLM judges RAG evaluation agent tracing token cost rubrics human raters bias testing NDCG t-tests p-values canary releases MCP Ollama privacy generated code red teaming observability synthetic data production traces prompt versioning metamorphic testing aesthetic judgment data contracts rollback thresholds chatbot testing multimodal AI dangerous capability evals quality as a layer

Draft Preview

A sample of the 110 draft chapters currently in the book.

The Next Generation Tester Will Measure Uncertainty
What Makes a System Non-Deterministic?
From Exact Assertions to Evaluation Criteria
LLM as a Judge
Variance: Not All Differences Are Bugs
Sampling: One Run Tells You Almost Nothing
Confidence Intervals: Saying About Like a Professional
Comparing Versions With T-Tests
P-Values: Evidence, Not Permission
Metamorphic Testing
Rare Failure Hunting
Reproducibility: Logging the Right Things
Release Gates for Non-Deterministic Systems
Rubrics That Actually Work
Adversarial and Red-Team Sampling
NDCG for Search Relevance
Stop Chasing High-Water Marks
Evals and Benchmarks
Testing a Chatbot
Seeing Inside Models With Interpretability Tools
RAG Evaluation
Agent Trajectory Scoring
Canary, Shadow, and Rollback Strategy
Cost and Token Budget Testing
Validation Is the Hard Part of AI-Generated Code
Halting, Godel, and the Limits of Testing AI-Generated Code
Testing Whether AI Is Dangerous
Quality as a Horizontal Layer
The Last Engineers Standing
The Future: Validation Becomes the Main Work
Anti-Patterns: Treating the Judge as Truth
Token Efficiency, Model Choice, and Business Value
How to Read an AI Eval Report
Aesthetic Judgment of AI Output
Eval Case Examples for Prompts, Chatbots, and LLM Inputs
Agentic Frameworks vs. Parameterized Workflows