Testing AI

Engineering Confidence in AI Systems

A working draft for people building, shipping, buying, governing, or validating AI systems. This book is about moving past one-run demos and into evidence: samples, slices, traces, costs, failures, risk, and judgment.

Testing AI book cover

For builders

Test RAG, agents, tool calls, MCP integrations, generated code, prompts, and model-backed workflows like real software.

For leaders

Learn what evidence to ask for before release: uncertainty, cost, rollback thresholds, severe failures, and ownership.

For reviewers

Comment on the draft. Feedback that materially improves the book earns the final book for free, and optional credit in the acknowledgements.

Concepts Covered

The book is intentionally broad because AI quality is no longer one skill. It is statistics, product judgment, automation, security, governance, and systems engineering.

evals sampling confidence intervals LLM judges RAG evaluation agent tracing token cost rubrics human raters bias testing NDCG t-tests p-values canary releases MCP Ollama privacy generated code red teaming observability synthetic data production traces prompt versioning metamorphic testing aesthetic judgment data contracts rollback thresholds chatbot testing multimodal AI dangerous capability evals quality as a layer

Draft Preview

A sample of the 110 draft chapters currently in the book.

  1. The Next Generation Tester Will Measure Uncertainty
  2. What Makes a System Non-Deterministic?
  3. From Exact Assertions to Evaluation Criteria
  4. LLM as a Judge
  5. Variance: Not All Differences Are Bugs
  6. Sampling: One Run Tells You Almost Nothing
  7. Confidence Intervals: Saying About Like a Professional
  8. Comparing Versions With T-Tests
  9. P-Values: Evidence, Not Permission
  10. Metamorphic Testing
  11. Rare Failure Hunting
  12. Reproducibility: Logging the Right Things
  13. Release Gates for Non-Deterministic Systems
  14. Rubrics That Actually Work
  15. Adversarial and Red-Team Sampling
  16. NDCG for Search Relevance
  17. Stop Chasing High-Water Marks
  18. Evals and Benchmarks
  19. Testing a Chatbot
  20. Seeing Inside Models With Interpretability Tools
  21. RAG Evaluation
  22. Agent Trajectory Scoring
  23. Canary, Shadow, and Rollback Strategy
  24. Cost and Token Budget Testing
  25. Validation Is the Hard Part of AI-Generated Code
  26. Halting, Godel, and the Limits of Testing AI-Generated Code
  27. Testing Whether AI Is Dangerous
  28. Quality as a Horizontal Layer
  29. The Last Engineers Standing
  30. The Future: Validation Becomes the Main Work
  31. Anti-Patterns: Treating the Judge as Truth
  32. Token Efficiency, Model Choice, and Business Value
  33. How to Read an AI Eval Report
  34. Aesthetic Judgment of AI Output
  35. Eval Case Examples for Prompts, Chatbots, and LLM Inputs
  36. Agentic Frameworks vs. Parameterized Workflows