---
name: tai-theme-evals-statistics-and-judges
description: 'Use the Testing AI theme Evals, Statistics, and Judges to plan, review, or teach related AI quality work. Applies concepts and techniques from the book to testing AI, AI-generated software, and non-deterministic systems when relevant.'
---

# Evals, Statistics, and Judges

Skill name: `tai-theme-evals-statistics-and-judges`

Based on **Testing AI: Engineering Confidence in AI Systems** by **Jason Arbon**.

## Theme Purpose

Use these approaches when designing rubrics, statistical comparisons, LLM judges, benchmark reports, search relevance metrics, F-scores, and high-water-mark defenses.

Apply these concepts when testing AI, AI-generated software, model-backed features, agents, search, chatbots, RAG systems, generated code, dynamic interfaces, or other software whose behavior can vary across runs, users, data, tools, or time.

## How To Use This Theme

- Identify the behavior, capability, risk, or release decision being evaluated.
- Choose the relevant concepts below and turn them into concrete eval cases, samples, traces, checks, rubrics, metrics, or release gates.
- Prefer evidence that supports a decision: ship, canary, hold, rollback, or collect more samples.
- Report by slices and severe failures when averages hide risk.
- Preserve enough evidence that another person or agent can understand what was tested, how it was measured, and why the recommendation follows.

## Concepts And Techniques To Apply

- Design evals as measurement systems with known populations, cases, labels, rubrics, judges, and failure categories.
- Use paired comparisons, t-tests, confidence intervals, p-values, power analysis, and minimum detectable effect only when their assumptions fit.
- Distinguish statistical significance from product significance, safety significance, and business significance.
- Use LLM-as-a-judge workflows with calibration sets, human review, disagreement queues, and judge versioning.
- Measure search relevance with graded relevance, ranking metrics such as NDCG, F-scores for precision/recall tradeoffs, and query-slice reports.
- Avoid high-water-mark chasing by reporting distributions and uncertainty, not only best observed runs.
- Treat benchmarks and eval suites as imperfect instruments that require inspection, provenance, and known limitations.

## Reporting Guidance

- State what was tested and what population the evidence represents.
- Explain uncertainty, missing coverage, severe failures, and known blind spots.
- Connect findings to a concrete decision or next action.
- Use topic-specific chapter skills only when deeper detail is needed; this theme skill should stand alone as practical guidance.