---
name: tai-ch030-multiple-comparisons-and-false-discoveries
description: 'Apply chapter 30 of Testing AI, Multiple Comparisons and False Discoveries, as a workflow for evaluating AI and non-deterministic systems. Use for test planning, eval design, quality review, release evidence, examples, or coaching related to multiple comparisons and false discoveries.'
---

# Multiple Comparisons and False Discoveries

Skill name: `tai-ch030-multiple-comparisons-and-false-discoveries`

Based on **Testing AI: Engineering Confidence in AI Systems** by **Jason Arbon**.

## Purpose

The more slices, variants, and metrics you inspect, the more likely one lucky result will look
real.

## Use This Workflow

- Identify the AI behavior or release decision being evaluated.
- Define realistic cases, slices, unacceptable outcomes, and evidence needed for confidence.
- Choose measurements that match the risk: rubric scores, samples, intervals, traces, human review, deterministic checks, or production monitors.
- Report uncertainty, severe failures, and decision impact instead of only a pass/fail result.

## Key Guidance

Multiple comparisons are a trap in AI evaluation. If you compare many prompts, many models, many
categories, and many metrics, some result will look impressive by chance. For example, testing
20 prompt variants and picking the one with the best p-value is not the same as proving that
variant is truly best. You may have selected noise.

## Apply The Approach

Create representative cases, score them with explicit criteria, review severe failures separately, report uncertainty, and connect the evidence to a concrete decision.

## Expert Notes

At expert level, separate exploratory analysis, confirmatory analysis, and monitoring. Use
holdout sets, preregistered primary metrics, adjusted thresholds, or false-discovery-rate
methods when many comparisons are part of the process.
