---
name: tai-ch147-testing-systems-that-may-be-smarter-than-us
description: 'Apply chapter 147 of Testing AI, Testing Systems That May Be Smarter Than Us, as a workflow for evaluating AI and non-deterministic systems. Use for test planning, eval design, quality review, release evidence, examples, or coaching related to testing systems that may be smarter than us.'
---

# Testing Systems That May Be Smarter Than Us

Skill name: `tai-ch147-testing-systems-that-may-be-smarter-than-us`

Based on **Testing AI: Engineering Confidence in AI Systems** by **Jason Arbon**.

## Purpose

At some point, the testing problem becomes: how does the less capable evaluator test the more
capable system?

## Use This Workflow

- Identify the AI behavior or release decision being evaluated.
- Define realistic cases, slices, unacceptable outcomes, and evidence needed for confidence.
- Choose measurements that match the risk: rubric scores, samples, intervals, traces, human review, deterministic checks, or production monitors.
- Report uncertainty, severe failures, and decision impact instead of only a pass/fail result.

## Key Guidance

Most software testing assumes the tester can understand the system well enough to judge it.
Frontier AI strains that assumption. If a system is better than humans at code, science,
persuasion, planning, or strategy, then ordinary review becomes weaker.

## Apply The Approach

Create representative cases, score them with explicit criteria, review severe failures separately, report uncertainty, and connect the evidence to a concrete decision.

## Expert Notes

At expert level, the future testing stack must combine empirical evals, containment, adversarial
institutions, scalable oversight, interpretability, provable constraints, monitoring, incident
learning, and humility. The goal is not perfect knowledge. The goal is enough independent
evidence and bounded power that trust does not depend on the system marking its own homework.