---
name: tai-ch064-agent-trajectory-scoring
description: 'Apply chapter 64 of Testing AI, Agent Trajectory Scoring, as a workflow for evaluating AI and non-deterministic systems. Use for test planning, eval design, quality review, release evidence, examples, or coaching related to agent trajectory scoring.'
---

# Agent Trajectory Scoring

Skill name: `tai-ch064-agent-trajectory-scoring`

Based on **Testing AI: Engineering Confidence in AI Systems** by **Jason Arbon**.

## Purpose

For agents, the final answer is only one part of quality. The path matters.

## Use This Workflow

- Identify the AI behavior or release decision being evaluated.
- Define realistic cases, slices, unacceptable outcomes, and evidence needed for confidence.
- Choose measurements that match the risk: rubric scores, samples, intervals, traces, human review, deterministic checks, or production monitors.
- Report uncertainty, severe failures, and decision impact instead of only a pass/fail result.

## Key Guidance

Agent trajectory scoring evaluates the steps an agent took: plan quality, tool selection, tool
arguments, permission checks, intermediate state, recovery, and final answer. For example, an
agent may eventually answer correctly after calling three unnecessary tools, exposing private
data in a tool argument, and ignoring a failed permission check. The final answer score would
miss the real problem.

## Apply The Approach

Create representative cases, score them with explicit criteria, review severe failures separately, report uncertainty, and connect the evidence to a concrete decision.

## Expert Notes

At expert level, trajectory scoring should use structured traces, span-level rubrics, side-
effect logs, permission matrices, tool contract checks, and severity rules that can block
release even when the final answer sounds acceptable.
