---
name: tai-theme-production-ai-systems-and-agents
description: 'Use the Testing AI theme Production AI Systems and Agents to plan, review, or teach related AI quality work. Applies concepts and techniques from the book to testing AI, AI-generated software, and non-deterministic systems when relevant.'
---

# Production AI Systems and Agents

Skill name: `tai-theme-production-ai-systems-and-agents`

Based on **Testing AI: Engineering Confidence in AI Systems** by **Jason Arbon**.

## Theme Purpose

Use these approaches when validating production AI systems: tracing, RAG, synthetic data, trace mining, versioning, agents, rollouts, cost, multimodal behavior, and contracts.

Apply these concepts when testing AI, AI-generated software, model-backed features, agents, search, chatbots, RAG systems, generated code, dynamic interfaces, or other software whose behavior can vary across runs, users, data, tools, or time.

## How To Use This Theme

- Identify the behavior, capability, risk, or release decision being evaluated.
- Choose the relevant concepts below and turn them into concrete eval cases, samples, traces, checks, rubrics, metrics, or release gates.
- Prefer evidence that supports a decision: ship, canary, hold, rollback, or collect more samples.
- Report by slices and severe failures when averages hide risk.
- Preserve enough evidence that another person or agent can understand what was tested, how it was measured, and why the recommendation follows.

## Concepts And Techniques To Apply

- Trace every prompt, retrieval call, tool call, model response, judge result, token cost, latency, error, retry, and final outcome.
- For RAG, separate retrieval quality from generation quality: hit rate, context precision and recall, freshness, grounding, and citation faithfulness.
- Mine production traces into eval cases by sampling, clustering, anonymizing, labeling, and promoting high-value failures into regression suites.
- Version prompts, policies, tools, retrieval indexes, rubrics, judges, model routes, and release configs together.
- Score agent trajectories, not only final answers: plan, tool choice, arguments, permissions, recovery, side effects, and final response.
- Use shadow mode, canaries, traffic slices, rollback thresholds, monitoring windows, and post-release trace review.
- Optimize quality per dollar across model family, model size, latency, token use, privacy, security, regional hosting, and business continuity.
- Use contracts for schemas, tool calls, source documents, citations, refusals, logs, and generated outputs.

## Reporting Guidance

- State what was tested and what population the evidence represents.
- Explain uncertainty, missing coverage, severe failures, and known blind spots.
- Connect findings to a concrete decision or next action.
- Use topic-specific chapter skills only when deeper detail is needed; this theme skill should stand alone as practical guidance.
