---
name: tai-theme-bias-raters-data-and-practical-tools
description: 'Use the Testing AI theme Bias, Raters, Data, and Practical Tools to plan, review, or teach related AI quality work. Applies concepts and techniques from the book to testing AI, AI-generated software, and non-deterministic systems when relevant.'
---

# Bias, Raters, Data, and Practical Tools

Skill name: `tai-theme-bias-raters-data-and-practical-tools`

Based on **Testing AI: Engineering Confidence in AI Systems** by **Jason Arbon**.

## Theme Purpose

Use these approaches when building practical AI quality workflows with chatbots, bias testing, raters, labelers, labeler-demographic audits, Promptfoo/TestFoo, Hugging Face, and private local models.

Apply these concepts when testing AI, AI-generated software, model-backed features, agents, search, chatbots, RAG systems, generated code, dynamic interfaces, or other software whose behavior can vary across runs, users, data, tools, or time.

## How To Use This Theme

- Identify the behavior, capability, risk, or release decision being evaluated.
- Choose the relevant concepts below and turn them into concrete eval cases, samples, traces, checks, rubrics, metrics, or release gates.
- Prefer evidence that supports a decision: ship, canary, hold, rollback, or collect more samples.
- Report by slices and severe failures when averages hide risk.
- Preserve enough evidence that another person or agent can understand what was tested, how it was measured, and why the recommendation follows.

## Concepts And Techniques To Apply

- Use bias cases, identity swaps, counterfactuals, cultural context, language coverage, accessibility cases, and subgroup slices.
- Measure rater agreement, disagreement, rubric clarity, labeler quality, and when expert review is required.
- Audit labeler demographics, domain expertise, incentives, fatigue, language, culture, context, and whether the labeler pool matches the product risk.
- Build chatbot and LLM-input evals with realistic prompts, policy boundaries, allowed cases, disallowed cases, edge cases, and security cases.
- Use tools such as Promptfoo/TestFoo, Hugging Face, Ollama, and local models when they fit privacy, security, or workflow needs.
- Handle internal, proprietary, HIPAA-like, or sensitive data with local execution, redaction, access control, and audit trails.
- Use production traces and synthetic cases together while watching for synthetic bias and unrealistic coverage.

## Reporting Guidance

- State what was tested and what population the evidence represents.
- Explain uncertainty, missing coverage, severe failures, and known blind spots.
- Connect findings to a concrete decision or next action.
- Use topic-specific chapter skills only when deeper detail is needed; this theme skill should stand alone as practical guidance.