---
name: tai-ch002-what-makes-a-system-non-deterministic
description: 'Apply chapter 2 of Testing AI, What Makes a System Non-Deterministic?, as a workflow for evaluating AI and non-deterministic systems. Use for test planning, eval design, quality review, release evidence, examples, or coaching related to what makes a system non-deterministic?.'
---

# What Makes a System Non-Deterministic?

Skill name: `tai-ch002-what-makes-a-system-non-deterministic`

Based on **Testing AI: Engineering Confidence in AI Systems** by **Jason Arbon**.

## Purpose

Before testers can evaluate unpredictable systems, they need to understand where the
unpredictability comes from and which variation actually matters.

## Use This Workflow

- Identify the AI behavior or release decision being evaluated.
- Define realistic cases, slices, unacceptable outcomes, and evidence needed for confidence.
- Choose measurements that match the risk: rubric scores, samples, intervals, traces, human review, deterministic checks, or production monitors.
- Report uncertainty, severe failures, and decision impact instead of only a pass/fail result.

## Key Guidance

Non-determinism means repeated runs can produce different behavior, even when the input looks
the same. That can happen because of model sampling, personalization, ranking experiments,
timing, cache state, tool calls, retrieved data, or hidden production context. For example, an
LLM may choose different words, a search system may reorder equivalent results, and a
distributed service may process two events in different orders. Some of that variation is
harmless. Some of it changes the truth.

Do not treat non-deterministic as one simple category. Some stochastic systems can be
made mostly reproducible by fixing the random seed, data snapshot, configuration, and
runtime. LLM-based products are trickier: even with low temperature, they may vary
because the served model, safety layer, tool output, retrieval context, hidden state,
floating-point or hardware behavior, or platform routing changed. The testing strategy
depends on which kind of variation you are trying to control.

## Apply The Approach

Create representative cases, score them with explicit criteria, review severe failures separately, report uncertainty, and connect the evidence to a concrete decision.

## Expert Notes

Technically, testers should separate sources of randomness from sources of state and sources of
platform change. Model temperature, random seeds, ranking tie-breakers, async timing, retrieval
snapshots, feature flags, user profiles, tool outputs, provider model versions, safety filters,
hardware/runtime paths, and hidden product context should be logged independently because each
one creates a different debugging path.
