How to Evaluate a Claude Code Skill

· 4 min read

After building my ELI5 skill, I needed a way to know if it actually worked. Not just “does it run,” but “is the output actually better with the skill than without it?” That’s a different kind of question than what a normal test answers.

Evals vs. Tests

In traditional software testing, you write an assertion like assert result == 42. It’s deterministic — same input, same output, pass or fail. You run it in CI, it either goes green or it doesn’t.

Evals are different. When you’re evaluating LLM output, the same prompt can produce different responses every time. There’s no single correct answer to “explain a database index to a 5-year-old.” Instead, you’re checking whether the response has certain qualities — is the tone right, are there analogies, is the jargon gone?

A typical Python test:

def test_add():
    assert add(2, 3) == 5  # deterministic, one correct answer

An eval assertion:

"Uses at least one concrete, child-friendly analogy (toys, animals, playground, picture books)"

You can’t check that with ==. You need a judge — and in this case, the judge is another LLM.

The Workflow

I set up an eval pipeline for the ELI5 skill that runs each test case in two configurations (with the skill and without it), then grades both outputs.

                    ┌─────────────────────┐
                    │   Load evals.json   │
                    │ test cases+asserts  │
                    └─────────┬───────────┘
                              │
                              ▼
                    ┌─────────────────────┐
                    │   Build configs     │
                    │ --a/--b or default  │
                    └─────────┬───────────┘
                              │
                              ▼
                      ┌───────────────┐
                      │   each test   │◄──────────────┐
                      └───┬───────┬───┘               │
                          │       │                   │
                ┌─────────┘       └─────────┐         │
                │                           │         │
                ▼                           ▼         │
     ┌───────────────────┐     ┌───────────────────┐  │
     │   Run Config A    │     │   Run Config B    │  │
     │  claude -p (skill)│     │  claude -p (base) │  │
     └─────────┬─────────┘     └─────────┬─────────┘  │
               │                         │            │
               └────────┐   ┌────────────┘            │
                        │   │                         │
                        ▼   ▼                         │
                    ┌─────────────────────┐           │
                    │  Save responses     │           │
                    │ response.md+timing  │───────────┘
                    └─────────┬───────────┘  next test
                              │
                              ▼
                    ┌─────────────────────┐
                    │ Grade each response │
                    │  LLM judge -> PASS  │
                    └─────────┬───────────┘
                              │
                              ▼
                    ┌─────────────────────┐
                    │ Pass rate summary   │
                    │ A:100% B:83% +16.7% │
                    └─────────────────────┘

The key idea: every test runs twice (A and B), and both outputs are graded against the same assertions. This gives you a direct comparison.

Defining Assertions

Test cases and assertions live in a JSON file. One example:

{
  "id": 0,
  "name": "explain-db-index-age5",
  "prompt": "ELI5 what a database index is",
  "audience": "Age 5 (default)",
  "assertions": [
    "Contains no technical jargon (query, B-tree, schema, SQL, optimize)",
    "Uses at least one concrete, child-friendly analogy (toys, animals, playground)",
    "Sentences are short and simple, averaging under 15 words per sentence",
    "Tone is warm, playful, and appropriate for a 5-year-old"
  ]
}

There are four assertions for this test case: no jargon, uses analogies, short sentences, and appropriate tone. Each one describes a quality of a good response, not an exact expected value. They’re written in natural language because they’ll be evaluated by an LLM, not by a string comparison.

Running the A/B Eval

The eval script uses claude -p to run each prompt in two modes. With the skill, the prompt includes the skill file:

def run_test(eval_case, outdir, configs):
    prompt = eval_case["prompt"]

    for config in configs:
        start = time.time()
        if config["skill_path"]:
            response = run_claude(
                f"Read the skill at {config['skill_path']} first, "
                f"then follow its instructions. Task: {prompt}"
            )
        else:
            response = run_claude(prompt)
        elapsed = time.time() - start

        # Save response and timing
        (config_dir / "response.md").write_text(response)

Config A points to the skill file. Config B runs the same prompt with no skill — just Claude’s default behavior. You can also A/B test two different skill versions:

python run-evals.py --a skills/eli5/SKILL.md --b ~/experiments/SKILL-v2.md

LLM-as-Judge

This is where evals diverge most from traditional tests. Instead of assert x == y, each response is sent to another LLM call that acts as a grader:

def grade_response(response_file, assertions):
    response_content = response_file.read_text()
    assertions_text = "\n".join(
        f"{i + 1}. {a}" for i, a in enumerate(assertions)
    )
    return run_claude(
        f"""You are a strict grader. Grade this response against each assertion.

RESPONSE:
{response_content}

ASSERTIONS:
{assertions_text}

For each assertion, output exactly one line:
PASS|<number>|<brief evidence>
or
FAIL|<number>|<brief evidence>

Output ONLY those lines. No other text."""
    )

The grader reads the response, checks it against each assertion, and returns a structured verdict. The output looks like:

PASS|1|No technical terms found — uses "book" and "list" instead
PASS|2|Uses analogy of a huge book with thousands of pages
PASS|3|Average sentence length ~10 words
FAIL|4|Tone is informative but not particularly warm or playful

What This Means in Practice

The LLM judge introduces variance. I ran the same eval twice and got different scores — the skill scored 10/12 one time and 11/12 the next. The judge might be stricter on “warm tone” in one run and more lenient in another. That’s the nature of probabilistic grading.

This is the fundamental difference from tests:

Tests Evals
Output Deterministic Probabilistic
Assertion assert x == 5 “Uses child-friendly analogies”
Judge Code (exact match) LLM (interpretation)
Reruns Same result every time May vary between runs
Goal Correctness Quality measurement

Because of this, a single eval run isn’t enough. Run it multiple times and look at the trend. If the skill consistently scores 80-90% and the baseline scores 40-50%, that’s a real signal even if individual runs fluctuate.


The eval script and test cases are in the ELI5 repo. If you’re building a skill and want to know whether it’s actually doing something, this is one way to find out.

Related Posts