Claude Code Skill Description Auto Trigger Matching

The first skill I ever wrote was called helper. Its description said "helps with code." I spent an entire afternoon wondering why Claude Code never invoked it — no errors, no warnings, just silence — until I realized the matcher had no signal to latch onto. Every other skill in my project had verbs, nouns, trigger phrases that named the work; mine had a shrug. The fix wasn't a clever algorithm or a hidden flag. It was rewriting one sentence so the matcher could tell, in under a hundred tokens, when my skill was the right tool and when it absolutely wasn't.

This article rebuilds that lesson from scratch. We will scaffold a .claude/skills/ directory, register a baseline skill, deliberately collide it with a second skill to expose ambiguity, refactor the descriptions with action verbs and scope qualifiers, plant anti-pattern skills that should never trigger, and finally build a prompt-matrix harness that scores precision and recall across the whole set. By the end you will have a small repo, a checklist of description heuristics, and a test harness you can point at any future skill before you ship it. The single rule under all of it: a skill description is not documentation for humans, it is a routing contract for a matcher — write it like a router, not like a README.

This is for engineers who already use Claude Code and want their skills to fire on the right prompts and stay quiet on the wrong ones, without resorting to guesswork or hand-edited keyword lists.

Step 1: Scaffolding the Skill Manifest Lab

Before we can study how Claude Code's skill auto-trigger matching reacts to wording changes, we need a stable harness that owns one real skill manifest and a parser we control. This step builds that harness from an empty folder — a .claude/skills/greet/SKILL.md stub, a small skill_loader module that parses frontmatter, and a pytest suite that confirms the manifest is discoverable and well-formed.

The point is to lock in the invariants that every later step will rely on. Once the loader returns a populated Skill object for greet, every wording change we make in subsequent steps becomes a measurable variable rather than a guess about file layout.

Setup

Create a Python project skeleton next to a .claude/ directory that mirrors the real Claude Code layout:

codebase/
├── pyproject.toml
├── .claude/
│   └── skills/
│       └── greet/
│           └── SKILL.md
├── src/
│   └── skill_loader.py
└── tests/
    ├── __init__.py
    └── test_skill_loader.py

We need no third-party runtime dependencies — the loader will parse the tiny YAML-ish frontmatter by hand to keep the surface area honest. The only tool we add is pytest, invoked via the standard Python toolchain. pyproject.toml registers src/ on the path and points the test runner at tests/.

[project]
name = "skill-trigger-lab"
version = "0.1.0"
description = "Companion code for studying how Claude Code matches skill descriptions to user prompts."
requires-python = ">=3.10"
dependencies = []

[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["src"]
addopts = "-q"

Implementation

The placeholder skill ships with the smallest possible frontmatter — a name and a description written in the same shape Claude Code expects. The body says outright that it is an anchor for later experiments, so future readers do not mistake it for working behaviour.

---
name: greet
description: Send a friendly greeting. Use when the user wants to say hello, greet someone by name, or open a conversation warmly.
---

# Greet

A placeholder skill used as the anchor for later auto-trigger experiments.

The body intentionally stays empty of behaviour. Subsequent steps will
rewrite the `description` field above to study how wording changes
Claude Code's skill auto-trigger matching.

The Skill dataclass captures exactly what later experiments will compare: the declared name, the description string we will mutate, the body text, and the source path for diagnostics. Freezing it forces every change to go through load_skills, which keeps the experiment reproducible.

@dataclass(frozen=True)
class Skill:
    name: str
    description: str
    body: str
    path: Path


def load_skills(project_root: Path) -> list[Skill]:
    skills_dir = project_root / ".claude" / "skills"
    if not skills_dir.is_dir():
        return []
    manifests = sorted(skills_dir.glob("*/SKILL.md"))
    return [_parse_skill_file(p) for p in manifests]

Frontmatter parsing is split into three tiny helpers — _split_frontmatter, _find_frontmatter_end, and _parse_simple_yaml — so no single function nests beyond two levels of control flow. Each helper owns one decision: where the YAML block starts and ends, and how to read a key: value line. A missing or unterminated --- delimiter raises ValueError, which the test suite locks down.

def _parse_simple_yaml(lines: list[str]) -> dict[str, str]:
    result: dict[str, str] = {}
    for line in lines:
        stripped = line.strip()
        if not stripped or stripped.startswith("#"):
            continue
        key, sep, value = line.partition(":")
        if not sep:
            continue
        result[key.strip()] = _unquote(value.strip())
    return result

The test module asserts the four invariants we care about for this step. The .claude/skills/ directory must exist, load_skills must discover at least one entry, every entry must be a populated Skill, and the greet description must already contain a trigger phrase like greet or hello. Two failure-mode tests round it off: an empty project returns an empty list, and a malformed manifest raises rather than silently returning garbage.

def test_greet_skill_description_mentions_a_trigger_phrase():
    by_name = {s.name: s for s in load_skills(REPO_ROOT)}
    greet = by_name["greet"]
    text = greet.description.lower()
    assert "greet" in text or "hello" in text

Verification

python3 -m pytest

.......                                                                  [100%]
7 passed in 0.03s

Seven green tests is the gate. They prove the manifest is discoverable on disk, the loader returns a typed Skill with the expected fields, the placeholder description carries the wording we will mutate later, and the loader fails loudly when frontmatter is missing.

What we built

We now have a self-contained Python package whose only job is to read .claude/skills/*/SKILL.md and hand back a frozen Skill record. The directory layout matches what Claude Code itself scans, so the same manifest we test against is the one a real Claude Code session would load.

The greet skill exists as a deliberate stub. It carries the minimal name plus a description that already mentions both the skill's verb (greet) and a natural-language alias (hello) — the kind of dual-cue wording we will dissect in later steps.

The test suite encodes the invariants we want to hold across every future change. If a later edit accidentally breaks the frontmatter, removes the skill, or strips the trigger phrases, pytest flips red before we publish anything misleading.

With the harness locked down, the next step can focus on the real research question: what happens to auto-trigger behaviour when we replace the description with progressively vaguer or more specific wording? Every variation will run against the same loader, the same directory layout, and the same passing baseline established here.

Repository

The state of the code after this step: 273008d

Step 2: Probing the Baseline Description with a Token-Overlap Trigger Matcher

Step 1 left us with a passing loader and a single placeholder skill on disk. That gives us a manifest we can read, but it does not yet tell us whether the skill's description would fire on a realistic prompt. This step closes that gap by writing a deterministic stand-in for Claude Code's skill auto-trigger matching and pinning down the baseline behaviour as a tested invariant.

The shape of the experiment matters. We freeze the greet description exactly as it shipped from step 1, build a token-overlap matcher we fully control, and curate two short prompt corpora — four prompts that should fire and four prompts that should not. From this step onward, every change to the description gets measured against the same matcher and the same prompts, so any swing in behaviour is wording-driven rather than tooling noise.

Setup

No new third-party dependencies. The two new source modules and the new test module slot into the layout established in step 1:

codebase/
├── src/
│   ├── skill_loader.py          # from step 1
│   ├── trigger_matcher.py       # NEW
│   └── sample_prompts.py        # NEW
└── tests/
    ├── test_skill_loader.py     # from step 1
    └── test_trigger_matcher.py  # NEW

The matcher lives in src/trigger_matcher.py so it can import the Skill dataclass from skill_loader without circular dependencies. The prompt corpus sits in its own sample_prompts module so future steps can extend the corpus without touching matcher logic. pyproject.toml already exposes src/ on pythonpath, so the new test module imports both with no extra wiring.

Implementation

The matcher's job is to answer one question per prompt: which skill — if any — would fire on this wording? We model that as a token-overlap score. A TriggerMatch record carries the skill, the integer overlap score, and the exact terms that overlapped, so a failing test can tell us why a match did or did not happen.

@dataclass(frozen=True)
class TriggerMatch:
    skill: Skill
    score: int
    matched_terms: tuple[str, ...]

Tokenisation is intentionally boring: lowercase, split on anything that is not [a-z0-9], and drop English stopwords plus single-character tokens. We keep the stopword list small and visible at the top of the module so the next step can reason about which words in a description actually contribute signal. Holding the stopword set as a frozenset also makes terms & prompt_tokens a clean intersection.

WORD_RE = re.compile(r"[a-z0-9]+")

STOPWORDS = frozenset({
    "a", "an", "and", "or", "the", "to", "of", "in", "on", "by",
    "for", "with", "use", "when", "user", "wants", "want", "someone",
    "is", "be", "this", "that", "it", "as", "at", "from", "you",
    "your", "can", "please", "would", "could", "should", "do",
})


def tokenize(text: str) -> list[str]:
    return WORD_RE.findall(text.lower())


def keywords(text: str) -> set[str]:
    return {w for w in tokenize(text) if w not in STOPWORDS and len(w) > 1}

Scoring is then a two-line set intersection, wrapped in a sort. score_prompt returns one row per skill so callers can inspect ties and near-misses, while best_match is the thin convenience layer that picks the top row and rejects zero-overlap matches. Splitting the two keeps each function under two control-flow levels and leaves room for future steps to add their own ranking strategies without touching the public surface.

def score_prompt(prompt: str, skills: Iterable[Skill]) -> list[TriggerMatch]:
    prompt_tokens = set(tokenize(prompt))
    matches: list[TriggerMatch] = []
    for skill in skills:
        terms = keywords(skill.description)
        hits = sorted(terms & prompt_tokens)
        matches.append(
            TriggerMatch(skill=skill, score=len(hits), matched_terms=tuple(hits))
        )
    matches.sort(key=lambda m: m.score, reverse=True)
    return matches


def best_match(prompt: str, skills: Iterable[Skill]) -> TriggerMatch | None:
    ranked = score_prompt(prompt, skills)
    if not ranked:
        return None
    top = ranked[0]
    if top.score == 0:
        return None
    return top

The prompt corpus is what turns the matcher into an experiment. Four positive prompts paraphrase the greet intent in different surface forms ("say hello", "greet Maria", "open the conversation", "send a quick greeting"). Four negative prompts cover unrelated developer requests — weather, Rust builds, TCP internals, a refactor — so we can detect false-positive bleed when later steps make the description vaguer.

POSITIVE_GREET_PROMPTS: tuple[str, ...] = (
    "Can you say hello to my new teammate?",
    "Please greet Maria warmly before the meeting starts.",
    "Open the conversation with a friendly hello.",
    "Send a quick greeting to the channel.",
)

NEGATIVE_PROMPTS: tuple[str, ...] = (
    "What is the weather tomorrow in Tokyo?",
    "Compile the Rust binary and run the integration suite.",
    "Explain how TCP congestion control reacts to packet loss.",
    "Refactor the invoice service to use dependency injection.",
)

Tests parametrise over both corpora. The positive battery asserts that the baseline greet description fires on every paraphrase, that the winning skill is greet, and that at least one term overlapped. The negative battery asserts that best_match returns None on every unrelated prompt — the description must not pick up signal from words like compile, weather, or refactor.

@pytest.mark.parametrize("prompt", POSITIVE_GREET_PROMPTS)
def test_baseline_greet_triggers_on_positive_prompts(prompt, skills):
    match = best_match(prompt, skills)
    assert match is not None, f"baseline greet failed to fire on: {prompt!r}"
    assert match.skill.name == "greet"
    assert match.score >= 1
    assert match.matched_terms


@pytest.mark.parametrize("prompt", NEGATIVE_PROMPTS)
def test_baseline_greet_stays_quiet_on_unrelated_prompts(prompt, skills):
    match = best_match(prompt, skills)
    assert match is None, (
        f"baseline greet wrongly fired on unrelated prompt: {prompt!r}"
    )

Four more focused tests pin down the matcher's mechanical contract — that tokenize lowercases and splits on punctuation, that keywords drops stopwords and one-character tokens, that score_prompt returns one row per skill sorted descending, and that matched terms come back lowercase and alphabetically sorted. Locking these down means a regression in the matcher itself cannot be silently blamed on the description.

Verification

python3 -m pytest

.....................                                                    [100%]
21 passed in 0.08s

Twenty-one passes — the seven loader tests from step 1 plus fourteen new matcher tests (eight parametrised prompts and six contract checks). The positive battery confirms the baseline greet description fires across paraphrased prompts; the negative battery confirms it does not bleed onto unrelated developer requests. From here, any future regression in either direction will surface as a coloured row in this output.

What we built

We now own a small, deterministic substitute for the part of Claude Code that decides whether a skill description matches a user prompt. Because it lives in our repository, we can read every decision it makes, log every overlapping term, and trust that its behaviour will not drift between runs.

Around the matcher sits a frozen baseline. The greet description from step 1 is unchanged, but it now has eight prompts of measured behaviour bolted onto it — four where it fires, four where it stays quiet. That asymmetry is the actual experimental control for the rest of the article.

The matcher's surface area is deliberately thin. tokenize, keywords, score_prompt, and best_match together fit in roughly fifty lines, and each function obeys the two-level nesting limit. The next step can rewrite the greet description as aggressively as it wants without ever needing to change matcher code.

That sets up the question step 3 will answer: what happens to the eight passing assertions if we strip the description down to a bare verb, or pad it with generic engineering vocabulary? The matcher will tell us, prompt by prompt, exactly which terms carried the trigger.

Repository

The state of the code after this step: 73b92d4

Step 3: Adding an Overlapping Welcome Skill to Surface Trigger Collisions

Step 2 left us with a single skill, a deterministic matcher, and eight prompts that pinned down a clean baseline: greet fires on greeting paraphrases and stays silent on unrelated developer chatter. A skill manifest with one entry can never expose a collision, though, because there is nothing for it to collide with. This step adds the missing variable.

We register a second skill, welcome, whose description deliberately reuses the same content words as greet — hello, greet, friendly, open, conversation. Then we extend the prompt corpus with onboarding-intent prompts and a small set of ambiguous prompts, and add tests that lock in exactly how the matcher behaves when two descriptions have overlapping vocabulary.

Setup

No new third-party dependencies. The deltas are one new manifest file and edits to two existing modules; the layout from step 2 stays put:

codebase/
├── .claude/
│   └── skills/
│       ├── greet/
│       │   └── SKILL.md          # unchanged from step 1
│       └── welcome/
│           └── SKILL.md          # NEW
├── src/
│   ├── skill_loader.py           # unchanged
│   ├── trigger_matcher.py        # unchanged
│   └── sample_prompts.py         # EDITED
└── tests/
    └── test_trigger_matcher.py   # EDITED

The greet manifest is intentionally untouched. We are isolating one variable at a time, and the variable for this step is "what happens when a second description joins the directory." Reusing the existing loader and matcher means any behavioural change observed here is wording-driven, not tooling-driven.

Implementation

The new manifest writes the overlap into the description on purpose. It targets welcoming a newcomer, but it leans on the same surface vocabulary greet already owns. That is the whole point — we want the matcher to see two skills competing for the same tokens.

---
name: welcome
description: Welcome a new user with a friendly hello. Use when the user wants to greet a newcomer joining the team or open a warm onboarding conversation.
---

# Welcome

A second placeholder skill whose description deliberately overlaps with
`greet`. The shared vocabulary ("hello", "greet", "friendly", "open",
"conversation") surfaces auto-trigger ambiguity that later steps will
disentangle.

Next we extend sample_prompts.py with two new corpora. POSITIVE_WELCOME_PROMPTS carries onboarding-flavoured wording — new hire, onboarding, newest team member — so we can check that the welcome-specific terms actually pull the win. AMBIGUOUS_PROMPTS deliberately reuses prompts from the original positive set whose vocabulary is symmetric across both descriptions.

POSITIVE_WELCOME_PROMPTS: tuple[str, ...] = (
    "Welcome the new hire to our team.",
    "Open a warm onboarding conversation for our newcomer.",
    "Greet our newest team member as they onboard.",
)

AMBIGUOUS_PROMPTS: tuple[str, ...] = (
    "Can you say hello to my new teammate?",
    "Open the conversation with a friendly hello.",
)

The test module then encodes three new invariants on top of the step 2 suite. First, the loader now returns both skills, so we assert the discovered name set is exactly {"greet", "welcome"}. Second, the baseline greet must still win on its original four positive prompts — adding a second skill must not regress the prior step's contract. Third, welcome must win on the onboarding-flavoured prompts, where the discriminating tokens (onboarding, newcomer, team) sit only in its description.

def test_step3_loads_both_overlapping_skills(skills):
    names = {s.name for s in skills}
    assert names == {"greet", "welcome"}


@pytest.mark.parametrize("prompt", POSITIVE_WELCOME_PROMPTS)
def test_welcome_triggers_on_onboarding_intent_prompts(prompt, skills):
    match = best_match(prompt, skills)
    assert match is not None, f"welcome failed to fire on: {prompt!r}"
    assert match.skill.name == "welcome", (
        f"expected welcome to win on {prompt!r}, got {match.skill.name}"
    )
    assert match.score >= 1

The interesting test — the one that turns this step into a real experiment — is the collision check. For each ambiguous prompt, both skills must end up with the same nonzero overlap score, and they must be the top two ranked entries. That is the matcher telling us, in failing colour, that two descriptions are indistinguishable on this wording.

@pytest.mark.parametrize("prompt", AMBIGUOUS_PROMPTS)
def test_overlapping_skills_collide_with_tied_scores(prompt, skills):
    ranked = score_prompt(prompt, skills)
    top_two = ranked[:2]
    assert top_two[0].score > 0
    assert top_two[0].score == top_two[1].score
    names = {m.skill.name for m in top_two}
    assert names == {"greet", "welcome"}

The final new test makes the ambiguity uncomfortably explicit. We feed the same ambiguous prompt to score_prompt twice — once with the skill list in load order, once reversed — and assert that the winning skill changes even though the score does not. The matcher is not "picking smarter"; it is silently falling back to list order, and the test pins that down so a later step cannot quietly paper over it.

def test_ambiguous_winner_is_decided_only_by_load_order(skills):
    prompt = "Open the conversation with a friendly hello."
    ranked_forward = score_prompt(prompt, skills)
    ranked_reverse = score_prompt(prompt, list(reversed(list(skills))))
    assert ranked_forward[0].score == ranked_reverse[0].score
    assert ranked_forward[0].skill.name != ranked_reverse[0].skill.name

Verification

python3 -m pytest

..............................                                           [100%]
30 passed in 0.09s

Thirty green tests. Step 1 contributed seven loader tests, step 2 contributed fourteen matcher tests, and step 3 adds nine net additions: two new loader tests that register the welcome stub and assert its description overlaps greet vocabulary, the both-skills-loaded matcher assertion, three onboarding-intent parametrised passes, two ambiguous-prompt collision passes, and the load-order tiebreaker. The four negative-prompt passes from step 2 still hold — they now cover both skills instead of just greet, so the baseline silence on unrelated prompts is preserved as the manifest grows.

What we built

We now have two skills in the manifest directory instead of one, and a test suite that distinguishes three regimes rather than two. There are prompts where greet alone wins, prompts where welcome alone wins, prompts where the two tie, and prompts where neither fires. That four-way split is the actual experimental surface for the rest of the article.

The collision tests are the structural payoff. They prove, mechanically, that two well-meaning descriptions sharing common verbs become indistinguishable on common phrasings. The matcher does not crash, but it also cannot choose; it falls back to whichever skill the loader happened to enumerate first.

The load-order test is the uncomfortable receipt for that fallback. Reversing the skill list flips the winner without changing the score, which means the user-visible behaviour of Claude Code's auto-trigger matching on these prompts would be just as fragile as the order of os.listdir on a real filesystem. Naming that fragility as a test, rather than a footnote, makes it impossible to forget.

With the collision pinned down, step 4 has a concrete failure to fix. The question becomes: which wording changes to the two descriptions move the tied prompts back into clean wins — and at what cost to the negative-prompt corpus? Every candidate rewrite will now be measured against the same thirty assertions.

Repository

The state of the code after this step: 2574f13

Key commits to step through:

273008d — step 1: scaffold the skill manifest lab
73b92d4 — step 2: token-overlap matcher and baseline corpus
2574f13 — step 3: overlapping welcome skill surfaces trigger collisions

Step 4: Breaking the Trigger Collision with Action Verbs, Explicit Phrases, and Scope Qualifiers

Step 3 left the suite in an uncomfortable state. Two ambiguous prompts — "Can you say hello to my new teammate?" and "Open the conversation with a friendly hello." — produced perfect ties between greet and welcome, and the winner flipped whenever the load order flipped. The matcher was not picking; the filesystem was.

This step fixes the descriptions themselves. We rewrite both manifests around three deliberate moves: concrete action verbs that name the operation, explicit trigger phrases that mirror the wording a user would actually type, and scope qualifiers that mark where each skill stops. The behavioural target is sharp: the previously tied prompts must now have a strict winner whose identity does not depend on directory enumeration order.

Setup

No new dependencies, no new files. The deltas are surgical edits to two existing manifests and the supporting test fixtures:

codebase/
├── .claude/
│   └── skills/
│       ├── greet/
│       │   └── SKILL.md          # description REWRITTEN
│       └── welcome/
│           └── SKILL.md          # description REWRITTEN
├── src/
│   ├── skill_loader.py           # unchanged
│   ├── trigger_matcher.py        # unchanged
│   └── sample_prompts.py         # renamed AMBIGUOUS_PROMPTS + resolutions
└── tests/
    ├── test_skill_loader.py      # asserts on the refactor's shape
    └── test_trigger_matcher.py   # collision tests inverted into win tests

The matcher source is untouched on purpose. Step 3 already proved that the scoring algorithm is deterministic and that the ambiguity lives in the descriptions, not the code. Keeping trigger_matcher.py frozen means any change we observe in the test outcomes is provably wording-driven.

Implementation

The rewrite of greet/SKILL.md swaps a vague single-sentence summary for a description that does three jobs at once. The action verbs are concrete (send, drop, reply, greet), the trigger phrases echo natural user wording, and the closing sentence carves out the scope.

---
name: greet
description: Send a brief friendly greeting like hi or hello. Use when the user wants to greet a teammate, send a quick hi, drop a casual salutation, or reply with a friendly hello. Reserved for acknowledging existing contacts.
---

Two design choices are worth naming. First, the verb cluster (send / drop / reply / greet) gives the matcher multiple lexical entry points without inflating the description with synonyms it does not actually mean. Second, the scope qualifier "reserved for acknowledging existing contacts" deliberately exiles the onboarding semantics — the user is not new, the contact already exists. That single phrase is what stops greet from grabbing onboarding-intent prompts that happen to mention saying hello.

The welcome/SKILL.md rewrite is its mirror. We strip every token the two descriptions previously shared — hello, greet, friendly, conversation, open, warm — and keep newcomer only because it is onboarding-coded. The verbs and phrases now lean entirely on the onboarding scope.

---
name: welcome
description: Onboard a newcomer with a warm welcome message. Use when the user wants to welcome a new hire, onboard a recently joined team member, or post an onboarding introduction for fresh arrivals.
---

The asymmetry is intentional. greet keeps the casual greeting vocabulary; welcome keeps the onboarding vocabulary; nothing sits in both bowls. The "for fresh arrivals" tail performs the same scope-qualifier role as greet's "existing contacts" — it tells the matcher this skill is for first-contact moments only.

Next we update sample_prompts.py to reflect the new contract. The old AMBIGUOUS_PROMPTS becomes PREVIOUSLY_AMBIGUOUS_PROMPTS (the prompts have not changed; only our claim about them has), and we add a resolution table that pins down which skill should now win each formerly-tied prompt.

PREVIOUSLY_AMBIGUOUS_PROMPTS: tuple[str, ...] = (
    "Can you say hello to my new teammate?",
    "Open the conversation with a friendly hello.",
)

PREVIOUSLY_AMBIGUOUS_PROMPT_RESOLUTIONS: tuple[tuple[str, str], ...] = (
    ("Can you say hello to my new teammate?", "greet"),
    ("Open the conversation with a friendly hello.", "greet"),
)

The resolution choices may look surprising — both prompts route to greet, not welcome. The reason is that the surviving content words in those prompts (hello, friendly, conversation) all belong to greet's description after the rewrite, while none of welcome's discriminating tokens (onboard, hire, joined, arrivals) appear in the prompts at all. The user mentioning "new" or "teammate" is not enough; the rewrite ensures the matcher demands stronger onboarding signal before firing welcome.

The test module then inverts step 3's collision tests into strict-win tests. The parametrised case that previously demanded top.score == runner_up.score now demands top.score > runner_up.score, with a per-prompt expected winner pulled from the resolution table.

@pytest.mark.parametrize("prompt", PREVIOUSLY_AMBIGUOUS_PROMPTS)
def test_step4_previously_tied_prompts_now_have_a_strict_winner(prompt, skills):
    ranked = score_prompt(prompt, skills)
    top, runner_up = ranked[0], ranked[1]
    assert top.score > 0, f"expected nonzero top score on {prompt!r}"
    assert top.score > runner_up.score, (
        f"step 4 refactor should break the tie on {prompt!r}; got "
        f"{top.skill.name}={top.score} vs {runner_up.skill.name}={runner_up.score}"
    )

The load-order test from step 3 — which previously asserted that reversing the skill list flipped the winner — is the most diagnostic update of all. We rewrite it to demand the opposite: the winner must be the same regardless of enumeration order, because score now decides outright instead of falling through to list position.

def test_step4_winner_is_stable_across_load_order(skills):
    prompt = "Open the conversation with a friendly hello."
    ranked_forward = score_prompt(prompt, skills)
    ranked_reverse = score_prompt(prompt, list(reversed(list(skills))))
    assert ranked_forward[0].skill.name == ranked_reverse[0].skill.name, (
        "step 4 refactor should make the winner depend on score, not load order"
    )
    assert ranked_forward[0].score == ranked_reverse[0].score

We also add five structural assertions in test_skill_loader.py that lock the rewrite's shape — not just its behaviour — into the suite. They check that the two descriptions share zero content keywords, that greet carries at least three action verbs, that greet declares a scope qualifier, that welcome leans on onboarding vocabulary, and that welcome has dropped every collision term. Those tests turn the three design moves of this step into machine-enforced invariants that a future careless rewording cannot quietly violate.

Verification

python3 -m pytest

....................................                                     [100%]
36 passed in 0.10s

Thirty-six green tests, up from thirty at the end of step 3. The six net additions split cleanly along the design moves: five new shape tests on test_skill_loader.py (no-shared-keywords, action-verb presence, scope-qualifier presence, onboarding-vocabulary presence, collision-term absence) and one new strict-winner routing test on test_trigger_matcher.py. The two previously-tied parametrised passes now assert top.score > runner_up.score instead of equality, and the load-order test asserts identical winners instead of flipped winners — same test count, inverted expectation.

What we built

We turned the step 3 collision into a forced choice. Every previously-tied ambiguous prompt now ranks one skill strictly above the other, and the strict winner stays the same when the skill directory is enumerated in reverse. The matcher is finally picking on score signal rather than coasting on os.listdir ordering.

The structural tests are the second deliverable. They encode the three design moves — action verbs, explicit trigger phrases, scope qualifiers — as runnable invariants on the description text itself. A future contributor who reverts to a vague one-liner will not just see a behavioural regression; they will see five named tests fail with messages that point at the exact rule they broke.

The third, quieter deliverable is the resolution table. By forcing ourselves to name which skill should win each previously-tied prompt, we made the scope decision explicit. Both prompts routing to greet is not an accident of vocabulary; it is a design call about where the boundary between greeting and onboarding lives.

What stays open is the false-positive question. The current suite still uses the same four negative prompts from step 2 and confirms neither skill fires on them — but the rewrite has expanded greet's surface vocabulary, and we have not yet stress-tested it against prompts that mention greeting verbs in unrelated contexts. Step 5 will close that gap by introducing anti-pattern skills with deliberately vague descriptions and watching how the matcher handles them.

Repository

The state of the code after this step: 9741051

Key commits to step through:

273008d — step 1: scaffold the skill manifest lab
73b92d4 — step 2: token-overlap matcher and baseline corpus
2574f13 — step 3: overlapping welcome skill surfaces trigger collisions
9741051 — step 4: action verbs, explicit phrases, and scope qualifiers break the collision

Step 5: Stress-Testing Precision with Vague, Broad, and Buzzword Anti-Pattern Skills

Step 4 closed the collision between greet and welcome by rewriting both descriptions around action verbs, explicit trigger phrases, and scope qualifiers. The strict-winner suite proved the rewrite worked on the prompts we already had — but it never asked the harder question: what happens when a bad description sits in the registry alongside the good ones? A matcher that fires on a vague competitor is not actually picking — it is leaking.

This step plants three anti-pattern skills — vague_helper, broad_handler, and buzzword_skill — into .claude/skills/ and bolts a thirty-two-case false-positive suite onto the existing matcher. The behavioural target is precise: every concrete positive prompt still routes to greet or welcome, every anti-pattern scores exactly zero on those prompts, and the four negative prompts from step 2 still trigger nothing at all.

Setup

No production code changes. The deltas are three new skill manifests, an extended prompt module, and a brand-new test file dedicated to false-positive controls:

codebase/
├── .claude/
│   └── skills/
│       ├── greet/SKILL.md          # unchanged
│       ├── welcome/SKILL.md        # unchanged
│       ├── vague_helper/SKILL.md   # NEW — anti-pattern 1
│       ├── broad_handler/SKILL.md  # NEW — anti-pattern 2
│       └── buzzword_skill/SKILL.md # NEW — anti-pattern 3
├── src/
│   ├── skill_loader.py             # unchanged
│   ├── trigger_matcher.py          # unchanged
│   └── sample_prompts.py           # + ANTI_PATTERN_SKILL_NAMES, CONCRETE_POSITIVE_PROMPTS
└── tests/
    ├── test_skill_loader.py        # unchanged
    ├── test_trigger_matcher.py     # one assertion loosened from equality to subset
    └── test_anti_patterns.py       # NEW — 32 tests, all precision-focused

The loader and matcher stay frozen for the same reason as step 4: if we add bad descriptions and change nothing else, any precision wobble we observe is provably caused by the new manifests rather than by hidden algorithmic drift. The single edit to test_trigger_matcher.py is a relaxation, not new behaviour — test_step3_loads_both_overlapping_skills switches from names == {"greet", "welcome"} to a subset check, because the registry now contains five skills and the original equality would no longer hold.

Implementation

Each anti-pattern is a different kind of bad description. The first, vague_helper, leans on abstraction: every content word is a generic noun or verb that no real user would type into a prompt.

---
name: vague_helper
description: A general-purpose helper that does useful things when called.
---

The token-overlap matcher we built in step 2 scores by intersection of content keywords. A description whose keywords are general, purpose, helper, useful, things, called has nothing to intersect with — real prompts contain verbs like greet, welcome, onboard, not the meta-vocabulary of skill authoring. The skill is structurally unable to win, and the test suite uses that fact as a hard invariant rather than a soft hope.

The second anti-pattern, broad_handler, inverts the failure mode by overclaiming scope:

---
name: broad_handler
description: Handles any kind of input across many situations and contexts.
---

handles, any, kind, input, across, many, situations, contexts — every word is about coverage rather than action. The matcher does not reward authorial confidence; it rewards lexical overlap with the user's prompt. By choosing a description that names no operation a user would request, we get a control that scores zero by construction.

The third anti-pattern, buzzword_skill, is the marketing-deck failure mode — confident, register-heavy, and entirely content-free:

---
name: buzzword_skill
description: An intelligent, smart, advanced solution leveraging cutting-edge capabilities to deliver value.
---

intelligent, smart, advanced, solution, leveraging, cutting, edge, capabilities, deliver, value — ten tokens that all sound impressive and zero that overlap with anything a real user types. The three anti-patterns together span the failure landscape: under-specified (vague), over-scoped (broad), and over-styled (buzzword).

We then extend sample_prompts.py with two pieces of shared vocabulary the new test file needs:

ANTI_PATTERN_SKILL_NAMES: frozenset[str] = frozenset({
    "vague_helper",
    "broad_handler",
    "buzzword_skill",
})

CONCRETE_POSITIVE_PROMPTS: tuple[str, ...] = (
    POSITIVE_GREET_PROMPTS + POSITIVE_WELCOME_PROMPTS
)

ANTI_PATTERN_SKILL_NAMES is the registry of skills we expect never to win, and CONCRETE_POSITIVE_PROMPTS is the concatenation of every well-formed greet and welcome prompt from earlier steps. The naming reflects the assertion shape: every concrete positive prompt is something a real user actually types, and no anti-pattern is allowed to outrank a real skill on those inputs.

The new test file tests/test_anti_patterns.py is the heart of the step. It parametrises ten distinct invariants over the three anti-pattern skills and the seven concrete positive prompts, producing the thirty-two-test precision suite.

@pytest.mark.parametrize("name", sorted(ANTI_PATTERN_SKILL_NAMES))
def test_step5_anti_pattern_descriptions_have_no_action_verbs(name, skills_by_name):
    text = skills_by_name[name].description.lower()
    forbidden_verbs = ("send", "drop", "reply", "onboard", "welcome", "greet", "post")
    leaked = [v for v in forbidden_verbs if v in text]
    assert leaked == [], (
        f"anti-pattern {name!r} must NOT borrow concrete action verbs; leaked {leaked!r}"
    )

This is a shape test. It encodes the rule "an anti-pattern stops being an anti-pattern the moment it borrows a real action verb" as a runnable assertion. A future contributor cannot quietly soften a bad description into a half-good one without the suite turning red.

The complementary behaviour tests then prove the anti-patterns actually fail to fire:

@pytest.mark.parametrize("prompt", CONCRETE_POSITIVE_PROMPTS)
def test_step5_anti_patterns_score_zero_on_concrete_positive_prompts(prompt, skills):
    ranked = score_prompt(prompt, skills)
    anti_matches = [m for m in ranked if m.skill.name in ANTI_PATTERN_SKILL_NAMES]
    for m in anti_matches:
        assert m.score == 0, (
            f"anti-pattern {m.skill.name!r} stole signal on {prompt!r}: "
            f"score={m.score} matched_terms={m.matched_terms!r}"
        )


@pytest.mark.parametrize("prompt", CONCRETE_POSITIVE_PROMPTS)
def test_step5_anti_patterns_never_win_against_real_skills(prompt, skills):
    match = best_match(prompt, skills)
    assert match is not None, f"expected a real skill to win on {prompt!r}"
    assert match.skill.name not in ANTI_PATTERN_SKILL_NAMES, (
        f"anti-pattern {match.skill.name!r} wrongly beat the real skills on "
        f"{prompt!r}"
    )

The first assertion is the strict version: every anti-pattern scores exactly zero on every concrete positive prompt. The second is the operational version: even if some future token-collision sneaks a non-zero score in, the anti-pattern must never beat the real winner. Holding both lines is what turns "the matcher seems to ignore bad descriptions" into a guarantee.

Two regression tests close the loop by re-running step 4's resolutions with the anti-patterns loaded:

def test_step5_greet_still_wins_previously_ambiguous_prompts_with_anti_patterns_loaded(
    skills,
):
    for prompt, expected in PREVIOUSLY_AMBIGUOUS_PROMPT_RESOLUTIONS:
        match = best_match(prompt, skills)
        assert match.skill.name == expected, (
            f"step 5 expects {expected!r} to keep winning {prompt!r} even with "
            f"anti-patterns in the registry; got {match.skill.name!r}"
        )

If adding three skills to the registry changed the resolution of a previously-tied prompt, the precision claim would be hollow — the anti-patterns would be reshaping the leaderboard even while losing. Pinning the step 4 routing keeps the new manifests honest.

Verification

python3 -m pytest

....................................................................   [100%]
68 passed in 0.17s

Sixty-eight green tests, up from thirty-six at the end of step 4. The thirty-two-test delta is the new test_anti_patterns.py file in its entirety: nine shape tests across the three anti-patterns, fourteen score-zero and never-win parametrisations across the seven concrete positive prompts, four negative-prompt assertions, three regression checks on the step 4 resolutions and positive routing, plus the three positive-routing parametrisations and the registry-size guard. No prior test required edits — step 4's strict-winner suite still passes byte-for-byte with the anti-patterns sitting in the loader.

What we built

We added three deliberately weak skill manifests that span the realistic failure modes for description authoring: under-specified (vague_helper), over-scoped (broad_handler), and over-styled (buzzword_skill). Each one is a runnable counter-example — not commentary in a markdown body, but an actual SKILL.md that the loader walks past on every test run.

We also built a thirty-two-test precision harness that turns the implicit claim "bad descriptions do not fire" into a hard invariant. The suite checks both the shape of the descriptions (no action verbs, no use when trigger phrases, no overlap with the real skills' keyword sets) and the behaviour on real prompts (score zero, never win, do not break step 4 routing).

The deeper payoff is the precision-versus-recall framing this step makes explicit. Earlier steps cared about recall — will the right skill fire when I ask? — and step 4 added the disambiguation question — which of two right skills wins? Step 5 finally pays attention to precision: do the wrong skills stay quiet? The three anti-patterns give us calibrated noise to measure precision against.

What stays open is the trade-off curve itself. We have proven the matcher is precise on seven concrete prompts and four negative prompts, but we have not yet mapped its behaviour across a larger prompt matrix that would expose marginal cases. Step 6 will build that matrix and score the matcher's precision and recall as numbers, not asserts.

Repository

The state of the code after this step: 2e1e38f

Key commits to step through:

273008d — step 1: scaffold the skill manifest lab
73b92d4 — step 2: token-overlap matcher and baseline corpus
2574f13 — step 3: overlapping welcome skill surfaces trigger collisions
9741051 — step 4: action verbs, explicit phrases, and scope qualifiers break the collision
2e1e38f — step 5: vague, broad, and buzzword anti-patterns prove matcher precision

Step 6: Scoring Trigger Precision and Recall with a Prompt-Matrix Harness

Step 5 closed the precision gap by planting three deliberately weak skill manifests in the registry and proving — with hard asserts — that they never outscored the real winners on seven concrete positive prompts. That suite answered a binary question well: do the bad skills stay quiet, yes or no. It still left the engineering question wide open: how precise and how recallant is the matcher when scored against the whole labelled corpus at once?

This step builds a small evaluation harness — prompt_matrix.py plus test_prompt_matrix.py — that drives a labelled prompt matrix through the existing matcher and reports per-skill precision and recall as numbers. The harness defines a PromptCase schema pairing each prompt with the skill that should fire (or None for the negative prompts), tallies true positives, false positives, and false negatives per skill, and renders a scoreboard. The real skills score 1.00 / 1.00; the anti-patterns score n/a because they have zero positives and zero negatives to divide by — which is itself a publishable result.

Setup

Two new files, no edits to the existing modules:

codebase/
├── .claude/skills/                # unchanged (5 skills)
├── src/
│   ├── skill_loader.py            # unchanged
│   ├── trigger_matcher.py         # unchanged
│   ├── sample_prompts.py          # unchanged
│   └── prompt_matrix.py           # NEW — harness + scoreboard renderer
└── tests/
    ├── test_skill_loader.py       # unchanged
    ├── test_trigger_matcher.py    # unchanged
    ├── test_anti_patterns.py      # unchanged
    └── test_prompt_matrix.py      # NEW — 16 tests, harness + matrix coverage

The harness reuses three frozensets already exported by sample_prompts: POSITIVE_GREET_PROMPTS, POSITIVE_WELCOME_PROMPTS, and NEGATIVE_PROMPTS. Keeping those as the single source of truth means the matrix never silently drifts from the prompts that earlier tests are still asserting against — if step 7 grows the corpus, every test file picks it up in one place.

Implementation

We start with the fixture schema. A PromptCase is the smallest unit of labelled data the harness understands: a prompt string plus the name of the skill that should win, or None if no skill should fire at all.

@dataclass(frozen=True)
class PromptCase:
    prompt: str
    expected_skill: str | None

Freezing the dataclass makes cases hashable and prevents tests from mutating shared fixtures by accident. The expected_skill: str | None shape is the important detail — it lets one labelled corpus describe both positive prompts (where some named skill must win) and negative prompts (where the right answer is silence). Negative labels are not a separate type or a separate function call; they are the same PromptCase with expected_skill=None.

The other dataclass, SkillMetrics, is the per-skill result we will eventually print.

@dataclass(frozen=True)
class SkillMetrics:
    skill_name: str
    true_positives: int
    false_positives: int
    false_negatives: int
    precision: float | None
    recall: float | None

precision and recall are typed float | None rather than float because both ratios are genuinely undefined when their denominator is zero — and that case is normal, not exceptional, for an anti-pattern that never fires on any prompt. Coercing None to 0.0 would make a never-firing skill look identical to a wrong-firing skill, which is the exact distinction the scoreboard needs to preserve.

Building the default matrix is a one-liner per source set:

def build_default_matrix() -> tuple[PromptCase, ...]:
    cases: list[PromptCase] = []
    cases.extend(PromptCase(p, "greet") for p in POSITIVE_GREET_PROMPTS)
    cases.extend(PromptCase(p, "welcome") for p in POSITIVE_WELCOME_PROMPTS)
    cases.extend(PromptCase(p, None) for p in NEGATIVE_PROMPTS)
    return tuple(cases)

Four greet prompts, three welcome prompts, four negative prompts — eleven labelled cases total. Returning a tuple (not a list) means callers cannot mutate the canonical matrix after construction, which keeps every test independent of every other test's ordering.

The scoring step needs three small helpers so the public functions stay flat. Per the codebase nesting rule — at most two levels of if/elif/else per function — the classification logic gets its own helper rather than living inside a loop:

def _classify(
    expected: str | None, winner: str | None, target: str
) -> tuple[int, int, int]:
    if expected == target and winner == target:
        return 1, 0, 0
    if expected != target and winner == target:
        return 0, 1, 0
    if expected == target and winner != target:
        return 0, 0, 1
    return 0, 0, 0

_classify reads as four exhaustive cases against one specific target skill: a true positive (expected and got the target), a false positive (got the target but shouldn't have), a false negative (should have got the target but didn't), and an irrelevant case (neither expected nor produced — counts for nothing). The fourth branch is the one that lets a single matrix score every skill independently: prompts that concern other skills are simply ignored when scoring this skill.

The other helper, _safe_ratio, is the source of the None-on-zero behaviour:

def _safe_ratio(numerator: int, denominator: int) -> float | None:
    if denominator == 0:
        return None
    return numerator / denominator

Precision divides by tp + fp; recall divides by tp + fn. When both are zero — as they are for every anti-pattern under the current corpus — the harness returns None, which propagates through the dataclass and prints as n/a in the scoreboard. That is structurally honest reporting: we are not claiming the skill scored 100% precision and 0% recall; we are admitting the matrix didn't exercise it.

evaluate_skill ties the two helpers together with a straight accumulator loop:

def evaluate_skill(
    skill_name: str,
    cases: Iterable[PromptCase],
    skills: Sequence[Skill],
) -> SkillMetrics:
    tp = fp = fn = 0
    for case in cases:
        winner = _winner_name(case.prompt, skills)
        dtp, dfp, dfn = _classify(case.expected_skill, winner, skill_name)
        tp += dtp
        fp += dfp
        fn += dfn
    return SkillMetrics(
        skill_name=skill_name,
        true_positives=tp,
        false_positives=fp,
        false_negatives=fn,
        precision=_safe_ratio(tp, tp + fp),
        recall=_safe_ratio(tp, tp + fn),
    )

_winner_name is a one-line wrapper around best_match from step 3 that strips the SkillMatch envelope down to a name string. That keeps evaluate_skill free of None-handling for "no skill won" — the comparison winner == target simply evaluates to False when winner is None, which is exactly the desired semantics.

evaluate_all is a fan-out over every loaded skill so the caller doesn't have to spell out every name:

def evaluate_all(
    cases: Iterable[PromptCase],
    skills: Sequence[Skill],
) -> dict[str, SkillMetrics]:
    case_list = list(cases)
    return {s.name: evaluate_skill(s.name, case_list, skills) for s in skills}

We materialise cases into a list once because the iterable is consumed five times — once per loaded skill. Forgetting that detail would let a generator-based caller silently get one populated row and four empty ones, which is the kind of bug that survives many code reviews.

Finally, format_report renders the dict into a fixed-width scoreboard so the harness output is readable in a terminal without piping through a formatter:

def format_report(metrics_by_name: dict[str, SkillMetrics]) -> str:
    header = f"{'skill':<20} {'precision':>10} {'recall':>10} {'tp':>4} {'fp':>4} {'fn':>4}"
    rows = [header, "-" * len(header)]
    for name in sorted(metrics_by_name):
        rows.append(_format_row(metrics_by_name[name]))
    return "\n".join(rows)

Sorting by name keeps the row order deterministic across runs, which matters for snapshot-style tests and for diffing two scoreboards across step boundaries. _format_ratio prints "n/a" for None and a two-decimal float otherwise — so a real skill's 1.0 shows as 1.00 and an anti-pattern's undefined precision shows as n/a.

The test file tests/test_prompt_matrix.py covers three layers. The first layer asserts the matrix itself is built correctly — every known prompt appears once, every expected label shows up, and the per-label counts match the source tuples. The second layer asserts the per-skill metrics: greet and welcome both achieve 1.0 precision and 1.0 recall on the eleven-case matrix, and each anti-pattern produces tp == fp == fn == 0 with precision is None and recall is None. The third layer is a set of micro-cases that prove the classifier itself can detect a synthetic false positive, a synthetic false negative, and an unrelated case — so a future regression in _classify shows up immediately rather than hiding behind the well-behaved default matrix.

Verification

python3 -m pytest

........................................................................ [ 85%]
............                                                             [100%]
84 passed in 0.25s

Eighty-four green tests, up from sixty-eight at the end of step 5. The sixteen-test delta lines up exactly with the new test_prompt_matrix.py file: three matrix-shape tests, two end-to-end per-skill scoreboards for greet and welcome, three parametrised n/a-precision-and-recall asserts across the three anti-patterns, one full-report shape test, one rendered-report snapshot check, and six micro-tests that exercise the classifier on synthetic single-case matrices. No prior test file required edits — the harness reads the same sample_prompts constants the rest of the suite already depends on.

What we built

We turned step 5's binary precision claim ("the anti-patterns stay quiet") into a measurable scoreboard that reports precision and recall as numbers per skill, against a labelled eleven-case matrix shared with the existing tests.

The harness draws a clean line between the three classifier outcomes the matcher can produce — a true positive, a false positive, a false negative — and the fourth outcome that is not an outcome at all: a prompt that concerns some other skill. That fourth branch is what lets one matrix score every skill independently without inflating either side of the ratio with prompts that were never aimed at the skill under evaluation.

The most visible payoff is the n/a reporting for the anti-patterns. We could have hidden zero-firing skills behind a 0.0 precision and a 0.0 recall, and the suite would still have passed — but those numbers would have implied "this skill exists and got everything wrong" when the truth is closer to "this skill exists and was never exercised by the corpus." Typing the ratios as float | None keeps that honesty in the dataclass, the formatter, and the asserts all the way through.

What stays open is the corpus itself. The matrix is small — eleven cases — and the precision and recall numbers will only ever be as informative as the prompts behind them. Step 7 will lean on this harness to add adversarial near-miss prompts that the current matcher silently mis-routes, so we have something other than a flat 1.00 to optimise against.

Repository

The state of the code after this step: c0bdd84

Key commits to step through:

273008d — step 1: scaffold the skill manifest lab
73b92d4 — step 2: token-overlap matcher and baseline corpus
2574f13 — step 3: overlapping welcome skill surfaces trigger collisions
9741051 — step 4: action verbs, explicit phrases, and scope qualifiers break the collision
2e1e38f — step 5: vague, broad, and buzzword anti-patterns prove matcher precision
c0bdd84 — step 6: prompt-matrix harness scores per-skill precision and recall

Step 7: Distilling a Reusable Description Checklist and Shipping the Polished Skill Set

Step 6 left us with a precision and recall scoreboard that printed 1.00 / 1.00 for the real skills and n/a / n/a for the anti-patterns. Those numbers proved the matcher behaved well on the labelled corpus, but they did not capture why the real descriptions worked — that knowledge still lived only in our heads and in the prose of the previous steps. The next reader who writes a brand-new skill cannot run the prompt-matrix harness on it, because the harness scores against a fixed labelled corpus that does not yet contain their prompts.

This final step lifts those implicit rules into an executable description checklist — description_checklist.py plus test_description_checklist.py — that grades any skill manifest against six structural criteria: enough action verbs, an explicit trigger phrase, an explicit scope qualifier, enough content keywords, no buzzword vocabulary, and no meta-scope vocabulary. The polished greet and welcome manifests pass clean; the three anti-patterns fail on every rule that matches their pathology. The deliverable is a tool, not just an article.

Setup

One new source file, one new test file, and a small touch-up to both real skill manifests:

codebase/
├── .claude/skills/
│   ├── greet/SKILL.md            # polished — now passes the checklist
│   ├── welcome/SKILL.md          # polished — now passes the checklist
│   ├── broad_handler/SKILL.md    # unchanged — still fails the checklist
│   ├── buzzword_skill/SKILL.md   # unchanged — still fails the checklist
│   └── vague_helper/SKILL.md     # unchanged — still fails the checklist
├── src/
│   ├── skill_loader.py           # unchanged
│   ├── trigger_matcher.py        # unchanged
│   ├── sample_prompts.py         # unchanged
│   ├── prompt_matrix.py          # unchanged
│   └── description_checklist.py  # NEW — six structural rules + report renderer
└── tests/
    ├── test_skill_loader.py        # unchanged
    ├── test_trigger_matcher.py     # unchanged
    ├── test_anti_patterns.py       # unchanged
    ├── test_prompt_matrix.py       # unchanged
    └── test_description_checklist.py  # NEW — 33 tests, grader coverage

No new third-party dependencies. The checklist module imports the existing keywords helper from trigger_matcher so the content-keyword rule uses the exact same tokeniser the matcher itself uses — a description that registers as "dense" to the matcher must also register as "dense" to the grader, by construction.

Implementation

The grader starts with five immutable vocabulary sets and two integer thresholds. These are the rules made data — not buried inside if chains — so a future author can read them as a table and decide which ones their domain needs to override.

ACTION_VERBS: frozenset[str] = frozenset({
    "send", "drop", "reply", "greet", "onboard", "welcome",
    "post", "say", "share", "ping", "compose", "introduce",
    "answer", "salute", "acknowledge",
})

TRIGGER_PHRASES: tuple[str, ...] = (
    "use when", "trigger when", "invoke when",
    "call when", "fire when",
)

SCOPE_QUALIFIERS: tuple[str, ...] = (
    "reserved for", "skip when", "skip if", "do not use",
    "only use when", "limited to", "exclusively for", "exception:",
)

BUZZWORDS: frozenset[str] = frozenset({
    "intelligent", "smart", "advanced", "solution",
    "leveraging", "leverage", "cutting", "edge",
    "capabilities", "deliver", "value", "innovative",
    "synergy", "seamless", "robust", "powerful",
})

META_SCOPE_TERMS: frozenset[str] = frozenset({
    "kind", "situations", "contexts", "things",
    "scenarios", "stuff", "items", "handles",
    "purpose", "general",
})

MIN_ACTION_VERBS: int = 2
MIN_CONTENT_KEYWORDS: int = 5

The split between frozenset and tuple is intentional. Single-word vocabularies live in frozensets because their lookup is word-boundary-based and the matcher needs to test membership for arbitrary words. Multi-word phrases like "use when" and "reserved for" live in tuples because their match is substring-based and order does not matter for substring tests — both shapes are immutable, both fail loudly if a test tries to mutate them.

The two dataclasses are kept deliberately small. The grader emits one ChecklistResult per skill, and that result either passes (no issues) or carries a tuple of human-readable issue strings.

@dataclass(frozen=True)
class ChecklistResult:
    skill_name: str
    passed: bool
    issues: tuple[str, ...]

Freezing the dataclass and typing issues as tuple[str, ...] rather than list[str] means a passing skill always equals every other passing skill of the same name, and tests can assert on issue tuples directly without worrying about identity or order drift between runs. The boolean passed is technically derivable from issues == () — but keeping it as an explicit field lets call sites filter passes versus fails without re-implementing the predicate everywhere.

The public entry point evaluate_description calls six small _check_* helpers in sequence and concatenates their issue lists. This is the only place the rules are composed; everywhere else they live as independent functions.

def evaluate_description(skill: Skill) -> ChecklistResult:
    text = skill.description.lower()
    issues: list[str] = []
    issues.extend(_check_action_verbs(text))
    issues.extend(_check_trigger_phrase(text))
    issues.extend(_check_scope_qualifier(text))
    issues.extend(_check_keyword_density(text))
    issues.extend(_check_buzzwords(text))
    issues.extend(_check_meta_scope_terms(text))
    return ChecklistResult(
        skill_name=skill.name,
        passed=not issues,
        issues=tuple(issues),
    )

Each helper is a flat function with at most one if and no nested conditionals — the codebase's two-levels-of-nesting rule is satisfied trivially. Splitting six rules into six functions also means a future "rule 7" — say, a length cap on the description — drops in as another _check_* line in evaluate_description and another vocabulary constant at the top, without touching any existing helper.

The word-boundary lookup helper is the one shared primitive across rules that need single-word checks.

def _word_present(word: str, text: str) -> bool:
    return re.search(rf"\b{re.escape(word)}\b", text) is not None

def _find_words(words: Iterable[str], text: str) -> set[str]:
    return {w for w in words if _word_present(w, text)}

re.escape defends against any rule vocabulary that ever contains regex metacharacters — for example a future c++ action verb would still match cleanly. \b anchors prevent the buzzword "smart" from matching the substring inside "smartphone", which would otherwise cause false positives on legitimately specific descriptions.

The action-verb rule asserts the description names at least MIN_ACTION_VERBS distinct concrete behaviours.

def _check_action_verbs(text: str) -> list[str]:
    found = _find_words(ACTION_VERBS, text)
    if len(found) >= MIN_ACTION_VERBS:
        return []
    return [
        f"needs at least {MIN_ACTION_VERBS} concrete action verbs "
        f"(found {sorted(found)!r})"
    ]

Two is the right floor, not one. A description with a single action verb often degenerates into "X does Y" — the verb hugs the skill name and the rest of the sentence is filler. Two verbs force the author to spell out at least one alternative phrasing a user might reach for, which is the lever that makes the matcher's token overlap actually fire across paraphrases.

The trigger-phrase and scope-qualifier rules are substring tests, not word-boundary tests — phrases are multi-word and naturally bracketed by whitespace already.

def _check_trigger_phrase(text: str) -> list[str]:
    if any(p in text for p in TRIGGER_PHRASES):
        return []
    return ["needs an explicit trigger phrase (e.g. 'use when ...')"]

def _check_scope_qualifier(text: str) -> list[str]:
    if any(q in text for q in SCOPE_QUALIFIERS):
        return []
    return [
        "needs an explicit scope qualifier "
        "(e.g. 'reserved for ...', 'skip if ...')"
    ]

These two rules are what step 4 found by trial-and-error and step 5 validated against anti-patterns: a description that signals when to fire and when not to fire gives the matcher the two boundaries it needs to win against vague competitors. The grader now refuses to let a future author skip either signal accidentally.

The keyword-density rule reuses the matcher's own tokeniser so the grader and the matcher never disagree about what counts as a "content keyword".

def _check_keyword_density(text: str) -> list[str]:
    count = len(keywords(text))
    if count >= MIN_CONTENT_KEYWORDS:
        return []
    return [
        f"needs at least {MIN_CONTENT_KEYWORDS} content keywords "
        f"(found {count})"
    ]

Five keywords is the threshold below which a description is genuinely too short to differentiate itself — and keywords() already strips stop-words and short tokens, so five real content words is a meaningful floor rather than a sentence-length one. The buzzword and meta-scope rules are symmetric: instead of asserting a vocabulary is present, they assert two specific vocabularies are absent.

def _check_buzzwords(text: str) -> list[str]:
    leaked = sorted(_find_words(BUZZWORDS, text))
    if not leaked:
        return []
    return [f"uses buzzword vocabulary: {leaked!r}"]

def _check_meta_scope_terms(text: str) -> list[str]:
    leaked = sorted(_find_words(META_SCOPE_TERMS, text))
    if not leaked:
        return []
    return [f"leans on meta-scope vocabulary: {leaked!r}"]

The two failure messages quote the exact leaked vocabulary so the author can see which word the grader flagged. Sorting the leaked words keeps the message stable across runs — important because tests assert on the exact issue strings.

Two fan-out helpers, evaluate_all and format_checklist_report, mirror the shape of the step 6 prompt-matrix module so a downstream caller can drive both graders the same way. The renderer prints one row per skill with status and issues — readable in a terminal, diffable across commits.

def evaluate_all(skills: Iterable[Skill]) -> dict[str, ChecklistResult]:
    return {s.name: evaluate_description(s) for s in skills}

def format_checklist_report(results: dict[str, ChecklistResult]) -> str:
    header = f"{'skill':<20} {'status':<6} issues"
    rows = [header, "-" * 72]
    for name in sorted(results):
        rows.append(_format_row(results[name]))
    return "\n".join(rows)

The polish pass on the two real manifests is short and targeted. The greet description already carried "send", "drop", "reply", "greet" as action verbs from step 4, plus "use when" as a trigger phrase. We added the explicit scope qualifier "reserved for acknowledging existing contacts" so the scope-qualifier rule passes without changing any matcher behaviour. The welcome description gained the symmetric scope qualifier "skip if the addressee already belongs to the company" — fresh vocabulary on purpose, so the step 4 collision tokens stay stripped.

The test file tests/test_description_checklist.py covers four layers. The first asserts the polished greet and welcome manifests pass with empty issue tuples. The second asserts every anti-pattern fails — and uses parametrised tests over ANTI_PATTERN_SKILL_NAMES to assert which specific rules each anti-pattern fails: action verbs, trigger phrase, scope qualifier, plus the buzzword or meta-scope vocabulary that matches its pathology. The third layer covers the evaluate_all and format_checklist_report glue. The fourth layer drops synthetic _fake_skill fixtures into the grader to prove each rule fires independently on a minimal description — so a future regression in one rule cannot hide behind another rule that happens to also flag the same manifest.

Verification

python3 -m pytest

........................................................................ [ 61%]
.............................................                            [100%]
117 passed in 0.28s

One hundred seventeen green tests, up from eighty-four at the end of step 6. The thirty-three-test delta is exactly test_description_checklist.py: two passing-real-skill tests, three parametrised-fail tests across the three anti-patterns, three vocabulary-leak tests, three action-verb-gap tests, three trigger-phrase-gap tests, three scope-qualifier-gap tests, one evaluate_all shape test, one report-render test, one full polished-set partition test, six vocabulary-constant tests, two passing-result-shape tests, one failing-result-uniqueness test, and four synthetic-description tests covering minimal-pass plus three single-rule failures. No earlier test file required edits — the polish on the two manifests preserved every existing matcher and prompt-matrix invariant.

What we built

We turned the implicit "how to write a good Claude Code skill description" knowledge from steps 1-6 into an executable grader that any author can run against any manifest, without needing to ship a labelled prompt corpus alongside it. The six rules — action verbs, trigger phrase, scope qualifier, keyword density, no buzzwords, no meta-scope — are the structural fingerprint of a description the auto-trigger matcher can lean on.

The polished greet and welcome manifests are the proof that the checklist is satisfiable in practice. Both pass with zero issues, and the prompt-matrix scoreboard from step 6 still reports 1.00 / 1.00 on the original eleven-case corpus — the polish strengthened the descriptions structurally without disturbing any matcher behaviour. The three anti-patterns continue to fail in the matcher and now fail in the grader, with specific, named issues per skill.

The grader is small enough to read in one sitting and tabular enough to extend. Adding a seventh rule means adding a _check_* helper and (optionally) another vocabulary constant at the top of the file. Adapting the rules for a different domain — say, a research assistant skill where "summarise" and "compare" are the natural action verbs — is one edit to the ACTION_VERBS frozenset, not a refactor of the grader.

What we deliberately did not build: a CLI wrapper. The grader is a library function, callable from a pre-commit hook, a CI step, an editor lint, or a one-off REPL session. A reader can wire it into whichever workflow makes sense for their plugin — that integration is the obvious follow-up, but the rules themselves are the deliverable.

Repository

The state of the code after this step: 585ab50

Key commits to step through:

273008d — step 1: scaffold the skill manifest lab
73b92d4 — step 2: token-overlap matcher and baseline corpus
2574f13 — step 3: overlapping welcome skill surfaces trigger collisions
9741051 — step 4: action verbs, explicit phrases, and scope qualifiers break the collision
2e1e38f — step 5: vague, broad, and buzzword anti-patterns prove matcher precision
c0bdd84 — step 6: prompt-matrix harness scores per-skill precision and recall
585ab50 — step 7: reusable description checklist grades the polished skill set

Repository

Full source at https://github.com/vytharion/claude-code-skill-description-auto-trigger-matching.

Walk the lessons by stepping through the git commits in the repo — each major step has its own commit you can git checkout and rerun.