claudeplugins.
claudeplugins38 min read

Claude Code Subagent — When to Delegate vs Keep in Main Context

Build a 5-step project exploring when Claude Code subagent delegation pays off: monolithic agent → context pressure → delegation threshold → isolation patterns → production guardrails.

What we're doing this step

Before we can talk meaningfully about "when to delegate work to a Claude Code subagent," we need a baseline that has nothing to delegate — a single agent that handles every step of a task inside one shared transcript. That is what real production agents look like before they grow up: one loop, one model, one growing context window. In this step we build that monolithic agent in plain Python. The agent runs a tool-use loop against a deterministic stub model so the behavior is testable without burning real Claude tokens. The point is not to imitate the SDK byte-for-byte; the point is to expose the thing that will eventually hurt — every tool call's full result is appended to the main transcript, and that transcript is what the model re-reads on every turn. That growth is exactly the cost we will measure, then mitigate via subagent context isolation in later steps.

Setup

Create a small Python package under codebase/. The repository for this article ships the following layout once step 1 lands:

codebase/
├── pyproject.toml
├── monolithic_agent/
│   ├── __init__.py
│   ├── agent.py
│   └── model.py
└── tests/
    └── test_monolithic_agent.py

We deliberately avoid pulling in the real claude-agent-sdk for this baseline. The mechanics we care about — a turn loop, tool dispatch, and a single transcript that everything lands in — are stack-agnostic. A stubbed model lets us assert on transcript shape without flaky network calls. The project metadata is intentionally small:

[build-system]
requires = ["setuptools>=61"]
build-backend = "setuptools.build_meta"

[project]
name = "delegate-demo"
version = "0.1.0"
description = "Companion code for the subagent context-isolation tutorial"
requires-python = ">=3.9"

[tool.setuptools.packages.find]
where = ["."]
include = ["monolithic_agent*"]

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-ra -q"

No third-party runtime dependencies. Pytest is the only thing we need, and it usually comes pre-installed in modern dev environments.

Implementation

Two modules carry the weight: model.py describes the moving parts of a turn (text, tool calls, stop signal) and agent.py runs the loop.

model.py defines the data classes the agent passes around plus a StubModel that hands back pre-baked turns so tests are deterministic:

@dataclass
class ToolCall:
    name: str
    arguments: dict


@dataclass
class ModelTurn:
    text: str = ""
    tool_calls: List[ToolCall] = field(default_factory=list)
    stop: bool = False


@dataclass
class StubModel:
    """A scripted model used in tests to mimic a real LLM."""

    script: List[ModelTurn]
    seen_transcripts: List[List[dict]] = field(default_factory=list)
    _cursor: int = 0

    def next_turn(self, transcript: List[dict]) -> ModelTurn:
        snapshot = [dict(entry) for entry in transcript]
        self.seen_transcripts.append(snapshot)
        turn = self.script[self._cursor]
        self._cursor += 1
        return turn

The crucial detail is seen_transcripts.append(snapshot). Every time the stub is asked for the next turn, it records exactly what the model would have re-read this turn. That is the artifact we will measure in later steps to prove that a monolithic agent re-pays for prior tool results on every subsequent turn — and that a delegating architecture does not.

agent.py wires the loop together. The agent calls the model, records the assistant text and tool calls, dispatches any tool calls, and repeats until the model returns stop=True or max_turns is exceeded:

def run(self, task: str) -> AgentResult:
    transcript: List[dict] = [{"role": "user", "content": task}]
    final_text = ""
    for turn_index in range(self.max_turns):
        turn = self.model.next_turn(transcript)
        self._record_assistant(transcript, turn)
        if turn.stop:
            final_text = turn.text
            break
        self._dispatch_tools(transcript, turn)
    else:
        raise RuntimeError(
            f"MonolithicAgent.run exceeded max_turns={self.max_turns}"
        )
    return AgentResult(
        final_text=final_text,
        transcript=transcript,
        turns=self.model.calls,
    )

A few intentional design choices, each tied to what we will measure later:

  • One transcript. Tool calls and tool results are appended to the same transcript list the model reads on the next turn. There is no scratchpad, no side conversation, no summarization step. This is the worst case on purpose.
  • context_tokens as a character proxy. Real tokenizers vary, but for the trend we only need a monotonic signal that grows with transcript size. AgentResult.context_tokens sums the character length of every entry. It is honest about being a proxy; it is not honest if you compare it to OpenAI's tiktoken output. We will not.
  • max_turns guard. Any tool-using loop without an upper bound is a memory leak waiting for a budget alert. Three lines of code save real money in real systems.
  • Helper splits to respect nesting limits. run calls _record_assistant, _dispatch_tools, and _invoke_tool, none of which exceed the project's "max two levels of nested if" rule. The architecture pressure here is real: a monolithic agent's loop body tends to balloon, and breaking it up early is the simplest way to keep refactors painless when step 4 splits the work into a subagent.

Tests live in tests/test_monolithic_agent.py and assert four behaviors: final synthesis is returned, the tool result lands in the main transcript, transcript size grows with tool-result size, unknown tools degrade to an error string instead of crashing, and max_turns triggers a clear RuntimeError.

Test it

python3 -m pytest -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
testpaths: tests
collected 5 items

tests/test_monolithic_agent.py .....                                     [100%]

============================== 5 passed in 0.01s ===============================

All five tests pass. The key assertion to internalize is in test_context_grows_with_tool_results: a 5,000-character tool result causes result.context_tokens to grow by at least 5,000. This is the behavior we will measure under realistic load in step 2 and refactor away in step 4.

What we got

A small, fully tested baseline agent that captures the essence of Claude Code's monolithic loop: research and synthesis share one transcript, every tool result is paid for forever, and the transcript itself is observable from the outside so we can measure context growth. From here we have everything we need to bloat the context on purpose (step 2), measure it (step 3), and discover the delegation boundary by splitting work into an isolated subagent (step 4).

Repository

The companion code for this article: https://github.com/vytharion/claude-code-subagent-context-isolation-when-to-delegate

The state of the code after this step: 666e4cc

Key commits to step through:

  • 666e4cc — step 1: minimal monolithic agent + transcript-growth tests

What we're doing this step

Step 1 left us with a monolithic agent that can read one file via a read_file tool and synthesise a one-line answer. That is too small a workload to expose the failure mode we care about. Real Claude Code sessions get expensive when the model is asked to scan a repository — open a handful of source files, read each in full, then write a summary that takes every one of those bodies into account. In that flow, each file body is appended to the same transcript and re-read by the model on every subsequent turn. The cost is not just "tool result + final answer"; it is "tool result × number of remaining turns," and that product is exactly what subagent delegation lets us cut in step 4. So in this step we build a realistic-looking multi-file repo, script a scan-and-summarise loop over it, and add tests that bake the growth pattern into assertions. We do not yet measure the bloat with charts or tokenisers — that is step 3. The deliverable for step 2 is just an honest reproduction of the load pattern.

Setup

We do not need any new dependencies. Everything lives inside the existing monolithic_agent package next to the agent from step 1. After this step, the package looks like this:

codebase/
├── pyproject.toml
├── monolithic_agent/
│   ├── __init__.py
│   ├── agent.py
│   ├── model.py
│   └── repo_scan.py            ← new in step 2
└── tests/
    ├── test_monolithic_agent.py
    └── test_repo_scan.py       ← new in step 2

The new repo_scan.py module owns three things: an in-memory "demo repo" (a dict[str, str] of file paths to bodies), a helper that scripts a StubModel to walk that repo file-by-file, and two tiny cost-measurement helpers that we will lean on in step 3. The new test file exercises each behavior end-to-end so the article's claims about transcript shape are mechanically enforced.

Exporting the new helpers through monolithic_agent/__init__.py keeps callers from having to know which submodule a symbol lives in:

from monolithic_agent.agent import MonolithicAgent, AgentResult
from monolithic_agent.model import StubModel, ModelTurn, ToolCall
from monolithic_agent.repo_scan import (
    build_demo_repo,
    build_scan_script,
    cumulative_context_chars,
    transcript_size_per_turn,
)

Implementation

build_demo_repo() returns a hard-coded dict[str, str] describing a five-file Python project: a README.md, a pyproject.toml, three source modules (loader.py, parser.py, pipeline.py), and a small test_pipeline.py. The bodies are real-looking Python, not random filler — but they are deliberately chunky so the cost of carrying them through the loop is visible without needing thousand-line fixtures:

def build_demo_repo() -> Dict[str, str]:
    return {
        "README.md": (
            "# delegate-demo\n\n"
            "A toy project used by a tutorial on Claude Code subagent context\n"
            "isolation. ...\n"
        ),
        "pyproject.toml": (
            "[build-system]\n"
            'requires = ["setuptools>=61"]\n'
            "..."
        ),
        "src/loader.py": "...",
        "src/parser.py": "...",
        "src/pipeline.py": "...",
        "tests/test_pipeline.py": "...",
    }

Why an in-memory dict rather than tmp_path fixtures on disk? Two reasons. First, the article is about transcript economics, not filesystem mechanics — pushing bytes through a real open() would add noise without changing the conclusion. Second, the existing make_read_tool helper from step 1 already accepts an arbitrary file map, so swapping the map is the only thing we have to do to drive a "realistic" scan.

The scripted scan loop is the second piece of repo_scan.py. build_scan_script(paths, final_text) returns a StubModel that issues one read_file tool call per path, accompanied by a short narrative line, and finally an assistant turn with stop=True carrying the summary:

def build_scan_script(
    paths: Sequence[str],
    final_text: str = "Summary complete.",
) -> StubModel:
    turns: List[ModelTurn] = []
    for path in paths:
        turns.append(
            ModelTurn(
                text=f"Reading {path} to feed the summary.",
                tool_calls=[ToolCall(name="read_file", arguments={"path": path})],
            )
        )
    turns.append(ModelTurn(text=final_text, stop=True))
    return StubModel(script=turns)

This shape matters because it mirrors how a real Claude Code agent actually behaves on a "read this repo" prompt: a turn that says what it is about to do, followed by the tool call, followed eventually by a synthesis turn. The scripted version is deterministic, but every snapshot the stub records inside seen_transcripts is exactly the context the real model would have ingested on that turn.

The third piece is the pair of cost helpers, which step 3 will plot but step 2 already exercises through tests:

def cumulative_context_chars(seen_transcripts: List[List[dict]]) -> int:
    total = 0
    for snapshot in seen_transcripts:
        for entry in snapshot:
            total += len(entry.get("content", ""))
    return total


def transcript_size_per_turn(seen_transcripts: List[List[dict]]) -> List[int]:
    return [
        sum(len(entry.get("content", "")) for entry in snapshot)
        for snapshot in seen_transcripts
    ]

cumulative_context_chars is the headline number we care about. On a scan of N files of average size S, the final synthesis turn has to carry roughly N * S characters in the transcript, but the total work the model did over the loop is closer to N * (N+1) / 2 * S because each file body was re-read on every later turn. That super-linear growth is the cost subagent delegation eliminates — step 4 will isolate the scan into a subagent whose transcript never makes it back into the parent context.

tests/test_repo_scan.py pins down seven properties of the new setup. The most important three are:

  • Every file body lands in the shared transcript. test_every_file_lands_in_the_main_transcript walks result.transcript and asserts each tool_result entry carries the body for one of the demo files. This is the "one transcript" invariant from step 1 applied at realistic scale.
  • The final turn re-reads everything. test_final_synthesis_turn_re_reads_every_prior_tool_result reaches into model.seen_transcripts[-1] (the snapshot the synthesis turn was given) and asserts every file body is in there. That is the receipt for the super-linear cost.
  • Cumulative cost grows super-linearly. test_cumulative_read_cost_grows_super_linearly asserts cumulative_context_chars(...) > 2 * repo_chars. A single re-read would already be 2 * repo_chars; we cross that threshold easily with five files plus a synthesis turn.

Two further tests check less dramatic but still useful invariants: per-turn transcript size is monotonic non-decreasing, and the turn count matches files + 1. Both will make regressions during the step-4 refactor loud rather than silent.

Test it

python3 -m pytest tests/test_repo_scan.py -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
collected 7 items

tests/test_repo_scan.py::test_demo_repo_has_enough_files_to_be_realistic PASSED [ 14%]
tests/test_repo_scan.py::test_every_file_lands_in_the_main_transcript PASSED [ 28%]
tests/test_repo_scan.py::test_context_tokens_scales_with_repo_size PASSED [ 42%]
tests/test_repo_scan.py::test_final_synthesis_turn_re_reads_every_prior_tool_result PASSED [ 57%]
tests/test_repo_scan.py::test_cumulative_read_cost_grows_super_linearly PASSED [ 71%]
tests/test_repo_scan.py::test_transcript_size_per_turn_is_monotonic_non_decreasing PASSED [ 85%]
tests/test_repo_scan.py::test_turn_count_matches_files_plus_one_synthesis PASSED [100%]

============================== 7 passed in 0.01s ===============================

All seven tests pass. The two assertions worth re-reading slowly are the cumulative-cost test (proves the loop carries the same bytes through the model many times) and the final-turn re-read test (proves the synthesis turn cannot avoid that cost). They are mechanically what makes the rest of the article true.

What we got

A faithful reproduction of the "scan a small repo and summarise it" workload that bloats a Claude Code agent's main context in practice. Every file body in the demo repo lands in the shared transcript, every later turn re-reads every prior tool result, and the cumulative cost is already strictly greater than two full repo passes by the time synthesis runs. We also exposed two tiny helpers — cumulative_context_chars and transcript_size_per_turn — that step 3 will use to put numbers on this growth without changing the agent. With this baseline locked behind tests, step 3 can focus purely on measurement and step 4 on the delegation refactor that makes the measurement curve flatten.

Repository

The companion code for this article: https://github.com/vytharion/claude-code-subagent-context-isolation-when-to-delegate

The state of the code after this step: 83ca35f

Key commits to step through:

  • 666e4cc — step 1: minimal monolithic agent + transcript-growth tests
  • 83ca35f — step 2: realistic multi-file repo-scan workload + cost helpers

What we're doing this step

Step 2 left us with a monolithic agent that scans a small repo and a stub model that records every transcript snapshot the model would have ingested. That instrumentation is rich, but it is not yet decision-useful. Looking at a list of growing snapshot lengths does not tell you whether the loop is wasteful, by how much, or which specific tool call you should hoist into a subagent first. Step 3 is the measurement step: we reduce those snapshots to four headline numbers (peak context, cumulative re-read cost, average re-read multiplier, turn count), rank every tool result by the bytes a subagent refactor would erase, and then walk that ranked list to find the smallest set of delegations that hit a savings target. The output is the article's core claim turned into code — a function the operator can call to ask "given this trace, where is the delegation boundary?" — and a test suite that pins down the answer for the demo repo. We are not refactoring the agent yet; step 4 owns that. Step 3's deliverable is purely diagnostic.

Setup

No new runtime dependencies. We add one production module and one test file to the existing layout:

codebase/
├── pyproject.toml
├── monolithic_agent/
│   ├── __init__.py
│   ├── agent.py
│   ├── model.py
│   ├── repo_scan.py
│   └── context_measurement.py        ← new in step 3
└── tests/
    ├── test_monolithic_agent.py
    ├── test_repo_scan.py
    └── test_context_measurement.py   ← new in step 3

context_measurement.py reuses the two helpers exported by repo_scan.py (cumulative_context_chars and transcript_size_per_turn) so the new module is a thin analysis layer, not a parallel implementation. Re-exporting the new symbols through monolithic_agent/__init__.py keeps the surface area discoverable:

from monolithic_agent.context_measurement import (
    ContextMeasurement,
    DelegationBoundary,
    DelegationCandidate,
    identify_delegation_boundary,
    measure_context,
    rank_delegation_candidates,
)

The three dataclasses are deliberately small. ContextMeasurement is the headline reading, DelegationCandidate is one ranked tool result, and DelegationBoundary packages the smallest top-K of candidates that reaches a savings target. Keeping them as plain @dataclass records makes them trivial to assert on in tests and trivial to serialise later if we want a JSON report.

Implementation

Three public functions and three dataclasses do the work. We start with the headline reducer, measure_context:

def measure_context(seen_transcripts: List[List[dict]]) -> ContextMeasurement:
    if not seen_transcripts:
        return ContextMeasurement(
            peak_chars=0,
            cumulative_chars=0,
            reread_multiplier=0.0,
            turn_count=0,
        )
    sizes = transcript_size_per_turn(seen_transcripts)
    cumulative = cumulative_context_chars(seen_transcripts)
    peak = max(sizes)
    multiplier = cumulative / peak if peak else 0.0
    return ContextMeasurement(
        peak_chars=peak,
        cumulative_chars=cumulative,
        reread_multiplier=multiplier,
        turn_count=len(seen_transcripts),
    )

The reread_multiplier is the most useful single number for the article's argument: it answers "how many full transcripts did we effectively pay for, on average?" A monolithic loop with no delegation will always have reread_multiplier ≥ 1.0; the bigger it gets, the more re-work the loop is doing. We compute it as cumulative / peak rather than as a count of turns because what we care about is the byte cost, not the turn count — a synthesis turn that re-reads ten file bodies should weigh ten times as much as a single tool call that returned an empty result.

Next, rank_delegation_candidates walks the final transcript, picks out every tool_result entry, and computes the bytes that would disappear if that entry had lived in a subagent instead of the main loop:

def rank_delegation_candidates(
    seen_transcripts: List[List[dict]],
) -> List[DelegationCandidate]:
    if not seen_transcripts:
        return []
    final_transcript = max(seen_transcripts, key=len)
    candidates: List[DelegationCandidate] = []
    for index, entry in enumerate(final_transcript):
        if entry.get("role") != "tool_result":
            continue
        candidates.append(_candidate_for(seen_transcripts, index, entry))
    candidates.sort(key=lambda c: c.waste_chars, reverse=True)
    return candidates

The cost model is intentionally simple. The agent only ever appends to the transcript, so an entry at index p of the final transcript appears in every snapshot whose length exceeds p. We use that index-arithmetic to count appearances rather than comparing payload content (which would mis-count if two tool results happened to return the same body). later_rereads = appearances - 1 because the first appearance is the work that has to happen somewhere — even a subagent has to read the file once. The bytes that disappear are exactly the re-reads: waste_chars = payload_chars * later_rereads.

Why rank by waste_chars rather than by payload_chars? Because the cost we want to cut is super-linear in the loop. A small payload that gets re-read ten times is more expensive than a fat payload that gets re-read once. Ranking by raw payload size would have us delegate the wrong tool calls and miss most of the savings.

The third function, identify_delegation_boundary, walks the ranked list and stops as soon as the cumulative savings meets the requested target:

def identify_delegation_boundary(
    seen_transcripts: List[List[dict]],
    target_savings_ratio: float = 0.5,
) -> DelegationBoundary:
    if not 0.0 <= target_savings_ratio <= 1.0:
        raise ValueError(
            f"target_savings_ratio must be in [0.0, 1.0], got {target_savings_ratio}"
        )
    ranked = rank_delegation_candidates(seen_transcripts)
    measurement = measure_context(seen_transcripts)
    target_chars = target_savings_ratio * measurement.cumulative_chars
    chosen: List[DelegationCandidate] = []
    saved = 0
    for candidate in ranked:
        if saved >= target_chars:
            break
        chosen.append(candidate)
        saved += candidate.waste_chars
    achieved_ratio = (
        saved / measurement.cumulative_chars if measurement.cumulative_chars else 0.0
    )
    return DelegationBoundary(
        candidates=chosen,
        target_savings_ratio=target_savings_ratio,
        achieved_savings_chars=saved,
        achieved_savings_ratio=achieved_ratio,
    )

A few choices worth flagging:

  • Greedy is optimal here. Because the cost model is linear in waste_chars and candidates are independent, taking the largest waste first always minimises the count of delegations needed to clear the target. No knapsack search.
  • Stop the first time we cross the line. We check saved >= target_chars at the top of the loop so the returned list is the smallest prefix that reaches the target, not the smallest prefix that exceeds it by the largest margin. The test test_identify_delegation_boundary_returns_smallest_sufficient_prefix pins this: dropping the last candidate must put us back under the target.
  • achieved_savings_ratio is honest about shortfalls. If even delegating every tool result fails to clear the target (e.g. a target of 1.0 against a loop that already runs lean), the boundary returns the full list and achieved_savings_ratio reports how close we got rather than silently lying about meeting the goal.
  • Range-checked input. target_savings_ratio outside [0.0, 1.0] is almost always a typo; raising early beats producing a nonsense boundary that the operator then has to debug.

Two small private helpers (_candidate_for, _appearance_count, _short_label) keep the public functions inside the project's "max two levels of nested if" rule. _short_label truncates the first line of a payload to forty characters so a DelegationCandidate is identifiable in logs without dumping a full file body into the operator's terminal.

The test file covers thirteen properties of the new module. The ones worth re-reading slowly:

  • test_reread_multiplier_grows_with_scan_size — a one-file scan has a smaller reread_multiplier than a five-file scan, which is the empirical version of the article's central claim.
  • test_earliest_tool_result_has_the_most_rereads — the file the agent read first is the one that gets re-read on every subsequent turn, so it should always lead the waste ranking when payloads are similar in size. This is the geometric intuition behind delegation: the earlier the tool call, the bigger the win from isolating it.
  • test_candidate_waste_equals_payload_times_later_rereads — the cost formula is mechanically enforced; if someone refactors _candidate_for and breaks the multiplication, the test fails loudly.
  • test_identify_delegation_boundary_meets_the_savings_target plus test_identify_delegation_boundary_returns_smallest_sufficient_prefix — the boundary actually reaches the target and does not overshoot by more than one candidate's worth of savings.
  • test_measurement_uses_step2_helpers_consistently — re-implementing peak_chars and cumulative_chars by hand using the step-2 helpers must match the new module's output. This is the seam between step 2 and step 3: if the analysis layer ever drifts from the raw helpers, the test goes red.

Test it

python3 -m pytest tests/test_context_measurement.py -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
collected 13 items

tests/test_context_measurement.py::test_measure_context_returns_zeroed_struct_when_no_turns_recorded PASSED [  7%]
tests/test_context_measurement.py::test_measure_context_reports_peak_cumulative_multiplier_and_turn_count PASSED [ 15%]
tests/test_context_measurement.py::test_reread_multiplier_grows_with_scan_size PASSED [ 23%]
tests/test_context_measurement.py::test_rank_delegation_candidates_lists_one_per_tool_result PASSED [ 30%]
tests/test_context_measurement.py::test_rank_delegation_candidates_sorted_by_waste_descending PASSED [ 38%]
tests/test_context_measurement.py::test_candidate_waste_equals_payload_times_later_rereads PASSED [ 46%]
tests/test_context_measurement.py::test_earliest_tool_result_has_the_most_rereads PASSED [ 53%]
tests/test_context_measurement.py::test_top_candidate_label_matches_first_line_of_payload PASSED [ 61%]
tests/test_context_measurement.py::test_identify_delegation_boundary_meets_the_savings_target PASSED [ 69%]
tests/test_context_measurement.py::test_identify_delegation_boundary_returns_smallest_sufficient_prefix PASSED [ 76%]
tests/test_context_measurement.py::test_identify_delegation_boundary_zero_target_returns_no_candidates PASSED [ 84%]
tests/test_context_measurement.py::test_identify_delegation_boundary_rejects_out_of_range_ratio PASSED [ 92%]
tests/test_context_measurement.py::test_measurement_uses_step2_helpers_consistently PASSED [100%]

============================== 13 passed in 0.01s ==============================

All thirteen tests pass. The two assertions that earn their keep loudest are test_reread_multiplier_grows_with_scan_size (the central claim is now mechanical) and test_earliest_tool_result_has_the_most_rereads (the geometry of the delegation boundary is now mechanical, not hand-wavy). Running the full suite alongside step 1 and step 2 confirms the new module has not broken any prior invariant:

python3 -m pytest -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
testpaths: tests
collected 25 items

tests/test_context_measurement.py .............                          [ 52%]
tests/test_monolithic_agent.py .....                                     [ 72%]
tests/test_repo_scan.py .......                                          [100%]

============================== 25 passed in 0.02s ==============================

What we got

A diagnostic layer that turns the raw per-turn snapshots from step 2 into the four numbers a maintainer actually needs (peak, cumulative, re-read multiplier, turn count), a ranked list of tool results scored by the bytes a subagent refactor would erase, and a boundary picker that returns the smallest set of delegations needed to hit a savings target. The earliest tool calls dominate the ranking, the boundary is the smallest sufficient prefix, and every claim is locked behind a test. That is the data we need to argue, in step 4, that pulling the repo-scan loop into an isolated subagent is the right delegation cut — not a guess about what feels expensive, but a specific list of tool results with specific dollar-equivalent byte savings.

Repository

The companion code for this article: https://github.com/vytharion/claude-code-subagent-context-isolation-when-to-delegate

The state of the code after this step: 3f72a41

Key commits to step through:

  • 666e4cc — step 1: minimal monolithic agent + transcript-growth tests
  • 83ca35f — step 2: realistic multi-file repo-scan workload + cost helpers
  • 3f72a41 — step 3: context measurement + ranked delegation-boundary picker

What we're doing this step

Step 3 left us with a number: the earliest tool results in the monolithic loop carried the biggest re-read tax, and a small handful of read_file calls accounted for most of the wasted bytes. Step 4 is where we act on that. We build a Subagent abstraction that mirrors how Claude Code's Task tool spawns a child — its own conversation, its own tools, its own transcript — and a parent-facing task tool that lets the main agent delegate work by name. We wire the canonical repo_scanner subagent through that boundary, run the same "summarise this repo" task we used in step 2, and check, mechanically, that the file bodies never touch the parent transcript while the parent still gets a usable summary back. The deliverable is a run_with_delegation entry point plus a DelegatingAgentResult that separates parent context from isolated context so the next step can put a multiplier on the win. This step is the architectural change the article has been building toward: not a measurement, not a heuristic, the actual refactor.

Setup

No new runtime dependencies. We add two production modules and one test file:

codebase/
├── pyproject.toml
├── monolithic_agent/
│   ├── __init__.py
│   ├── agent.py
│   ├── model.py
│   ├── repo_scan.py
│   ├── context_measurement.py
│   ├── subagent.py                  ← new in step 4
│   └── delegating_agent.py          ← new in step 4
└── tests/
    ├── test_monolithic_agent.py
    ├── test_repo_scan.py
    ├── test_context_measurement.py
    └── test_subagent.py             ← new in step 4

subagent.py defines the child-side primitives (Subagent, SubagentResult, and the make_task_tool factory that mints a parent-facing tool). delegating_agent.py is the wiring module — it owns run_with_delegation, the DelegatingAgentResult shape, and the canonical build_repo_scanner_subagent_factory used by both the tests and the next step's comparison. We re-export everything through monolithic_agent/__init__.py so the test file can pull the whole surface from one import path:

from monolithic_agent import (
    DelegatingAgentResult,
    Subagent,
    SubagentResult,
    build_repo_scanner_subagent_factory,
    make_task_tool,
    run_with_delegation,
)

The split between subagent.py and delegating_agent.py is deliberate. The first file knows nothing about repos, summaries, or the demo task — it just describes what a delegated child is and how to expose one to a parent agent. The second file is the application-specific glue: which subagent the parent has access to, how its tools are wired, what its final report looks like. Keeping that seam clean means the Subagent class can be reused for any other delegation in the codebase (a "code-review" subagent, a "doc-writer" subagent) without dragging the repo-scanning specifics along.

Implementation

The child side is small. A SubagentResult is the trace a delegated subagent returns to its caller, and a Subagent is just a MonolithicAgent with a name and call-scoped state:

@dataclass
class SubagentResult:
    name: str
    final_text: str
    transcript: List[dict]
    turns: int


@dataclass
class Subagent:
    name: str
    model: StubModel
    tools: Dict[str, ToolFn] = field(default_factory=dict)
    max_turns: int = 20

    def run(self, prompt: str) -> SubagentResult:
        inner = MonolithicAgent(
            model=self.model,
            tools=self.tools,
            max_turns=self.max_turns,
        )
        agent_result = inner.run(prompt)
        return SubagentResult(
            name=self.name,
            final_text=agent_result.final_text,
            transcript=agent_result.transcript,
            turns=agent_result.turns,
        )

The crucial design choice lives in SubagentResult: the parent will only ever see final_text. The full transcript is kept on the result so tests and analysis tooling can audit what was hidden, but that data never flows back through the parent's context. That asymmetry — child-knows-everything, parent-knows-the-summary — is the entire point of the delegation boundary, and packaging it as a dataclass is what lets step 5 prove the asymmetry held with a single assertion.

The parent-facing task tool is a closure over a factory registry plus a write-only log:

def make_task_tool(
    factories: Dict[str, SubagentFactory],
    log: List[SubagentResult],
) -> ToolFn:
    def _task(arguments: dict) -> str:
        name = arguments.get("subagent", "")
        prompt = arguments.get("prompt", "")
        factory = factories.get(name)
        if factory is None:
            return f"ERROR: unknown subagent {name}"
        subagent = factory()
        result = subagent.run(prompt)
        log.append(result)
        return result.final_text

    return _task

Three properties worth flagging:

  • The factory is called per invocation, not once at registration. Each call returns a brand-new Subagent with its own StubModel, so cursor state never leaks between subagent runs. That mirrors how Claude Code's Task tool starts a fresh subagent for every call rather than reusing a single long-lived one.
  • The log lives outside the closure's return path. The tool's contract with the parent is a single string; everything else (turn count, full child transcript, name) goes to the log via side effect. That keeps the parent transcript a pure projection of the boundary and lets the wiring layer collect audit data without leaking it into the model's context.
  • Unknown names degrade to an error string. Matching make_read_tool's contract: tools don't raise into the agent loop. A typo'd subagent arg should look like a tool failure the model can react to, not a crash the operator has to chase.

The wiring layer is delegating_agent.py. run_with_delegation is a thin shim that turns a parent model + a subagent registry into a DelegatingAgentResult:

def run_with_delegation(
    parent_model: StubModel,
    parent_tools: Dict[str, ToolFn],
    subagent_factories: Dict[str, SubagentFactory],
    task: str,
    max_turns: int = 20,
) -> DelegatingAgentResult:
    log: List[SubagentResult] = []
    tools = dict(parent_tools)
    tools["task"] = make_task_tool(subagent_factories, log)
    agent = MonolithicAgent(model=parent_model, tools=tools, max_turns=max_turns)
    parent_result = agent.run(task)
    return DelegatingAgentResult(
        parent_result=parent_result,
        subagent_results=log,
    )

The parent stays a plain MonolithicAgent. The only architectural change vs step 2 is that one of its tools — task — spawns a context-isolated subagent. Holding everything else equal is what makes step 5's comparison clean: if the parent had switched to a different runner, any context drop could be attributed to either the new runner or the delegation boundary; with this shape, the only variable is the boundary itself.

DelegatingAgentResult exposes two derived numbers that step 5 will lean on:

@dataclass
class DelegatingAgentResult:
    parent_result: AgentResult
    subagent_results: List[SubagentResult]

    @property
    def parent_context_chars(self) -> int:
        return sum(
            len(entry.get("content", ""))
            for entry in self.parent_result.transcript
        )

    @property
    def isolated_context_chars(self) -> int:
        return sum(
            sum(len(entry.get("content", "")) for entry in sub.transcript)
            for sub in self.subagent_results
        )

Two separate accumulators, deliberately. parent_context_chars is the cost the article wants to minimise — bytes the model has to re-read on the parent loop. isolated_context_chars is where that cost went — the bytes the subagent ate so the parent didn't have to. Keeping them as two properties (rather than a single "total bytes" number) lets the tests assert the right invariant: not that the work disappeared, but that it moved across the boundary.

Finally, the canonical subagent factory:

def build_repo_scanner_subagent_factory(
    repo: Dict[str, str],
    final_text: str = "Repo summarised.",
) -> SubagentFactory:
    paths = list(repo)

    def factory() -> Subagent:
        return Subagent(
            name="repo_scanner",
            model=build_scan_script(paths, final_text=final_text),
            tools={"read_file": make_read_tool(repo)},
        )

    return factory

This is the seam to step 2. We reuse build_scan_script and make_read_tool unchanged — the file-by-file scan that bloated the monolithic loop now runs inside the subagent, with the subagent's own transcript catching every read_file body. The parent never sees the script; it only sees the final_text the subagent returns.

The test file pins eleven properties. The four that earn the most are:

  • test_parent_transcript_does_not_contain_any_file_body — for every body in the demo repo, assert it is not a substring of the concatenated parent transcript. This is the privacy guarantee of the delegation boundary stated as a single mechanical check.
  • test_subagent_transcript_carries_every_file_body — the work didn't vanish, it moved. The subagent transcript still contains every file body it had to read. If a future refactor accidentally short-circuited the subagent, this test would catch it because the parent transcript would still pass while the subagent transcript would be empty.
  • test_delegation_shrinks_parent_context_versus_monolithic — run both architectures on the same task and assert delegating.parent_context_chars * 4 < mono_result.context_tokens. A 4× headroom is loud enough that the comparison won't flake on small changes to the demo repo, and step 5 will sharpen this into a precise ratio.
  • test_subagent_factory_yields_fresh_instance_each_call — calling the factory twice returns two different Subagent instances with two different StubModel instances. This is the property that prevents accidental state-sharing across delegated runs.

The remaining tests cover the smaller invariants — that the parent only sees final_text as the task tool result, that the subagent records its own name and turn count, that the log entries arrive in order, that unknown subagent names return an error string instead of raising, and that the only parent transcript entry mentioning the subagent name is the single tool_call line.

Test it

python3 -m pytest tests/test_subagent.py -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
collected 11 items

tests/test_subagent.py::test_run_with_delegation_returns_delegating_agent_result PASSED [  9%]
tests/test_subagent.py::test_parent_only_sees_subagent_summary_as_tool_result PASSED [ 18%]
tests/test_subagent.py::test_parent_transcript_does_not_contain_any_file_body PASSED [ 27%]
tests/test_subagent.py::test_subagent_transcript_carries_every_file_body PASSED [ 36%]
tests/test_subagent.py::test_subagent_records_its_own_name_and_turn_count PASSED [ 45%]
tests/test_subagent.py::test_delegation_shrinks_parent_context_versus_monolithic PASSED [ 54%]
tests/test_subagent.py::test_isolated_context_absorbs_the_file_reading_cost PASSED [ 63%]
tests/test_subagent.py::test_make_task_tool_returns_error_string_for_unknown_subagent PASSED [ 72%]
tests/test_subagent.py::test_subagent_factory_yields_fresh_instance_each_call PASSED [ 81%]
tests/test_subagent.py::test_task_tool_logs_every_subagent_invocation_in_order PASSED [ 90%]
tests/test_subagent.py::test_parent_tool_call_argument_is_the_only_subagent_signal_in_parent PASSED [100%]

============================== 11 passed in 0.01s ==============================

All eleven tests pass. The two assertions doing the heaviest lifting are test_parent_transcript_does_not_contain_any_file_body (the boundary actually isolates) and test_delegation_shrinks_parent_context_versus_monolithic (the boundary actually pays for itself). Running the full suite confirms no regression against steps 1–3:

python3 -m pytest -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
testpaths: tests
collected 36 items

tests/test_context_measurement.py .............                          [ 36%]
tests/test_monolithic_agent.py .....                                     [ 50%]
tests/test_repo_scan.py .......                                          [ 69%]
tests/test_subagent.py ...........                                       [100%]

============================== 36 passed in 0.03s ==============================

Thirty-six tests now lock the architecture down: the monolithic agent still works (step 1), the demo repo still bloats it (step 2), the measurement layer still ranks the right tool calls (step 3), and the delegation boundary now isolates the heaviest of those tool calls behind a Task-style subagent without altering anything else.

What we got

A Subagent primitive that wraps a MonolithicAgent with a name and a call-scoped result, a make_task_tool factory that lets the parent invoke any registered subagent by name, a run_with_delegation runner that wires the parent and the registry together, and a DelegatingAgentResult that separates parent context from isolated context so the win is measurable. We reused step 2's build_scan_script and make_read_tool unchanged — the file-by-file scan that bloated the monolithic loop now runs inside the repo_scanner subagent, the parent only sees the final summary as a single tool result, and eleven tests pin down that the file bodies never leak across the boundary. Step 5 picks this up: we plug the measurement layer from step 3 into both runs, put a number on the savings, and distill the result into a reusable when-to-delegate heuristic.

Repository

The companion code for this article: https://github.com/vytharion/claude-code-subagent-context-isolation-when-to-delegate

The state of the code after this step: a0a3496

Key commits to step through:

  • 666e4cc — step 1: minimal monolithic agent + transcript-growth tests
  • 83ca35f — step 2: realistic multi-file repo-scan workload + cost helpers
  • 3f72a41 — step 3: context measurement + ranked delegation-boundary picker
  • a0a3496 — step 4: Task-style subagent + delegating runner that isolates the repo scan

What we're doing this step

Step 4 moved the file-by-file repo scan behind a Task-style subagent and tested that the parent transcript no longer carried any file bodies. That is a qualitative win — the boundary holds — but the article still needs the quantitative one: how much smaller is the parent's context, exactly, and at what point would the refactor not have been worth it? Step 5 answers both questions in code. We run the step 1/step 2 monolithic agent and the step 4 delegating agent on the same demo repo, feed their per-turn snapshots into a new compare_footprints reducer, and surface a DelegationComparison object that pairs the two ContextMeasurement records plus the isolated subagent cost. Then we encode the lesson learned across the previous four steps as a three-rule should_delegate heuristic — minimum cumulative bytes, minimum re-read multiplier, minimum saveable fraction — so a future caller staring at a noisy trace can ask "is it worth pulling this into a subagent?" and get a yes/no with a reason string attached. The deliverable is the article's last code-bearing step: a side-by-side comparison module, a tunable decision rule, and seventeen tests that pin every threshold and every comparison invariant.

Setup

No new runtime dependencies. We add two production modules, one test file, and a small tweak to subagent.py so the comparison module can read subagent-side snapshots:

codebase/
├── pyproject.toml
├── monolithic_agent/
│   ├── __init__.py
│   ├── agent.py
│   ├── model.py
│   ├── repo_scan.py
│   ├── context_measurement.py
│   ├── subagent.py                  ← updated in step 5
│   ├── delegating_agent.py
│   ├── comparison.py                ← new in step 5
│   └── heuristic.py                 ← new in step 5
└── tests/
    ├── test_monolithic_agent.py
    ├── test_repo_scan.py
    ├── test_context_measurement.py
    ├── test_subagent.py
    └── test_when_to_delegate.py     ← new in step 5

The subagent tweak adds a seen_transcripts field on SubagentResult and has Subagent.run copy them off the stub model after the child finishes. That is the only edit to step 4 code, and it is purely additive — the field defaults to an empty list, every existing test still passes, and the new field is what lets compare_footprints measure the isolated cost without re-running the subagent.

Everything new is re-exported through monolithic_agent/__init__.py:

from monolithic_agent import (
    DelegationComparison,
    DelegationRecommendation,
    compare_footprints,
    recommend_for_run,
    should_delegate,
)

Splitting comparison.py and heuristic.py follows the same seam as step 4: one module describes raw measurement (what the run cost), the other describes a decision (should we have delegated). Keeping them separate means the comparison reducer is reusable in contexts that have their own heuristic, and the heuristic is reusable on top of any measurement source — not just our stub model's snapshots.

Implementation

compare_footprints is intentionally a thin function. It accepts three sets of per-turn snapshots — the monolithic run's, the delegating parent's, and a list-of-lists for every spawned subagent — and reduces them to one DelegationComparison:

def compare_footprints(
    monolithic_seen_transcripts: List[List[dict]],
    parent_seen_transcripts: List[List[dict]],
    subagent_seen_transcripts: List[List[List[dict]]],
) -> DelegationComparison:
    monolithic = measure_context(monolithic_seen_transcripts)
    parent = measure_context(parent_seen_transcripts)
    isolated = _combined_isolated(subagent_seen_transcripts)
    return DelegationComparison(
        monolithic=monolithic,
        parent=parent,
        isolated=isolated,
    )

The shape carries three derived ratios:

@dataclass
class DelegationComparison:
    monolithic: ContextMeasurement
    parent: ContextMeasurement
    isolated: ContextMeasurement

    @property
    def parent_peak_savings_ratio(self) -> float: ...

    @property
    def parent_cumulative_savings_ratio(self) -> float: ...

    @property
    def isolated_share_of_total(self) -> float: ...

parent_peak_savings_ratio and parent_cumulative_savings_ratio are the numbers the article exists to surface — the fraction of the monolithic loop's worst-turn footprint and total re-read cost that the delegation boundary actually erased from the parent. isolated_share_of_total is the audit-side companion: it confirms the work moved across the boundary rather than vanished. A future refactor that accidentally short-circuited the subagent would push isolated_share_of_total toward zero while the parent ratios stayed high, which is a much louder failure than a single number on its own.

The subagent snapshots are flattened before measurement. That choice matters: a delegation that spawned three small subagents should not get to advertise the smallest of their peak sizes as "the isolated peak." We treat the isolated context as the union of every subagent transcript the parent ever caused to exist, so the peak is the worst single subagent turn and the cumulative is the sum of bytes re-read across all of them. That mirrors how a real refactor pays for delegation — many small fresh contexts instead of one ever-growing shared one.

heuristic.py codifies the lesson learned. Three named constants set the defaults:

DEFAULT_MIN_CUMULATIVE_CHARS = 2000
DEFAULT_MIN_REREAD_MULTIPLIER = 1.5
DEFAULT_MIN_BOUNDARY_SAVINGS = 0.3

Each one corresponds to a failure mode the previous steps surfaced. 2000 cumulative chars is the floor below which the loop is simply too small for the refactor overhead to pay back. A reread_multiplier of 1.5 is the threshold at which the monolithic loop is, on average, re-reading more than one and a half full transcripts — anything below that and the loop is not actually wasting much, even if it is verbose. A boundary_savings_ratio of 0.3 means the ranked candidates from step 3 collectively cover at least thirty per cent of the cumulative cost — a smaller saveable fraction means even an ideal delegation cut would not cover its own cost. Each constant is overrideable so a caller working on a different workload can re-tune the gates without forking the module.

should_delegate evaluates the three rules independently and returns a DelegationRecommendation that lists every threshold the measurement passed and every gate that vetoed delegation:

def should_delegate(
    measurement: ContextMeasurement,
    boundary: Optional[DelegationBoundary] = None,
    *,
    min_cumulative_chars: int = DEFAULT_MIN_CUMULATIVE_CHARS,
    min_reread_multiplier: float = DEFAULT_MIN_REREAD_MULTIPLIER,
    min_boundary_savings_ratio: float = DEFAULT_MIN_BOUNDARY_SAVINGS,
) -> DelegationRecommendation: ...

Three design choices worth flagging:

  • All three gates must pass. Any single failed gate vetoes the recommendation. That asymmetry is deliberate: false-positive delegations cost real maintenance time, while false-negative skips just leave the loop slightly verbose. The heuristic should be conservative.
  • boundary is optional. A caller who only has a measurement (no ranked candidates) gets a two-rule decision; supplying a boundary adds the third gate. The function never errors on a missing argument — it just downgrades to a weaker recommendation.
  • The reason string is reconstructable from the rules lists. triggered_rules and failed_rules are the structured form, reason is the human-readable summary, and the two are always consistent. Callers can log the structured form and render the string, or vice versa, without re-running the heuristic.

recommend_for_run is the convenience entry-point: hand it a stub model's seen_transcripts and it threads measure_context, identify_delegation_boundary, and should_delegate together so the caller only has to make one function call.

The test file pins seventeen properties. The ones worth re-reading slowly:

  • test_compare_footprints_shows_parent_cumulative_savings + test_compare_footprints_shows_parent_peak_savings — both ratios clear 0.5 on the demo repo. Half the monolithic loop's cost was paying for re-reads that delegation now erases.
  • test_compare_footprints_isolated_carries_the_offloaded_cost — the work did not vanish; isolated_share_of_total > 0.5 confirms the majority of the byte-cost now lives behind the delegation boundary. Step 4's test_isolated_context_absorbs_the_file_reading_cost proved this on its own subagent log; step 5's version proves it relative to the parent's new (smaller) cost.
  • test_should_delegate_returns_true_for_full_repo_scan vs test_should_delegate_returns_false_for_tiny_single_file_run — the heuristic correctly recommends delegation for the article's central workload and correctly refuses to recommend it for a one-file scan that would not pay back the refactor.
  • test_should_delegate_respects_supplied_boundary_savings_floor — supplying a degenerate boundary (no candidates, zero savings) vetoes delegation even though the cumulative and re-read gates would otherwise pass. This is the rule that prevents the heuristic from recommending refactors that have no actual byte-savings target.
  • test_should_delegate_thresholds_are_caller_tunable — cranking min_cumulative_chars above the run's actual cost flips the decision to "keep monolithic." This is what makes the constants defaults rather than magic numbers.
  • test_should_delegate_uses_documented_default_thresholds — sanity check that the public constants are the values the article and the README cite. If someone re-tunes the floors without updating the docs, this test fails loudly.

The remaining tests cover the smaller invariants — the empty-subagent edge case for the comparison reducer, the zero-measurement case for the heuristic, the per-turn snapshot count on SubagentResult, the monotonic growth of subagent snapshot sizes, the boundary-veto path for recommend_for_run, and the consistency between triggered_rules, failed_rules, and the reason string.

Test it

python3 -m pytest tests/test_when_to_delegate.py -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
collected 17 items

tests/test_when_to_delegate.py::test_subagent_result_now_exposes_seen_transcripts PASSED [  5%]
tests/test_when_to_delegate.py::test_subagent_seen_transcripts_grow_monotonically PASSED [ 11%]
tests/test_when_to_delegate.py::test_compare_footprints_returns_delegation_comparison PASSED [ 17%]
tests/test_when_to_delegate.py::test_compare_footprints_shows_parent_cumulative_savings PASSED [ 23%]
tests/test_when_to_delegate.py::test_compare_footprints_shows_parent_peak_savings PASSED [ 29%]
tests/test_when_to_delegate.py::test_compare_footprints_isolated_carries_the_offloaded_cost PASSED [ 35%]
tests/test_when_to_delegate.py::test_compare_footprints_handles_empty_subagent_list PASSED [ 41%]
tests/test_when_to_delegate.py::test_should_delegate_returns_true_for_full_repo_scan PASSED [ 47%]
tests/test_when_to_delegate.py::test_should_delegate_reason_string_lists_triggered_rules PASSED [ 52%]
tests/test_when_to_delegate.py::test_should_delegate_returns_false_for_tiny_single_file_run PASSED [ 58%]
tests/test_when_to_delegate.py::test_should_delegate_respects_supplied_boundary_savings_floor PASSED [ 64%]
tests/test_when_to_delegate.py::test_should_delegate_thresholds_are_caller_tunable PASSED [ 70%]
tests/test_when_to_delegate.py::test_should_delegate_uses_documented_default_thresholds PASSED [ 76%]
tests/test_when_to_delegate.py::test_should_delegate_handles_zero_measurement PASSED [ 82%]
tests/test_when_to_delegate.py::test_recommend_for_run_recommends_delegation_for_full_repo_scan PASSED [ 88%]
tests/test_when_to_delegate.py::test_recommend_for_run_rejects_tiny_run PASSED [ 94%]
tests/test_when_to_delegate.py::test_recommend_for_run_zero_target_vetos_via_boundary_gate PASSED [100%]

============================== 17 passed in 0.02s ==============================

All seventeen tests pass. The two assertions doing the heaviest lifting are test_compare_footprints_shows_parent_cumulative_savings (the central article claim is now a number, not a vibe) and test_should_delegate_returns_true_for_full_repo_scan paired with test_should_delegate_returns_false_for_tiny_single_file_run (the heuristic correctly picks a side on both the workload that motivated the article and the workload that would have been a false positive). Running the full suite confirms no regression against steps 1–4:

python3 -m pytest -v
============================= test session starts ==============================
platform darwin -- Python 3.9.6, pytest-8.4.2, pluggy-1.6.0
rootdir: /codebase
configfile: pyproject.toml
testpaths: tests
collected 53 items

tests/test_context_measurement.py .............                          [ 24%]
tests/test_monolithic_agent.py .....                                     [ 33%]
tests/test_repo_scan.py .......                                          [ 47%]
tests/test_subagent.py ...........                                       [ 67%]
tests/test_when_to_delegate.py .................                         [100%]

============================== 53 passed in 0.05s ==============================

Fifty-three tests now lock the full architecture down: the monolithic agent loop still runs research and synthesis in one transcript (step 1), the demo repo still bloats it file-by-file (step 2), the measurement layer still ranks the right re-reads (step 3), the delegation boundary still isolates the heaviest of those re-reads (step 4), and the comparison plus heuristic modules now turn the win into a number plus a yes/no rule callers can re-apply on new workloads.

What we got

A DelegationComparison shape that pairs the monolithic and delegating runs into one object with three savings ratios, a compare_footprints reducer that builds it from three sets of per-turn snapshots, a should_delegate heuristic that gates a yes/no recommendation behind three independently tunable thresholds, a recommend_for_run convenience entry-point that threads measurement + boundary + heuristic together, and seventeen tests that pin every threshold and every comparison invariant. The heuristic is conservative on purpose: all three gates must pass for it to recommend delegation, the constants are public so callers can re-tune them per workload, and the reason string is reconstructable from the structured triggered_rules and failed_rules lists so logs stay machine-parsable. The article now has its closing claim in code form — given a noisy monolithic run, here is how to measure the cost, here is how to spot the boundary, and here is the rule you would apply to decide whether the Task-style refactor is worth doing — and a future caller staring at a different workload can re-run the same reducers without re-deriving any of the previous five steps.

Repository

The companion code for this article: https://github.com/vytharion/claude-code-subagent-context-isolation-when-to-delegate

The state of the code after this step: 6072ac9

Key commits to step through:

  • 666e4cc — step 1: minimal monolithic agent + transcript-growth tests
  • 83ca35f — step 2: realistic multi-file repo-scan workload + cost helpers
  • 3f72a41 — step 3: context measurement + ranked delegation-boundary picker
  • a0a3496 — step 4: Task-style subagent + delegating runner that isolates the repo scan
  • 6072ac9 — step 5: side-by-side footprint comparison + tunable when-to-delegate heuristic