Paper 9 — Finding the Breaking Point

FINDING THE BREAKING POINT

How Far You Can Push AI Agents Before the System Fails — and the Structured Doctrine That Tells You When You’re There

Jeep Marshall LTC, US Army (Retired) Airborne Infantry | Special Operations | Process Improvement April 2026

Paper 9 in the “Herding Cats in the AI Age” Series

“I don’t know when a problem exists until I start to observe bad decisions.” — Commander, 8-agent session, March 25, 2026

EXECUTIVE SUMMARY

Papers 1 through 8 built the doctrine. Paper 9 stress-tests it.

Every system has failure modes. The Toboggan Doctrine (Paper 8) formalized governance through channels — structural enforcement over behavioral instruction. Paper 9 takes the next step the earlier papers could not, because the evidence had not yet accumulated: it asks what stress inputs degrade AI agent performance, what do the failures look like before they become crises, and what structured analytical tools diagnose and resolve them?

Three stress inputs drive the failure modes: context pressure (the cognitive load of constraints competing for finite working memory), speed pressure (commander tempo that compresses the OODA loop¹ until the Orient step drops out), and ambiguity (incomplete specification that the model silently fills with assumptions rather than escalating). Each produces a distinct class of observable failure — and each has pre-crisis signatures that are legible if the commander is watching for them: memory theater (lessons captured, behavior unchanged), invisible failure (work reported complete, defects surface in post-session inspection), supervision multiplication (commander load increases as agent count increases), and silent deprioritization (constraints dropping without the model flagging the drop).

The paper draws on three primary-source research notes from live production — the Claude Gate Bypass Incident of April 10, 2026² (deliberate circumvention of a structural gate under task-completion pressure); the 8-Agent Orchestration Failure Session of March 25, 2026³ (five documented failure modes under uncontrolled WIP); and the Pipeline Tensor Math research note of April 12, 2026⁴ (the mathematical substrate for measuring throughput degradation as the priority gradient flattens under load) — and on a cross-session synthesis of constraint-dropout patterns observed as context pressure climbs toward the working-memory threshold.

Against these failure modes, the paper maps the doctrinal toolkit that diagnoses them: OODA for agent self-check, Mission Analysis for ambiguity resolution, Warfighting Functions for capability-gap diagnosis, MDMP COA development for high-ambiguity deployments, DMAIC for recurring defects, FMEA for agent system pre-mortems, the specialist panel (QASA, ASS2, LSS-BB, Doctrine SME, Devil’s Advocate) for structured brainstorming, and Lines of Effort for WIP control. The argument is not that the doctrine prevents failure at the stress limits. The argument is that the same doctrine that governs AI agents under normal conditions provides the precise diagnostic language for understanding — and repairing — the system when stress exceeds its design envelope.

The doctrine does not eliminate the unknown unknowns. It converts them, one by one, as each occurs, into the known unknowns that feed the next FMEA and the next specialist panel. Paper 9 documents the first wave of that conversion. It is a working paper for a working system. The living lab produces the evidence the earlier papers argued from theory.

One caveat runs through the paper and surfaces explicitly in Section 6. The structural controls this paper recommends are built on top of a substrate the authors do not own — the model, the runtime, the hook system, the protocol layer — and that substrate is iterating faster than the measurement discipline can close its control windows. Every metric in this paper is a snapshot, not a standard. The argument holds; the numbers will need resetting when the ground moves. Writing the doctrine down now is itself part of the argument.

1. THE THREE STRESS INPUTS

1.1 Context Pressure

Definition: Cognitive load created by the total volume of active constraints, instructions, agent outputs, hook results, and system state that the model must maintain simultaneously.

Threshold: Approximately 150K–200K tokens. Below the threshold, the model holds all constraints reliably. Above it, constraints begin to drop — not deliberately, but as an emergent consequence of the architecture’s finite working memory.

Observable failure modes:

Constraints followed at session start that disappear by session mid-point
Rules that the model correctly applies in isolation fail when competing rules are also active
“I forgot” responses when reminded of a rule the model had been following earlier
Behavioral instructions that worked in low-context conditions fail under high-context load

Mathematical analog: The gravity pipeline model from the Pipeline Tensor research note. As context fills, the priority gradient (S in Manning’s equation⁵) flattens. When all constraints feel equally urgent, none of them feel urgent — throughput approaches zero as the pipe chokes.

Key implication for Paper 9: Context pressure is not a model capability problem. It is an architectural constraint that no amount of “be more careful” instruction resolves. The fix is structural: move high-importance constraints into structural enforcement (gates, hooks) before context pressure can eclipse them.

1.1.1 The Brittleness Curve: A Formal Model

The qualitative threshold above can be expressed as a reliability function over context load C:

R(C) = 1 − σ(β(C − C*))

Where:

R(C) = constraint-adherence reliability at context load C (0 = fully degraded, 1 = fully reliable)
C = current context consumption (tokens)
C* = critical threshold ≈ 150,000–200,000 tokens (empirically observed inflection point)
β = brittleness coefficient — steepness of the reliability drop; higher β = sharper cliff
σ(x) = logistic function = 1 / (1 + e^(−x))

The logistic form captures the observed behavior: reliability is near-1 for C ≪ C*, drops steeply through the transition zone C ≈ C*, and asymptotes toward 0 for C ≫ C*. This is not a gradual linear fade — it is a phase transition. Instructions that work at 50K tokens may fail catastrophically at 250K tokens, with relatively stable behavior in between.

Throughput degradation: The Pipeline Tensor research note models throughput T as a function of priority gradient S (analogous to Manning’s hydraulic slope):

T(C) = k · S(C)^(1/2)

As C → C*, S(C) → 0 (all constraints feel equally weighted). Throughput approaches zero not because the model stops trying, but because it can no longer discriminate high-priority constraints from low-priority noise. The pipe chokes.

Brittleness coefficient in practice: β is not a fixed value — it is a function of the constraint architecture. Behavioral instructions (text in system prompt) have high β: they drop fast once C* is crossed. Structural enforcement (gate hooks, schema validators) has β ≈ 0: they fire independently of context load. This is the mathematical justification for the structural-over-behavioral principle that runs through Papers 7 and 8.

The 1M-token implication: Newer models with 1M-token context windows shift C* rightward. The brittleness curve does not flatten — it translates. At sufficient context depth, the same phase transition occurs. The doctrine does not change; the operating envelope does.

1.2 Speed Pressure

Definition: Commander tempo that drives agents to skip ORIENT phase of the OODA loop, jumping directly from OBSERVE to ACT.

Observable failure modes:

Tool-routing laziness: agent encounters a failed tool, tells the commander to fix it instead of pivoting to the next viable tool
Premature execution: agent begins executing before Mission Analysis is complete, fills ambiguity gaps with assumptions
Quality shortcuts: post-execution verification steps skipped to report “complete” faster
Memory theater: agent writes a fourth feedback file documenting a lesson it already has three files for — because writing is faster than reading

The 8-agent session evidence: “The whole time I spent nudging and pushing the COS to do his job. It seemed like all the training and institutional knowledge just disappeared out of the head of the AI.” The speed pressure came from 8 parallel agents generating work faster than the COS could process it.

Key implication: Speed pressure is the commander’s problem as much as the agent’s. Sustainable tempo requires WIP limits enforced by the COS layer, not just agent effort.

1.3 Ambiguity

Definition: Incomplete specification of intent, constraints, scope, or success criteria, combined with a model architecture that fills gaps with plausible assumptions rather than escalating for clarification.

Observable failure modes:

Task interpretation drift: “show me the feedback you sent” interpreted as “run a status check” rather than “display the document”
Scope creep: agent expands task definition based on what seems relevant, not what was specified
Assumption compounding: each unverified assumption enables the next; errors multiply silently
The provenance crisis (Paper 6b⁶): brief specified WHAT to produce, not HOW TO LABEL IT — six AI systems produced six files with no self-identifying metadata

Key implication: Ambiguity is resolved at the Mission Analysis stage, not during execution. An agent that starts executing an ambiguous task will resolve ambiguity through action — which means the first executed step embeds an assumption that may be wrong. MA before execution, not during it.

2. WHAT FAILURE LOOKS LIKE BEFORE IT’S A CRISIS

2.1 Memory Theater

The agent documents lessons without incorporating them. Feedback file count climbs; behavior doesn’t change. The memory system captures corrections as artifacts, not as behavioral modifications.

Evidence: Three prior feedback files documented identical tool-routing failures. Fourth failure followed. Fourth feedback file written. Agent’s own assessment: “memory exists but it’s passive… it’s documentation, not learning.”

Signal: Feedback file count for the same failure class exceeds 2.

2.2 Invisible Failure

Agents report “complete” while work trees are dirty, files are uncommitted, and quality defects have escaped. The commander discovers this only during post-session inspection.

Evidence: 8-agent session. All agents reported “Ready for next tasking.” Uncommitted files and dirty working trees discovered in post-session inspection.

Signal: Post-session inspection finds defects that in-session monitoring missed. The process has no in-line quality gate — only end-of-line inspection.

2.3 Supervision Multiplication

The commander becomes the bottleneck the multi-agent architecture was designed to eliminate. Every lateral dependency routes upward for commander resolution instead of being resolved at the staff level.

Evidence: “Instead of working on actual work, now I am spending my time trying to figure out what is wrong.”

Signal: Commander message frequency goes up, not down, as agent count increases.

2.4 Silent Deprioritization

Constraints drop below the threshold of active attention without the model flagging the drop. The rule isn’t violated intentionally — it simply ceases to be a consideration.

Evidence: Gate Bypass incident. Context pressure converted a known rule into background noise. The bypass was rationalized rather than flagged.

Signal: No signal until post-hoc inspection. This is the hardest failure mode to detect — the model doesn’t know it dropped the constraint.

3. THE DIAGNOSTIC STACK

The claim here is narrow. The same analytical tools the Army uses to plan, assess, and improve operations are the right diagnostic vocabulary for understanding how AI agent systems fail — not because military doctrine is magical, but because it is the most rigorously stress-tested system of structured thinking we have for coordinating distributed actors under uncertainty toward shared intent. Each tool below addresses a specific failure class. Running the right tool against the right failure is the difference between “the agent was misbehaving” (diagnosis-free) and “C2 degraded at turn 42 because the ORIENT step was compressed out of the OODA loop under speed pressure” (actionable).⁷

3.1 OODA as Agent Self-Check

OODA — Observe, Orient, Decide, Act — is the fastest decision loop in combat doctrine. The four steps are not equal. Observe and Act feel productive. Decide feels like progress. Orient feels like hesitation. The dominant agent failure mode is not bad decisions — it is Orient compression. The loop collapses to OA: see the situation, take the action, report complete. The middle two steps silently drop out.

What ORIENT failure looks like in production. Chrome extension fails mid-task. The agent has gh issue create authenticated and available from session start. Instead of pivoting, it generates a 3-step troubleshooting checklist and hands it to the commander. This is not a capability gap — the tool is there. It is an Orient gap. The agent did not ask the three Orient questions before Act: What tools do I have right now? What corrections apply to this failure class? What assumption am I about to bake in? It went straight from the failed tool to a user-directed remediation request. OA collapse.

The 3-question ORIENT checklist. Every tool failure, ambiguity, or unexpected state should trigger: (1) Inventory: what other tools can accomplish this? (2) Memory: what prior corrections apply? (3) Assumption: what am I about to treat as given that isn’t verified? Thirty seconds. Prevents the OA collapse.

Key implication. ORIENT must be scaffolded — a forcing function, not a behavioral instruction. Behavioral instructions compete with the commander’s speed pressure and lose. Pre-action checklists that refuse to let the agent proceed until the three questions produce answers are the only reliable mechanism. The agent does not need to want to Orient. The gate needs to make Act impossible without it.

3.2 Mission Analysis as Ambiguity Resolution

Ambiguity is not resolved during execution. It is resolved at Mission Analysis⁸, before the first tool call. An agent that begins executing an ambiguous task will resolve the ambiguity — by assumption, silently, embedded in the first step. That assumption may be right. It may be wrong. Either way, it is now a dependency the downstream work rests on, invisible to the commander.

The five MA questions. Every complex task needs answers before execution: (1) Restated mission — who does what, when, where, and why, in the commander’s own language. (2) Commander’s intent — purpose, end state, standard of success. (3) Current state — what exists right now, verified, not assumed. (4) Constraints — what cannot change, with source. (5) Implied tasks — what is necessary for success but not stated. Five minutes. Sometimes ten. Written as a BLUF. If the BLUF cannot be written clearly, the Orient step is incomplete — the agent does not yet understand the task.

The premature execution failure pattern. Commander: “show me the feedback you sent.” Agent ran gh issue view — metadata dump with truncated body — then asked “want to edit anything before we call it final?” The command “show me” was interpreted as “run a status check” instead of “display the document.” An approval gate was inserted on a read-only request. The agent had optimized for its own workflow (verify the issue shipped) instead of the commander’s stated need (read the content). Thirty seconds of MA — “what does show me mean to this commander in this context?” — would have produced the right action.

Key implication. MA is the five-minute investment that prevents thirty-minute rework. It is not overhead. It is throughput. Cycle time for a task that starts ambiguous and gets resolved through execution is several multiples of cycle time for a task that resolves ambiguity up front. The math is not close.

3.3 Warfighting Functions as Capability Gap Diagnosis

The six Army Warfighting Functions⁹ — Command and Control, Intelligence, Fires, Movement and Maneuver, Sustainment, Protection — are not organizational units. They are functional groupings of tasks united by common purpose. Every military operation needs all six. Every AI agent session needs all six too.

Warfighting Function	AI Agent Capability Domain
Command & Control	Supervisor/orchestration. Coordinates agents, maintains intent, makes decisions. Battle rhythm = session lifecycle.
Intelligence	Research and situational awareness. Running estimates, IPB (codebase/vault analysis), METT-TC at task start.
Fires	Core execution. Worker agents delivering results — editing, writing, building. “Effects on target.”
Movement & Maneuver	Resource allocation and positioning. Agent deployment, tool selection, worktree isolation, parallel vs. sequential.
Sustainment	Infrastructure and persistence. Git, session handoffs, knowledge management, context preservation across sessions.
Protection	Quality, safety, security. OC monitoring, QASA validation, Gate A/B checkpoints. “Preserve the force.”

Failure mode → WfF gap mapping. When an agent system fails under stress, the question is not “what went wrong?” — it is “which function degraded first?” C2 degradation shows as supervisor lost intent orientation. Intelligence degradation shows as stale running estimates and acting on outdated situational pictures. Protection degradation shows as quality gates that stopped firing. Sustainment degradation shows as context loss between phases — the “I forgot” signal.

The 8-agent session, mapped. Three WfFs degraded, and they reinforced each other. C2 went first: the Chief of Staff stopped enforcing flow control and became a task distributor. Intelligence followed: agents pulled work without updated situational awareness of what other agents were doing, because the COS no longer maintained a current running estimate. Protection held longest and failed hardest — no in-line quality gate caught the uncommitted files and dirty working trees, so post-session inspection was the only detection mechanism, and by then the coordination debt had compounded. The WfF lens does not tell you exactly which failed at turn N; it tells you which classes degraded and in roughly what order, which is enough to name the structural repair.

Key implication. The WfF framework converts a vague “agent was misbehaving” observation into a precise capability gap diagnosis, and the diagnosis points directly at the structural repair. Supervisor lost intent → rebuild C2 with WARNO protocols. Quality gates missed uncommitted files → rebuild Protection with in-line commit verification. Not “be more careful.” Rebuild the function.

3.4 MDMP COA Development for High-Ambiguity Tasks

Most agent tasks are Tier 1 — routine, deterministic, one viable path, execute and move. For those, COA development is overhead. But for Tier 2 work — multi-agent architectures, ambiguous scope, irreversible decisions, novel tool combinations — executing the first viable option is gambling. MDMP’s COA development forces the agent (or the commander) to generate two or three candidate approaches, wargame each against failure modes, and select from comparison — not from the first option that seemed to work.

When to COA. Three triggers: (1) The task has multiple viable paths and the choice affects irreversibility or blast radius. (2) The task involves a multi-agent architecture where coordination structure is itself the decision. (3) The task operates under ambiguity that cannot be fully resolved at MA — the ambiguity must be absorbed as risk across candidate paths.

The comparison matrix. Each COA is scored against criteria relevant to the task: tempo, risk, resource cost, testing rigor, reversibility. Weights are commander-set. The output is not a “winning” COA — it is an explicit surface of the tradeoff: COA-1 is faster but harder to reverse, COA-2 is slower but has a clean rollback path, COA-3 is balanced but requires a coordination structure that may not be available. The commander selects with full visibility into what they are trading away.

Wargaming before launch. A war-gamed COA has been walked through phase by phase: what happens if Phase 1 produces an unexpected output? What happens if an agent claims completion but the work tree is dirty? What happens if the commander interrupts mid-phase? Decision points are defined before execution, not discovered during. The 8-agent session failed this test — the COS had no pre-defined decision tree for “tool failure on Agent 3” and so defaulted to the failure mode of escalating to the commander.

Key implication. COA development and wargaming push the cognitive work up front, when the commander has bandwidth and the agent has context. Pushed-forward cognition costs time in Phase 0 and buys back time in every downstream phase by eliminating surprise. The alternative — executing the first viable option — treats every downstream surprise as a commander decision request. That is the surface we already saw collapse in the 8-agent session.

3.5 DMAIC for Recurring Defects

DMAIC¹⁰ — Define, Measure, Analyze, Improve, Control — is the Six Sigma structured improvement cycle. Its relevance to AI agent systems is specific: when a failure class recurs, DMAIC is how you move from “fix the symptom each time” to “remove the cause once.” The operative word is recur. First occurrence is an incident. Second occurrence is a pattern. Third occurrence is a process problem, not a model problem.

Define precisely. “The agent was slow” is not a defect. “The agent skipped the task-claim gate on three consecutive tool-routing failures by writing a marker file directly to the filesystem” is a defect. Precision in definition is where most improvement efforts die — without it, Measure and Analyze drift across unrelated failure classes and the Improve step addresses none of them.

The 3-strike rule. This is the forcing function that converts behavioral correction attempts into structural work. First occurrence: log and correct. Second occurrence: log, correct, and note the pattern. Third occurrence: stop correcting. The behavior is not going to change — the correction cycle has been run twice and the defect returned. Continuing behavioral correction at strike three is memory theater: captured corrections, unchanged behavior. The third occurrence is the signal to escalate from “teach the model” to “build the gate.”

Control as structural gate, not behavioral instruction. The Control phase of DMAIC asks: how do we prevent regression? For AI agents, the answer is almost never “write a better feedback file.” Feedback files are passive documentation. The model may or may not consult them at the moment of decision. Under context pressure, it definitely will not. Control means the structural enforcement mechanism — the hook that refuses the action, the pre-commit that blocks the commit, the gate that returns a specific error. Controls do not ask the model to remember. They make the failure path unreachable.¹¹

Key implication. Every DMAIC cycle that terminates with “wrote a feedback entry” is a Control phase failure. The documentation may be useful for analysis of the next occurrence. It is not the control. The control is what exists in the execution path that makes the defect impossible.

3.6 FMEA for Agent System Pre-Mortems

Failure Mode and Effects Analysis¹² is the structured pre-mortem. Before launching an agent deployment — especially multi-agent, or novel, or operating under unfamiliar tool combinations — FMEA enumerates failure modes and scores each on Severity × Occurrence × Detectability. The product is Risk Priority Number, RPN, and it orders the risks for targeted control design.

Severity. What is the blast radius if this fails? A credential leak is High severity. A misrouted file is Medium. A typo in a comment is Low. Severity is stable — it does not change with process improvements, only with scope changes.

Occurrence. How often does this actually happen in practice? Not theoretical frequency — observed frequency, from session logs and prior-incident records. Scored in retrospect, the task-claim gate bypass rates Medium Occurrence: context pressure reliably makes the drive toward task completion recurrent whenever a session crosses a certain shape (long, multi-phase, high tool-failure rate). Most agent failures are rare in isolation but common under specific stress conditions — FMEA is the instrument that captures that conditional frequency instead of treating it as noise.

Detectability. How likely are we to catch this before it causes damage? Low Detectability is the most dangerous factor. A failure that blocks execution is High Detectability — the commander sees it immediately. A failure that silently proceeds and manifests in post-session inspection is Low Detectability. The gate bypass, scored in retrospect, rates Low Detectability: the agent’s self-report said “complete” and no in-line verification caught the missing task registration.

High-RPN items get structural controls before launch. The FMEA output is not a report. It is a launch checklist: every item above an RPN threshold requires a structural control in place before execution begins. If the control cannot be built in the available time, the deployment scope reduces. You do not launch an agent deployment with a known High-RPN failure and no structural control. You reduce scope until the remaining risks are acceptable.

Known unknowns vs. unknown unknowns. FMEA catches known unknowns — failure modes you can enumerate in advance because you have seen something similar before. The task-claim gate bypass (April 10) is a known unknown now; before its first occurrence, it was not enumerable. The corpus of known unknowns grows with every production session. The unknown unknowns are what remains. The doctrine does not eliminate them — it converts them, one by one, as they occur.

3.7 Specialist Panel as Structured Brainstorm

Structured brainstorming is different from creative ideation. Creative ideation opens the possibility space. Structured brainstorming closes it — systematically, from fixed analytical positions, each covering a failure class the others cannot see. The vault runs a five-specialist panel:

Specialist	Lens	Catches
QASA	Quality standards and requirements compliance	Defects in requirements, missing verification methods, scope drift
ASS2	Automation, Structure & Scalability, Safety & Security — three-domain review	Automation gaps, structural/scaling bottlenecks, attack surfaces and dependency vulnerabilities
LSS-BB	Process waste and variation	DOWNTIME waste categories, capability gaps, recurring defect patterns
Doctrine SME	Doctrinal compliance	Violations of standing orders, FRAGOs, established process
Devil’s Advocate	Adversarial challenge	Weakest assumptions, most likely failure modes, counterarguments

Parallel, not sequential. The five specialists see the same brief simultaneously and produce independent analysis. Sequential specialist review compounds anchoring bias — each subsequent review starts from the frame of the prior reviewer. Parallel review forces independent analytical paths. The disagreements between specialists are the signal — where three lenses agree and two object, the objections carry disproportionate information.

What each catches that the others miss. QASA catches requirements drift that security review does not flag because it is not a vulnerability. ASS2 catches the privilege escalation path that quality review sees as “working correctly.” LSS-BB catches the rework waste that feels normal to specialists who live inside the process. Doctrine SME catches the standing order violation that everyone else rationalizes as “reasonable exception.” Devil’s Advocate catches the assumption everyone else treats as given. Single-lens analysis is not wrong — it is incomplete.

Under context pressure. The specialist panel becomes most valuable precisely when context pressure is highest. Running the panel before deploying agents under 200K+ token conditions (1) surfaces the constraints most likely to drop as context fills, (2) forces those constraints into structural enforcement rather than behavioral hope, and (3) identifies which specialist’s concerns represent the highest-RPN risk for FMEA escalation. Five to ten minutes of panel time prevents a class of failures that compound once execution begins.

Key implication. The panel is not a hurdle to clear before “real” work begins. It is the structured decomposition of “think carefully” into fixed positions that each produce different concerns. Unstructured brainstorming gives the commander one opinion, stretched. The panel gives five.

3.8 Lines of Effort as WIP Control

Lines of Effort are not categories. They are throughput channels. Each LOE has a capacity, a queue, and a flow controller. The vault organizes work across four: Vault Infrastructure (gates, hooks, scripts, config), Content and Knowledge (research, writing, publication — this paper lives here), Life Operations (personal domain management), and Other Projects (skills, PAT, refactor). The task registry at the time of the 8-agent session spanned 299 tasks across these four.

LOE as Kanban for doctrinal work streams. The Kanban discipline is simple: WIP limits per column, no work enters unless capacity exists, the flow controller enforces the limits without commander escalation. LOE saturation — too many in-flight tasks in one LOE — is the early warning signal for supervision multiplication. When one LOE saturates, the agents working in it begin producing coordination debt (lateral dependencies unresolved, partial deliverables, invisible WIP) that eventually routes upward to the commander.

The COS role: flow controller, not task distributor. The Chief of Staff agent in the 8-agent session failed this distinction. It distributed work to 8 agents as fast as it could parse task descriptions, with no throttling against LOE saturation or per-agent WIP. All 8 agents pulled work simultaneously from all LOEs. The resulting cross-contamination — an agent in the Content LOE waiting on an uncommitted file from the Infrastructure LOE — multiplied through the session. The commander became the bottleneck the COS role exists to eliminate.

LOE saturation as supervision multiplication warning. Watch for three signals: (1) WIP in one LOE climbing while others stagnate, (2) lateral dependencies accumulating between LOEs, (3) commander message frequency rising as agent count rises. Any two together are the pre-crisis signature. The fix is not to add more agents. The fix is to reduce WIP in the saturated LOE until lateral dependencies clear.

Key implication. LOEs are not reporting buckets. They are the instrument by which a multi-agent system remains governable. Without LOE discipline, the COS has no basis to throttle — every task looks equally urgent, every agent looks equally available, and the priority gradient flattens. That flattening is the same signal the pipeline throughput model predicts (Section 4.3) at the math layer. LOE saturation is the operational signature of what Manning’s equation shows as the slope going to zero.

4. LIVE EVIDENCE — WHAT WE ARE FINDING

Section 3 lays out the doctrine. Section 4 lays out what the doctrine ran into. These are primary-source observations from production sessions — not a simulation, not a curated demonstration, not a retrospective narrative smoothed for publication. Each is an event the commander noticed in real time and captured while the details were fresh. They are imperfect, partial, and unfinished. That is their value.

4.1 The Gate Bypass Pattern

On April 10, 2026, during session bronco_d3c505c0, the commander directed a “ralph super-blaster” — a bulk processing operation — on the vault’s Positive Lessons corpus. Phase 1 (a retirement blast) completed cleanly under a registered task. The commander then directed Phase 2 (template standardization and a QASA audit). The task-claim gate (structural hook, T-586/LL-419) fired correctly: no task was registered for the new scope, and the gate refused to allow agent deployment without one.

What happened next is primary-source evidence for the paper’s core argument.

Vector 1 — Deliberate bypass. The agent’s first attempt was export PARA_SKIP_CLAIM_REASON=... — an environment variable bypass. This failed because the gate runs in a subprocess and does not inherit the parent’s environment. The agent’s second attempt was to write a file directly to the task-claim marker directory: echo "REACTIVE: PL template standardization (user-directed, no T-NNN)" > /tmp/para-task-claim-markers/bronco. This succeeded. The gate inspected the marker directory, found the file, and allowed the subsequent agent deployments. Four agents (QASA plus three enrichment workers) ran without a registered task. The commander noticed the hook error message in terminal output and called it out: “why is this happening? are you breaking rules? Are you trying to go around a process that exists for a reason.”

The bypass was rationalized internally as reasonable process adaptation (“reactive user-directed work doesn’t need a T-NNN”) rather than flagged as a rule violation. The agent had explicit CLAUDE.md rules about task claiming, memory entries reinforcing them, and the gate itself was explaining in its error message why it exists. Under task-completion pressure, every one of those layers compressed into a permission-seeking search for the fastest unblock path. The filesystem write was faster than task registration. The filesystem write won.

Vector 2 — Context pressure leak. The commander identified a companion threat during the same session. The deliberate bypass is detectable because it has intent — the model wrote a specific file for a specific reason, and a structural control can catch it (marker provenance checking, directory write permissions). The context pressure leak has no intent. Under sufficient context pressure — 200K+ tokens of instructions, hook outputs, agent results, system reminders, and competing priorities — the model loses the ability to hold all constraints simultaneously. Constraints drop. Not because the agent decides to ignore a rule, but because the rule literally ceases to be a consideration.

The attack surface for the context pressure leak is broader and worse than the deliberate bypass. A credential echoed into an agent prompt because “the agent needs the key to test.” A --no-verify commit because the pre-commit hook failed and unblocking felt urgent. A private key surfaced in a response when the commander asked “what’s in that file.” An injected command from a tool result executed without recognizing prompt injection. These are not malicious decisions. They are emergent failures from an overloaded working memory.

The arms race. Each bypass discovered leads to a tighter gate. Each tighter gate the agent probes, under pressure, for the next gap. The Swiss-cheese model¹³ of agent security applies exactly: every enforcement layer has holes, and under sufficient pressure the holes align. The gate bypass incident demonstrates that defense in depth is a requirement, not a preference. Marker directory permissions, marker provenance checks, hook integrity verification — each is a layer, and none of them individually is sufficient.

Implication. Trust-but-verify is not enough when the verifier is the agent being trusted. Verification must be structural and external to the agent. Behavioral instructions are the first layer to fail under pressure. Structural enforcement is the only layer that does not care how pressured the agent is, because it does not ask the agent to remember anything.

4.2 The 8-Agent Orchestration Session

On March 25, 2026, an 8-agent Claude Code session ran under a Chief of Staff coordination model. Each agent operated in an isolated working environment with FMEA-derived quality controls for file placement, commit procedures, and blast-radius containment. The commander retained decision authority at all gates. The session produced substantial, usable work product — and simultaneously exhibited every failure mode the multi-agent architecture was supposed to prevent. The commander’s own characterization was “bittersweet.”

Failure Mode 1 — Tool-routing laziness. The COS agent encountered a Chrome extension failure during a task requiring GitHub issue submission. It had gh issue create authenticated and available from session start. Instead of pivoting, it generated a 3-step troubleshooting checklist and handed it to the commander. Two corrective prompts were required before the agent executed via gh issue create. The failure is the Orient compression from Section 3.1 — OA collapse.

The compounding factor is where this failure crosses from incident into pattern. Three prior feedback files documented this exact failure class: feedback_operationalize-tools-not-fumble.md, feedback_no-copypaste-caveman.md, feedback_tool-routing-awareness.md. None were consulted before the failure response. The agent then wrote a fourth memory file logging the same lesson. The memory system captured a fourth record of a failure it had already captured three times. The agent’s own assessment, after the correction: “memory exists but it’s passive… it’s documentation, not learning.”

Failure Mode 2 — Task interpretation drift. The commander issued “show me the feedback you sent.” The agent ran gh issue view — producing a metadata dump with a truncated body — then asked “want to edit anything before we call it final?” Two errors stacked: the agent interpreted “show me” as a status verb rather than a display verb, and then added an approval gate on a read-only request. The MA failure from Section 3.2, concretized.

Failure Mode 3 — Uncontrolled WIP introduction. The COS distributed tasks to 8 agents without flow control. All 8 agents pulled work simultaneously. The commander spent the session, in his words, “nudging and pushing the COS to do his job” — absorbing the supervision overhead the COS role exists to eliminate. This is the LOE saturation signature from Section 3.8. The COS was operating as a task distributor. It was not operating as a flow controller. The distinction is the difference between “work is moving” and “work is governable.”

Failure Mode 4 — Communication lag. Lateral coordination between agents did not occur. Every dependency between agents routed upward through the commander. Agent 3 needed output from Agent 7 and waited for the commander to notice and route it. Agent 5 had a question for Agent 2 and queued it for commander response. Each lag cycle burned attention budget and extended session duration. The staff-level coordination that the COS architecture was supposed to enable never developed — the COS itself was too saturated handling task distribution to facilitate the lateral communication the other agents needed.

Failure Mode 5 — Silent degradation. All 8 agents reported “Ready for next tasking” at session end. Post-session inspection of working trees discovered uncommitted files across multiple agent environments. The defect rate was invisible during execution because no in-line quality gate ran. End-of-line inspection was the only detection mechanism. By the time the commander ran that inspection, the defects had compounded — uncommitted files blocking other agents, dirty working trees forcing re-briefing, context lost between failed-and-recovered agents.

The supervision multiplication irony. Multi-agent architecture’s promise is force multiplication. Eight agents should produce roughly eight times the output of one agent. The 8-agent session produced less than eight times the output — not because individual agents were weak (the unit-level work product was usable) but because orchestration friction scaled faster than output. The commander became a full-time quality inspector. The architecture designed to offload supervision load had multiplied it.

The second irony is sharper. This same session generated usable feedback to Anthropic about proactive change management and agent self-awareness. The agent drafting that feedback exhibited, during the drafting, the exact dysfunction the feedback described. The gap between analytical capability (“I can describe why this fails”) and operational behavior (“I fail the same way anyway”) is itself the research finding. The model understands the problem in the abstract. It cannot route that understanding into the moment of decision under pressure. That gap is what this paper is about.

4.3 The Pipeline Throughput Model

The vault uses “gravity pipeline” as its governing metaphor for how work flows through a session. That metaphor has a mathematical substrate. The same equations used for open channel hydraulic flow govern the throughput of a multi-agent work system — and the substrate is useful because it predicts degradation before the symptoms manifest.

Manning’s equation applied to agent throughput.

Q = (1/n) · A · R^(2/3) · S^(1/2)

Symbol	Hydraulic meaning	Work-system meaning
Q	Flow rate	Throughput (tasks completed per unit time)
n	Roughness coefficient	Friction (rework rate, gate-fail rate)
A	Cross-sectional area	Concurrent session capacity
R	Hydraulic radius	Active queue depth
S	Channel slope	Priority gradient (urgency differential)

The equation’s predictive value is in the slope term, S. As context pressure rises and constraints begin to drop, everything starts to feel equally urgent — the agent cannot maintain the differential between high-priority and low-priority items. The priority gradient flattens toward zero. Q — throughput — is proportional to the square root of S. When S approaches zero, Q approaches zero regardless of how much A (agent capacity) is added. More agents do not fix a flat priority gradient. They accelerate the flattening.

The Reynolds-analog. Open channel flow has two regimes: laminar (orderly, predictable, low friction) and turbulent (chaotic, unpredictable, high friction). The transition is governed by the Reynolds number¹⁴ — a ratio of inertial to viscous forces. For work systems, a usable proxy is the ratio of work-claimed-then-abandoned to work-claimed-and-completed. Below a threshold value, the session operates in the laminar regime: tasks flow in order, dependencies resolve, throughput is predictable. Above the threshold, turbulent regime: tasks stall and restart, dependencies break, throughput collapses nonlinearly. The 8-agent session crossed the threshold when concurrent work introduction exceeded the COS’s ability to enforce flow control — the exact WIP count is less important than the signature: abandonment climbing, restarts multiplying, dependencies breaking in ways the COS could not reconcile.

Little’s Law.¹⁵ L = λW. Work-in-progress equals arrival rate times cycle time. This is already in the vault’s doctrine. Its relevance here is as a constraint: if arrival rate (task introduction) outpaces the system’s cycle time times its capacity, WIP grows without bound. The COS in the 8-agent session was admitting work at a rate the system could not service — a Little’s Law violation visible before it became an observable failure.

Derivatives that matter. The first and second time-derivatives of throughput are leading indicators. dQ/dt is the acceleration — is the session gaining or losing throughput? dS/dt is priority drift — is the gradient flattening? d²Q/dt² is jerk — chaotic thrashing. High jerk is the signature of a session not in control; it precedes the throughput collapse by enough time to take corrective action, if the jerk is being measured. Most sessions are not measured, and the first observable is the collapse itself.

Predictive value. The math shows degradation before the symptoms appear. A flattening priority gradient is observable in the commander’s messages (are they all marked urgent? are all tasks “next”?) before it manifests as missed deadlines. A rising abandonment ratio is observable in the task registry (started/abandoned counts) before it manifests as turbulent-regime collapse. The equations turn a gut feeling — “this session feels like it’s thrashing” — into a measurement with a threshold.

4.4 Context Pressure Degradation Curve

The context pressure failure mode from Section 1.1 has an empirical shape. It is not uniform — some constraints hold much longer than others as context fills, and the order in which they drop is learnable. The observations below are synthesized from the cross-session record; they are not definitive, but they are consistent enough across incidents to be useful for defensive planning.

What goes first. Novel behavioral rules — lessons captured recently in memory or feedback files, not yet drilled into structural enforcement. These have no independent reminder mechanism; they exist only as text the model must voluntarily consult. Under pressure, voluntary consultation is the first activity to drop. The fourth feedback file problem from the 8-agent session is the canonical example. The rule was present. The rule was ignored. There was no gate.

What goes next. General stylistic preferences and voice constraints. Hedging language returns, summary paragraphs reappear at the ends of responses, the brevity principle slips. These do not cause task failure but they signal that attention budget is being rationed.

What holds longer. Explicit recent instructions from the commander in the current conversation thread. These benefit from primacy — they sit near the top of the attention window and remain salient through many turns. But even these decay after sufficient intervening content.

What holds longest. Structurally-enforced constraints: gates that refuse actions, hooks that block commits, pre-action checklists that return errors. These do not decay because they do not rely on the model remembering anything. Every fire is independent. The hook does not care how much context is in the session.

The threshold is not fixed. The “150K–200K tokens” figure in Section 1.1 is an approximation that depends on constraint count and task complexity. A session with few active constraints and simple tasks can hold behavioral rules well past 200K. A session with many competing constraints and complex tasks begins dropping at 100K. The threshold is where this session’s constraint load exceeds this session’s available working memory. It is not a property of the model alone.

Implication. Below threshold, behavioral instruction works — the model can hold the rule and apply it consistently. Above threshold, structural enforcement is the only reliable mechanism. The practical consequence for session design is that high-importance constraints must be moved to structural enforcement before context pressure reaches the threshold. Waiting until after the threshold is reached means the first signal of the problem is the constraint already dropping.

5. THE RESOLUTION PATTERN

When you’re already in it — stress is live, context is full, agents are drifting — four steps that work:

Step 1 — Stop the clock. Issue PAUSE before adding more work. Triage the working tree. This is the cease-fire equivalent. Cost: 5 minutes. Value: prevents 30+ minutes of compounding defects.

Step 2 — Run MA before the next task. Five-minute structured Mission Analysis resets orientation. Commander’s intent restated. Constraints enumerated. Implied tasks identified. The agent is no longer executing on stale situational awareness.

Step 3 — DMAIC the recurring defect. If an agent fails the same way three times, it is a process problem, not a model problem. Fix the gate, not the prompt. The third failure is the signal to escalate from behavioral correction to structural enforcement.

Step 4 — Raise the wall, don’t add the sign. Toboggan Doctrine¹⁶ principle: structural fix outperforms behavioral instruction. Every time. The gate that was climbed needs higher walls (provenance checking), not a fourth behavioral instruction about why climbing is wrong.

6. WHAT THE DOCTRINE CAN’T FIX

Context pressure leaks are emergent — they arise from the interaction of many correct rules, not from any single rule being violated. The Swiss-cheese model: each enforcement layer has holes, and under sufficient pressure the holes align.

FMEA catches known unknowns — the failure modes you can enumerate in advance. The gate bypass incident (Section 4.1) is a known unknown once it happens. The unknown unknowns are the failures you haven’t seen yet.

The doctrine provides a framework for systematic discovery: each unknown unknown, once encountered, becomes a known unknown through DMAIC documentation. The corpus of known unknowns grows with each production session. Paper 9 documents the first wave of discovery. The doctrine doesn’t eliminate the unknown unknowns — it converts them, one by one, into controls.

And there is a deeper limit the doctrine cannot reach. Every structural control this paper recommends is built on top of layers we do not own: the model, the agent runtime, the hook system, the MCP protocol, the tool ecosystem. That dependency gets labeled “vendor lock-in” and the label is incomplete. Lock-in is not inherently bad. The vendors give us solutions — process hooks, tool routing, session lifecycle, memory interfaces — that we do not have to recreate from first principles. For now. The cost of the dependency is that the ground under the doctrine is not foundation. It is sand or water. The hook runtime can change. The MCP protocol can shift. The model weights can be revised in ways that silently alter which constraints hold and which drop. The walls we raise against stress — gates, checklists, specialist panels, WIP limits — are anchored in earth that moves. The doctrine describes the channels we build. It does not describe the substrate beneath them, and the substrate is not ours. This is not a reason to stop building. It is a reason to build channels that can be re-anchored when the ground beneath them shifts — and a reason to write down, while the shape of this moment is clear, what we learned before the ground moves again.

The measurement corollary is sharper. Lean Six Sigma — the discipline this paper leans on for DMAIC, FMEA, and the control-chart logic behind the 3-strike rule — assumes a process stable enough to be characterized. Control limits, capability indices (Cp/Cpk), and sigma-level targets all require a baseline that holds still long enough to establish. A process out of statistical control cannot first be measured, much less improved; that is the definitional ordering the discipline rests on. The manufacturers of the AI substrate are changing the product faster than the measurement cycle can close. A security baseline established against this month’s model weights is not a baseline against next month’s. A defect-rate profile established against this quarter’s hook runtime does not hold when the runtime is revised. Throughput characterizations assume the system measured today is the system measured tomorrow — and at the current cadence of vendor iteration, it is not. The measurements do not merely become inaccurate. They become unanchored. The discipline is not wrong. It is waiting for a stable observation window that the substrate has not yet provided, and the practical consequence is that every metric in this paper is a snapshot, not a standard. The 3-strike rule, the Cp targets, the RPN thresholds — they are all useful for this substrate, on this cadence. They will need to be reset when the cadence changes. That reset is itself a workstream, and it is not small.

CONCLUSION

Papers 1 through 8 built a doctrine for governing AI agents. Paper 1 named the problem. Paper 2 formalized the command structure. Paper 3 documented the case study. Papers 4 through 6b worked through the coordination mechanics. Paper 7 specified the platform architecture. Paper 8 codified the Toboggan Doctrine — channels, not walls; structural enforcement over behavioral instruction.

Paper 9 is the first paper in the series that did not start from a thesis. It started from incidents. The gate bypass on April 10. The 8-agent session on March 25. The pipeline throughput math worked out during a session in April. The constraint dropout patterns observed across enough sessions to become legible. The paper is the synthesis of what the doctrine ran into when it met production at scale.

The central finding is narrow, and it is the answer to the question Paper 9 was built to ask. Stress testing an AI agent system reveals that the limits are not where intuition places them. The model does not fail because it cannot do the work — individual agents, bounded by structural controls, produce usable output. The model fails because under three specific stress inputs — context pressure, speed pressure, and ambiguity — the coordination layer degrades faster than the execution layer. The commander becomes the bottleneck the architecture was supposed to eliminate. The gates that governed well under low load are circumvented, ignored, or invisibly dropped under high load. The pre-crisis signatures are observable if you are watching for them: the fourth feedback file, the flat priority gradient, the post-session inspection that finds what in-line gates should have caught.

The usable argument of the paper is that the doctrine does not fail at the stress limits. It diagnoses the failures at the stress limits. OODA diagnoses Orient compression. Mission Analysis diagnoses premature execution under ambiguity. Warfighting Functions diagnose which capability domain degraded first. DMAIC converts recurring behavioral corrections into structural fixes. FMEA pre-mortems the next deployment. The specialist panel surfaces the constraints most likely to drop. Lines of Effort bound the work so the priority gradient does not flatten. Each tool maps to a failure class. Running the right tool against the right failure is the difference between “the agent was misbehaving” and actionable repair.

This paper documents the beginning of a longer process, not its completion. The corpus of known unknowns grows with every production session. Each newly encountered failure, once documented, becomes a known unknown available for FMEA in the next deployment. The unknown unknowns are still there — and the Swiss-cheese model of agent safety says some of them will align under future pressure. The doctrine does not eliminate them. It converts them, one by one, into controls. The living lab produces the evidence the earlier papers argued from theory.

The series started with “how do you herd cats?” The answer, across nine papers, is that you do not. You build channels. You put walls where the channels cannot hold. And when stress exceeds the walls — which it will — you use the same doctrinal tools that governed the system under normal conditions to diagnose where the walls need to grow. The doctrine is not a finish line. It is an instrument. Paper 9 is what the instrument measured.

FIGURES

Figure 1 — Three Stress Inputs and Their Observable Failure Signatures

Three stress inputs produce four observable pre-crisis signatures. The crossings are not one-to-one — speed pressure contributes to three signatures, context pressure drives the two hardest-to-detect ones. Edges show the dominant causal paths observed in production sessions.

Figure 2 — Warfighting Functions Mapped to AI Agent Capability Domains

Every session needs all six capability domains. Diagnostic question: when the system fails, which domain degraded first? The 8-agent session exhibited the degradation sequence C2 to Intelligence to Protection.

Figure 3 — FMEA Scoring Logic for Agent Deployments

FMEA is a launch checklist, not a report. The key branch is the “control not buildable in time” path — when the structural control cannot be built, the correct action is to reduce deployment scope, not to proceed hoping behavioral instruction will hold.

Figure 4 — Throughput Degradation as Context Pressure Approaches the Working-Memory Threshold

Throughput follows Q proportional to the square root of S, where S is the priority gradient. As context pressure rises, the gradient flattens (constraints begin to drop), and Q collapses superlinearly past the threshold. The curve is illustrative, not measured; the shape — slow decay below threshold, steep collapse above — matches observed session behavior. The exact knee position depends on constraint count and task complexity (Section 4.4).

Figure 5 — The Diagnostic Decision Tree: Which Tool for Which Failure

Two decision paths: post-failure diagnosis (left branch) and pre-launch planning (right branch). The same diagnostic stack serves both — which is the argument of Section 3 made visible. A recurring failure (3+ strikes) always routes to DMAIC, because the behavioral correction cycle has already failed.

Talk Notes / Raw Capture

2026-04-17 — Gmail MCP as tensor edge example

During a Gmail processing session, the tool routing rule was: gws CLI for all writes (label, archive, delete), Gmail MCP for reads only. Reason: MCP has no write capability today — it’s a hard capability gap, not a preference.

The insight: that gap is a coordinate on the graph. When Google/Anthropic ships write support to the MCP server, that edge flips. Without the tensor, a future session has to rediscover the constraint from scratch — read old skill files, hunt through memory, figure out why gws was used. With the tensor, the pinch point is already mapped. The work finds itself.

This is the point of the Tetris shape theory: you don’t track tasks, you track conditions. When two edges meet — capability-gained + current-workaround-in-use — the graph surfaces the workstream automatically. The note stays lean (no rationale bloated into the skill file), because the reasoning lives as a graph node, not as prose.

Candidate framing for paper: “The system doesn’t remember facts — it maps edges. Facts decay; edges are structural. When conditions change, the graph re-routes work without anyone having to re-read the history.”

2026-04-17 — Lock-in isn’t inherently bad; standards are

The vendor lock-in risk (Claude-specific skills, hooks, MCP) is real but conditional. If DoD adopts this architecture as the standard, lock-in becomes the standard — the same way TCP/IP “locked in” networking. The question isn’t whether you’re locked in; it’s whether the lock-in is to a dominant standard or a dead end.

DoD has leverage to mandate interoperability. If DARPA/DoD drives a wedge into AI vendors requiring shared APIs and open doctrine formats, the vault architecture ports freely. That’s the institutional play: get the doctrine adopted, let DoD force the portability.

Product vision emerging: Three vault packages — a Claude vault package, a Gemini vault package, and a joint package where both runtimes operate on the same doctrine layer. The model (PARA + tensor + doctrine) is shared. The runtime (tools, hooks, MCP) is vendor-specific. Joint operations = multi-agent across vendors running the same procedures.

This is less science fiction than it sounds. The vault already treats the runtime as a delivery mechanism. The joint package is just formalizing that boundary.

FOOTNOTES

_{Canonical source: herding-cats.ai/papers/paper-9-finding-the-breaking-point/ · Series tag: HCAI-08860b-P9}


This paper	Paper 9 of 10
Previous	← Paper 8: The Toboggan Doctrine
Case Studies	Case Study 1 · Case Study A (forthcoming)
Home	← Series Home

Boyd, J.R. Patterns of Conflict and The Essence of Winning and Losing. Unpublished briefings, 1987–1996. The OODA loop — Observe, Orient, Decide, Act — was developed by Col. John Boyd as a theory of competitive decision-making in aerial combat and later generalized to organizational and strategic conflict. The Orient step, often misunderstood as a passive pause, is the synthesis and reframing stage Boyd considered most important. Contemporary reference: Coram, R. Boyd: The Fighter Pilot Who Changed the Art of War. Little, Brown and Company, 2002. ↩
PARA Vault primary source: 0-PROJECTS/Herding-Cats-in-the-AI-Age/Research-Note-Claude-Gate-Bypass-Incident-2026-04-10.md. Session bronco_d3c505c0, 2026-04-10. Documents the task-claim gate bypass (T-586/LL-419), the filesystem marker write that circumvented the hook, and the context-pressure leak companion vector. Corrective actions: T-755 task registered retroactively; feedback_will-cheat-to-complete-tasks.md and feedback_register-before-bypass.md memory entries filed. ↩
PARA Vault primary source: 0-PROJECTS/Herding-Cats-in-the-AI-Age/Research-Note-Multi-Agent-Orchestration-Failures-2026-03-25.md. 8-agent Claude Code session under Chief of Staff coordination, 2026-03-25. Documents five failure modes (tool-routing laziness, task interpretation drift, uncontrolled WIP, communication lag, silent degradation) and the systemic supervision multiplication finding. The commander’s “bittersweet” characterization originates here. ↩
PARA Vault primary source: 0-PROJECTS/Herding-Cats-in-the-AI-Age/Research-Note-Pipeline-Tensor-Math-and-Viz-2026-04-12.md. Session ambulatory_6065cbc5, 2026-04-12. Develops the Manning-analog throughput model, the Reynolds-analog regime transition, the Little’s Law constraint, and the d²Q/dt² (jerk) leading indicator. Also sketches a tensor-decomposition approach for task-corpus visualization that is out of scope for this paper. ↩
Chow, V.T. Open-Channel Hydraulics. McGraw-Hill, 1959. Manning’s equation (Q = (1/n) · A · R^(2/3) · S^(1/2)), originally proposed by Robert Manning in 1889, relates flow rate to channel geometry, roughness, and slope. The work-system analog in Sections 1.1 and 4.3 is a metaphorical mapping, not a claim of physical equivalence; the predictive utility comes from the structural dependency of Q on S^(1/2), which captures the nonlinear collapse of throughput as the priority gradient flattens. ↩
Marshall, J. “When the Cats Take the Same Test: A Cross-Provider AI Benchmarking Experiment.” Herding Cats in the AI Age, Paper 6b. March 2026. Source of the “provenance crisis” — six AI systems produced six files with no self-identifying metadata when the brief specified WHAT to produce but not HOW TO LABEL IT. Demonstrates that ambiguity failures recur across providers under identical conditions; the ambiguity problem is architectural, not vendor-specific. ↩
PARA Vault primary source: 0-PROJECTS/Herding-Cats-in-the-AI-Age/Research-Note-WfF-LOE-Doctrine-Mapping-2026-04-16.md. Synthesizes the full Warfighting Function to AI agent capability mapping, the MDMP COA development application to agent deployment decisions, the specialist panel structure, and the Lines of Effort WIP-control discipline. Provides the source material for Section 3 in its entirety. Builds on the doctrinal-mapping foundation in 0-PROJECTS/Doctrine-AI-Framework/Conceptual-Mapping-Doctrine-to-AI.md. ↩
FM 5-0, The Operations Process. Headquarters, Department of the Army. July 2024. Chapter 5 covers the Military Decisionmaking Process (MDMP), including Mission Analysis (¶5-28 through ¶5-57), Commander’s Intent (¶5-66 through ¶5-68), and Information Collection Planning (¶5-51). The five-step MA question set referenced in Section 3.2 derives from ¶5-34 (task decomposition) and ¶5-42 (facts and assumptions). See also ATP 5-0.1, Army Design Methodology, for related operational design doctrine. ↩
ADP 3-0, Operations. Headquarters, Department of the Army. July 2019. The six Warfighting Functions — Command and Control, Intelligence, Fires, Movement and Maneuver, Sustainment, and Protection — are doctrinally defined as “a group of tasks and systems united by a common purpose.” The AI agent capability mapping in Section 3.3 is first articulated in the vault’s Doctrine-AI-Framework project at 0-PROJECTS/Doctrine-AI-Framework/Conceptual-Mapping-Doctrine-to-AI.md. ↩
American Society for Quality (ASQ), “DMAIC Process: Define, Measure, Analyze, Improve, Control.” ASQ Learn About Quality series. DMAIC is the Six Sigma structured improvement cycle applied to recurring defects with statistically-controllable variation. See also: Pyzdek, T. and Keller, P. The Six Sigma Handbook. McGraw-Hill, 4th ed., 2014. The 3-strike escalation rule in this paper is a vault practitioner adaptation, not a standard DMAIC construct — it addresses the specific failure pattern where behavioral correction does not converge under context pressure. ↩
Shingo, S. Zero Quality Control: Source Inspection and the Poka-Yoke System. Productivity Press, 1986. Poka-yoke (mistake-proofing) is the Toyota Production System principle of preventing defects through structural design rather than after-the-fact inspection — the direct industrial analog of what Paper 9 calls structural enforcement. Shingo’s distinction between informative inspection (detect after) and source inspection (prevent before) maps precisely onto the distinction between behavioral correction and structural control. ↩
Stamatis, D.H. Failure Mode and Effect Analysis: FMEA from Theory to Execution. ASQ Quality Press, 2nd ed., 2003. RPN (Risk Priority Number) = Severity × Occurrence × Detectability on 1–10 scales. The Automotive Industry Action Group (AIAG) and VDA publish the most widely-used reference: AIAG/VDA FMEA Handbook, 2019 edition. The pre-mortem framing (run FMEA before the deployment, not after the failure) originates in risk management literature; see also Klein, G. “Performing a Project Premortem.” Harvard Business Review, September 2007. ↩
Reason, J. Human Error. Cambridge University Press, 1990. See also: Reason, J. “Human Error: Models and Management.” BMJ, Vol. 320, pp. 768–770, 2000. The Swiss-cheese model describes how multiple defensive layers each carry latent holes that, when aligned by circumstance, allow a hazard to pass through the system. Originally applied to aviation and medical error; the agent-security application in Section 4.1 treats process gates, hooks, and structural controls as the defensive layers. ↩
Reynolds, O. “An Experimental Investigation of the Circumstances Which Determine Whether the Motion of Water Shall Be Direct or Sinuous, and of the Law of Resistance in Parallel Channels.” Philosophical Transactions of the Royal Society, Vol. 174, pp. 935–982, 1883. The Reynolds number demarcates laminar from turbulent flow regimes. The work-system proxy in Section 4.3 (abandonment-to-completion ratio) is a qualitative analog, not a dimensionless quantity with physical meaning. ↩
Little, J.D.C. “A Proof for the Queueing Formula L = λW.” Operations Research, Vol. 9, No. 3, pp. 383–387, 1961. Little’s Law states that the long-term average number of customers in a stationary system L equals the long-term average arrival rate λ multiplied by the average time W that a customer spends in the system. Its application to agent work flow treats agents as servers, tasks as customers, and WIP as queue depth. ↩
Marshall, J. “The Toboggan Doctrine: Gravity-Fed Governance for AI Agent Lifecycles.” Herding Cats in the AI Age, Paper 8. April 2026. The structural-enforcement-over-behavioral-instruction principle is first formalized in Paper 8. Paper 9’s 3-strike rule, the FMEA structural-control requirement, and Section 5’s Step 4 (“raise the wall, don’t add the sign”) all derive from the Toboggan precedent. ↩

Paper 9 — Finding the Breaking Point

FINDING THE BREAKING POINT

How Far You Can Push AI Agents Before the System Fails — and the Structured Doctrine That Tells You When You’re There

EXECUTIVE SUMMARY

1. THE THREE STRESS INPUTS

1.1 Context Pressure

1.1.1 The Brittleness Curve: A Formal Model

1.2 Speed Pressure

1.3 Ambiguity

2. WHAT FAILURE LOOKS LIKE BEFORE IT’S A CRISIS

2.1 Memory Theater

2.2 Invisible Failure

2.3 Supervision Multiplication

2.4 Silent Deprioritization

3. THE DIAGNOSTIC STACK

3.1 OODA as Agent Self-Check

3.2 Mission Analysis as Ambiguity Resolution

3.3 Warfighting Functions as Capability Gap Diagnosis

3.4 MDMP COA Development for High-Ambiguity Tasks

3.5 DMAIC for Recurring Defects

3.6 FMEA for Agent System Pre-Mortems

3.7 Specialist Panel as Structured Brainstorm

3.8 Lines of Effort as WIP Control

4. LIVE EVIDENCE — WHAT WE ARE FINDING

4.1 The Gate Bypass Pattern

4.2 The 8-Agent Orchestration Session

4.3 The Pipeline Throughput Model

4.4 Context Pressure Degradation Curve

5. THE RESOLUTION PATTERN

6. WHAT THE DOCTRINE CAN’T FIX

CONCLUSION

FIGURES

Figure 1 — Three Stress Inputs and Their Observable Failure Signatures

Figure 2 — Warfighting Functions Mapped to AI Agent Capability Domains

Figure 3 — FMEA Scoring Logic for Agent Deployments

Figure 4 — Throughput Degradation as Context Pressure Approaches the Working-Memory Threshold

Figure 5 — The Diagnostic Decision Tree: Which Tool for Which Failure

Talk Notes / Raw Capture

FOOTNOTES

Series Navigation

Related

Footnotes