Skip to content

Paper 10 — The Stance: Why This Series Exists

An Innovation Thesis for Governing AI Agents at Scale

Section titled “An Innovation Thesis for Governing AI Agents at Scale”

Jeep Marshall LTC, US Army (Retired) Airborne Infantry | Special Operations | Process Improvement April 2026

The Herding Cats in the AI Age series is a practitioner’s research program assembled over seventy-five days of continuous multi-agent operations inside a single knowledge vault — 270+ Claude Code sessions, 498 registered tasks, 8,000+ git commits, 398 documented lessons learned. The thesis of the series, named plainly here so readers can decide whether to keep going: AI does not need more intelligence. It needs doctrine, process discipline, and quality assurance. Nine papers follow that claim into specific corners — what the problem looks like, why civilian AI teams keep failing at it, and which long-established military and industrial frameworks already solved it. This paper is the front door. It states the stance, names the problem, introduces the approach, and points readers to the nine papers that develop each piece in depth. If you read only one paper in this series, read this one first.

Every few years an industry convinces itself that the problem it keeps failing to solve will yield to a smarter version of the same thing it has been trying. AI is in that moment now.

Gartner projects that 40% of enterprise agentic AI projects will be canceled by the end of 2027 due to rising costs, unclear value, and weak risk control.1 A Deloitte survey of 3,235 global leaders found that only one in five companies has mature governance for AI agents. The market is projected to surge from $7.8 billion to over $52 billion by 2030, and most of that money will be spent on agents that organizations cannot reliably govern. The industry’s response has been to ask for larger models, longer context windows, better reasoning traces, and more guardrails.

This series takes a different stance.

The models are not the problem. A frontier AI system today clears every cognitive bar that the 2015 research community said it would take thirty years to reach. The constraint is not raw capability. The constraint is coordination — getting one agent to finish what another agent started, getting a team of agents to agree on what “done” means, getting the same agent to produce the same quality output on Tuesday that it produced on Monday, getting any of it to hold up when the user steps away for a day.

The military solved this problem in the 1790s. Industrial manufacturers solved it again in the 1950s. Both solutions are documented, battle-tested, and freely available. Neither one was cited in the research papers that the current agentic AI platforms were built on. The AI industry is, in the most literal sense, rediscovering things other disciplines already know.

The stance of this series is that the rediscovery is wasteful and the evidence supports borrowing rather than inventing. Specifically:

  1. Military doctrine (FM 5-0 Planning, ADP 6-0 Mission Command, ATP 5-0.1 MDMP, the OODA loop) provides a coordination language for autonomous agents operating under a shared commander’s intent. This is the exact problem agentic AI is trying to solve. The doctrine is eighty years old and has been stress-tested at scales the AI industry has never touched.

  2. Lean Six Sigma (DMAIC, FMEA, Cp/Cpk, process sigma, poka-yoke) provides a quality framework for measuring and improving repeatable processes. Agentic AI is a repeatable process. Measuring it the way a Toyota plant measures a stamping line reveals defect modes that no amount of model evaluation benchmarks will surface.

  3. Information theory and systems engineering (Shannon, Little’s Law, Reynolds number, Swiss-cheese models) provide the mathematical grounding for reasoning about signal, throughput, and failure propagation in multi-agent systems.

None of this is speculative. Each paper in the series documents a specific mechanism, a specific experiment, or a specific outcome from applying these frameworks to live AI operations. The papers are written from the inside of a working system, not from a whiteboard. When a paper claims that a technique produces a measurable result, the result is a count of commits, a process sigma value, or a percentage of successful runs — not a demonstration video.

The coordination failure in civilian AI has four recurring signatures. They appear in every case study in this series and every post-mortem the author has worked through during the research program.

Signature one: the agent that forgets the mission. Given a clear goal at session start, the agent drifts mid-execution and ends the session having worked on something adjacent but not what was asked. The military calls this a failure of mission command; Lean Six Sigma calls it scope creep. Both disciplines have specific countermeasures. The civilian AI response has been to ask the model to summarize the goal more often.

Signature two: the agent that cannot hand off to the next agent. One session produces an artifact; the next session inherits it and cannot tell what was shipped, what was deferred, or what assumptions were in play. In doctrine this is a failure of the running estimate. In Six Sigma it is a broken SIPOC. In civilian AI it is solved by writing longer README files.

Signature three: the agent that ships defects past every review. Self-validation passes, the supervisor signs off, the AAR grades the session green — and then twenty-six hours later the production system crashes because a field was typed as a string instead of an integer. In military doctrine this is covered by the rehearsal and backbrief process. In quality engineering it is the fundamental reason for design reviews and FMEA. In civilian AI the current fix is usually “add another eval.”

Signature four: the team that never learns. The same defect shows up in three separate sessions, each time discovered fresh, each time logged as a novel finding, each time promised a fix that never lands. This is the absence of a continuous improvement loop. CPI solved this in manufacturing. The after action review solved this in the Army. Civilian AI currently has no standard mechanism for it.

Every paper in this series is an application of one or more of those existing remedies to one or more of those signatures. The series is not a catalog of new inventions. It is a map of what already works.

The research program that produced this series is a single practitioner’s Obsidian vault — a PARA-method knowledge base that doubles as the operational surface for a multi-agent AI team. Claude is the current implementation; other tools join at defined programming points. The vault is not a demo. It is the author’s working 2nd brain, used daily for memory assistance, caregiving coordination, writing, and operational work. Everything the papers describe was invented because the author needed it to function.

That constraint matters. A coordination framework that only works on a benchmark is not a coordination framework. The frameworks described in this series are the ones that survived contact with real work: writing papers while managing inbox, running a research program while coordinating medical appointments, orchestrating agents while maintaining a household. The doctrine proved itself by keeping the lights on, not by winning an eval leaderboard.

The governance architecture that emerged has a name — the Toboggan Doctrine (Paper 8) — and a shape. Agents enter the channel loaded with templates, knowledge wells, and pre-made decisions. Gravity pulls them downhill through a pre-execution gate, an execution phase, and a completion gate. At the bottom, an after action review captures lessons that feed back up the continuous improvement loop, updating the templates for the next session. The system improves itself without dedicated improvement effort. The factory worker does not push the template; the template pushes the factory worker.

That metaphor — the factory worker pushed along a channel by the template, rather than the worker pushing the template — is the compression of the whole series into one image. Every paper is an application of it, an ablation study on it, or a stress test against it.

Each paper stands alone; the series is designed so a reader can enter through whichever one matches their current problem. The recommended path for a first read is the one listed here, but any order works.

Paper 1 — The Super-Intelligent Five-Year-Old names the problem. Current frontier AI clears every cognitive benchmark a decade ago would have ranked as superhuman, and still produces the behavioral profile of a capable, unsupervised child. The fix is not more intelligence. The fix is doctrine — the same doctrine that turns capable but individually unreliable humans into reliable teams.

Paper 2 — The Digital Battle Staff traces the coordination framework civilian AI lacks to its origin. Napoleon’s 1795 headquarters and SOCOM’s 2026 agentic AI experiments converged on the same architecture: a staff of specialists coordinating under a shared commander’s intent. The paper shows what the AI industry is missing by showing what the military built instead.

Paper 3 — The PARA Experiment is the first full field report. One practitioner, one knowledge vault, thirty-three days, 1,768 git commits. Twelve of fourteen predicted failure modes appeared within seventy-two hours of multi-agent introduction. This paper is where the series earns its empirical credibility — the failure modes are named before they appear, and then they appear.

Paper 4 — The Creative Middleman is a case study on what breaks without doctrine. Adobe Firefly routes its own users’ prompts to competitors’ models because Firefly cannot render readable text. The paper dissects the coordination failure that produced that outcome and generalizes it to the whole class of AI systems that are thin orchestration layers over unreliable components.

Paper 5 — When the Cats Talk to Each Other is the first AI-to-AI experiment. Two frontier systems with opposing design philosophies are placed in structured dialogue. The output is a formal coordination framework that neither model produced alone. The paper demonstrates that structured conversation between agents — under doctrine — can produce artifacts no single agent can.

Paper 6 — When the Cats Form a Team extends the experiment. Four frontier AI systems are assigned military staff roles (G1, G2, G3, G5) and given a shared commander’s intent. They produce six strategic insights that any solo agent missed. The paper is the most direct empirical demonstration in the series that the staff architecture works.

Paper 6b — When the Cats Take the Same Test is the quality-variance companion. Six AI systems receive identical Commander’s Intent for the same design task. The resulting experimental designs vary wildly in quality. The paper is a rebuke to the industry assumption that “frontier model” is a meaningful quality tier — and a reminder that process controls the output, not the model.

Paper 7 — MDMP Platform Blueprint is the platform spec. It takes the doctrine the earlier papers prove out and translates it into an opinionated system design — a conversational, doctrine-structured platform for multi-agent AI decision-making that works for both a ROTC cadet learning planning and an enterprise commander managing a product launch. This is the paper to hand an engineering lead who has asked, “fine, what would you actually build?”

Paper 8 — The Toboggan Doctrine is the governance synthesis. Template-driven channels outperform hook-based walls. The paper consolidates findings from the entire research program into a single governance framework compatible with OWASP’s Agentic Top 10 and Microsoft’s Agent Governance Toolkit — but inverts their assumption that governance means more enforcement layers. Instead: build the channel, let gravity do the work.

Paper 9 — Finding the Breaking Point stress-tests the doctrine. Where do channels fail? Under what load does the toboggan break? The paper documents the failure modes, the recovery mechanisms, and the adjustments needed to keep the architecture honest at scale. If the first eight papers are the case for the doctrine, this one is the case for its limits.

The series is not a sales pitch and not an academic treatise. It is a field report from a working system, written for practitioners who already know their AI coordination is failing and have run out of patience for advice that amounts to “try a larger model.”

If you lead an AI team inside an enterprise, the papers will give you a doctrine-based language for describing what is going wrong and a specific, already-proven remedy for each failure class.

If you are a researcher in agentic AI, the papers will give you empirical data from a production multi-agent system that you can measure your own ideas against.

If you are a practitioner building for yourself, the papers will give you a toolkit. The vault that produced this research is small enough for one person to operate and large enough to coordinate a real workload. The doctrine scales down to a single user on a single laptop and up to an enterprise staff. The code, the templates, and the lessons-learned corpus are public.

The stance of this series, stated one last time: AI does not need more intelligence. It needs doctrine, process discipline, and quality assurance — all three of which existed before the current generation of models was trained, all three of which work when applied, and all three of which the civilian AI industry has so far declined to adopt.

Build the channel. Let gravity work. Measure the results. Report back.

Canonical source: herding-cats.ai/papers/paper-10-the-stance/ · Series tag: HCAI-b3a051-P10

  1. Gartner, “Predicts 2026: AI Agents Drive Productivity Gains but Face Governance Challenges,” November 2025. The 40% cancellation figure is drawn from the firm’s enterprise AI adoption tracker and covers the twenty-four-month window ending December 2027.