Three-Volume Curriculum

The Agent Engineering Trilogy

Three nested layers around an LLM call — prompt, context, harness — and what their composition produces. Across three courses, one argument: AI engineering work now pays off farther from the raw prompt.

Format: Three one-semester graduate courses + capstone
Modules: 36 + synthesis
Pacing: One module per week (lecture / paper / lab)
Innermost
Prompt
⊃ Prompt
Context
⊃ Context
Harness
+ Environment
Agent

Foundations

Why prompt, context, and harness belong in one curriculum

The terminology matters because it points to the right failure layer: elicitation, context construction, or runtime control.

Why This Matters

Most real AI system failures are misdiagnosed. A team blames the prompt when the retrieved context is stale. They blame the model when the tool surface is too broad, the verifier is missing, or the loop has no clean stopping condition. They ask for a larger model when the real problem is that the surrounding system is making the model solve the wrong task.

That diagnosis problem is why the trilogy matters. Prompt engineering was the obvious first discipline because early LLM work happened one inference at a time: write a better instruction, get a better answer. Production agent systems retrieve, compress, route, call tools, inspect state, ask for approval, retry, recover, and terminate. The engineering work has moved from phrasing a request to controlling the operating conditions around repeated model calls.

This curriculum teaches that movement as a diagnostic stack. Prompts shape a single inference. Contexts shape what the model can know and attend to at each inference. Harnesses shape what the whole system can do over time. Students who learn the stack can debug systems without superstition: they can ask whether the failure is elicitation, context construction, or control flow, then improve the layer that actually failed.

Three Nested Layers Around an LLM Call

Prompt engineering is the craft of the input string itself — instructions, few-shot exemplars, role framing, output format cues, CoT triggers. The unit is one inference's text and the target is elicitation. As frontier models have grown less sensitive to phrasing, the marginal value here has become more situational and more dependent on the surrounding system.

Context engineering moves the unit up to the entire window as a managed artifact: retrieved documents, distilled memory, tool descriptions, compacted history, examples drawn from a bank. Token budgeting, RAG, semantic caching, memory hierarchies, and context compression all live here. The model still receives one inference window, but code assembles that window.

Harness engineering is the outermost shell — the program that runs the model in a loop. It owns control flow: when to invoke, which tools to expose, how to route and verify outputs, when to retry, when to spawn subagents, when to terminate, how to sandbox side effects. The harness is deterministic code with the LLM as a subroutine, and prompts, contexts, tool calls, skill loads, and verifiers are all harness state. Claude Code's agent loop, Cursor's editing logic, Devin's planner, Proceda — these are harnesses.

The progression is one of scope and determinism. Prompts are nondeterministic strings; contexts are deterministically assembled but consumed nondeterministically; harnesses are deterministic programs that compose model calls into reliable systems. As model capability rises, reliability, cost, and task completion become properties of the harness more than properties of the prompt.

Context Engineering Is a Harness Subroutine

Every step of the loop, the harness has to decide what window to present to the next inference: which tool descriptions to include (all of them blows the budget; route-then-load is the standard move), which prior turns to keep verbatim vs. summarize, which retrieved docs to inject, which scratchpad/memory entries are still live, which skill files (SKILL.md and friends) to load. Those decisions are context engineering, and they're made anew on every iteration. A harness without context engineering would just concatenate history until it OOMs.

The containment is strict: harness ⊃ context ⊃ prompt. Context engineering can serve a single one-shot call, as in a classical RAG pipeline. Harness engineering begins when a program manages repeated model calls, state, tools, and termination.

At higher layers, context engineering becomes a policy: what to load depends on the agent's trajectory. That's where just-in-time skill loading, progressive disclosure of tools, and dynamic compaction live, and it's also where many current reliability gains are coming from.

Agents, Scaffolds, and Harnesses

An agent, in the modern sense, is the (model, harness) pair operating over an environment. The model supplies the policy — token-level decisions about what to say or which tool to call. The harness supplies everything else needed to turn those decisions into sustained, goal-directed behavior over time: the loop, the tools, the memory, the verifiers, the termination conditions, the recovery logic. Without the harness, the system reduces to a chat completion. Without the model, it reduces to a workflow engine. The agent is the composition.

This gives the curriculum a clean arc. Prompt engineering teaches students to shape a single inference. Context engineering teaches them to shape what the model sees across many inferences, treating the window as a managed resource. Harness engineering teaches them to wrap those inferences in a control program that closes the loop with an environment. The agent definition then follows naturally: an agent is a harness whose context-engineering policy and tool-use trajectory are conditioned on its own prior outputs and on environmental feedback. The progression mirrors the move from open-loop to closed-loop control.

One useful framing: the three layers correspond to three different loci of programming. In prompt engineering, the artifact is text. In context engineering, the artifact is a data structure (the assembled window) plus the code that assembles it. In harness engineering, the artifact is a program with the model as a callable. Each layer subsumes the previous and adds determinism on the outside while preserving stochasticity on the inside. Agents appear when the outer program is structured as a perception-action loop.

Scaffold vs. harness

They overlap heavily but the connotations differ, and the drift in usage tracks a real shift in what people are building.

Scaffold came out of the early AutoGPT / BabyAGI / ReAct era and carried the metaphor faithfully: temporary supporting structure around a model with limited planning, memory, and recovery. The connotation was compensatory: add a planner, a memory store, a reflection step. The implicit promise was that better models would eventually let the scaffolding come down, the way construction scaffolding comes down once a building can stand on its own. METR and ARC Evals popularized scaffold in the eval context for roughly this reason: they wanted to measure raw model capability and treated the surrounding code as a confound to be controlled.

Harness has displaced it over the last year or so, and the change in metaphor is doing real work. A harness is permanent apparatus for channeling a powerful system toward useful work; the metaphor points to engines, reactors, and other systems that need durable control surfaces. The surrounding program becomes a first-class engineering artifact whose design determines what the system can do. That matches the empirical record: scaling models alone has left surrounding code in place, and better models have raised the ceiling for well-designed loops. Claude Code, Cursor, Devin, and the SOP-Bench-style systems are harnesses in this sense.

Scaffold and harness remain near-synonyms in many uses, but the shift in framing requires explicit treatment. The scaffold framing predicts that capability gains erode the surrounding code's value; the harness framing predicts that capability gains increase the value of good surrounding code because a more capable model can be pointed at harder problems by a better-designed loop. Which framing turns out to be right is one of the live empirical questions in the field.

The Emerging Literature

The interesting feature is that the field hasn't fully settled — there are at least three competing structural framings even among the canonical pieces. Here is the lineage.

Context engineering — well-established canon

The term has a clean origin story. Tobi Lütke (X, June 18, 2025) argued that context engineering described the core skill more accurately than prompt engineering because the task is to provide enough situational material for the model to solve the problem. Andrej Karpathy amplified the term a week later (X, June 25, 2025) with the context-window formulation that became canonical. Simon Willison codified it on his blog three days later and predicted, correctly, that it would have sticking power. The Lütke-Karpathy-Willison triad is the citation chain most commonly used for the term's popularization.

The institutional follow-up: Anthropic's Effective context engineering for AI agents (Sept 2025), Lance Martin's Context Engineering for Agents on the LangChain blog (the write/select/compress/isolate taxonomy), and Philipp Schmid's Context Engineering Part 2 (2025).

Harness engineering — much more recent, more contested

This one is six months old and the term entered the mainstream through a specific sequence:

  1. Anthropic, Effective harnesses for long-running agents (Nov 26, 2025). First use of "harness engineering" from a frontier lab in a substantive piece.
  2. OpenAI, Harness engineering: leveraging Codex in an agent-first world (Feb 11, 2026). The post that put the term in everyone's mouth. The three-engineer, million-line codebase, zero-typed-code anecdote became the field's reference point.
  3. LangChain, Improving deep agents with harness engineering (Feb 17, 2026). The Terminal Bench 2.0 result — same GPT-5.2-Codex, score moves from 52.8% to 66.5% by changing only the harness, agent climbs from rank 30 to top 5. This became the headline empirical claim.
  4. Vivek Trivedy / LangChain, The Anatomy of an Agent Harness (March 10, 2026). The clearest definitional piece, the source of the canonical model-plus-harness formula, and the best single primary source on the harness-vs-agent relation.
  5. Birgitta Böckeler, Harness engineering for coding agent users deserves separate treatment because it uses a different framing: a user harness for coding agents as a specific form of context engineering, organized around feedforward guides, feedback sensors, computational vs. inferential controls, and regulation categories such as maintainability, architecture fitness, and behavior. This is a minority view but it's the most theoretically careful.
  6. Avi Chawla, The Anatomy of an Agent Harness (April 2026). A different essay with the same title — synthesizes Anthropic, OpenAI, LangChain, and Perplexity into eleven harness components. Recovers Beren Millidge's 2023 model-as-CPU, context-as-RAM, harness-as-OS analogy as the structural backbone.

The all-three-together pieces

These are the articles that try to draw the exact map this curriculum has been building:

On the harness-vs-agent question specifically

The clearest direct treatments:

Agent = Model + Harness. The agent is the emergent behavior; the harness is the machinery producing it.

Three observations

First Almost none of this literature is academic. Natural-Language Agent Harnesses and the Agent Harness for Large Language Model Agents survey (April 2026) are useful academic and preprint entry points, but much of the active literature is still engineering blogs from the labs and the framework companies. The publication path remains unusually open.

Second Definitional instability is substantive. The nested-stack framing (prompt ⊂ context ⊂ harness) is dominant, but there are at least three serious challengers: Böckeler's "user harness as a specific form of context engineering," PrivOcto's "context is a subset of harness", and a less-articulated view in which all three are parallel disciplines that overlap. The framing this curriculum uses — three nested layers with the agent as the (model, harness) composition operating over an environment — is the most common reading, with competing readings still active. Böckeler and Trivedy give students the cleanest contrast between the main positions.

Third The trilogy framing remains nonstandard, though it lines up with the literature's current direction. The closest published analog is Avi Chawla's Anatomy of an Agent Harness piece, which uses the same three-concentric-rings framing. A careful unification piece could still claim open ground, especially if it addresses the Böckeler position and the Natural-Language Agent Harnesses challenge to the layer boundary. The field has the vocabulary but no single canonical reference that unifies the three layers carefully.

Volume I

Prompt Engineering Curriculum

The completing volume of the trilogy. The same one-semester scope and graduate-level pacing, but with a different relationship to its subject than the other two courses. Prompt engineering is older, more empirical, more saturated with folk knowledge, and — as the field has matured — increasingly absorbed into the layers above it. A modern version teaches what those techniques were responses to, which ones survived the move to capable models, and where the discipline goes once context and harness engineering own more of the system behavior. That arc is the spine of the course.

Two framing readings for lecture one

The canonical survey — Schulhoff et al., The Prompt Report: A Systematic Survey of Prompt Engineering Techniques — is a PRISMA-style review of 1,565 papers, producing a vocabulary of 33 terms and a taxonomy of 58 text-only techniques (plus 40 multimodal). It's the single most useful reference document for the syllabus and belongs early as a map of the territory. The companion site is the navigation aid.

The historical framing — Schluntz & Zhang's Building Effective Agents and Anthropic's Effective context engineering for AI agents (revisited from the other two courses) — together make the case that prompt engineering is the natural progression's first layer in a larger stack. Week one establishes why the discipline exists and why its scope has shifted.

Module 01

Foundations: Prompting as the Original Interface

Define prompt boundaries and how the field arrived at the term. Cover the pre-GPT-3 prompting literature (cloze prompts, PET) briefly, then the discontinuity introduced by in-context learning.

Canonical readings

  1. Schulhoff et al., The Prompt Report. Vocabulary and taxonomy chapters.
  2. Liu et al., Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP (2021). The pre-GPT-3 survey, useful for showing that prompting started as a fine-tuning alternative before it became an end-user practice.
  3. Schick & Schütze, Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference (PET), EACL 2021. Assigned for historical grounding.
  4. DAIR.AI, Prompt Engineering Guide — the field's most-used practitioner reference, useful as continuous background.
Module 02

In-Context Learning as the Substrate

Prompt engineering depends on in-context learning, so the discipline's possibilities and limits track ICL's properties. Cover the founding empirical observation, the mechanistic theory, and the unsettling results about what makes ICL work, many of which surprise practitioners.

Canonical readings

  1. Brown et al., Language Models are Few-Shot Learners, NeurIPS 2020. The GPT-3 paper.
  2. Min et al., Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, EMNLP 2022. Required critical reading — the result that label correctness in demonstrations matters less than label distribution changes the interpretation of every few-shot prompt.
  3. Olsson et al., In-context Learning and Induction Heads (Anthropic, 2022). The mechanistic account.
  4. Xie et al., An Explanation of In-Context Learning as Implicit Bayesian Inference, ICLR 2022.
Module 03

Few-Shot Prompting and Exemplar Design

The most-used technique in the entire syllabus. Cover exemplar selection, ordering effects, the calibration problem, and the smaller-but-real returns as models improve.

Canonical readings

  1. Lu et al., Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, ACL 2022. The order-sensitivity result: the same exemplars in different orders can swing accuracy by tens of points.
  2. Zhao et al., Calibrate Before Use: Improving Few-Shot Performance of Language Models, ICML 2021. The recency, majority-label, and common-token biases that few-shot prompts induce, and how to correct for them.
  3. Liu et al., What Makes Good In-Context Examples for GPT-3? (2021). The retrieval-based exemplar selection paper.
LabReplicate a small order-sensitivity experiment on a current model. The result is usually weaker than it was in 2022, which is itself a lesson in how the field has evolved.
Module 04

Chain-of-Thought: The Technique That Reshaped the Field

CoT is the canonical example of a prompt technique that elicits a model capability. Cover the original paper, the scaling result that gave it weight, and the close-reading of why it works.

Canonical readings

  1. Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022. The founding paper.
  2. Wang et al., Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters, ACL 2023. The decomposition of CoT's contribution into "what part of the prompt actually matters."
  3. Madaan & Yazdanbakhsh, Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango (2022). Adjacent critical reading.
Module 05

Zero-Shot Reasoning and Decomposition

Once CoT showed exemplars weren't strictly necessary, a small zoo of zero-shot reasoning techniques followed. Treat them as a family: each is a different prompted control structure over the model's reasoning trajectory.

Canonical readings

  1. Kojima et al., Large Language Models are Zero-Shot Reasoners, NeurIPS 2022. "Let's think step by step."
  2. Zhou et al., Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, ICLR 2023. Decomposition before solving.
  3. Wang et al., Plan-and-Solve Prompting, ACL 2023. The zero-shot upgrade.
  4. Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models, NeurIPS 2023. Branching over linear CoT.
  5. Besta et al., Graph of Thoughts (2023). For completeness; assign as optional.
Module 06

Self-Consistency, Sampling, and Ensembles

When a single sampled trajectory is unreliable, the next move is to draw many and aggregate. This module is where prompt engineering starts pressing against the boundary with harness engineering: the techniques here are inference-time control structures around the same prompt.

Canonical readings

  1. Wang et al., Self-Consistency Improves Chain-of-Thought Reasoning in Language Models, ICLR 2023. The canonical paper; majority-vote over sampled CoT trajectories.
  2. Madaan et al., Self-Refine: Iterative Refinement with Self-Feedback, NeurIPS 2023. Same-model critique-and-revise as a prompting pattern.
  3. Huang et al., Large Language Models Cannot Self-Correct Reasoning Yet, ICLR 2024. The critical counterpoint; without external feedback, self-refinement often degrades.
LabImplement self-consistency on a small reasoning benchmark, vary the sample count, and plot the accuracy/cost curve. The lab gives students a real sense of when ensembling pays for itself.
Module 07

System Prompts, Roles, and Formatting

The "stylistic" half of prompt engineering: how role assignments, output-format directives, delimiter conventions, and structural cues shape behavior. Empirically real, often overstated by practitioners.

Canonical readings

  1. Sclar et al., Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design (2024). The unsettling result that trivial formatting changes (e.g., choice of separator) can swing benchmark accuracy by 76 percentage points on some models. Required reading.
  2. Zheng et al., When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models (2024). Required reading as a corrective to persona folklore.
  3. Anthropic, Prompt engineering overview — the current practitioner guidance from a frontier lab.
  4. OpenAI, Prompt engineering guide — the parallel guidance from another.
Module 08

Structured Output and Constrained Decoding

Machine-consumed output needs more than polite JSON instructions. Cover the spectrum from "ask nicely for JSON" to logit-masking with grammars.

Canonical readings

  1. Willard & Louf, Efficient Guided Generation for Large Language Models (2023). The Outlines paper — FSM-based constrained decoding.
  2. Geng et al., JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models (2025). Comparative evaluation of Outlines, Guidance, XGrammar, OpenAI structured outputs, Gemini.
  3. Tam et al., Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, EMNLP 2024. The cautionary result: constraining format can degrade reasoning quality on the same task. Assign together with the Outlines paper to keep students honest.
  4. Anthropic, Tool use overview. Function calling as the production manifestation.
Module 09

Adversarial Prompting: Jailbreaks and Injection

Cover both attacker- and defender-side prompting. The defender side belongs in a prompt engineering course because the system-prompt design space is where most production defenses live (or fail).

Canonical readings

  1. Perez & Ribeiro, Ignore Previous Prompt: Attack Techniques for Language Models (2022). The original prompt-injection paper.
  2. Wei, Haghtalab & Steinhardt, Jailbroken: How Does LLM Safety Training Fail?, NeurIPS 2023. The clearest theoretical account; the "competing objectives" and "mismatched generalization" categories are the right vocabulary.
  3. Greshake et al., Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, AISec 2023. The IPI paper, which frames the attack surface as data flow through the application.
  4. Zou et al., Universal and Transferable Adversarial Attacks on Aligned Language Models (2023). GCG — gradient-based suffix attacks, the white-box bound on what prompt-level defenses can stop.
  5. Schulhoff et al., Ignore This Title and HackAPrompt, EMNLP 2023. The largest crowd-sourced jailbreak dataset; useful for labs.
Module 10

Automatic Prompt Optimization

The case against hand-authored prompts, treated as a discipline in its own right. The arc from APE to GEPA tracks the same intellectual move the rest of the field has made: from natural-language artifacts to programs that produce them.

LabTake a hand-tuned prompt from Module 4 or 5 and optimize it with at least two different methods (e.g., APE + DSPy's MIPRO), reporting accuracy delta and search cost.
Module 11

Multimodal Prompting

Prompt engineering for images, code, and audio. The discipline rhymes with text prompting but has distinctive failure modes that warrant a dedicated module.

Canonical readings

  1. Zhou et al., Conditional Prompt Learning for Vision-Language Models (CoCoOp), CVPR 2022.
  2. Khattak et al., MaPLe: Multi-modal Prompt Learning, CVPR 2023.
  3. Liu et al., Visual Instruction Tuning (LLaVA), NeurIPS 2023. The instruction-tuning side that conditions how vision-language models respond to prompts.
  4. Oppenlaender, A Taxonomy of Prompt Modifiers for Text-to-Image Generation (2022). Image prompting is its own folk-knowledge tradition; this is the closest thing to a systematic account.
  5. Schulhoff et al., The Prompt Report — revisit the multimodal taxonomy chapter.
Module 12

Coda: After Prompt Engineering

The pedagogically honest closing module. As models have improved, the marginal return on careful prompt wording has become more situational; many techniques in this syllabus are best understood as responses to an era when models needed more explicit scaffolding to reason. The skills that endure are the ones that compose into something larger — exemplar curation feeds into context engineering, CoT becomes a unit of harness control flow, automatic optimization is the bridge to programmatic systems. End the course with this synthesis.

Canonical readings

  1. Anthropic, Effective context engineering for AI agents. The explicit thesis that prompt engineering's center of gravity has moved.
  2. Schluntz & Zhang, Building Effective Agents — revisit with attention to the "find the simplest solution first" framing.
  3. Anthropic Research, Tracing the thoughts of a large language model (2025). Mechanistic interpretability is starting to give principled explanations for why prompts have the effects they do; a glimpse of what eventually replaces folk practice.
  4. Karpathy, Software 2.0 (2017) — reassigned in this context to motivate the framing that the era's discipline migrates upward as the underlying substrate matures.
Final LabStudents take a single complex task they prompt-engineered earlier in the semester, refactor it as (a) a DSPy program and (b) a small harness in the style of the harness course, and write up which version produced better, more reliable, more maintainable results. The lab is the bridge between this course and the other two.

A Few Pedagogical Notes

This course is best taught with a note of intellectual honesty about its subject. A 2022-style prompt engineering course would have spent much more time on phrasing patterns; a 2026 version spends most of its time on the techniques that turned out to compose into something larger and on the empirical results that taught the field its limits. Four readings anchor that critical posture: the Sclar et al. formatting-sensitivity paper, the Min et al. demonstrations paper, the Huang et al. self-correction paper, and the Tam et al. format-restriction paper. Each undermines a piece of folk wisdom that practitioners still repeat.

The organizing question for the beginning and end of the course: which of these techniques would survive a hypothetical model that was perfectly calibrated and never needed reasoning scaffolding? Working through that question is how students learn to distinguish techniques that exploited capability gaps (most of Modules 4–7) from techniques that exploit something more structural about how transformers consume sequences (Modules 8 and 10). The first kind ages; the second kind compounds. Students who can tell them apart will be useful builders for the next several years regardless of what specific models they're working with.

Finally, the same continuous reading used in the context course applies here. Simon Willison's blog is the running practitioner narrative, and his prompt-injection tag in particular is the best ongoing chronicle of how the adversarial side of prompt engineering is actually playing out in production.

Volume II

Context Engineering Curriculum

Parallel in structure to the harness curriculum, scoped to one semester. Where the harness course studies programs that wrap models, this course studies what the model sees on every step — the assembly, management, retrieval, and degradation properties of the token sequence itself. The two courses are complementary; ideally a student takes context engineering first, since the harness lectures presuppose it.

Two cross-cutting framings for lecture one

The finite-resource frame — Anthropic's Effective context engineering for AI agents — treats every token in the window as competing for finite attention, with the engineering question being which configuration of those tokens most reliably produces the desired behavior.

The theoretical frame — Mei et al., A Survey of Context Engineering for Large Language Models — gives the field a formal taxonomy across retrieval/generation, processing, and management, and grounds the discipline as something distinct from prompt engineering. The companion repo, Awesome-Context-Engineering, is the running bibliography for the rest of the course.

Module 01

Foundations: From Prompt to Context

Establish the conceptual shift: prompt engineering optimizes a string, context engineering optimizes the entire assembled window as a managed artifact, and the latter is the natural progression once production systems start having state, retrieval, and tool outputs.

Canonical readings

  1. Anthropic, Effective context engineering for AI agents. The single best entry point.
  2. Lance Martin / LangChain, Context Engineering for Agents. The clearest practitioner taxonomy — write / select / compress / isolate.
  3. Mei et al., A Survey of Context Engineering for Large Language Models (2025). Read the taxonomy chapter; the rest is reference.
  4. Karpathy's tweet thread on context engineering (2025), assigned as the field's informal definitional moment.
Module 02

In-Context Learning as the Substrate

Why context engineering works at all. ICL is the mechanism that lets information injected at inference time change model behavior without weight updates; understanding its mechanics determines what context-engineering interventions are even plausible.

Canonical readings

  1. Brown et al., Language Models are Few-Shot Learners, NeurIPS 2020. The GPT-3 paper, and the founding empirical observation.
  2. Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022. The simplest demonstration that context shape changes output.
  3. Xie et al., An Explanation of In-Context Learning as Implicit Bayesian Inference, ICLR 2022. The canonical theoretical lens.
  4. Olsson et al., In-context Learning and Induction Heads (Anthropic, 2022). The mechanistic story.
  5. Min et al., Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, EMNLP 2022. The unsettling result: label correctness matters less than label distribution.
Module 03

Retrieval-Augmented Generation

RAG is the canonical context-engineering operation: programmatically inject relevant external text into the window before the model sees the query. Cover the original architecture, the dense-retrieval substrate, and the modern survey.

Module 04

Long-Context Behavior and Its Failures

The empirical realities that constrain the design space. Larger windows alone add capacity; position effects, attention dilution, and graceful-degradation patterns determine how useful that capacity is. This module is where students stop thinking "just put it all in."

Canonical readings

  1. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, TACL 2024. The U-shaped position curve.
  2. Hong, Troynikov & Huber (Chroma), Context Rot: How Increasing Input Tokens Impacts LLM Performance (2025). The systematic study across 18 frontier models. Reproducible toolkit on GitHub.
  3. Kamradt, Needle in a Haystack. The benchmark that started the long-context evaluation discourse.
  4. Hsieh et al., RULER: What's the Real Context Size of Your Long-Context Language Models? (2024). The more rigorous successor to NIAH.
LabReplicate a small-scale context-rot experiment on a model the students can call, plotting accuracy vs. context length on a fixed task.
Module 05

Memory and State Representations

What gets kept across turns, in what form, where? This module covers the cognitive-science-inspired typology and the canonical implementations. Distinguish carefully from the harness course's treatment, which focuses on when memory is loaded; here the question is how it's represented.

Canonical readings

  1. Sumers et al., Cognitive Architectures for Language Agents (CoALA) (2024). The clearest framing of working / episodic / semantic / procedural memory in LLM systems.
  2. Park et al., Generative Agents: Interactive Simulacra of Human Behavior, UIST 2023. The memory stream with importance/recency/relevance scoring.
  3. Packer et al., MemGPT: Towards LLMs as Operating Systems (2024). Memory hierarchy via virtual context management.
  4. Zhong et al., MemoryBank: Enhancing Large Language Models with Long-Term Memory (2024). Forgetting curves as a design primitive.
Module 06

Compression, Compaction, and the Brevity-Bias Problem

When the window fills up, something has to give. Cover summarization-based compaction, the failure modes (brevity bias, context collapse), and the recent push toward incremental, structured updates that preserve detail.

Canonical readings

  1. Zhang et al., Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models (2025). The ACE paper — required for its critical observation about context collapse (an 18k-token context dropping to 122 tokens in a single rewrite pass) and its grow-and-refine alternative.
  2. Kang et al., ACON: Optimizing Context Compression for Long-Horizon LLM Agents (2025). Compression specifically tuned to agent trajectories.
  3. Anthropic, Effective context engineering for AI agents — revisit, now reading the compaction and summarization sections closely.
  4. Mejba, Claude Code 1M Context: How I Stop Context Rot (2026). Practitioner-level, but useful because it grounds the abstractions in a tool many students have used.
Module 07

Tool and Skill Descriptions as Context

Tool descriptions and skill files are context that the model pays for on every step. Cover the tool-catalog explosion problem, routing-before-loading, and the skill-as-progressive-disclosure pattern.

Canonical readings

  1. Anthropic Engineering, Writing effective tools for AI agents. The internal observation that ~58 tools can consume ~55k tokens drives much of the design space.
  2. Anthropic Engineering, Equipping agents for the real world with Agent Skills. Progressive disclosure framing.
  3. Composio, Tool Calling Explained: The Core of AI Agents (2026). Catalog-scale problem with concrete numbers.
  4. Model Context Protocol specification. Dynamic tool discovery as a context-engineering enabler.
Module 08

KV Caching and the Prefix-Invariance Constraint

The production constraint that shapes everything in agentic context engineering: a single token's worth of variation in a long prefix invalidates the KV cache for the whole suffix. Once students see this, many design choices in production agents become legible as cache-preserving moves, even when theory would suggest a different optimum.

Canonical readings

  1. Anthropic, Prompt caching documentation. The economics: cache reads at 10% of base cost, writes at 125%, break-even at two hits.
  2. Manus / Yichao Ji, Context Engineering for AI Agents: Lessons from Building Manus (2025). The 100:1 input-to-output ratio observation, and KV-cache hit rate as "the north star" of production agents.
  3. Bala Priya C, The Complete Guide to Inference Caching in LLMs (2026). The three-tier story (KV / prefix / semantic) for grounding.
  4. Gim et al., Prompt Cache: Modular Attention Reuse for Low-Latency Inference (2024). The foundational paper.
Module 09

Programmatic and Declarative Context Construction

The case against hand-authored prompts. Treat the prompt as the artifact, but produced by a program: signatures, modules, and optimizers that compile high-level specifications into the actual text the model sees.

LabTake a hand-authored prompt from an earlier module and re-express it as a DSPy program; compile against a small dataset and compare.
Module 10

Context as Evolving Playbook

The frontier framing: context as a structured artifact that accumulates, refines, and prunes over time. Connect ACE to Reflexion-style verbal RL and to the skill systems emerging in production.

Canonical readings

  1. Zhang et al., Agentic Context Engineering — revisit, now reading the generation/reflection/curation workflow closely.
  2. Suzgun et al., Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory (2025). The precursor to ACE, with a useful framing of test-time learning as a context-modification operation.
  3. Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023. Read here for its memory mechanics: verbal feedback as a learnable context substrate.
  4. Anthropic, Equipping agents for the real world with Agent Skills — revisit. The skill file is the production manifestation of "context as evolving playbook."
Module 11

Production Case Studies

Two long-form practitioner reports that reward close reading. Both compress more useful design lessons than any single paper in the syllabus.

Canonical readings

  1. Manus / Yichao Ji, Context Engineering for AI Agents: Lessons from Building Manus (2025). Four framework rewrites distilled to a handful of principles: maximize KV-cache hit rate, mask actions instead of removing them, externalize state to the file system, recite goals to fight lost-in-the-middle, leave failed actions in context.
  2. Philipp Schmid, Context Engineering for AI Agents: Part 2 (2025). The follow-up that integrates Manus, LangChain, and Anthropic's current thinking — particularly useful for the "pre-rot threshold" framing.
  3. Anthropic's multi-agent research system post — re-read with attention to the per-subagent context strategy as well as the orchestration pattern.
Module 12

Evaluation

How to measure context engineering. Distinct from agent-level benchmarks; the evaluations here isolate the contribution of the context-construction pipeline.

Canonical readings

  1. Wu et al., LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, ICLR 2025. The benchmark that drives much of the memory and compression literature; used in both Chroma's context rot study and the ACE paper.
  2. Trivedi et al., AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents, ACL 2024. The agent benchmark most sensitive to context-engineering quality.
  3. Hsieh et al., RULER — revisit for evaluation methodology.
  4. Zhu et al., Establishing Best Practices for Building Rigorous Agentic Benchmarks (2025). The same critical reading as in the harness course — assigned again because students need to internalize that evaluation results in this field are routinely overstated by 50-100% relative.
Final LabStudents take their semester project (some context-engineering pipeline of their choice — RAG, compaction, programmatic prompt construction, or a combination) and submit it against a held-out LongMemEval split, reporting variance across at least five runs.

A Few Pedagogical Notes

The two courses share a tight conceptual relationship: context engineering studies the artifact, the harness course studies the program that produces it. Most students will find the conceptual handoff cleanest if they take this course first. The throughline from week one is cost: every token in the window is paid for in money (input tokens), latency (prefill time), attention budget (the lost-in-the-middle problem), and cache invalidation (the Manus argument). Almost every design pattern in the syllabus is legible as a response to one of those four costs.

The syllabus resists organizing itself around RAG. RAG was the canonical context-engineering operation in 2020–2023, but the field has moved decisively toward (a) agentic settings where retrieval is one of many context-shaping operations and (b) long-horizon context management. RAG is Module 3, but Modules 4–12 are largely about what happens after the initial retrieve-and-stuff.

A recurring blog complements the formal readings: Simon Willison's tag for "context engineering" tracks the field in real time and provides the most accessible running narrative. It stays in the background reading throughout the course.

Volume III

Harness Engineering Curriculum

A twelve-module curriculum that builds from first principles to the current research frontier, scoped to a one-semester graduate course. Each module is sized for roughly a week (one lecture on theory, one on a paper or codebase, one lab). The readings are ordered by priority — the first one or two in each list are the canonical entry points; the rest extend the discussion.

Two cross-cutting framings for lecture one

The externalization frame — Zhou et al.'s recent survey, Externalization in LLM Agents — treats memory, skills, protocols, and the harness as cognitive artifacts that offload internal model burden onto deterministic infrastructure. This is the cleanest theoretical lens for the whole course.

The harness-as-product frame — Anthropic's Building Effective Agents (Schluntz & Zhang, with reference implementations), MongoDB's The Agent Harness, and LangChain's The Anatomy of an Agent Harness — establishes that the LLM is a small fraction of any deployed system and that the surrounding code is a first-class engineering artifact.

Module 01

Foundations: From Prompt to Context to Harness

Establish the three layers, the agent definition, the workflow/agent distinction, and the scaffold-vs-harness terminological shift. Set up the throughline: as model capability rises, engineering responsibility moves up the stack.

Canonical readings

  1. Schluntz & Zhang, Building Effective Agents, Anthropic (2024). The single best entry point.
  2. Simon Willison, Notes on Building Effective Agents (2024). Useful gloss on the terminology question.
  3. Zhou et al., Externalization in LLM Agents (2026). Theoretical framing for the course.
  4. Karpathy, Software 2.0 (2017) — assigned for the historical analogy: ML code is a tiny fraction of a production ML system. Best read alongside Sculley et al., Hidden Technical Debt in Machine Learning Systems (NeurIPS 2015).
Module 02

The Agent Loop: ReAct and Its Descendants

The canonical perception-action loop. Cover the original formulation, its few-shot prompting roots, and the move to tool-call APIs as first-class loop primitives.

Canonical readings

  1. Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023. The foundational paper.
  2. Kim et al., An LLM Compiler for Parallel Function Calling, ICML 2024. Plan-and-execute as a DAG, with reported ~3.6× speedup over sequential ReAct.
  3. Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models, NeurIPS 2023. The branching generalization.
LabImplement a minimal ReAct loop in ~100 lines, no framework. The lab makes the loop's brevity and the surrounding engineering visible.
Module 03

Tool Use as the Action Space

Tool calls are the agent's action interface; their design dominates reliability. Cover schema design, description-as-prompt, error contracts, tool catalog scaling, and the MCP shift from static tools to dynamic discovery.

Canonical readings

  1. Anthropic, Tool Use Overview (current docs).
  2. Kiran Prakash / martinfowler.com, Function calling using LLMs (2025). Best concise treatment, including MCP.
  3. Model Context Protocol specification. The protocol itself.
  4. Bala Priya C, The Roadmap to Mastering Tool Calling in AI Agents (2026). Failure modes and catalog scaling.
LabBuild a tool registry with three failure modes (bad args, missing resource, timeout) and observe how error message phrasing affects recovery.
Module 04

Planning and Orchestration Patterns

The taxonomy of how multiple model calls compose. The Anthropic patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) provide the vocabulary; the multi-agent research paper shows them in production.

Canonical readings

  1. Anthropic, How we built our multi-agent research system (2025). Required.
  2. Schluntz & Zhang patterns (revisit Module 1 reading with focus on the pattern catalog).
  3. Reference implementations: claude-cookbooks/patterns/agents and the Pydantic AI port.
  4. Fountaincity, Anthropic's Multi-Agent Blueprint: What Production Adds (2026). The cost and SLA realities the original post elides.
Module 05

Memory and State Externalization

The harness is responsible for everything the model cannot remember. Cover the cognitive-science-inspired typology (working / episodic / semantic / procedural), retrieval policies, and the recent push beyond static RAG toward continuum memory.

Canonical readings

  1. Park et al., Generative Agents: Interactive Simulacra of Human Behavior, UIST 2023. The reference architecture: memory stream + reflection + retrieval scoring.
  2. Sumers et al., Cognitive Architectures for Language Agents (CoALA) (2024). The clearest theoretical taxonomy.
  3. Continuum Memory Architectures for Long-Horizon LLM Agents (2026). Critique of stateless RAG.
  4. Packer et al., MemGPT: Towards LLMs as Operating Systems (2024). Memory hierarchy via virtual context management.
Module 06

Context Engineering Inside the Harness

Context engineering as the per-step subroutine of the harness. Compaction, just-in-time loading, tool-description budgeting, progressive disclosure, skill loading.

Canonical readings

  1. Anthropic, Effective context engineering for AI agents (and follow-up posts on context rot and skill systems).
  2. Lance Martin / LangChain, Context Engineering for Agents. The clearest practitioner taxonomy (write / select / compress / isolate).
  3. Anthropic Engineering, Writing effective tools for AI agents — including the observation that ~58 tools can consume ~55k tokens, which forces routing-before-loading.
LabImplement progressive tool disclosure with a router model selecting from N tool clusters before the main agent sees descriptions.
Module 07

Reflection, Self-Correction, and Inner Loops

Inner loops that revise outputs before the harness commits them. The critical readings keep the pattern honest: self-correction without external feedback often fails, so the harness usually needs verifiers.

Module 08

Safety, Sandboxing, and Permissions

Once tools execute real code, blast radius dominates design. Cover prompt injection (direct and indirect), tool permission models, capability gating, and approval contracts.

Module 09

Human-in-the-Loop and Approval Contracts

When the loop deliberately yields control. Cover the design space (HITL vs HOTL, tool-level vs request-level approval, suspension vs approval) and the durable-execution problem (waits of hours/days).

Canonical readings

  1. Anthropic Building Effective Agents, "Combining and customizing these patterns" section.
  2. Mastra, Human-in-the-Loop: When to Use Agent Approval (2026). Clean taxonomy of approval vs. suspension.
  3. LangChain, Human-in-the-loop middleware. The four decision types (approve / edit / reject / respond).
  4. Cloudflare, Agents SDK Human in the Loop patterns. Five patterns including durable approvals via Workflows.
Module 10

Observability and Tracing

Traces are required for harness debugging. Cover span hierarchy, OpenTelemetry as the substrate, and the OpenInference semantic conventions that specialize OTel for LLM workloads.

Canonical readings

  1. OpenInference Specification. The semantic conventions reference.
  2. Arize, Phoenix. OSS reference implementation; spin one up in lab.
  3. Anthropic's multi-agent post, "Production reliability and engineering challenges" section, which discusses non-deterministic debugging and how to make agent execution legible.
LabInstrument the Module 2 ReAct agent end-to-end with OpenInference, then deliberately introduce a bug and find it in the trace.
Module 11

Evaluation and Benchmarks

How to measure a harness. Cover end-to-end task benchmarks, the variance problem, the rigor problem, and tool-level evaluation as distinct from end-to-end.

Canonical readings

  1. Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, ICLR 2024. The canonical coding agent benchmark.
  2. Yao et al., τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Sierra, 2024). The pass^k reliability metric is the important contribution.
  3. Mialon et al., GAIA: A Benchmark for General AI Assistants, ICLR 2024.
  4. Zhou et al., WebArena: A Realistic Web Environment for Building Autonomous Agents, ICLR 2024.
  5. Zhu et al., Establishing Best Practices for Building Rigorous Agentic Benchmarks (2025). Required critical reading — documents how up to 100% relative error inflates leaderboard claims.
Module 12

Case Study and Frontier: Claude Code and Self-Optimizing Harnesses

End with two contemporary anchors. The Claude Code source exposure (March 2026) made a production harness publicly readable; the Meta-Harness paper raised the harness itself to a first-class optimization target.

Canonical readings

  1. WaveSpeed, Claude Code Agent Harness: Architecture Breakdown (2026).
  2. VILA-Lab, Dive-into-Claude-Code — a curated index of community analyses of the leak, including the systematic "Horse Book" mdBook framing it as a harness engineering case study.
  3. Lee et al., Meta-Harness: End-to-End Optimization of Model Harnesses (Stanford, 2026). The harness as a search target. The 6× same-model performance gap from harness alone is the headline number.
  4. Zhang et al., General Modular Harness for LLM Agents in Multi-Turn Gaming Environments (2025). Ablations that quantify the contribution of perception / memory / reasoning modules separately.
Final LabStudents take their semester-long harness, instrument it on a small Meta-Harness-style outer loop, and report whether automated edits beat their hand-tuned version.
Addendum

Specialized Harness Engineering

A closing bridge from general-purpose agent harnesses to domain-specific execution systems. Specialized harness engineering asks what happens when the task family is known well enough that the stable structure can move out of the model and into the runtime.

General-purpose harnesses maximize breadth: broad goals, broad tools, open-ended context gathering, and loose stopping conditions. Specialized harnesses live in the productive middle between deterministic workflows and fully open-ended agents. They assume the work has a recognizable skeleton — steps, branches, allowed actions, approvals, outputs, and failure states — but still needs model judgment for language understanding, fuzzy matching, exception handling, summarization, and local decisions.

The core move is to change the task the model sees. The runtime owns typed state, scoped tools, domain-specific context routing, policy gates, verification checks, structured outputs, and human review surfaces. The model receives a local objective with the smallest useful context and action surface. The result is usually cheaper, more inspectable, easier to audit, and easier to improve than asking a frontier model to improvise the whole procedure from a monolithic prompt.

Proceda's SOP execution architecture is the concrete case study. An SOP is parsed into explicit steps; each step receives only its own instructions and relevant tool schemas; the model advances by calling control tools such as complete_step() or request_clarification(); the executor manages transitions, approval gates, context trimming, output extraction, circuit breakers, and trace events. The SOP-Bench lesson is that procedural reliability can come from the harness absorbing planning and state-management burden, even with a smaller model.

Canonical readings

  1. Haldar, Specialized Harness Engineering (2026). The conceptual frame: specialized harnesses as the broad middle between deterministic workflows and general-purpose agents.
  2. Haldar, Anatomy of a SOTA Agentic SOP-Execution Engine (2026). The Proceda architecture: SOP as state machine, control tools, context trimming, approval gates, structured extraction, bounded execution, and event traces.
LabTake the Module 2 ReAct loop and specialize it for one bounded SOP-like task family: define typed state, scoped tools, explicit completion and clarification tools, validation checks, and a task-level evaluation metric.

A Few Pedagogical Notes

The course works best if labs accumulate one codebase across modules. By Module 12 students have an instrumented, sandboxed, multi-agent harness with memory and approvals, and they end by submitting it to automated optimization. The arc mirrors the field's actual move: from hand-authored loops to optimized infrastructure.

SWE-bench, τ-bench, and at least one custom domain benchmark appear from the start, so students re-evaluate the same harness across modules and see numbers move, plateau, or regress as they add components. The Module 12 readings on benchmark rigor land much harder when students have lived with their own numbers all semester.

Two readings recur throughout: Simon Willison's blog as a running practitioner narrative, and Lilian Weng's LLM-Powered Autonomous Agents (2023) as the historical reference point most of the field still cites.

Capstone

From Harness to Agent

A culminating module designed for students who have completed all three courses. It is a synthesis lecture, a closing reading set, and a final lab that makes the trilogy's argument legible as a single arc.

The Composition

The capstone starts with the definition, stated as cleanly as it can be:

An agent is the (model, harness) pair operating over an environment.

Everything in the three courses has been preparation for understanding why that definition is non-trivial. The model is what the prompt engineering course studied: a fixed function from a context to a distribution over next tokens. The harness is what the harness engineering course studied: a program that wraps that function in a loop, with tools, memory, control flow, and termination logic. The environment is whatever the harness reaches into through its tools — a file system, a browser, a database, an API surface, a human collaborator. The context engineering course studied the interface between model and harness: how the harness assembles, manages, and refreshes the token sequence the model sees on every step.

The definition has sharp consequences. Agent names the whole composition: model, harness, and environment together. A frontier model dropped into the wrong harness produces a worse agent than a weaker model in a well-designed one; the Meta-Harness paper's six-fold same-model performance gap is the empirical anchor for that claim. The environment is part of the definition: changing the environment changes what the agent is, as well as what it does. Improving any one component in isolation can still leave the agent worse, because the composition has dynamics that none of the parts have alone.

The clearest theoretical articulation of this composition is Sumers et al., Cognitive Architectures for Language Agents (CoALA) (2024), which formalizes language agents in terms of memory modules, action space, and decision procedure. It is the integrative reading for this lecture, with its diagrams providing the structure. Russell & Norvig, Artificial Intelligence: A Modern Approach, Chapter 2: Intelligent Agents supplies the older percept-action-environment vocabulary the field eventually rediscovered.

The historical arc: scaffold to harness

The terminological shift from "scaffold" to "harness" is the field's most legible signal that its understanding of what it's building has changed. The shift deserves careful treatment because it encodes a real intellectual claim, beyond style preference.

The scaffold era ran roughly from early 2023 through mid-2024. Its canonical artifacts are AutoGPT (Toran Bruce Richards, March 2023) and BabyAGI (Yohei Nakajima, April 2023), each a few hundred lines of Python that wrapped GPT-4 in a planning-and-execution loop with a vector store for memory. The framing was construction-site scaffolding: temporary structure around something still unable to stand on its own. The implicit prediction was that scaffolding would become unnecessary as models improved, the way construction scaffolding is removed once a building can bear its own weight. METR and ARC Evals adopted the term partly for this reason: they wanted to measure raw model capability and treated the surrounding code as a confound to be controlled.

The empirical record complicated that prediction. Models got much better over 2023–2026, while surrounding code became larger, more sophisticated, and more clearly a primary determinant of system behavior. Claude Code, Cursor, Devin, and the production systems studied in the harness course are first-class infrastructure. Their design choices — tool granularity, context compaction policy, approval contracts, sub-agent spawning rules — determine what the system can do.

The current era is the harness era. The metaphor is permanent apparatus that channels a powerful thing toward useful work: engines, reactors, and other systems that need durable control surfaces. Better models raise the value of good harness design. The same physical artifact — a loop, a tool registry, a memory store, a context manager — gets reinterpreted as the thing being built. The Externalization survey makes the same claim theoretically: memory, skills, protocols, and the harness itself are cognitive artifacts that offload internal model burden onto deterministic infrastructure, and that offloading is a permanent design feature.

The close reading compares Anthropic's Building Effective Agents, which almost never uses the word "scaffold," uses "agent" carefully and narrowly, and treats the surrounding code as the locus of design decisions, against AutoGPT's original README. The shift is visible directly in the prose.

An open empirical question anchors the section: was the scaffold framing wrong, or was it premature? The Meta-Harness result and the harness-engineering literature suggest scaffolds persist across the capability levels seen so far. But some future model may benefit less from external memory, external tools, or external control flow, at which point the harness may become more confound than capability. No compelling evidence settles it. The harness framing is currently winning because it matches the available data. The exercise asks students to argue both sides.

How the three layers compose into an agent

This is the integrative move the trilogy was building toward:

  • A prompt is a fragment — an instruction, an exemplar, a tool description, a system message. In an agent, every prompt-engineered artifact ends up embedded in the harness as a template, a tool schema, or a piece of guidance that the harness emits to the model at the right moment. Prompt engineering is the discipline of designing those fragments. Most of what the prompt engineering course studied lives inside an agent as static assets the harness assembles from.
  • A context is the entire token sequence the model sees on a given step — the prompt fragments plus everything the harness chose to include on this particular iteration. Context engineering is the discipline of policy: which fragments, which prior outputs, which retrieved documents, which tool descriptions, which compacted memory, all subject to finite-resource constraints and degradation curves. In an agent, context engineering is the per-step subroutine of the harness, run anew on every iteration.
  • A harness is the program that runs the model in a loop, calls tools, applies the context policy, manages state, and decides when to stop. Harness engineering is the discipline of building that program.

An agent is what emerges when a harness operates over an environment in pursuit of a goal. The composition has properties no individual layer has:

  • Trajectory-dependence. The same agent on the same task can take different paths through the environment depending on stochastic choices, environmental responses, and which retrieved or compacted artifacts the context policy surfaced. Reliability becomes a distribution, which is why τ-bench's pass^k metric is the right way to measure production agents.
  • Closed-loop failure modes. Agents exhibit pathologies absent from single inferences: loops, goal drift, capability erosion under context rot, sub-agent cascade failures. These emerge from the dynamics of the composition, beyond what a prompt, context window, or static harness diagram can show.
  • Emergent capability above the parts. The Manus team's observation that they "rebuilt the framework four times" while keeping the model fixed is evidence that agent capability is largely a harness-and-context property. The Meta-Harness six-fold gap is the same observation, formalized.

The clearest single reading on this synthesis is Lilian Weng's LLM-Powered Autonomous Agents (2023), now somewhat dated but still the cleanest narrative bridge from "LLM" to "agent" in the literature. Wang et al., A Survey on Large Language Model based Autonomous Agents (2024) supplies breadth, and Wang et al., Voyager: An Open-Ended Embodied Agent with Large Language Models (2024) supplies the cleanest case study of an agent that does the integrative thing — accumulating skills the harness can later retrieve, blurring the line between context, harness, and model.

Where the frontier is heading

The course ends on three live questions.

First, the convergence of the three layers under automated optimization. Meta-Harness suggests harnesses can be optimized end-to-end. GEPA and DSPy suggest prompts can be. ACE and Dynamic Cheatsheet suggest contexts can. The natural endpoint is a co-optimized stack where the boundaries between layers blur because all three are being adjusted to a shared task signal. The Anthropic skills framing — SKILL.md files loaded by the harness on demand — is one production-grade instantiation: a skill is simultaneously a prompt fragment, a context-loading policy, and a piece of harness behavior. Students who internalize this stop seeing prompt/context/harness as separable disciplines and start seeing them as a single optimization surface.

Second, agents in environments that contain other agents. Almost every reading in the trilogy assumes the environment is passive — a file system, an API, a webpage. As multi-agent systems become routine, the environment increasingly contains other agents whose harnesses are pursuing their own goals. The classical multi-agent systems literature (Wooldridge, An Introduction to MultiAgent Systems, 2009) is suddenly relevant again, and the harness designs that worked against passive environments often misbehave against active ones. This is one of the most likely sources of the next round of "harness engineering" failure modes.

Third, what scaffolding's prediction would even have looked like, if true. As a closing exercise, have students articulate what evidence would convince them that the scaffold framing was right: some capability threshold past which the surrounding code becomes more confound than capability. The exercise sharpens their thinking about the relationship between model capability and system capability, and inoculates them against the easy assumption that improvements in either are improvements in both.

Final integrative lab

Students bring forward the artifact each course produced — the prompts from the prompt course, the context-engineering pipeline from the context course, the harness from the harness course — and compose them into a single agent attacking a non-trivial task in a non-trivial environment. AppWorld or a custom domain works well. The deliverable is a paper about the system: students must report which design decisions belonged to which layer, which decisions cut across layers (and therefore which discipline's vocabulary fails to capture them), where the composition introduced behaviors none of the layers predicted, and what their pass^k variance looks like over at least ten runs.

The paper is the trilogy's exit exam. Students who can write it well will understand something most working practitioners miss: prompt, context, and harness engineering name adjacent regions of a single design surface, and the agent sits on top of all three.

A Closing Note for the Final Lecture

The trilogy was structured this way because the field arrived at it this way, painfully and in roughly this order. Prompt engineering came first because it was the cheapest way to get useful behavior out of a black-box model. Context engineering followed once context windows grew large enough that their contents became the bottleneck. Harness engineering became respectable only once it was clear that capable models still needed sustained, deterministic infrastructure around them to produce reliable systems.

The next layer — whatever it turns out to be — will get its own name once the field knows what to call it. The test for students is whether they can guess what that name will be. Correct guesses are most likely to come from building the thing before the field names it.