The Composition
The capstone starts with the definition, stated as cleanly as it can be:
An agent is the (model, harness) pair operating over an environment.
Everything in the three courses has been preparation for understanding why that definition is non-trivial. The model is what the prompt engineering course studied: a fixed function from a context to a distribution over next tokens. The harness is what the harness engineering course studied: a program that wraps that function in a loop, with tools, memory, control flow, and termination logic. The environment is whatever the harness reaches into through its tools — a file system, a browser, a database, an API surface, a human collaborator. The context engineering course studied the interface between model and harness: how the harness assembles, manages, and refreshes the token sequence the model sees on every step.
The definition has sharp consequences. Agent names the whole composition: model, harness, and environment together. A frontier model dropped into the wrong harness produces a worse agent than a weaker model in a well-designed one; the Meta-Harness paper's six-fold same-model performance gap is the empirical anchor for that claim. The environment is part of the definition: changing the environment changes what the agent is, as well as what it does. Improving any one component in isolation can still leave the agent worse, because the composition has dynamics that none of the parts have alone.
The clearest theoretical articulation of this composition is Sumers et al., Cognitive Architectures for Language Agents (CoALA) (2024), which formalizes language agents in terms of memory modules, action space, and decision procedure. It is the integrative reading for this lecture, with its diagrams providing the structure. Russell & Norvig, Artificial Intelligence: A Modern Approach, Chapter 2: Intelligent Agents supplies the older percept-action-environment vocabulary the field eventually rediscovered.
The historical arc: scaffold to harness
The terminological shift from "scaffold" to "harness" is the field's most legible signal that its understanding of what it's building has changed. The shift deserves careful treatment because it encodes a real intellectual claim, beyond style preference.
The scaffold era ran roughly from early 2023 through mid-2024. Its canonical artifacts are AutoGPT (Toran Bruce Richards, March 2023) and BabyAGI (Yohei Nakajima, April 2023), each a few hundred lines of Python that wrapped GPT-4 in a planning-and-execution loop with a vector store for memory. The framing was construction-site scaffolding: temporary structure around something still unable to stand on its own. The implicit prediction was that scaffolding would become unnecessary as models improved, the way construction scaffolding is removed once a building can bear its own weight. METR and ARC Evals adopted the term partly for this reason: they wanted to measure raw model capability and treated the surrounding code as a confound to be controlled.
The empirical record complicated that prediction. Models got much better over 2023–2026, while surrounding code became larger, more sophisticated, and more clearly a primary determinant of system behavior. Claude Code, Cursor, Devin, and the production systems studied in the harness course are first-class infrastructure. Their design choices — tool granularity, context compaction policy, approval contracts, sub-agent spawning rules — determine what the system can do.
The current era is the harness era. The metaphor is permanent apparatus that channels a powerful thing toward useful work: engines, reactors, and other systems that need durable control surfaces. Better models raise the value of good harness design. The same physical artifact — a loop, a tool registry, a memory store, a context manager — gets reinterpreted as the thing being built. The Externalization survey makes the same claim theoretically: memory, skills, protocols, and the harness itself are cognitive artifacts that offload internal model burden onto deterministic infrastructure, and that offloading is a permanent design feature.
The close reading compares Anthropic's Building Effective Agents, which almost never uses the word "scaffold," uses "agent" carefully and narrowly, and treats the surrounding code as the locus of design decisions, against AutoGPT's original README. The shift is visible directly in the prose.
An open empirical question anchors the section: was the scaffold framing wrong, or was it premature? The Meta-Harness result and the harness-engineering literature suggest scaffolds persist across the capability levels seen so far. But some future model may benefit less from external memory, external tools, or external control flow, at which point the harness may become more confound than capability. No compelling evidence settles it. The harness framing is currently winning because it matches the available data. The exercise asks students to argue both sides.
How the three layers compose into an agent
This is the integrative move the trilogy was building toward:
- A prompt is a fragment — an instruction, an exemplar, a tool description, a system message. In an agent, every prompt-engineered artifact ends up embedded in the harness as a template, a tool schema, or a piece of guidance that the harness emits to the model at the right moment. Prompt engineering is the discipline of designing those fragments. Most of what the prompt engineering course studied lives inside an agent as static assets the harness assembles from.
- A context is the entire token sequence the model sees on a given step — the prompt fragments plus everything the harness chose to include on this particular iteration. Context engineering is the discipline of policy: which fragments, which prior outputs, which retrieved documents, which tool descriptions, which compacted memory, all subject to finite-resource constraints and degradation curves. In an agent, context engineering is the per-step subroutine of the harness, run anew on every iteration.
- A harness is the program that runs the model in a loop, calls tools, applies the context policy, manages state, and decides when to stop. Harness engineering is the discipline of building that program.
An agent is what emerges when a harness operates over an environment in pursuit of a goal. The composition has properties no individual layer has:
- Trajectory-dependence. The same agent on the same task can take different paths through the environment depending on stochastic choices, environmental responses, and which retrieved or compacted artifacts the context policy surfaced. Reliability becomes a distribution, which is why τ-bench's pass^k metric is the right way to measure production agents.
- Closed-loop failure modes. Agents exhibit pathologies absent from single inferences: loops, goal drift, capability erosion under context rot, sub-agent cascade failures. These emerge from the dynamics of the composition, beyond what a prompt, context window, or static harness diagram can show.
- Emergent capability above the parts. The Manus team's observation that they "rebuilt the framework four times" while keeping the model fixed is evidence that agent capability is largely a harness-and-context property. The Meta-Harness six-fold gap is the same observation, formalized.
The clearest single reading on this synthesis is Lilian Weng's LLM-Powered Autonomous Agents (2023), now somewhat dated but still the cleanest narrative bridge from "LLM" to "agent" in the literature. Wang et al., A Survey on Large Language Model based Autonomous Agents (2024) supplies breadth, and Wang et al., Voyager: An Open-Ended Embodied Agent with Large Language Models (2024) supplies the cleanest case study of an agent that does the integrative thing — accumulating skills the harness can later retrieve, blurring the line between context, harness, and model.
Where the frontier is heading
The course ends on three live questions.
First, the convergence of the three layers under automated optimization. Meta-Harness suggests harnesses can be optimized end-to-end. GEPA and DSPy suggest prompts can be. ACE and Dynamic Cheatsheet suggest contexts can. The natural endpoint is a co-optimized stack where the boundaries between layers blur because all three are being adjusted to a shared task signal. The Anthropic skills framing — SKILL.md files loaded by the harness on demand — is one production-grade instantiation: a skill is simultaneously a prompt fragment, a context-loading policy, and a piece of harness behavior. Students who internalize this stop seeing prompt/context/harness as separable disciplines and start seeing them as a single optimization surface.
Second, agents in environments that contain other agents. Almost every reading in the trilogy assumes the environment is passive — a file system, an API, a webpage. As multi-agent systems become routine, the environment increasingly contains other agents whose harnesses are pursuing their own goals. The classical multi-agent systems literature (Wooldridge, An Introduction to MultiAgent Systems, 2009) is suddenly relevant again, and the harness designs that worked against passive environments often misbehave against active ones. This is one of the most likely sources of the next round of "harness engineering" failure modes.
Third, what scaffolding's prediction would even have looked like, if true. As a closing exercise, have students articulate what evidence would convince them that the scaffold framing was right: some capability threshold past which the surrounding code becomes more confound than capability. The exercise sharpens their thinking about the relationship between model capability and system capability, and inoculates them against the easy assumption that improvements in either are improvements in both.
Final integrative lab
Students bring forward the artifact each course produced — the prompts from the prompt course, the context-engineering pipeline from the context course, the harness from the harness course — and compose them into a single agent attacking a non-trivial task in a non-trivial environment. AppWorld or a custom domain works well. The deliverable is a paper about the system: students must report which design decisions belonged to which layer, which decisions cut across layers (and therefore which discipline's vocabulary fails to capture them), where the composition introduced behaviors none of the layers predicted, and what their pass^k variance looks like over at least ten runs.
The paper is the trilogy's exit exam. Students who can write it well will understand something most working practitioners miss: prompt, context, and harness engineering name adjacent regions of a single design surface, and the agent sits on top of all three.
A Closing Note for the Final Lecture
The trilogy was structured this way because the field arrived at it this way, painfully and in roughly this order. Prompt engineering came first because it was the cheapest way to get useful behavior out of a black-box model. Context engineering followed once context windows grew large enough that their contents became the bottleneck. Harness engineering became respectable only once it was clear that capable models still needed sustained, deterministic infrastructure around them to produce reliable systems.
The next layer — whatever it turns out to be — will get its own name once the field knows what to call it. The test for students is whether they can guess what that name will be. Correct guesses are most likely to come from building the thing before the field names it.