Frontier models (and thereby OSS) will continue paving their way towards dissolving most general failure modes in agents. What’s left is specification failure: the agent does the wrong thing because it was never told the right thing, or was told something contradictory. Enterprise teams are dealing with growing contexts and prompts as agents take on more complex work, and making that usable for agents is hard. We believe that the specification that derives agent context must be a first-class artifact, and a spec you can author, correct, and retrieve from is a durable asset that compounds. We study search and learning algorithms that attack this directly, on the τ³-bench banking agent suite, the FinanceBench benchmark, and the Harvey Legal Agent Benchmark across two open / open-weight backbones.
- The Context Engine is a retrieval sidecar that scales test-time compute on retrieval instead of handing the model raw tools. By investing offline/test-time compute to survey the data, build its own structures and artifacts, and tailor search to the task, it returns fewer, better tokens to the model.
- It lifts retrieval F1 on both backbones, and with it the action-check pass rate: on Mistral Large 3 recall and precision rise together, while on the stronger-retriever GPT-OSS 120B it trades recall for ~11× precision and cuts wasted work from agentic search tool calls.
- Meta-distillation adds onto the context engine by distilling reusable procedural guidance from solved trajectories, in the token space. This lets feedback and reward signals propagate and improve both the context itself and the search mechanism used to retrieve it.
- On both models, the action-check pass rate improves significantly, topping out the gains from the context engine.
The problem: specification failure
As base models improve, the failures that survive are increasingly not about raw capability. The model can read, plan, and call tools; what it lacks is the organization’s specific, often unwritten, specification of correct behavior: the rules, edge cases, company style, tool contracts, and worked examples that determine whether an action is right here and now. That knowledge lives in docs, prompts, people’s heads, and scattered traces. Nobody can audit it, and corrections to it don’t durably stick or scale.
We call this specification failure, and we think it is the dominant remaining failure mode for enterprise agents. The thesis behind Kinesthetic is that the response is not a bigger model or a cleverer prompt, but treating context, the spec, as a first-class artifact with intentionally-designed interfaces for its human and agent users: something you index, retrieve from, measure, and correct, with a clear authority gradient from human-authored ground truth down through derived, regenerable machinery.
Vanilla agentic search and context engineering fail to capture the nuance of the data that it works with. Providing an agent with generic search tools with no understanding of what the data contains, how it’s structured, or how it should be interpreted for a given scenario is like providing an explorer a compass but no map: the agent can wander forever and bloat the context (and exhaust the budget) with redundant search calls. Additionally, we believe that past trace data (or any historical data of the task) is vastly underutilized in the applied setting. Your agents should be able to learn by doing.
This post explores foundational research for infrastructure around this failure mode. We isolate a staged approach: first assembling the optimal context offline, second retrieving precise information for the agent at runtime, and third distilling reusable procedure from experience. Each plays into the others to create the ultimate context layer for an agent.
The setup
We use three benchmarks to showcase our learning infrastructure’s ability to improve agent performance, each highlighting a specific improvement Kinesthetic can offer.
- Benchmark
- The τ³-bench banking knowledge benchmark (Shi et al., 2026) tests a model in a knowledge-retrieval customer-service domain with configurable RAG pipelines, document search, embeddings, and agentic shell-based search. Tasks are long-horizon customer simulations where certain actions must be taken by both agent and customer for success; we set partial rewards as an extra hill-climbable metric per iteration. We view this as our holy grail: the combined power of improved context and continual learning in the token space.
- Data splits
- Split into train and a hold-out eval set via K-means clustering over
google/embeddinggemma-300membeddings, so train is representative of the benchmark's task-type distribution. - Backbones
- Mistral Large 3 (non-reasoning) and GPT-OSS 120B (reasoning), neither SOTA in any regard. Sonnet 4.6 as the user/customer simulator.
- Configs
- All-tools baseline vs. AWS AgentCore Memory baseline vs. Context Engine (CE) vs. Meta-distillation (MD).
- Baseline
- The all-tools baseline gives the agent 3 search tools: BM25 keyword search, dense embedding search, and agentic shell-based search.
- Metrics
- Success, action-check pass rate (Action r+w), retrieval recall / precision, and turns.
- Benchmark
- A test suite for open-book financial QA. The open-source sample has 150 annotated questions about publicly traded companies with answers and evidence strings (plus the document and page the answer lives in). We use it only to assess retrieval, so we use just the question and the ground-truth document + page number.
- Data splits
- Used purely as a retrieval dataset: paired questions and ground-truth document/page numbers to score retrieval accuracy.
- Backbones
- Mistral Large 3 as the Context Engine model.
- Configs
- Context Engine (CE) search vs. vanilla dense retrieval.
- Baseline
- Vanilla dense embedding search with OpenAI
text-embedding-3-largeembeddings. - Metrics
- Retrieval recall, precision, and F1 score.
- Benchmark
- The LAB benchmark assesses LLM agents on realistic legal work. Each task is an instruction, documents, and rubrics, run via a vanilla harness that executes the LLM in an agentic tool-calling loop and grades the deliverable with LLM-as-judge. Procedural knowledge is paramount: we aim to show meta-distillation can automate learning that would otherwise need post-training or manual context curation.
- Data splits
- Split into train and a hold-out eval set via K-means clustering over
google/embeddinggemma-300membeddings, so train is representative of the task-type distribution. For experiment budget reasons, we constrained this experiment to 44 test tasks within the insurance, energy natural resources, and immigration family groups from the benchmarks pre-given task splits. - Backbones
- Mistral Large 3 as the student; an Opus 4.5 teacher supplies the offline trajectories MD distills from; Sonnet 4.6 is the LLM-as-judge. (CE on Harvey is future work: a quick pass found much of each document's context is irrelevant to a given task, i.e. context rot.)
- Configs
- Meta-distillation (MD) vs. the base LAB harness, to instill procedural learnings (from contrastive methods) in the token space.
- Baseline
- The base harness provided by the LAB benchmark, no guidance.
- Metrics
- Task Success Score, criteria pass rate, and turns.
The domain-specific data
The first stage is to organize the data so any agentic search method has an easier time finding what it needs. Search methods are often benchmarked against flat document structures, with no offline compute spent understanding or organizing the data. At Kinesthetic, we believe most text-based data can be organized better for agents.
The context engine prepares for itself offline, surveying the ground truth and building its own structures, features, and artifacts to make searching easier, so the runtime payload stays small. This can look like hierarchical directory structuring, relational graphs, generated summaries, or even net-new document synthesis that combines procedures across documents.
The Context Engine
The second stage is improving over vanilla search. In the context engine, a retrieval sidecar sits beside the agent, not inline in its tool loop. Vanilla all-tools search (embedding, BM25, grep) isn’t optimal: the model pays for every token it reads, in context and in cost. The sidecar tailors search to the task and dynamically scales test-time compute on retrieval, returning fewer, better tokens, drawing on the structures the offline survey already built.
What does the work underneath is a novel retrieval architecture that fuses several search modalities into one multi-agent loop (exact shell-style search, dense embedding search, graph search, etc.), all running over those structures. This architecture was inspired by the learnings and results from the DCI paradigm and RLM.
The engine reasons about what kind of lookup each part of a task needs and composes the modalities accordingly, rather than firing all of them blindly. Different questions want different tools: some want the exact token, some the nearest meaning, some the document three references away. Routing them well, against an already-organized corpus, is where the extra retrieval compute turns into fewer, better tokens.
The Context Engine’s retrieval system implies a strict authority gradient. The ground-truth spec is the source of truth; everything the engine derives from it (indexes, structures, assembled context) is disposable and regenerable. A human correction to the spec flows down and rebuilds the rest, so the derived machinery never hardens into a second source of truth.
Results
Analysis
- Mistral: recall and precision rise together: recall 48% → 64%, precision 18% → 38%. The context engine’s retrieval doesn’t trade recall for precision, both are maintained to provide concise, relevant context to the agent.
- GPT-OSS (not shown): focus, not recall. Against a backbone that already over-retrieves (recall 0.87, precision 0.035, ~33 turns/task of grepping), the engine lowers recall to 0.51 but buys ~11× the precision (0.035 → 0.40), an F1 jump 0.07 → 0.45, and cuts wasted work to ~20 turns/task. The value here isn’t finding more, it’s finding only the right things so the model stops grepping and starts acting.
- FinanceBench corroborates on pure retrieval. Page-level F1 climbs from vanilla’s best 0.29 (at k=1) to 0.55, and vanilla can only match the engine’s recall by flooding k=10, collapsing precision to 0.09. At equal recall the engine is ~4–5× more precise; doc-level F1 0.68 → 0.83.
Meta-distillation & continual learning
The third and final stage is where we begin to see the learning loop close. Where CE curates declarative context, MD captures procedural knowledge: it distills reusable how-to guidance from solved trajectories and feedback, in the token space, so it composes with whatever the backbone emits rather than depending on a model’s reasoning machinery.
The raw material is trajectories (and accompanying traces, KB data, feedback, etc.) where a stronger teacher solved a task the student got wrong. From each of those, meta-distillation extracts a compact, reusable how-to: when the situation applies, the procedure that worked, the decision criteria and context the teacher was implicitly using, and a paired right-way/wrong-way example. Crucially, the guides talk about context and tools by their role in the workflow, so the procedure transfers to new tasks. Related guides on the same theme then consolidate into a single playbook: a shared spine plus the specifics.
distill
The reason this generalizes is that it’s distillation in token space. The output is plain procedural text, not a weight update: there’s no reward model and no training run, so it composes with whatever the backbone already emits. It’s a close cousin of recent self-distillation work that turns a model’s own feedback- or demonstration-conditioned predictions into a learning signal (Hübotter et al., 2026; Shenfeld et al., 2026), except we keep the distillate in the token space the agent reads at inference, rather than folding it back into weights. That’s why the same distilled procedure lifted both a non-reasoning model and a reasoning one: it’s additive to whatever the model and the retrieval layer were already doing, not a replacement for either. At inference the relevant guidance is pulled in for the situation at hand and kept stable while the agent works through it, so the agent carries procedure instead of re-deriving it every run.
The format has shown its utility through iterations of an outer meta-training loop (inspired by GEPA and DSPy), but it also points to something larger. Every guide is built from a trajectory a teacher actually won, which makes it labeled procedural data: the same loop that improves the agent today is quietly producing the policy data you’d train your own model on tomorrow.
Results
Analysis
- Mistral: Action checks reach 45.7% on held-out tasks, topping the Context Engine’s 41.5%, but carried as compact procedure in context instead of retrieval compute at query time, at a fraction of the cost (~$0.16 vs. ~$0.45 agent/task).
- GPT-OSS: Meta-distillation pushes action score to 16.8%, best of any arm, beating the mistral-transfer buffer (12.7%), AgentCore (13.6%), the Context Engine (15.0%) and the baseline (6.0%), at the lowest agent cost. It keeps the strong native retrieval (recall ~0.71) and adds procedure on top rather than replacing it.
- It beats managed episodic memory, and the handicap runs the wrong way. On Mistral, AgentCore Memory lands at the all-tools baseline (30.4% vs. ~28%), far below MD (45.7%); on GPT-OSS it’s competitive but still loses to the meta-distill. And AgentCore was fed the stronger Opus 4.5 memory-writer while MD used Sonnet, so the gap isn’t a generation-model artifact.
- On Harvey, the lift has exactly the right shape: rescue the weak, hold the strong. Tasks the bare student is near-helpless on jump +0.369 (0.006 → 0.375, from can’t do it to competent); tasks it already handles stay flat (−0.001). Several blank failures become near-solved (draft-markup-of-EPC-contract 0.00 → 0.95). And guidance makes the weak model more reliable: the first-attempt no-deliverable rate halves (11.4% → 4.5%).
- Generality across model classes and domains. The same distilled procedure helped a non-reasoning backbone (Mistral) and a reasoning one (GPT-OSS), and the method carried from banking support to legal work. Working in token space is exactly why: the intervention isn’t tied to any one model’s reasoning machinery.
A foundation for self-improvement
The three stages aren’t separate tricks; they compose. The survey structures the corpus, the Context Engine retrieves precise declarative context over it, and meta-distillation carries the procedural knowledge for using that context. Run the engine and the distillate together on Mistral and they stack: the combined arm edges past both individual methods at once, capturing the union of what each contributes rather than averaging them. We aren’t claiming recursive self-improvement yet; what this shows is the foundation for it, and for continual learning, built where it should start: in the context infrastructure.
Results
- The first step toward a self-improving loop. Stacked today, the combined arm is only marginally ahead of meta-distillation alone (46.2% vs. 45.7%), and that’s exactly the next thread to pull: in future experiments the distillate should improve the search itself, so procedure mined from solved work feeds back into how the Context Engine retrieves, not just into what the agent does with what it retrieves. That’s the step that closes the loop. The pieces are already in place: the survey structures the corpus, the engine retrieves better context, better context produces more solved trajectories (optimal policy data), and those trajectories distill back into procedure that improves the agent again. Run over a team’s live traffic that loop becomes self-reinforcing, better context yielding more optimal rollouts, a better distillate, and better context again, all anchored to a human-authored spec at the top of the authority gradient. That is the foundation we think recursive self-improvement and continual learning should be built on.
Why this matters
Continual learning foundations
We see this as a necessary first step toward continual learning. By pushing the ceiling of a base model’s ability closer to the optimal policy, all downstream RL benefits, and sampled rollouts have a better chance of producing a high reward. These methods also work with small amounts of annotated data (here, provided by the benchmark) that mirror a realistic deployment. Even with weaker evals like LLM-as-judge (no full verifiers as in RLVR), you can automatically improve performance with orders of magnitude less labeling work than sample-hungry methods (SFT, RL). And the specification is fully human-readable and separated from the harness implementation, so interactions are ergonomic for humans as well as the agent: no more wrestling with trace data, annotations, index updates, and test infrastructure just to change the spec for an edge case.
Kinesthetic architecture
There’s a clean architecture underneath, and it’s the one the product is built on: human-authored ground truth is authoritative; a derived engine turns it into runtime context; and transient, model-specific harness features adapts it to a given backbone and shrinks as models improve. Nothing downstream of ground truth is a second source of truth; it’s all regenerable from the spec.
From corrections to policy data
This experiment setup shows what you can do with a small sample of trajectories that have been labeled/annotated given a human-authored/verified success criteria. In the real world, where annotations are expensive and scarce, this exemplifies how valuable it is to get a successful demonstration of a task. By improving the context layer, you produce more optimal policy data (rollouts/trajectories that solve a task), thereby creating data for distillation. This loop can continuously run over your team’s live traffic and eventually produce high quality data for a post-trained model you own: specified by your experts rather than rented from a frontier vendor.
Future work
- Extend the experiment setup: multi-trial runs, a retrieval-compute scaling curve, ablations of the individual Context Engine and meta-distillation components and processes, and scaling to additional benchmarks.
- Extend Meta-Distillation Exploration: What to distill, how aggressively to filter, and how well performance gains scale with teacher and data quality. Additionally, characterize how reasoning vs. non-reasoning models consume the same procedure, and turn compounded corrections (from annotators or LLMaJ) into trainable, on-policy data to fully close the loop.
- Frontier Expansion: Both methods were deliberately tested on open / open-weight backbones, and applying the same context engine and meta-distillation on top of a frontier model is the natural next step to see whether they push SOTA higher rather than only closing the gap to it.
- Partner with us on real-world data: The cleanest test of this work isn’t another benchmark: it’s a design partner’s own traffic. The action-check gains we measured here have somewhere to convert into end-to-end success inside a harness the partner already owns and tunes, and every correction along the way compounds into the context layer as a durable, ownable asset. If you’re running enterprise agents and feeling specification failure, get in touch.
References
- Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, et al. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. arXiv preprint arXiv:2507.19457, 2025.
- Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv preprint arXiv:2310.03714, 2023.
- Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Barres. τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge. arXiv preprint arXiv:2603.04370, 2026.
- Jonas Hübotter, Frederike Lübeck, Lejs Behric, et al. Reinforcement Learning via Self-Distillation. arXiv preprint arXiv:2601.20802, 2026.
- Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-Distillation Enables Continual Learning. arXiv preprint arXiv:2601.19897, 2026.
- Zhuofeng Li, Haoxiang Zhang, Cong Wei, et al. Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction. arXiv preprint arXiv:2605.05242, 2026.
- Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive Language Models. arXiv preprint arXiv:2512.24601, 2025.