Research

Specifying the agent: closing the gap to frontier with a context engine and meta-distillation

Anthony Le·Mac Broido·Kinesthetic Research·June 2026

TL;DR

Frontier models (and thereby OSS) will continue paving their way towards dissolving most general failure modes in agents. What’s left is specification failure: the agent does the wrong thing because it was never told the right thing, or was told something contradictory. Enterprise teams are dealing with growing contexts and prompts as agents take on more complex work, and making that usable for agents is hard. We believe that the specification that derives agent context must be a first-class artifact, and a spec you can author, correct, and retrieve from is a durable asset that compounds. We study search and learning algorithms that attack this directly, on the τ³-bench banking agent suite, the FinanceBench benchmark, and the Harvey Legal Agent Benchmark across two open / open-weight backbones.

The Context Engine is a retrieval sidecar that scales test-time compute on retrieval instead of handing the model raw tools. By investing offline/test-time compute to survey the data, build its own structures and artifacts, and tailor search to the task, it returns fewer, better tokens to the model.
- It lifts retrieval F1 on both backbones, and with it the action-check pass rate: on Mistral Large 3 recall and precision rise together, while on the stronger-retriever GPT-OSS 120B it trades recall for ~11× precision and cuts wasted work from agentic search tool calls.
Meta-distillation adds onto the context engine by distilling reusable procedural guidance from solved trajectories, in the token space. This lets feedback and reward signals propagate and improve both the context itself and the search mechanism used to retrieve it.
- On both models, the action-check pass rate improves significantly, topping out the gains from the context engine.

τ³-bench · banking

Both levers convert spec into correct action

Action-check pass rate. Each bar starts at its backbone's all-tools baseline (muted) and stacks the gain the method adds on top.

All-tools baseline+ Context Engine+ Meta-distillation+ CE & MD (Mistral)

50%40%30%20%10%0%

41.5

45.7

46.2

15.0

16.8

Mistral Large 3

non-reasoning · base 28.2%

GPT-OSS 120B

reasoning · base 6.0%

Action-check pass rate (Action r+w). Mistral: baseline & CE pooled over hold-out data (n=64); meta-distillation and the combined CE+MD arm on the val split (n=32). GPT-OSS: val hold-out (n=31–32); MD is the backbone-native buffer distilled from GPT-OSS's own trajectories.

FinanceBench + Harvey LAB

The same two levers, two more domains

Beyond banking, each method carries its headline result into a domain that isolates it: retrieval quality on financial filings, procedural learning on legal work.

FinanceBench · Context Engine

Page-level retrieval F1: vanilla dense search vs. the Context Engine

0.29→0.55best vanilla → CE

~4–5× more precise at equal recall. The planner + entity graph + shell does the work; bespoke chunking didn't beat whole pages.

Harvey LAB · Meta-distillation

Rubric pass rate: weak Mistral student, no guidance vs. distilled how-tos

0.35→0.45+0.10 over baseline

Rescue the weak, hold the strong. Tasks the bare student couldn't do jump +0.37 (0.01 → 0.38); tasks it already handled stay flat.

Both from offline-mined teacher work, no weight updates. Full breakdowns in Stage 2 (Context Engine) and Stage 3 (meta-distillation) below.

The problem: specification failure

As base models improve, the failures that survive are increasingly not about raw capability. The model can read, plan, and call tools; what it lacks is the organization’s specific, often unwritten, specification of correct behavior: the rules, edge cases, company style, tool contracts, and worked examples that determine whether an action is right here and now. That knowledge lives in docs, prompts, people’s heads, and scattered traces. Nobody can audit it, and corrections to it don’t durably stick or scale.

We call this specification failure, and we think it is the dominant remaining failure mode for enterprise agents. The thesis behind Kinesthetic is that the response is not a bigger model or a cleverer prompt, but treating context, the spec, as a first-class artifact with intentionally-designed interfaces for its human and agent users: something you index, retrieve from, measure, and correct, with a clear authority gradient from human-authored ground truth down through derived, regenerable machinery.

Vanilla agentic search and context engineering fail to capture the nuance of the data that it works with. Providing an agent with generic search tools with no understanding of what the data contains, how it’s structured, or how it should be interpreted for a given scenario is like providing an explorer a compass but no map: the agent can wander forever and bloat the context (and exhaust the budget) with redundant search calls. Additionally, we believe that past trace data (or any historical data of the task) is vastly underutilized in the applied setting. Your agents should be able to learn by doing.

This post explores foundational research for infrastructure around this failure mode. We isolate a staged approach: first assembling the optimal context offline, second retrieving precise information for the agent at runtime, and third distilling reusable procedure from experience. Each plays into the others to create the ultimate context layer for an agent.

The setup

We use three benchmarks to showcase our learning infrastructure’s ability to improve agent performance, each highlighting a specific improvement Kinesthetic can offer.

τ³ Banking KnowledgeCustomer support · the holy grail

long-horizon simsRAG / shell searchMistral + GPT-OSS

Benchmark: The τ³-bench banking knowledge benchmark (Shi et al., 2026) tests a model in a knowledge-retrieval customer-service domain with configurable RAG pipelines, document search, embeddings, and agentic shell-based search. Tasks are long-horizon customer simulations where certain actions must be taken by both agent and customer for success; we set partial rewards as an extra hill-climbable metric per iteration. We view this as our holy grail: the combined power of improved context and continual learning in the token space.
Data splits: Split into train and a hold-out eval set via K-means clustering over google/embeddinggemma-300m embeddings, so train is representative of the benchmark's task-type distribution.
Backbones: Mistral Large 3 (non-reasoning) and GPT-OSS 120B (reasoning), neither SOTA in any regard. Sonnet 4.6 as the user/customer simulator.
Configs: All-tools baseline vs. AWS AgentCore Memory baseline vs. Context Engine (CE) vs. Meta-distillation (MD).
Baseline: The all-tools baseline gives the agent 3 search tools: BM25 keyword search, dense embedding search, and agentic shell-based search.
Metrics: Success, action-check pass rate (Action r+w), retrieval recall / precision, and turns.

FinanceBenchContext retrieval, isolated

150 Qs · 84 filingspage-level retrievalMistral

Benchmark: A test suite for open-book financial QA. The open-source sample has 150 annotated questions about publicly traded companies with answers and evidence strings (plus the document and page the answer lives in). We use it only to assess retrieval, so we use just the question and the ground-truth document + page number.
Data splits: Used purely as a retrieval dataset: paired questions and ground-truth document/page numbers to score retrieval accuracy.
Backbones: Mistral Large 3 as the Context Engine model.
Configs: Context Engine (CE) search vs. vanilla dense retrieval.
Baseline: Vanilla dense embedding search with OpenAI text-embedding-3-large embeddings.
Metrics: Retrieval recall, precision, and F1 score.

Harvey LABLegal work, procedural

44 test tasksLLM-rubric scoredMistral student · Opus teacher

Benchmark: The LAB benchmark assesses LLM agents on realistic legal work. Each task is an instruction, documents, and rubrics, run via a vanilla harness that executes the LLM in an agentic tool-calling loop and grades the deliverable with LLM-as-judge. Procedural knowledge is paramount: we aim to show meta-distillation can automate learning that would otherwise need post-training or manual context curation.
Data splits: Split into train and a hold-out eval set via K-means clustering over google/embeddinggemma-300m embeddings, so train is representative of the task-type distribution. For experiment budget reasons, we constrained this experiment to 44 test tasks within the insurance, energy natural resources, and immigration family groups from the benchmarks pre-given task splits.
Backbones: Mistral Large 3 as the student; an Opus 4.5 teacher supplies the offline trajectories MD distills from; Sonnet 4.6 is the LLM-as-judge. (CE on Harvey is future work: a quick pass found much of each document's context is irrelevant to a given task, i.e. context rot.)
Configs: Meta-distillation (MD) vs. the base LAB harness, to instill procedural learnings (from contrastive methods) in the token space.
Baseline: The base harness provided by the LAB benchmark, no guidance.
Metrics: Task Success Score, criteria pass rate, and turns.

1Stage 1 · τ³-bench, FinanceBench

The domain-specific data

The first stage is to organize the data so any agentic search method has an easier time finding what it needs. Search methods are often benchmarked against flat document structures, with no offline compute spent understanding or organizing the data. At Kinesthetic, we believe most text-based data can be organized better for agents.

The context engine prepares for itself offline, surveying the ground truth and building its own structures, features, and artifacts to make searching easier, so the runtime payload stays small. This can look like hierarchical directory structuring, relational graphs, generated summaries, or even net-new document synthesis that combines procedures across documents.

τ³-bench · banking knowledge

From a flat pile of docs to a navigable structure

Offline survey

Before flat · ~799 docs

green_account.mdbnpl_gold.mddiamond_vault.mdsilver_saver.mdvirtual_card.mdqr_transfers.mdecocard.mdsweep_program.mdsplit_the_bill.mdevergreen.mdplatinum_reserve.mdcrypto_cash_back.mdsupport_codes.mdrho_bank_plus.mdblue_account.md

→

After surveyed & structured

banking_knowledge/

checking_accounts/

blue_account/+ entities

evergreen_account/

… 9 more tiers

credit_cards/

diamond_elite_card/

… virtual_card_management/

buy_now_pay_later/6 tiers

savings_accounts/9 tiers

The survey also emitsRelation graphs: links between related docsSummaries: per-doc & per-clusterEntity tags: extracted & attached to docs

This is one possible way the survey could organize a flat directory of ~799 documents into a tiered hierarchy, not the exact structure we produced. Alongside the hierarchy it derives other artifacts too, all of which the agentic loop can route its shell / dense / graph lookups over.

2Stage 2 · τ³-bench, FinanceBench

The Context Engine

The second stage is improving over vanilla search. In the context engine, a retrieval sidecar sits beside the agent, not inline in its tool loop. Vanilla all-tools search (embedding, BM25, grep) isn’t optimal: the model pays for every token it reads, in context and in cost. The sidecar tailors search to the task and dynamically scales test-time compute on retrieval, returning fewer, better tokens, drawing on the structures the offline survey already built.

What does the work underneath is a novel retrieval architecture that fuses several search modalities into one multi-agent loop (exact shell-style search, dense embedding search, graph search, etc.), all running over those structures. This architecture was inspired by the learnings and results from the DCI paradigm and RLM.

The engine reasons about what kind of lookup each part of a task needs and composes the modalities accordingly, rather than firing all of them blindly. Different questions want different tools: some want the exact token, some the nearest meaning, some the document three references away. Routing them well, against an already-organized corpus, is where the extra retrieval compute turns into fewer, better tokens.

The Context Engine’s retrieval system implies a strict authority gradient. The ground-truth spec is the source of truth; everything the engine derives from it (indexes, structures, assembled context) is disposable and regenerable. A human correction to the spec flows down and rebuilds the rest, so the derived machinery never hardens into a second source of truth.

The Context Engine as a retrieval sidecar

Spec stays authoritative; the engine spends offline + query-time compute to assemble a small, task-tailored context for the agent.

Authoritative

Ground-truth spec

Human-authored rules, edge cases, tool contracts, worked examples.

→

Derived · regenerable

Context Engine

Offline survey builds structures; one agentic loop fuses search modalities and routes per task.

shell / exactdense embedgraph

→

Small payload

Assembled context

Fewer, better tokens, tailored to the task at hand, not the whole corpus.

→

Consumer

Agent

Acts on a focused context instead of grepping raw tools.

↩Authority gradientA human correction flows down to the spec and rebuilds the derived machinery, so nothing downstream becomes a second source of truth.

The sidecar sits beside the agent's loop, not inside it; retrieval compute is a dial spent in the engine rather than out of the model's token budget.

Results

τ³-bench · retrieval

Isolating retrieval: surface the right material, cut the noise

τ³-bench retrieval on Mistral Large 3: all-tools baseline vs. Context Engine. Both recall and precision rise at once: the signature of spending compute on retrieval rather than sliding along the usual frontier.

All-tools baselineContext Engine

Retrieval recall · %

100%75%50%25%0%

Mistral Large 3

Retrieval precision · %

45%30%15%0%

Mistral Large 3

Shown on Mistral Large 3. On GPT-OSS 120B the same engine reads as focus rather than reach: against a backbone that already over-retrieves (recall ~0.87, precision ~0.035, ~33 turns/task of grepping), it lowers recall to ~0.51 but buys ~11× the precision (~0.40), lifting retrieval F1 0.07 → 0.45 and the action-check pass rate 6.0% → 15.0% while cutting wasted work to ~20 turns/task.

FinanceBench · page-level retrieval

FinanceBench: vanilla never gets recall and precision together

Page-level retrieval F1 on 150 financial-QA questions. Vanilla dense search peaks at k=1 and decays as it floods more chunks; the Context Engine self-limits and dominates at every operating point.

Vanilla dense (top-k)Context Engine

0.60.40.20.0

0.29

0.24

0.21

0.15

0.55

k=1

vanilla

k=3

vanilla

k=5

vanilla

k=10

vanilla

Context Engine

self-limits ≤ k10

Analysis

Mistral: recall and precision rise together: recall 48% → 64%, precision 18% → 38%. The context engine’s retrieval doesn’t trade recall for precision, both are maintained to provide concise, relevant context to the agent.
GPT-OSS (not shown): focus, not recall. Against a backbone that already over-retrieves (recall 0.87, precision 0.035, ~33 turns/task of grepping), the engine lowers recall to 0.51 but buys ~11× the precision (0.035 → 0.40), an F1 jump 0.07 → 0.45, and cuts wasted work to ~20 turns/task. The value here isn’t finding more, it’s finding only the right things so the model stops grepping and starts acting.
FinanceBench corroborates on pure retrieval. Page-level F1 climbs from vanilla’s best 0.29 (at k=1) to 0.55, and vanilla can only match the engine’s recall by flooding k=10, collapsing precision to 0.09. At equal recall the engine is ~4–5× more precise; doc-level F1 0.68 → 0.83.

3Stage 3 · τ³-bench, Harvey LAB

Meta-distillation & continual learning

The third and final stage is where we begin to see the learning loop close. Where CE curates declarative context, MD captures procedural knowledge: it distills reusable how-to guidance from solved trajectories and feedback, in the token space, so it composes with whatever the backbone emits rather than depending on a model’s reasoning machinery.

The raw material is trajectories (and accompanying traces, KB data, feedback, etc.) where a stronger teacher solved a task the student got wrong. From each of those, meta-distillation extracts a compact, reusable how-to: when the situation applies, the procedure that worked, the decision criteria and context the teacher was implicitly using, and a paired right-way/wrong-way example. Crucially, the guides talk about context and tools by their role in the workflow, so the procedure transfers to new tasks. Related guides on the same theme then consolidate into a single playbook: a shared spine plus the specifics.

Many wins distill into one playbook

Heterogeneous evidence from work a teacher got right: trajectories, the documents and tools in play, judge feedback, and human annotation, collapses into a single reusable playbook.

Won trajectories

task-07 ✓task-23 ✓…

Source documents

spec.mdrunbook.md

Tools, by role

‹search›‹writer›‹runner›

LLM-as-judge / verifiers

verdictsscores

Human annotation

correctionsnotes

Token space

Meta-
distill

no weight update

→

playbooks / consolidated.md

Consolidated

One playbook

Convergence. Every input is evidence from work a teacher actually got right. Meta-distillation keeps the procedure and consolidates related guides into one playbook: a shared spine plus the specifics.

The reason this generalizes is that it’s distillation in token space. The output is plain procedural text, not a weight update: there’s no reward model and no training run, so it composes with whatever the backbone already emits. It’s a close cousin of recent self-distillation work that turns a model’s own feedback- or demonstration-conditioned predictions into a learning signal (Hübotter et al., 2026; Shenfeld et al., 2026), except we keep the distillate in the token space the agent reads at inference, rather than folding it back into weights. That’s why the same distilled procedure lifted both a non-reasoning model and a reasoning one: it’s additive to whatever the model and the retrieval layer were already doing, not a replacement for either. At inference the relevant guidance is pulled in for the situation at hand and kept stable while the agent works through it, so the agent carries procedure instead of re-deriving it every run.

The format has shown its utility through iterations of an outer meta-training loop (inspired by GEPA and DSPy), but it also points to something larger. Every guide is built from a trajectory a teacher actually won, which makes it labeled procedural data: the same loop that improves the agent today is quietly producing the policy data you’d train your own model on tomorrow.

Meta-distillation: from won trajectories to injected procedure

Distill compact how-to guides from teacher wins, consolidate into playbooks, and pull the relevant one in at inference.

Raw material

Won trajectories

Teacher solved a task the student got wrong.

→

Token space

Distill

Extract when-it-applies, the procedure, decision criteria, and a right/wrong example.

tools by roleno weight update

→

Consolidated

Playbook

Related guides de-duplicate into a shared spine plus specifics.

→

Inference

Injected

Relevant guidance pulled in and kept stable while the agent works.

⤿ByproductEvery guide is built from a trajectory a teacher won: labeled procedural data, the policy data you'd train your own model on tomorrow.

Because the distillate is plain text, it's additive to whatever retrieval (bare tools or the Context Engine) is already in place, and model-backbone-agnostic.

Results

Harvey LAB · legal work

Distilled how-tos rescue a weak legal student

Harvey LAB rubric pass rate (fraction of criteria met) for a Mistral student on 44 held-out legal tasks. Guidance is mined offline from an Opus 4.5 teacher and injected at inference: no weight updates, no teacher at runtime.

Baseline (no guidance)+ Meta-distillationOpus 4.5 teacher (ceiling ref)

Mean rubric pass rate · all 44 tasks

0.80.60.40.20.0

0.35

0.45

0.58

base → MD → teacher

By baseline difficulty · rescue weak, hold strong

0.60.40.20.0

0.01

0.38

0.47

weak

base < 0.2 · n=12

strong

base ≥ 0.2 · n=32

Selected wins: tasks the bare student failed, distillation makes competent

EnergyDraft markup of EPC contract

0.00→0.95

InsuranceExtract key terms, acquisition closing docs

0.00→0.81

EnergyIdentify issues in concession agreement

0.00→0.71

InsuranceDraft reservation-of-rights letter

0.31→0.79

ImmigrationDraft petition support letter

0.22→0.69

EnergyIdentify issues in operations & maintenance

0.00→0.47

τ³-bench · banking

Distilled procedure beats managed episodic memory

τ³-bench action-check pass rate: all-tools baseline vs. AWS AgentCore Memory vs. meta-distillation, fed the same teacher trajectories. GPT-OSS uses the backbone-native MD buffer.

All-tools baselineAWS AgentCore MemoryMeta-distillation

50%40%30%20%10%0%

28.2

30.4

45.7

6.0

13.6

16.8

Mistral Large 3

non-reasoning

GPT-OSS 120B

reasoning · native buffer

Fairness note: AgentCore got the stronger memory-writer. Its episodic records were extracted by Opus 4.5; the MD guides only by Sonnet. Yet on Mistral AgentCore lands at the all-tools baseline (30.4% vs ~28%), far below MD (45.7%); on GPT-OSS it edges transferred guides but loses to the backbone-native distill (16.8%). The MD win is therefore not explained by writer quality.

Analysis

Mistral: Action checks reach 45.7% on held-out tasks, topping the Context Engine’s 41.5%, but carried as compact procedure in context instead of retrieval compute at query time, at a fraction of the cost (~$0.16 vs. ~$0.45 agent/task).
GPT-OSS: Meta-distillation pushes action score to 16.8%, best of any arm, beating the mistral-transfer buffer (12.7%), AgentCore (13.6%), the Context Engine (15.0%) and the baseline (6.0%), at the lowest agent cost. It keeps the strong native retrieval (recall ~0.71) and adds procedure on top rather than replacing it.
It beats managed episodic memory, and the handicap runs the wrong way. On Mistral, AgentCore Memory lands at the all-tools baseline (30.4% vs. ~28%), far below MD (45.7%); on GPT-OSS it’s competitive but still loses to the meta-distill. And AgentCore was fed the stronger Opus 4.5 memory-writer while MD used Sonnet, so the gap isn’t a generation-model artifact.
On Harvey, the lift has exactly the right shape: rescue the weak, hold the strong. Tasks the bare student is near-helpless on jump +0.369 (0.006 → 0.375, from can’t do it to competent); tasks it already handles stay flat (−0.001). Several blank failures become near-solved (draft-markup-of-EPC-contract 0.00 → 0.95). And guidance makes the weak model more reliable: the first-attempt no-deliverable rate halves (11.4% → 4.5%).
Generality across model classes and domains. The same distilled procedure helped a non-reasoning backbone (Mistral) and a reasoning one (GPT-OSS), and the method carried from banking support to legal work. Working in token space is exactly why: the intervention isn’t tied to any one model’s reasoning machinery.

+All three stages together · τ³-bench, Mistral Large 3

A foundation for self-improvement

The three stages aren’t separate tricks; they compose. The survey structures the corpus, the Context Engine retrieves precise declarative context over it, and meta-distillation carries the procedural knowledge for using that context. Run the engine and the distillate together on Mistral and they stack: the combined arm edges past both individual methods at once, capturing the union of what each contributes rather than averaging them. We aren’t claiming recursive self-improvement yet; what this shows is the foundation for it, and for continual learning, built where it should start: in the context infrastructure.

Results

τ³-bench · banking

CE + MD captures the union, not the average

τ³-bench, Mistral, 32 val tasks. The combined arm matches meta-distillation's action quality while keeping the Context Engine's retrieval quality, taking the best of both levers rather than averaging them.

All-tools baseline+ Context Engine+ Meta-distillation+ CE & MD

50%40%30%20%10%0%

28.7

37.5

45.7

46.2

Action-check pass rate

val · n=32

The first step toward a self-improving loop. Stacked today, the combined arm is only marginally ahead of meta-distillation alone (46.2% vs. 45.7%), and that’s exactly the next thread to pull: in future experiments the distillate should improve the search itself, so procedure mined from solved work feeds back into how the Context Engine retrieves, not just into what the agent does with what it retrieves. That’s the step that closes the loop. The pieces are already in place: the survey structures the corpus, the engine retrieves better context, better context produces more solved trajectories (optimal policy data), and those trajectories distill back into procedure that improves the agent again. Run over a team’s live traffic that loop becomes self-reinforcing, better context yielding more optimal rollouts, a better distillate, and better context again, all anchored to a human-authored spec at the top of the authority gradient. That is the foundation we think recursive self-improvement and continual learning should be built on.

Why this matters

Continual learning foundations

We see this as a necessary first step toward continual learning. By pushing the ceiling of a base model’s ability closer to the optimal policy, all downstream RL benefits, and sampled rollouts have a better chance of producing a high reward. These methods also work with small amounts of annotated data (here, provided by the benchmark) that mirror a realistic deployment. Even with weaker evals like LLM-as-judge (no full verifiers as in RLVR), you can automatically improve performance with orders of magnitude less labeling work than sample-hungry methods (SFT, RL). And the specification is fully human-readable and separated from the harness implementation, so interactions are ergonomic for humans as well as the agent: no more wrestling with trace data, annotations, index updates, and test infrastructure just to change the spec for an edge case.

Kinesthetic architecture

There’s a clean architecture underneath, and it’s the one the product is built on: human-authored ground truth is authoritative; a derived engine turns it into runtime context; and transient, model-specific harness features adapts it to a given backbone and shrinks as models improve. Nothing downstream of ground truth is a second source of truth; it’s all regenerable from the spec.

From corrections to policy data

This experiment setup shows what you can do with a small sample of trajectories that have been labeled/annotated given a human-authored/verified success criteria. In the real world, where annotations are expensive and scarce, this exemplifies how valuable it is to get a successful demonstration of a task. By improving the context layer, you produce more optimal policy data (rollouts/trajectories that solve a task), thereby creating data for distillation. This loop can continuously run over your team’s live traffic and eventually produce high quality data for a post-trained model you own: specified by your experts rather than rented from a frontier vendor.

Future work

Extend the experiment setup: multi-trial runs, a retrieval-compute scaling curve, ablations of the individual Context Engine and meta-distillation components and processes, and scaling to additional benchmarks.
Extend Meta-Distillation Exploration: What to distill, how aggressively to filter, and how well performance gains scale with teacher and data quality. Additionally, characterize how reasoning vs. non-reasoning models consume the same procedure, and turn compounded corrections (from annotators or LLMaJ) into trainable, on-policy data to fully close the loop.
Frontier Expansion: Both methods were deliberately tested on open / open-weight backbones, and applying the same context engine and meta-distillation on top of a frontier model is the natural next step to see whether they push SOTA higher rather than only closing the gap to it.
Partner with us on real-world data: The cleanest test of this work isn’t another benchmark: it’s a design partner’s own traffic. The action-check gains we measured here have somewhere to convert into end-to-end success inside a harness the partner already owns and tunes, and every correction along the way compounds into the context layer as a durable, ownable asset. If you’re running enterprise agents and feeling specification failure, get in touch.

References

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, et al. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. arXiv preprint arXiv:2507.19457, 2025.
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv preprint arXiv:2310.03714, 2023.
Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Barres. τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge. arXiv preprint arXiv:2603.04370, 2026.
Jonas Hübotter, Frederike Lübeck, Lejs Behric, et al. Reinforcement Learning via Self-Distillation. arXiv preprint arXiv:2601.20802, 2026.
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-Distillation Enables Continual Learning. arXiv preprint arXiv:2601.19897, 2026.
Zhuofeng Li, Haoxiang Zhang, Cong Wei, et al. Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction. arXiv preprint arXiv:2605.05242, 2026.
Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive Language Models. arXiv preprint arXiv:2512.24601, 2025.

Citation

Anthony Le, Mac Broido, "Specifying the agent: closing the gap to frontier with a context engine and meta-distillation", Kinesthetic Research, June 2026. https://kinesthetic.dev/blog