Structured Context Engineering for File-Native Agentic Systems

Original: Simon Willison · 09/02/2026

Summary

Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured dat

Key Insights

“Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data.” — Introduction to the research focus on SQL generation for agent operations.

“The biggest impact was the models themselves - with frontier models beating the leading open source models.” — Highlighting the significant performance difference between frontier and open source models.

“The ‘grep tax’ emerged as schema size scaled.” — Discussing the increased token consumption by models when using TOON format at larger scales.

Topics

Full Article

# Structured Context Engineering for File-Native Agentic Systems

Author: Simon Willison
Published: 2026-02-09
Source: https://simonwillison.net/2026/Feb/9/structured-context-engineering-for-file-native-agentic-systems/#atom-everything

Structured Context Engineering for File-Native Agentic Systems New paper by Damon McMillan exploring challenging LLM context tasks involving large SQL schemas (up to 10,000 tables) across different models and file formats:

Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables.

Unsurprisingly, the biggest impact was the models themselves - with frontier models (Opus 4.5, GPT-5.2, Gemini 2.5 Pro) beating the leading open source models (DeepSeek V3.2, Kimi K2, Llama 4). Those frontier models benefited from filesystem based context retrieval, but the open source models had much less convincing results with those, which reinforces my feeling that the filesystem coding agent loops aren’t handled as well by open weight models just yet. The Terminal Bench 2.0 leaderboard is still dominated by Anthropic, OpenAI and Gemini. The “grep tax” result against TOON was an interesting detail. TOON is meant to represent structured data in as few tokens as possible, but it turns out the model’s unfamiliarity with that format led to them spending significantly more tokens over multiple iterations trying to figure it out:

Screenshot of a figure from a research paper. Introductory text reads: "As schema size increased, TOON showed dramatically increased token consumption for Claude models despite being ~25% smaller in file size. Scale experiments used Claude models only." Below is "Figure 7: The 'Grep Tax' - TOON Token Overhead at Scale", a bar chart with a logarithmic y-axis labeled "Tokens" comparing YAML (teal) and TOON (purple) at two schema sizes: S5 (500 tables) and S9 (10,000 tables). At S5, TOON is +138% more tokens than YAML (~1,100 vs ~450). At S9, TOON is +740% more tokens (~50,000 vs ~7,000). Below the chart, explanatory text reads: "The 'grep tax' emerged as schema size scaled. At S5 (500 tables), TOON consumed 138% more tokens than YAML; at S9 (10,000 tables), this grew to 740%. Root cause: models lacked familiarity with TOON's syntax and could not construct effective refinement patterns."

Via @omarsar0

Key Takeaways

Notable Quotes

Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data.

Context: Introduction to the research focus on SQL generation for agent operations.

The biggest impact was the models themselves - with frontier models beating the leading open source models.

Context: Highlighting the significant performance difference between frontier and open source models.

The ‘grep tax’ emerged as schema size scaled.

Context: Discussing the increased token consumption by models when using TOON format at larger scales.

[[topics/agent-native-architecture]]
[[topics/ai-agents]]
[[topics/prompt-engineering]]

[AINews] Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats

Swyx · explanation · 70% similar

I dream about AI subagents; they whisper to me while I'm asleep

Geoffrey Huntley · explanation · 69% similar

Effective context engineering for AI agents

Anthropic Engineering · explanation · 69% similar

Originally published at https://simonwillison.net/2026/Feb/9/structured-context-engineering-for-file-native-agentic-systems/#atom-everything.

Research

Personal

Planning

Structured Context Engineering for File-Native Agentic Systems

Summary

Key Insights

Topics

Full Article

Key Takeaways

Notable Quotes

[AINews] Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats

I dream about AI subagents; they whisper to me while I'm asleep

Effective context engineering for AI agents

Research

Personal

Planning

​Summary

​Key Insights

​Topics

​Full Article

​Key Takeaways

​Notable Quotes

​Related Topics

​Related Articles

[AINews] Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats

I dream about AI subagents; they whisper to me while I'm asleep

Effective context engineering for AI agents

Summary

Key Insights

Topics

Full Article

Key Takeaways

Notable Quotes

Related Topics

Related Articles