Original: Geoffrey Huntley · 07/04/2025
Summary
Claude 3.7’s advertised context window is 200k, but I’ve noticed that the quality of output clips at the 147k-152k mark. It’s an old joke in the DJ community about upcoming artists having a bad reputation for pushing the audio signal into the red. Red is bad because it results in the audio signal being clipped and the mix sounding muddy.Key Insights
“Claude 3.7’s advertised context window is 200k, but I’ve noticed that the quality of output clips at the 147k-152k mark.” — Discussing the discrepancy between advertised and actual performance of LLMs.
“These tools make each engineer within your team at least two times more productive.” — Highlighting the productivity gains from using advanced AI tools.
“Going forward, companies should budget 500 USD per day, per dev, on tokens as the new normal for business.” — Discussing the financial implications of integrating AI tools into software development.
Topics
Full Article
if you are redlining the LLM, you aren’t headlining
Author: Geoffrey HuntleyPublished: 2025-04-07
Source: https://ghuntley.com/redlining/
It’s an old joke in the DJ community about upcoming artists having a bad reputation for pushing the audio signal into the red. Red is bad because it results in the audio signal being clipped and the mix sounding muddy.
It’s a good analogy that applies to software engineers in the future, however, as these LLMs have context windows.
Claude 3.7’s advertised context window is 200k, but I’ve noticed that the quality of output clips at the 147k-152k mark. Regardless of which agent is used, when clipping occurs, tool call to tool call invocation starts to fail like below when using Claude 3.7:
RULER: What’s the Real Context Size of Your Long-Context Language Models? The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the “needle”) from long distractor texts (the “haystack”), has been widely adopted to evaluate long… — arXiv.org
GitHub - NVIDIA/RULER: This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models? This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models? - NVIDIA/RULER — GitHubThus, ‘redlining’ will differ per LLM, and each LLM will have unique properties that make it better for different tasks. Some are better for generating the initial specifications, while others are better for implementing tasks within the specification. Either way, you must ensure that the evaluation loop you dispatch does not redline the LLM context window. If I hadn’t ditched Cursor, I would have never learned this observation, as they currently do not surface this information within their product. These days, I’m running raw directly to the Anthropic API. It’s expensive, but the outcomes are dramatically better. There’s something cooked about Windsurf/Cursors’ go-to-market pricing - there’s no way they are turning a profit at 50/month gets you a happy meal experience. If you want more power, you gotta ditch snacking at McDonald’s. Going forward, companies should budget 500 USD per day, per dev, on tokens as the new normal for business, which is circa 50k USD (likely) to $127k USD (highest) per year. If you don’t have OPEX per dev to do that, it’s time to start making some adjustments… These tools make each engineer within your team at least two times more productive. Don’t take my word for it—here’s a study by Harvard Business School published last week that confirms this.
please keep in mind that this study was done with the previous generation of AI tooling and is already out-of-date
The Cybernetic Teammate Having an AI on your team can increase performance, provide expertise, and improve your experience — One Useful ThingSo what does it mean if a business doesn’t have a budget for this OPEX spending on something better than a McHappy meal when a competitor has the budget to opt for high-power tools? It means the budget will come from somewhere. If we take what we know—an engineering manager can only manage seven people—a team of switched-on engineers utilising these tools can output N-times more business outcomes than a team without them. Suddenly, you need fewer teams and fewer engineering managers to get the same outcomes…
p.s. socials
- BlueSky: https://bsky.app/profile/ghuntley.com/post/3lm6iqkisw52d
- X: https://x.com/GeoffreyHuntley/status/1909013985887936608
- LinkedIn: https://www.linkedin.com/posts/geoffreyhuntley_if-you-arent-redlining-the-llm-you-arent-activity-7314776697487167490-qddK
Key Takeaways
Notable Quotes
Claude 3.7’s advertised context window is 200k, but I’ve noticed that the quality of output clips at the 147k-152k mark.Context: Discussing the discrepancy between advertised and actual performance of LLMs.
These tools make each engineer within your team at least two times more productive.Context: Highlighting the productivity gains from using advanced AI tools.
Going forward, companies should budget 500 USD per day, per dev, on tokens as the new normal for business.Context: Discussing the financial implications of integrating AI tools into software development.
Related Topics
- [[topics/anthropic-api]]
- [[topics/ai-agents]]
- [[topics/prompt-engineering]]
Related Articles
I dream about AI subagents; they whisper to me while I'm asleep
Geoffrey Huntley · explanation · 83% similar
I dream of roombas - thousands of automated AI robots that autonomously maintain codebases
Geoffrey Huntley · explanation · 78% similar
Claude Sonnet is a small-brained mechanical squirrel of <T>
Geoffrey Huntley · explanation · 78% similar
Originally published at https://ghuntley.com/redlining/.