Taalas serves Llama 3.1 8B at 17,000 tokens/second

Original: Simon Willison · 20/02/2026

Summary

This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model that can run at a staggering 17,000 tokens/second. This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model (from July 2024) that can run at a staggering 17,000 tokens/second.

Key Insights

“This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model that can run at a staggering 17,000 tokens/second.” — Introduction of Taalas’s new product and its performance capabilities.

“They describe their Silicon Llama as “aggressively quantized, combining 3-bit and 6-bit parameters.”” — Details on the technical approach Taalas is taking with their hardware.

“Their next generation will use 4-bit - presumably they have quite a long lead time for baking out new models!” — Insight into Taalas’s future plans and the development process for their products.

Topics

Full Article

# Taalas serves Llama 3.1 8B at 17,000 tokens/second

Author: Simon Willison
Published: 2026-02-20
Source: https://simonwillison.net/2026/Feb/20/taalas/#atom-everything

Taalas serves Llama 3.1 8B at 17,000 tokens/second This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model (from July 2024) that can run at a staggering 17,000 tokens/second. I was going to include a video of their demo but it’s so fast it would look more like a screenshot. You can try it out at chatjimmy.ai. They describe their Silicon Llama as “aggressively quantized, combining 3-bit and 6-bit parameters.” Their next generation will use 4-bit - presumably they have quite a long lead time for baking out new models!

Key Takeaways

Notable Quotes

This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model that can run at a staggering 17,000 tokens/second.

Context: Introduction of Taalas’s new product and its performance capabilities.

They describe their Silicon Llama as “aggressively quantized, combining 3-bit and 6-bit parameters.”

Context: Details on the technical approach Taalas is taking with their hardware.

Their next generation will use 4-bit - presumably they have quite a long lead time for baking out new models!

Context: Insight into Taalas’s future plans and the development process for their products.

[[topics/ai-agents]]
[[topics/model-optimization]]
[[topics/hardware-startups]]

Claude Sonnet is a small-brained mechanical squirrel of <T>

Geoffrey Huntley · explanation · 64% similar

Introducing GPT‑5.3‑Codex‑Spark

Simon Willison · explanation · 63% similar

[AINews] Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model

Swyx · reference · 63% similar

Originally published at https://simonwillison.net/2026/Feb/20/taalas/#atom-everything.

Research

Personal

Planning

Taalas serves Llama 3.1 8B at 17,000 tokens/second

Summary

Key Insights

Topics

Full Article

Key Takeaways

Notable Quotes

Claude Sonnet is a small-brained mechanical squirrel of <T>

Introducing GPT‑5.3‑Codex‑Spark

[AINews] Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model

Research

Personal

Planning

​Summary

​Key Insights

​Topics

​Full Article

​Key Takeaways

​Notable Quotes

​Related Topics

​Related Articles

Claude Sonnet is a small-brained mechanical squirrel of <T>

Introducing GPT‑5.3‑Codex‑Spark

[AINews] Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model

Summary

Key Insights

Topics

Full Article

Key Takeaways

Notable Quotes

Related Topics

Related Articles