Original: Swyx · 12/02/2026
Summary
*AI News for 2/10/2026-2/11/2026. We checked 12 subreddits, 544 Twitters and 24 Discords (256 channels, and 7988 messages) for you. Estimated reading time saved (at 200wpm): 655 minutes. AINews’ website lets yoKey Insights
“Opus-class, but not a 1T super model like Kimi or Qwen.” — Comparing GLM-5’s capabilities and scale to other models.
“GLM-5 also integrates DeepSeek Sparse Attention (DSA), significantly reducing deployment cost while preserving long-context capacity.” — Highlighting the technical advancement in GLM-5 for cost-effective deployment.
“A big part of the Reddit conversations centered around how they are running into compute constraints on their inference service.” — Discussing the community’s reaction to Z.ai’s computational challenges.
Topics
Full Article
Published: 2026-02-12
Source: https://www.latent.space/p/ainews-zai-glm-5-new-sota-open-weights
AI News for 2/10/2026-2/11/2026. We checked 12 subreddits, 544 Twitters and 24 Discords (256 channels, and 7988 messages) for you. Estimated reading time saved (at 200wpm): 655 minutes. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!As we mentioned yesterday, China open model week is in full swing. Today was Z.ai’s turn to launch their big update before the Big Whale. Per the GLM-5 blogpost:
- GLM-5 also integrates DeepSeek Sparse Attention (DSA), significantly reducing deployment cost while preserving long-context capacity. (prompting comments on the DeepSeek total victory in open model land)
DeepSeek “V4-lite” / 1M context rollout, attention as the differentiator, and inference stack fixes
MiniMax M2.5 / StepFun / Qwen: fast coding models, cost pressure, and benchmark jockeying
Video generation shockwave: SeeDance v2, PixVerse R1, and “IP constraints” as a structural advantage
Agents, coding workflows, and the new “malleable software” toolchain
Measurement, evaluation, and safety: benchmarks, observability, and agent security gaps
Top tweets (by engagement)
1. GLM-5 and MiniMax 2.5 Launches
- OpenAI President Greg Brockman has highlighted the ongoing challenge of compute scarcity, noting that even with significant investments, meeting future demand remains uncertain. OpenAI has published a chart emphasizing that scaling compute resources is crucial for achieving profitability, indicating the broader industry trend of compute limitations impacting AI development. Source.
- The issue of being ‘GPU starved’ is not unique to smaller companies like Z.ai; even major players like Google and OpenAI face similar challenges. Google has reportedly had to ‘nerf’ its models, potentially through quantization, to manage demand with limited resources, highlighting the widespread impact of hardware constraints on AI capabilities.
- The scarcity of high-performance GPUs, such as the RTX 5090, is a common problem among developers and companies alike. This shortage affects both individual developers and large organizations, indicating a significant bottleneck in the AI development pipeline due to hardware availability and pricing constraints.
- GLM-5 is noted for having the lowest hallucination rate on the AA-Omniscience benchmark, which is a significant achievement in reducing errors in AI-generated content. This positions GLM-5 as a leader in accuracy among open-source models, surpassing competitors like Opus 4.5 and GPT-5.2-xhigh.
- The open-source AI community is rapidly closing the gap with closed-source models, now trailing by only about three months. This is exemplified by the upcoming release of DeepSeek v4, which will utilize the same DSA architecture as GLM-5 but on a larger scale, indicating a trend towards more powerful open-source models.
- There is a call for transparency in the AI community regarding the resources required to run these advanced models, such as memory requirements. This information is crucial for developers and researchers to effectively utilize and optimize these models in various applications.
- GLM-5 has been released with model weights available under the MIT License on platforms like Hugging Face and ModelScope. A notable technical detail is that GLM-5 was trained using FP16 precision, which contrasts with Deepseek’s use of FP8, potentially impacting computational efficiency and model performance.
- The cost comparison between GLM-5 and other models like DeepSeek V3.2 Speciale and Kimi K2.5 reveals significant differences. GLM-5’s input costs are approximately 3 times higher than DeepSeek V3.2 Speciale (0.27) and 1.8 times higher than Kimi K2.5 (0.45). Output costs are also notably higher, being 6.2 times more expensive than DeepSeek V3.2 Speciale (0.41) and 14% more expensive than Kimi K2.5 (2.25).
- GLM-5’s release on OpenRouter and the removal of Pony Alpha suggest a strategic shift, with GLM-5 being more expensive than Kimi 2.5. This indicates a potential focus on premium features or performance enhancements that justify the higher pricing, despite the increased cost compared to competitors.
- The release of GLM 5.0 and MiniMax 2.5 is part of a broader trend in China where multiple AI models are being launched in quick succession. This includes models like Seedance 2.0, Seedream 5.0, and Qwen-image 2.0, with more expected soon such as Deepseek-4.0 and Qwen-3.5. This rapid development suggests a highly competitive environment in the Chinese AI sector, potentially leading to significant advancements in AI capabilities.
- The frequent release of AI models in China, such as GLM 5.0 and MiniMax 2.5, indicates a strategic push in AI development, possibly driven by national initiatives to lead in AI technology. This aligns with China’s broader goals to enhance its technological infrastructure and capabilities, suggesting that these releases are not just celebratory but part of a larger, coordinated effort to advance AI technology.
- The rapid succession of AI model releases in China, including GLM 5.0 and MiniMax 2.5, highlights the intense competition and innovation within the Chinese AI industry. This environment fosters accelerated development cycles and could lead to breakthroughs in AI research and applications, positioning China as a formidable player in the global AI landscape.
- Front_Eagle739 raises a concern about the lack of activity on GLM 5’s Hugging Face repository, questioning whether this indicates a shift towards a closed-source model. This could suggest a delay in open-sourcing or a strategic decision to keep the model proprietary, which would impact accessibility and community contributions.
- Sea_Trip5789 provides a link to the updated subscription plans for GLM 5, noting that currently only the ‘max’ plan supports it. They mention that after infrastructure rebalancing, the ‘pro’ plan will also support it, but the ‘lite’ plan will not. This highlights the tiered access strategy and potential limitations for users on lower-tier plans.
2. Local LLM Hardware and Optimization
- segmond suggests obtaining inference numbers before training, mentioning models like
gpt-oss-120bandglm4.6vas examples that could fit completely on the setup. This implies a focus on evaluating the system’s performance with large models to ensure it meets expectations before proceeding with more resource-intensive tasks like training. - lolzinventor discusses their setup using
8x3090GPUs withx16 to x8x8 splittersonPCIe v3and dual processors, highlighting that despite potential bandwidth limitations, the system performs adequately. They mention considering an upgrade toRomed-2Tand using7 GPUs of x16, with a potential configuration change to accommodate an 8th GPU. They also address power stability issues, resolved by using4x1200W PSUsto handle power spikes, and inquire about training intervals, indicating a focus on optimizing power and performance balance.
- The use of
--no-mmapis highlighted as a critical point for optimizing performance when running large models on integrated GPUs. This flag helps avoid doubling memory allocations, which is a common pitfall when using UMA (Unified Memory Architecture) with Vulkan. This insight is particularly relevant for those trying to maximize efficiency on systems with limited resources. - The performance of achieving 18 tokens per second on an 80B Mixture of Experts (MoE) model while simultaneously running NAS and Jellyfin is noted as impressive. This setup demonstrates the potential of using integrated GPUs for heavy computational tasks without the need for discrete GPUs, showcasing a ‘one box to rule them all’ capability.
- A suggestion is made to try running the
gpt-oss-20bmodel, which is claimed to be approximately twice as fast as the current setup. This model, when combined with a server.dev MCP search, is suggested to enhance performance and intelligence, indicating a potential alternative for those seeking faster inference speeds.
- SimplyRemainUnseen discusses a setup using two 128GB Ryzen AI Max+ systems, highlighting their strong 4-bit performance for LLMs and image generation. They mention the ability to fine-tune a QAT LoRA with unsloth’s workflows to improve int4 quantization performance, achieving usable speeds on models like GLM 4.7. The setup also supports running a ComfyUI API and GPT OSS 120B for image and video generation, leveraging the substantial unified memory.
- PraxisOG suggests using 4x 3090 GPUs for a balance of memory capacity and speed, suitable for running models like Qwen coder. They also mention an alternative with 7x AMD V620 for full GPU offload, which can handle models like GLM4.7 or provide extensive context with minimax 2.1 and 2.2. For a quieter setup, they recommend a Strix Halo box, which offers similar VRAM capacity to 4x 3090 but with less noise.
- Own_Atmosphere9534 compares different setups, including a Macbook M4 PRO MAX 128GB and RTX 5090, both around $5K. They highlight the Mac’s performance, comparable to RTX 3090, and its ability to run models like Llama 3.3 70B Instruct and Qwen3 coder variants effectively. They emphasize the importance of model size and hardware familiarity, noting that their M4 MacBook performs well with GPT-OSS-20B, influencing their decision to purchase the M4 PRO MAX.
- Plastic-Ordinary-833 highlights the significance of integrating MCP support into llama-server, noting that it simplifies the process of switching between cloud and local models without altering the tool setup. However, they express concern about how the agentic loop handles errors when smaller models hallucinate tool calls or return malformed JSON, which has been a major issue with local agents.
- allozaur discusses the initial release of MCP support in llama.cpp WebUI, emphasizing the focus on creating a solid client-side base before extending support to the llama-server backend. They mention using GitHub, Hugging Face, and Exa Search remote servers via streamable HTTP, with WebSocket transport also supported. OAuth, notifications, and sampling are not included in the initial release, but the goal is to iterate after a solid first release.
- prateek63 points out that MCP support in llama.cpp is a significant advancement, particularly the agentic loop support, which was a major barrier to using local models for tool-use workflows. The integration allows for native operation with local inference, moving towards self-hosting agentic setups, which were previously reliant on cloud APIs.
3. Qwen Model Developments
- The Qwen-Image-2.0 model is notable for its ability to handle both image generation and editing tasks, with a focus on high-resolution outputs up to 2K. This dual capability is significant as it allows for more versatile applications in creative and professional settings, where both creation and modification of images are required.
- There is a discussion about the model’s performance in rendering natural light and facial features, which are traditionally challenging areas for AI models. The ability to accurately depict these elements suggests advancements in the model’s underlying architecture or training data, potentially making it a ‘game changer’ in the field of AI image generation.
- Concerns are raised about the model’s multilingual capabilities, particularly its performance across different languages. The predominance of Chinese examples in the showcase might indicate a bias or optimization towards Chinese language and cultural contexts, which could affect its utility in more diverse linguistic environments.
- GarbageOk5505 highlights the concept of ‘dead zones’ in the representation space of LLMs, where models consistently reproduce incorrect behavior. This suggests that Reinforcement Learning from Human Feedback (RLHF) might not effectively address these issues, as it could lead to models ignoring certain instruction axes. The commenter is curious about whether the severity of these dead zones correlates with the model’s reliability on downstream tasks, particularly in handling ambiguous instructions, which could impact the development of reliable AI agents.
- TomLucidor suggests a method for testing prompt biases by creating multiple personas using various names and adjectives, and conducting A/A testing with different seeds. This approach could help identify consistent biases in model responses, providing insights into how models might be steered or influenced by different prompts.
- TheRealMasonMac references a study by Anthropic on ‘assistant-axis’, implying that the post might be inspired by similar research. This connection suggests a broader context of exploring how LLMs can be influenced or characterized by different axes of behavior, potentially offering a framework for understanding model personalities.
- A user inquires about the compatibility of the finetuning notebooks with ROCm and AMD cards, and asks about the duration of finetuning processes. They also seek advice on the largest model that can be trained or finetuned on a system with a combined VRAM of 40GB (24GB + 16GB). This suggests a need for detailed hardware compatibility and performance benchmarks for different GPU configurations.
- Another user expresses concerns about the stability and effectiveness of training Mixture of Experts (MoE) models, particularly regarding issues with the router and potential degradation of model intelligence during training processes like SFT (Supervised Fine-Tuning) or DPO (Data Parallel Optimization). They ask if there have been improvements in these areas and seek recommendations for current best practices in MoE model training, indicating ongoing challenges and developments in this field.
Less Technical AI Subreddit Recap
/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo
1. Seedance 2.0 AI Video and Image Innovations
- Mr_Universal000 highlights the potential of AI in democratizing filmmaking, especially for those with limited budgets. They express excitement about using AI to create motion pictures from storyboards, which can serve as proof of concept for attracting funding. The commenter is particularly interested in open-source solutions that could make this technology more accessible.
- Forumly_AI discusses the transformative impact of AI-generated video content on society. They predict that AI influencers will become significant, with the potential to shape ideas and perceptions, thereby generating revenue. The commenter anticipates that within a year, advancements in video models will lead to substantial societal changes, suggesting a future where AI’s influence is pervasive.
- aalluubbaa suggests that the accurate voice reconstruction by Seedance 2 might be due to overfitting, particularly because the model could have been trained extensively on the influencer’s content. This implies that the model’s performance might not generalize well across different voices or contexts, highlighting a potential limitation in its training data diversity.
- 1a1b speculates on a technical mechanism for voice reconstruction, suggesting that it might be related to a technique called ‘Side Eye’ developed in 2023. This technique involves extracting audio from the vibrations captured in camera lens springs, which could theoretically leave artifacts that a model might use to reconstruct sound from visual data.
- makertrainer posits that the incident might have been exaggerated by ByteDance to showcase their technology’s capabilities. They suggest that the voice similarity could have been coincidental, rather than a demonstration of advanced AI capabilities, indicating skepticism about the true extent of the technology’s performance.
2. AI Resignations and Industry Concerns
- A technical perspective suggests that the co-founders of xAI might be leaving due to a shift in control dynamics, with Elon Musk taking a more dominant role in the company’s direction. This could lead to a reduced influence for the co-founders, prompting them to pursue ventures where they have more control and a larger stake. The implication is that the strategic vision of xAI is heavily influenced by Musk, which might not align with the co-founders’ aspirations.
- The departure of xAI co-founders could be linked to financial incentives, such as a buyout by SpaceX. This scenario would allow the co-founders to cash out their equity stakes, providing them with the capital to explore new opportunities. This financial angle suggests that the resignations are part of a strategic exit plan rather than a reaction to internal conflicts or dissatisfaction.
- There is speculation that if Elon Musk does not initiate a hiring spree for new executives, it would confirm his central role in managing xAI. This would indicate a consolidation of power and decision-making within the company, potentially leading to a more streamlined but Musk-centric operational model. This could be a strategic move to align xAI’s objectives closely with Musk’s broader vision for AI and technology.
- The discussion highlights the economic model of AI services, comparing it to platforms like Facebook and YouTube. The argument is made that to make AI accessible to everyone, similar to how Facebook operates, ads are necessary. Without ads, AI services would need to charge users, potentially limiting access to wealthier individuals, which contradicts the idea of AI as a ‘great leveler’.
- A user suggests that paying for AI services like ChatGPT can be justified if users are deriving significant real-world benefits and efficiencies. This implies that for professional or intensive users, the cost of subscription could be offset by the productivity gains and additional features provided by the paid service.
- The conversation touches on the perception of AI as distinct from human behavior, yet it reflects a misunderstanding of human behavior itself. This suggests a deeper philosophical debate about the nature of AI and its alignment or divergence from human cognitive processes.
3. DeepSeek Model Updates and Benchmarks
- A user mentioned that Deepseek has been updated to handle 1 million tokens, suggesting a significant increase in its processing capability. This could imply improvements in handling larger datasets or more complex queries, which is a notable enhancement for users dealing with extensive data or requiring detailed analysis.
- Another user reported that after the update, Deepseek provided a nuanced and original review of a complex piece of character writing. This suggests improvements in the model’s ability to understand and critique creative content, indicating advancements in its natural language processing and comprehension skills.
- A comment highlighted that Deepseek’s responses now exhibit more ‘personality,’ drawing a comparison to ChatGPT. This could indicate enhancements in the model’s conversational abilities, making interactions feel more human-like and engaging, which is crucial for applications requiring user interaction.
- DeepSeek’s recent update to support a context length of up to 1 million tokens marks a significant enhancement from its previous version, which supported 128K tokens. This improvement allows for more efficient processing of extensive documents, such as novels, which can contain hundreds of thousands of tokens. This capability is particularly beneficial for tasks involving long-form text analysis and complex Q&A scenarios.
- The update to DeepSeek has reportedly increased the processing time for certain queries. A user noted that a question which previously took 30 seconds to process now takes 160 seconds, indicating a potential trade-off between the increased context length and processing speed. This suggests that while the model can handle larger inputs, it may require more computational resources, impacting response times.
- There is some skepticism about the update, with users questioning the authenticity of the claims regarding the model’s capabilities. One user referred to the update as a ‘hallucination,’ suggesting that there might be doubts about whether the model can truly handle the expanded context length as advertised.
- The update to DeepSeek introduces a significant increase in context window size to 1 million tokens, which translates to approximately 750,000 English words or 1.5 million Chinese characters. This is achieved using Multi-head Latent Attention (MLA), which compresses the key-value cache, allowing for fast inference and reduced memory usage despite the expanded context. This enhancement enables processing of entire codebases or novels without needing to rerun prompts, which is a substantial improvement for handling large datasets.
- There is a clarification that the update does not involve changes to the underlying model architecture itself, but rather extends the context window and updates the knowledge cutoff to May 2025. This means that for existing chats, the primary change users will experience is the increased chat length capability, without alterations to the model’s core functionalities or performance characteristics.
- Despite the significant update in context window size, there are no official release notes available on the DeepSeek website yet. This lack of documentation might leave users without detailed insights into the technical specifics or potential limitations of the new features, such as the impact on performance metrics or compatibility with existing systems.
- The discussion highlights that Grok 4.1 is a closed-source model noted for its cost-effectiveness, suggesting it offers competitive performance at a lower price point compared to other models. This could be particularly relevant for users prioritizing budget without sacrificing too much on performance.
- A query is raised about the absence of DeepSeek V3.2 Speciale in the results, indicating interest in this specific version. This suggests that there might be expectations or known performance metrics associated with this version that users were keen to compare against the tested models.
- The limited number of models tested, only six, is questioned, which implies a potential limitation in the comprehensiveness of the results. This could affect the generalizability of the findings, as a broader range of models might provide a more complete picture of the current state of open-source AI performance.
A summary of Summaries of Summaries by gpt-5.21. GLM-5 Rollout, Access Paths & Benchmark Scrutiny
- LMArena also added glm-5 to Text+Code Arena and reported it hit #1 among open models (#11 overall, score 1452, +11 vs GLM-4.7) on the Text Arena leaderboard, while Eleuther noted a free endpoint on Modal until April 30 with concurrency=1: Modal GLM-5 endpoint.
- Nous Research community also compared GLM-5 vs Kimi on browsecomp, citing ~744B (+10B MTP) for GLM-5 vs 1T for Kimi and claiming higher active params for GLM (40B) vs Kimi (32B), reinforcing that people are reading leaderboard claims with a more technical lens.
- The thread framed GLM-OCR as a practical swap-in for OCR-heavy products (they cited ongoing use of Gemini Flash but wanting something cheaper), while other Latent Space posts highlighted a wave of open multimodal releases (via Merve’s post) as competition intensifies on capability-per-dollar.
- OpenRouter chatter amplified the rumor mill with questions about “deepseek v4” appearing on X and guesses it might be a lite variant, showing how fast unconfirmed model IDs now propagate into planning and routing decisions.
- The reliability complaints contrasted sharply with the launch hype, pushing people toward pragmatic mitigations (e.g., explicitly specifying model fallbacks rather than relying on auto routing) while the thread devolved into jokes rather than concrete SLO fixes.
- DSPy builders simultaneously pushed RLM into practice by integrating RLM into Claude Code via subagents/agent teams and requesting critique on the implementation in a Discord thread: core implementation post.
- The feature list is unusually broad for an MCP plugin—Images, reverse image search, local OCR, Lens, Flights, Stocks, Weather, News/Trends—and the discussion treated it as a quick way to bolt retrieval onto local models without paying per-query.
- A follow-on workflow thread explored auditable context saving with a
/wrapsession boundary that saves context+reflection as markdown with metadata, tying tool ergonomics directly to the “context rot / losing the thread” pain point.
- Multiple members claimed Triton struggles on Blackwell due to unconventional MXFP8/NVFP4 layouts and compiler limits, with more expected at the (linked) Triton TLX talk, signaling a potential tooling bifurcation for next-gen NVIDIA.
- The release notes also mentioned progress toward ABI stability, which matters for downstream integration as teams try to standardize low-precision MoE training stacks across heterogeneous environments.
- Discussion centered on how smaller dtypes might free enough shared memory for c tiles to enable async stores/persistence, reflecting a broader theme: people want low-level control knobs back as architectures and datatypes get weirder.
- Users also reported wrong article links, lower source counts (as low as 24), and suspected cost-saving behaviors like Sonar being used for first responses, creating a reliability/quality tax that engineers notice immediately.
- Pricing backlash escalated with a report of spending $100 every three days for ~9 hours of work using Opus 4.6 and GPT-5.3 Codex, reframing “best coding model” debates as cost/performance engineering.
- Latent Space tied the policy to IPO risk and churn concerns via Discord’s post, while Nous members discussed moving bot/tool communities to Matrix, showing infra builders treat comms platforms as part of their stack.
Key Takeaways
Notable Quotes
Opus-class, but not a 1T super model like Kimi or Qwen.Context: Comparing GLM-5’s capabilities and scale to other models.
GLM-5 also integrates DeepSeek Sparse Attention (DSA), significantly reducing deployment cost while preserving long-context capacity.Context: Highlighting the technical advancement in GLM-5 for cost-effective deployment.
A big part of the Reddit conversations centered around how they are running into compute constraints on their inference service.Context: Discussing the community’s reaction to Z.ai’s computational challenges.
Related Topics
- [[topics/openai-api]]
- [[topics/prompt-engineering]]
- [[topics/ai-agents]]
Related Articles
[AINews] Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats
Swyx · explanation · 89% similar
[AINews] "Sci-Fi with a touch of Madness"
Swyx · explanation · 88% similar
[AINews] OpenAI and Anthropic go to war: Claude Opus 4.6 vs GPT 5.3 Codex
Swyx · explanation · 85% similar
Originally published at https://www.latent.space/p/ainews-zai-glm-5-new-sota-open-weights.