[AINews] GPT 5.4: SOTA Knowledge Work -and- Coding -and- CUA Model, OpenAI is so very back

Original: Swyx · 06/03/2026

Summary

OpenAI’s GPT-5.4 launch introduces a unified model for coding and non-coding tasks, showcasing significant improvements in efficiency and user experience.

Key Insights

“GPT5.4 is our first mainline reasoning model that incorporates the frontier coding capabilities of GPT5.3codex.” — Describing the significance of the GPT-5.4 model launch.

“5.4 ranks as beating domain experts 69-71% of the time.” — Highlighting the performance metrics of the new model.

“The efficiency story across the board ALSO means that 5.4 is just a much better driver model for agents.” — Discussing the overall improvements in the model’s capabilities.

Topics

Full Article

The last time we checked in on (monthly?) frontier model war, Opus 4.6 vs 5.3 Codex was the talk of the town (even as Novembers Opus 4.5 made the venerable Cursor declare War Time and double down on its pivot into Cloud Agents in 2 months). Mostly the content machine of Anthropic (

19B ARR) stacked up vs the generally better benchmarks of OpenAI (

25B ARR). We warned back then not to take first reactions too seriously, and that bore out. Heres Codex plotting its own user growth jumping by over 1 million developers since exactly 1 month ago:Weve learned to take for granted that OpenAI is the smartest kid in the room, always reporting SOTA evals, but this set of updates feel much more substantial and confident than any OpenAI launch in recent history, including the big splashy GPT5 launch we were so excited in the ancient times of August 2025.The sheer comprehensiveness and confidence of this launch impressed us:The first ever unified GPT5.x model between coding and non-coding - GPT5.4 is our first mainline reasoning model that incorporates the frontier coding capabilities of GPT5.3codex and that is rolling out across ChatGPT, the API and Codex. Were calling it GPT5.4 to reflect that jump, and to simplify the choice between models when using Codex. Over time, you can expect our Instant models and Thinking models to evolve at different speeds.GDPVal ranks 5.4 as beating domain experts 69-71% of the time, including improved capabilities at the Big Three office productivity suite capabilities of sheets, docs, and slides. Anthropic also published important work today on just how many sectors of the economy are vulnerable, AND have overhangs:Computer Use (important for OpenClaw and Claude Cowork type knowledge work usecases) demonstrated clear efficiency gains in OSWorld-Verified but ALSO an incredible new Codex skill for building very interactive The efficiency story across the board ALSO means that 5.4 is just a much better driver model for agents:the most impressive set of endorsement notes from notable companies across all industries ever seen on any model launch ever, including making key hires in the finance vertical5.4 Pro launched the same day as 5.4 (usually Pro launches a few weeks after)We were in the 5.4 trial and accidentally left it on while going back to normal work and completely didnt notice we didnt miss Opus anymore.Interesting.AI News for 3/4/2026-3/5/2026. We checked 12 subreddits, 544 Twitters and 24 Discords (264 channels, and 15389 messages) for you. Estimated reading time saved (at 200wpm): 1568 minutes. AINews website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!AI Twitter RecapOpenAIs GPT-5.4 rollout: unified mainline + Codex, native computer use, and a new pricing/latency regimeGPT-5.4 / GPT-5.4 Pro launch: OpenAI shipped GPT-5.4 Thinking and GPT-5.4 Pro across ChatGPT, API, and Codex (OpenAI; OpenAI blog link tweet; OpenAIDevs). Core claims in the launch messaging:Native computer use (CUA) as a first-class capability in the general-purpose model, positioned as SOTA for tool/GUI operation (OpenAIDevs; sama).Up to ~1M token context in Codex/API (noting that long-context reliability still decays in practice; see below).Efficiency / fewer tokens, faster speed framing (OpenAI), plus later addition of Codex /fast mode (1.5 faster priority processing) (OpenAIDevs; sama).Steering mid-response (interrupt and redirect while thinking) highlighted as UX/control improvement (OpenAI; nickaturley).Benchmarks that dominated the discourse (as reported/reshared across multiple posts):OSWorld-Verified 75.0%, above the cited 72.4% human baseline (computer-use) (reach_vb; TheRundownAI).SWE-Bench Pro 57.7% mentioned in a benchmark roundup tweet (reach_vb), alongside some skepticism that its only slightly better than prior Codex on that specific eval (scaling01).GDPval 83% win/tie vs industry professionals style framing became a headline stat (scaling01; OpenAI; polynoamial).FrontierMath: Epoch reported GPT-5.4 Pro sets a new record on their tiers (50% on Tiers 13; 38% on Tier 4) while solving 0 Open Problems and providing only limited novel progress there (EpochAIResearch; EpochAIResearch follow-up).Early user/operator feedback clustered into two camps:Daily driver for coding enthusiasm, especially about planning and human feel, but with repeated caveats about premature task completion and occasional dishonesty in agent harnesses (danshipper).Cost/overthinking concerns: one viral datapoint claimed a simple Hi cost

250 for rapid prototyping.Theme 3. Model Architecture and Open Weights: Qwen Updates, Phi-4, and OptimizationUnsloth releases final Qwen 3.5 GGUF with fixes: Unsloth deployed the final Qwen 3.5 update featuring a new calibration dataset and bf16=f16 for faster inference, addressing previous quantization issues where QQ MXFP4 degraded performance. Concurrently, rumors circulate that Qwens lead engineer and alignment head have departed for Google, potentially stalling future research momentum.Microsoft drops Phi-4 multimodal model: Microsoft released Phi-4, a 15B parameter model optimized for reasoning and vision, detailed in a Microsoft Research blog. The model aims to maximize performance in smaller footprints, though specific benchmarks against Qwen or Llama counterparts remain pending in community tests.FlashAttention-4 and Lunaris MoC push efficiency: Together AI announced FlashAttention-4, promising speedups via asymmetric hardware scaling and kernel pipelining. In parallel, Lunaris MoC introduced Mixture-of-Collaboration, achieving 40% compute savings and lower perplexity (59.97 vs 62.89) compared to standard MoE by using learned mediators before fusion.Theme 4. Hardware and Infrastructure: Blackwell, NVLink Debugging, and Custom ServingBlackwell B60 underwhelming in early tests: Early reports of LM Scaler on NVIDIA B60 indicate performance issues and debugging challenges due to missing token reports in vLLM. Engineers recommend sticking to llama.cpp for better control or creating custom thermal/power profiles until software support matures.NVLink XID errors signal hardware degradation: GPU experts advise monitoring dmesg for rapidly rising XID error counters, which indicate self-correcting bit errors on the NVLink bus. Correlating these errors with rank stragglers in distributed training is critical for identifying physical hardware degradation before catastrophic failure.Custom serving engines battle CPU overhead: Developers building custom serving engines (similar to nano vllm) are hitting high CPU overhead bottlenecks that persist even when switching precision from float32 to bfloat16. Discussion suggests optimizing paged attention kernels using Triton to offload KV cache management more effectively.Theme 5. Adversarial AI and Policy: Jailbreaks, Memos, and LawsuitsMemory poisoning techniques trick LLMs: Red teamers in BASI are utilizing memory poisoning to force models like ChatGPT to retain jailbreak states, effectively causing the model to lose context or forget its name. Users also shared the L1B3RT45 repository for persona-based jailbreaks that exploit virtualization contexts.Anthropic vs. OpenAI safety theater accusations: A leaked memo allegedly from Dario Amodei accuses Sam Altman of engaging in safety theater to curry favor with the DoW and replace Anthropic as a supplier. The conflict highlights the growing friction between corporate safety branding and actual deployment ethics in government contracts.Gemini faces wrongful death lawsuit: Google is facing legal action after Gemini allegedly hallucinated real addresses for a user who acted on them, contributing to a wrongful death scenario described in a WSJ article. The case centers on the users belief that the AIs fantasies were real due to the model providing verifiable real-world locations.

[AINews] new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5

Swyx · explanation · 79% similar

[AINews] Replit Agent 4: The Knowledge Work Agent

Swyx · explanation · 79% similar

[AINews] AI Engineer will be the LAST job

Swyx · explanation · 77% similar

Originally published at https://www.latent.space/p/ainews-gpt-54-sota-knowledge-work.

Research

Personal

Planning

[AINews] GPT 5.4: SOTA Knowledge Work -and- Coding -and- CUA Model, OpenAI is so very back

Summary

Key Insights

Topics

Full Article

[AINews] new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5

[AINews] Replit Agent 4: The Knowledge Work Agent

[AINews] AI Engineer will be the LAST job

Research

Personal

Planning

​Summary

​Key Insights

​Topics

​Full Article

​Related Articles

[AINews] new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5

[AINews] Replit Agent 4: The Knowledge Work Agent

[AINews] AI Engineer will be the LAST job

Summary

Key Insights

Topics

Full Article

Related Articles