Nov 19, 2025
Gemini 3 lands, early benchmarks pop, agents scale to a million steps
🧩 The Gist
Google introduced Gemini 3, framed as its most capable model yet, alongside a detailed model card and early developer impressions. A hands-on review highlights long context, multimodality, and pricing that sits between Gemini 2.5 and top-tier rivals. Benchmarks shared from the model card show sizable gains on reasoning-heavy tests. Beyond the launch, a new arXiv paper demonstrates a system that completes a million LLM steps with zero errors using microagents and voting, while open tooling and infrastructure updates hint at where the stack is heading.
🚀 Key Highlights
- Google announced Gemini 3 as its latest flagship model, authored by Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu, and described it as “our most intelligent model.”
- In hands-on testing, Gemini 3 Pro keeps a 1M token input window, up to 64k token outputs, and accepts text, images, audio, and video, with a knowledge cutoff in January 2025.
- Pricing observed by reviewers: Gemini 3 Pro input $2.00 per 1M tokens at ≤200k context then $4.00 above that, output $12.00 then $18.00, a little pricier than Gemini 2.5 and still cheaper than Claude Sonnet 4.5.
- A reader summary of the Gemini 3 Pro model card reports strong results on reasoning benchmarks, for example Humanity’s Last Exam 37.5 percent, ARC‑AGI‑2 31.1 percent, GPQA Diamond 91.9 percent, and AIME 2025 at 95 percent without tools and 100 percent with code execution.
- Real‑world demo: transcribing and structuring a 3.5 hour public meeting with speaker labels and timestamps, a workflow called out as especially useful for local journalism.
- Paper highlight: a system called MAKER solves a task requiring over one million LLM steps with zero errors by extreme task decomposition into microagents and multi‑agent voting.
- Ecosystem notes: RowboatX releases an open‑source CLI for local background agents that use the filesystem as state, Unix tools for supervision, and optional MCP integrations.
🎯 Strategic Takeaways
- Capabilities: Reported gains on reasoning and math suggest leading models are converging at the top tier, with tool use and code execution becoming key amplifiers on hard tests.
- Economics: Gemini 3 Pro’s tiered pricing tightens the gap with rivals, making long‑context multimodal workloads more approachable while still signaling premium positioning.
- Workflows: Long‑form audio transcription and structured summarization look increasingly production‑ready, particularly for civic reporting and enterprise meeting intelligence.
- Agentic systems: The million‑step result indicates that reliability at scale may come from process design, decomposition, and error correction, not just bigger monoliths.
- Infra and ops: New local‑first agent tools and data center innovations point to a stack that blends cloud scale with on‑device control, with leaders publicly cautioning about overheated investment climates.
🧠 Worth Reading
- Solving a Million‑Step LLM Task with Zero Errors: Introduces MAKER, which breaks big problems into simple subtasks for focused microagents, then uses efficient multi‑agent voting to correct errors at every step. The practical takeaway is that robust, long‑running agent workflows can be engineered with modular decomposition and systematic error checking, which teams can adapt to complex back‑office or research pipelines.
Building more with GPT-5.1-Codex-Max (openai.com) Introducing GPT-5.1-Codex-Max, a faster, more intelligent agentic coding model for Codex. The model is designed for long-running, project-scale work with enhanced reasoning and token efficiency. openai
GPT-5.1-Codex-Max System Card (openai.com) This system card outlines the comprehensive safety measures implemented for GPT‑5.1-CodexMax. It details both model-level mitigations, such as specialized safety training for harmful tasks and prompt… openai
How evals drive the next chapter in AI for businesses (openai.com) Learn how evals help businesses define, measure, and improve AI performance—reducing risk, boosting productivity, and driving strategic advantage. openai
Strengthening our safety ecosystem with external testing (openai.com) OpenAI works with independent experts to evaluate frontier AI systems. Third-party testing strengthens safety, validates safeguards, and increases transparency in how we assess model capabilities and risks. openai
How Scania is accelerating work with AI across its global workforce (openai.com) Description: Global manufacturer Scania is scaling AI with ChatGPT Enterprise. With team-based onboarding and strong guardrails, AI is boosting productivity, quality, and innovation. openai