Nov 19, 2025
Gemini 3 lands, early benchmarks pop, agents scale to a million steps
š§© The Gist
Google introduced Gemini 3, framed as its most capable model yet, alongside a detailed model card and early developer impressions. A hands-on review highlights long context, multimodality, and pricing that sits between Gemini 2.5 and top-tier rivals. Benchmarks shared from the model card show sizable gains on reasoning-heavy tests. Beyond the launch, a new arXiv paper demonstrates a system that completes a million LLM steps with zero errors using microagents and voting, while open tooling and infrastructure updates hint at where the stack is heading.
š Key Highlights
- Google announced Gemini 3 as its latest flagship model, authored by Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu, and described it as āour most intelligent model.ā
- In hands-on testing, Gemini 3 Pro keeps a 1M token input window, up to 64k token outputs, and accepts text, images, audio, and video, with a knowledge cutoff in January 2025.
- Pricing observed by reviewers: Gemini 3 Pro input $2.00 per 1M tokens at ā¤200k context then $4.00 above that, output $12.00 then $18.00, a little pricier than Gemini 2.5 and still cheaper than Claude Sonnet 4.5.
- A reader summary of the Gemini 3 Pro model card reports strong results on reasoning benchmarks, for example Humanityās Last Exam 37.5 percent, ARCāAGIā2 31.1 percent, GPQA Diamond 91.9 percent, and AIME 2025 at 95 percent without tools and 100 percent with code execution.
- Realāworld demo: transcribing and structuring a 3.5 hour public meeting with speaker labels and timestamps, a workflow called out as especially useful for local journalism.
- Paper highlight: a system called MAKER solves a task requiring over one million LLM steps with zero errors by extreme task decomposition into microagents and multiāagent voting.
- Ecosystem notes: RowboatX releases an openāsource CLI for local background agents that use the filesystem as state, Unix tools for supervision, and optional MCP integrations.
šÆ Strategic Takeaways
- Capabilities: Reported gains on reasoning and math suggest leading models are converging at the top tier, with tool use and code execution becoming key amplifiers on hard tests.
- Economics: Gemini 3 Proās tiered pricing tightens the gap with rivals, making longācontext multimodal workloads more approachable while still signaling premium positioning.
- Workflows: Longāform audio transcription and structured summarization look increasingly productionāready, particularly for civic reporting and enterprise meeting intelligence.
- Agentic systems: The millionāstep result indicates that reliability at scale may come from process design, decomposition, and error correction, not just bigger monoliths.
- Infra and ops: New localāfirst agent tools and data center innovations point to a stack that blends cloud scale with onādevice control, with leaders publicly cautioning about overheated investment climates.
š§ Worth Reading
- Solving a MillionāStep LLM Task with Zero Errors: Introduces MAKER, which breaks big problems into simple subtasks for focused microagents, then uses efficient multiāagent voting to correct errors at every step. The practical takeaway is that robust, longārunning agent workflows can be engineered with modular decomposition and systematic error checking, which teams can adapt to complex backāoffice or research pipelines.
Building more with GPT-5.1-Codex-Max (openai.com) Introducing GPT-5.1-Codex-Max, a faster, more intelligent agentic coding model for Codex. The model is designed for long-running, project-scale work with enhanced reasoning and token efficiency. openai
GPT-5.1-Codex-Max System Card (openai.com) This system card outlines the comprehensive safety measures implemented for GPTā5.1-CodexMax. It details both model-level mitigations, such as specialized safety training for harmful tasks and prompt⦠openai
How evals drive the next chapter in AI for businesses (openai.com) Learn how evals help businesses define, measure, and improve AI performanceāreducing risk, boosting productivity, and driving strategic advantage. openai
Strengthening our safety ecosystem with external testing (openai.com) OpenAI works with independent experts to evaluate frontier AI systems. Third-party testing strengthens safety, validates safeguards, and increases transparency in how we assess model capabilities and risks. openai
How Scania is accelerating work with AI across its global workforce (openai.com) Description: Global manufacturer Scania is scaling AI with ChatGPT Enterprise. With team-based onboarding and strong guardrails, AI is boosting productivity, quality, and innovation. openai