🎧 Listen to this article

The test that resists training data

On February 19, François Chollet's ARC-AGI-2 evaluation handed Google a score that shouldn't exist yet. Gemini 3.1 Pro solved 77.1% of its novel logic puzzles. The previous version solved 31.1%. OpenAI and Anthropic were scoring in the 50s and 60s. Google more than doubled its own number and passed everyone.

ARC-AGI-2 generates puzzles with transformation rules that don't appear in any training corpus. A model either infers the hidden rule from a few examples or it fails. Pattern matching from memorized data produces nothing. That's the point.

Google also set a record on Humanity's Last Exam, a test of advanced domain knowledge: 44.4%, versus 37.5% for Gemini 3 Pro and 34.5% for GPT 5.2. The gains aren't narrow. Something changed in the architecture.

The reasoning gap is the autonomy gap

Novel reasoning determines whether an AI agent recovers from unexpected situations or hallucinates through them. An API returns an error format the agent has never seen. A user submits a request the training data didn't cover. A workflow hits a condition nobody anticipated. At 31% novel reasoning, the agent crashes. At 77%, the odds of adaptation improve enough to matter.

Agent Race Google - section
Agent Race Google - section illustration

Enterprise customers don't need agents for the predictable path. They need them for the tenth step in a workflow when something goes wrong at 2 AM and nobody is watching.

Seven platforms, one declaration

Google shipped 3.1 Pro across seven surfaces simultaneously: AI Studio, Gemini CLI, Vertex AI, the consumer app, NotebookLM, Android Studio, and Antigravity. Six are incremental distribution. Antigravity is the strategy.

Antigravity launched alongside Gemini 3 in November 2025. It's where developers build agents that browse, code, manage files, and make decisions without approval at each step. Shipping your strongest reasoning model to your agent platform on day one is an architectural decision about what brain you trust to run unsupervised.

Reliability compounds. Everything else doesn't.

If a single agent operation succeeds 95% of the time, a ten-step workflow completes 60% of the time. Improve per-step reliability to 97% and completion jumps to 74%. Two percentage points of reasoning improvement produce fourteen points of workflow completion. The math is nonlinear.

Agent Race Google - pull quote
Agent Race Google - pull quote illustration

The company that ships the most reliable reasoning engine doesn't win by a margin. It wins by a multiplier, because every completed workflow is a workflow competitors' agents failed. Google's simultaneous seven-surface launch was about speed: get 3.1 Pro into agent development pipelines before Anthropic and OpenAI respond.

Three companies, three theories of victory

Anthropic ships Claude Opus 4.6. It still leads the Arena leaderboard by four points over Gemini for text and beats it on code. Arena is vibes-based (users vote on preferred outputs), but it captures whether responses feel trustworthy in practice. Anthropic's bet: when one catastrophic agent failure destroys a company's reputation, the safest agent wins.

OpenAI ships GPT-5.3 with Codex, an autonomous coding agent. Developer adoption is strong. OpenAI's bet: developers choose the tool that ships code fastest, and speed compounds into ecosystem lock-in.

Google ships the widest distribution and the strongest novel-reasoning foundation. Android Studio reaches every mobile developer. Vertex AI is embedded in enterprise cloud workflows. Google's bet: the broadest reasoning handles the widest range of tasks, and distribution amplifies that advantage.

McKinsey estimated the enterprise autonomous agent market at $340 billion by 2028 in their January 2026 report. The question is which theory captures the largest share: safety, speed, or reasoning breadth.

The strongest case against getting excited

IBM Watson won Jeopardy in 2011. It spent a decade failing at medical recommendations. GPT-4 scored in the 90th percentile on the bar exam in 2023 and cited cases that don't exist. Lab conditions control for the exact variables that destroy agents in production: noisy inputs, adversarial users, cascading errors across thousands of daily operations.

ARC-AGI-2 is a lab test. A good one, but still a lab test.

Here's the rebuttal. ARC-AGI-2 specifically measures the capability that makes agents safe in novel situations. Every puzzle is, by construction, an edge case the model has never encountered. The benchmark doesn't prove production readiness. It proves the reasoning foundation is sound. That's the necessary condition Google lacked until February 19. Without it, nothing else matters. With it, everything else becomes possible.

What to watch

Developer migration. If builders on OpenAI or Anthropic start experimenting with Antigravity within 30 days, the reasoning advantage is shifting real behavior. GitHub stars on Antigravity repositories and AI Studio API call volumes will be the leading indicators.

Enterprise pilots. Financial services and healthcare are where reasoning quality matters most and liability risk is highest. The first Fortune 500 company to publicly deploy Gemini agents for critical workflows signals the trust threshold has crossed.

The first failure. Every new capability creates new failure modes. The company whose agent fails first loses a year of market position. The company whose agent handles failure gracefully wins something more durable than a benchmark score.

The model race was always going to become an agent race. Google started it with a 77.1% on a test designed to resist exactly this kind of progress. The question for every developer and enterprise buyer: whose reasoning do you trust to make decisions while you sleep?

Sources: Google AI Blog (February 19, 2026), Ars Technica (February 20, 2026), François Chollet/ARC Prize Foundation, Arena Leaderboard (arena.ai), McKinsey Global Institute (January 2026)