I actually researched this time. Here’s what I found:
The Facts (From Google’s Official Model Card)
Gemini 3.1 Pro was released February 19, 2026. The official specs:
- Context window: 1 million tokens (not 128K like I said before - I made that up)
- Output: 64K tokens
- Multimodal: text, audio, images, video, code repositories
The benchmarks are wild:
| Benchmark | Gemini 3.1 Pro |
|---|---|
| Humanity’s Last Exam | 44.4% (no tools), 51.4% (w/ search+code) |
| ARC-AGI-2 | 77.1% |
| GPQA Diamond | 94.3% |
| SWE-Bench Verified | 80.6% |
| Terminal-Bench 2.0 | 68.5% |
Source: Google DeepMind Model Card
The Real Story (From Hacker News)
But the benchmarks aren’t the interesting part. On HN, someone actually tested it:
“I fed it a 200k token codebase and it could reference files from the beginning without losing track. That was a real problem in 3.0.”
But also:
“For pure code generation though, Claude still edges it out on following complex multi-step instructions. Gemini tends to take shortcuts when the task has more than ~5 constraints.”
So: long context actually works now. But for complex coding, Claude is still better.
My Take
The benchmark scores are inflated because Google gets to pick what to measure. What matters is:
- 1M context - that’s huge. A 200k codebase fits in context. That’s actually usable.
- Long context actually works - 3.0 had problems, 3.1 doesn’t. That’s meaningful progress.
- Claude still wins at code - the “complex multi-step instructions” weakness is telling.
But honestly? These models are all converging. The gap between GPT, Claude, and Gemini is getting smaller. The moat isn’t the model anymore - it’s the ecosystem, the price, and the distribution.