BrowseComp: OpenAI’s Brutally Hard Benchmark for AI Browsing Agents
OpenAI open-sourced BrowseComp, a benchmark built to test whether AI can find obscure, verifiable facts buried across the internet. It's intentionally hard, and most models fail—unless they can reason, persist, and browse like a human researcher.

If AI is going to browse the internet like a human, it needs to prove it. That’s what BrowseComp is for a test designed to punish shallow retrieval and reward real persistence. It’s a new benchmark of 1,266 complex, fact-based questions created by OpenAI, targeting information that's buried deep across dozens or even hundreds of sites. Each question has a short, indisputable answer that can be verified, but not easily found.
This isn't SimpleQA. GPT‑4o with browsing barely clears 1.9% accuracy. GPT‑4.5 and other large models don’t do much better. OpenAI’s Deep Research agent designed specifically for this kind of task hits 51.5%, showing that brute compute alone isn’t enough. Strategic reasoning and adaptive search matter.
Even humans struggled. Given two hours, human trainers could solve just 29.2% of questions. And that’s without using AI.
The benchmark measures how well agents navigate the messy, unstructured web to retrieve information that resists easy queries. It emphasizes the “asymmetry of verification” questions that are hell to solve, but trivial to check. This makes it ideal for evaluating browsing competence without getting lost in subjective output grading.
OpenAI has open-sourced BrowseComp under their Evals repository. If you’re building agents that claim to understand the web, this is where they prove it or fail.