The Rise of AI in Deep Research: Evaluating New Capabilities
Artificial intelligence is making significant strides as a powerful research assistant, moving beyond simple facts to tackle complex inquiries that demand multi-step reasoning. As the landscape of large language models (LLMs) evolves, major players like OpenAI, Anthropic, Google, and Perplexity are branding these advanced capabilities under catchy names—OpenAI refers to it as "Deep Research," while Anthropic calls it "Extended Thinking." But how well do these AI agents perform in real-world research scenarios?
Introducing the Deep Research Bench
The recent report from FutureSearch, known as Deep Research Bench (DRB), provides a comprehensive analysis of how these AI systems handle complex web-based research tasks. Unlike simple Q&A formats, DRB challenges AI models with 89 varied tasks to mimic the intricate and often messy demands faced by human analysts and researchers. Tasks include finding specific numbers and validating claims, making the evaluation not just a test of knowledge but of reasoning and adaptability.
Key Categories of Tasks:
- Finding Numbers: e.g., "How many FDA Class II medical device recalls occurred?”
- Validating Claims: e.g., "Is ChatGPT 10x more energy-intensive than Google Search?”
- Compiling Datasets: e.g., "Job trends for US software developers from 2019–2023."
The Technology Behind the Benchmarks
At the core of DRB is an architecture called ReAct (Reason + Act), designed to simulate human-like research methods. This includes reasoning through problems, performing web searches, and iterating based on observed results. DRB employs a stable dataset known as RetroSearch—essentially a frozen archive of web pages—allowing for consistent evaluations that avoid the chaos of live internet searches.
Findings from the Evaluation
Among the tested models, OpenAI’s o3 led the pack, scoring 0.51 out of 1.0 on the DRB scale, a score that reflects the complexity of the tasks rather than a mere absolute measure of capability. Though this might seem modest, even the best models fall short of matching highly skilled human researchers.
Noteworthy Competitors:
- Claude 3.7 Sonnet from Anthropic, showcasing adaptability in both "thinking" and "non-thinking" tasks.
- Gemini 2.5 Pro from Google, excelling in structured planning and multi-step reasoning.
- DeepSeek-R1, an open-source model, made significant strides, nearing the performance of closed models like GPT-4 Turbo.
Challenges and Limitations
Despite these advancements, AI agents still exhibit notable weaknesses. A common issue is what researchers term "context loss;" as tasks get more complex, models often forget prior details, leading to a disjointed output. Other problems include repetitive searching and a tendency to draw premature conclusions based on incomplete data—a reminder of the limitations these systems still face in mimicking human-like thought processes.
Interestingly, the report also studied "toolless" agents that rely solely on their internal datasets for answers. Surprisingly, these models performed almost as well in simpler tasks as those equipped with web search functionalities. However, when faced with complex inquiries, they fell short, emphasizing that deep research demands both recall and real-time verification—capabilities that only tool-assisted agents can provide.
The Path Forward
The DRB findings underscore a crucial takeaway: while advanced AI can outperform average humans in narrowly defined tasks, they lag behind seasoned researchers in nuanced reasoning and adaptability. As we integrate these AI tools into serious knowledge work, frameworks like DRB will be vital in assessing not just what these systems can do, but how effectively they can do it.
In the evolving world of AI, the trends highlighted by FutureSearch show that while there is significant promise, the journey toward achieving fully autonomous, human-like researchers remains ongoing. As firms increasingly lean on AI for strategic insights, understanding these complexities will be imperative for shaping future innovations in this dynamic field.

Writes about personal finance, side hustles, gadgets, and tech innovation.
Bio: Priya specializes in making complex financial and tech topics easy to digest, with experience in fintech and consumer reviews.