AI research

Under the Hood of OpenAI’s Deep Research System

Under the Hood of OpenAI’s Deep Research System

OpenAI’s “Deep Research” system is a new agentic capability in ChatGPT that can independently perform multi-step online research and deliver a detailed, cited report “at the level of a research analyst” (OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it | TechRadar). In other words, you give it a complex prompt, and the AI will find information, analyze it across dozens of sources, and synthesize a comprehensive answer with references (Simon Willison on openai) (OpenAI announces "deep research" agent that can complete online research - SD Times). This is a big leap from standard ChatGPT, which normally answers based on its trained knowledge (possibly with a single retrieval step) in one go. Let’s break down how Deep Research works, how it differs from regular ChatGPT, and how it coordinates multiple AI models – including GPT-4o, GPT-4.5, o1, o1 Pro, and the o3-mini family – to carry out complex reasoning tasks.

What Exactly Is the Deep Research System?

Deep Research is essentially an autonomous research agent built into ChatGPT. Unlike a normal ChatGPT session where the model responds immediately using its internal knowledge or a brief search, the Deep Research mode unfolds over several minutes and many steps. It will iteratively:

All of this happens behind the scenes. As a user, you see the final “research report” output (often multi-page with headings, bullet points, and references) after a few minutes. You might also get to answer a couple of clarifying questions at the start, but then the agent takes over fully (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).

How is this different from standard ChatGPT? Traditional ChatGPT (even with the browsing plugin) generally handles queries in a single-step fashion – it might fetch one page or use its training data, then immediately answer. Deep Research, by contrast, behaves more like a human researcher: it can dig through many sources, change its strategy if new information suggests a different angle, and only stop when it’s gathered sufficient material (OpenAI announces "deep research" agent that can complete online research - SD Times). OpenAI itself notes that Deep Research “accomplishes in tens of minutes what would take a human many hours” (OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it | TechRadar). It produces a far more in-depth answer than a quick GPT reply, complete with source citations for transparency.

However, this complexity comes with limitations: Deep Research runs are resource-intensive and thus rate-limited for users. (Initially, ChatGPT Plus users got ~10 Deep Research queries per month, while Pro users got ~100-120 (Simon Willison on openai) (OpenAI announces "deep research" agent that can complete online research - SD Times).) This hints that under the hood, multiple powerful models and a lot of computation are being used per query, which is why it’s rationed.

Before diving into the model architecture, one more thing: Deep Research can be almost too convincing. By compiling a lengthy, well-formatted report with references, it may give an impression of PhD-level analysis even when it might have missed key info or made subtle errors (Simon Willison on openai). So while it’s extraordinarily useful, users are cautioned to verify critical facts – the system isn’t infallible and can sometimes omit crucial details or even hallucinate capabilities (e.g. thinking it has a certain tool access) (Simon Willison on openai).

With that understanding, let’s explore how Deep Research works internally and how it leverages different model “brains” for different tasks.

A Two-Headed Approach: Reasoning Agent vs. Core Knowledge Model

Under the hood, the Deep Research system uses an agent architecture. You can think of it as having two types of “AI brains” working together:

In some designs, these might be the same model playing different roles at different times. In others, there could be a hand-off (for example, one model does the heavy multi-step reasoning and then calls in another model to draft the final answer in polished prose). OpenAI hasn’t published a precise schematic of their Deep Research internals, but based on reports and similar open-source agents, we can outline how tasks are routed between components (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles):

  1. Initial Planning: When you submit a prompt in Deep Research mode, the system gives it to the manager/organizer agent (likely an o3-series reasoning model). This agent reads your question and comes up with a plan: identifying what needs to be researched and formulating initial search queries (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles). It essentially says “Okay, first I should search for X, Y, and Z.”
  2. Web Searching: The agent then calls a search tool (e.g. using an API or OpenAI’s built-in browser agent called “Operator”) with the queries. It retrieves a list of relevant links (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).
  3. Browsing and Extraction: For each promising link, the agent (or a sub-agent) visits the page and extracts content. This might involve stripping the text from HTML, using an internal parser for PDFs, or even interpreting images if needed (OpenAI announces "deep research" agent that can complete online research - SD Times). The agent can also navigate page links or perform follow-up searches if the initial results weren’t sufficient (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).
  4. Iterative Reasoning: After each round of reading, the reasoning model analyzes the new information. It may update its plan or spawn new questions – just like a human researcher might say “Actually, this piece of data raises another question – let me look that up too.” The agent can change course as it learns new info (OpenAI announces "deep research" agent that can complete online research - SD Times). This loop of search → read → analyze can repeat several times, hopping through sources.
  5. Synthesis and Writing: Once the agent determines it has gathered enough material, it proceeds to synthesize the answer. Now, a large language model composes the report, organizing it into sections, drawing conclusions, and citing sources for each claim. This is where a model with strong writing and knowledge (like GPT-4-series) shines.

Throughout this process, Deep Research is using the model(s) in an “chain-of-thought” mode, meaning the AI is effectively talking to itself, generating reasoning steps that aren’t shown to the user but guide its actions (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles). OpenAI’s system card even noted the model sometimes produces a hidden chain-of-thought describing tools it “has” (some hallucinated) and what it’s doing (Simon Willison on openai) – evidence of this internal self-dialog.

Importantly, OpenAI chose a specific type of model for the Deep Research agent: a version of their upcoming o3 model optimized for web reasoning (OpenAI announces "deep research" agent that can complete online research - SD Times). In fact, early reports stated “Deep Research…leverages an early OpenAI o3 model, executing multi-step web research, data analysis, and enabling code execution to solve complex tasks” (The Ultimate AI Showdown: GPT-4.5 vs Claude 3.7 Sonnet vs Grok 3). This suggests the heart of the system is one of the new “o-series” models (which are built for reasoning and tool-use), rather than the standard GPT-4.

So how do all the model names come into play? Let’s unpack the roster of models and their roles.

The Cast of Models: GPT-4o, GPT-4.5, o1, o1 Pro, o3 Mini (and “High” mode)

OpenAI’s model lineup has grown more complex in the past year. Each model has different strengths, and Deep Research can pair with them in different ways. Here’s a breakdown of each:

GPT-4o: The All-Purpose Workhorse

GPT-4o (sometimes called GPT-4 “Operations” or just GPT-4 (October 2024 version)) is essentially the general-purpose GPT-4 model that succeeded the original GPT-4. It’s a multimodal, versatile model – capable of handling text, images, and audio inputs in one, with a massive 128K token context window (GPT-4o explained: Everything you need to know). GPT-4o is the Swiss Army knife in OpenAI’s toolkit: it’s fast, broad in knowledge, and relatively cost-effective for its power (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium) (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). In ChatGPT, GPT-4o (and its smaller variant) became the default models for users by late 2024 (GPT-4o explained: Everything you need to know).

Role in Deep Research: GPT-4o can be thought of as the “knowledgeable writer” model. It excels at pulling together information and articulating it well. In a Deep Research workflow, GPT-4o might be used when the agent needs to interpret diverse content (like an image graph found online or an audio snippet) since it’s multimodal (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). It’s also a likely choice for drafting the final answer because it’s very good at natural language generation and can handle a lot of context (the collated notes from various sources). If the Deep Research agent were to hand off data to another model for writing, GPT-4o would be a prime candidate.

Strengths: Speed and versatility. GPT-4o is much faster and cheaper than the o-series reasoning models (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). For example, its API pricing was roughly $5 per million input tokens vs $15 for o1’s input (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). It also supports features like browsing, code execution (e.g. via Code Interpreter), and huge context lengths out of the box. In fact, GPT-4o itself can browse and solve many tasks directly – but it generally doesn’t do the kind of extensive multi-hop reasoning that Deep Research demands.

Weaknesses: While GPT-4o is very capable, deep logical reasoning and complex problem-solving are not its specialty relative to the o-series. In challenging domains like tricky math or competitive coding puzzles, it underperforms compared to models like o1. For instance, on a math exam (AIME), GPT-4o managed only ~13% accuracy, whereas the specialized o1 model hit 83% (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium)! GPT-4o tends to answer questions more directly and might not rigorously double-check itself across multiple steps unless explicitly prompted to do so. In short, it’s the fast generalist – great for broad knowledge and fluent output, but not the best at heavy reasoning for hours.

GPT-4.5: The Next-Generation Generalist

GPT-4.5 is an intermediate upgrade to GPT-4 that OpenAI released as a research preview in early 2025. It represents the next step in the GPT-4 family, intended to improve in areas like factual accuracy, reasoning, and alignment (it was noted for enhanced pattern recognition and even some emotional intelligence in early tests (The Ultimate AI Showdown: GPT-4.5 vs Claude 3.7 Sonnet vs Grok 3)). Think of GPT-4.5 as “GPT-4, but a bit smarter and safer.”