Under the Hood of OpenAI’s Deep Research System

OpenAI’s “Deep Research” system is a new agentic capability in ChatGPT that can independently perform multi-step online research and deliver a detailed, cited report “at the level of a research analyst” (OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it | TechRadar). In other words, you give it a complex prompt, and the AI will find information, analyze it across dozens of sources, and synthesize a comprehensive answer with references (Simon Willison on openai) (OpenAI announces "deep research" agent that can complete online research - SD Times). This is a big leap from standard ChatGPT, which normally answers based on its trained knowledge (possibly with a single retrieval step) in one go. Let’s break down how Deep Research works, how it differs from regular ChatGPT, and how it coordinates multiple AI models – including GPT-4o, GPT-4.5, o1, o1 Pro, and the o3-mini family – to carry out complex reasoning tasks.

What Exactly Is the Deep Research System?

Deep Research is essentially an autonomous research agent built into ChatGPT. Unlike a normal ChatGPT session where the model responds immediately using its internal knowledge or a brief search, the Deep Research mode unfolds over several minutes and many steps. It will iteratively:

Formulate search queries based on your prompt,
Browse web results and read content (articles, PDFs, even images if needed (OpenAI announces "deep research" agent that can complete online research - SD Times)),
Analyze and cross-check information, and
Gradually build up an answer with structured sections and citations.

All of this happens behind the scenes. As a user, you see the final “research report” output (often multi-page with headings, bullet points, and references) after a few minutes. You might also get to answer a couple of clarifying questions at the start, but then the agent takes over fully (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).

How is this different from standard ChatGPT? Traditional ChatGPT (even with the browsing plugin) generally handles queries in a single-step fashion – it might fetch one page or use its training data, then immediately answer. Deep Research, by contrast, behaves more like a human researcher: it can dig through many sources, change its strategy if new information suggests a different angle, and only stop when it’s gathered sufficient material (OpenAI announces "deep research" agent that can complete online research - SD Times). OpenAI itself notes that Deep Research “accomplishes in tens of minutes what would take a human many hours” (OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it | TechRadar). It produces a far more in-depth answer than a quick GPT reply, complete with source citations for transparency.

However, this complexity comes with limitations: Deep Research runs are resource-intensive and thus rate-limited for users. (Initially, ChatGPT Plus users got ~10 Deep Research queries per month, while Pro users got ~100-120 (Simon Willison on openai) (OpenAI announces "deep research" agent that can complete online research - SD Times).) This hints that under the hood, multiple powerful models and a lot of computation are being used per query, which is why it’s rationed.

Before diving into the model architecture, one more thing: Deep Research can be almost too convincing. By compiling a lengthy, well-formatted report with references, it may give an impression of PhD-level analysis even when it might have missed key info or made subtle errors (Simon Willison on openai). So while it’s extraordinarily useful, users are cautioned to verify critical facts – the system isn’t infallible and can sometimes omit crucial details or even hallucinate capabilities (e.g. thinking it has a certain tool access) (Simon Willison on openai).

With that understanding, let’s explore how Deep Research works internally and how it leverages different model “brains” for different tasks.

A Two-Headed Approach: Reasoning Agent vs. Core Knowledge Model

Under the hood, the Deep Research system uses an agent architecture. You can think of it as having two types of “AI brains” working together:

1. The Reasoning/Planning Agent: This is the part of the system that decides what to do next – generate a search query, read a page, ask a follow-up, etc. It breaks your query into sub-tasks, performs multi-hop reasoning, and orchestrates the whole workflow. This agent is implemented by a specialized “reasoning” model (from OpenAI’s new o-series, as we’ll discuss) that excels at chain-of-thought and tool use (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).
2. The Core Answer Generator: This is the part that actually produces the final written report and handles general understanding. It needs broad knowledge and fluent writing ability. Typically this would be a powerful GPT-4 family model that can take all the gathered info and synthesize a coherent answer.

In some designs, these might be the same model playing different roles at different times. In others, there could be a hand-off (for example, one model does the heavy multi-step reasoning and then calls in another model to draft the final answer in polished prose). OpenAI hasn’t published a precise schematic of their Deep Research internals, but based on reports and similar open-source agents, we can outline how tasks are routed between components (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles):

Initial Planning: When you submit a prompt in Deep Research mode, the system gives it to the manager/organizer agent (likely an o3-series reasoning model). This agent reads your question and comes up with a plan: identifying what needs to be researched and formulating initial search queries (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles). It essentially says “Okay, first I should search for X, Y, and Z.”
Web Searching: The agent then calls a search tool (e.g. using an API or OpenAI’s built-in browser agent called “Operator”) with the queries. It retrieves a list of relevant links (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).
Browsing and Extraction: For each promising link, the agent (or a sub-agent) visits the page and extracts content. This might involve stripping the text from HTML, using an internal parser for PDFs, or even interpreting images if needed (OpenAI announces "deep research" agent that can complete online research - SD Times). The agent can also navigate page links or perform follow-up searches if the initial results weren’t sufficient (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).
Iterative Reasoning: After each round of reading, the reasoning model analyzes the new information. It may update its plan or spawn new questions – just like a human researcher might say “Actually, this piece of data raises another question – let me look that up too.” The agent can change course as it learns new info (OpenAI announces "deep research" agent that can complete online research - SD Times). This loop of search → read → analyze can repeat several times, hopping through sources.
Synthesis and Writing: Once the agent determines it has gathered enough material, it proceeds to synthesize the answer. Now, a large language model composes the report, organizing it into sections, drawing conclusions, and citing sources for each claim. This is where a model with strong writing and knowledge (like GPT-4-series) shines.

Throughout this process, Deep Research is using the model(s) in an “chain-of-thought” mode, meaning the AI is effectively talking to itself, generating reasoning steps that aren’t shown to the user but guide its actions (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles). OpenAI’s system card even noted the model sometimes produces a hidden chain-of-thought describing tools it “has” (some hallucinated) and what it’s doing (Simon Willison on openai) – evidence of this internal self-dialog.

Importantly, OpenAI chose a specific type of model for the Deep Research agent: a version of their upcoming o3 model optimized for web reasoning (OpenAI announces "deep research" agent that can complete online research - SD Times). In fact, early reports stated “Deep Research…leverages an early OpenAI o3 model, executing multi-step web research, data analysis, and enabling code execution to solve complex tasks” (The Ultimate AI Showdown: GPT-4.5 vs Claude 3.7 Sonnet vs Grok 3). This suggests the heart of the system is one of the new “o-series” models (which are built for reasoning and tool-use), rather than the standard GPT-4.

So how do all the model names come into play? Let’s unpack the roster of models and their roles.

The Cast of Models: GPT-4o, GPT-4.5, o1, o1 Pro, o3 Mini (and “High” mode)

OpenAI’s model lineup has grown more complex in the past year. Each model has different strengths, and Deep Research can pair with them in different ways. Here’s a breakdown of each:

GPT-4o: The All-Purpose Workhorse

GPT-4o (sometimes called GPT-4 “Operations” or just GPT-4 (October 2024 version)) is essentially the general-purpose GPT-4 model that succeeded the original GPT-4. It’s a multimodal, versatile model – capable of handling text, images, and audio inputs in one, with a massive 128K token context window (GPT-4o explained: Everything you need to know). GPT-4o is the Swiss Army knife in OpenAI’s toolkit: it’s fast, broad in knowledge, and relatively cost-effective for its power (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium) (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). In ChatGPT, GPT-4o (and its smaller variant) became the default models for users by late 2024 (GPT-4o explained: Everything you need to know).

Role in Deep Research: GPT-4o can be thought of as the “knowledgeable writer” model. It excels at pulling together information and articulating it well. In a Deep Research workflow, GPT-4o might be used when the agent needs to interpret diverse content (like an image graph found online or an audio snippet) since it’s multimodal (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). It’s also a likely choice for drafting the final answer because it’s very good at natural language generation and can handle a lot of context (the collated notes from various sources). If the Deep Research agent were to hand off data to another model for writing, GPT-4o would be a prime candidate.

Strengths: Speed and versatility. GPT-4o is much faster and cheaper than the o-series reasoning models (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). For example, its API pricing was roughly $5 per million input tokens vs $15 for o1’s input (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). It also supports features like browsing, code execution (e.g. via Code Interpreter), and huge context lengths out of the box. In fact, GPT-4o itself can browse and solve many tasks directly – but it generally doesn’t do the kind of extensive multi-hop reasoning that Deep Research demands.

Weaknesses: While GPT-4o is very capable, deep logical reasoning and complex problem-solving are not its specialty relative to the o-series. In challenging domains like tricky math or competitive coding puzzles, it underperforms compared to models like o1. For instance, on a math exam (AIME), GPT-4o managed only ~13% accuracy, whereas the specialized o1 model hit 83% (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium)! GPT-4o tends to answer questions more directly and might not rigorously double-check itself across multiple steps unless explicitly prompted to do so. In short, it’s the fast generalist – great for broad knowledge and fluent output, but not the best at heavy reasoning for hours.

GPT-4.5: The Next-Generation Generalist

GPT-4.5 is an intermediate upgrade to GPT-4 that OpenAI released as a research preview in early 2025. It represents the next step in the GPT-4 family, intended to improve in areas like factual accuracy, reasoning, and alignment (it was noted for enhanced pattern recognition and even some emotional intelligence in early tests (The Ultimate AI Showdown: GPT-4.5 vs Claude 3.7 Sonnet vs Grok 3)). Think of GPT-4.5 as “GPT-4, but a bit smarter and safer.”

Role in Deep Research: GPT-4.5 would play a role similar to GPT-4o, but with some improvements. It can also serve as a generalist model for summarizing and writing. If integrated into Deep Research, GPT-4.5 could potentially take on more of the reasoning load than GPT-4o would, thanks to whatever enhancements it has in following multi-step instructions. For example, it might be less prone to hallucinate facts, making the final output more reliable. There are hints that GPT-4.5 was used in ChatGPT for Pro users around the time Deep Research launched (OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it | TechRadar) (it was offered to $200/mo Pro subscribers first). So Pro users’ Deep Research sessions might leverage GPT-4.5 for answer synthesis, whereas Plus users might use GPT-4o or a smaller model.

Strengths: Better reasoning and accuracy than GPT-4o. While exact metrics aren’t public, GPT-4.5 was generally reported to reduce hallucination rates and follow complex instructions more closely. It likely maintains the 128K context and multi-modality. This makes it a strong candidate for being the core model handling the large compiled context of research sources and turning it into a well-structured answer.

Weaknesses: Not as specialized as o-series models. GPT-4.5 is still a jack-of-all-trades, and while it may outperform GPT-4o slightly in reasoning, it won’t match an o1 or o3 on tasks like step-by-step logic or math proofs. It also has a high computational cost (likely similar or slightly above GPT-4o’s cost). Thus, for the planning and multi-hop searching aspect of Deep Research, OpenAI still leans on the more efficient reasoning model (o3) to keep things tractable (OpenAI announces "deep research" agent that can complete online research - SD Times). GPT-4.5 might come in at the final stage to ensure the answer is comprehensive and well-written, using the data the agent gathered.

o1: The Deep Reasoner (First-Generation)

o1 is the first in OpenAI’s new line of “o-series” models focused on reasoning and step-by-step problem solving. If GPT-4o is a Swiss Army knife, o1 is a specialized surgical tool for logic. It uses a chain-of-thought approach, meaning it can internally work through multi-step solutions (almost like how a human would show their work in math) rather than jumping straight to an answer (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). The result is astounding performance on complex tasks: o1 scored in the 89th percentile on Codeforces coding challenges and 83% on a challenging math exam, massively outpacing GPT-4 on those (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). It’s designed to “think” more deeply.

Role in Deep Research: o1 (or its successor models) is basically the brain of the operation for planning and reasoning. In Deep Research, the agent that decides what to search and how to analyze info likely uses an o-series model like o1. The system’s ability to do multi-hop queries and dynamically adjust comes from this model’s strength in long chain-of-thought. For example, if the question is a complex analytical one (say, comparing economic trends), an o1-type model can break that down into subquestions (one for each economic indicator, etc.), search for each, and combine them, where a normal model might not even know where to start.

We can imagine Deep Research’s internal monologue looking like an o1 model thinking: “First, I should find data on X. Got it. Now, does that data answer the question fully? Maybe I also need Y. Let’s search that…” and so on, until all pieces are assembled. This kind of autonomous decision-making is exactly what o1 was built for (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).

Strengths: Top-tier reasoning and huge context. The o1 model supports a 200,000-token context window (about 150k-160k words of text!) and can output extremely long responses (up to 100k tokens) (Simon Willison on openai). That means it can juggle a ton of information – perfect for absorbing multiple web pages worth of content in Deep Research. Its chain-of-thought ability lets it solve problems that stump other models. It’s also integrated with features like function calling and reading images (so it’s not missing modern conveniences) (Simon Willison on openai). Essentially, o1 can serve as a meticulous researcher, ensuring no logical step is skipped when answering hard questions.

Weaknesses: Speed and cost. All that deep thinking comes at a price: o1 can be dozens of times slower than GPT-4o on a per-query basis (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). It also costs significantly more to run (about 3x the price of GPT-4o per token in the API) (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). For instance, o1 was priced around $0.015 per 1K tokens input versus $0.005 for GPT-4o (and even pricier for output) (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium). In practical terms, an o1 might take 30 seconds to reason through something a GPT-4 would answer (perhaps superficially) in 1 second. This is one reason why Deep Research can take a few minutes – the model is literally thinking step by step, which uses more compute. It’s also why OpenAI might not deploy o1 for every Deep Research query by default, especially for non-Pro users. Instead, they introduced lighter-weight reasoning models (like o3-mini) to handle most cases more efficiently, as we’ll see.

o1 Pro: The Reinforced Heavyweight

o1 Pro is essentially an even larger and more powerful version of o1. It’s OpenAI’s “most expensive model” as of early 2025 (Simon Willison on openai), likely intended for enterprise or specialized use. It has the same 200K context and features as o1 (Simon Willison on openai), but presumably with many more parameters or extra training that boost its capabilities further. The pricing is eye-watering – about 10× the cost of o1 (which is 1000× the cost of simpler models like GPT-3.5) (Simon Willison on openai). This implies o1 Pro might be extremely large in size. It’s also only available through a new API (not the standard Chat Completion API) (Simon Willison on openai), reflecting that it’s a bit experimental or niche.

Role in Deep Research: For most users, o1 Pro is overkill and likely not used. However, if an enterprise customer or a developer has access to o1 Pro, they could integrate it into a Deep Research pipeline to tackle the absolutely hardest tasks. You might consider o1 Pro the “extreme reasoning mode” – it could handle even more complex chains of logic with slightly better accuracy or solve problems o1 can’t. In theory, Deep Research could swap in o1 Pro as the reasoning agent when ultimate thoroughness is needed (perhaps via a setting or automatically if a query is detected to be exceptionally difficult). That said, due to cost, it’s doubtful it’s used widely in the built-in ChatGPT Deep Research feature for end-users. OpenAI likely sticks to o3 or o1 for that. So think of o1 Pro as the Ferrari of reasoning models – available to those who absolutely need that extra edge and willing to burn through a lot of tokens.

Strengths: Potentially state-of-the-art reasoning on any task. If o1 is highly capable, o1 Pro would push that further, possibly incorporating more recent data or fine-tuning. It might have slightly better performance on niche benchmarks and an even more robust ability to follow very complex instructions to the letter.

Weaknesses: Extremely high cost and maybe latency. It doesn’t support streaming responses (so it must complete an entire thought before outputting) (Simon Willison on openai), which suggests it might perform very long internal computations. Given its cost, it’s impractical for routine use. It’s truly meant for specialized cases (think: critical research analyses where you want the absolute best AI thinking, or perhaps in domains like scientific research assistance). In most Deep Research scenarios, cheaper models can do the job almost as well.

o3 Mini: The Cost-Effective Reasoning Engine

Now enters the model that is actually at the center of Deep Research’s broad rollout: OpenAI o3-mini. This model was released in Jan 2025 and marketed as “pushing the frontier of cost-effective reasoning.” (OpenAI o3-mini | OpenAI) It’s part of the next generation after o1 – the o3 series – but labeled “mini” because it’s a smaller, optimized model. The key idea is o3-mini tries to deliver o1-like reasoning prowess at a fraction of the cost and with lower latency. OpenAI reports that o3-mini actually outperformed o1 on many benchmarks, especially coding tasks, despite presumably having fewer parameters (Simon Willison on openai). In coding, logic, and even science questions, o3-mini is extremely impressive for its size (OpenAI o3-mini | OpenAI) (OpenAI o3-mini | OpenAI).

Role in Deep Research: o3-mini is the model actually powering most Deep Research queries for ChatGPT Plus and Pro users (OpenAI announces "deep research" agent that can complete online research - SD Times). It’s the “brains” of the Deep Research agent doing the step-by-step browsing and analysis. OpenAI specifically optimized a version of o3 for web browsing, which strongly suggests o3-mini is integrated with the browsing/Operator tool and tuned for that use case (OpenAI announces "deep research" agent that can complete online research - SD Times). It can read through content and reason about it on the fly without running up an outrageous bill – exactly what’s needed to allow more users to access Deep Research. When Deep Research was first exclusive to Pro users, perhaps a larger model was used; but now that it’s rolling out to Plus users, OpenAI notes “the iteration will be powered by GPT-4o Mini” for wider access (OpenAI Rolls Out Deep Research Access to More ChatGPT Users). In other words, to make it scalable, they utilize these mini models for the heavy lifting.

In practice, o3-mini likely handles both the planning and the drafting in many cases (because it’s capable enough to produce a good write-up too, not just plan). It might generate the report content with citations directly. However, if there are parts that require extra reasoning, o3-mini has a special trick: adjustable reasoning “effort” levels.

Strengths: Efficiency and balanced skill. O3-mini is significantly cheaper per token than even GPT-4o mini or o1. OpenAI mentioned a 95% reduction in per-token cost since GPT-4, while maintaining top-tier reasoning (OpenAI o3-mini | OpenAI), which alludes to models like o3-mini. It also has decent speed – measurements showed it produces the first token ~2.5 seconds faster than o1-mini (OpenAI o3-mini | OpenAI). And yet, its accuracy on tough benchmarks is very high: for example, in one evaluation, “o3-mini (high) reached 83.6%” on a certain test (OpenAI o3-mini | OpenAI) (likely MMLU or another academic benchmark), which is on par with the best large models. It’s also currently OpenAI’s best released model on software engineering tasks (OpenAI o3-mini | OpenAI), outperforming even GPT-4o in some coding challenges. All this means o3-mini can handle a wide range of research queries well, at low cost, making Deep Research feasible to offer to many users.

Additionally, o3-mini supports a massive context (likely 100K+ tokens) similar to its peers, so it can absorb lengthy articles and multiple sources. It’s tuned to be reliable with tools – meaning it won’t freak out when asked to use a browser or handle structured data.

Weaknesses: By nature of being “mini,” it might have a lower raw knowledge capacity than the full-sized giants. If asked an extremely obscure question relying on some niche fact not easily searchable, o3-mini might be more likely to miss it compared to GPT-4. Also, while it’s great at reasoning for its size, a truly convoluted reasoning task might still stump it unless it uses the “high effort” mode. In terms of writing quality, o3-mini is strong, but perhaps slightly less eloquent or creative than GPT-4o on open-ended composition. That said, because Deep Research outputs are factual and structured, this is not a big drawback.

o3 Mini “High” Mode: Turning Up the Dial on Reasoning

You’ll notice we refer to “o3-mini (high)” above. OpenAI introduced the idea of configurable reasoning effort with the o-series models. Essentially, the model can trade speed for better reasoning by thinking in more steps. Developers can specify something like reasoning: {"effort": "high"} in the API (Simon Willison on openai), which prompts the model to take longer reasoning paths internally. In benchmarks, using high effort dramatically boosts accuracy – for example, on a math test, o3-mini at high effort scored 77.0% vs lower at medium (OpenAI o3-mini | OpenAI), and on competitive programming tasks it achieved ~49% at high (OpenAI o3-mini | OpenAI), surpassing earlier versions.

Role in Deep Research: The system can invoke high-effort reasoning for especially tough sub-problems. Most of the time, o3-mini on its default setting might suffice to gather info and answer. But if the agent encounters, say, a very complex logical puzzle or a need to interpret conflicting data, it could switch to high effort (or always run in high effort for Pro users who want maximum accuracy). Practically, this might mean the agent spends extra time verifying calculations or cross-checking sources. It’s like telling the model “this part is really important, think harder/longer on it.”

From the user perspective, you just notice that Deep Research sometimes takes a bit longer on tricky queries – under the hood it could be cranking up the effort. The result is an answer closer to what an o1-level performance would give, but still using the cheaper o3-mini model. This “mode” is not a distinct model, but it’s worth listing because it effectively changes o3-mini’s behavior and capabilities.

Strengths: Maximum reasoning accuracy short of calling in a giant model. O3-mini at high effort has demonstrated performance that rivals the much larger o1 on many tasks (OpenAI o3-mini | OpenAI). It ensures that if there’s a complicated reasoning chain needed (multi-hop question answering, detailed numerical analysis, etc.), the agent can handle it. The beauty is it does so dynamically – only when needed – so simple questions don’t always incur the slowdown.

Weaknesses: Latency. High effort means the model is allowed to use far more computation per query. This can slow down the response significantly. Deep Research already might take a couple of minutes; running everything at “high” could make it even slower (though still within a few minutes typically). Therefore, OpenAI might balance this – using medium effort for most steps and reserving high for final answer formulation or very complex queries.

Also, even at high effort, o3-mini is still a smaller model at heart. If a query truly needed the full might of o1 Pro (some extreme case), o3-mini high might not fully reach that level. But those cases are rare in practical use.

How Models Collaborate in Deep Research

Putting it all together, Deep Research uses the right model for the right task in a given session. There isn’t necessarily a single fixed pairing; it can depend on your subscription level and the query complexity. Here are some patterns of how these models pair up in Deep Research workflows:

Default (ChatGPT Plus) Workflow – o3-Mini as Solo Agent: For most users, Deep Research is handled primarily by o3-mini. This one model takes care of both the reasoning and the final writing. Thanks to its training, it can produce a structured report with citations directly (Simon Willison on openai). The system will prompt it to output in the desired format (headings, etc.), effectively letting it act as both planner and writer. This is efficient. GPT-4o might only be called in if something like image understanding is needed (since o3-mini is presumably text-only). For example, if the research involves analyzing an image from a webpage, the agent could hand that image to GPT-4o to interpret, then feed the result back into o3-mini’s context. But for purely text-based research, o3-mini can handle it end-to-end.
High-End (Pro) Workflow – Mix of o3 and GPT-4.5: ChatGPT Pro users might benefit from a hybrid approach. The Deep Research agent could use o3-mini (perhaps always in high effort mode) to do the digging and reasoning. Once it has all the info, it could pass the summarized notes to GPT-4.5 to compose a very fluent final answer. GPT-4.5 might polish the language, ensure the context is fully used, and maybe add a bit more nuance or clarity to the writing. The two models essentially tag-team: o3 does the thinking, GPT-4.5 does the talking. This approach leverages o3’s strength in analysis and GPT-4.5’s strength in expression. The user wouldn’t notice the hand-off except that the answer might read a tad smoother or come slightly faster than if o3 was writing everything (since GPT-4.5 is quicker at text generation).
Special Cases – Escalation to o1: In exceptional cases (say an extremely difficult query or an enterprise setting), the system could escalate from o3-mini to o1. For instance, if after several attempts the o3 model isn’t confidently solving the problem, the system could recognize this and say “let’s bring in the big gun.” An o1 model (with its massive context and more thorough reasoning) might then re-process the gathered info or tackle the hardest part of the question. This isn’t confirmed behavior in ChatGPT’s Deep Research, but it’s feasible especially via the API. Essentially, o3 could act as the first pass, and o1 as the second opinion if needed. The downside is cost and time, so it’d be reserved for only when accuracy is paramount. Alternatively, an advanced user with API access might themselves program a pipeline: use o3-mini to collect data cheaply, then feed it all to o1-pro to compose the ultimate report. That would mimic a human research assistant (o3) gathering files for a senior analyst (o1-pro) to write the final memo.

Overall, the architecture is designed such that each model family contributes what it’s best at:

GPT-4o/4.5 – provide broad knowledge, multimodal understanding, and high-quality text generation. Ensures the answer isn’t just logically sound but also well-presented and contextually rich.
o1/o1-Pro – provide maximal reasoning depth and reliability on the hardest problems. They embody the “think step-by-step” ethos. Used when necessary for correctness.
o3-Mini – provides fast, efficient reasoning to drive the agent’s actions on typical tasks. It’s the day-to-day engine that makes autonomous research affordable and responsive, scaling this capability to many queries.

The result is that Deep Research can consult hundreds of sources and produce a lengthy report with minimal human intervention (OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it | TechRadar). It truly behaves like an AI research analyst: scouring the web, collating findings, and giving you a cited write-up. From a technical standpoint, it’s a beautiful example of model orchestration, where different AI models are orchestrated like microservices in a pipeline.

Before concluding, let’s look at a quick comparison of these models side-by-side in the context of Deep Research:

Model Comparison Table

Below is a comparison of GPT-4o, GPT-4.5, o1, o1 Pro, o3 Mini, and o3 Mini (High Effort) with respect to their roles in Deep Research, capabilities, speed, cost, and context window:

Model	Role in Deep Research	Key Capabilities	Relative Speed	Cost	Context Size
GPT-4o	General-purpose model; used for final answer synthesis and multimodal tasks (e.g. interpreting images or audio in sources). Often the “voice” of the report.	- Broad knowledge base<br>- Multimodal (text, image, audio) ([OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution	by Cogni Down Under	Medium](https://medium.com/@cognidownunder/openais-o1-vs-gpt-4o-a-deep-dive-into-ai-s-reasoning-revolution-fd9f7891e364#:~:text=GPT,It%20supports))<br>- Web browsing & tools support ([OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution	by Cogni Down Under
GPT-4.5	Advanced generalist; may be used for Pro users’ final answers. Possibly handles some reasoning steps with improved accuracy.	- Improved factual accuracy & intent following (The Ultimate AI Showdown: GPT-4.5 vs Claude 3.7 Sonnet vs Grok 3)<br>- Likely fewer hallucinations<br>- Strong writing with nuance (emotion, tone)	Fast – Similar to GPT-4o (possibly slightly heavier, but still far faster than o1).	High – Only available to Pro/API. Likely similar pricing to GPT-4o or somewhat more (as a premium preview).	128K tokens (assumed, same architecture baseline as 4o).
o1	Deep reasoning engine; plans multi-hop research and solves complex sub-problems. Used when thorough logical reasoning is needed.	- Chain-of-thought reasoning ([OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution	by Cogni Down Under	Medium](https://medium.com/@cognidownunder/openais-o1-vs-gpt-4o-a-deep-dive-into-ai-s-reasoning-revolution-fd9f7891e364#:~:text=Reasoning%20Capabilities%3A%20The%20Game%20Changer))<br>- Exceptional STEM problem-solving ([OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution	by Cogni Down Under
o1 Pro	Ultra-high-end reasoner; reserved for the most complex analyses. Would handle “mission-critical” Deep Research tasks if used.	- Same as o1, but likely higher accuracy or capacity (due to more parameters)<br>- Might handle even more nuanced or lengthy reasoning without error<br>- Future-proofed for frontier tasks	Very Slow – Possibly even slower (and no streaming output) (Simon Willison on openai), as it might generate longer thought chains.	Prohibitive – ~$0.15/1K input, $0.60/1K output (Simon Willison on openai) (10× o1). Typically used only if absolutely needed (e.g. by enterprises or via special access).	200K tokens (same as o1). No increase in context, primarily an improvement in quality/scale of model.
o3 Mini	Primary Deep Research agent model; handles planning, searching, reading, and often composes the draft answer. Optimized for cost-effective reasoning in Plus/Pro.	- Strong multi-step reasoning at lower compute<br>- Excels at coding and logic tasks ([OpenAI o3-mini	OpenAI](https://openai.com/index/openai-o3-mini/#:~:text=Competition%20coding%3A%20On%20Codeforces%20competitive,programming%2C%20OpenAI%20o3%E2%80%91mini))<br>- Tool use and web browsing optimized (OpenAI announces "deep research" agent that can complete online research - SD Times)<br>- Good writing with structured output (can format answers with headings, etc.)	Moderate/Fast – Faster than o1 (by a lot) and even faster than o1-mini by ~2.5s per token ([OpenAI o3-mini	OpenAI](https://openai.com/index/openai-o3-mini/#:~:text=Latency%3A%20o3%E2%80%91mini%20has%20an%20avg,to%20first%20token%20than%20o1%E2%80%91mini)). Feels reasonably responsive given its reasoning abilities.
o3 Mini (High)	(Same model as o3-mini, run in high-effort mode) Used internally when the agent needs maximum accuracy on a complex step.	- Enhanced chain-of-thought (more internal steps allowed)<br>- Achieves top-tier benchmark performance (e.g. 83.6% on difficult knowledge tasks) ([OpenAI o3-mini	OpenAI](https://openai.com/index/openai-o3-mini/#:~:text=match%20at%20L195%20%28yellow%29%20improve,showing%20significant%20progress))<br>- Reduces errors on tricky queries by exhaustive reasoning	Slow – Slower than normal o3-mini. Each question may take multiple internal passes. Still overall faster than o1 for many tasks, but noticeable lag.	Higher – Uses more tokens/thinking per answer, effectively costing more per query (but still cheaper than using a larger model).

Notes: “Speed” here refers to relative inference speed/latency for a given task. “Cost” is relative token pricing and compute requirements. Context sizes are based on known specs; all these models have very large context windows, suitable for handling lots of text – a critical feature for Deep Research.

Illustrative Workflow Example (with Model Pairing)

To make this concrete, imagine you ask Deep Research: “Analyze how the retail industry has transformed in the last 3 years and suggest opportunities for a new AI-based retail service. Provide data and sources.”

Understanding the Query (Model: o3-mini) – The o3-mini agent breaks this down: it needs retail industry trends, specifically last 3 years, plus it needs to identify opportunities for AI services. It might search for “retail industry trends 2021-2024 report” first (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles).
First Search & Read (Model: o3-mini) – It finds a few reports (perhaps a Statista report, a news article about retail tech, etc.). It reads them and finds data on e-commerce growth, omnichannel strategies, post-pandemic consumer behavior. The agent sees mention of AI in retail (like AI-driven personalization, inventory management).
Follow-up Questions (Model: o3-mini) – Noticing “AI in retail” mentioned, the agent now specifically searches for “AI applications in retail 2023 examples”. It finds maybe a blog or Gartner piece listing AI use-cases in retail.
Analysis (Model: o3-mini high effort) – To formulate opportunities, the agent now synthesizes: trend data says e-commerce up X%, omnichannel is big; AI use-cases include chatbots, demand forecasting. It internally decides the opportunities likely revolve around those areas. It might run in high effort mode here to carefully cross-check facts (ensure the growth percentages are correct and from reliable sources).
Drafting the Report (Model: GPT-4o or o3-mini) – Now it composes the answer. If using o3-mini alone, it writes a report with sections: “Trends 2020-2024 in Retail” (citing the sources for growth stats), “Role of AI in Recent Retail Innovations” (citing the examples found), and “Opportunities for a New AI-based Retail Service” (here it might suggest a service idea like an AI-powered customer insights platform, justified by the trends). If a handoff is configured, o3-mini might pass bullet-point notes to GPT-4.5 to articulate fully. Either way, the final text is structured, with each claim backed by a reference link (Simon Willison on openai).
Final Output: You receive a multi-section analysis, e.g. “In the past 3 years, retail saw a 35% increase in e-commerce adoption (source: XYZ report) (Simon Willison on openai), and omnichannel models became the norm. AI technologies like chatbots for customer service and AI-driven inventory optimization have been widely adopted (The Ultimate AI Showdown: GPT-4.5 vs Claude 3.7 Sonnet vs Grok 3)... Given these trends, an opportunity emerges for a Retail AI Personalization Service, leveraging AI to analyze customer data and deliver personalized shopping experiences online and in-store…” – all with citations to those reports the agent read.

This process shows how the agent juggles retrieval and reasoning using the o3 model, and then uses its language generation capabilities (possibly enhanced by GPT-4) to produce the final polished report.

Conclusion

OpenAI’s Deep Research system is an exciting peek into the future of AI assistants: it combines the strengths of different models to achieve something far beyond a single ChatGPT response. By pairing high-level reasoning models (the o-series) with powerful generalist models (GPT-4 series), it creates an AI that can truly research like a human, not just regurgitate information. It plans, it searches, it cross-checks, and finally it writes – acting as both the researcher and the writer.

For power users, understanding this pairing is valuable. It means when you invoke Deep Research, there’s a lot happening under the hood: a careful reasoning dance by a specialist model, and possibly the fine prose of a generalist polishing the output. Each model contributes – e.g. o1 ensures the logic is sound and thorough, GPT-4o ensures the answer is comprehensive and well-phrased, and o3-mini makes it efficient enough to use at scale.

The architecture is modular and smart: a manager agent delegates to tool-using sub-agents for fetching data (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles), and different model “experts” might be used at different stages. This modular design is likely to expand – future iterations might include models specialized in real-time data or models like vision experts if the query demands it.

In summary, Deep Research represents an orchestration of AI components: planning, tool use, and multiple reasoning passes handled by one model (o3/o1), and broad knowledge integration handled by another (GPT-4 family). The result is a system that “can do work for you independently” and compile a report that would normally take a human many hours (OpenAI’s new Deep Research is the ChatGPT AI agent we’ve been waiting for – 3 reasons why I can’t wait to use it | TechRadar). It’s a glimpse of how AI can assist with serious research tasks by leveraging the best of all worlds in model design.

As users, we get the benefit of an on-demand research analyst. And as AI enthusiasts, it’s fascinating to see how under the hood, OpenAI’s models are coordinating – each doing what they do best – to push the envelope of what an AI assistant can do (The Ultimate AI Showdown: GPT-4.5 vs Claude 3.7 Sonnet vs Grok 3).

Sources: The information above is based on reports from OpenAI and others about the Deep Research feature and OpenAI’s models, including the Deep Research system card (Simon Willison on openai), press coverage (OpenAI announces "deep research" agent that can complete online research - SD Times) (A Comparison of Deep Research AI Agents | by Omar Santos | Feb, 2025 | AI Security Chronicles), and expert analyses comparing the models’ capabilities (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium) (OpenAI’s o1 vs GPT-4o: A Deep Dive into AI’s Reasoning Revolution | by Cogni Down Under | Medium).