Quick Take on OpenAI's New Models: o3, o4-mini — What Can Image Reasoning + Autonomous Tools Bring? | Yunqi Tech π

云启资本·April 17, 2025·11·0

In the early hours of April 17 Beijing time, OpenAI released two new AI reasoning models, o3 and o4-mini. These are likely the last standalone AI reasoning models before GPT-5, with further improvements in both performance and cost efficiency. In this episode of **"Yunqi Tech π"**, we break down the details for you.

AI models keep evolving.

In the early hours of April 17 Beijing time, OpenAI released two new AI reasoning models — o3 and o4-mini. These are likely the last standalone AI reasoning models before GPT-5, with fresh improvements in both capability and cost-efficiency. This edition of "Yunqi Tech π" walks you through the details.

Source: WeChat account "Founder Park" | Original title: "OpenAI Releases o3 and o4-mini: First Reasoning Models That Think With Images, Capable of Autonomous Tool Use, Benchmark Scores Continue to Climb"

OpenAI today announced two new AI reasoning models: o3 — which the company calls its "most powerful reasoning model" — and o4-mini, a smaller, faster model that "delivers exceptional performance for its size and cost." OpenAI says o3 is its most advanced reasoning model to date, outperforming previous models on tests of mathematics, coding, reasoning, science, and visual understanding. Meanwhile, o4-mini offers what OpenAI describes as a competitive balance of price, speed, and performance.

o3 and o4-mini will feature "image thinking" — the ability to "directly integrate images into their chain of thought." OpenAI says these models can also manipulate images during reasoning by zooming in or rotating them. OpenAI announced that its reasoning models will now have access to all ChatGPT tools, including web browsing, Python code execution, image processing, and image generation. These tools are available starting today for ChatGPT Plus, Pro, and Team users on o3, o4-mini, and o4-mini-high, and will roll out to o3-pro "within weeks." (o1, o3-mini, and o3-mini-high will be gradually phased out from these tiers.) Pricing has dropped significantly compared to equivalent o1 tiers (o3-mini is 63% cheaper than o1-mini). In the API, o4-mini and o3 offer a 200,000-token context window, up to 100,000 output tokens, with a knowledge cutoff of June 1, 2024. Sam Altman said o3 and o4-mini are likely ChatGPT's last standalone AI reasoning models before GPT-5 — the company's next-generation product that aims to unify traditional models (like GPT-4.1) with its reasoning models. o3 and o4-mini are now available to developers through the Chat Completions API and Responses API (some developers may need to verify their organization identity to access these models).

Benchmark Scores Keep Breaking Records

o3 set new SOTA records on benchmarks including Codeforces, SWE-bench, and MMMU. It is particularly well-suited for complex queries requiring multi-angle analysis where answers may not be obvious. On visual tasks such as analyzing images, charts, and diagrams, o3 performs especially well. According to external expert evaluations, on difficult real-world tasks, o3 reduced major errors by 20% compared to OpenAI o1 — with standout performance in programming, business/consulting, and creative ideation.

OpenAI o4-mini is a lightweight model optimized for fast, cost-efficient reasoning — delivering exceptional performance on mathematics, programming, and visual tasks despite its compact size and low cost. On the AIME 2025 benchmark, when equipped with a Python interpreter, o4-mini nearly broke the test ceiling with a 99.5% score. Expert evaluations also show it outperforms its predecessor o3-mini on non-STEM tasks and data science. Thanks to its efficient design, o4-mini supports significantly higher rate limits than o3, making it an ideal high-throughput solution for scenarios requiring complex reasoning.

Both new models generally outperform their predecessors OpenAI o1 and o3-mini in efficiency as well. Taking the 2025 AIME math competition as an example, o3 dominates o1 across the entire cost-efficiency frontier; similarly, o4-mini's frontier performance significantly surpasses o3-mini's. Overall, in most practical applications, o3 and o4-mini will be both smarter and cheaper than o1 and o3-mini respectively.

Thinking With Images

OpenAI claims o3 and o4-mini are its first models capable of "thinking with images." In practice, users can upload images to ChatGPT — such as whiteboard sketches or charts from PDFs — and the models will analyze these images during their "chain of thought" before responding. Thanks to this new capability, o3 and o4-mini can understand blurry and low-quality images, and perform operations like zooming in or rotating images during reasoning.

They seamlessly combine advanced reasoning with tools like web search and image processing — automatically scaling, cropping, flipping, or optimizing images — even extracting insights from flawed photos. For example, users can upload photos of economics problem sets for step-by-step solutions, or share screenshots of build errors to quickly get root cause analysis.

OpenAI tested OpenAI o3 and o4-mini on diverse human exams and machine learning benchmarks. These new visual reasoning models significantly outperformed predecessors on all multimodal tasks tested.

Autonomous Tool Use

OpenAI's o3 and o4-mini models have full access to tools within ChatGPT, and can use custom tools via function calling in the API. These models are trained to reason about how to solve problems, choose when and how to use tools, quickly generate detailed and thoughtful answers — typically within a minute — and present outputs in the correct format.

For example, a user might ask: "How will California's summer energy usage compare to last year?" The model can search the web for utility data, write Python code to build projections, generate charts or images, and explain the key factors behind the forecast — chaining together multiple tool calls. Their reasoning capabilities let the models react and adapt to information they encounter as needed. For instance, they can search the web multiple times with the help of a search provider, review results, and try new searches when more information is needed.

New Frontiers in Reinforcement Learning

During the development of OpenAI o3, OpenAI observed that large-scale reinforcement learning exhibits the same "more compute = better performance" trend seen in GPT-series pretraining. By re-exploring the scaling path — this time in reinforcement learning — and increasing both training compute and inference-time thinking by an order of magnitude, clear performance gains were still visible. This validates that model performance continues to improve with increased thinking time. At the same latency and cost as OpenAI o1, o3 delivers higher performance in ChatGPT — and it has been verified that if given more time to think, its performance continues to climb.

OpenAI also trained both models to use tools through reinforcement learning — not just teaching them how to use tools, but teaching them to reason about when to use them. Their ability to deploy tools based on expected outcomes makes them perform better in open-ended situations, especially those involving visual reasoning and multi-step workflows. According to early testers, this improvement shows up both in academic benchmarks and in real-world tasks.

Codex CLI

Codex CLI is a lightweight coding agent that runs from your terminal. It runs directly on your computer, turns natural language into executable code, and is designed to maximize the reasoning capabilities of models like o3 and o4-mini — with support for more API models like GPT-4.1 coming soon.

An OpenAI spokesperson told TechCrunch: "Codex CLI is a lightweight, open-source coding agent that runs locally in your terminal." The goal is "to give users a minimal, transparent interface that directly connects models with code and tasks."

Users can pass screenshots or low-fidelity sketches to the model via command line, combined with local code access, to gain the benefits of multimodal reasoning. OpenAI sees this as bridging its models with users and their compute