OpenAI

ChatGPT 5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: Benchmarks, Pricing, and Codex (2026)

By Sagnik Bhattacharya 24 Apr 2026 13 min read

Coding Liquids blog cover featuring Sagnik Bhattacharya for ChatGPT 5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: Benchmarks, Pricing, and Codex (2026), with OpenAI model comparison dashboards, Codex workflow cards, and benchmark visuals.

On 23 April 2026, OpenAI shipped GPT-5.5 — internally codenamed Spud — and made it available the same day inside ChatGPT and Codex. Unlike the 5.2, 5.3, and 5.4 releases that preceded it, GPT-5.5 is not a post-training refinement of GPT-5; it is the first fully retrained base model since GPT-4.5. That distinction matters because the improvements are broad rather than task-specific: agentic coding, computer use, knowledge work, and early scientific research all move forward together.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

This tutorial is written for someone who wants a single, honest reference on GPT-5.5 and its Codex deployment. By the end of it you will know what changed versus GPT-5.4, what the Codex variant actually does differently, how the published benchmark numbers compare against Claude Opus 4.7 and Gemini 3.1 Pro, where pricing lands, and which of the three families deserves your default slot for different kinds of work. I have a companion guide on GitHub Copilot Agent Mode in VS Code if you want the IDE-side story, and ChatGPT vs Claude vs Copilot vs Gemini for Excel for the non-coding comparison.

Quick comparison: ChatGPT 5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro

Short version: GPT-5.5 Codex is the safest default for autonomous coding and terminal-heavy work, Claude Opus 4.7 has the strongest public SWE-Bench Pro number in the current comparison, and Gemini 3.1 Pro is the value pick when you need strong reasoning with cheaper tokens and a large context window.

Model	Best for	Benchmark signal	Standard API price
ChatGPT 5.5 / GPT-5.5 Codex	Autonomous coding, terminal tasks, Codex workflows	82.7% Terminal-Bench 2.0; 58.6% SWE-Bench Pro	$5 input / $30 output per 1M tokens; Pro is $30 / $180
Claude Opus 4.7	Hard code fixes, code review, sustained agentic work	64.3% SWE-Bench Pro; 69.4% Terminal-Bench 2.0 in OpenAI's comparison table	$5 input / $25 output per 1M tokens
Gemini 3.1 Pro	Cost-sensitive large-context work and Google-stack workflows	54.2% SWE-Bench Pro; 68.5% Terminal-Bench 2.0 in OpenAI's comparison table	$2 input / $12 output per 1M tokens for prompts up to 200K; higher above 200K

What GPT-5.5 actually is

GPT-5.5 is positioned by OpenAI as their "smartest and most intuitive to use model" at release. The framing in the announcement is worth paying attention to: it is sold less as a reasoning step-change and more as a faster, sharper thinker for fewer tokens. The model matches GPT-5.4's per-token latency in production serving while producing higher-quality output and consuming fewer output tokens to complete the same task. That combination — equal latency, higher quality, lower token count — is what makes the upgrade economically meaningful rather than just optically impressive.

Three structural things changed with this release:

Fresh base model, not a fine-tune. GPT-5.2, 5.3, and 5.4 were all post-training iterations on the GPT-5 base. GPT-5.5 is a new pre-train — the first since GPT-4.5 in 2025. In practice this shows up as deeper code-semantics understanding and better multi-file reasoning rather than just sharper chat responses.
1 million token context window at base tier. The Responses and Chat Completions API exposes a 1M token context for both GPT-5.5 and GPT-5.5 Pro. For the work where context matters most — reviewing a whole microservice, summarising a codebase, tracing an incident across logs — this closes the gap with Gemini's long-context advantage.
Agentic capability is the headline. OpenAI explicitly ties GPT-5.5 to "a new way of getting work done on a computer" — operating software, researching online, running multi-step workflows autonomously. This is why the Codex and ChatGPT rollouts landed on the same day rather than Codex catching up weeks later.

GPT-5.5 Pro: when the bigger bill is worth it

GPT-5.5 ships with a sibling model, GPT-5.5 Pro, priced at $30 per 1M input tokens and $180 per 1M output tokens — 6x the base model on input and 6x on output. Pro is not a different architecture; it is the same model running with more reasoning compute per request, tuned for maximum accuracy on hard problems.

Use GPT-5.5 Pro when the task is scientific reasoning, non-trivial mathematics, or a long agentic job where a single mis-step is expensive (a data migration script, a refactor touching payment code, an experimental analysis you will write up and cite). Do not use Pro as your default chat model — the economics do not work, and the base GPT-5.5 already clears the bar for almost all knowledge work. A clean rule of thumb: if you cannot articulate what Pro would have caught that base missed, stay on base.

GPT-5.5 in Codex: what changes

Codex is the coding-specific surface for GPT-5.5. When you use Codex — through the VS Code extension, the Terminal CLI, the Codex SDK, or the cloud async environment — you get a model tuned on top of the GPT-5.5 base for autonomous code generation, multi-file reasoning, and iterative debugging. The loop is structured: natural-language task → plan → generate code → run tests → review and iterate. Changes land as auditable diffs or pull requests rather than direct commits to main, and shell commands, file operations, and package installs are chained inside a sandboxed cloud environment.

Three things the Codex variant does that plain GPT-5.5 does not do as aggressively:

Dynamic reasoning budget based on task scope. A rename or typo fix completes in seconds. An API endpoint with tests takes 5–15 minutes. A multi-file feature takes 30–90 minutes. A framework migration can run for 7+ hours autonomously. The model decides the budget — you just set a maxDurationMinutes ceiling.
First-class Codex SDK. The @openai/codex-sdk package wraps the agent loop cleanly: you pass an API key, a repo path, and a task description; you get back a structured result with filesChanged, testsRun, and a diff.
Multi-agent orchestration. A parent Codex agent can spawn subagents scoped to specific directories (for example, backend, frontend, tests), each running with its own task scope, and merge the results behind a validation pass. This is the pattern NVIDIA disclosed is used internally across 10,000+ employees.

Minimal Codex SDK example

Here is the smallest useful Codex setup. It takes a natural-language task and returns a diff you can inspect before applying:

import { CodexAgent } from "@openai/codex-sdk";

const agent = new CodexAgent({
  apiKey: process.env.OPENAI_API_KEY,
  model: "gpt-5-codex",
  repoPath: "./my-project",
});

const result = await agent.run({
  task: "Add input validation to POST /api/users. " +
        "Validate email format, require non-empty name, " +
        "return 400 on invalid input. Include unit tests.",
  maxDurationMinutes: 30,
});

console.log(result.filesChanged);  // ["src/api/users.ts", "tests/users.test.ts"]
console.log(result.testsRun);       // { passed: 12, failed: 0 }
console.log(result.diff);           // full change diff, review before merging

In practice you wrap this in a CI job so labelled GitHub issues auto-open a draft pull request. I have a longer walkthrough at How to Build an AI Agent With File Search and Tools — the patterns transfer directly.

Where Codex runs

Four deployment environments, each for a different workflow:

VS Code extension. Interactive edits, pair-programming style. Seconds to minutes per action.
Terminal CLI. Scripted workflows, CI/CD, local batch jobs. Minutes to hours.
Cloud async. Long refactors and unattended work that you submit and review later. Up to 7+ hours per task, with Slack notifications when done.
Codex SDK. Programmatic control from your own service for bots, internal tools, or bespoke agents.

For a fuller tour of the VS Code side specifically, see my Copilot Agent Mode in VS Code tutorial — the Codex extension behaves similarly in the IDE but without the Microsoft 365 policy layer.

The headline GPT-5.5 benchmarks

OpenAI published a set of concrete numbers at launch. The ones worth knowing:

GDPval: 84.9%. A test of agents producing well-specified knowledge work across 44 occupations. This is the headline number for work replacement — how well the model produces real deliverables in categories like finance, law, consulting, and engineering write-ups.
OSWorld-Verified: 78.7%. Measures whether a model can operate a real computer environment autonomously — clicking, typing, reading UI state, completing multi-step tasks in actual applications. This is the computer use benchmark.
Tau2-Bench Telecom: 98.0%. A customer-service workflow benchmark on complex telecom tasks, measured without prompt tuning. Near-ceiling performance here is what allows OpenAI to claim agentic readiness for real support workflows.
Terminal-Bench 2.0: 82.7%. Complex command-line workflows and DevOps automation. A state-of-the-art score at release.
SWE-Bench Pro: 58.6%. Real GitHub issue resolution across Python, JavaScript/TypeScript, Java, and Go. This is the Codex-variant number; it measures practical coding rather than isolated problem-solving.

Two numbers that do not appear on the public slide but matter just as much: token efficiency (GPT-5.5 uses fewer output tokens than GPT-5.4 for identical tasks, which lowers per-task cost at identical per-token pricing), and per-token latency (matched to GPT-5.4, so the upgrade does not slow down production serving).

ChatGPT 5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro

Three frontier model families are live in April 2026: OpenAI's GPT-5.5, Anthropic's Claude Opus 4.7, and Google's Gemini 3.1 Pro. Benchmarks alone rarely pick a winner, but this comparison is much clearer if you separate autonomous terminal work, hard GitHub issue resolution, long-context reading, and token economics.

Coding benchmarks side by side

Benchmark	GPT-5.5 / Codex	Claude Opus 4.7	Gemini 3.1 Pro	What it means
SWE-Bench Pro	58.6%	64.3%	54.2%	Opus 4.7 has the strongest public hard-code-fix score in OpenAI's launch table.
Terminal-Bench 2.0	82.7%	69.4%	68.5%	GPT-5.5 is the stronger terminal and DevOps automation choice.
GDPval	84.9%	80.3%	67.3%	GPT-5.5 has the strongest professional knowledge-work signal.
BrowseComp	84.4%	79.3%	85.9%	Gemini edges the base model on web research; GPT-5.5 Pro leads at 90.1%.

The honest takeaway is not "GPT-5.5 wins everything." Claude Opus 4.7 is now the stronger published pick for hard SWE-Bench Pro coding fixes, GPT-5.5 has the cleanest lead for terminal-heavy autonomous work, and Gemini 3.1 Pro stays competitive because it gives teams a lower token bill with a large context window.

Where each family wins in practice

GPT-5.5 / Codex — agentic and multi-step work. If the task is "resolve this GitHub issue", "migrate this service from Express to Fastify", or "run the full release checklist", GPT-5.5 Codex is still the cleanest default. The integrated Codex workflow, structured SDK, and top Terminal-Bench number tilt this way.
Claude Opus 4.7 — hard code fixes and review-quality output. Opus 4.7 is the model I would now test first when the job is a difficult bug fix, a public SDK change, or code a human will review line by line. It has the strongest SWE-Bench Pro number in the current comparison and Anthropic positions it specifically for sustained coding and agentic work.
Gemini 3.1 Pro — cost-sensitive large-context work. Gemini's value is less about winning the coding headline and more about being cheap enough to run across a lot of context. If you are feeding long documents, large repos, or batch analysis jobs through Google infrastructure, the pricing and 1M-context family make it hard to ignore.

One more nuance: Claude Opus 4.7 is no longer just the "expensive quality" option. At $5 input and $25 output per 1M tokens, it matches GPT-5.5's standard input price and undercuts GPT-5.5 on output tokens. The real decision is whether you need Codex's integrated agent loop, Opus's stronger hard-code-fix signal, or Gemini's lower large-context cost.

Pricing comparison

Rough per-token pricing, as of the GPT-5.5 launch:

GPT-5.5 (base). $5 / 1M input, $30 / 1M output in the API, with a 1M token context window. In Codex, GPT-5.5 runs with a 400K context window.
GPT-5.5 Pro. $30 / 1M input, $180 / 1M output. Use it only when the accuracy gain is worth the 6x output price.
GPT-5 / GPT-5.4 (for reference). $2.50 / 1M input, $15 / 1M output — still cheaper than 5.5, still the right call for bulk workloads where 5.5's quality uplift does not justify double the cost.
Claude Opus 4.7. $5 / 1M input, $25 / 1M output, with prompt caching and batch processing discounts available.
Claude Sonnet 4.6. Mid-tier pricing, still the more economical Claude default for everyday coding if you do not need Opus 4.7.
Gemini 3.1 Pro. $2 / 1M input and $12 / 1M output for prompts up to 200K tokens; $4 input and $18 output above 200K. Batch and Flex cut the standard rates in half.

Two practical notes on pricing. First, prompt caching on GPT-5.5 gives roughly 90% input-token savings on repeated context — this is material when you are running Codex against the same repo many times per day. Second, GPT-5.5 genuinely consumes fewer output tokens than 5.4 for the same task, so the effective per-task cost is lower than the per-token price suggests. A realistic estimate: GPT-5.5 costs about 30–40% more per task than 5.4, not the full 100% implied by the input-token sticker price.

Safety, classifiers, and what rolled forward from 5.2

With GPT-5.2 in December 2025, OpenAI deployed a new set of cyber-safeguards — stricter classifiers for requests that look like exploit development, credential stuffing, offensive reconnaissance, or evasion tooling. With GPT-5.5, those classifiers tighten further. In my own testing, legitimate defensive security work (CTFs, threat modelling, log analysis, sandboxed exploit research on intentionally vulnerable lab VMs) still works, but the model now asks more clarifying questions about authorisation context before proceeding with dual-use requests. If you are doing authorised pentesting, lead with the engagement context — that is the difference between a helpful response and a refusal.

How to pick: a decision tree

You do not have to pick one model for everything. Most engineering teams end up with two or three in rotation, routed by task. A decision tree that has held up through GPT-5.5's first days:

Is the task autonomous coding inside a repo (resolve issue, migrate service, open PR)? → GPT-5.5 via Codex. Use the SDK, cap maxDurationMinutes, review the diff.
Is the task producing code a human will review line-by-line (public SDK, regulated codepath, shared library)? → Claude Opus 4.7 first, Sonnet 4.6 when cost matters more. Better hard-fix signal and strong review-quality output.
Is the task understanding a large, unfamiliar codebase to make one cross-cutting fix? → Gemini 3.1 Pro. Long-context behaviour is still the most forgiving at size.
Is the task high-stakes reasoning (scientific analysis, difficult maths, multi-hop investigation with real consequences)? → GPT-5.5 Pro. Accept the 6x price because being right is the whole point.
Is the task bulk and cost-sensitive (document summarisation, lightweight classification, background pipelines)? → GPT-5.4, Claude Sonnet 4.6, or Gemini 3.1 Pro depending on which API you are already paying for. Do not pay GPT-5.5 prices for work a cheaper model solves equally well.

If you want the broader AI model picker for non-coding work, my Gemma 4 vs ChatGPT vs Claude vs Copilot and open models vs API models guides cover the rest of the map.

Five things to check before you switch your team to GPT-5.5

Moving your default model is disruptive. Before making the call, run through these five checks.

Benchmark on your own tasks, not OpenAI's. Take the 10 prompts your team runs daily, run them on GPT-5.4 and GPT-5.5, and compare quality, cost, and latency. Public benchmarks are useful for shortlisting; your tasks decide the winner.
Measure token efficiency, not just per-token price. The 30–40% effective cost uplift I mentioned above is an average — your workload may be closer to flat (if you were already brief) or worse (if you frequently hit the reasoning ceiling). Log output tokens per task before and after for a week.
Validate the Codex SDK against your repo layout. The SDK assumes a sensible repo — monorepos, git submodules, and repos with non-standard structures can confuse the sandbox. Run it on a representative service first.
Check your cyber-classifier hit rate. If your team does security work, incident response, or log analysis, a small sample of requests will flag under the tightened classifiers. Make sure the prompts carry authorisation context.
Plan for the 1M context window. A 1M context changes what is worth RAG-ing and what is worth stuffing directly. Re-evaluate your RAG boundary — you may find some pipelines now fit in a single prompt and do not need the retrieval layer at all. My RAG with your own documents tutorial covers the tradeoff.

Frequently Asked Questions

What is GPT-5.5 and when was it released?

GPT-5.5 is OpenAI's latest frontier model, released on 23 April 2026 under the internal codename "Spud". OpenAI describes it as the first fully retrained base model since GPT-4.5, meaning it is not a fine-tune or post-training variant of GPT-5 — it is a fresh pre-train. It rolled out the same day to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, with GPT-5.5 Pro additionally available to Pro, Business, and Enterprise tiers.

Is GPT-5.5 for Codex a separate model or the same model?

GPT-5.5 in Codex is the same underlying base model with a Codex-specific post-training pass layered on top. The base GPT-5.5 already improved on agentic coding versus GPT-5.4; the Codex variant adds coding-specific reinforcement, a longer autonomous-action budget (up to 7+ hours for complex refactors), and tighter integration with the Codex SDK, VS Code extension, Terminal CLI, and cloud async environments. You do not pick between the two manually — Codex routes automatically to the Codex-tuned variant when you are inside Codex surfaces.

How much does GPT-5.5 cost to use through the API?

GPT-5.5 is available in the Responses and Chat Completions APIs at $5 per 1 million input tokens and $30 per 1 million output tokens, with a 1 million token context window. GPT-5.5 Pro costs $30 per 1M input and $180 per 1M output — roughly 6x the base model price, for workloads where maximum accuracy matters more than cost. In Codex usage, GPT-5.5 consumes fewer tokens than GPT-5.4 for the same tasks, so the effective cost per completed task is lower even at identical token prices.

How does GPT-5.5 compare to Claude Opus 4.7 and Gemini 3.1 Pro for coding?

OpenAI's launch comparison puts GPT-5.5 ahead on Terminal-Bench 2.0 at 82.7%, while Claude Opus 4.7 leads the same table on SWE-Bench Pro at 64.3%. Gemini 3.1 Pro trails those two on the coding rows but is much cheaper at standard API rates and remains attractive for large-context, cost-sensitive work. Short answer: GPT-5.5 Codex for terminal-heavy autonomous coding, Claude Opus 4.7 for hard code fixes and review-quality output, Gemini 3.1 Pro for budget-aware large-context workflows.

Do I need GPT-5.5 Pro, or is the base GPT-5.5 enough?

Base GPT-5.5 is enough for almost all coding, knowledge work, and agentic tasks — it is the model that scored 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval. Reach for GPT-5.5 Pro only when the task is high-stakes scientific reasoning, difficult mathematics, or a multi-step agentic job where one wrong step costs real time or money. At 6x the output price, Pro is a poor default. Start with base GPT-5.5, promote to Pro only when you can point at a specific failure it would have prevented.