OpenAI

GPT-5.5 and GPT-5.5 Codex Explained: Benchmarks, Pricing, and How It Compares to Claude and Gemini (2026)

By Sagnik Bhattacharya 24 Apr 2026 13 min read

Coding Liquids tutorial cover featuring Sagnik Bhattacharya for GPT-5.5 and GPT-5.5 Codex explained, showing an OpenAI GPT-5.5 model comparison dashboard.

On 23 April 2026, OpenAI shipped GPT-5.5 — internally codenamed Spud — and made it available the same day inside ChatGPT and Codex. Unlike the 5.2, 5.3, and 5.4 releases that preceded it, GPT-5.5 is not a post-training refinement of GPT-5; it is the first fully retrained base model since GPT-4.5. That distinction matters because the improvements are broad rather than task-specific: agentic coding, computer use, knowledge work, and early scientific research all move forward together.

This tutorial is written for someone who wants a single, honest reference on GPT-5.5 and its Codex deployment. By the end of it you will know what changed versus GPT-5.4, what the Codex variant actually does differently, how the published benchmark numbers compare against Claude Opus 4.6 and Gemini 3.1 Pro, where pricing lands, and which of the three families deserves your default slot for different kinds of work. I have a companion guide on GitHub Copilot Agent Mode in VS Code if you want the IDE-side story, and ChatGPT vs Claude vs Copilot vs Gemini for Excel for the non-coding comparison.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

What GPT-5.5 actually is

GPT-5.5 is positioned by OpenAI as their "smartest and most intuitive to use model" at release. The framing in the announcement is worth paying attention to: it is sold less as a reasoning step-change and more as a faster, sharper thinker for fewer tokens. The model matches GPT-5.4's per-token latency in production serving while producing higher-quality output and consuming fewer output tokens to complete the same task. That combination — equal latency, higher quality, lower token count — is what makes the upgrade economically meaningful rather than just optically impressive.

Three structural things changed with this release:

Fresh base model, not a fine-tune. GPT-5.2, 5.3, and 5.4 were all post-training iterations on the GPT-5 base. GPT-5.5 is a new pre-train — the first since GPT-4.5 in 2025. In practice this shows up as deeper code-semantics understanding and better multi-file reasoning rather than just sharper chat responses.
1 million token context window at base tier. The Responses and Chat Completions API exposes a 1M token context for both GPT-5.5 and GPT-5.5 Pro. For the work where context matters most — reviewing a whole microservice, summarising a codebase, tracing an incident across logs — this closes the gap with Gemini's long-context advantage.
Agentic capability is the headline. OpenAI explicitly ties GPT-5.5 to "a new way of getting work done on a computer" — operating software, researching online, running multi-step workflows autonomously. This is why the Codex and ChatGPT rollouts landed on the same day rather than Codex catching up weeks later.

GPT-5.5 Pro: when the bigger bill is worth it

GPT-5.5 ships with a sibling model, GPT-5.5 Pro, priced at $30 per 1M input tokens and $180 per 1M output tokens — 6x the base model on input and 6x on output. Pro is not a different architecture; it is the same model running with more reasoning compute per request, tuned for maximum accuracy on hard problems.

Use GPT-5.5 Pro when the task is scientific reasoning, non-trivial mathematics, or a long agentic job where a single mis-step is expensive (a data migration script, a refactor touching payment code, an experimental analysis you will write up and cite). Do not use Pro as your default chat model — the economics do not work, and the base GPT-5.5 already clears the bar for almost all knowledge work. A clean rule of thumb: if you cannot articulate what Pro would have caught that base missed, stay on base.

GPT-5.5 in Codex: what changes

Codex is the coding-specific surface for GPT-5.5. When you use Codex — through the VS Code extension, the Terminal CLI, the Codex SDK, or the cloud async environment — you get a model tuned on top of the GPT-5.5 base for autonomous code generation, multi-file reasoning, and iterative debugging. The loop is structured: natural-language task → plan → generate code → run tests → review and iterate. Changes land as auditable diffs or pull requests rather than direct commits to main, and shell commands, file operations, and package installs are chained inside a sandboxed cloud environment.

Three things the Codex variant does that plain GPT-5.5 does not do as aggressively:

Dynamic reasoning budget based on task scope. A rename or typo fix completes in seconds. An API endpoint with tests takes 5–15 minutes. A multi-file feature takes 30–90 minutes. A framework migration can run for 7+ hours autonomously. The model decides the budget — you just set a maxDurationMinutes ceiling.
First-class Codex SDK. The @openai/codex-sdk package wraps the agent loop cleanly: you pass an API key, a repo path, and a task description; you get back a structured result with filesChanged, testsRun, and a diff.
Multi-agent orchestration. A parent Codex agent can spawn subagents scoped to specific directories (for example, backend, frontend, tests), each running with its own task scope, and merge the results behind a validation pass. This is the pattern NVIDIA disclosed is used internally across 10,000+ employees.

Minimal Codex SDK example

Here is the smallest useful Codex setup. It takes a natural-language task and returns a diff you can inspect before applying:

import { CodexAgent } from "@openai/codex-sdk";

const agent = new CodexAgent({
  apiKey: process.env.OPENAI_API_KEY,
  model: "gpt-5-codex",
  repoPath: "./my-project",
});

const result = await agent.run({
  task: "Add input validation to POST /api/users. " +
        "Validate email format, require non-empty name, " +
        "return 400 on invalid input. Include unit tests.",
  maxDurationMinutes: 30,
});

console.log(result.filesChanged);  // ["src/api/users.ts", "tests/users.test.ts"]
console.log(result.testsRun);       // { passed: 12, failed: 0 }
console.log(result.diff);           // full change diff, review before merging

In practice you wrap this in a CI job so labelled GitHub issues auto-open a draft pull request. I have a longer walkthrough at How to Build an AI Agent With File Search and Tools — the patterns transfer directly.

Where Codex runs

Four deployment environments, each for a different workflow:

VS Code extension. Interactive edits, pair-programming style. Seconds to minutes per action.
Terminal CLI. Scripted workflows, CI/CD, local batch jobs. Minutes to hours.
Cloud async. Long refactors and unattended work that you submit and review later. Up to 7+ hours per task, with Slack notifications when done.
Codex SDK. Programmatic control from your own service for bots, internal tools, or bespoke agents.

For a fuller tour of the VS Code side specifically, see my Copilot Agent Mode in VS Code tutorial — the Codex extension behaves similarly in the IDE but without the Microsoft 365 policy layer.

The headline GPT-5.5 benchmarks

OpenAI published a set of concrete numbers at launch. The ones worth knowing:

GDPval: 84.9%. A test of agents producing well-specified knowledge work across 44 occupations. This is the headline number for work replacement — how well the model produces real deliverables in categories like finance, law, consulting, and engineering write-ups.
OSWorld-Verified: 78.7%. Measures whether a model can operate a real computer environment autonomously — clicking, typing, reading UI state, completing multi-step tasks in actual applications. This is the computer use benchmark.
Tau2-Bench Telecom: 98.0%. A customer-service workflow benchmark on complex telecom tasks, measured without prompt tuning. Near-ceiling performance here is what allows OpenAI to claim agentic readiness for real support workflows.
Terminal-Bench 2.0: 82.7%. Complex command-line workflows and DevOps automation. A state-of-the-art score at release.
SWE-Bench Pro: 58.6%. Real GitHub issue resolution across Python, JavaScript/TypeScript, Java, and Go. This is the Codex-variant number; it measures practical coding rather than isolated problem-solving.

Two numbers that do not appear on the public slide but matter just as much: token efficiency (GPT-5.5 uses fewer output tokens than GPT-5.4 for identical tasks, which lowers per-task cost at identical per-token pricing), and per-token latency (matched to GPT-5.4, so the upgrade does not slow down production serving).

GPT-5.5 vs Claude Opus 4.6 vs Gemini 3.1 Pro

Three frontier model families are live in April 2026: OpenAI's GPT-5.5, Anthropic's Claude Opus 4.6 (with Sonnet 4.6 as the cheaper sibling), and Google's Gemini 3.1 Pro. Benchmarks alone rarely pick a winner, but they do frame the tradeoffs.

Coding benchmarks side by side

SWE-bench Verified (real GitHub issues). Claude Opus 4.6 at 80.8%, Gemini 3.1 Pro at 80.6%, Claude Sonnet 4.6 at 79.6%. The three are within a percentage point of each other on this benchmark.
SWE-Bench Pro (harder real-world variant, multi-language). GPT-5.5 Codex at 58.6% — the only frontier model with a public number on this newer, harder benchmark at the time of writing.
Terminal-Bench 2.0. GPT-5.5 at 82.7%, state-of-the-art at release. Anthropic and Google have not published directly comparable numbers on this specific benchmark.

The quick takeaway: on the traditional SWE-bench Verified number the three models are effectively tied. On agentic and terminal-based coding — where you are running an autonomous loop across shells, tools, and files — GPT-5.5 Codex has the cleanest public lead.

Where each family wins in practice

GPT-5.5 / Codex — agentic and multi-step work. If the task is "resolve this GitHub issue", "migrate this service from Express to Fastify", or "run the full release checklist", GPT-5.5 Codex is the current default. The 7+ hour autonomous budget, structured SDK, and top Terminal-Bench number tilt this way.
Claude Opus 4.6 — review-quality code and reasoning. Opus produces code that is cleaner to read and more consistently commented. For work where a human has to review what the AI wrote (regulated industries, public SDKs, financial code), Opus is still my first reach. It is also generally stronger on long-context coherence — it holds a plot across 200K+ tokens more reliably.
Gemini 3.1 Pro — sprawling legacy codebases. Gemini's effective long-context advantage shows up most on SWE-bench when the fix requires reading many files simultaneously to understand the change. If you are feeding a whole enterprise monorepo and asking for a cross-cutting fix, Gemini's context handling tends to be the most forgiving.

On cost, Claude Opus 4.6 is the most expensive of the three — you are paying roughly 2.5x Gemini 3.1 Pro for 0.2 points more on SWE-bench Verified. This is why many teams run Claude Sonnet 4.6 as the default (it handles 80%+ of coding tasks at Opus quality) and only escalate to Opus on hard reviews.

Pricing comparison

Rough per-token pricing, as of the GPT-5.5 launch:

GPT-5.5 (base). $5 / 1M input, $30 / 1M output. 1M token context.
GPT-5.5 Pro. $30 / 1M input, $180 / 1M output. 1M token context. Use sparingly.
GPT-5 / GPT-5.4 (for reference). $2.50 / 1M input, $15 / 1M output — still cheaper than 5.5, still the right call for bulk workloads where 5.5's quality uplift does not justify double the cost.
Claude Opus 4.6. Premium pricing; positioned above GPT-5.5 on input, similar on output.
Claude Sonnet 4.6. Mid-tier pricing, often the economic winner for coding at frontier quality.
Gemini 3.1 Pro. The cheapest of the three frontier families for the quality level, which is why it is often the default for large-scale batch jobs.

Two practical notes on pricing. First, prompt caching on GPT-5.5 gives roughly 90% input-token savings on repeated context — this is material when you are running Codex against the same repo many times per day. Second, GPT-5.5 genuinely consumes fewer output tokens than 5.4 for the same task, so the effective per-task cost is lower than the per-token price suggests. A realistic estimate: GPT-5.5 costs about 30–40% more per task than 5.4, not the full 100% implied by the input-token sticker price.

Safety, classifiers, and what rolled forward from 5.2

With GPT-5.2 in December 2025, OpenAI deployed a new set of cyber-safeguards — stricter classifiers for requests that look like exploit development, credential stuffing, offensive reconnaissance, or evasion tooling. With GPT-5.5, those classifiers tighten further. In my own testing, legitimate defensive security work (CTFs, threat modelling, log analysis, sandboxed exploit research on intentionally vulnerable lab VMs) still works, but the model now asks more clarifying questions about authorisation context before proceeding with dual-use requests. If you are doing authorised pentesting, lead with the engagement context — that is the difference between a helpful response and a refusal.

How to pick: a decision tree

You do not have to pick one model for everything. Most engineering teams end up with two or three in rotation, routed by task. A decision tree that has held up through GPT-5.5's first days:

Is the task autonomous coding inside a repo (resolve issue, migrate service, open PR)? → GPT-5.5 via Codex. Use the SDK, cap maxDurationMinutes, review the diff.
Is the task producing code a human will review line-by-line (public SDK, regulated codepath, shared library)? → Claude Opus 4.6 or Sonnet 4.6. Better code readability and comment density.
Is the task understanding a large, unfamiliar codebase to make one cross-cutting fix? → Gemini 3.1 Pro. Long-context behaviour is still the most forgiving at size.
Is the task high-stakes reasoning (scientific analysis, difficult maths, multi-hop investigation with real consequences)? → GPT-5.5 Pro. Accept the 6x price because being right is the whole point.
Is the task bulk and cost-sensitive (document summarisation, lightweight classification, background pipelines)? → GPT-5.4, Claude Sonnet 4.6, or Gemini 3.1 Pro depending on which API you are already paying for. Do not pay GPT-5.5 prices for work a cheaper model solves equally well.

If you want the broader AI model picker for non-coding work, my Gemma 4 vs ChatGPT vs Claude vs Copilot and open models vs API models guides cover the rest of the map.

Five things to check before you switch your team to GPT-5.5

Moving your default model is disruptive. Before making the call, run through these five checks.

Benchmark on your own tasks, not OpenAI's. Take the 10 prompts your team runs daily, run them on GPT-5.4 and GPT-5.5, and compare quality, cost, and latency. Public benchmarks are useful for shortlisting; your tasks decide the winner.
Measure token efficiency, not just per-token price. The 30–40% effective cost uplift I mentioned above is an average — your workload may be closer to flat (if you were already brief) or worse (if you frequently hit the reasoning ceiling). Log output tokens per task before and after for a week.
Validate the Codex SDK against your repo layout. The SDK assumes a sensible repo — monorepos, git submodules, and repos with non-standard structures can confuse the sandbox. Run it on a representative service first.
Check your cyber-classifier hit rate. If your team does security work, incident response, or log analysis, a small sample of requests will flag under the tightened classifiers. Make sure the prompts carry authorisation context.
Plan for the 1M context window. A 1M context changes what is worth RAG-ing and what is worth stuffing directly. Re-evaluate your RAG boundary — you may find some pipelines now fit in a single prompt and do not need the retrieval layer at all. My RAG with your own documents tutorial covers the tradeoff.

Frequently Asked Questions

What is GPT-5.5 and when was it released?

GPT-5.5 is OpenAI's latest frontier model, released on 23 April 2026 under the internal codename "Spud". OpenAI describes it as the first fully retrained base model since GPT-4.5, meaning it is not a fine-tune or post-training variant of GPT-5 — it is a fresh pre-train. It rolled out the same day to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, with GPT-5.5 Pro additionally available to Pro, Business, and Enterprise tiers.

Is GPT-5.5 for Codex a separate model or the same model?

GPT-5.5 in Codex is the same underlying base model with a Codex-specific post-training pass layered on top. The base GPT-5.5 already improved on agentic coding versus GPT-5.4; the Codex variant adds coding-specific reinforcement, a longer autonomous-action budget (up to 7+ hours for complex refactors), and tighter integration with the Codex SDK, VS Code extension, Terminal CLI, and cloud async environments. You do not pick between the two manually — Codex routes automatically to the Codex-tuned variant when you are inside Codex surfaces.

How much does GPT-5.5 cost to use through the API?

Per the OpenAI announcement, GPT-5.5 will be available in the Responses and Chat Completions APIs at $5 per 1 million input tokens and $30 per 1 million output tokens, with a 1 million token context window. GPT-5.5 Pro costs $30 per 1M input and $180 per 1M output — roughly 6x the base model price, for workloads where maximum accuracy matters more than cost. In Codex usage, GPT-5.5 consumes fewer tokens than GPT-5.4 for the same tasks, so the effective cost per completed task is lower even at identical token prices.

How does GPT-5.5 compare to Claude Opus 4.6 and Gemini 3.1 Pro for coding?

On headline coding numbers, GPT-5.5 leads on agentic workflows — 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-Bench Pro through the Codex variant. Claude Opus 4.6 sits at 80.8% on SWE-bench Verified and is frequently preferred for code readability and long-context coherence. Gemini 3.1 Pro scores 80.6% on SWE-bench Verified and wins when a fix requires understanding many files at once, thanks to its larger effective context. Short answer: GPT-5.5 for autonomous multi-step work in Codex, Claude for review-quality code, Gemini for sprawling legacy repos.

Do I need GPT-5.5 Pro, or is the base GPT-5.5 enough?

Base GPT-5.5 is enough for almost all coding, knowledge work, and agentic tasks — it is the model that scored 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval. Reach for GPT-5.5 Pro only when the task is high-stakes scientific reasoning, difficult mathematics, or a multi-step agentic job where one wrong step costs real time or money. At 6x the output price, Pro is a poor default. Start with base GPT-5.5, promote to Pro only when you can point at a specific failure it would have prevented.