What changed between this model and its predecessor in the 4o family?

Three major leaps: the context window expanded from 128K to 1.0M tokens (8x), instruction following improved significantly for complex multi-constraint prompts, and coding benchmarks rose across generation, review, and refactoring tasks. Cost dropped relative to GPT-4o.

How does the 75% prompt caching discount work with the context of 1.0M tokens?

Cached input tokens, from repeated system prompts, shared few-shot examples, or persistent context, are billed at 75% below the standard input rate. With a window of 1.0M tokens, caching a large system prompt or reference corpus across requests yields substantial savings.

Is GPT-4.1 mini a distilled version of full GPT-4.1?

OpenAI describes it as a separate model in the GPT-4.1 family, not a direct distillation. It was trained to match GPT-4o-level intelligence at lower compute requirements while sharing the GPT-4.1 family's improvements in coding and instruction following.

Can GPT-4.1 mini handle an entire codebase in one request?

The context window of 1.0M tokens accommodates most single-repository codebases. For retrieval accuracy across the full range, the GPT-4.1 family maintains strong performance even at extreme context lengths, an area where previous-generation models often degraded.

What latency improvement should I expect?

GPT-4.1 mini delivers nearly half the latency compared to GPT-4o. See live throughput and time-to-first-token metrics on this page for current measured performance.

When should I use full GPT-4.1 instead of mini?

When the task demands the absolute highest accuracy, particularly on complex coding challenges, nuanced multi-step instructions, or workloads where the quality gap between mini and full is measurable and consequential for your application.

GPT-4.1 mini

GPT-4.1 mini delivers GPT-4o-class intelligence at reduced cost with nearly half the latency, making it a cost-performance option in the GPT-4.1 family for high-volume production workloads.

File InputTool UseVision (Image)Implicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'openai/gpt-4.1-mini',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out GPT-4.1 mini by OpenAI. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

0.7s

$0.40/M

$1.60/M

Read:$0.1/M

Write:—

$14/K

+ input costs

—

05/14/2025

Legal:Terms

•

Privacy

0.6s

92tps

$0.40/M

$1.60/M

Read:$0.1/M

Write:—

$10.00/K

+ input costs

—

05/14/2025

More models by OpenAI

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

0.7s

119tps

$5.00/M

$30.00/M

Read:

$0.5/M

Write:

—

$10.00/K

+ input costs

—

04/24/2026

400K

1.2s

272tps

$0.75/M

$4.50/M

Read:$0.07/M

Write:—

$10.00/K

+ input costs

—

03/17/2026

400K

0.5s

21tps

$0.20/M

$1.25/M

Read:$0.02/M

Write:—

$10.00/K

+ input costs

—

03/17/2026

1.1M

1.9s

79tps

$2.50/M

$15.00/M

Read:

$0.25/M

Write:

—

$10.00/K

+ input costs

—

03/05/2026

131K

0.1s

330tps

$0.35/M

$0.75/M

Read:$0.25/M

Write:—

—

08/05/2025

128K

0.7s

87tps

$0.15/M

$0.60/M

Read:$0.07/M

Write:—

$14/K

+ input costs

—

07/18/2024

About GPT-4.1 mini

GPT-4.1 mini launched on May 14, 2025 as the middle tier of the GPT-4.1 family. Three advances separate it from its predecessor.

First, the context window expanded from 128K to 1.0M tokens, an 8x increase. An entire codebase, a full conversation history spanning days, or a collection of legal documents all fit in a single request. Combined with the 75% prompt caching discount available across the GPT-4.1 family, long-context workflows that reuse system prompts become very affordable.

Second, instruction following improved materially. OpenAI trained the GPT-4.1 family with a focus on adherence to complex, multi-constraint prompts. For developers building structured pipelines where the model must follow formatting rules, respect output schemas, and handle edge cases in system instructions, this reduces debugging time and increases reliability.

Third, coding capability stepped up. The GPT-4.1 family brought measurable gains on code generation, review, and refactoring benchmarks compared to the GPT-4o generation. GPT-4.1 mini inherits those gains, making it capable enough for code assistance tasks that previously required a full-size model.

The result: GPT-4o-class intelligence at lower cost and nearly half the latency. For most production workloads, GPT-4.1 mini is the right choice.

What To Consider When Choosing a Provider

Configuration: Low latency and a context window of 1.0M tokens make GPT-4.1 mini unusually versatile. It can stream responses in real-time chat while also handling batch jobs that load entire codebases into context. Those two patterns rarely coexist in a single model at this price.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GPT-4.1 mini

Best For

Production chat interfaces: Streaming products where reduced latency and cost directly improve unit economics
Long-context workloads: Codebase analysis, document comparison, and extended conversation histories that use the window of 1.0M tokens
Strict output formatting: Pipelines where improved instruction following reduces error rates
Agentic loops: Many sequential calls where both speed and cost per call compound over the session
Code assistance: Generation tasks that benefit from the GPT-4.1 family's coding improvements

Consider Alternatives When

Maximum coding accuracy: Full GPT-4.1 is the stronger choice when top-tier instruction adherence is required
Lightweight tasks: GPT-4.1 nano handles classification, routing, or simple extraction at even lower cost
STEM reasoning dominant: Dedicated reasoning models may outperform on complex math or science workloads

Conclusion

GPT-4.1 mini combines an 8x context expansion, stronger instruction following, improved coding, and reduced cost relative to GPT-4o. Together, these changes make it the default model for production traffic in the GPT-4.1 family.

Frequently Asked Questions

What changed between this model and its predecessor in the 4o family?
Three major leaps: the context window expanded from 128K to 1.0M tokens (8x), instruction following improved significantly for complex multi-constraint prompts, and coding benchmarks rose across generation, review, and refactoring tasks. Cost dropped relative to GPT-4o.
How does the 75% prompt caching discount work with the context of 1.0M tokens?
Cached input tokens, from repeated system prompts, shared few-shot examples, or persistent context, are billed at 75% below the standard input rate. With a window of 1.0M tokens, caching a large system prompt or reference corpus across requests yields substantial savings.
Is GPT-4.1 mini a distilled version of full GPT-4.1?
OpenAI describes it as a separate model in the GPT-4.1 family, not a direct distillation. It was trained to match GPT-4o-level intelligence at lower compute requirements while sharing the GPT-4.1 family's improvements in coding and instruction following.
Can GPT-4.1 mini handle an entire codebase in one request?
The context window of 1.0M tokens accommodates most single-repository codebases. For retrieval accuracy across the full range, the GPT-4.1 family maintains strong performance even at extreme context lengths, an area where previous-generation models often degraded.
What latency improvement should I expect?
GPT-4.1 mini delivers nearly half the latency compared to GPT-4o. See live throughput and time-to-first-token metrics on this page for current measured performance.
When should I use full GPT-4.1 instead of mini?
When the task demands the absolute highest accuracy, particularly on complex coding challenges, nuanced multi-step instructions, or workloads where the quality gap between mini and full is measurable and consequential for your application.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

GPT-4.1 mini

Playground

Providers

More models by OpenAI

About GPT-4.1 mini

What To Consider When Choosing a Provider

When to Use GPT-4.1 mini

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions