Build realtime voice agents with GPT Realtime 2

Design low-latency speech-to-speech assistants, live translation workflows, streaming transcription, and tool-enabled voice experiences for production teams.

Start Building Explore Architecture

Independent builder-focused workspace for GPT Realtime 2, WebRTC, tools, and voice operations.

Professional realtime voice interface with waveforms and microphone controls

Voice Studio

Voice generation for narration, dialogue, and transcripts

Generate narration, dialogue, or transcriptions with AI voice tools.

Voice

AI voice

Realtime voice agent planning

GPT Realtime 2 is for teams planning low-latency voice agents with the OpenAI Realtime API

GPT Realtime 2 is an independent workspace for mapping realtime speech-to-speech agents, browser WebRTC sessions, server-side WebSocket audio, streaming transcription, live translation, tool calls, and production usage controls before a team ships a voice workflow.

Last updated: May 10, 2026

Key takeaways

Use WebRTC when a browser or mobile client needs responsive microphone input and audio output.
Use WebSocket when a backend service owns audio routing, recording, telephony, or policy enforcement.
Treat tool calls, escalation, monitoring, and cost controls as launch requirements, not afterthoughts.

Architecture fit table

Realtime voice architecture choices for GPT Realtime 2 projects
Workflow signal	Recommended setup	Why it matters
Browser voice assistant	WebRTC session with short-lived client access	Keeps microphone and playback latency low while avoiding long-lived secrets in the client.
Call center or telephony path	Server-controlled realtime audio with explicit handoff rules	Lets the backend manage routing, logs, compliance review, and human escalation.
Live translation or transcription	Separate session settings, transcript review, and usage budget	Keeps language handling, quality checks, and cost forecasting visible to operators.

Primary references

Voice-agent capabilities for serious workflows

Keep the homepage focused on what teams actually need: responsive conversations, controlled sessions, useful transcripts, and tool actions that fit existing systems.

Speech-to-speech agents

Natural realtime conversations for support, coaching, intake, and guided operations.

Streaming transcription

Capture spoken sessions as structured text for review, search, QA, and follow-up.

Live translation workflows

Route multilingual conversations through a voice experience that stays usable in the moment.

Tool-enabled conversations

Let voice agents check records, create tickets, update systems, or trigger approved actions.

Session control

Tune instructions, voice behavior, context, and handoff rules for repeatable outcomes.

Usage visibility

Plan around session length, model choice, tools, and context so teams can budget confidently.

Workflow

From voice idea to production-ready agent

A clean operating model for building realtime voice systems without making the first screen feel like an experiment.

Define the agent

Write the role, boundaries, escalation rules, and success criteria before touching transport or tools.

Configure realtime sessions

Choose voice behavior, input modes, turn handling, and context strategy for the target channel.

Connect tools and data

Attach only the systems the agent needs, with explicit permissions and predictable failure paths.

Review usage and launch

Track transcript quality, latency, tool activity, and credit consumption before scaling traffic.

Architecture

Implementation patterns for realtime voice

Use the right transport and session shape for the channel: browser voice, server-side audio, secure client access, and tool-backed conversations.

WebRTC

Browser voice transport

Use WebRTC when you need direct low-latency microphone input and audio output in a web product.

Best fit: browser-based voice assistants with responsive turn-taking.

Explore Architecture

Server-side audio stream architecture illustration

Audio pipeline

Server-side streams

Use server-controlled audio streams when backend orchestration, recording, or telephony integration matters more.

Best fit: controlled infrastructure, call routing, compliance review, and server-owned state.

Explore Architecture

Security

Ephemeral access

Issue short-lived client secrets from your server so browsers can connect without exposing privileged credentials.

Best fit: production clients that need secure session startup and policy enforcement.

Explore Architecture

Voice tools and policy orchestration illustration

Tooling

Tools and policies

Connect function calls, business rules, retrieval, and handoff paths so the voice agent can act safely.

Best fit: support, sales, training, operations, and internal copilots.

Explore Architecture

Built around practical production constraints

The homepage should signal that this is a professional voice-agent platform, not a novelty page.

128K

context window for long-running realtime workflows

WebRTC

browser voice transport for low-latency interaction

Tools

function calling for workflow actions and system handoff

Where realtime voice agents fit

Position GPT Realtime 2 around concrete business conversations instead of generic chat features.

Customer support voice agents

Answer routine questions, collect context, and hand off cleanly when a human should step in.

Sales qualification calls

Capture needs, route leads, and update pipeline tools while keeping the conversation natural.

Language learning tutors

Run spoken practice with corrections, summaries, and adaptive lesson flow.

Live translation assistants

Help multilingual teams communicate across calls, field work, travel, and operations.

Meeting and field copilots

Turn spoken updates into structured notes, tasks, and follow-up records.

Internal operations assistants

Guide staff through checklists, policy questions, and system actions hands-free.

GPT Realtime 2 FAQ

Clear answers for teams evaluating realtime voice agents.

Build realtime voice agents with GPT Realtime 2

Voice generation for narration, dialogue, and transcripts

GPT Realtime 2 is for teams planning low-latency voice agents with the OpenAI Realtime API

Key takeaways

Architecture fit table

Primary references

Voice-agent capabilities for serious workflows

Speech-to-speech agents

Streaming transcription

Live translation workflows

Tool-enabled conversations

Session control

Usage visibility

From voice idea to production-ready agent

Define the agent

Configure realtime sessions

Connect tools and data

Review usage and launch

Implementation patterns for realtime voice

Browser voice transport

Server-side streams

Ephemeral access

Tools and policies

Built around practical production constraints

128K context window for long-running realtime workflows

WebRTC browser voice transport for low-latency interaction

Tools function calling for workflow actions and system handoff

Where realtime voice agents fit

Customer support voice agents

Sales qualification calls

Language learning tutors

Live translation assistants

Meeting and field copilots

Internal operations assistants

GPT Realtime 2 FAQ

Is this an official OpenAI website?

Can I build production voice agents with GPT Realtime 2?

Does it support WebRTC?

Can voice agents call tools?

How should teams control cost?

Who is this homepage for?