Building Voice AI Agents with Twilio, OpenAI, and FastAPI

Lessons from production: how to architect a real-time voice agent that handles inbound and outbound calls, logs conversations live, and scales reliably.

November 4, 2025 4 min read

I've spent the last year building Voice AI agents for real-time phone conversations. Here's what actually matters in production — the parts the tutorials skip.

The architecture, end to end#

A working voice agent has more moving parts than a typical web app. Here's the flow for an outbound call:

[Trigger] → FastAPI → Twilio API → SIP trunk → PSTN → User's phone
                                    ↓
                              [Media stream]
                                    ↓
                          STT → LLM → TTS → Twilio → User
                                    ↓
                          Function calls → Database

The hard part isn't any single piece — it's getting them to talk to each other in real time without dropping audio or losing context.

Latency is the whole game#

Users notice 800ms of silence on a call. They tolerate maybe 400ms. Below 200ms feels natural.

That budget has to cover: the user finishing speaking → STT detecting end of speech → LLM generating a response → TTS synthesizing audio → audio reaching the user's phone. Every layer has to be optimized:

Use streaming everywhere. Stream audio to STT, stream LLM tokens to TTS, stream TTS chunks back to Twilio. Don't wait for one stage to finish before starting the next.
Pick a fast LLM. GPT-4 is too slow for real-time. GPT-4o or Gemini Flash are usable. For simple agents, even smaller models work.
Pre-warm your server. Cold starts will kill your first call. Keep a connection pool to OpenAI open.

Tip

OpenAI's Realtime API collapses STT → LLM → TTS into a single WebSocket. Latency is dramatically better, but you trade flexibility (you can't easily swap voices or models mid-call).

Function calling during a live call#

This is where it gets interesting. You want the agent to actually do things — log the conversation, look up customer data, transfer the call. With OpenAI's function calling, you can:

tools = [{
    "type": "function",
    "function": {
        "name": "log_lead",
        "description": "Save a qualified lead to the database",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "interest": {"type": "string"},
                "callback_time": {"type": "string"},
            },
        },
    },
}]

When the model decides to call log_lead, your FastAPI backend handles it asynchronously while the agent continues speaking. The user never hears a pause.

SIP vs. plain Twilio#

For low-volume use cases, Twilio's standard voice API is fine. For real production loads — say, a campaign making 1,000+ calls a day — SIP trunking is non-negotiable. It's cheaper per minute, gives you better routing control, and lets you bring your own carrier (we use Nayatel for Pakistan-based calls).

The catch: SIP brings its own protocol complexity. You'll be debugging RTP audio streams, codec negotiation, and NAT traversal. Budget time for it.

Handling failure modes#

Real calls fail in ways your tests won't catch:

The user's phone has bad reception — partial audio, garbled STT
Network jitter — the audio stream stutters
The LLM hallucinates a function call — handle gracefully, don't crash
The user interrupts mid-response — you need barge-in detection

Build for these from day one. Add idle timeouts, max-call-duration limits, and fallbacks for when the LLM fails. Log everything — you'll need it when something goes wrong on call #4,372.

What I'd tell my past self#

Start with the shortest possible end-to-end path: place a call, say one sentence, hang up. Get that working. Then add complexity one layer at a time. Voice AI has more failure modes than any web stack I've worked with, and they all interact. Iterate.

The reward is worth it. There's something genuinely magical about a system that can hold a real phone conversation with another human. You'll see.

Building Voice AI Agents with Twilio, OpenAI, and FastAPI

The architecture, end to end#

Latency is the whole game#

Function calling during a live call#

SIP vs. plain Twilio#

Handling failure modes#

What I'd tell my past self#

Subscribe to the newsletter

Comments

Continue reading

Getting Started with Next.js 14: A Practical Guide

Mastering Tailwind CSS: From Basics to Beautiful

Building Scalable APIs with Node.js: Lessons from Production