Building Voice AI Agents with Twilio, OpenAI, and FastAPI
Lessons from production: how to architect a real-time voice agent that handles inbound and outbound calls, logs conversations live, and scales reliably.
I've spent the last year building Voice AI agents for real-time phone conversations. Here's what actually matters in production — the parts the tutorials skip.
The architecture, end to end#
A working voice agent has more moving parts than a typical web app. Here's the flow for an outbound call:
[Trigger] → FastAPI → Twilio API → SIP trunk → PSTN → User's phone
↓
[Media stream]
↓
STT → LLM → TTS → Twilio → User
↓
Function calls → Database
The hard part isn't any single piece — it's getting them to talk to each other in real time without dropping audio or losing context.
Latency is the whole game#
Users notice 800ms of silence on a call. They tolerate maybe 400ms. Below 200ms feels natural.
That budget has to cover: the user finishing speaking → STT detecting end of speech → LLM generating a response → TTS synthesizing audio → audio reaching the user's phone. Every layer has to be optimized:
- Use streaming everywhere. Stream audio to STT, stream LLM tokens to TTS, stream TTS chunks back to Twilio. Don't wait for one stage to finish before starting the next.
- Pick a fast LLM. GPT-4 is too slow for real-time. GPT-4o or Gemini Flash are usable. For simple agents, even smaller models work.
- Pre-warm your server. Cold starts will kill your first call. Keep a connection pool to OpenAI open.
OpenAI's Realtime API collapses STT → LLM → TTS into a single WebSocket. Latency is dramatically better, but you trade flexibility (you can't easily swap voices or models mid-call).
Function calling during a live call#
This is where it gets interesting. You want the agent to actually do things — log the conversation, look up customer data, transfer the call. With OpenAI's function calling, you can:
tools = [{
"type": "function",
"function": {
"name": "log_lead",
"description": "Save a qualified lead to the database",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"interest": {"type": "string"},
"callback_time": {"type": "string"},
},
},
},
}]When the model decides to call log_lead, your FastAPI backend handles it asynchronously while the agent continues speaking. The user never hears a pause.
SIP vs. plain Twilio#
For low-volume use cases, Twilio's standard voice API is fine. For real production loads — say, a campaign making 1,000+ calls a day — SIP trunking is non-negotiable. It's cheaper per minute, gives you better routing control, and lets you bring your own carrier (we use Nayatel for Pakistan-based calls).
The catch: SIP brings its own protocol complexity. You'll be debugging RTP audio streams, codec negotiation, and NAT traversal. Budget time for it.
Handling failure modes#
Real calls fail in ways your tests won't catch:
- The user's phone has bad reception — partial audio, garbled STT
- Network jitter — the audio stream stutters
- The LLM hallucinates a function call — handle gracefully, don't crash
- The user interrupts mid-response — you need barge-in detection
Build for these from day one. Add idle timeouts, max-call-duration limits, and fallbacks for when the LLM fails. Log everything — you'll need it when something goes wrong on call #4,372.
What I'd tell my past self#
Start with the shortest possible end-to-end path: place a call, say one sentence, hang up. Get that working. Then add complexity one layer at a time. Voice AI has more failure modes than any web stack I've worked with, and they all interact. Iterate.
The reward is worth it. There's something genuinely magical about a system that can hold a real phone conversation with another human. You'll see.
Subscribe to the newsletter
Get an email whenever I publish a new post. No spam, unsubscribe anytime.
Comments
Share your thoughts. Your email is private and won't be displayed.
Loading comments…
Continue reading
Getting Started with Next.js 14: A Practical Guide
A hands-on walkthrough of the App Router, Server Components, and Partial Prerendering — the features that make Next.js 14 a serious upgrade.
ReadMastering Tailwind CSS: From Basics to Beautiful
Advanced Tailwind techniques to build responsive, maintainable designs faster — without the CSS bloat.
ReadBuilding Scalable APIs with Node.js: Lessons from Production
Practical patterns for building Node.js APIs that handle real load — error handling, validation, observability, and the architectural decisions that matter.
Read