Building Production-Ready Voice Agents

This guide walks you through the complete process of building a production-ready AI voice agent from scratch, following best practices learned from building hundreds of voice agents for real clients.

This is not a beginner tutorial. We assume you’re familiar with voice AI platforms like Vapi or Retell AI. If you’re brand new, check out their beginner tutorials first.

The Reality of Voice Agent Development

Here’s what typically happens:

You build a voice agent
You call it 5-10 times
It sounds great! Everything works perfectly
You deploy to production
It fails 40-50% of the time with real users

Why? Because manual testing can’t simulate:

Different accents and speaking styles
Frustrated or stressed users
Edge cases and unexpected responses
The scale of hundreds or thousands of calls

This guide shows you how to build agents that actually work in production, not just in testing.

Overview: Building an Agent Start-to-Finish

We’ll build a complete inbound voice agent for a real estate company that can handle:

Property inquiries (buying, selling, renting)
Lead qualification
Information capture (name, address, timeline)
Conditional logic based on user responses

Then we’ll stress-test it with Relyable to find and fix issues before real customers encounter them.

Step 1: Plan Your Agent with a Call Flow

Before writing a single line of prompt, map out your agent’s conversation flow visually. This makes prompt engineering dramatically easier.

Create a Call Flow Diagram

Use a tool like Whimsical, Miro, or any diagramming tool to map out:

First Message - What the agent says when it answers
Pathways - Different conversation flows based on user intent
Questions - Information to capture at each step
Conditions - Branching logic (if yes → do this, if no → do that)
End States - How conversations conclude

Example: Real Estate Agent Flow

Inbound Call Started
    ↓
"Hi, this is Emily. I'm an AI agent from Inflate Real Estate. How can I help you today?"
    ↓
   ┌─────────────┼─────────────┐
   ↓             ↓             ↓
BUYING        SELLING       RENTING

Buying Pathway:

“Great! Which property were you interested in?”
“Could you spell out your full name?”
“Do you have another property to sell first?”
- If YES → “Would you like our assistance selling it?”
  - If YES → “What’s your current address?”
- If NO → “Would you like to book a walkthrough?”

Selling Pathway:

“Great! Could you provide your address?”
“Just to confirm, your address is [repeat address]. Is that correct?”
“Could you spell out your full name?”
“What’s your reason for selling?”
“Have you done any repairs recently?”
“What’s your timeline for getting this sold?”

Renting Pathway:

“Great! Which property were you interested in?”
“Could you spell out your full name?”
“What’s your timeline for moving in?”
“Would you like to book a walkthrough?”

The more detailed your call flow, the easier prompt engineering becomes. Spend 20-30 minutes on this before touching your prompt.

Step 2: Configure Your Voice Agent Settings

Before building your prompt, set up your agent configuration in Vapi or Retell.

Model Selection

For Vapi:

GPT-4o - Reliable, fast, good for most use cases (recommended for starting)
GPT-4o-mini - Cheaper, slightly less capable
Claude 3.5 Sonnet - Alternative, good reasoning
GPT-5 (new) - Test carefully before using in production

Start with GPT-4o until you’ve tested your agent thoroughly. Don’t use brand-new models in production right away.

Voice Selection

Choosing the right voice is critical: Vapi Voices:

Pre-tuned for phone calls
Sound realistic and not glitchy
Limited selection (~10-15 voices)
Recommended for production - they’re tested and reliable

ElevenLabs Voices:

Larger selection
Can sound more natural
Warning: Some voices sound great but glitch after hundreds of calls
Test extensively if using ElevenLabs

Never deploy a voice to production without testing it on at least 50-100 calls. Some voices that sound incredible will screech or glitch under load.

Transcriber Settings

For most agents, the defaults work well:

Deepgram for Vapi (recommended)
Start with default settings
Only adjust after you’ve tested and identified specific transcription issues

Step 3: Structure Your Prompt

A well-structured prompt is critical for reliable agent performance.

Recommended Prompt Structure

# ROLE

You are **Emily**, a friendly AI agent from **Inflate Real Estate Services**. You help answer inbound phone calls, assisting clients with selling, buying, or renting properties.

# TASK

Follow the pathways below based on what the caller needs. Follow each question step-by-step in order.

## Buying

This is the pathway to follow when someone is interested in buying one of our properties.

1. Great! Which of our properties were you interested in?
2. Thank you. Could you please spell out your full name for us?
3. Once their name has been captured, repeat it back to them letter by letter.
4. Do you have another property you need to sell first?

If the user responds yes to question 4, please ask: `No worries. Would you like our assistance in selling this property?`

If the user responds no to question 4, please ask: `No worries. Would you like to book a walkthrough?`

If they said yes to wanting help selling:
5. Great! Could you please provide us with your current address?
6. Thanks. Would you like to book a walkthrough for the new property?

## Selling

This is the pathway to follow when someone is interested in selling their property.

1. Great! Could you please provide us with your address?
2. Thank you. Just to confirm, your address is `[read out the address captured]`. Is that correct?
3. Thank you. Could you please spell out your full name for us?
4. Once their name has been captured, repeat it back to them letter by letter.
5. What is your reason for selling?
6. Have you done any repairs recently?
7. What is your timeline for getting this place sold?

## Renting

This is the pathway to follow when someone is interested in renting one of our properties.

1. Great! Which of our properties were you interested in?
2. Thank you. Could you please spell out your full name for us?
3. Once their name has been captured, repeat it back to them letter by letter.
4. What is your timeline for moving in?
5. No worries. Would you like to book a walkthrough?

# NOTES

## Address Pronunciation

When reading out an address, split it into clear words:
- "123B" should be read as "one two three B"
- "57A Woodson Blvd" should be read as "five seven A, Woodson Boulevard"
- Never say "one hundred twenty-three", always "one two three"

Key Prompt Engineering Principles

Use Markdown Formatting: AI models are trained on markdown. Using proper formatting helps the AI understand your prompt better:

Use # Headers for major sections
Use ## Subheaders for pathways
Use **Bold** for emphasis on key terms
Use numbered lists for sequential steps
Use backticks for exact phrases to say

Be Explicit About Order: Number your questions explicitly (1, 2, 3…) and tell the agent to follow them step-by-step. Don’t assume the AI will figure out the order. Use Clear Conditionals: Instead of “If they say yes, help them” use “If the user responds yes to question 4, please ask: Would you like our assistance?” - Reference specific question numbers to avoid ambiguity. Confirm Critical Information: Always repeat back names (spell back letter by letter), addresses (read back with proper pronunciation), email addresses, and any data going into a CRM. This dramatically improves accuracy. Define Personality in the Role: The ROLE section shapes how the agent speaks. Adding “friendly” or “professional” or “empathetic” actually changes behavior significantly.

Step 4: Test Manually First

Before running automated tests, do a quick manual test:

Publish Your Agent

Save and publish your agent in Vapi/Retell

Call It Yourself

Call your agent and go through one pathway completely. Check:

Does it read the first message correctly?
Does it follow the question order?
Does the voice sound good?
Is the response latency acceptable?

Note Issues

Write down anything that doesn’t work. Don’t try to fix everything yet - just get a baseline.

If your agent completely fails manual testing, fix the obvious issues before automated testing. Automated testing is for finding subtle edge cases, not broken core functionality.

Step 5: Import to Relyable

Follow the Quick Start guide to:

Create a Relyable account
Import your agent
Generate test cases
Create personas

Step 6: Generate Comprehensive Test Cases

Test cases evaluate specific behaviors. Aim for 15-25 test cases covering:

Must-Have Test Cases

Identity and Branding:

Agent introduces with correct name and company
Agent mentions it’s an AI system
Agent maintains consistent identity

Conversation Flow:

Agent asks questions in the correct order
Agent doesn’t skip required questions
Agent handles conditional logic correctly

Data Capture:

Agent captures required information (name, address, email)
Agent confirms captured information back to user
Agent spells back name letter by letter
Agent reads addresses with proper pronunciation

Pathway Handling:

Agent correctly identifies buying vs selling vs renting intent
Agent follows the correct pathway for each intent
Agent handles pathway changes (e.g., buyer also needs to sell)

Edge Case Handling:

Agent handles objections (“Why do you need my name?”)
Agent handles unclear responses
Agent doesn’t give up on difficult customers prematurely

Setting Test Case Priorities

Priority	When to Use	Impact on Score
Critical	Must work 100% of the time (emergency routing, compliance)	Huge impact
High	Core functionality (capturing lead info, following flow)	Large impact
Medium	Important but not critical (tone, using caller’s name)	Medium impact
Low	Nice-to-have (specific phrases, minor details)	Small impact

Step 7: Create Diverse Personas

Don’t just create “normal” callers. Create personas that stress-test your agent:

Example Personas to Create

The Frustrated Elderly Customer: Stanley Miller, 79-year-old retired mechanic. Frustrated with life, no patience, speaks in short clipped sentences. Gruff and exasperated. Reluctantly calling because his kids are forcing him to sell his house of 50 years. The Fast-Talking Young Professional: Sarah Chen, 28-year-old tech professional. Speaks very quickly, interrupts frequently, expects immediate answers. Impatient with any delays. Used to chatbots and expects perfect performance. The Non-Native Speaker: Raj Patel, 45-year-old engineer. Strong Indian accent, speaks slowly and carefully, sometimes struggles with pronunciation. Very polite but needs information repeated sometimes. The Skeptical Customer: Mike Johnson, 52-year-old business owner. Doesn’t trust AI systems, questions everything, tests whether it’s really AI or human. Asks unexpected questions to trip up the system. The Confused Caller: Linda Martinez, 67-year-old retiree. Not tech-savvy, easily confused, needs things explained multiple times. Forgets what was just discussed. Very polite but difficult to keep on track.

Create 5-7 diverse personas representing your actual customer base plus some worst-case scenarios. This coverage helps you find issues across user types.

Step 8: Run Automated Tests at Scale

Now run your tests:

Start with 5-10 Scenarios

Create 5-10 different scenarios using your personas:

2-3 buying scenarios
2-3 selling scenarios
2-3 renting scenarios
1-2 complex scenarios (buyer who also needs to sell)

Run All at Once

Select all scenarios and run them together. This typically takes 10-20 minutes for 5-10 calls.

Wait for Results

Grab coffee. Relyable will call your agent with each scenario and evaluate against all test cases.

Step 9: Analyze Results and Find Issues

Understanding Your Score

After tests complete, you’ll see an overall score:

90%+ → Excellent, production-ready
70-89% → Good, acceptable for production
50-69% → Needs work before production
Below 50% → Significant issues, not ready

Reality Check: In the video demonstration, an agent that worked perfectly in manual testing scored 51-61% in automated testing. This is normal and exactly why you need automated testing!

Reviewing Failed Test Cases

Click on failed test cases to see:

Which calls failed - Sometimes a test case fails on 2/5 calls, not all
Why it failed - AI explanation of what went wrong
The exact conversation - Full transcript and audio
Suggestions - How to fix it in your prompt

Common Failure Patterns

Skipping Questions:

Symptom: Agent jumps to question 5 without asking questions 3 and 4
Fix: Add explicit numbering and: “Follow each question step-by-step in order. Do not skip any questions.”

Not Confirming Information:

Symptom: Agent captures name but doesn’t spell it back
Fix: Make confirmation explicit: “3. Once their name has been captured, repeat it back to them letter by letter.”

Giving Up on Difficult Customers:

Symptom: When customer pushes back (“Why do you need that?”), agent says “No worries, have a great day” and ends the call
Fix: Add handling: “If the caller objects or questions why you need information, briefly explain the reason and ask again gently. Do not end the call unless they explicitly want to hang up.”

Breaking Conditional Logic:

Symptom: Agent asks “Would you like help selling?” even when user said NO to having a property to sell
Fix: Be more explicit: “If the user responds NO to question 4, skip to question 6 directly.”

Poor Address Pronunciation:

Symptom: Agent says “one hundred twenty-three” instead of “one two three” for address 123
Fix: Add pronunciation guide in a NOTES section showing how to split numbers and addresses.

Step 10: Iterate and Improve

The key to production-ready agents is iteration:

Fix 2-3 Issues at a Time

Don’t try to fix everything. Pick the 2-3 most critical failures and fix those in your prompt.

Update in Vapi/Retell

Make your prompt changes in your voice platform, not in Relyable. Relyable is read-only.

Sync to Relyable

Click “Sync Prompt” in Relyable to pull the updated prompt.

Test Again

Run the same scenarios again. Did your score improve? Did the specific failures get fixed?

Repeat Until 70%+

Keep iterating. Most production-ready agents take 5-10 iterations to get from 50% to 75%+.

Tracking Improvements

Iteration	Score	Changes Made
1	51%	Baseline
2	58%	Added step-by-step ordering
3	63%	Added name confirmation
4	69%	Fixed conditional logic
5	74%	Added objection handling
6	78%	Production Ready

Document what changes you make each iteration. This helps you understand what actually improves performance vs. what doesn’t matter.

Step 11: Enable Live Monitoring

Once you reach 70%+ consistently:

Enable Call Monitoring

In Relyable, go to Agent Settings → Enable Call Monitoring

Deploy to Production

Your agent is now ready for real customers

Monitor Performance

Every production call is evaluated against your test cases. You’ll see:

Real-time scores
Which test cases are failing in production
Trends over time (is quality improving or degrading?)

Set Up Alerts

Get notified when:

Score drops below 70%
Critical test cases fail
Specific issues occur multiple times

Prompt Engineering Best Practices

Use Markdown Effectively

Good Example:

# ROLE
You are **Emily**, a friendly AI agent from **Inflate Real Estate**.

## Buying Pathway
1. Ask which property they're interested in
2. Capture their full name

Bad Example:

You are Emily from Inflate Real Estate. When someone wants to buy ask them which property and get their name.

Preview Your Prompt

Use Markdown Live Preview to see how your prompt renders. This shows you how the AI “sees” your prompt.

Be Explicit About Everything

Don’t assume the AI will “figure it out.” Be explicit about:

Question order
What to say exactly (use backticks)
When to say it
What NOT to do

Test One Change at a Time

If you change 5 things and the score improves, you don’t know which change helped. Change 1-2 things per iteration.

Use Real Call Examples

When you find a failure in testing, paste the transcript into your prompt as an example:

## Handling Objections

If a caller asks "Why do you need my name?", respond with:
"I just want to make sure we have the right information to help you. It ensures our team can follow up properly."

Example:
Caller: "Why do you need my name?"
Agent: "I just want to make sure we have the right information to help you. It ensures our team can follow up properly. Could you spell out your full name for us?"

Advanced: Data Capture and CRM Integration

Once your agent is reliable, integrate it with your CRM:

Capture Data with Functions

In Vapi, set up functions to capture:

Name
Phone number
Email
Address
Other lead information

Confirm Before Sending

Always confirm information before sending to CRM:

5. Repeat back: "Just to confirm, your email is `[email]`. Is that correct?"
6. If yes, save to CRM. If no, ask them to repeat it.

Test the Integration

Run automated tests with the CRM integration enabled. Verify:

Data is captured correctly
Data is confirmed before sending
Invalid data doesn’t get sent
The agent handles CRM errors gracefully

Performance Optimization

Reducing Latency

If your agent is too slow:

Use a faster model - Try GPT-4o-mini instead of GPT-4o
Shorten your prompt - Remove unnecessary details
Reduce tool calls - Fewer function calls = faster responses
Use Vapi’s latest features - They constantly optimize latency

Improving Speech Quality

If speech sounds robotic or glitchy:

Switch voices - Some voices are more reliable
Adjust WPM (words per minute) - 150-180 is natural
Add punctuation to responses - Helps with cadence
Use SSML tags - For pauses and emphasis (platform-dependent)

Common Pitfalls

Testing Only Happy Paths: Don’t just test scenarios where everything goes perfectly. Test difficult customers, people who object or refuse to give information, people who go off-script, and multiple intents in one call. Assuming Manual Testing Is Enough: You CANNOT catch all issues with manual testing. You’ll miss rare edge cases, issues with specific accents, problems at scale, and inconsistencies (works 80% of the time). Over-Engineering Too Early: Start simple. Get the basic flow working well before adding complex branching, advanced features, or CRM integrations. A simple agent that works is better than a complex one that’s unreliable. Ignoring Low-Priority Test Cases: Just because something is “low priority” doesn’t mean ignore it. If low-priority test cases fail 100% of the time, fix them. They’re still part of the user experience. Not Documenting Changes: Keep notes on what you change and why. This helps you understand what works, train team members, debug issues later, and build knowledge for future agents.

Checklist: Is My Agent Production-Ready?

Functionality Requirements

Agent introduces itself correctly 100% of the time
Agent follows conversation flow step-by-step
Agent captures all required information
Agent confirms critical data (name, email, address)
Agent handles conditional logic correctly
Agent doesn’t skip questions
Agent doesn’t give up on difficult customers

Testing Requirements

Agent scores 70%+ on automated tests consistently
Tested with at least 5 different personas
Tested with 20+ different scenarios
All Critical test cases pass 100%
All High test cases pass 90%+
Edge cases are handled gracefully

Quality Standards

Voice sounds natural and doesn’t glitch
Latency is under 1 second average
Words per minute is 150-180 (natural pace)
Agent sounds friendly and professional
No awkward pauses or weird inflections

Production Monitoring

Live monitoring is enabled in Relyable
Alerts are set up for critical failures
Team knows how to check performance
Process for responding to issues is defined

Next Steps

API Documentation

Integrate Relyable into your development workflow

Quick Start

Get your first agent tested in 10 minutes

Dashboard

Access your Relyable workspace

Community

Join other voice AI builders

Get Started

API Documentation

Agents

Personalities

Evaluations

Runs

Webhooks

​The Reality of Voice Agent Development

​Overview: Building an Agent Start-to-Finish

​Step 1: Plan Your Agent with a Call Flow

​Create a Call Flow Diagram

​Example: Real Estate Agent Flow

​Step 2: Configure Your Voice Agent Settings

​Model Selection

​Voice Selection

​Transcriber Settings

​Step 3: Structure Your Prompt

​Recommended Prompt Structure

​Key Prompt Engineering Principles

​Step 4: Test Manually First

​Step 5: Import to Relyable

​Step 6: Generate Comprehensive Test Cases

​Must-Have Test Cases

​Setting Test Case Priorities

​Step 7: Create Diverse Personas

​Example Personas to Create

​Step 8: Run Automated Tests at Scale

​Step 9: Analyze Results and Find Issues

​Understanding Your Score

​Reviewing Failed Test Cases

​Common Failure Patterns

​Step 10: Iterate and Improve

​Tracking Improvements

​Step 11: Enable Live Monitoring

​Prompt Engineering Best Practices

​Use Markdown Effectively

​Preview Your Prompt

​Be Explicit About Everything

​Test One Change at a Time

​Use Real Call Examples

​Advanced: Data Capture and CRM Integration

​Capture Data with Functions

​Confirm Before Sending

​Test the Integration

​Performance Optimization

​Reducing Latency

​Improving Speech Quality

​Common Pitfalls

​Checklist: Is My Agent Production-Ready?

​Functionality Requirements

​Testing Requirements

​Quality Standards

​Production Monitoring

​Next Steps

API Documentation

Quick Start

Dashboard

Community

​Resources

The Reality of Voice Agent Development

Overview: Building an Agent Start-to-Finish

Step 1: Plan Your Agent with a Call Flow

Create a Call Flow Diagram

Example: Real Estate Agent Flow

Step 2: Configure Your Voice Agent Settings

Model Selection

Voice Selection

Transcriber Settings

Step 3: Structure Your Prompt

Recommended Prompt Structure

Key Prompt Engineering Principles

Step 4: Test Manually First

Step 5: Import to Relyable

Step 6: Generate Comprehensive Test Cases

Must-Have Test Cases

Setting Test Case Priorities

Step 7: Create Diverse Personas

Example Personas to Create

Step 8: Run Automated Tests at Scale

Step 9: Analyze Results and Find Issues

Understanding Your Score

Reviewing Failed Test Cases

Common Failure Patterns

Step 10: Iterate and Improve

Tracking Improvements

Step 11: Enable Live Monitoring

Prompt Engineering Best Practices

Use Markdown Effectively

Preview Your Prompt

Be Explicit About Everything

Test One Change at a Time

Use Real Call Examples

Advanced: Data Capture and CRM Integration

Capture Data with Functions

Confirm Before Sending

Test the Integration

Performance Optimization

Reducing Latency

Improving Speech Quality

Common Pitfalls

Checklist: Is My Agent Production-Ready?

Functionality Requirements

Testing Requirements

Quality Standards

Production Monitoring

Next Steps

Resources