The Reality of Voice Agent Development
Here’s what typically happens:- You build a voice agent
- You call it 5-10 times
- It sounds great! Everything works perfectly
- You deploy to production
- It fails 40-50% of the time with real users
- Different accents and speaking styles
- Frustrated or stressed users
- Edge cases and unexpected responses
- The scale of hundreds or thousands of calls
Overview: Building an Agent Start-to-Finish
We’ll build a complete inbound voice agent for a real estate company that can handle:- Property inquiries (buying, selling, renting)
- Lead qualification
- Information capture (name, address, timeline)
- Conditional logic based on user responses
Step 1: Plan Your Agent with a Call Flow
Before writing a single line of prompt, map out your agent’s conversation flow visually. This makes prompt engineering dramatically easier.Create a Call Flow Diagram
Use a tool like Whimsical, Miro, or any diagramming tool to map out:- First Message - What the agent says when it answers
- Pathways - Different conversation flows based on user intent
- Questions - Information to capture at each step
- Conditions - Branching logic (if yes → do this, if no → do that)
- End States - How conversations conclude
Example: Real Estate Agent Flow
- “Great! Which property were you interested in?”
- “Could you spell out your full name?”
- “Do you have another property to sell first?”
- If YES → “Would you like our assistance selling it?”
- If YES → “What’s your current address?”
- If NO → “Would you like to book a walkthrough?”
- If YES → “Would you like our assistance selling it?”
- “Great! Could you provide your address?”
- “Just to confirm, your address is [repeat address]. Is that correct?”
- “Could you spell out your full name?”
- “What’s your reason for selling?”
- “Have you done any repairs recently?”
- “What’s your timeline for getting this sold?”
- “Great! Which property were you interested in?”
- “Could you spell out your full name?”
- “What’s your timeline for moving in?”
- “Would you like to book a walkthrough?”
Step 2: Configure Your Voice Agent Settings
Before building your prompt, set up your agent configuration in Vapi or Retell.Model Selection
For Vapi:- GPT-4o - Reliable, fast, good for most use cases (recommended for starting)
- GPT-4o-mini - Cheaper, slightly less capable
- Claude 3.5 Sonnet - Alternative, good reasoning
- GPT-5 (new) - Test carefully before using in production
Voice Selection
Choosing the right voice is critical: Vapi Voices:- Pre-tuned for phone calls
- Sound realistic and not glitchy
- Limited selection (~10-15 voices)
- Recommended for production - they’re tested and reliable
- Larger selection
- Can sound more natural
- Warning: Some voices sound great but glitch after hundreds of calls
- Test extensively if using ElevenLabs
Transcriber Settings
For most agents, the defaults work well:- Deepgram for Vapi (recommended)
- Start with default settings
- Only adjust after you’ve tested and identified specific transcription issues
Step 3: Structure Your Prompt
A well-structured prompt is critical for reliable agent performance.Recommended Prompt Structure
Key Prompt Engineering Principles
Use Markdown Formatting: AI models are trained on markdown. Using proper formatting helps the AI understand your prompt better:- Use
# Headersfor major sections - Use
## Subheadersfor pathways - Use
**Bold**for emphasis on key terms - Use numbered lists for sequential steps
- Use backticks for exact phrases to say
Would you like our assistance?” - Reference specific question numbers to avoid ambiguity.
Confirm Critical Information: Always repeat back names (spell back letter by letter), addresses (read back with proper pronunciation), email addresses, and any data going into a CRM. This dramatically improves accuracy.
Define Personality in the Role: The ROLE section shapes how the agent speaks. Adding “friendly” or “professional” or “empathetic” actually changes behavior significantly.
Step 4: Test Manually First
Before running automated tests, do a quick manual test:1
Publish Your Agent
Save and publish your agent in Vapi/Retell
2
Call It Yourself
Call your agent and go through one pathway completely. Check:
- Does it read the first message correctly?
- Does it follow the question order?
- Does the voice sound good?
- Is the response latency acceptable?
3
Note Issues
Write down anything that doesn’t work. Don’t try to fix everything yet - just get a baseline.
Step 5: Import to Relyable
Follow the Quick Start guide to:- Create a Relyable account
- Import your agent
- Generate test cases
- Create personas
Step 6: Generate Comprehensive Test Cases
Test cases evaluate specific behaviors. Aim for 15-25 test cases covering:Must-Have Test Cases
Identity and Branding:- Agent introduces with correct name and company
- Agent mentions it’s an AI system
- Agent maintains consistent identity
- Agent asks questions in the correct order
- Agent doesn’t skip required questions
- Agent handles conditional logic correctly
- Agent captures required information (name, address, email)
- Agent confirms captured information back to user
- Agent spells back name letter by letter
- Agent reads addresses with proper pronunciation
- Agent correctly identifies buying vs selling vs renting intent
- Agent follows the correct pathway for each intent
- Agent handles pathway changes (e.g., buyer also needs to sell)
- Agent handles objections (“Why do you need my name?”)
- Agent handles unclear responses
- Agent doesn’t give up on difficult customers prematurely
Setting Test Case Priorities
| Priority | When to Use | Impact on Score |
|---|---|---|
| Critical | Must work 100% of the time (emergency routing, compliance) | Huge impact |
| High | Core functionality (capturing lead info, following flow) | Large impact |
| Medium | Important but not critical (tone, using caller’s name) | Medium impact |
| Low | Nice-to-have (specific phrases, minor details) | Small impact |
Step 7: Create Diverse Personas
Don’t just create “normal” callers. Create personas that stress-test your agent:Example Personas to Create
The Frustrated Elderly Customer: Stanley Miller, 79-year-old retired mechanic. Frustrated with life, no patience, speaks in short clipped sentences. Gruff and exasperated. Reluctantly calling because his kids are forcing him to sell his house of 50 years. The Fast-Talking Young Professional: Sarah Chen, 28-year-old tech professional. Speaks very quickly, interrupts frequently, expects immediate answers. Impatient with any delays. Used to chatbots and expects perfect performance. The Non-Native Speaker: Raj Patel, 45-year-old engineer. Strong Indian accent, speaks slowly and carefully, sometimes struggles with pronunciation. Very polite but needs information repeated sometimes. The Skeptical Customer: Mike Johnson, 52-year-old business owner. Doesn’t trust AI systems, questions everything, tests whether it’s really AI or human. Asks unexpected questions to trip up the system. The Confused Caller: Linda Martinez, 67-year-old retiree. Not tech-savvy, easily confused, needs things explained multiple times. Forgets what was just discussed. Very polite but difficult to keep on track.Step 8: Run Automated Tests at Scale
Now run your tests:1
Start with 5-10 Scenarios
Create 5-10 different scenarios using your personas:
- 2-3 buying scenarios
- 2-3 selling scenarios
- 2-3 renting scenarios
- 1-2 complex scenarios (buyer who also needs to sell)
2
Run All at Once
Select all scenarios and run them together. This typically takes 10-20 minutes for 5-10 calls.
3
Wait for Results
Grab coffee. Relyable will call your agent with each scenario and evaluate against all test cases.
Step 9: Analyze Results and Find Issues
Understanding Your Score
After tests complete, you’ll see an overall score:- 90%+ → Excellent, production-ready
- 70-89% → Good, acceptable for production
- 50-69% → Needs work before production
- Below 50% → Significant issues, not ready
Reviewing Failed Test Cases
Click on failed test cases to see:- Which calls failed - Sometimes a test case fails on 2/5 calls, not all
- Why it failed - AI explanation of what went wrong
- The exact conversation - Full transcript and audio
- Suggestions - How to fix it in your prompt
Common Failure Patterns
Skipping Questions:- Symptom: Agent jumps to question 5 without asking questions 3 and 4
- Fix: Add explicit numbering and: “Follow each question step-by-step in order. Do not skip any questions.”
- Symptom: Agent captures name but doesn’t spell it back
- Fix: Make confirmation explicit: “3. Once their name has been captured, repeat it back to them letter by letter.”
- Symptom: When customer pushes back (“Why do you need that?”), agent says “No worries, have a great day” and ends the call
- Fix: Add handling: “If the caller objects or questions why you need information, briefly explain the reason and ask again gently. Do not end the call unless they explicitly want to hang up.”
- Symptom: Agent asks “Would you like help selling?” even when user said NO to having a property to sell
- Fix: Be more explicit: “If the user responds NO to question 4, skip to question 6 directly.”
- Symptom: Agent says “one hundred twenty-three” instead of “one two three” for address 123
- Fix: Add pronunciation guide in a NOTES section showing how to split numbers and addresses.
Step 10: Iterate and Improve
The key to production-ready agents is iteration:1
Fix 2-3 Issues at a Time
Don’t try to fix everything. Pick the 2-3 most critical failures and fix those in your prompt.
2
Update in Vapi/Retell
Make your prompt changes in your voice platform, not in Relyable. Relyable is read-only.
3
Sync to Relyable
Click “Sync Prompt” in Relyable to pull the updated prompt.
4
Test Again
Run the same scenarios again. Did your score improve? Did the specific failures get fixed?
5
Repeat Until 70%+
Keep iterating. Most production-ready agents take 5-10 iterations to get from 50% to 75%+.
Tracking Improvements
| Iteration | Score | Changes Made |
|---|---|---|
| 1 | 51% | Baseline |
| 2 | 58% | Added step-by-step ordering |
| 3 | 63% | Added name confirmation |
| 4 | 69% | Fixed conditional logic |
| 5 | 74% | Added objection handling |
| 6 | 78% | Production Ready |
Step 11: Enable Live Monitoring
Once you reach 70%+ consistently:1
Enable Call Monitoring
In Relyable, go to Agent Settings → Enable Call Monitoring
2
Deploy to Production
Your agent is now ready for real customers
3
Monitor Performance
Every production call is evaluated against your test cases. You’ll see:
- Real-time scores
- Which test cases are failing in production
- Trends over time (is quality improving or degrading?)
4
Set Up Alerts
Get notified when:
- Score drops below 70%
- Critical test cases fail
- Specific issues occur multiple times
Prompt Engineering Best Practices
Use Markdown Effectively
Good Example:Preview Your Prompt
Use Markdown Live Preview to see how your prompt renders. This shows you how the AI “sees” your prompt.Be Explicit About Everything
Don’t assume the AI will “figure it out.” Be explicit about:- Question order
- What to say exactly (use backticks)
- When to say it
- What NOT to do
Test One Change at a Time
If you change 5 things and the score improves, you don’t know which change helped. Change 1-2 things per iteration.Use Real Call Examples
When you find a failure in testing, paste the transcript into your prompt as an example:Advanced: Data Capture and CRM Integration
Once your agent is reliable, integrate it with your CRM:Capture Data with Functions
In Vapi, set up functions to capture:- Name
- Phone number
- Address
- Other lead information
Confirm Before Sending
Always confirm information before sending to CRM:Test the Integration
Run automated tests with the CRM integration enabled. Verify:- Data is captured correctly
- Data is confirmed before sending
- Invalid data doesn’t get sent
- The agent handles CRM errors gracefully
Performance Optimization
Reducing Latency
If your agent is too slow:- Use a faster model - Try GPT-4o-mini instead of GPT-4o
- Shorten your prompt - Remove unnecessary details
- Reduce tool calls - Fewer function calls = faster responses
- Use Vapi’s latest features - They constantly optimize latency
Improving Speech Quality
If speech sounds robotic or glitchy:- Switch voices - Some voices are more reliable
- Adjust WPM (words per minute) - 150-180 is natural
- Add punctuation to responses - Helps with cadence
- Use SSML tags - For pauses and emphasis (platform-dependent)
Common Pitfalls
Testing Only Happy Paths: Don’t just test scenarios where everything goes perfectly. Test difficult customers, people who object or refuse to give information, people who go off-script, and multiple intents in one call. Assuming Manual Testing Is Enough: You CANNOT catch all issues with manual testing. You’ll miss rare edge cases, issues with specific accents, problems at scale, and inconsistencies (works 80% of the time). Over-Engineering Too Early: Start simple. Get the basic flow working well before adding complex branching, advanced features, or CRM integrations. A simple agent that works is better than a complex one that’s unreliable. Ignoring Low-Priority Test Cases: Just because something is “low priority” doesn’t mean ignore it. If low-priority test cases fail 100% of the time, fix them. They’re still part of the user experience. Not Documenting Changes: Keep notes on what you change and why. This helps you understand what works, train team members, debug issues later, and build knowledge for future agents.Checklist: Is My Agent Production-Ready?
Functionality Requirements
- Agent introduces itself correctly 100% of the time
- Agent follows conversation flow step-by-step
- Agent captures all required information
- Agent confirms critical data (name, email, address)
- Agent handles conditional logic correctly
- Agent doesn’t skip questions
- Agent doesn’t give up on difficult customers
Testing Requirements
- Agent scores 70%+ on automated tests consistently
- Tested with at least 5 different personas
- Tested with 20+ different scenarios
- All Critical test cases pass 100%
- All High test cases pass 90%+
- Edge cases are handled gracefully
Quality Standards
- Voice sounds natural and doesn’t glitch
- Latency is under 1 second average
- Words per minute is 150-180 (natural pace)
- Agent sounds friendly and professional
- No awkward pauses or weird inflections
Production Monitoring
- Live monitoring is enabled in Relyable
- Alerts are set up for critical failures
- Team knows how to check performance
- Process for responding to issues is defined
Next Steps
API Documentation
Integrate Relyable into your development workflow
Quick Start
Get your first agent tested in 10 minutes
Dashboard
Access your Relyable workspace
Community
Join other voice AI builders