With the advancement of generative AI, we have witnessed significant growth in chatbot products that dominate the market. Simultaneously, voice AI has improved to the extent that smooth conversations with AI are now feasible. Whether you are building AI for inbound and outbound calls, professional services, companion apps, etc., voice remains a core part of the experience and is important for conversion. We can all recall frustrating experiences with AI during calls — robotic voices, awkward silences, long latency periods, and the need to press buttons to interact, which collectively diminish the human-like quality of the experience and occasionally irritate users.
Before we jump right into how to build a great voice experience, let's take a moment to recap how a human usually interacts in a conversation. We operate with <200ms latency when turn-taking happens, we backchannel as needed, we subconsciously understand when the other party finishes their turn, we understand the meaning and emotions of the other party, we have filler words within our sentences, we stop talking when interrupted... The list can go on, but the essential point I'm making here is that there are so many little mechanisms happening behind the scenes when we are having a simple, smooth conversation, and it is extremely HARD for machines to consider all these and perform like humans.
One common question we get asked a lot is why I have to use the Retell API --Can't I just stitch ASR (speech-to-text), LLM, TTS (text-to-speech) together to build a voice conversation?
Well, hmm, you totally should if you have the time, and see how far a simple stitching approach can get you. The number one problem we hear from those who make their own voice system is that it's hard to cut latency down; the number two problem we see is that interruption handling is hard to implement with a simple setup; the number three problem we see is that the agent's response is not conversational enough to sound like a human. To tackle all these, let's go over an overview of what components need to be there and the work that needs to be done for a good conversational voice AI experience.
1. Integrate with web frontend or programmable communication tools like Twilio, Vonage to get user audio.
2. Work with audio bytes and streaming protocols: User audio from various frontends (web, phone call) will come in different encodings, formats, and be sent over via different streaming protocols. This is a strenuous task, as audio bytes are hard to manipulate and time-consuming to work with. Ask any engineer you know that works with audio signals; they will share the same statement.
3. Understand the audio: There are various signals from audio that are vital for a smooth conversation.
4. Decide whether to speak: understand whether the other party will finish their turn soon, or they have already finished their turn, whether they are awaiting a response or just pausing to formulate their thoughts, etc. Needs to combine text, emotion, tonality, pause, and other audio input to generate this decision.
5. Generating the responses: Generating a good response to what the user has said is hard, and very scenario-specific. There are various ways to do this part and it is customized for each use case, so here I will just share one simple flow of response generation.
6. Synthesize the audio: Usually achieved using TTS (text to speech) models, transform the response text into audio. Needs to have tone and emotion variance that suits the scenario to be humanlike. Ideally, the TTS output should get streamed back for lower latency.
7. Taking actions: AI that can talk is cool, and AI that can take actions is cooler. This usually is achieved with function calling functionalities of certain models, or structured data output, so that downstream can book appointments when needed, can look up information when appropriate.
I think by now, most folks would agree that this is not as easy as stitching together ASR, LLM, TTS. Thus, let me (shamelessly) introduce how Retell AI can help here. By integrating with Retell AI, you can save months of development, enjoy a state-of-the-art voice experience, and get all the following covered:
What you need to do: keep iterating on your core product to make it better, while we take care of the audio part. Here are the parts you need to work on:
I hope this blog can give you a high-level idea of how to build a great voice agent, and hopefully (and shamelessly) my pitch for Retell AI can shed light on how we can help.
Happy building!