Written by Jane Clarkson
You tell your phone to set a reminder, ask your smart speaker to change the song, or speak a movie title into your remote—no more tapping through tedious menus and on-screen keyboards. Voice-user interfaces (VUI) have transitioned from niche alternatives to essential components of consumer technology. But when did this transformation begin? What’s next for the industry, where does artificial intelligence (AI) come into play, and how will it impact consumers? To answer these questions, we turn to Voice AI engineering expert Manoj Boopathi Raj, a Senior Software Engineer at Google. With a decade of experience driving breakthroughs in voice recognition and VUI, Mr. Boopathi Raj offers a unique perspective on this technological shift.
Please tell us a little more about yourself and how AI has played a role in your career.
I’ve always been fascinated by how quickly technology is molded by our needs—and AI is perhaps the best example. Many people don’t realize that long before large language models (LLMs) became headline news, we were already using machine learning, only in much more mundane ways. Any software that needed to scale and operate with minimal human intervention likely has some AI in its engineering. Optimizing Google Fi cell network coverage, classifying spam uploads on YouTube—I’ve had the privilege of proposing and leading projects which have been employing these algorithmic solutions for many years now. The current race to create more intuitive interfaces is a natural progression, and it’s thrilling to be part of it.
OpenAI’s GTP-4o comes to mind. The voice module created quite the buzz when they expressed concerns about creating “emotional reliance” because of how effective it is. Can you shed a little more light on the history of voice-user interfaces?
Absolutely. This is just the latest milestone in a journey that began decades ago. It seems primitive now, but older readers will remember the earliest use of VUI, specifically for “AI,” was dictation software in the late ’90s, like Dragon NaturallySpeaking—products where you spoke, the computer listened, then interpreted the input into text. We don’t think back on them today because they were clunky, demanded unnaturally slow, deliberate enunciation, and required a lot of intervention—that is, they failed to address the needs of their users. It wasn’t until a decade later, with the introduction of Apple’s Siri, and GA, that VUI began to meet those needs more effectively. Fast forward to today, and VUI systems like Google Assistant have become indispensable, expected to handle complex tasks with near-perfect accuracy. The recent excitement around OpenAI’s GPT-4o reminded both technologists and consumers how compelling it is to communicate with our technology in the same way we communicate with one another—through voice.
As a senior engineer on the Google Assistant team, did these big leaps in VUI become apparent in your work?
Google Assistant is a perfect case study for the evolution of VUI. While I’ve led efforts on the mobile side of the product, namely improvements in its natural language processing (NLP) abilities, my work automotive environments is where the potential impact of VUI becomes most apparent. Some might be more familiar with the name Android Auto.
We were initially faced with all the distractions and noises a human driver might be used to: Engine sounds, passenger conversations, their own music—much more disruptive for a machine, which depends on clean audio signals. I first focused on developing a robust data collection infrastructure—because exhaustive big data is everything when it comes to training AI—and then on fine-tuning the speech models, so the VUI could handle every kind of imaginable scenario. The result was a spectacular 50% average improvement in word error rates across six languages, and that was just one initiative.
The Android Automotive OS system, which is now installed in over 200 million cars worldwide, really shows how VUI is becoming indispensable in everyday tech. This isn’t just about convenience; it’s about ensuring that these systems function reliably in the real world, where the stakes—especially in automotive applications—are incredibly high. Android Auto’s VUI is being featured in leading OEMs, so the market will continue to see the technology deployed, if not expanded upon, in the coming years.
On that note, what are your thoughts on manufacturers moving towards more voice user interfaces (VUI)? Or, phrased differently, how do you see VUI affecting consumers as it becomes the new standard?
The trend towards VUI coincides, and not accidentally, with autonomous vehicles and the prevalence of LLMs. Trusting an AI to drive a car is a bold proposition, and VUI may play a critical role in that trust, perhaps in the not-too-distant future of the automotive industry. Consumers need to feel confident that the AI can accurately listen to both the environment and their commands, understand context, and make decisions as well as—or better than—a human driver. The skepticism is reasonable, and the stakes are high—if one system fails, it can undermine trust in the entire industry. That trust is very hard to earn back.
The way I see it, VUI has the potential to bridge the gap between current skepticism and future trust in AI. As these systems become more robust, they provide an opportunity to introduce AI in discrete ways that feel natural and intuitive. We want to reduce the learning curve and build consumer confidence. As an added bonus, they also offer an unprecedented chance to create accessibility options, without requiring users to adapt to new, unfamiliar interfaces—simply say what you need or ask a question. Even if we never fully hand over the wheel, VUI provides an interface that makes the technology behind it more approachable.
Looking even further ahead, the potential for VUI goes beyond driving or smart home commands. VUI and AI together could revolutionize how we access information and receive personalized services. Healthcare is a potent example: VUI could empower patients to manage their immediate needs through voice commands, reducing the barriers to accessing critical information or immediate support. In education, it could create new opportunities for learning, providing students with personalized, voice-driven tutoring and feedback, tailored to individual needs. The possibilities are there, if we’re prepared to make them a reality. The biggest challenge for the industry today is making the technology more reliable and human-centric. It isn’t that people are afraid of speaking to their tech—they’re just afraid of not being heard.
Readers can see some of the work you’re doing to make technical information more accessible on Hackernoon and DZone. Thank you for your time, Mr. Boopathi Raj. We look forward to seeing the industry evolve with engineers like yourself paving the way.
Thank you. I’m excited to be part of this journey, and I look forward to what the future says.