Researchers at Cornell University have developed EchoSpeech, a silent-speech recognition interface that employs acoustic-sensing and artificial intelligence to continuously recognize up to 31 unvocalized commands based on lip and mouth movements. This low-power, wearable interface can be operated on a smartphone and requires only a few minutes of user training data for command recognition.
Ruidong Zhang, a doctoral student of information science, is the lead author of “EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic Sensing,” which will be presented at the Association for Computing Machinery Conference on Human Factors in Computing Systems (CHI) this month in Hamburg, Germany.
“For people who cannot vocalize sound, this silent speech technology could be an excellent input for a voice synthesizer. It could give patients their voices back,” Zhang said, highlighting the technology’s potential applications with further development.
Real-World Applications and Privacy Advantages
In its current form, EchoSpeech could be used for communicating with others via smartphone in environments where speech is inconvenient or inappropriate, such as noisy restaurants or quiet libraries. The silent speech interface can also be paired with a stylus and utilized with design software like CAD, significantly reducing the need for a keyboard and a mouse.
Equipped with microphones and speakers smaller than pencil erasers, the EchoSpeech glasses function as a wearable AI-powered sonar system, sending and receiving soundwaves across the face and detecting mouth movements. A deep learning algorithm then analyzes these echo profiles in real-time with approximately 95% accuracy.
“We’re moving sonar onto the body,” said Cheng Zhang, assistant professor of information science and director of Cornell’s Smart Computer Interfaces for Future Interactions (SciFi) Lab.
Existing silent-speech recognition technology typically relies on a limited set of predetermined commands and necessitates the user to face or wear a camera. Cheng Zhang explained that this is neither practical nor feasible and also raises significant privacy concerns for both the user and those they interact with.
EchoSpeech’s acoustic-sensing technology eliminates the need for wearable video cameras. Moreover, since audio data is smaller than image or video data, it requires less bandwidth to process and can be transmitted to a smartphone via Bluetooth in real-time, according to François Guimbretière, professor in information science.
“And because the data is processed locally on your smartphone instead of uploaded to the cloud,” he said, “privacy-sensitive information never leaves your control.”
Credit: Source link