Google’s Pixel Buds have long provided impressive real-time translation capabilities. Recently, companies like Timkettle have launched similar earbuds aimed at business users. However, all these devices could only translate one audio source at a time.
Researchers at the University of Washington have created a groundbreaking set of AI-powered headphones that can translate multiple voices simultaneously. Imagine an expert linguist in a busy bar, effortlessly processing conversations in several languages all at once.
This innovation is known as Spatial Speech Translation and utilizes binaural headphones. These headphones replicate auditory experiences as humans naturally perceive sounds. To capture this effect, microphones are positioned on a dummy head, mimicking the distance between human ears.
The significance of this approach lies in its ability to help us not only hear sounds but also discern their direction. The ultimate aim is to craft a natural sound environment, delivering a stereo experience akin to attending a live concert, which is referred to as spatial listening in today’s terminology.
Led by Professor Shyam Gollakota, the team has worked on various impressive projects, including apps that provide underwater GPS for smartwatches and brain implants that interact with electronic devices.
How does multi-speaker translation work?
“We’re capturing the uniqueness of each person’s voice along with their directional speech for the first time,” notes Gollakota, associated with the Paul G. Allen School of Computer Science & Engineering.
The system functions like radar, detecting the number of speakers in its vicinity and dynamically updating that count as individuals move in and out of range. Impressively, this process operates entirely on-device, ensuring privacy by not sending audio data to cloud servers.
Along with translating speech, the technology preserves the tonal qualities and loudness of each speaker’s voice. It also includes dynamic adjustments based on how a speaker moves throughout the space. Interestingly, Apple is reportedly developing a similar feature for AirPods that would allow for real-time audio translation.
How does it all come to life?
In its testing phases, the UW team assessed the translation features of the AI headphones in various indoor and outdoor settings. The system can process and output translated spoken audio in 2 to 4 seconds. Participants seemed to prefer a delay of around 3 to 4 seconds, but the team is actively enhancing the translation speed.
While initial tests have focused on translations involving Spanish, German, and French, the researchers aim to expand their capabilities to include additional languages. Their approach blends blind source separation, localization, real-time expressive translation, and binaural sound rendering into a seamless process.
The team utilized a speech translation model capable of real-time operation on Apple’s M2 silicon, performing audio processing with Sony’s noise-cancelling WH-1000XM4 headphones paired with a Sonic Presence SP15C binaural USB microphone.
Moreover, the code for this proof-of-concept device is available for others to explore, enabling the scientific community and hobbyists to build on the groundwork laid by the UW team.