Meta and a group of researchers from the University of Texas at Austin (UT Austin) are working on bringing realistic audio to the metaverse.
As Kristen Garuman, Research Director at Meta AI, explains (opens in new tab), there’s more to augmented and virtual reality (AR and VR, respectively) than just visuals. Audio plays a very important role in making a world feel alive. Garuman says “audio is shaped by the environment that [it’s] in.” There are various factors that influence how sound behaves like the geometry of a room, what’s in said room, and how far someone is from a source.
To achieve this, Meta's plan is to use AR glasses to record both audio and video from one location, then using a set of three AI models, transform and clean the recording so it feels like it's happening in front of you when you play it back at home. The AIs will take into account the room that you're in so it can match the environment.
Looking at the projects, it appears Meta is focusing on AR glasses. Meta's plan for VR headsets includes replicating the sights and sounds of an environment, like a concert, so it feels like you're there in person.
We asked Meta how can people listen to the enhanced audio. Will people need a pair of headphones to listen or will it come from the headset? We didn't get a response.
We also asked Meta how can developers get a hold of these AI models. They've been made open source so third-party developers can work on the tech, but Meta didn't offer any further details.
Transformed by AI
The question is how can Meta record audio on a pair of AR glasses and have it reflect a new setting.
The first solution is known as AViTAR which is a ”Visual Acoustic Matching model.” (opens in new tab) This is the AI that transforms audio to match a new environment. Meta offers the example of a mother recording her child’s dance recital at an auditorium with a pair of AR glasses.
One of the researchers claims that the mother in question can take that recording and play it back at home where the AI will morph the audio. It'll scan the environment, take into account any obstacles in a room, and have the recital sound like it's happening right in front of her with the same glasses. The researcher states the audio will come from the glasses.
To help clean up audio, there is Visually-Informed Dereverberation (opens in new tab). Basically, it removes distracting reverb from the clip. The example given is recording a violin concert at a train station, taking it home, and having the AI clean up the clip so you hear nothing but music.
The last AI model is VisualVoice (opens in new tab), which uses a combination of visual and audio cues to separate voices from other noises. Imagine recording a video of two people arguing. This AI will isolate one voice so you can understand them while silencing everything else. Meta explains visual cues are important because AI needs to see who’s talking in order to understand certain nuances and know who's speaking.
In relation to visuals, Meta states they plan on bringing in video and other cues to further enhance AI-driven audio. Since this technology is still early in development, it’s unknown if and when Meta will bring these AIs to a Quest headset near you.
Be sure to read our latest review on the Oculus Quest 2 if you're thinking of buying one. Spoiler alert: we like it.