Unveiling the Astonishing Power of AI: Extracting Sound from Silent Videos and Images
The idea of extracting audio from a static image might sound like something out of a science fiction novel, but thanks to the power of artificial intelligence (AI), a group of scientists has turned this concept into a reality.
Led by Kevin Fu, a professor specializing in electrical and computer engineering and computer science at Northeastern University, this team has created a remarkable machine learning tool called Side Eye, which has revolutionized the analysis of images.
Functionality:
Side Eye has the remarkable capability to analyze a still image and unveil several key insights. It can determine the gender of a speaker within the room where the photo was taken, transcribe spoken words, and even pinpoint the location where the image was captured. Remarkably, this tool can also be applied to muted videos, extending its utility.
Kevin Fu shared his thoughts on Side Eye, explaining its potential applications:
“Imagine someone is recording a TikTok video and they’ve muted the audio, overlaying it with music. Have you ever been curious about what they’re really saying? Is it ‘watermelon, watermelon,’ or perhaps something more sensitive like ‘here’s my password’? Was there someone speaking behind them? With Side Eye, you can uncover what’s being said off-camera.”
Operation:
Side Eye relies on the power of machine learning and taps into the image stabilization technology commonly found in smartphone cameras.
Modern smartphone cameras employ a lens suspension system with liquid-immersed springs, ensuring that photos remain sharp and focused, even when the photographer’s hand is unsteady. These cameras employ sensors and an electromagnet to counteract any movement by adjusting the lens in the opposite direction, effectively stabilizing the image.
Also Read: Affordable iPhone 15 Outperforms iPhone 15 Pro and Pro Max in Durability
Interestingly, when an individual speaks in close proximity to the camera lens during a photo capture, it generates subtle vibrations in the springs, which slightly alters the path of light. Extracting audio frequencies from these vibrations might seem like an insurmountable challenge, but it becomes possible through the use of the rolling shutter technique, a common feature in most cameras.
As Fu explains, “Cameras today operate by not capturing all pixels of an image simultaneously; they do it one row at a time, hundreds of thousands of times in a single photo. This effectively allows you to amplify the amount of frequency information you can obtain, essentially enhancing the granularity of the audio.”
Implications:
While Side Eye is still in its early stages and requires substantial training data to improve and reach its full potential, there are both potential benefits and risks associated with its development.
Also Read: How Instagram Affects Battery Performance on Apple iPhones
In the wrong hands, an advanced version of Side Eye could pose a significant cybersecurity threat. However, there are also promising prospects for this technology, especially if an advanced iteration were to be employed as a digital tool for law enforcement agencies in crime investigations, providing valuable digital evidence.
This groundbreaking technology has the potential to reshape how we analyze and interpret visual content, with a wide range of applications in various fields, from security to entertainment.