Microsoft recently unveiled its cutting-edge text-to-speech AI language model VALL-E, which it claims can mimic any voice — including its emotional tone, vocal timbre and even the background noise — after training using just three seconds of audio.
The researchers imagine VALL-E could work as a high-quality text-to-speech synthesizer, in addition to a speech editor that would doctor audio recordings to incorporate phrases not originally said. Coupled with generative AI models like OpenAI’s GPT-3, the developers say VALL-E could even be utilized in original audio content creation.
The event has some experts sounding alarm bells over the technology’s implications for misuse; through VALL-E and other generative AI programs, malicious actors could mass produce audio-based disinformation at unprecedented scales, sources say.
How does VALL-E work?
Unlike previous speech synthesizers, most of which work by modulating waveforms to sound just like human speech, VALL-E functions by analyzing a brief voice sample to generate the most probably representation of what that voice might sound like, based on its hundreds of hours of coaching data, reads Microsoft’s paper.
To supply enough data to match almost any voice sample possible, VALL-E was trained on a whopping 60,000 hours of speech from over 7,000 unique speakers, using Meta’s LibriLight audio library — compared, current text-to-speech systems average lower than 600 hours of coaching data, the authors wrote.
The result, in keeping with researchers, is a model that outperforms current state-of-the-art text-to-speech generators when it comes to “speech naturalness and speaker similarity.”
Samples of the model’s capabilities can be found online. While some voice prompts seemed obviously fake, others approached and even achieved natural-sounding speech. As AI continues to develop at a breakneck pace, some experts imagine VALL-E could soon provide near-perfect imitations of anybody’s voice.
VALL-E’s unveiling preceded reports that Microsoft allegedly plans to take a position $10 billion in OpenAI, the Elon Musk-cofounded startup that created GPT-3 (probably the most powerful language models available) and its mega-viral chatbot application, ChatGPT. It’s unclear whether development of VALL-E impacted the choice.
Microsoft declined the Star’s requests for comment.
Ease of use
Brett Caraway, an associate professor of media economics on the University of Toronto, said voice mimicking synthesizers exist already — but they require an important deal of fresh audio data to tug off convincing speech.
With technology like VALL-E, nevertheless, anyone could achieve the identical results with a pair seconds of audio.
“VALL-E lowers the brink or barrier to replicating anyone else’s voice,” Caraway said. “So, in making it easier to do, it creates a risk of proliferation of content because more people will give you the chance to do it more quickly with less resources.”
“It should create an actual crisis in managing disinformation campaigns. It should be harder to identify and it is going to be overwhelming when it comes to the quantity of disinformation potentially,” he said.
Lack of trust
Bad actors could pair an altered voice with manufactured video to make anyone appear to say anything, Caraway continued. Spam and scam callers could phone people pretending to be someone they’re not. Fraudsters could use it to bypass voice identification systems — and that’s just the tip of the iceberg. Eventually, Caraway is anxious “it could erode people’s trust across the board.”
Abhishek Gupta, founder and principal researcher on the Montreal AI Ethics Institute, agreed. Over email, he wrote: “There’s the potential for erosion of our belief in provided testimony, evidence, and other attestations, since there’s at all times the claim that somebody could make that their voice’s likeness was replicated they usually didn’t say any of the things which might be being attributed to them.
“This further diminishes the health of the data ecosystem and makes trust a really tenuous commodity in society.”
Gupta also noted that artists who depend on their voice to make a living might be impacted, because it’s now possible to steal anyone’s voice to be used in projects you’d previously must pay them for.
How can we prevent harm?
Gupta believes it’s time to assemble a “multidisciplinary set of stakeholders who carry domain expertise across AI, policy-making and futures pondering” to proactively prepare for future challenges as an alternative of simply reacting to each latest advancement.
“Leaning in on existing research within the areas of accountability, transparency, fairness, privacy, security, etc. because it pertains to AI may also help alleviate the severity of the challenges that one might encounter within the space,” he continued.
Microsoft’s researchers acknowledged VALL-E’s potential for harm of their conclusion, saying its abilities “may carry potential risks in misuse of the model, corresponding to spoofing voice identification or impersonating a particular speaker. To mitigate such risks, it is feasible to construct a detection model to discriminate whether an audio clip was synthesized by VALL-E.”
While he agreed it could help, Caraway is skeptical that solely counting on AI-detection software is enough: as detection models advance, so too will techniques to bypass said detection. As an alternative, he believes media-literacy education is the most effective solution — teaching kids from a young age find out how to find trustworthy information online.
“One in every of the things that I even have been a proponent of is attempting to institute media literacy and data literacy starting in preschool,” he said.
“I also think a key component here is recommitting to good journalism … not only in expression, but when it comes to investment into quality journalism. We’d like it now greater than ever.”