This Speech Synthesis Is Better Than Real Humans

Don’t you hate it when you click on a video in YouTube hoping to find an explanation on, say, the development of speech synthesis, you find female Google speaking in the background? Yes, I do, and thanks to my mother who is still familiarizing herself with social media, she sends me videos filled with annoying Google speech regularly.

Dessa’s Meta Labs Machine Learning Engineers, Hashiam Kadhim, Joe Palermo and Rayhane Mama, have finally produced a lifelike speech simulation and dang, should we say, we didn’t see that coming. When we listened to the audio they have recorded with the AI voice, we were blown away. Just think about all those videos that won’t use that annoying Google speech anymore!

And oh, dear, Jarvis could be real.

They used Joe Rogan’s voice to make the simulation and if you don’t know who he is, he is the host of The Joe Rogan Experience podcast with only two episodes short of reaching 1300 episodes as of May 16.

Now, be ready to test yourself. Can you guess whose voice is it that in this audio? Can you guess whether it’s the AI or actual Joe Rogan speaking?

Must be challenging to guess whether that was the AI speaking or an actual living human. Here’s the actual footage of the audio recorded by the team of engineers.

The risks and benefits

With the ability to perfectly impersonate someone’s voice, it’s very possible that the technology is wanted for those who want to do good and bad. There are risks and obviously, benefits if we are able to create AI speech at this level, for example:

  • Have your e-books read by your favorite reader, such as Alec Baldwin.
  • Listen to smoother, more natural-sounding speech.
  • Improve the quality of accessibility service for people who require it, such as ALS patients.

And there are even more benefits to come that we don’t know when this AI text-to-speech can be even more perfected. But unfortunately, concerns also come with it.

  • Spam calls that may aim to gain personal information without you knowing who they actually are.
  • Impersonation of someone’s voice that may lead to bullying or harassment.
  • Audio deepfake becomes easy to execute, requiring professional analyzer with the right software to see if the audio is a real thing.

We can think of worse, but let’s stay positive because people can make the worst things out of a spoon. People can do much more with such a realistic AI voice.

For fear of the implications of the technology to the society, the research lab has not released anything in open-source. Government and lawmakers are also making their way towards regulating the misuse of AI technologies and cybercrime.

The team has decided to showcase this achievement in order to help people understand what it does and what it can do to the society.


