Listen to this AI voice clone of Bill Gates created by Facebook’s engineers

June 11, 2019

We’re headed for a revolution in computer-generated speech, and a voice clone of Microsoft founder Bill Gates demonstrates exactly why.

In the clips embedded below, you can listen to what seems to be Gates reeling off a series of innocuous phrases. “A cramp is no small danger on a swim,” he cautions. “Write a fond note to the friend you cherish,” he advises. But each voice clip has been generated by a machine learning system named MelNet, designed and created by engineers at Facebook.

In fact, Gates is just the best known of the handful of individuals MelNet can mimic.

Now you may be wondering why the researchers chose to replicate such a STEM-y bunch of speakers. Well, the simple answer is that one of the resources used to train MelNet was a 452-hour dataset of TED talks. The rest of the training data came from audiobooks, chosen because the “highly animated manner” of the speakers make for a challenging target.

Now, these audio samples are undeniably impressive, but MelNet isn’t exactly a bolt from the blue. The quality of voice clones have been steadily improving in recent years, with a recent replica of podcaster Joe Rogan demonstrating exactly how far we’ve come. Much of this progress dates back to 2016 with the unveiling of SampleRNN and WaveNet, the latter being a machine learning text-to-speech program created by Google’s London-based AI lab DeepMind, which now powers the Google Assistant.

The basic approach with WaveNet, SampleRNN, and similar programs is to feed the AU system a ton of data and use that to analyze the nuances in a human voice. (Older text-to-speech systems don’t generate audio, but reconstitute it: chopping up speech samples into phonemes, then stitching these back together to create new words.) But while WaveNet and others were trained using audio waveforms, Facebook’s MelNet uses a richer and more informationally dense format to learn to speak: the spectrogram.

In an accompanying paper, Facebook’s researchers note that while WaveNet produces higher-fidelity audio output, MelNet is superior at capturing “high-level structure” — the subtle consistencies contained in a speaker’s voice that are, ironically, almost impossible to describe in words, but to which the human ear is finely attuned.

They say that this is because the data captured in a spectrogram is “orders of magnitude more compact” than that found in audio waveforms. This density allows the algorithms to produce more consistent voices, rather than being distracted by and honing in on the extreme detail of a waveform recording (to use an overly simplistic human analogy).

There are limitations, though. The most important being that the model can’t replicate how a human voice will change over longer periods of time; building up drama or tension over a paragraph or page of text, for example. Interestingly, this is similar to the constraints we’ve seen in AI text generation, which captures surface-level coherency not long-term structure.

These caveats aside, the results are astoundingly good. And, more impressively, MelNet is a multifunction system. It doesn’t just generate realistic voices, it can also be used to generate music (though the output is a little dodgy at times, and it doesn’t seem like it can be shaped and sculpted in a way that would make it commercially useful).

As ever, there are benefits and dangers with this technology. The benefits? Higher-quality AI assistants; realistic voice models for people with speech impairments; and a range of uses in the entertainment industry. The dangers? How about crumbling trust in traditional forms of evidence, and the potential for audio harassment, scams, and generalized slander? All the fun of the AI fake fair basically. Just pair it with this recent research that lets you edit what someone says in a video by typing in new speech, and the possibilities are endless.

Previous articleHow AI is catching people who cheat on their diets, job searches and school work

Next article3 Top Tech Stocks to Buy in June

Google workers arrested after nine-hour protest in cloud chief’s office

Revolut’s Suggested Valuation up 45% With Schroders’ Writeup

New Zealand consults on digital cash

Tesla is laying off 285 employees in Buffalo, New York as…

Fed Chair Powell says there has been a ‘lack of further…

Japan’s Nissan bets on solid-state batteries, gigacasting for next-gen EVs

Google Ads Introduces Generative AI Tools For Demand Gen Campaigns

Clearing the air: Wind farms more land efficient than previously thought

‘Quantum internet’ connection finally achieved in historic breakthrough

Nissan steps up all-solid-state EV battery plans as pilot line construction…

Stock market today: Tech leads stock slide, Nvidia falls almost 4%

Oil prices fall more than 3% as traders discount Iran-Israel war…

Bitcoin falls 8% in five days; here are 4 reasons behind…

Is Bitcoin price going to crash again?

Spot Bitcoin ETF Hype Dies Down, Normalcy Sets In

America’s facing a retirement disaster. There’s a better way.

9 simple Google Flights tips and tricks to try on your…

More people are working well past retirement age. It’s not easy

Millennials want to retire by 60. Good luck with that.

The Social Security Cost-of-Living Adjustment (COLA) Forecast for 2025 Was Just…

Listen to this AI voice clone of Bill Gates created by Facebook’s engineers

Must Read

Revolut’s Suggested Valuation up 45% With Schroders’ Writeup

Tesla is laying off 285 employees in Buffalo, New York as...

America’s facing a retirement disaster. There’s a better way.

9 simple Google Flights tips and tricks to try on your...

More people are working well past retirement age. It’s not easy

Most Viewed

‘China is about to start bidding’ — Will Hong Kong Bitcoin...

High interest rates are getting more challenging for even the biggest...

Bitcoin’s halving countdown continues with one week to go

Trending Now

Japan’s Nissan bets on solid-state batteries, gigacasting for next-gen EVs

Google Ads Introduces Generative AI Tools For Demand Gen Campaigns

Clearing the air: Wind farms more land efficient than previously thought

Listen to this AI voice clone of Bill Gates created by Facebook’s engineers

RELATED ARTICLES

Must Read

Most Viewed

Trending Now