We’re headed for a revolution in computer-generated speech, and a voice clone of Microsoft founder Bill Gates demonstrates exactly why.
In the clips embedded below, you can listen to what seems to be Gates reeling off a series of innocuous phrases. “A cramp is no small danger on a swim,” he cautions. “Write a fond note to the friend you cherish,” he advises. But each voice clip has been generated by a machine learning system named MelNet, designed and created by engineers at Facebook.
In fact, Gates is just the best known of the handful of individuals MelNet can mimic.
Now you may be wondering why the researchers chose to replicate such a STEM-y bunch of speakers. Well, the simple answer is that one of the resources used to train MelNet was a 452-hour dataset of TED talks. The rest of the training data came from audiobooks, chosen because the “highly animated manner” of the speakers make for a challenging target.
Now, these audio samples are undeniably impressive, but MelNet isn’t exactly a bolt from the blue. The quality of voice clones have been steadily improving in recent years, with a recent replica of podcaster Joe Rogan demonstrating exactly how far we’ve come. Much of this progress dates back to 2016 with the unveiling of SampleRNN and WaveNet, the latter being a machine learning text-to-speech program created by Google’s London-based AI lab DeepMind, which now powers the Google Assistant.
The basic approach with WaveNet, SampleRNN, and similar programs is to feed the AU system a ton of data and use that to analyze the nuances in a human voice. (Older text-to-speech systems don’t generate audio, but reconstitute it: chopping up speech samples into phonemes, then stitching these back together to create new words.) But while WaveNet and others were trained using audio waveforms, Facebook’s MelNet uses a richer and more informationally dense format to learn to speak: the spectrogram.
In an accompanying paper, Facebook’s researchers note that while WaveNet produces higher-fidelity audio output, MelNet is superior at capturing “high-level structure” — the subtle consistencies contained in a speaker’s voice that are, ironically, almost impossible to describe in words, but to which the human ear is finely attuned.
They say that this is because the data captured in a spectrogram is “orders of magnitude more compact” than that found in audio waveforms. This density allows the algorithms to produce more consistent voices, rather than being distracted by and honing in on the extreme detail of a waveform recording (to use an overly simplistic human analogy).
There are limitations, though. The most important being that the model can’t replicate how a human voice will change over longer periods of time; building up drama or tension over a paragraph or page of text, for example. Interestingly, this is similar to the constraints we’ve seen in AI text generation, which captures surface-level coherency not long-term structure.
These caveats aside, the results are astoundingly good. And, more impressively, MelNet is a multifunction system. It doesn’t just generate realistic voices, it can also be used to generate music (though the output is a little dodgy at times, and it doesn’t seem like it can be shaped and sculpted in a way that would make it commercially useful).
As ever, there are benefits and dangers with this technology. The benefits? Higher-quality AI assistants; realistic voice models for people with speech impairments; and a range of uses in the entertainment industry. The dangers? How about crumbling trust in traditional forms of evidence, and the potential for audio harassment, scams, and generalized slander? All the fun of the AI fake fair basically. Just pair it with this recent research that lets you edit what someone says in a video by typing in new speech, and the possibilities are endless.