At this point, anyone who has been following AI research is long familiar with generative models that can synthesize speech or melodic music from nothing but text prompting. Nvidia’s newly revealed “Fugatto” model looks to go a step further, using new synthetic training methods and inference-level combination techniques to “transform any mix of music, voices, and sounds,” including the synthesis of sounds that have never existed.
While Fugatto isn’t available for public testing yet, a sample-filled website showcases how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir. While the results on display can be a bit hit or miss, the vast array of capabilities on display here helps support Nvidia’s description of Fugatto as “a Swiss Army knife for sound.”
You’re only as good as your data
In an explanatory research paper, over a dozen Nvidia researchers explain the difficulty in crafting a training dataset that can “reveal meaningful relationships between audio and language.” While standard language models can often infer how to handle various instructions from the text-based data itself, it can be hard to generalize descriptions and traits from audio without more explicit guidance.
To that end, the researchers start by using an LLM to generate a Python script that can create a large number of template-based and free-form instructions describing different audio “personas” (e.g., “standard, young-crowd, thirty-somethings, professional”). They then generate a set of both absolute (e.g., “synthesize a happy voice”) and relative (e.g., “increase the happiness of this voice”) instructions that can be applied to those personas.
The wide array of open source audio datasets used as the basis for Fugatto generally don’t have these kinds of trait measurements embedded in them by default. But the researchers make use of existing audio understanding models to create “synthetic captions” for their training clips based on their prompts, creating natural language descriptions that can automatically quantify traits such as gender, emotion, and speech quality. Audio processing tools are also used to describe and quantify training clips on a more acoustic level (e.g. “fundamental frequency variance” or “reverb”).
For relational comparisons, the researchers rely on datasets where one factor is held constant while another changes, such as different emotional readings of the same text or different instruments playing the same notes. By comparing these samples across a large enough set of data samples, the model can start to learn what kinds of audio characteristics tend to appear in “happier” speech, for instance, or differentiate the sound of a saxophone and a flute.
After running a variety of different open source audio collections through this process, the researchers ended up with a heavily annotated dataset of 20 million separate samples representing at least 50,000 hours of audio. From there, a set of 32 Nvidia tensor cores was used to create a model with 2.5 billion parameters that started to show reliable scores on a variety of audio quality tests.
It’s all in the mix
Beyond the training, Nvidia is also talking up Fugatto’s “ComposableART” system (for “Audio Representation Transformation”). When provided with a prompt in text and/or audio, this system can use “conditional guidance” to “independently control and generate (unseen) combinations of instructions and tasks” and generate “highly customizable audio outputs outside the training distribution.” In other words, it can combine different traits from its training set to create entirely new sounds that have never been heard before.
I won’t pretend to understand all of the complex math described in the paper—which involves a “weighted combination of vector fields between instructions, frame indices and models.” But the end results, as shown in examples on the project’s webpage and in an Nvidia trailer, highlight how ComposableART can be used to create the sound of, say, a violin that “sounds like a laughing baby or a banjo that’s playing in front of gentle rainfall” or “factory machinery that screams in metallic agony.” While some of these examples are more convincing to our ears than others, the fact that Fugatto can take a decent stab at these kinds of combinations at all is a testament to the way the model characterizes and mixes extremely disparate audio data from multiple different open source data sets.
Perhaps the most interesting part of Fugatto is the way it treats each individual audio trait as a tunable continuum, rather than a binary. For an example that melds the sound of an acoustic guitar and running water, for instance, the result ends up very different when either the guitar or the water is weighted more heavily in Fugatto’s interpolated mix. Nvidia also mentions examples of tuning a French accent to be heavier or lighter, or varying the “degree of sorrow” inherent in a spoken clip.
Beyond tuning and combining different audio traits, Fugatto can also perform the kinds of audio tasks we’ve seen in previous models, like changing the emotion in a piece of spoken text or isolating the vocal track in a piece of music. Fugatto can also detect individual notes in a piece of MIDI music and replace them with a variety of vocal performances, or detect the beat of a piece of music and add effects from drums to barking dogs to ticking clocks in a way that matches the rhythm.
While the researchers describe Fugatto as just the first step “towards a future where unsupervised multitask learning emerges from data and model scale,” Nvidia is already talking up use cases from song prototyping to dynamically changing video game scores to international ad targeting. But Nvidia was also quick to highlight that models like Fugatto are best seen as a new tool for audio artists rather than a replacement for their creative talents.
“The history of music is also a history of technology,” Nvidia Inception participant and producer/songwriter Ido Zmishlany said in Nvidia’s blog post. “The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born. With AI, we’re writing the next chapter of music. We have a new instrument, a new tool for making music—and that’s super exciting.”