At this point, anyone who has been following AI research is long familiar with generative models that can synthesize speech or melodic music from nothing but text prompting. Nvidia’s newly revealed “Fugatto” model looks to go a step further, using new synthetic training methods and inference-level combination techniques to “transform any mix of music, voices, and sounds,” including the synthesis of sounds that have never existed.
While Fugatto isn’t available for public testing yet, a sample-filled website showcases how Fugatto can be used to dial a number of distinct audio traits and descriptions up or down, resulting in everything from the sound of saxophones barking to people speaking underwater to ambulance sirens singing in a kind of choir. While the results on display can be a bit hit or miss, the vast array of capabilities on display here helps support Nvidia’s description of Fugatto as “a Swiss Army knife for sound.”