Combining audio control and style transfer using latent diffusion

Supporting webpage for ISMIR 2024. Code is available here

Abstract

Deep generative models are now able to synthesize high-quality audio signals, shifting the critical aspect in their development from audio quality to control capabilities. Although text-to-music generation is getting largely adopted by the general public, explicit control and example-based style transfer are more adequate modalities to capture the intents of artists and musicians. In this paper, we aim to unify explicit control and style transfer within a single model by separating local and global information to capture musical structure and timbre respectively. To do so, we leverage the capabilities of diffusion autoencoders to extract semantic features, in order to build two representation spaces. We enforce disentanglement between those spaces using an adversarial criterion and a two-stage training strategy. Our resulting model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We evaluate our model on one-shot timbre transfer and MIDI-to-audio tasks on instrumental recordings and show that we outperform existing baselines in terms of audio quality and target fidelity. Furthermore, we show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.

MIDI-to-audio

Examples in MIDI-to-audio generation on the Slakh dataset . For each midi file, we present results in reconstruction (using the original audio associated with the midi file) and transfer to a different recording timbre. For the baseline SpecDiff (Multi-instrument music synthesis with spectrogram diffusion [1]), we swap the MIDI instrument program to the one of the target timbre sample.

Scroll to see all the results if necessary.

	MIDI		Target	SpecDiff	Ours with encoder	Ours
Piano		reconstruction
Piano		transfer
Guitar		reconstruction
Guitar		transfer
Strings		reconstruction
Strings		transfer
Voice		reconstruction
Voice		transfer
Synth		reconstruction
Synth		transfer
Bass		reconstruction
Bass		transfer
Flute		reconstruction
Flute		transfer

Timbre Transfer

Synthetic Data

Examples in timbre transfer on the Slakh dataset. We compare our method with two baselines, Music Style Transfer [2] and SS-VAE [3].

	Source	Target	SS-VAE	Music Style Transfer	Ours no adv.	Ours
Piano to guitar
guitar to voice
synth to strings
guitar to flute
bass to keys
guitar to guitar

Real Data

Examples in timbre transfer on three real instrumental recordings datasets.

	Source	Target	SS-VAE	Music Style Transfer	Ours no adv.	Ours
piano to guitar
guitar to piano
flute to piano
guitar to flute
piano to flute
violin to guitar
violin to piano
piano to piano

Music style transfer

Examples in musical style transfer between recordings of rock, jazz, dub and lofi hip-hop. For music gen, we use the source audio as melody input and the following prompts :

Jazz : ‘In the style of a jazz song’
Rock : ‘In the style of instrumental rock’
Dub : ‘In the style of dub reggae music’
Lofi : ‘In the style of lofi hip hop’

Source	Target	MusicGen	Ours no adv.	Ours

References

[1] C. Hawthorne, I. Simon, A. Roberts, N. Zeghidour, J. Gardner, E. Manilow, and J. Engel, “Multi-instrument music synthesis with spectrogram diffusion,” arXiv preprint arXiv:2206.05408, 2022.615

[2] O. Cífka, A. Ozerov, U. ̧Sim ̧sekli, and G. Richard “Self-supervised vq-vae for one-shot music style transfer,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processin(ICASSP). IEEE, 2021, pp. 96–100

[3] Li, Y. Zhang, F. Tang, C. Ma, W. Dong, and C. Xu, “Music style transfer with time-varying inversion of diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp.547–555

This site is open source. Improve this page.