MARS: Sound Generation via Multi-Channel Autoregression on Spectrograms

Eleonora Ristori · Luca Bindini · Paolo Frasconi

📄 View on arXiv

Abstract

Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), which, to the best of our knowledge, is the first adaptation of next-scale autoregressive modeling to the spectrogram domain. MARS treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping strategy that reduces spatial resolution without information loss. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity sound generation.

Generated Samples

Below we present examples of audio generated by MARS, each accompanied by the three most similar sounds from the training set. Similarity is computed by measuring the cosine distance between embeddings extracted with VGGish.

Example 1

Generated Sample

Nearest Neighbors

Example 2

Generated Sample

Nearest Neighbors

Example 3

Generated Sample

Nearest Neighbors

Example 4

Generated Sample

Nearest Neighbors

Example 5

Generated Sample

Nearest Neighbors