Most widely-used modern audio codecs, such as Ogg Vorbis and MP3, as well as more recent "neural" codecs like Meta's Encodec or Descript's are based on block-coding; audio is divided into overlapping, fixed-size "frames" which are then compressed. While they produce excellent reproduction quality and can be used for downstream tasks such as text-to-audio, they do not produce an intuitive, directly-interpretable representation.
In this work, we introduce a proof-of-concept audio encoder that encodes audio as a sparse set of events and their times-of-occurrence. Rudimentary physics-based assumptions are used to model attack and the physical resonance of both the instrument being played and the room in which a performance occurs, hopefully encouraging a sparse, parsimonious, and easy-to-interpret representation.
We imagine a near-future audio codec where some types of musical composition could take place directly in the audio codec space. Text-to-audio is an excellent interface for non-musicians creating background music for movies, advertisements or social media content, but it is our view that experienced musicians and composers work in a "space" not fully captured by language and will prefer a much finer-grained representation that affords an effectively infinite range of possible sounds.
This work clearly does not replace current music generation models, but could serve as the underlying encoding on which generative models are trained. It is the authors' intuition that models trained on this rich, symbolic representation might have a much deeper "understanding" of the content they produce, given the point-cloud-like nature of the signal. Instead of predicting the next frame, the generative model would be predicting the relationships between "events". Long-term coherence in musical generation has always been a difficult problem, and we speculate that many models spend enormous shares of their capacity learning to reproduce physical resonance, rather than the human forces that drive them in interesting, musical directions.
Early speech results from the LJ-Speech dataset can be found here.
This work takes inspiration from symbolic approaches, such as MIDI, iterative decomposition methods like matching pursuit, and granular synthesis, which represents audio as a sparse set of "grains" or simple audio atoms.
The encoder iteratively removes energy from the input spectrogram, producing an event vector and one-hot/dirac impulse
representing the time of occurrence. Other representations of time are possible, e.g., a scalar value of seconds, or
a binary vector representing the frame number with log2(n_samples
dimensionality.
The decoder uses the 32-dimensional event vector to choose an attack envelope, evolving resonance, and room impulse response to model the acoustic event, and then "schedules" it by convolving the event with the one-hot/direc impulse. Audio is not produced using typical upsampling convolutions, avoiding artifacts and producing more natural-sounding events.
We train on the MusicNet dataset dataset) for ~76 hours, selecting random ~6 second audio segments sampled at 22050hz (mono) with a batch-size of 2. The model takes the following steps for 32 iterations on each training sample:
The model is trained to maximize the amount of energy removed from the original signal at each step, and to minimize an adversarial loss, produced by a small, convolutional down-sampling discriminator which is trained in parallel, analyzing both the real and reproduced signals in the STFT spectrogram domain. Half of the generated events are masked/removed when analyzed by the discriminator, encouraging each event vector to stand on its own as a realistic event.
All training and model code can be found here.
When encoding, the entire ~6-second spectrogram is analyzed, but its second-half is masked when choosing the next event. In this way, the model can slide along overlapping sections of audio and encode segments of arbitrary durations.
Recent experiments use a greedy, per-event loss which maximizes the energy removed from the signal at each step, as well as a learned, adversarial loss. Reconstruction quality will likely benefit from a more perceptually-aligned loss and a larger, more diverse dataset.
Firstly, this model is relatively small, weighing in at ~14M parameters (~80 MB on disk) and has only been trained for around 76 hours, so it seems there is a lot of space to increase the model size, dataset size and training time to further improve. The reconstruction quality of the examples on this page is not amazing, certainly not good enough even for a lossy audio codec, but the structure the model extracts seems like it could be used for many interesting applications, and future work will improve perceptual audio quality.
The decoder side of the model is very interesting, and all sorts of physical modelling-like approaches could yield better, more realistic, and sparser renderings of the audio.
For example, simple RNNs might serve as a natural alternative to the encoder used for the sound reproductions in this article.
If you'd like to cite this article, you can use the following BibTeX block.
In this latest iteration of the work, we introduce a "streaming" algorithm so that we can decompose audio segments of arbitrary lengths.
We mask the second half of the input audio to enable the streaming algorithm, so only the first half of the input audio is reproduced.
We can see that while energy is removed at each step, removed segments do not map cleanly onto audio "events" as a human listener would typically conceive of them. Future work will move toward fewer and more meaningul events via induced sparsity and/or clustering of events.
Here, we generate random event vectors with the original event times.
Here we use the original event vectors, but generate random times.
Each event vector is "perturbed" or moved in the same direction in event space by adding a random event vector with small magnitude
Different stopping conditions might be chosen during inference (e.g. norm of the residual) but during training, we remove energy for 32 steps. Each event vector is of dimension 32. The decoder generates an event from this vector, which is then scheduled.
Time is along the x-axis, and a 32D -> 1D projection of event vectors using t-SNE constitutes the distribution along the y-axis. Colors are produced via a random projection from 32D -> 3D (RGB). Here it becomes clear that there are many redundant/overlapping events. Future work will stress more sparsity and less event overlap, hopefully increasing interpretability further.
We mask the second half of the input audio to enable the streaming algorithm, so only the first half of the input audio is reproduced.
We can see that while energy is removed at each step, removed segments do not map cleanly onto audio "events" as a human listener would typically conceive of them. Future work will move toward fewer and more meaningul events via induced sparsity and/or clustering of events.
Here, we generate random event vectors with the original event times.
Here we use the original event vectors, but generate random times.
Each event vector is "perturbed" or moved in the same direction in event space by adding a random event vector with small magnitude
Different stopping conditions might be chosen during inference (e.g. norm of the residual) but during training, we remove energy for 32 steps. Each event vector is of dimension 32. The decoder generates an event from this vector, which is then scheduled.
Time is along the x-axis, and a 32D -> 1D projection of event vectors using t-SNE constitutes the distribution along the y-axis. Colors are produced via a random projection from 32D -> 3D (RGB). Here it becomes clear that there are many redundant/overlapping events. Future work will stress more sparsity and less event overlap, hopefully increasing interpretability further.
We mask the second half of the input audio to enable the streaming algorithm, so only the first half of the input audio is reproduced.
We can see that while energy is removed at each step, removed segments do not map cleanly onto audio "events" as a human listener would typically conceive of them. Future work will move toward fewer and more meaningul events via induced sparsity and/or clustering of events.
Here, we generate random event vectors with the original event times.
Here we use the original event vectors, but generate random times.
Each event vector is "perturbed" or moved in the same direction in event space by adding a random event vector with small magnitude
Different stopping conditions might be chosen during inference (e.g. norm of the residual) but during training, we remove energy for 32 steps. Each event vector is of dimension 32. The decoder generates an event from this vector, which is then scheduled.
Time is along the x-axis, and a 32D -> 1D projection of event vectors using t-SNE constitutes the distribution along the y-axis. Colors are produced via a random projection from 32D -> 3D (RGB). Here it becomes clear that there are many redundant/overlapping events. Future work will stress more sparsity and less event overlap, hopefully increasing interpretability further.
We mask the second half of the input audio to enable the streaming algorithm, so only the first half of the input audio is reproduced.
We can see that while energy is removed at each step, removed segments do not map cleanly onto audio "events" as a human listener would typically conceive of them. Future work will move toward fewer and more meaningul events via induced sparsity and/or clustering of events.
Here, we generate random event vectors with the original event times.
Here we use the original event vectors, but generate random times.
Each event vector is "perturbed" or moved in the same direction in event space by adding a random event vector with small magnitude
Different stopping conditions might be chosen during inference (e.g. norm of the residual) but during training, we remove energy for 32 steps. Each event vector is of dimension 32. The decoder generates an event from this vector, which is then scheduled.
Time is along the x-axis, and a 32D -> 1D projection of event vectors using t-SNE constitutes the distribution along the y-axis. Colors are produced via a random projection from 32D -> 3D (RGB). Here it becomes clear that there are many redundant/overlapping events. Future work will stress more sparsity and less event overlap, hopefully increasing interpretability further.
We mask the second half of the input audio to enable the streaming algorithm, so only the first half of the input audio is reproduced.
We can see that while energy is removed at each step, removed segments do not map cleanly onto audio "events" as a human listener would typically conceive of them. Future work will move toward fewer and more meaningul events via induced sparsity and/or clustering of events.
Here, we generate random event vectors with the original event times.
Here we use the original event vectors, but generate random times.
Each event vector is "perturbed" or moved in the same direction in event space by adding a random event vector with small magnitude
Different stopping conditions might be chosen during inference (e.g. norm of the residual) but during training, we remove energy for 32 steps. Each event vector is of dimension 32. The decoder generates an event from this vector, which is then scheduled.
Time is along the x-axis, and a 32D -> 1D projection of event vectors using t-SNE constitutes the distribution along the y-axis. Colors are produced via a random projection from 32D -> 3D (RGB). Here it becomes clear that there are many redundant/overlapping events. Future work will stress more sparsity and less event overlap, hopefully increasing interpretability further.
This blog post is generated from a Python script using conjure.