TSA

The Strange Agency

Funky Diffusion

If we could guide our Stable Diffusion with audio, we could generate music videos. One, perhaps naive, way to try this is to create noise fields from audio spectra, and then use these with img2img diffusion.

Instead of creating a typical spectrogram, we map audio spectra to noise. We do this by chunking our audio into frame-length pieces and taking an FFT of the onset portion of each piece.

n_window_samples = math.floor(window_size * sample_rate)
amps = rfft(signal, n_window_samples)
amps = np.absolute(amps)

Nothing unusual there. We end up with a pile of frequency bins, which we can subsample into a smaller array or arbitrary size. Here we grab 90 bins, with logarithmic spacing, which gives us a more acoustically informative picture.

# create 90 logarithmically spaced indices, from 10 to n_window_samples / 2
indices = np.logspace(1, math.log(n_window_samples / 2, 10), num=90).astype(int)
image_amps = np.take(amps, indices)

The signal in each bin is then mapped onto image space using the Python hash function, with color determined by frequency bin and lightness by the magnitude of that bin.

We can then diffuse the images, using start_schedule=0.5 and a prompt like, “iceland vista high res photo 4k cinematic lighting anamorphic 16mm”. The seed is kept constant, so while each frame produces a different image, there is a reasonable amount of coherence between them.

There is a good deal of grain in the images, and bumping up the start_schedule to 0.75 helps eradicate that, while also making the image more interesting. However, we start to sacrifice consistency between images.

Changing the prompt to “hyperrealistic photography of an android moody fog cyberpunk”, we get less consistency, and the grain from the initial images is quite apparent.

Interestingly, increasing the start_schedule here actually creates more consistency between frames with similar frequency content, but less consistency overall.

As the grain is still quite visible, we perform a gaussian blur on the noisy source image before feeding that to the diffusion process. The effect is quite pronounced, yielding much smoother images. Hover there are still some flickery artifacts.

With a 0.75 start_schedule setting we get still less noise in the image, and we eliminate the artifacts.

Below we see frames generated at 15 fps, with a little FILM added for improved temporal coherence.

Music: James Brown & Clyde Stubblefield