Advection Diffusion

Feb 12, 2023

Controlling Stable Diffusion with point advection.

Process

Here we start with a generative animation created with Houdini. The animation frames are used as source images for Stable Diffusion.

By starting with an animation, we achieve reasonable temporal coherence between frames, as can be seen below. The diffusion results closely mimic the animation input.

In order to make the results match our music, we have to do a little tweaking. The original track runs at 128 bpm, which does not line up very nicely with the intended 30 fps animation.

Beats per minute: 128
Beats per second: 2.1333
Length of 1 bar (4 beats): 1.875 seconds
Length of 1 beat: 0.46875 seconds
At 30 fps: 14.0625 frames per beat

To fix this, we can slightly slow down the music, so that beats will line up cleanly with frames. Scaling the original by about 87.890625 percent gives us a 112.5 bpm tempo. Since literally everything sounds better when it is slowed down, we lose nothing with this step.

sox -t wav input.128.wav -t wav output.112.5.wav speed 0.87890625

The intention was to run two interpolations on the output, thus turning each frame into 4. At 112.5 bpm, we get a very convenient 16 frames per beat, 30*60/112.5=16. Taking every fourth frame (pre-interpolation), we have 4 frames per beat or 8 frames per bar. Snare drums are on the second beat, and we modulate the start_schedule parameter so that it goes high with every snare kit and then decays down to 0.4.

start_schedule = 0.4 + (1 - ((i - 4) % 8) / 8) * .4

To quickly check the equation in action, we can see its output below. (Arrows added for clarity).

>>> for i in range(16): print(i, 1-((i-4)%8)/8)
...
0 0.5
1 0.375
2 0.25
3 0.125
4 1.0 <------
5 0.875
6 0.75
7 0.625
8 0.5
9 0.375
10 0.25
11 0.125
12 1.0 <------
13 0.875
14 0.75
15 0.625

And we can identify the upbeat frames, i.e. where the snare drum hits like so:

# snare drum frames
>>> for i in range(32):
...     if 1-((i-4)%8)/8 == 1:
...             print(i)
...
4
12
20
28

The results can be seen below, with the face coming into view every 8 frames (after the initial 4) and then melting into incoherence. We can also see how the underlying animation lends a temporal continuity to the imagery.

The last step is to scale back up from from our 7.5 fps animation to a full 30, by running FILM interpolation with a factor of 2. This further smooths out discontinuities between frames, yielding our final output.