I only run 1 sample at a time (batch size 1), forgot to mention that, and that affects the step time.
It looks like each additional image in a batch is cheaper than the 1st image. For example if I reduce my resolution so I can generate more in a single batch
1 image, 50 steps, 320x320: 5s
2 images, 50 steps, 320x320: 8s
3 images, 50 steps, 320x320: 11s
4 images, 50 steps, 320x320: 14s
And the trend continues, and my reported iteration/sec goes down as well. It's not accounting for the fact that with steps=50 and batch size=4 it's actually running 200 steps, just in 4 parallel parts.