I tried using a docker container and it took 3 min to generate a prompt. However it seems that 2:45 min is somehow spent on tje GPU and finally the remaining 15 seconds the GPU gets utilized.
I haven't had the time to look into this yet, but it does seem to work.
I've gotten it running with a Radeon RX 6800 on Ubuntu Linux 22.04 (with overwriting PyTorch with a ROCm-supporting version), and on Windows 10 (in a very barebones way using ONNX), but are there better, more full-featured ways to get it running on Windows? Would love to know.