I’ve been playing around with every possible way to run Stable Diffusion on my 6700XT. In theory there are many potential solutions, in practice there’s far less.
The traditional route is ONNX which works, but with some drawbacks. There’s no well published path towards FP16. Without it, it eats VRAM and exhausts even the 12GB on my 6700XT easily. The other issue is performance. The latter being the easiest to solve as it relates to the sedated pace of official ONNX DirectML Runtime releases. Switch to ORT Nightly and you get twice the speed.
For the FP16 questions I looked at all the available scripts and eventually did a “silly” experiment. In theory it seemed like I could just convert to mixed precision and keep in- and output the same. You can then load it as a normal pipeline and if all goes well, the Runtime converts some stuff internally to FP16. That sounds almost to good to be true right? Ironically enough it isn’t. I converted UNET (the big slow part of it all) to FP16 and the Runtime happily ran using a lot less VRAM and on top of that, it was faster.
On 512×512 models the performance boost is modest, roughly 25% better (from 2it/s to 2.5it/s). For 768×768 models it’s a life or death change for my 6700XT. Before if I ran 2.0 (or now the new 2.1) at 768×768, it would take 7.3s per iteration. Only to run out VRAM when it hit the VAE phase, no images were ever finished. My new mixed precision approach gets me 1.5s/it (that’s 5x times faster!) and no VRAM issues. For convenience I’ll share the entire ONNX process on Github in the near future.
I have also tried some alternatives:
- Tensorflow-DirectML: There’s a Stable Diffusion implementation for Tensorflow HERE and there’s a recent release of Tensorflow-DirectML. The problem is that the Tensorflow implementation of Stable Diffusion is not maintained. While it’s easy to get it running, there’s no benefit over ONNX. Additionally, FP16 doesn’t work in Tensorflow DirectML, making it a dead end even if you’d be committing development time.
- PyTorch-DirectML: This would obviously be the theoretical holy grail. Given diffusers and co are in PyTorch. We got Microsoft to update PyTorch-DirectML, but apparently they did not realise we would expect to run Stable Diffusion. That one hence quickly ended in nothing more than a (now acknowledged) issue on GitHub. Fingers crossed this eventually works.
And then finally Nod AI’s SHARK. This an MLIR/IREE based implementation where it all gets compiled in high performance tailored models. From a performance and VRAM perspective, SHARK is a clear winner. It’s not the most practical as of yet though. You will need to install a specific driver for the best results (kindly provided by AMD directly on their website). It’s also not as versatile as ONNX. While with ONNX you can get nearly any model working, you currently depend on the SHARK developers to release things as there’s no model conversion option. From a speed perspective it does well with 512×512 at 3.1it/s (using FP16), and possibly some room for improvement left on the table.
The winner? If you’ve a somewhat decent card (6700XT and beyond) I’d go with ONNX until SHARK becomes more versatile. And maybe with FP16 you’ll be fine with ONNX on 8GB cards too. In the long run I hope to be able to get SHARK’s performance and resource management without compromises on versatility.