CUDAnosaur


I wiped the dust off my 6 year old laptop as it had a fairly forward looking feature back in the day: 8GB of VRAM. The WIFI is quirky and the disk is slow, but that 8GB belongs to a GTX 1070 which is … a CUDA card.

I figured people have been running Stable Diffusion on cards with less than 8GB so a mobile GTX 1070 with 8GB sounds quite “capable”. And surprisingly, it is!

I saw a recent comment on the diffusers GitHub that explained how inefficient the ONNX pipelines actually are. While they work with DirectML, they are extremely inefficient at doing so. It results in continuous copying of data between CPU and GPU. I figured even an old CUDA card would stand a chance if it ran as efficiently as possible.

The model loading times are pretty poor. But once it actually gets to work, it manages about 1.8it/s. That is only 10% away from what my 6700XT currently gets me using the same model (Stable Diffusion 2 Base). That’s also right out of the gate with little to no attempt at optimisation.

Given NVidia purposely crippled FP16 on the GTX 1070 and its siblings, I am not sure how much more performance I can wring out of it. But it definitely makes for an interesting test platform. It’ll allow me to test CUDA code and I can use CUDA to export FP16 ONNX models. Maybe the latter can allow me to get me more out of the 6700XT.

, ,

Leave a comment