This is a story about counting limbs, fingers and toes. A fairly trivial concept for us, a major hurdle for AI. My ideas are somewhat speculative but hopefully thought provoking.
When depicting people with Stable Diffusion, the best way to get good results is to avoid limbs. Results with limbs are often nightmare fuel. Extra limbs or body parts growing into each other is rarely a happy sight. You’d think that enough pictures with limbs were in the training set so why is it so hard to get it right? Compare it for example with faces, where Stable Diffusion often manages to do a far better job.
The thing is, we might not be very appreciative of how our knowledge about anatomy helps us fill in the blanks. At all times and from many angles, plenty of body parts are partially or entirely hidden. For starters by clothes. Clothes hide a lot of our body but we’ve a fairly good idea of what is anatomically there. A hand or arm behind someone’s back is just fine. A thumb hidden because we’re holding something, easy too.
But what about faces? Well, they’re rarely hidden and are made up out of things that hardly move around. Arms and legs move all over the place; but eyes, noses and mouths do not. They do change shape but are very anchored location wise. If I look at the hundreds of photographs of people I’ve made, faces are also definitely the thing you’ll learn the most about. Consequently, it isn’t weird that Stable Diffusion is capable of doing a fairly decent job at building faces.
So how about trying a work around? If enough samples were in the training set, we might expect Stable Diffusion to do well with descriptive prompts. Hence the title of this post. From the emoji to Jean-Luc Picard, we all know what a facepalm should look like. And handily it implies a fairly clear unobstructed positioning of the involved limbs.
Unfortunately, Stable Diffusion quickly disagreed. I generated 850 pictures for my prompt but with limited luck. While it was clear that it was aiming for the intended outcome, it mostly generated nightmare fuel. Out of the 850, only 6 looked acceptably realistic. So I took those and had it create variations by changing guidance scale and steps. In the end I retained the 3 best.

These aren’t perfect but can be cleaned up with only a minor amount of work.
More attempts with anything a bit harder all failed. Asking for hands holding something very obvious like a phone, does not give any decent results.
I am also a bit sceptic when it comes to the idea of endless prompt tuning. In most cases this doesn’t really help me a lot. I generally fare better trying a few hundred seeds than trying to wring a good result out of the wrong seed. Maybe I’m just bad at it? If anyone has really good results with prompts explicitly involving hands, I’d love to hear about it.