Computational biologist, data scientist, digital artist
This is amazing: Diffusion models have internal depth and foreground/background maps. This means these models build an internal 3D representation of the scene and then render it.
Paper is from a few months ago, but I only encountered it today.
yc015.github.io/scene-representation-diffusion-model/