I'm impressed but skeptic, too. Laypeople might not get an accurate impression on what is possible.
Considering the image with the guy in a black shirt playing guitar. What if it was a picture of a naked man standing next to a black shirt that is hanging on a clothes hook, and next to it is a guitar is mounted on a rack: My guess it that the computer will still spit out: Guy in black shirt playing guitar, just because this sentence is very plausible in the underlying language model.