Cosma Shalizi explains it very well. "Again: Calling this "attention" at best a ...

MichaelRazum · on May 17, 2023

Thanks a lot for sharing. I liked the explanation, although it seems to be not a perfect matching between attention and kernel (as far as I understand). Since as I undestand x_o = Qx and x_i = Qx and y_i = Vx but it doesn't map to Wu dot Wv.

Anyway just curious do you or someone else have more sources in this format?

At least this way it is very helpfull to think about the methods. Kind of agree that the formular does look very very similar to well known methods. But on the other hand, the author doesn't explain the transformers in a similar simple way since it is not obvious why you could stack kernels and get better results.