Also I think this thread is a good place to complain about the paper. The model is not described clearly. For example, try to find the size of input data vector for the model in the paper - it is not specified. There is also a misleading phrase
All sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.
which makes the reader think that each block (Transformer) gets 512-dimensional vector as input and produces 512 numbers at the output. But this is wrong. 512 numbers is just a single word, not entire text or internal state. I could not understand this from reading just the original paper.
Also it is not written where do keys, queries and values for attention come from.
Also it is not written where do keys, queries and values for attention come from.