Also I think this thread is a good place to complain about the paper. The model ...

Also I think this thread is a good place to complain about the paper. The model is not described clearly. For example, try to find the size of input data vector for the model in the paper - it is not specified. There is also a misleading phrase

    All sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.

which makes the reader think that each block (Transformer) gets 512-dimensional vector as input and produces 512 numbers at the output. But this is wrong. 512 numbers is just a single word, not entire text or internal state. I could not understand this from reading just the original paper.

Also it is not written where do keys, queries and values for attention come from.