Transformer is a building block (a part) of a language model. "Language model" is an algorithm that can predict words following given words. For example, you can give a text to a model and get a summary of this text, or an answer to the question in the text, or a translation of the text.
Language models are often made of two parts - encoder and decoder. The encoder reads input text (each word is encoded as a bunch of numbers, for example, as list of 512 floating-point numbers) and produces a "state" (also a large list of numbers) which is expected to encode the meaning of the text. Then the decoder reads the state and produces the output as words (to be exact, as probabilities for every possible word in the dictionary to be at a certain position in the output).
Before Transformers, people tended to use so called "recurrent neural networks" architecture. With this approach, the encoder processes the text word by word and updates the state after every word:
state = some initial state
for word in text:
state = model(state, word)
model(...) here is a complicated mathematical function, often with millions of operations and parameters.
As I have written above, after reading the text, the state should encode the meaning of the text.
But it turned out that this approach doesn't scale well with long or complicated texts because the information from beginning of the text gets lost. The model tends to "forget" what it had read before. So a new architecture, "Transformers", was proposed. The difference is that now we give entire text (each word encoded as bunch of numbers) to the model:
state = model(input text)
Now the model processes the text at once. But implementing this naively would result in a very large model with too many parameters that would require too much memory and computing time. So developers used a trick here - most of the time each input word is processed separately from others (as in recurrent model), but there are stages, called "attention" where the words are processed together (and those stages are relatively light), so it looks like this:
# stage where all text is processed at once
# using quick algorithm
state1 = attention(input text)
# stage where each part of state is processed independently
# with lot of heavy calculations
state2 = map(some function, state1)
state3 = attention(state2)
state4 = map(some function, state3)
...
To summarize, in Transformers the model processes the text at once, but we have to employ tricks and split processing into stages to make calculation feasible. Probably that is why some people believe the authors should receive a reward for their work.
I think this explanation is as far as one can get without learning ML.
Also I think this thread is a good place to complain about the paper. The model is not described clearly. For example, try to find the size of input data vector for the model in the paper - it is not specified. There is also a misleading phrase
All sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.
which makes the reader think that each block (Transformer) gets 512-dimensional vector as input and produces 512 numbers at the output. But this is wrong. 512 numbers is just a single word, not entire text or internal state. I could not understand this from reading just the original paper.
Also it is not written where do keys, queries and values for attention come from.
Transformer is a building block (a part) of a language model. "Language model" is an algorithm that can predict words following given words. For example, you can give a text to a model and get a summary of this text, or an answer to the question in the text, or a translation of the text.
Language models are often made of two parts - encoder and decoder. The encoder reads input text (each word is encoded as a bunch of numbers, for example, as list of 512 floating-point numbers) and produces a "state" (also a large list of numbers) which is expected to encode the meaning of the text. Then the decoder reads the state and produces the output as words (to be exact, as probabilities for every possible word in the dictionary to be at a certain position in the output).
Before Transformers, people tended to use so called "recurrent neural networks" architecture. With this approach, the encoder processes the text word by word and updates the state after every word:
model(...) here is a complicated mathematical function, often with millions of operations and parameters.As I have written above, after reading the text, the state should encode the meaning of the text.
But it turned out that this approach doesn't scale well with long or complicated texts because the information from beginning of the text gets lost. The model tends to "forget" what it had read before. So a new architecture, "Transformers", was proposed. The difference is that now we give entire text (each word encoded as bunch of numbers) to the model:
Now the model processes the text at once. But implementing this naively would result in a very large model with too many parameters that would require too much memory and computing time. So developers used a trick here - most of the time each input word is processed separately from others (as in recurrent model), but there are stages, called "attention" where the words are processed together (and those stages are relatively light), so it looks like this: To summarize, in Transformers the model processes the text at once, but we have to employ tricks and split processing into stages to make calculation feasible. Probably that is why some people believe the authors should receive a reward for their work.I think this explanation is as far as one can get without learning ML.