I cannot do ELI5, but can do ELI14 for you. Transformer is a building block (a p...

I cannot do ELI5, but can do ELI14 for you.

Transformer is a building block (a part) of a language model. "Language model" is an algorithm that can predict words following given words. For example, you can give a text to a model and get a summary of this text, or an answer to the question in the text, or a translation of the text.

Language models are often made of two parts - encoder and decoder. The encoder reads input text (each word is encoded as a bunch of numbers, for example, as list of 512 floating-point numbers) and produces a "state" (also a large list of numbers) which is expected to encode the meaning of the text. Then the decoder reads the state and produces the output as words (to be exact, as probabilities for every possible word in the dictionary to be at a certain position in the output).

Before Transformers, people tended to use so called "recurrent neural networks" architecture. With this approach, the encoder processes the text word by word and updates the state after every word:

    state = some initial state
    for word in text:
        state = model(state, word)

model(...) here is a complicated mathematical function, often with millions of operations and parameters.

As I have written above, after reading the text, the state should encode the meaning of the text.

But it turned out that this approach doesn't scale well with long or complicated texts because the information from beginning of the text gets lost. The model tends to "forget" what it had read before. So a new architecture, "Transformers", was proposed. The difference is that now we give entire text (each word encoded as bunch of numbers) to the model:

    state = model(input text)

Now the model processes the text at once. But implementing this naively would result in a very large model with too many parameters that would require too much memory and computing time. So developers used a trick here - most of the time each input word is processed separately from others (as in recurrent model), but there are stages, called "attention" where the words are processed together (and those stages are relatively light), so it looks like this:

    # stage where all text is processed at once
    # using quick algorithm
    state1 = attention(input text)
    # stage where each part of state is processed independently
    # with lot of heavy calculations
    state2 = map(some function, state1)
    state3 = attention(state2)
    state4 = map(some function, state3)
    ...

To summarize, in Transformers the model processes the text at once, but we have to employ tricks and split processing into stages to make calculation feasible. Probably that is why some people believe the authors should receive a reward for their work.

I think this explanation is as far as one can get without learning ML.