I understand what you mean but please understand that this code is targeted at p...

I understand what you mean but please understand that this code is targeted at people which would at least have some background knowledge, like having read the seminal Transformer paper, "Attention Is All You Need", https://arxiv.org/abs/1706.03762

Most of the code becomes really straightforward once you have. A lot of the magic constants are the result of multi page proofs (like the GELU constant) that would be impractical to put in the code.

Deep learning research really is a field that requires some amount of knowledge, and it's normal that you don't automatically understand state of the art code. Here is the GPT2 paper https://d4mucfpksywv.cloudfront.net/better-language-models/l...