Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not a ML expert but I know a bit about math.

It's "differentiable" in the same way that e.g. the "jump function" (Heaviside step function) is differentiable (not as a function from real numbers to real numbers, but as a distribution). It's derivative is the "point impulse function" (Dirac delta function), which, again, is a distribution, not a real function.

Distributions are nicely defined in math, but can't really be operated with numerically (at least not in the same way as real/float functions), but you can approximate them using continuous functions. So instead of having a function jump from 0 to 1, you "spread" the jump and implement it as a continuous transition from e.g. `0-epsilon` to `0+epsilon` for some tiny epsilon. Then you can differentiate it as usual, even numerically.

Similarly, hash table lookup is a dis-continuous function - the result of `hash.get(lookup)` is just `value` (or `null`). To make it continuous, you "spread" the value, so that nearby keys (for some definition of "nearby") will return nearby values.

One way to do this, is to use the scalar product between `lookup` and all keys in the hashtable (normalized, the scalar product is close to 1 if the arguments are "nearby"), and use the result as weights to multiply with all values in the hashtable. That's what the transformer does.



Thanks for this explanation. I couldn't wrap my mind around the "differentiable hash table" analogy, but "distribution of keys" -> "distribution of values" starts to click.

I'm not an ML expert either but I have taken graduate level courses and published papers with "machine learning" in the title, so I feel like I should be able to understand these things better. The field just moves so fast. It's a lot of work to keep up. Easy-to-digest explanations like this are underrated.


>The field just moves so fast. It's a lot of work to keep up. Easy-to-digest explanations like this are underrated.

This is really the truth. I can't possibly understand how people in this field who are talented can still keep up. I have a binder full of seminal papers that I have to cull to make room for more recent and relevant research every few months. I feel there is a lot of potential in simplifying the details of the mechanisms that drive a lot of it, but nobody has time to stop, consolidate the information and publish it. And if they did, it would just be another outdated textbook in a few years.


thank you. This made it click.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: