Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An LSTM is slightly different than a standard feedforward unit. LSTM unit has internal state dependent on past inputs. Gradient descent trains the LSTM not only which output for a given input, but also how to update the internal state given an input. That the whole thing should still be differentiable is not intuitive to me.

edit: It's kind of like, if not actually equivalent to, programming a Turing machine by its gradient on training data.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: