An LSTM is slightly different than a standard feedforward unit. LSTM unit has internal state dependent on past inputs. Gradient descent trains the LSTM not only which output for a given input, but also how to update the internal state given an input. That the whole thing should still be differentiable is not intuitive to me.
edit: It's kind of like, if not actually equivalent to, programming a Turing machine by its gradient on training data.
edit: It's kind of like, if not actually equivalent to, programming a Turing machine by its gradient on training data.