The Structure of the LSTM, which is shown in the image above.
Items in the Structure
Inputs $\boldsymbol{z}(t)=(\boldsymbol{x}(t), \boldsymbol{y}(t-1))$
The input is combined the input $x(t)$ and the output of last time $y(t-1)$
Network Updates ($W_*$ are the free parameters)
$\boldsymbol f(t)$ is the forget gate, $\boldsymbol g(t)$ decide the whether and what to remember from the input of this time, the $\boldsymbol h(t)$ is the input block, the $\boldsymbol o(t)$ is the output gate.
Long-term memory update
It can be seen as the influence of the forget gate on the hidden state plus the memory update.
(对隐含变量是否进行遗忘,加上这次的选择记忆阶段得到下一次的隐含状态的输入$c(t)$)Output $\boldsymbol{y}(t)=\boldsymbol{o}(t) \otimes \tanh (\boldsymbol{c}(t))$
The product of the output gate and the $tanh(\boldsymbol c(t))$.
Note
We can train an LSTM by unwrapping it in time.
It involves four dense layers with sigmoidal(or tanh) output: those gates
The LSTM is typucally very slow to train.
There are a few variants of LSTMs, but all are very similar. The most popular is probably Gated Recurrent Unit (GRU).
(to be continued)
GRU
LSTM有很多变体,其中较大改动的是Gated Recurrent Unit (GRU),这是由 Cho, et al. (2014)提出。它将忘记门和输入门合成了一个单一的 更新门。同样还混合了细胞状态和隐藏状态,和其他一些改动。最终的模型比标准的 LSTM模型要简单。效果和LSTM差不多,但是参数少了1/3,不容易过拟合。
Reference
- Adam lstm slide COMP6208
- RNN, LSTM and GRU