LSTM | We Need to Go Deeper

avatar

The Structure of the LSTM, which is shown in the image above.

Items in the Structure

Inputs $\boldsymbol{z}(t)=(\boldsymbol{x}(t), \boldsymbol{y}(t-1))$

The input is combined the input $x(t)$ and the output of last time $y(t-1)$
Network Updates ($W_*$ are the free parameters)
$\begin{array}{ll}{\boldsymbol{f}(t)=\boldsymbol{\sigma}\left(\mathbf{W}_{f} \boldsymbol{z}(t)\right)} & {\boldsymbol{g}(t)=\boldsymbol{\sigma}\left(\boldsymbol{W}_{g} \boldsymbol{z}(t)\right)} \\ {\boldsymbol{h}(t)=\tanh \left(\boldsymbol{W}_{h} \boldsymbol{z}(t)\right)} & {\boldsymbol{o}(t)=\boldsymbol{\sigma}\left(\boldsymbol{W}_{o} \boldsymbol{z}(t)\right)}\end{array}$
$\boldsymbol f(t)$ is the forget gate, $\boldsymbol g(t)$ decide the whether and what to remember from the input of this time, the $\boldsymbol h(t)$ is the input block, the $\boldsymbol o(t)$ is the output gate.
Long-term memory update
$\boldsymbol{c}(t)=\boldsymbol{f}(t) \otimes \boldsymbol{c}(t-1)+\boldsymbol{g}(t) \otimes \boldsymbol{h}(t)$
It can be seen as the influence of the forget gate on the hidden state plus the memory update.
(对隐含变量是否进行遗忘，加上这次的选择记忆阶段得到下一次的隐含状态的输入$c(t)$)
Output $\boldsymbol{y}(t)=\boldsymbol{o}(t) \otimes \tanh (\boldsymbol{c}(t))$

The product of the output gate and the $tanh(\boldsymbol c(t))$.

Note

We can train an LSTM by unwrapping it in time.

It involves four dense layers with sigmoidal(or tanh) output: those gates

The LSTM is typucally very slow to train.

There are a few variants of LSTMs, but all are very similar. The most popular is probably Gated Recurrent Unit (GRU).

(to be continued)

GRU

LSTM有很多变体，其中较大改动的是Gated Recurrent Unit (GRU)，这是由 Cho, et al. (2014)提出。它将忘记门和输入门合成了一个单一的更新门。同样还混合了细胞状态和隐藏状态，和其他一些改动。最终的模型比标准的 LSTM模型要简单。效果和LSTM差不多，但是参数少了1/3，不容易过拟合。

avatar

Reference

Adam lstm slide COMP6208
RNN, LSTM and GRU