一、 RNN概述人工神经网络和卷积神经网络的假设前提都是:元素之间是相互独立的 ,但是在生活中很多情况下这种假设并不成立,比如你写一段有意义的话 “遇见一个人只需1秒,喜欢一个人只需3,秒,爱上一个人只需1分钟,而我却用我的[?]在爱你。” ,作为正常人我们知道这里应该填 “一生”,但之所以我们会这样填是因为我们读取了上下文,而普通的神经网络输入之间是相互独立的,网络没有记忆能力。扩展一下:训练样本是连续的序列且其长短不一,如一段连续的语音、一段连续的文本等,这些序列前面的输入与后面的输入有有一定的相关性,很难将其拆解为一个个单独的样本来进行DNN/CNN训练。 循环神经网络(Recurrent Neural Networks,简称RNN)广泛应用于:
二、RNN网络结构及原理图中各个参数意义: 1) 2) 3) 4) 5) 6) 三、RNN前向传播原理对于任何一个序列索引号 \[h^{(t)} = \sigma(z^{(t)} = \sigma(Ux^{(t)}+Wh^{(t-1)}+b))
\] 其中 序列索引号为 \[o^{(t)} = Vh^{(t)}+c
\] 此时预测输出为: \[\hat{y}^{(t)} = \sigma(o^{(t)})
\] 在上面这一过程中使用了两次激活函数(第一次获得隐藏状态\(h^{(t)}\),第二次获得预测输出\(\hat{y}^{(t)}\))通常在第一次使用 四、RNN反向传播推导RNN的法向传播通过梯度下降一次次迭代得到合适的参数 对于RNN,我们在序列的每一个位置上都有损失,所以最终的损失 \[L = \sum_{t=1}^{\tau}L^{(t)}
\] 损失函数对更新的参数进行求偏导(注意我们这里使用的两个激活函数分别为
\[\frac{\partial{L}}{\partial{c}} = \sum_{t=1}^{\tau}\frac{\partial{L^{(t)}}}{\partial{c}} = \sum_{t =1}^{\tau}\hat{y}^{(t)}-y^{(t)}
\] \[\frac{\partial{L}}{\partial{V}} = \sum_{t=1}^{\tau}\frac{\partial{L^{(t)}}}{\partial{V}} = \sum_{t =1}^{\tau}(\hat{y}^{(t)}-y^{(t)})(h^{(t)})^T
\]
\[h^{(t+1)} = tanh(Ux^{(t+1)}+Wh^{(t)}+b))
\] 对于 \[\delta^{(t)} = \frac{\partial{L}}{\partial{(h^{(t)})}}
\] 从\(\delta^{(\tau+1)}\)递推\(\delta^{(t)}\) \[\delta^{(t)} = (\frac{\partial{\delta^{(t)}}}{\partial{h^{(t)}}})^T \frac{\partial{L}}{\partial{o^{(t)}}} + (\frac{\partial{h^{(t+1)}}}{\partial{h^{(t)}}})^T \frac{\partial{L}}{\partial{h^{(t+1)}}} = V^T(\hat{y}^{(t)}-y^{(t)}) +W^Tdiag(1-(h^{(t+1)})^2)\delta^{(t+1)}
\] 对于\(\delta{(\tau)}\),其后面没有其他的索引(最后一个输入),因此: \[\delta^{(\tau)} = (\frac{\partial{\delta^{(\tau)}}}{\partial{h^{(\tau)}}})^T \frac{\partial{L}}{\partial{o^{(\tau)}}} = V^T(\hat{y}^{(\tau)}-y^{(t)})
\] 根据\(\delta{(t)}\),我们就可以计算 \[\frac{\partial{L}}{\partial{W}} = \sum_{t=1}^{\tau}diag(1-(h^{(t)})^2)\delta^{(t)}(h^{(t-1)})^T
\] \[\frac{\partial{L}}{\partial{b}} = \sum_{t=1}^{\tau}diag(1-(h^{(t)})^2)\delta^{(t)}
\] \[\frac{\partial{L}}{\partial{V}} = \sum_{t=1}^{\tau}diag(1-(h^{(t)})^2)\delta^{(t)}(x^{(t)})^T
\] 五、RNN梯度消失问题
\[S_1 = W_xX_1 + W_sS_0+b_1 ; O_1 = W_0S_1 +b2
\] \[S_2 = W_xX_2 + W_sS_1+b_1 ; O_2 = W_0S_2 +b2
\] \[S_3 = W_xX_3 + W_sS_2+b_1 ; O_3 = W_0S_3 +b2
\] 假设在t=3时刻,损失函数为$$L_3 = \frac{1}{2}(Y_3-O_3)^2$$ \[\frac{\partial{L}_3}{\partial{W}_0} = \frac{\partial{L}_3}{\partial{O}_3} \frac{\partial{O}_3}{\partial{W}_0}
\] \[\frac{\partial{L}_3}{\partial{W}_x} = \frac{\partial{L}_3}{\partial{O}_3} \frac{\partial{O}_3}{\partial{S}_3} \frac{\partial{S}_3}{\partial{W}_x} + \frac{\partial{L}_3}{\partial{O}_3} \frac{\partial{O}_3}{\partial{S}_3} \frac{\partial{S}_3}{\partial{S}_2}\frac{\partial{S}_2}{\partial{W}_x}+ \frac{\partial{L}_3}{\partial{O}_3} \frac{\partial{O}_3}{\partial{S}_3} \frac{\partial{S}_3}{\partial{S}_2}\frac{\partial{S}_2}{\partial{S}_1}\frac{\partial{S}_1}{\partial{w}_x}
\] \[\frac{\partial{L}_3}{\partial{W}_s} = \frac{\partial{L}_3}{\partial{O}_3} \frac{\partial{O}_3}{\partial{S}_3} \frac{\partial{S}_3}{\partial{W}_s} + \frac{\partial{L}_3}{\partial{O}_3} \frac{\partial{O}_3}{\partial{S}_3} \frac{\partial{S}_3}{\partial{S}_2}\frac{\partial{S}_2}{\partial{W}_s}+ \frac{\partial{L}_3}{\partial{O}_3} \frac{\partial{O}_3}{\partial{S}_3} \frac{\partial{S}_3}{\partial{S}_2}\frac{\partial{S}_2}{\partial{S}_1}\frac{\partial{S}_1}{\partial{w}_s}
\] 从这冗长的公式中可以看见用梯度下降法对损失函数求 \[\frac{\partial{L}_t}{\partial{W}_x} = \sum_{k=0}^{t}\frac{\partial{L}_t}{\partial{O}_t}\frac{\partial{O}_t}{\partial{S}_t}(\prod_{j=k+1}^{t}\frac{\partial{S}_j}{\partial{S}_{j-1}})\frac{\partial{S}_k}{\partial{W}_x}
\] \[\frac{\partial{L}_t}{\partial{W}_s} = \sum_{k=0}^{t}\frac{\partial{L}_t}{\partial{O}_t}\frac{\partial{O}_t}{\partial{S}_t}(\prod_{j=k+1}^{t}\frac{\partial{S}_j}{\partial{S}_{j-1}})\frac{\partial{S}_k}{\partial{W}_s}
\] 如果再加上激活函数:$$S_j = tanh(W_xX_j + W_sS_{j-1}+b_1)$$ 则$$\prod_{j=k+1}^{t}\frac{\partial{S}j}{\partial{S}{j-1}} = \prod_{j=k+1}^{t}W_s tanh^{'}$$ 激活函数tanh[2]: \[f(x) = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
\] tanh函数导数: \[f(x)^{'} = 1 - (tanh(x))^2
\] 根据激活函数及其导数的图像可见 [3]
\[W_xX_j + W_sS_{j-1} + b_1 = 0
\]
\[\prod_{j=k+1}^{t}W_s tanh^{'} --> 0
\]
\[\prod_{j=k+1}^{t}W_s tanh^{'} --> ∞
\] 六、消除梯度爆炸和梯度消失在公式: \[\frac{\partial{L}_t}{\partial{W}_x} = \sum_{k=0}^{t}\frac{\partial{L}_t}{\partial{O}_t}\frac{\partial{O}_t}{\partial{S}_t}(\prod_{j=k+1}^{t}\frac{\partial{S}_j}{\partial{S}_{j-1}})\frac{\partial{S}_k}{\partial{W}_x}
\] \[\frac{\partial{L}_t}{\partial{W}_s} = \sum_{k=0}^{t}\frac{\partial{L}_t}{\partial{O}_t}\frac{\partial{O}_t}{\partial{S}_t}(\prod_{j=k+1}^{t}\frac{\partial{S}_j}{\partial{S}_{j-1}})\frac{\partial{S}_k}{\partial{W}_s}
\] 导致梯度消失和梯度爆炸的原因在于: \[\prod_{j=k+1}^{t}\frac{\partial{S}_j}{\partial{S}_{j-1}}
\] 消除这个部分的影响一个考虑是使得 \[\frac{\partial{S}_j}{\partial{S}_{j-1}} ≈ 1
\] 另一种是使得: \[\frac{\partial{S}_j}{\partial{S}_{j-1}} ≈ 0
\] |
|