AI科学家带你从零开始学习：循环神经网络！

爱因思念l5j0t8 2017-12-08

展开全文

前面的教程里我们使用的网络都属于前馈神经网络。为什么叫前馈是整个网络是一条链（回想下gluon.nn.Sequential），每一层的结果都是反馈给下一层。这一节我们介绍循环神经网络，这里每一层不仅输出给下一层，同时还输出一个隐藏状态，给当前层在处理下一个样本时使用。下图展示这两种网络的区别。

循环神经网络的这种结构使得它适合处理前后有依赖关系的样本。我们拿语言模型举个例子来解释这个是怎么工作的。语言模型的任务是给定句子的前T个字符，然后预测第T 1个字符。假设我们的句子是“你好世界”，使用前馈神经网络来预测的一个做法是，在时间1输入“你”，预测”好“，时间2向同一个网络输入“好”预测“世”。下图左边展示了这个过程。

注意到一个问题是，当我们预测“世”的时候只给了“好”这个输入，而完全忽略了“你”。直觉上“你”这个词应该对这次的预测比较重要。虽然这个问题通常可以通过n-gram来缓解，就是说预测第T 1个字符的时候，我们输入前n个字符。如果n=1，那就是我们这里用的。我们可以增大n来使得输入含有更多信息。但我们不能任意增大n，因为这样通常带来模型复杂度的增加从而导致需要大量数据和计算来训练模型。

循环神经网络使用一个隐藏状态来记录前面看到的数据来帮助当前预测。上图右边展示了这个过程。在预测“好”的时候，我们输出一个隐藏状态。我们用这个状态和新的输入“好”来一起预测“世”，然后同时输出一个更新过的隐藏状态。我们希望前面的信息能够保存在这个隐藏状态里，从而提升预测效果。

在更加正式的介绍这个模型前，我们先去弄一个比“你好世界“稍微复杂点的数据。

《时间机器》数据集

我们用《时间机器》这本书做数据集主要是因为古登堡计划计划使得可以免费下载，而且我们看了太多用莎士比亚作为例子的教程。下面我们读取这个数据并看看前面500个字符（char）是什么样的：

In [1]:

with open('../data/timemachine.txt') as f: time_machine = f.read()print(time_machine[0:500])

The Time Machine, by H. G. Wells [1898] I The Time Traveller (for so it will be convenient to speak of him) was expounding a recondite matter to us. His grey eyes shone and twinkled, and his usually pale face was flushed and animated. The fire burned brightly, and the soft radiance of the incandescent lights in the lilies of silver caught the bubbles that flashed and passed in our glasses. Our chairs, being his patents, embraced and caressed us rather than submitted to be sat upon, and the

接着我们稍微处理下数据集。包括全部改为小写，去除换行符，然后截去后面一段使得接下来的训练会快一点。

In [2]:

time_machine = time_machine.lower().replace('\n', '').replace('\r', '')
time_machine = time_machine[0:10000]

字符的数值表示

先把数据里面所有不同的字符拿出来做成一个字典：

In [3]:

character_list = list(set(time_machine))
character_dict = dict([(char,i) for i,char in enumerate(character_list)])

vocab_size = len(character_dict)

print('vocab size:', vocab_size)
print(character_dict)

vocab size: 43 {'x': 0, 'm': 1, 'o': 2, ')': 3, 'j': 4, 'h': 5, 'w': 6, 'l': 7, ':': 8, '-': 9, 'q': 10, 'k': 11, '!': 12, 'c': 13, '[': 14, ''': 15, 'p': 16, '9': 17, 'n': 18, '1': 19, 's': 20, 'a': 21, 'e': 22, '.': 23, ']': 24, '?': 25, ';': 26, 'd': 27, 'i': 28, '8': 29, 'f': 30, 'u': 31, ',': 32, ' ': 33, 'y': 34, 'z': 35, 'g': 36, '(': 37, 't': 38, '_': 39, 'r': 40, 'v': 41, 'b': 42}

然后可以把每个字符转成从0开始的指数(index)来方便之后的使用。

In [4]:

time_numerical = [character_dict[char] for char in time_machine]

sample = time_numerical[:40]

print('chars: \n', ''.join([character_list[idx] for idx in sample]))
print('\nindices: \n', sample)

chars: the time machine, by h. g. wells [1898]i indices: [38, 5, 22, 33, 38, 28, 1, 22, 33, 1, 21, 13, 5, 28, 18, 22, 32, 33, 42, 34, 33, 5, 23, 33, 36, 23, 33, 6, 22, 7, 7, 20, 33, 14, 19, 29, 17, 29, 24, 28]

数据读取

同前一样我们需要每次随机读取一些（batch_size个）样本和其对用的标号。这里的样本跟前面有点不一样，这里一个样本通常包含一系列连续的字符（前馈神经网络里可能每个字符作为一个样本）。

如果我们把序列长度（seq_len）设成10，那么一个可能的样本是The Time T。其对应的标号仍然是长为10的序列，每个字符是对应的样本里字符的后面那个。例如前面样本的标号就是he Time Tr。

下面代码每次从数据里随机采样一个批量：

In [5]:

import random
from mxnet import nd
def data_iter(batch_size, seq_len, ctx=None): num_examples = (len(time_numerical)-1) // seq_len num_batches = num_examples // batch_size # 随机化样本 idx = list(range(num_examples)) random.shuffle(idx) # 返回seq_len个数据 def _data(pos): return time_numerical[pos:pos seq_len] for i in range(num_batches): # 每次读取batch_size个随机样本 i = i * batch_size examples = idx[i:i batch_size] data = nd.array( [_data(j*seq_len) for j in examples], ctx=ctx) label = nd.array( [_data(j*seq_len 1) for j in examples], ctx=ctx) yield data, label

看下读出来长什么样：

In [6]:

for data, label in data_iter(batch_size=3, seq_len=8): print('data: ', data, '\n\nlabel:', label) break

data: [[ 40. 2. 18. 36. 33. 20. 28. 27.] [ 33. 38. 28. 1. 22. 33. 38. 40.] [ 2. 41. 28. 18. 13. 28. 21. 7.]] <NDArray 3x8 @cpu(0)> label: [[ 2. 18. 36. 33. 20. 28. 27. 22.] [ 38. 28. 1. 22. 33. 38. 40. 21.] [ 41. 28. 18. 13. 28. 21. 7. 33.]] <NDArray 3x8 @cpu(0)>

循环神经网络

在对输入输出数据有了解后，我们来正式介绍循环神经网络。首先回忆下单隐层的前馈神经网络的定义，假设隐层的激活函数是 $ϕ$ ，那么这个隐层的输出就是

$H = ϕ (X W_{w h} b_{h})$

最终的输出是

Y^=softmax(HWhy by)

（跟多层感知机相比，这里我们把下标从 $W_{1}$ 和 $W_{2}$ 改成了意义更加明确的 $W_{w h}$ 和 $W_{h y}$ )

将上面网络改成循环神经网络，我们首先对输入输出加上时间戳 $t$ 。假设 $X_{t}$ 是序列中的第 $t$ 个输入，对应的隐层输出和最终输出是 $H_{t}$ 和 ${\hat{Y}}_{t}$ 。循环神经网络只需要在计算隐层的输出的时候加上跟前一时间输入的加权和，为此我们引入一个新的可学习的权重 $W_{h h}$ ：

$H_{t} = ϕ (X_{t} W_{x h} H_{t - 1} W_{h h} b_{h})$

输出的计算跟前一致：

${\hat{Y}}_{t} = softmax (H_{t} W_{h y} b_{y})$

一开始我们提到过，隐层输出（又叫隐藏状态）可以认为是这个网络的记忆。它存储前面时间里面的信息。我们的输出是完全只基于这个状态。最开始的状态， $H_{- 1}$ ，通常会被初始为0.

Onehot编码

注意到每个字符现在是用一个整数来表示，而输入进网络我们需要一个定长的向量。一个常用的办法是使用onehot来将其表示成向量。就是说，如果值是 $i$ , 那么我们创建一个全0的长为vocab_size的向量，并将其第 $i$ 位表示成1.

In [7]:

nd.one_hot(nd.array([0,4]), vocab_size)

Out[7]:

[[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] <NDArray 2x43 @cpu(0)>

记得前面我们每次得到的数据是一个batch_size x seq_len的批量。下面这个函数将其转换成seq_len个可以输入进网络的batch_size x vocba_size的矩阵

In [8]:

def get_inputs(data): return [nd.one_hot(X, vocab_size) for X in data.T]

inputs = get_inputs(data)
print('input length: ',len(inputs))
print('input[0] shape: ', inputs[0].shape)

input length: 8 input[0] shape: (3, 43)

初始化模型参数

模型的输入和输出维度都是vocab_size。

In [9]:

import mxnet as mx

# 尝试使用 GPU
import sys
sys.path.append('..')
import utils
ctx = utils.try_gpu()
print('Will use ', ctx)

num_hidden = 256
weight_scale = .01

# 隐含层
Wxh = nd.random_normal(shape=(vocab_size,num_hidden), ctx=ctx) * weight_scale
Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx) * weight_scale
bh = nd.zeros(num_hidden, ctx=ctx)

# 输出层
Why = nd.random_normal(shape=(num_hidden,vocab_size), ctx=ctx) * weight_scale
by = nd.zeros(vocab_size, ctx=ctx)
params = [Wxh, Whh, bh, Why, by]
for param in params: param.attach_grad()

Will use gpu(0)

定义模型

我们将前面的模型公式定义直接写成代码。

In [10]:

def rnn(inputs, H): # inputs: seq_len 个 batch_size x vocab_size 矩阵 # H: batch_size x num_hidden 矩阵 # outputs: seq_len 个 batch_size x vocab_size 矩阵 outputs = [] for X in inputs: H = nd.tanh(nd.dot(X, Wxh) nd.dot(H, Whh) bh) Y = nd.dot(H, Why) by outputs.append(Y) return (outputs, H)

做个简单的测试：

In [11]:

state = nd.zeros(shape=(data.shape[0], num_hidden), ctx=ctx)
outputs, state_new = rnn(get_inputs(data.as_in_context(ctx)), state)

print('output length: ',len(outputs))
print('output[0] shape: ', outputs[0].shape)
print('state shape: ', state_new.shape)

output length: 8 output[0] shape: (3, 43) state shape: (3, 256)

预测序列

在做预测时我们只需要给定时间0的输入和起始隐藏状态。然后我们每次将上一个时间的输出作为下一个时间的输入。

In [12]:

def predict(prefix, num_chars): # 预测以 prefix 开始的接下来的 num_chars 个字符 prefix = prefix.lower() state = nd.zeros(shape=(1, num_hidden), ctx=ctx) output = [character_dict[prefix[0]]] for i in range(num_chars len(prefix)): X = nd.array([output[-1]], ctx=ctx) Y, state = rnn(get_inputs(X), state) #print(Y) if i < len(prefix)-1: next_input = character_dict[prefix[i 1]] else: next_input = int(Y[0].argmax(axis=1).asscalar()) output.append(next_input) return ''.join([character_list[i] for i in output])

梯度剪裁

在求梯度时，循环神经网络因为需要反复做O(seq_len)次乘法，有可能会有数值稳定性问题。（想想 $2^{40}$ 和 ${0.5}^{40}$ ）。一个常用的做法是如果梯度特别大，那么就投影到一个比较小的尺度上。假设我们把所有梯度接成一个向量 $g$ ，假设剪裁的阈值是 $θ$ ，那么我们这样剪裁使得 $‖ g ‖$ 不会超过 $θ$ ：

$g = min (\frac{θ}{‖ g ‖}, 1) g$

In [13]:

def grad_clipping(params, theta): norm = nd.array([0.0], ctx) for p in params: norm = nd.sum(p.grad ** 2) norm = nd.sqrt(norm).asscalar() if norm > theta: for p in params: p.grad[:] *= theta/norm

训练模型

下面我们可以还是训练模型。跟前面前置网络的教程比，这里只有两个不同。

通常我们使用Perplexit(PPL)这个指标。可以简单的认为就是对交叉熵做exp运算使得数值更好读。
在更新前我们对梯度做剪裁

In [14]:

from mxnet import autograd
from mxnet import gluon
from math import exp

epochs = 200
seq_len = 35
learning_rate = .1
batch_size = 32

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
for e in range(epochs 1): train_loss, num_examples = 0, 0 state = nd.zeros(shape=(batch_size, num_hidden), ctx=ctx) for data, label in data_iter(batch_size, seq_len, ctx): with autograd.record(): outputs, state = rnn(get_inputs(data), state) # reshape label to (batch_size*seq_len, ) # concate outputs to (batch_size*seq_len, vocab_size) label = label.T.reshape((-1,)) outputs = nd.concat(*outputs, dim=0) loss = softmax_cross_entropy(outputs, label) loss.backward() grad_clipping(params, 5) utils.SGD(params, learning_rate) train_loss = nd.sum(loss).asscalar() num_examples = loss.size if e % 20 == 0: print('Epoch %d. PPL %f' % (e, exp(train_loss/num_examples))) print(' - ', predict('The Time Ma', 100)) print(' - ', predict('The Medical Man rose, came to the lamp,', 100), '\n')

Epoch 0. PPL 30.966309 - the time maaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - the medical man rose, came to the lamp,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Epoch 20. PPL 11.129305 - the time mathe the the the the the the the the the the the the the the the the the the the the the the the the t - the medical man rose, came to the lamp, and and and and and and and and and and and and and and and and and and and and and and and and and Epoch 40. PPL 9.489771 - the time mat the the the the the the the the the the the the the the the the the the the the the the the the the - the medical man rose, came to the lamp, the the the the the the the the the the the the the the the the the the the the the the the the the Epoch 60. PPL 8.317870 - the time mant and have the thang the thang the thang the thang the thang the thang the thang the thang the thang - the medical man rose, came to the lamp, and have than the than the than the than the than the than the than the than the than the than the t Epoch 80. PPL 7.294700 - the time mave are it and he be way he meding and the time traveller and he in way his some traveller and he in w - the medical man rose, came to the lamp, and he in way his some traveller and he in way his some traveller and he in way his some traveller a Epoch 100. PPL 6.312849 - the time mading an a ther thing to the pat an at le sand the time traveller thin the time traveller thin the tim - the medical man rose, came to the lamp, the time traveller thin the time traveller thin the time traveller thin the time traveller thin the Epoch 120. PPL 5.235753 - the time mant move thengthe time traveller chovegred an a some traveller sime thang his on the time traveller ch - the medical man rose, came to the lamp, and the time traveller chovegred an a some traveller sime thang his on the time traveller chovegred Epoch 140. PPL 4.408807 - the time mavere at and the limensions, ard the time traveller cand the time traveller spoce of hald and the pacc - the medical man rose, came to the lamp, and the pacced hal gan thought rover ancedt and the time, and the ond have a monting to the one of t Epoch 160. PPL 3.712035 - the time mading of that is alle to spee of che rider, and hr ghas the for the medical man. sursaid the time trav - the medical man rose, came to the lamp, and the pay ur maling of the ind in alle to spee of che rider, and hr ghas the for the medical man. Epoch 180. PPL 3.219052 - the time mannot move about in time traveller simelanother at our why gente is a forest anowhy trovel tha pexpers - the medical man rose, came to the lamp, and hover ancepter atint to bot spofper--fical ol motracollyon,' said the medical man. our carealll Epoch 200. PPL 2.823239 - the time machine, but bact asmolly he ghat uxishen the time traveller cand ner allage bet berace for the redinee - the medical man rose, came to the lamp, ank in a maning in anyally mowe coon ther the focrous cint the lact the time traveller cand nean ofr

可以看到一开始学到简单的字符，然后简单的词，接着是复杂点的词，然后看上去似乎像个句子了。