时间序列ARMA中p，q选择

imelee 2018-09-17

展开全文

时间序列中p,q值选择

1.模型识别：

对平稳时间序列Yn，求得其自相关函数（ACF）和偏自相关函数（PACF）序列。
若PACF序列满足在p步截尾，且ACF序列被负指数函数控制收敛到0，则Yn为AR(p)序列。
若ACF序列满足在q步截尾，且PACF序列被负指数函数控制收敛到0，则Yn为MA(q)序列。
若ACF序列和PACF序列满足皆不截尾，但都被负指数函数控制收敛到0，则Yn为ARMA序列。

2.模型定阶：

对于有N个观察值的序列，求得相应于AR(p)、MA(q) 和 ARMA(p,q)三种模型的残差方差，出现模型最小残差方差时的模型阶数就是各个模型的最佳阶数。

3.模型参数估计：

AR模型的参数可以根据ACF序列构成的矩阵及其矩阵之间的转化关系求得。
MA模型的参数采用线性迭代法即可求出。
ARMA模型参数估计方法是按上述求解AR模型和MA模型参数的方法分别对AR和MA模型进行参数估计，即可得到ARMA模型的参数。

4.模型估计函数：

根据对应的模型以及估计参数等带入估计函数计算出估计值。

python中使用Statsmodel库进行p，q值选择

最近工作中用到时间序列模型中的ARMA（ARIMA），需要自动选择p，q值，但是我查到的资料都是根据自相关图和偏自相关图来观察拖尾和截尾，以此来选择p，q值，刚开始时一筹莫展，后来灵机一动，为何不看下导入的statsmodel库中对应画图的函数调用时用的源代码呢：

import statsmodels

print statsmodels.__file__1
2
3

根据输出找到源代码所在位置，用来画偏自相关图的代码部分为：

acf_x, confint = pacf(x, nlags=nlags, alpha=alpha, method=method)

    if use_vlines:
        ax.vlines(lags, [0], acf_x, **kwargs)
        ax.axhline(**kwargs)

    # center the confidence interval TODO: do in acf?
    confint = confint - confint.mean(1)[:,None]
    kwargs.setdefault('marker', 'o')
    kwargs.setdefault('markersize', 5)
    kwargs.setdefault('linestyle', 'None')
    ax.margins(.05)
    ax.plot(lags, acf_x, **kwargs)
    ax.fill_between(lags, confint[:,0], confint[:,1], alpha=.25)1
2
3
4
5
6
7
8
9
10
11
12
13
14

从画图部分可以看到置信区间上界为confint[:,0]，下界线为 confint[:,1]，又有：

confint = confint - confint.mean(1)[:,None]1

因此可以写个循环，当出现截尾时返回当前p,q值，也就可以确定选择AR或MA模型了，如果两者都拖尾，则需要用赤池信息准则，或者贝叶斯信息准则或别的准则来进一步判断，那就是另一回事了，除了运行时间稍微有点长外并没有什么困难的了。

代码如下：

1、判断时间序列稳定性

#用来检查时间序列稳定性的，代码中选用的临界值为5%，p-value选用的为0.1，这个可以根据实际进行修改。

from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller, acf, pacf


def testStationarity(timeSer):

    stationarity = False

    dftest = adfuller(timeSer)
    dfoutput = Series(dftest[:4], index=[
                      'Test Statistic', 'p-value', 'lags', 'nobs'])

    for key, value in dftest[4].items():
        dfoutput['Critical values (%s)' % key] = value

    if dfoutput['Test Statistic'] < dfoutput['Critical values (5%)']:
        if dfoutput['p-value'] < 0.1:
            stationarity = True

    return stationarity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

2、选择合适的p，q值

（由于自动判断，因此收敛速度状况并没有做判断，这个是因为我使用的数据在满足第一个在置信区间内的值后即是局部最优，实际中因数据的不同代码需要部分修改，这里只是提供了一个思路。）

def p_q_choice(timeSer, nlags=40, alpha=.05):

    kwargs = {'nlags': nlags, 'alpha': alpha}
    acf_x, confint = acf(timeSer, **kwargs)
    acf_px, confint2 = pacf(timeSer, **kwargs)

    confint = confint - confint.mean(1)[:, None]
    confint2 = confint2 - confint2.mean(1)[:, None]

    for key1, x, y, z in zip(range(nlags), acf_x, confint[:,0], confint[:,1]):
        if x > y and x < z:
            q = key1
            break

    for key2, x, y, z in zip(range(nlags), acf_px, confint2[:,0], confint[:,1]):
        if x > y and x < z:
            p = key2
            break

    return p, q
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

附录：
ARIMA模型运用的流程

根据时间序列的散点图、自相关函数和偏自相关函数图识别其平稳性。
对非平稳的时间序列数据进行平稳化处理。直到处理后的自相关函数和偏自相关函数的数值非显著非零
根据所识别出来的特征建立相应的时间序列模型。
平稳化处理后，若偏自相关函数是截尾的，而自相关函数是拖尾的，则建立AR模型；
若偏自相关函数和自相关函数均是拖尾的，则序列适合ARMA模型。
参数估计，检验是否具有统计意义。
假设检验，判断（诊断）残差序列是否为白噪声序列。
利用已通过检验的模型进行预测。