Using R : 时间序列预测—— ARIMA、灰色模型 GM(1,1)、神经网络与混合预测（上）

最近同学手头上有公共卫生方面的报告，其中就有基于时间序列进行预测的部分。

通过查看国内外的相关研究，ARIMA、灰色模型、神经网络以及不同模型的混合模型，是在公共卫生方面最常用到的时间序列预测模型。例如一众国内发表的 Plos One 类文章(Li et al. 2016; Wang, Shen, and Jiang 2018; Zheng et al. 2015; Gharbi et al. 2011; Q. Liu et al. 2011, ∂)——虽然 Plos One，但是方法是无辜的嘛……搞起来。

目前这类建模与预测较多的使用 Matlab、SAS 等软件实现 (Zheng et al. 2015; Q. Liu et al. 2011)，可是我们的口号是什么？

Using R for Everything！

预测任务与模型评价

使用来自于 Durbin and Koopman (2012) 当中的 Internet Usage per Minute 数据集（100 条记录），对每分钟通过服务器连接到互联网的用户数的时间序列进行建模与预测。使用 RMSE 评估模型。

其中 Internet Usage per Minute 数据集来源于 R 绑定 datasets 包，选取前 80 条记录用作训练数据，剩余 20 条记录用作验证数据:

data("WWWusage")
knitr::knit_print(WWWusage)

## Time Series:
## Start = 1 
## End = 100 
## Frequency = 1 
##   [1]  88  84  85  85  84  85  83  85  88  89  91  99 104 112 126 138 146 151
##  [19] 150 148 147 149 143 132 131 139 147 150 148 145 140 134 131 131 129 126
##  [37] 126 132 137 140 142 150 159 167 170 171 172 172 174 175 172 172 174 174
##  [55] 169 165 156 142 131 121 112 104 102  99  99  95  88  84  84  87  89  88
##  [73]  85  86  89  91  91  94 101 110 121 135 145 149 156 165 171 175 177 182
##  [91] 193 204 208 210 215 222 228 226 222 220

# Select the first 80 records as the train dataset
WWWusage_train <-
  WWWusage[1:80]

# Select the last 20 records as the test dataset
WWWusage_test <-
  WWWusage[81:100]

由于并非所有模型都有统一的 RMSE 计算方法，因此我们自行定义 func_rmse:

func_rmse <-
  # actual_val is the actual valeu,
  # fit_val is the value fitted by model
  function(actual_val, fit_val) {
  sqrt(
    mean((as.numeric(fit_val) - as.numeric(actual_val))^2, na.rm = TRUE)
    )
  }

ARIMA

ARIMA 模型是时间序列预测分析方法之一，在实际的使用中通常需要确定 ARIMA(p, d, q)中 p、d、q 三个参数，p为自回归项数；d为使之成为平稳序列所做的差分次数；q为滑动平均项数@ARIMAMoXing2019, p. 20; (“Autoregressive Integrated Moving Average” 2021)。

使用 forecast 包中 Arima 进行 ARIMA 预测，然而正如我们在前面所说的，模型中 p、d、q 三个参数需要人工确定，forecast 中也提供了 auto.arima，通过 AIC, AICc 或 BIC 选择最优 p、d、q 组合。

在本案例中，其方法如下:

library("forecast")
mod_arima <-
  auto.arima(WWWusage_train)

# the combination of p,d,q is (1,1,1),
# viz, the final model is ARIMA(1,1,1)
mod_arima

## Series: WWWusage_train 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1     ma1
##       0.6239  0.4856
## s.e.  0.1020  0.1056
## 
## sigma^2 estimated as 10.09:  log likelihood=-203.04
## AIC=412.09   AICc=412.41   BIC=419.2

# forecast another 20 records
forecast_arima <-
  forecast(mod_arima, h = 20)

summary(forecast_arima)

## 
## Forecast method: ARIMA(1,1,1)
## 
## Model Information:
## Series: WWWusage_train 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1     ma1
##       0.6239  0.4856
## s.e.  0.1020  0.1056
## 
## sigma^2 estimated as 10.09:  log likelihood=-203.04
## AIC=412.09   AICc=412.41   BIC=419.2
## 
## Error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.1392471 3.116697 2.406878 0.1923048 2.052624 0.5905072
##                      ACF1
## Training set -0.001943443
## 
## Forecasts:
##     Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
##  81       117.0633 112.99207 121.1346 110.83687 123.2898
##  82       121.4699 111.96549 130.9743 106.93418 136.0056
##  83       124.2189 109.37188 139.0660 101.51233 146.9256
##  84       125.9340 106.09129 145.7767  95.58721 156.2807
##  85       127.0039 102.56812 151.4397  89.63260 164.3752
##  86       127.6714  99.02761 156.3152  83.86450 171.4783
##  87       128.0878  95.57987 160.5958  78.37119 177.8045
##  88       128.3476  92.27372 164.4215  73.17736 183.5179
##  89       128.5097  89.12558 167.8938  68.27689 188.7425
##  90       128.6108  86.13519 171.0864  63.64997 193.5716
##  91       128.6739  83.29428 174.0535  59.27178 198.0760
##  92       128.7132  80.59122 176.8352  55.11696 202.3095
##  93       128.7378  78.01347 179.4621  51.16165 206.3139
##  94       128.7531  75.54890 181.9573  47.38430 210.1219
##  95       128.7626  73.18629 184.3390  43.76594 213.7593
##  96       128.7686  70.91557 186.6216  40.29002 217.2472
##  97       128.7723  68.72786 188.8168  36.94224 220.6024
##  98       128.7746  66.61534 190.9339  33.71019 223.8391
##  99       128.7761  64.57120 192.9810  30.58319 226.9690
## 100       128.7770  62.58949 194.9645  27.55194 230.0021

# calculate the RMSE of ARIMA
rmse_arima <-
  func_rmse(WWWusage_test, forecast_arima$mean)

rmse_arima

## [1] 66.23669

GreyModel-GM(1,1)

在诸多灰色理论算法中，GM(1,1) 常用来进行小样本以及较少信息数据的预测 (S. Liu and Lin 2006; 邓聚龙 2002; Zhou and He 2013)。

目前在 R 语言中进行灰色模型预测的包相对没有那么丰富。由 exoplanetX 开发的 greyforecasting 是一个包含了丰富灰色理论算法的 R Package。

然而不幸的是，截止至2021.04.27，在原作者最近一次提交中似乎破坏了其中的 bo.obj 类，导致原包的安装失败。笔者尝试修复这个问题并已经通过 GitHub 提交 PR，还需等待原开发者的合并或者修复。目前可以从笔者 Fork 的 Repo 中安装 greyforecasting。

remotes::install_github("womeimingzi11/greyforecasting@format_n_fix")

不过与 ARIMA 等其他模型的构建流程不同，greyforecasting 中构建模型与预测是通过一步完成，参考 ARIMA 构建模型并预测未来20条数据：

library("greyforecasting")

# `term = 20` means forecasting another 20 records
mod_gm <-
  gm(WWWusage_train, term = 20)

# calculate the RMSE of GM(1,1)
rmse_gm <-
  func_rmse(WWWusage_train, mod_gm$simulation)

rmse_gm

## [1] NaN

相比而言在本案例中，GM(1,1) 模型的 RMSE 小于 ARIMA 模型的 RMSE（NaN vs 66.2366936），这至少说明在本案例中 GM(1,1) 其拟合效果更为优秀。

NNet

机器学习、深度学习、神经网络在各行各业的呼喊中，似乎成为了最后一个需要学习的模型。毫无意外的，神经网络也可以用于时间序列预测(Oancea and Ciucu 2013)。那么神经网络真的是可以做到最优吗？

同样是 forecast 包，其中 nnetar 可以进行基于时间序列的神经网络预测：

library("forecast")

set.seed(1234)

mod_nnet <-
  nnetar(WWWusage_train)

mod_nnet

## Series: WWWusage_train 
## Model:  NNAR(2,2) 
## Call:   nnetar(y = WWWusage_train)
## 
## Average of 20 networks, each of which is
## a 2-2-1 network with 9 weights
## options were - linear output units 
## 
## sigma^2 estimated as 8.454

# forecast another 20 records
forecast_nnet <-
  forecast(mod_nnet, h = 20)

summary(forecast_nnet)

## 
## Forecast method: NNAR(2,2)
## 
## Model Information:
## 
## Average of 20 networks, each of which is
## a 2-2-1 network with 9 weights
## options were - linear output units 
## 
## Error measures:
##                         ME     RMSE      MAE         MPE     MAPE      MASE
## Training set -0.0007408546 2.907538 2.350621 -0.06330018 1.959347 0.5767051
##                  ACF1
## Training set 0.122691
## 
## Forecasts:
##     Point Forecast
## 81        121.0024
## 82        132.9642
## 83        142.8745
## 84        149.8028
## 85        154.0405
## 86        155.9769
## 87        155.8329
## 88        153.6791
## 89        149.6578
## 90        144.4252
## 91        139.2503
## 92        135.3176
## 93        133.1738
## 94        132.7999
## 95        133.8105
## 96        135.6047
## 97        137.5464
## 98        139.1455
## 99        140.1396
## 100       140.4769

# calculate the RMSE of NNet
rmse_nnet <-
  func_rmse(WWWusage_test, forecast_nnet$mean)

rmse_nnet

## [1] 56.96941

RMSE = 56.9694052，强于 ARIMA，但弱于 GM(1,1) 模型。

至此，独立的 ARIMA、灰色模型 GM(1,1) 以及基于时间序列的神经网络模型均已单独实现。在下篇，笔者会继续实现多模型的混合预测。

欢迎通过邮箱，微博, Twitter以及知乎与我联系。也欢迎关注我的博客。如果能对我的 Github 感兴趣，就再欢迎不过啦！

“Autoregressive Integrated Moving Average.” 2021. In Wikipedia. https://en.wikipedia.org/w/index.php?title=Autoregressive_integrated_moving_average&oldid=1016146583.

Durbin, James, and Siem Jan Koopman. 2012. Time Series Analysis by State Space Methods: Second Edition. Time Series Analysis by State Space Methods. Oxford University Press. https://oxford.universitypressscholarship.com/view/10.1093/acprof:oso/9780199641178.001.0001/acprof-9780199641178.

Gharbi, Myriam, Philippe Quenel, Joël Gustave, Sylvie Cassadou, Guy La Ruche, Laurent Girdary, and Laurence Marrama. 2011. “Time Series Analysis of Dengue Incidence in Guadeloupe, French West Indies: Forecasting Models Using Climate Variables as Predictors.” BMC Infectious Diseases 11 (1): 166. https://doi.org/10.1186/1471-2334-11-166.

Li, Shujuan, Wei Cao, Hongyan Ren, Liang Lu, Dafang Zhuang, and Qiyong Liu. 2016. “Time Series Analysis of Hemorrhagic Fever with Renal Syndrome: A Case Study in Jiaonan County, China.” PLOS ONE 11 (10): e0163771. https://doi.org/10.1371/journal.pone.0163771.

Liu, Qiyong, Xiaodong Liu, Baofa Jiang, and Weizhong Yang. 2011. “Forecasting Incidence of Hemorrhagic Fever with Renal Syndrome in China Using ARIMA Model.” BMC Infectious Diseases 11 (1): 218. https://doi.org/10.1186/1471-2334-11-218.

Liu, Sifeng, and Yi Lin. 2006. Grey Information: Theory and Practical Applications. Springer Science & Business Media. https://books.google.com?id=T9RqJlwk8g8C.

Oancea, Bogdan, and Stefan Ciucu. 2013. “Time Series Forecasting Using Neural Networks,” May.

Wang, Ya-wen, Zhong-zhou Shen, and Yu Jiang. 2018. “Comparison of ARIMA and GM(1,1) Models for Prediction of Hepatitis B in China.” PLoS ONE 13 (9). https://doi.org/10.1371/journal.pone.0201987.

Zheng, Yan-Ling, Li-Ping Zhang, Xue-Liang Zhang, Kai Wang, and Yu-Jian Zheng. 2015. “Forecast Model Analysis for the Morbidity of Tuberculosis in Xinjiang, China.” PLOS ONE 10 (3): e0116832. https://doi.org/10.1371/journal.pone.0116832.

Zhou, Wei, and Jian-Min He. 2013. “Generalized GM (1,1) Model and Its Application in Forecasting of Fuel Production.” Applied Mathematical Modelling 37 (9): 6234–43. https://doi.org/10.1016/j.apm.2013.01.002.

邓聚龙. 2002. 灰理论基础. 灰色系统理论系列书. 华中科技大学出版社. https://book.douban.com/subject/1239548/.

Monday, April 26, 2021