«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2024.01.017]
点击复制

基于门控时空注意力的视频帧预测模型()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 45卷
期数:: 2024年01期

页码:: 70-77

栏目:

出版日期:: 2024-01-19

文章信息/Info

Title:: Video Frame Prediction Model Based on Gated Spatio-Temporal Attention

作者:: 李卫军; 张新勇; 高庾潇; 顾建来; 刘锦彤; 1. 北方民族大学计算机科学与工程学院,宁夏银川 750021;2. 北方民族大学图像图形智能处理国家民委重点实验室,宁夏银川 750021

Author(s):: LI Weijun; ZHANG Xinyong; GAO Yuxiao; GU Jianlai; LIU Jintong; 1. School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China; 2. The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan 750021, China

关键词:: 视频帧预测; 卷积神经网络; 注意力机制; 门控卷积; 编解码网络

Keywords:: video frame prediction; convolutional neural network; attention mechanism; gated convolution; codec network

DOI:: 10.13705/j.issn.1671-6833.2024.01.017

文献标志码:: A

摘要:: 针对循环式视频帧预测架构存在精度低、训练缓慢,以及结构复杂和误差累积等问题,提出了一种基于门控时空注意力的视频帧预测模型。首先,通过空间编码器提取视频帧序列的高级语义信息,同时保留背景特征;其次,建立门控时空注意力机制,采用多尺度深度条形卷积和通道注意力来学习帧内及帧间的时空特征,并利用门控融合机制平衡时空注意力的特征学习能力;最后,由空间解码器将高级特征解码为预测的真实图像,并补充背景语义以完善细节。在 Moving MNIST、TaxiBJ、WeatherBench、KITTI 数据集上的实验结果显示,同多进多出模型 SimVP 相比,MSE 分别降低了 14. 7%、6. 7%、10. 5%、18. 5%,在消融扩展实验中,所提模型达到了较好的综合性能,具有预测精度高、计算量低和推理效率高等优势。

Abstract:: A video frame prediction model based on gated spatio-temporal attention was proposed to address the issues of low accuracy, slow training, complex structure, and error accumulation in recurrent video frame prediction architectures. Firstly, high-level semantic information of the video frame sequence was extracted by a spatial encoder while preserving background features. Secondly, a gated spatio-temporal attention mechanism was established, utilizing multi-scale deep bar convolutions and channel attention to learn both intra-frame and inter-frame spatio-temporal features. A gate fusion mechanism was employed to balance the feature learning capability of spatiotemporal attention. Finally, a spatial decoder reconstructed the high-level features into predicted realistic images and complements background semantics to enhance the details. Experimental results on the Moving MNIST, TaxiBJ, WeatherBench, and KITTI datasets showed that compared to the multi-input multi-output model SimVP, the mean squared error (MSE) was reduced by 14. 7%, 6. 7%, 10. 5%, and 18. 5%, respectively. In ablation and expansion experiments, the proposed model achieved good overall performance, demonstrating advantages such as high prediction accuracy, low computational complexity, and efficient inference.

参考文献/References:

[1] DAI K, LI X T, YE Y M, et al. MSTCGAN： multiscale time conditional generative adversarial network for long-term satellite image sequence prediction[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60： 1-16.

[2] TAN C, LI S Y, GAO Z Y, et al. OpenSTL： a comprehensive benchmark of spatio-temporal predictive learning[EB/OL]. (2023-06-20)[2023-07-20]. https：∥arxiv.org/abs/2306.11249.

[3] SRIVASTAVA N, MANSIMOV E, SALAKHUTDINOV R. Unsupervised learning of video representations using LSTMs[EB/OL]. (2016-01-04)[2023-07-20]. https：∥arxiv.org/abs/1502.04681.

[4] SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM network： a machine learning approach for precipitation nowcasting[EB/OL]. (2015-09-19)[2023-07-20]. https：∥arxiv.org/abs/1506.04214.

[5] WANG Y B, LONG M S, WANG J M, et al. PredRNN： recurrent neural networks for predictive learning using spatiotemporal LSTMs[C]∥NIPS′17： Proceedings of the 31st International Conference on Neural Information Processing Systems. Cham： Springer, 2017： 879-888.

[6] WANG Y B, GAO Z F, LONG M S, et al. PredRNN++： towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning[EB/OL]. (2018-11-19)[2023-07-20]. https：∥arxiv.org/abs/1804.06300.

[7] LIU Z W, YEH R A, TANG X O, et al. Video frame synthesis using deep voxel flow[C]∥2017 IEEE International Conference on Computer Vision (ICCV).Piscataway： IEEE, 2017： 4473-4481.

[8] AIGNER S, KÖRNER M. FutureGAN： anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing GANs[EB/OL]. (2018-11-26)[2023-07-20]. https：∥arxiv.org/abs/1810.01325.

[9] YE X, BILODEAU G A. VPTR： efficient transformers for video prediction[C]∥2022 26th International Confe-rence on Pattern Recognition (ICPR). Piscataway： IEEE, 2022： 3492-3499.

[10] GAO Z Y, TAN C, WU L R, et al. SimVP： simpler yet better video prediction[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway： IEEE, 2022： 3160-3170.

[11] GUO M H, LU C Z, HOU Q B, et al. SegNeXt： rethinking convolutional attention design for semantic segmentation[EB/OL]. (2022-09-18)[2023-07-20]. https：∥arxiv.org/abs/2209.08575.

[12] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE, 2018： 7132-7141.

[13] WANG Y B, ZHANG J J, ZHU H Y, et al. Memory in memory： a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics[C]∥2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway： IEEE, 2020： 9146-9154.

[14] LOTTER W, KREIMAN G, COX D. Deep predictive coding networks for video prediction and unsupervised learning[EB/OL]. (2017-05-01)[2023-07-20]. https：∥arxiv.org/abs/1605.08104.

[15] GUEN V L, THOME N. Disentangling physical dynamics from unknown factors for unsupervised video prediction[C]∥2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway： IEEE, 2020： 11471-11481.

[16] PAN T, JIANG Z Q, HAN J N, et al. Taylor saves for later： disentanglement for video prediction using Taylor representation[J]. Neurocomputing, 2022, 472： 166-174.

[17] SUN F, BAI C, SONG Y, et al. MMINR： multi-frame-to-multi-frame inference with noise resistance for precipitation nowcasting with radar[C]∥2022 26th International Conference on Pattern Recognition (ICPR). Piscataway： IEEE, 2022： 97-103.

[18] NING S L, LAN M C, LI Y R, et al. MIMO is all you need： a strong multi-in-multi-out baseline for video prediction[EB/OL]. (2023-05-30)[2023-07-20]. https：∥arxiv.org/abs/2212.04655.

[19] TAN C, GAO Z Y, LI S Y, et al. SimVP： towards simple yet powerful spatiotemporal predictive learning[EB/OL]. (2023-04-26)[2023-07-20]. https：∥arxiv.org/abs/2211.12509.

[20] SMITH L N, TOPIN N. Super-convergence： very fast training of neural networks using large learning rates[EB/OL]. (2017-08-23)[2023-07-20]. https：∥arxiv.org/abs/1708.07120v1.[21] CHANG Z, ZHANG X F, WANG S S, et al. MAU： a motion-aware unit for video prediction and beyond[C]∥35th Conference on Neural Information Processing Systems. Sydney： NeurIPS , 2021： 1-13.

[22] ZHANG J B, ZHENG Y, QI D K. Deep spatio-temporal residual networks for citywide crowd flows prediction[C]∥Proceedings of the 31st AAAI Conference on Artificial Intelligence. New York： ACM, 2017： 1655-1661.

[23] TAN C, GAO Z Y, WU L R, et al. Temporal attention unit： towards efficient spatiotemporal predictive learning[C]∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway： IEEE, 2023： 18770-18782.

[24] RASP S, DUEBEN P D, SCHER S, et al. WeatherBench： a benchmark data set for data-driven weather forecasting[J]. Journal of Advances in Modeling Earth Systems, 2020, 12(11)： 1-17.

[25] DING X H, ZHANG X Y, MA N N, et al. RepVGG： making VGG-style ConvNets great again[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway： IEEE, 2021： 13728-13737.

相似文献/References:

[1]郝旺身,陈耀,孙浩,等.基于全矢-CNN的轴承故障诊断研究[J].郑州大学学报(工学版),2020,41(05):92.[doi:10.13705/j.issn.1671-6833.2020.03.004]
　HAO Wangshen,CHEN Yao,SUN Hao,et al.Bearing Fault Diagnosis Based on Full Vector-CNN[J].Journal of Zhengzhou University (Engineering Science),2020,41(01):92.[doi:10.13705/j.issn.1671-6833.2020.03.004]
[2]孙宁,王龙玉,刘佶鑫,等.结合特权信息与注意力机制的场景识别[J].郑州大学学报(工学版),2021,42(01):42.[doi:10.13705/j.issn.1671-6833.2021.01.007]
　SUN Ning,WANG Longyu,LIU Jixin,et al.Scene Recognition Based on Privilege Information and Attention Mechanism[J].Journal of Zhengzhou University (Engineering Science),2021,42(01):42.[doi:10.13705/j.issn.1671-6833.2021.01.007]
[3]贲可荣,杨佳辉,张献,等.基于Transformer和卷积神经网络的代码克隆检测[J].郑州大学学报(工学版),2023,44(06):12.[doi:10.13705/j.issn.1671-6833.2023.03.012]
　BEN Kerong,YANG Jiahui,ZHANG Xian,et al.Code Clone Detection Based on Transformer and Convolutional Neural Network[J].Journal of Zhengzhou University (Engineering Science),2023,44(01):12.[doi:10.13705/j.issn.1671-6833.2023.03.012]
[4]高宇飞,马自行,徐静,等.基于卷积和可变形注意力的脑胶质瘤图像分割[J].郑州大学学报(工学版),2024,45(02):27.[doi:10.13705/j.issn.1671-6833.2023.05.007]
　GAO Yufei,MA Zixing,XU Jing,et al.Brain Glioma Image Segmentation Based on Convolution and Deformable Attention[J].Journal of Zhengzhou University (Engineering Science),2024,45(01):27.[doi:10.13705/j.issn.1671-6833.2023.05.007]

更新日期/Last Update: 2024-01-24

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics