[1]陈 燕,赖宇斌,肖 澳,等.基于 CLIP 和交叉注意力的多模态情感分析模型[J].郑州大学学报(工学版),2024,45(02):42-50.[doi:10.13705/j.issn.1671-6833.2024.02.003]
 CHEN Yan,LAI Yubin,XIAO Ao,et al.Multimodal Sentiment Analysis Model Based on CLIP and Cross-attention[J].Journal of Zhengzhou University (Engineering Science),2024,45(02):42-50.[doi:10.13705/j.issn.1671-6833.2024.02.003]
点击复制

基于 CLIP 和交叉注意力的多模态情感分析模型()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
45
期数:
2024年02期
页码:
42-50
栏目:
出版日期:
2024-03-06

文章信息/Info

Title:
Multimodal Sentiment Analysis Model Based on CLIP and Cross-attention
作者:
陈 燕12 赖宇斌1 肖 澳1 廖宇翔1 陈宁江1
1. 广西大学 计算机与电子信息学院,广西 南宁 530000;2. 广西大学 广西多媒体通信与网络技术重点实验室,广 西 南宁 530000
Author(s):
CHEN Yan12 LAI Yubin1 XIAO Ao1 LIAO Yuxiang1 CHEN Ningjiang1
1. School of Computer and Electronic Information Science, Guangxi University, Nanning 530000, China; 2. Guangxi Key Laboratory of Multimedia Communication and Network Technology, Guangxi University, Nanning 530000, China
关键词:
情感分析 多模态学习 交叉注意力 CLIP 模型 Transformer 特征融合
Keywords:
sentiment analysis multimodal learning cross-attention CLIP model Transformer feature fusion
分类号:
TP391
DOI:
10.13705/j.issn.1671-6833.2024.02.003
文献标志码:
A
摘要:
针对多模态情感分析中存在的标注数据量少、模态间融合不充分以及信息冗余等问题,提出了一种基于 对比语言-图片训练(CLIP)和交叉注意力( CA) 的多模态情感分析( MSA) 模型 CLIP-CA-MSA。 首先,该模型使用 CLIP 预训练的 BERT 模型、PIFT 模型来提取视频特征向量与文本特征;其次,使用交叉注意力机制将图像特征向 量和文本特征向量进行交互,以加强不同模态之间的信息传递;最后,利用不确定性损失特征融合后计算输出最终 的情感分类结果。 实验结果表明:该模型比其他多模态模型准确率提高 5 百分点至 14 百分点,F1 值提高 3 百分点 至 12 百分点,验证了该模型的优越性,并使用消融实验验证该模型各模块的有效性。 该模型能够有效地利用多模 态数据的互补性和相关性,同时利用不确定性损失来提高模型的鲁棒性和泛化能力。
Abstract:
In response to the issues of limited annotated data, insufficient fusion between modalities, and information redundancy in multimodal sentiment analysis, a multimodal sentiment analysis model called CLIP-CA-MSA based on contrastive language-image pretraining(CLIP) and cross-attention mechanism was proposed in this study. This model employed models such as BERT which was pre-trained by CLIP, and PIFT to extract feature vectors from videos and textual content. Subsequently, a cross-attention mechanism was applied to facilitate interaction between image feature vectors and text feature vectors, enhancing information exchange across different modalities. Finally, the uncertainty loss was utilized to compute the fused features, and the ultimate sentiment classification results were generated from the outputs. The experimental results showed that the model could increase accuracyrate by 5 percentage points to 14 percentage points and the F1 value by 3 percentage point to 12 percentage point over other multimodal models, which verifieed the superiority of the model in this study. And uses of ablation experiments to verified the validity of each module of the model. This model could effectively utilize the complementarity and correlation of multimodal data, and utilize uncertainty loss to improve the robustness and generalization ability of the model.

参考文献/References:

[1] PANG B, LEE L, VAITHYANATHAN S. Thumbs up? sentiment classification using machine learning techniques [ C]∥Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing ( EMNLP 2002) . Stroudsburg: ACL, 2002: 79-86. 

[2] ZHANG L,LIU B. Sentiment analysis and opinion mining [EB / OL] . (2015-12-31) [2023-04-24] . https:∥doi. org / 10. 1007 / 978-1-4899-7502-7_907-2. 
[3] 李勇, 金庆雨, 张青川. 融合位置注意力机制和改进 BLSTM 的食品评论情感分析[ J] . 郑州大学学报( 工 学版) , 2020, 41(1) :58-62.
 LI Y, JIN Q Y, ZHANG Q C. Improved BLSTM food review sentiment analysis with positional attention mechanisms[ J] . Journal of Zhengzhou University ( Engineering Science) , 2020, 41(1) :58-62. 
[4] MUNIKAR M, SHAKYA S, SHRESTHA A. Finegrained sentiment classification using BERT [ EB / OL ] . (2019- 10 - 04 ) [ 2023 - 04 - 24 ] . https:∥arxiv. org / abs/ 1910. 03474. 
[5] ZHU X G, LI L, ZHANG W, et al. Dependency exploitation: a unified CNN-RNN approach for visual emotion recognition [ C ] ∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. New York: ACM, 2017:3595-3601. 
[6] YOU Q Z, JIN H L, LUO J B. Visual sentiment analysis by attending on local image regions[ C]∥Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence. New York:ACM, 2017: 231-237.
 [7] WANG H H, MEGHAWAT A, MORENCY L P, et al. Select-additive learning: improving generalization in multimodal sentiment analysis[ C]∥2017 IEEE International Conference on Multimedia and Expo ( ICME) . Piscataway: IEEE, 2017: 949-954. 
[8] 吴思思, 马静. 基于感知融合的多任务多模态情感分 析模型[ J] . 数据分析与知识发现,2023(10) :74-84. 
WU S S,MA J. Multi-task & multi-modal sentiment analysis model based on aware fusion[ J] . Data Analysis and Knowledge Discovery, 2023(10) :74-84. 
[9] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB / OL] . (2021-02-26) [ 2023- 04- 24] . https:∥ arxiv. org / abs/ 2103. 00020.
 [10] 赖宇斌, 陈燕, 胡小春,等. 基于提示嵌入的突发公共 卫生事件微博文本情感分析[ J] . 数据分析与知识发 现,2023,7(11) :46-55. 
LAI Y B,CHEN Y,HU X C. et al. Emotional analysis of public health emergency micro-blog based on prompt embedding [ J ]. Data Analysis and Knowledge Discovery, 2023,7(11):46-55. 
[11] YU W M, XU H, MENG F Y, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with finegrained annotation of modality [ C] ∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3718-3727.
 [12] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [ EB / OL ] . (2017 - 07 - 23 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1707. 07250.
 [13] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]∥Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2247-2256. 
[14] TSAI Y H H, BAI S J, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences [ C] . ∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569. 
[15] YU W M, XU H, YUAN Z Q, et al. Learning modalityspecific representations with self-supervised multi-task learning for multimodal sentiment analysis[ C]∥Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto :AAAI, 2021: 10790-10797. 
[16] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [ EB / OL ] . (2014- 09 - 04 ) [ 2023 - 04 - 24 ] . https:∥arxiv. org / abs/ 1409. 1556. 
[17] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [ C] ∥2016 IEEE Conference on Computer Vision and Pattern Recognition 、(CVPR) . Piscataway: IEEE, 2016: 770-778. 
[18] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE / CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) . Piscataway: IEEE, 2022: 11966-11976.
 [19] BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2. 0: facial behavior analysis toolkit[C]∥2018 13th IEEE International Conference on Automatic Face & Gesture Recognition ( FG 2018) . Piscataway: IEEE, 2018: 59-66.
 [20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB / OL]. (2020-10-22)[2023-04- 24] . https:∥arxiv. org / abs/ 2010. 11929.
 [21] DESAI S, RAMASWAMY H G. Ablation-CAM: visual explanations for deep convolutional network via gradientfree localization [ C] ∥2020 IEEE Winter Conference on Applications of Computer Vision ( WACV) . Piscataway: IEEE, 2020: 972-980.
 [22] LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[EB / OL] . ( 2019 - 09 - 26) [ 2023 - 04 - 24] . https:∥arxiv. org / abs/ 1909. 11942.
 [23] DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for language understanding[ EB / OL] . ( 2018 - 11 - 11) [ 2023 - 04 - 24] . https:∥doi. org / 10. 48550 / arXiv. 1810. 04805. 
[24] SUN Y, WANG S H, LI Y K, et al. ERNIE: enhanced representation through knowledge integration [ EB / OL ] . (2019 - 04 - 19 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1904. 09223. 
[25] CUI Y M, CHE W X, LIU T, et al. Revisiting pretrained models for Chinese natural language processing [EB / OL] . (2020 - 04 - 29) [ 2023 - 04 - 24] . https:∥ doi. org / 10. 48550 / arXiv. 2004. 13922. 
[26] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach [ EB / OL ] . (2019 - 07 - 26 ) [ 2023 - 04 - 24 ] . https: ∥doi. org / 10. 48550 / arXiv. 1907. 11692. 
[27] LUO H S, JI L, ZHONG M, et al. CLIP4Clip: an empirical study of CLIP for end to end video clip retrieval and captioning[J]. Neurocomputing, 2022, 508

相似文献/References:

[1]李勇,金庆雨,张青川.融合位置注意力机制和改进BLSTM的食品评论情感分析[J].郑州大学学报(工学版),2020,41(01):58.[doi:10.13705/j.issn.1671-6833.2020.01.006]
 Li Yong,Jin Qingyu,Zhang Qingchuan.Improved BLSTM Food Review Sentiment Analysis with Positional Attention Mechanisms[J].Journal of Zhengzhou University (Engineering Science),2020,41(02):58.[doi:10.13705/j.issn.1671-6833.2020.01.006]

更新日期/Last Update: 2024-03-08