[1]纪 科,张 秀,马 坤,等.基于关键实体和文本摘要多特征融合的话题匹配算法[J].郑州大学学报(工学版),2024,45(02):51-59.[doi:10.13705/j.issn.1671-6833.2024.02.008]
 JI Ke,ZHANG Xiu,MA Kun,et al.Topic Matching Algorithm Based on Multi-feature Fusion of Key Entities and Text Abstracts[J].Journal of Zhengzhou University (Engineering Science),2024,45(02):51-59.[doi:10.13705/j.issn.1671-6833.2024.02.008]
点击复制

基于关键实体和文本摘要多特征融合的话题匹配算法()
分享到:

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:
45
期数:
2024年02期
页码:
51-59
栏目:
出版日期:
2024-03-06

文章信息/Info

Title:
Topic Matching Algorithm Based on Multi-feature Fusion of Key Entities and Text Abstracts
作者:
纪 科12 张 秀12 马 坤12 孙润元12 陈贞翔12 邬 俊3
1. 济南大学 信息科学与工程学院,山东 济南 250022;2. 济南大学 山东省网络环境智能计算技术重点实验室,山 东 济南 250022;3. 北京交通大学 计算机与信息技术学院,北京 100044
Author(s):
JI Ke12 ZHANG Xiu12 MA Kun12 SUN Runyuan12 CHEN Zhenxiang12 WU Jun3
1. School of Information Science and Engineering, University of Jinan, Jinan 250022, China; 2. Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan 250022, China; 3. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
关键词:
话题匹配 关键实体 文本摘要 文本匹配 信息检索
Keywords:
topic matching key entity text summary text matching information retrieval
分类号:
TP391. 1
DOI:
10.13705/j.issn.1671-6833.2024.02.008
文献标志码:
A
摘要:
随着网络的快速普及,互联网新闻的数量剧增,在这种情况下,如何有效地找到更加符合特定主题的相关 报道成为一个迫切需要解决的问题。 针对这一问题,提出了基于关键实体和文本摘要多特征融合的话题匹配算 法。 首先,使用 W 2NER 模型进行命名实体识别,通过词频、TF-IDF、词的合群性、词词相似度和词句相似度特征,提 取关键的实体。 其次,使用 Pegasus 模型进行文本摘要,通过 BiLSTM 融合关键实体特征与文本摘要特征,得到新闻 文本的深层次语义特征。 再次,使用交叉注意力机制对待匹配新闻进行特征交互,增进彼此的联系。 最后,融合新 闻文本的深层次语义特征和文本交互特征,共同参与文本话题匹配的判断。 在来自于搜狐的真实数据上进行了不 同算法的对比实验,结果表明:所提算法准确率和精确率均与其他算法效果相近,召回率和 F1 值均有所提升。
Abstract:
With the rapid popularization of the Internet, the amount of Internet news has increased dramatically. In this case, how to effectively find relevant reports that are more in line with a specific topic has become an urgent problem to be solved. To address this issue, a topic matching algorithm based on the fusion of key entities and text abstracts was proposed in this study. Firstly, the W2NER model was used for named entity recognition to extract key entities using features such as word frequency, TF-IDF, lexical cohesion word-word similarity, and word-sentence similarity. Secondly, the Pegasus model was used for text summarization, and the deep semantic features of news texts were obtained by combining the key entity features with the text summary features using BiLSTM. Next, the cross-attention mechanism was employed to enhance the interaction between the matching news articles by performing feature interaction. Finally, the deep semantic features of the news texts and the text interaction features were fused together to participate in the determination of text topic matching. Comparative experiments were conducted on real data from Sohu, and the results showed that the proposed algorithm achieved similar accuracy and precision compared to other algorithms, while recall and F1 score were improved.

参考文献/References:

[1] MALA V, LOBIYAL D K. Semantic and keyword based web techniques in information retrieval[ C]∥2016 International Conference on Computing, Communication and Automation (ICCCA). Piscataway: IEEE, 2017: 23-26. 

[2] 陈宁. 基于网络的关键词检索技巧[ J] . 中国科技信 息, 2008(2) :115-115, 117. 
CHEN N. Key words retrieval skills based on network [ J] . China Science and Technology Information, 2008 (2) :115-115, 117.
 [3] COHEN W W, RAVIKUMAR P, FIENBERG S. A comparison of string distance metrics for name-matching tasks [C]∥ Proceedings of the 2003 International Conference on Information Integration on the Web. New York:ACM, 2003:73-78.
 [4] 庞亮, 兰艳艳, 徐君, 等. 深度文本匹配综述[ J] . 计 算机学报, 2017, 40(4) : 985-1003.
 PANG L, LAN Y Y, XU J, et al. A survey on deep text matching[ J] . Chinese Journal of Computers, 2017, 40 (4) : 985-1003. 
[5] LIU J, KONG X, ZHOU X, et al. Data mining and information retrieval in the 21st century: a bibliographic review[ J] . Computer Science Review, 2019, 34: 100193.
 [6] ARORA S, BATRA K, SINGH S. Dialogue system: a brief review[EB / OL] . ( 2013- 6- 18) [ 2023 - 06 - 15] . https:∥arxiv. org / abs/ 1306. 4134.
 [7] MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity[ C]∥Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. New York: ACM, 2016: 2786-2792. 
[8] YIN W P, SCHÜTZE H, XIANG B, et al. ABCNN: attention-based convolutional neural network for modeling sentence pairs [ J ] . Transactions of the Association for Computational Linguistics, 2016, 4: 259-272. 
[9] DEVLIN J, CHANG M W, LEE K, et al. BERT: pretraining of deep bidirectional transformers for language understanding[ EB / OL ] . ( 2018 - 10 - 11 ) [ 2023 - 06 - 15] . https:∥arxiv. org / abs/ 1810. 04805.
 [10] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach [ EB / OL ] . (2019 - 7 - 26 ) [ 2023 - 06 - 15 ] . https:∥arxiv. org / abs/ 1907. 11692. 
[11] WEI J Q, REN X Z, LI X G, et al. NEZHA: neural contextualized representation for Chinese language understanding[EB / OL] . ( 2019- 8- 31) [ 2023- 06- 15] . https:∥arxiv. org / abs/ 1909. 00204. pdf.
 [12] PEINELT N, NGUYEN D, LIAKATA M. TBERT: topic models and BERT joining forces for semantic similarity detection[C]∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7047-7055.
 [13] 周澳回, 翁知远, 周思源, 等. 一种基于主题过滤和 语义匹配的服务发现方法[ J] . 郑州大学学报( 工学 版) , 2022, 43(6) :36-41, 56.
 ZHOU A H, WENG Z Y, ZHOU S Y, et al. A service discovery method based on topic filtering and semantic matching [ J ] . Journal of Zhengzhou University ( Engineering Science) , 2022, 43(6) :36-41, 56. 
[14] MIAO C Y, CAO Z, TAM Y C. Keyword-attentive deep semantic matching[EB / OL] . (2020- 05- 11) [ 2023- 06 -15] . https:∥arxiv. org / abs/ 2003. 11516. 
[15] ZOU Y C, LIU H W, GUI T, et al. Divide and conquer: text semantic matching with disentangled keywords and intents[EB / OL] . (2022-05-6) [2023-06-15] . https: ∥arxiv. org / abs/ 2203. 02898. 
[16] HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging [ EB / OL]. ( 2015 - 08 - 09) [2023-06-15]. https:∥arxiv. org / abs/ 1508. 01991. pdf. 
[17] LI J Y, FEI H, LIU J, et al. Unified named entity recognition as word-word relation classification [ J] . Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(10) : 10965-10973.
 [18] LIU Y. Fine-tune BERT for extractive summarization [EB / OL] . (2019 - 05 - 25) [ 2023 - 06 - 15] . https:∥ arxiv. org / abs/ 1903. 10318. 
[19] ZHANG J Q, ZHAO Y, SALEH M, et al. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization[ EB / OL] . ( 2019 - 12 - 18) [ 2023 - 06 - 15] . https:∥arxiv. org / abs/ 1912. 08777.
 [20] YU Y, SI X, HU C, et al. A review of recurrent neuranetworks: LSTM cells and network architectures [ J ] . Neural computation, 2019, 31(7) : 1235-1270.
 [21] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [ EB / OL] . (2013 - 01 - 16) [ 2023 - 06 - 15] . https:∥arxiv. org / abs/ 1301. 3781. 
[22] ZHANG J X, GAN R Y, WANG J J, et al. Fengshenbang 1. 0: being the foundation of Chinese cognitive intelligence[EB / OL] . ( 2022- 09- 07) [ 2023- 06- 15] . https:∥arxiv. org / abs/ 2209. 02970. 
[23] 李勇, 金庆雨, 张青川. 融合位置注意力机制和改进 BLSTM 的食品评论情感分析[ J] . 郑州大学学报( 工 学版) , 2020, 41(1) :58-62. 
LI Y, JIN Q Y, ZHANG Q C. Improved BLSTM food review sentiment analysis with positional attention mechanisms[ J] . Journal of Zhengzhou University ( Engineering Science) , 2020, 41(1) :58-62.
 [24] 搜狐. 2021 搜 狐 校 园 文 本 匹 配 算 法 大 赛 [ EB / OL ] . (2021-03-29) [ 2023- 06- 15] . 
https:∥www. biendata. xyz/ competition / sohu_2021 / . Sohu. 2021 Sohu campus text matching algorithm competition[EB / OL] . (2021- 03- 29) [ 2023- 06- 15] . https: ∥www. biendata. xyz/ competition / sohu_2021 / .
 [25] REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks [ EB / OL ] . (2019 - 8 - 27 ) [ 2023 - 06 - 15 ] . https:∥arxiv. org / abs/ 1908. 10084. 
[26] SUN Y, WANG S H, FENG S K, et al. ERNIE 3. 0: large-scale knowledge enhanced pre-training for language understanding and generation[ EB / OL]. ( 2021- 07 - 05) [2023-06-15]. https:∥arxiv. org / abs/ 2107. 02137.

更新日期/Last Update: 2024-03-08