«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1671-6833.2023.04.014]
点击复制

低资源少样本连续语音识别最新进展()

分享到：

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

卷:: 44
期数:: 2023年04期

页码:: 1-9

栏目:

出版日期:: 2023-06-01

文章信息/Info

Title:: Overview of Recent Progress in Low-resource Few-shot Continuous Speech Recognition

作者:: 屈丹; 杨绪魁; 闫红刚; 陈雅淇; 牛铜; 战略支援部队信息工程大学信息系统工程学院,河南郑州 450001

Author(s):: QU Dan; YANG Xukui; YAN Honggang; CHEN Yaqi; NIU Tong; School of Information System Engineering, the University of Information Engineering, the University of Strategic Support Force, 450001, Zhengzhou, Henan

关键词:: 低资源少样本; 连续语音识别; 生成对抗网络; 自监督表示学习; 深度强化学习; 元学习

Keywords:: low-resource few-shot; continuous speech recognition; generative adversarial networks; self-supervised representation learning; deep reinforcement learning; meta-learning

分类号:: ＴＮ９１２. ３４

DOI:: 10.13705/j.issn.1671-6833.2023.04.014

文献标志码:: A

摘要:: 低资源少样本语音识别是目前语音识别行业面临的迫切技术需求。首先,总结了低资源连续语音识别技术的框架技术,重点介绍了低资源语音在特征提取、声学建模和资源扩展等方面的若干关键技术研究进展。其次, 在连续语音识别框架技术发展的基础上,重点阐述了生成对抗网络、自监督表示学习、深度强化学习和元学习等高级深度学习技术在解决少样本语音识别方面的最新发展。在此基础上,分析了目前该技术面临的互补有限、数据和任务不均衡与模型轻量化部署问题,为后续发展提供了新的思路和举措。最后,对低资源少样本连续语音识别进行了总结和展望。

Abstract:: Low-resource few-shot speech recognition is an urgent technical demand faced by the speech recognition industry. The framework technology for few-shot speech recognition is first briefly discussed in this article. The research progress of several important low resource speech technologies, including feature extraction, acoustic model, and resource expansion, is then highlighted. The latest advancements in deep learning technologies, such as generative adversarial networks, self-supervised representation learning, deep reinforcement learning, and meta-learning, are then focused on in order to address few-shot speech recognition on the basis of the development of continuous speech recognition framework technology. On that basis, the problems of limited complementarity, unbalanced task and model deployment faced by this technology are analyzed for the subsequent development. Finally, a summary and prospect of few-shot continuous speech recognition are given.

参考文献/References:

[1] GRAVES A, FERNwidth=11,height=14,dpi=110NDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]∥Proceedings of the 23rd International Conference on Machine learning. New York: ACM, 2006: 369-376.

[2] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]∥2013 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Piscataway: IEEE, 2013: 6645-6649.

[3] CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]∥2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2016: 4960-4964.

[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.

[5] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2023-03-10].https:∥arxiv.org/abs/1810.04805.

[6] GALES M J F, KNILL K M, RAGNI A, et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED[C]∥The 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages. St. Petersburg: RFBR, 2014: 16-23.

[7] 赵淑芳, 董小雨. 基于改进的LSTM深度神经网络语音识别研究[J]. 郑州大学学报(工学版), 2018, 39(5): 63-67.ZHAO S F, DONG X Y. Research on speech recognition based on improved LSTM deep neural network[J]. Journal of Zhengzhou University (Engineering Science), 2018, 39(5): 63-67.

[8] THOMAS S, GANAPATHY S, HERMANSKY H. Multilingual MLP features for low-resource LVCSR systems[C]∥2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Pisca-taway: IEEE, 2012: 4269-4272.

[9] POVEY D, BURGET L, AGARWAL M, et al. The subspace Gaussian mixture model: a structured model for speech recognition[J]. Computer Speech &Language, 2011, 25(2): 404-439.

[10] IMSENG D, BOURLARD H, GARNER P N. Using KL-divergence and multilingual information to improve ASR for under-resourced languages[C]∥2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2012: 4869-4872.

[11] MOHAMED A R, DAHL G E, HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22.

[12] POVEY D, CHENG G F, WANG Y M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks[C]∥Interspeech 2018. Hyderabad: ISCA, 2018: 3743-3747.

[13] 薛均晓, 黄世博, 王亚博, 等. 基于时空特征的语音情感识别模型TSTNet[J]. 郑州大学学报(工学版), 2021, 42(6): 28-33.XUE J X, HUANG S B, WANG Y B, et al. Speech emotion recognition TSTNet based on spatial-temporal features[J]. Journal of Zhengzhou University (Engineering Science), 2021, 42(6): 28-33.

[14] POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]∥Interspeech 2016. San Francisco: ISCA, 2016: 2751-2755.

[15] JAITLY N, HINTON E. Vocal tract length perturbation (VTLP) improves speech recognition[C]∥Proceedings of the Workshop on Deep Learning for Audio, Speech and Language. Atlanta: ICML, 2013:1-5.

[16] KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition[C]∥Interspeech 2015. Dresden: ISCA, 2015: 3586-3589.

[17] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition[EB/OL]. (2019-04-18)[2023-03-10].https:∥arxiv.org/abs/1904.08779.

[18] KHARITONOV E, RIVI RE M, SYNNAEVE G, et al. Data augmenting contrastive learning of speech representations in the time domain[C]∥2021 IEEE Spoken Language Technology Workshop (SLT). Piscataway: IEEE, 2021: 215-222.

[19] XIE Q Z, LUONG M T, HOVY E, et al. Self-training with noisy student improves ImageNet classification[C]∥2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 10684-10695.

[20] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[EB/OL]. (2014-06-10)[2023-03-10]. https:∥arxiv.org/abs/1406.2661.

[21] 王坤峰, 苟超, 段艳杰, 等. 生成式对抗网络GAN的研究进展与展望[J]. 自动化学报, 2017, 43(3): 321-332.WANG K F, GOU C, DUAN Y J, et al. Generative adversarial networks: the state of the art and beyond[J]. Acta Automatica Sinica, 2017, 43(3): 321-332.

[22] QIAN Y M, HU H, TAN T. Data augmentation using generative adversarial networks for robust speech recognition[J]. Speech Communication, 2019, 114: 1-9.

[23] SUN S N, YEH C F, OSTENDORF M, et al. Training augmentation with adversarial examples for robust speech recognition[EB/OL].(2018-06-07)[2023-03-10].https:∥arxiv.org/abs/1806.02782.

[24] SHINOHARA Y. Adversarial multi-task learning of deep neural networks for robust speech recognition[C]∥Interspeech 2016. San Francisco: ISCA, 2016: 2369-2372.

[25] LIU B, NIE S, ZHANG Y P, et al. Boosting noise robustness of acoustic model via deep adversarial training[C]∥2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 5034-5038.

[26] LI C Y, VU N T. Improving speech recognition on noisy speech via speech enhancement with multi-discriminators CycleGAN[C]∥2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2022: 830-836.[27] 屈丹, 张文林, 杨绪魁. 实用深度学习基础[M]. 北京: 清华大学出版社, 2022.QU D, ZHANG W L, YANG X K. Practical deep learning foundation[M]. Beijing: Tsinghua University Press, 2022.

[28] CHUNG Y A, HSU W N, TANG H, et al. An unsupervised autoregressive model for speech representation learning[C]∥Interspeech 2019. Graz: ISCA, 2019: 146-150.

[29] CHUNG Y A, TANG H, GLASS J. Vector-quantized autoregressive predictive coding[C]∥Interspeech 2020. Shanghai: ISCA, 2020: 3760-3764.

[30] LIU A T, LI S W, LEE H Y. TERA: self-supervised learning of transformer encoder representation for speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.

[31] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021, 29: 3451-3460.

[32] GUTMANN M, HYV RINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[J]. Journal of Machine Learning Research, 2010, 9: 297-304.

[33] OORD A V D, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding[EB/OL]. (2019-01-22)[2023-03-10]. https:∥arxiv.org/abs/1807.03748.

[34] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. Wav2vec: unsupervised pre-training for speech recognition[C]∥Interspeech 2019. Graz: ISCA, 2019: 3465-3469.

[35] TJANDRA A, SAKTI S, NAKAMURA S. Sequence-to-sequence ASR optimization via reinforcement learning[C]∥2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 5829-5833.

[36] TJANDRA A, SAKTI S, NAKAMURA S. End-to-end speech recognition sequence training with reinforcement learning[J]. IEEE Access, 2019, 7: 79758-79769.

[37] LUO Y P, CHIU C C, JAITLY N, et al. Learning online alignments with continuous rewards policy gradient[C]∥2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2017: 2801-2805.

[38] VARIANI E, RYBACH D, ALLAUZEN C, et al. Hybrid autoregressive transducer (HAT)[C]∥2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6139-6143.

[39] KALA T K, SHINOZAKI T. Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection[C]∥2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 5759-5763.

[40] CHUNG H, JEON H B, PARK J G. Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning[C]∥2020 International Joint Conference on Neural Networks (IJCNN). Piscataway: IEEE, 2020: 1-6.

[41] RADZIKOWSKI K, NOWAK R, WANG L, et al. Dual supervised learning for non-native speech recognition[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2019, 2019(1): 1-10.

[42] 王璐, 潘文林. 基于元学习的语音识别探究[J]. 云南民族大学学报(自然科学版), 2019, 28(5): 510-516.WANG L, PAN W L. Speech recognition based on meta-learning[J]. Journal of Yunnan Minzu University (Natural Sciences Edition), 2019, 28(5): 510-516.

[43] 侯俊龙, 潘文林. 基于元度量学习的低资源语音识别[J]. 云南民族大学学报(自然科学版), 2021, 30(3): 272-278.HOU J L, PAN W L. Low-resource speech recognition based on meta-metric learning[J]. Journal of Yunnan Minzu University (Natural Sciences Edition), 2021, 30(3): 272-278.

[44] KLEJCH O, FAINBERG J, BELL P. Learning to adapt: a meta-learning approach for speaker adaptation[C]∥Interspeech 2018. Hyderabad: ISCA, 2018: 867-871.

[45] HSU J Y, CHEN Y J, LEE H Y. Meta learning for end-to-end low-resource speech recognition[C]∥2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 7844-7848.

[46] XIAO Y B, GONG K, ZHOU P, et al. Adversarial meta sampling for multilingual low-resource speech recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(16): 14112-14120.[47] WINATA G I, CAHYAWIJAYA S, LIU Z H, et al. Learning fast adaptation on cross-accented speech recognition[C]∥Interspeech 2020. Shanghai: ISCA, 2020: 1276-1280.

[48] WINATA G I, CAHYAWIJAYA S, LIN Z J, et al. Meta-transfer learning for code-switched speech recognition[EB/OL]. (2020-03-04) [2023-03-10].https:∥arxiv.org/abs/2003.01901.

更新日期/Last Update: 2023-06-30

《郑州大学学报(工学版)》[ISSN:1671-6833/CN:41-1339/T]

文章信息/Info

参考文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics