[1] GRAVES A, FERNwidth=11,height=14,dpi=110NDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]∥Proceedings of the 23rd International Conference on Machine learning. New York: ACM, 2006: 369-376.[2] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]∥2013 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Piscataway: IEEE, 2013: 6645-6649.
[3] CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]∥2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2016: 4960-4964.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[5] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2023-03-10].https:∥arxiv.org/abs/1810.04805.
[6] GALES M J F, KNILL K M, RAGNI A, et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED[C]∥The 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages. St. Petersburg: RFBR, 2014: 16-23.
[7] 赵淑芳, 董小雨. 基于改进的LSTM深度神经网络语音识别研究[J]. 郑州大学学报(工学版), 2018, 39(5): 63-67.ZHAO S F, DONG X Y. Research on speech recognition based on improved LSTM deep neural network[J]. Journal of Zhengzhou University (Engineering Science), 2018, 39(5): 63-67.
[8] THOMAS S, GANAPATHY S, HERMANSKY H. Multilingual MLP features for low-resource LVCSR systems[C]∥2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Pisca-taway: IEEE, 2012: 4269-4272.
[9] POVEY D, BURGET L, AGARWAL M, et al. The subspace Gaussian mixture model: a structured model for speech recognition[J]. Computer Speech &Language, 2011, 25(2): 404-439.
[10] IMSENG D, BOURLARD H, GARNER P N. Using KL-divergence and multilingual information to improve ASR for under-resourced languages[C]∥2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2012: 4869-4872.
[11] MOHAMED A R, DAHL G E, HINTON G. Acoustic modeling using deep belief networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1): 14-22.
[12] POVEY D, CHENG G F, WANG Y M, et al. Semi-orthogonal low-rank matrix factorization for deep neural networks[C]∥Interspeech 2018. Hyderabad: ISCA, 2018: 3743-3747.
[13] 薛均晓, 黄世博, 王亚博, 等. 基于时空特征的语音情感识别模型TSTNet[J]. 郑州大学学报(工学版), 2021, 42(6): 28-33.XUE J X, HUANG S B, WANG Y B, et al. Speech emotion recognition TSTNet based on spatial-temporal features[J]. Journal of Zhengzhou University (Engineering Science), 2021, 42(6): 28-33.
[14] POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]∥Interspeech 2016. San Francisco: ISCA, 2016: 2751-2755.
[15] JAITLY N, HINTON E. Vocal tract length perturbation (VTLP) improves speech recognition[C]∥Proceedings of the Workshop on Deep Learning for Audio, Speech and Language. Atlanta: ICML, 2013:1-5.
[16] KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition[C]∥Interspeech 2015. Dresden: ISCA, 2015: 3586-3589.
[17] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition[EB/OL]. (2019-04-18)[2023-03-10].https:∥arxiv.org/abs/1904.08779.
[18] KHARITONOV E, RIVI RE M, SYNNAEVE G, et al. Data augmenting contrastive learning of speech representations in the time domain[C]∥2021 IEEE Spoken Language Technology Workshop (SLT). Piscataway: IEEE, 2021: 215-222.
[19] XIE Q Z, LUONG M T, HOVY E, et al. Self-training with noisy student improves ImageNet classification[C]∥2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 10684-10695.
[20] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[EB/OL]. (2014-06-10)[2023-03-10]. https:∥arxiv.org/abs/1406.2661.
[21] 王坤峰, 苟超, 段艳杰, 等. 生成式对抗网络GAN的研究进展与展望[J]. 自动化学报, 2017, 43(3): 321-332.WANG K F, GOU C, DUAN Y J, et al. Generative adversarial networks: the state of the art and beyond[J]. Acta Automatica Sinica, 2017, 43(3): 321-332.
[22] QIAN Y M, HU H, TAN T. Data augmentation using generative adversarial networks for robust speech recognition[J]. Speech Communication, 2019, 114: 1-9.
[23] SUN S N, YEH C F, OSTENDORF M, et al. Training augmentation with adversarial examples for robust speech recognition[EB/OL].(2018-06-07)[2023-03-10].https:∥arxiv.org/abs/1806.02782.
[24] SHINOHARA Y. Adversarial multi-task learning of deep neural networks for robust speech recognition[C]∥Interspeech 2016. San Francisco: ISCA, 2016: 2369-2372.
[25] LIU B, NIE S, ZHANG Y P, et al. Boosting noise robustness of acoustic model via deep adversarial training[C]∥2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 5034-5038.
[26] LI C Y, VU N T. Improving speech recognition on noisy speech via speech enhancement with multi-discriminators CycleGAN[C]∥2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Piscataway: IEEE, 2022: 830-836.[27] 屈丹, 张文林, 杨绪魁. 实用深度学习基础[M]. 北京: 清华大学出版社, 2022.QU D, ZHANG W L, YANG X K. Practical deep learning foundation[M]. Beijing: Tsinghua University Press, 2022.
[28] CHUNG Y A, HSU W N, TANG H, et al. An unsupervised autoregressive model for speech representation learning[C]∥Interspeech 2019. Graz: ISCA, 2019: 146-150.
[29] CHUNG Y A, TANG H, GLASS J. Vector-quantized autoregressive predictive coding[C]∥Interspeech 2020. Shanghai: ISCA, 2020: 3760-3764.
[30] LIU A T, LI S W, LEE H Y. TERA: self-supervised learning of transformer encoder representation for speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2351-2366.
[31] HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021, 29: 3451-3460.
[32] GUTMANN M, HYV RINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[J]. Journal of Machine Learning Research, 2010, 9: 297-304.
[33] OORD A V D, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding[EB/OL]. (2019-01-22)[2023-03-10]. https:∥arxiv.org/abs/1807.03748.
[34] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. Wav2vec: unsupervised pre-training for speech recognition[C]∥Interspeech 2019. Graz: ISCA, 2019: 3465-3469.
[35] TJANDRA A, SAKTI S, NAKAMURA S. Sequence-to-sequence ASR optimization via reinforcement learning[C]∥2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 5829-5833.
[36] TJANDRA A, SAKTI S, NAKAMURA S. End-to-end speech recognition sequence training with reinforcement learning[J]. IEEE Access, 2019, 7: 79758-79769.
[37] LUO Y P, CHIU C C, JAITLY N, et al. Learning online alignments with continuous rewards policy gradient[C]∥2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2017: 2801-2805.
[38] VARIANI E, RYBACH D, ALLAUZEN C, et al. Hybrid autoregressive transducer (HAT)[C]∥2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 6139-6143.
[39] KALA T K, SHINOZAKI T. Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection[C]∥2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2018: 5759-5763.
[40] CHUNG H, JEON H B, PARK J G. Semi-supervised training for sequence-to-sequence speech recognition using reinforcement learning[C]∥2020 International Joint Conference on Neural Networks (IJCNN). Piscataway: IEEE, 2020: 1-6.
[41] RADZIKOWSKI K, NOWAK R, WANG L, et al. Dual supervised learning for non-native speech recognition[J]. EURASIP Journal on Audio, Speech, and Music Processing, 2019, 2019(1): 1-10.
[42] 王璐, 潘文林. 基于元学习的语音识别探究[J]. 云南民族大学学报(自然科学版), 2019, 28(5): 510-516.WANG L, PAN W L. Speech recognition based on meta-learning[J]. Journal of Yunnan Minzu University (Natural Sciences Edition), 2019, 28(5): 510-516.
[43] 侯俊龙, 潘文林. 基于元度量学习的低资源语音识别[J]. 云南民族大学学报(自然科学版), 2021, 30(3): 272-278.HOU J L, PAN W L. Low-resource speech recognition based on meta-metric learning[J]. Journal of Yunnan Minzu University (Natural Sciences Edition), 2021, 30(3): 272-278.
[44] KLEJCH O, FAINBERG J, BELL P. Learning to adapt: a meta-learning approach for speaker adaptation[C]∥Interspeech 2018. Hyderabad: ISCA, 2018: 867-871.
[45] HSU J Y, CHEN Y J, LEE H Y. Meta learning for end-to-end low-resource speech recognition[C]∥2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2020: 7844-7848.
[46] XIAO Y B, GONG K, ZHOU P, et al. Adversarial meta sampling for multilingual low-resource speech recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(16): 14112-14120.[47] WINATA G I, CAHYAWIJAYA S, LIU Z H, et al. Learning fast adaptation on cross-accented speech recognition[C]∥Interspeech 2020. Shanghai: ISCA, 2020: 1276-1280.
[48] WINATA G I, CAHYAWIJAYA S, LIN Z J, et al. Meta-transfer learning for code-switched speech recognition[EB/OL]. (2020-03-04) [2023-03-10].https:∥arxiv.org/abs/2003.01901.