[1] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2016: 770-778. [2] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[EB/OL]. (2020-09-11) [2023-08-09]. https:∥arxiv. org/ abs/1905.11946.
[3] RADOSAVOVIC I, KOSARAJU R P, GIRSHICK R, et al. Designing network design spaces[C]∥2020 IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2020: 10425-10433.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010.
[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03)[2023-08-09]. https:∥arxiv.org/abs/2010.11929.
[6] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[J] Lecture Notes in Artificial Intelligence, 2020,12346: 213-229.
[7] WANG H Y, ZHU Y K, ADAM H, et al. MaX-Deep-Lab: end-to-end panoptic segmentation with mask Transformers[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 5459-5470.
[8] CHENG B W, SCHWING A G, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation [EB/OL]. (2021-08-31)[2023-08-09]. https:∥ arxiv.org/abs/2107.06278.
[9] CHEN X, YAN B, ZHU J W, et al. Transformer tracking[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 8122-8131.
[10] JIANG Y F, CHANG S Y, WANG Z Y. TransGAN: two pure Transformers can make one strong GAN, and that can scale up[EB/OL]. (2021-12-09)[2023-08-09]. https:∥arxiv.org/abs/2102.07074.
[11] CHEN H T, WANG Y H, GUO T Y, et al. Pre-trained image processing Transformer[C]∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2021: 12294-12305.
[12] TAY Y, DEHGHANI M, BAHRI D, et al. Efficient Transformers: a survey[J]. ACM Computing Surveys, 2023, 55(6): 1-28.
[13] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: a survey[J]. ACM Computing Surveys, 2021, 54(S10): 1-41.
[14] HAN K, WANG Y H, CHEN H T, et al. A survey on Vision Transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 87-110.
[15] LIN T Y, WANG Y X, LIU X Y, et al. A survey of Transformers[J]. AI Open, 2022, 3: 111-132.
[16]毕莹, 薛冰, 张孟杰. GP算法在图像分析上的应用综 述[J]. 郑州大学学报(工学版), 2018, 39(6): 3-13. BI Y, XUE B, ZHANG M J. A survey on genetic programming to image analysis[J]. Journal of Zhengzhou University (Engineering Science), 2018, 39(6): 3-13.
[17] YUAN L, CHEN Y P, WANG T, et al. Tokens-to-token ViT: training Vision Transformers from scratch on ImageNet[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 558-567.
[18]WU H P, XIAO B, CODELLA N, et al. CvT: introducing convolutions to Vision Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 22-31.
[19]WANG W H, XIE E Z, LI X, et al. Pyramid Vision Transformer: a versatile backbone for dense prediction without convolutions[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 568-578.
[20]WANG W H, XIE E Z, LI X, et al. PVTv2: improved baselines with pyramid Vision Transformer[J]. Computational Visual Media, 2022, 8(3): 415-424.
[21] PAN Z Z, ZHUANG B H, HE H Y, et al. Less is more: pay less attention in Vision Transformers[EB/OL]. (2021-12-23)[2023-08-09]. https:∥arxiv. org/ abs/2105.14217.
[22] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[EB/OL]. (2018-04-12) [2023-08-09]. https:∥arxiv. org/ abs/1803.02155.
[23] CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for Vision Transformers [EB/OL]. (2023-02-13)[2023-08-09]. https:∥arxiv. org/ abs/2102.10882.
[24] DONG X Y, BAO J M, CHEN D D, et al. CSWin Transformer: a general Vision Transformer backbone with cross-shaped windows[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12114-12124.
[25] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical Vision Transformer using shifted windows[C]∥ 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2021: 10012-10022.
[26] ZHANG Z M, GONG X. Axially expanded windows for local-global interaction in Vision Transformers[EB/OL]. (2022-11-13)[2023-08-09]. https:∥arxiv. org/ abs/2209.08726.
[27] TU Z Z, TALEBI H, ZHANG H, et al. MaxViT: multiaxis Vision Transformer[C]∥European Conference on Computer Vision. Cham: Springer, 2022: 459-479.
[28] FANG J M, XIE L X, WANG X G, et al. MSG-Transformer: exchanging local spatial information by manipulating messenger tokens[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12053-12062.
[29] HAN K, XIAO A, WU E H, et al. Transformer in Transformer[EB/OL]. (2021-08-26)[2023-08-09]. https:∥arxiv.org/abs/2103.00112.
[30] CHU X X, TIAN Z, WANG Y Q, et al. Twins: revisiting the design of spatial attention in Vision Transformers [EB/OL]. (2021-09-30)[2023-08-09]. https:∥ arxiv.org/abs/2104.13840.
[31] FAN Q H, HUANG H B, GUAN J Y, et al. Rethinking local perception in lightweight Vision Transformer[EB/ OL]. (2023-06-01)[2023-08-09]. https:∥arxiv. org/abs/2303.17803.
[32] GUO J Y, HAN K, WU H, et al. CMT: convolutional neural networks meet Vision Transformers[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 12165-12175.
[33]WOO S, DEBNATH S, HU R H, et al. ConvNeXt V2: co-designing and scaling ConvNets with masked autoencoders[C]∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2023: 16133-16142.
[34] SANDLER M, HOWARD A, ZHU M L, et al. Mobile-NetV2: inverted residuals and linear bottlenecks[C]∥ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.
[35] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 11966-11976.
[36] REN S C, ZHOU D Q, HE S F, et al. Shunted self-attention via multi-scale token aggregation[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10853-10862.
[37] YUAN K, GUO S P, LIU Z W, et al. Incorporating convolution designs into Visual Transformers[C]∥2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2022: 559-568.
[38] LEE-THORP J, AINSLIE J, ECKSTEIN I, et al. FNet: mixing tokens with Fourier Transforms[EB/OL]. (2022-05-26) [2023-08-09]. https:∥arxiv. org/ abs/2105.03824.
[39] MARTINS A F T, FARINHAS A, TREVISO M, et al. Sparse and continuous attention mechanisms[EB/OL]. (2020-10-29)[2023-08-09]. https:∥arxiv. org/ abs/2006.07214.
[40] MARTINS P H, MARINHO Z, MARTINS A F T. ∞-former: infinite memory Transformer[EB/OL]. (2022-05-25) [2023-08-09]. https:∥arxiv. org/ abs/2109.00301.
[41] RAO Y M, ZHAO W L, ZHU Z, et al. Global filter networks for image classification[EB/OL]. (2021-10-26) [2023-08-09]. https:∥arxiv.org/abs/2107.00645.
[42] YU W H, LUO M, ZHOU P, et al. MetaFormer is actually what you need for vision[C]∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2022: 10819-10829.
[43] BERTASIUS G, WANG H, TORRESANI L. Is spacetime attention all you need for video understanding? [EB/ OL]. (2021-02-24)[2023-08-09]. https:∥arxiv. org/abs/2102.05095.