skip to main content
research-article

Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment

Published: 26 September 2023 Publication History

Abstract

The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA). On the one hand, keeping the original resolution will lead to unacceptable computational costs. On the other hand, existing practices, such as resizing or cropping, will change the quality of original videos due to difference in details or loss of contents, and are henceforth harmful to quality assessment. With obtained insight from the studies of spatial-temporal redundancy in the human visual system, visual quality around a neighbourhood has high probability to be similar, and this motivates us to investigate an effective quality-sensitive neighbourhood representative sampling scheme for VQA. In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS), and the resultant samples are named <bold/><italic>fragments</italic><bold/>. In St-GMS, full-resolution videos are first divided into mini-cubes with predefined spatial-temporal grids, then the temporal-aligned quality representatives are sampled to compose the fragments that serve as inputs for VQA. In addition, we design the Fragment Attention Network (FANet), a network architecture tailored specifically for fragments. With fragments and FANet, the proposed <bold>FAST-VQA</bold> and <bold>FasterVQA</bold> (with an improved sampling scheme) achieves up to 1612&#x00D7; efficiency than the existing state-of-the-art, meanwhile achieving significantly better performance on all relevant VQA benchmarks.

References

[1]
M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Trans. Image Process., vol. 21, no. 8, pp. 3339–3352, Aug. 2012.
[2]
A. Mittal, M. A. Saad, and A. C. Bovik, “A completely blind video integrity oracle,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 289–300, Jan. 2016.
[3]
J. Korhonen, “Two-level approach for no-reference consumer video quality assessment,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 5923–5938, Dec. 2019.
[4]
Z. Tu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik, “UGC-VQA: Benchmarking blind video quality assessment for user generated content,” IEEE Trans. Image Process., vol. 30, pp. 4449–4464, 2021.
[5]
D. Li, T. Jiang, and M. Jiang, “Quality assessment of in-the-wild videos,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 2351–2359.
[6]
D. Li, T. Jiang, and M. Jiang, “Unified quality assessment of in-the-wild videos with mixed datasets training,” Int. J. Comput. Vis., vol. 129, no. 4, pp. 1238–1257, 2021.
[7]
Z. Ying, M. Mandal, D. Ghadiyaram, and A. Bovik, “Patch-VQ: ‘patching up’ the video quality problem,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 019–14 029.
[8]
F. Götz-Hahn, V. Hosu, H. Lin, and D. Saupe, “KonVid-150k: A dataset for no-reference video quality assessment of videos in-the-wild,” IEEE Access, vol. 9, pp. 72 139–72 160, 2021.
[9]
J. You and J. Korhonen, “Deep neural networks for no-reference video quality assessment,” in Proc. IEEE Conf. Image Process., 2019, pp. 2349–2353.
[10]
J. Korhonen, Y. Su, and J. You, “Blind natural video quality prediction via statistical temporal features and deep spatial features,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 3311–3319.
[11]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[12]
J. You, “Long short-term convolutional transformer for no-reference video quality assessment,” in Proc. ACM Int. Conf. Multimedia, 2021, pp. 2112–2120.
[13]
B. Li, W. Zhang, M. Tian, G. Zhai, and X. Wang, “Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 5944–5958, Sep. 2022.
[14]
H. Wu et al., “DisCoVQA: Temporal distortion-content transformers for video quality assessment,” 2022,.
[15]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 4278–4284.
[16]
K. Hara, H. Kataoka, and Y. Satoh, “Learning spatio-temporal features with 3D residual networks for action recognition,” in Proc. Int. Conf. Comput. Vis. Workshops, 2017, pp. 3154–3160.
[17]
R. A. Poldrack and M. J. Farah, “Progress and challenges in probing the human brain,” Nature, vol. 526, no. 7573, pp. 371–379, 2015.
[18]
M. Tagliasacchi, A. Trapanese, S. Tubaro, J. Ascenso, C. Brites, and F. Pereira, “Exploiting spatial redundancy in pixel domain Wyner-Ziv video coding,” in Proc. IEEE Conf. Image Process., 2006, pp. 253–256.
[19]
D. J.L. Gall, “The MPEG video compression algorithm,” Signal Process.: Image Commun., vol. 4, no. 2, pp. 129–140, 1992.
[20]
M. Buckler, P. Bedoukian, S. Jayasuriya, and A. Sampson, “EVA: Exploiting temporal redundancy in live computer vision,” in Proc. ACM/IEEE 45th Annu. Int. Symp. Comput. Archit., 2018, pp. 533–546.
[21]
G. K. Wallace, “The JPEG still picture compression standard,” Commun. ACM, vol. 34, no. 4, pp. 30–44, Apr. 1991.
[22]
T. Wiegand, “Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H. 264-ISO/IEC 14496-10 AVC),” JVT-G050, 2003.
[23]
R. Keys, “Cubic convolution interpolation for digital image processing,” IEEE Trans. Acoust. Speech Signal Process., vol. 29, no. 6, pp. 1153–1160, Dec. 1981.
[24]
Z. Liu et al., “Video swin transformer,” 2021,.
[25]
J. Carreira and A. Zisserman, “Quo Vadis, action recognition? A new model and the kinetics dataset,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4724–4733.
[26]
H. Fan et al., “Multiscale vision transformers,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 6824–6835.
[27]
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, “ViViT: A video vision transformer,” in Proc. Int. Conf. Comput. Vis., 2021, pp. 6836–6846.
[28]
L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for no-reference image quality assessment,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1733–1740.
[29]
L. Kang and P. Ye, Y. Li and, and D. Doermann, “Simultaneous estimation of image quality and distortion via multi-task convolutional neural networks,” in Proc. IEEE Conf. Image Process., 2015, pp. 2791–2795.
[30]
H. Wu et al., “FAST-VQA: Efficient end-to-end video quality assessment with fragment sampling,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 538–554.
[31]
A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3, pp. 209–212, Mar. 2013.
[32]
D. Ghadiyaram and A. C. Bovik, “Perceptual quality prediction on authentically distorted images using a bag of features approach,” J. Vis., vol. 17, 2017, Art. no.
[33]
R. Soundararajan and A. C. Bovik, “Video quality assessment by reduced reference spatio-temporal entropic differencing,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, pp. 684–694, Apr. 2013.
[34]
A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE Trans. Image Process., vol. 20, pp. 3350–3364, Dec. 2011.
[35]
A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21, no. 12, pp. 4695–4708, Dec. 2012.
[36]
L. Liao et al., “Exploring the effectiveness of video perceptual representation in blind video quality assessment,” in Proc. ACM Int. Conf. Multimedia, 2022, pp. 837–846.
[37]
P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli, and A. C. Bovik, “ST-GREED: Space-time generalized entropic differences for frame rate dependent video quality prediction,” IEEE Trans. Image Process., vol. 30, pp. 7446–7457, 2021.
[38]
J. P. Ebenezer, Z. Shang, Y. Wu, H. Wei, S. Sethuraman, and A. C. Bovik, “ChipQA: No-reference video quality prediction via space-time chips,” IEEE Trans. Image Process., vol. 30, pp. 8059–8074, 2021.
[39]
W. Kim, J. Kim, S. Ahn, J. Kim, and S. Lee, “Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 224–241.
[40]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
[41]
K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.
[42]
B. Chen, L. Zhu, G. Li, F. Lu, H. Fan, and S. Wang, “Learning generalized spatial-temporal deep feature representation for no-reference video quality assessment,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 1903–1916, Apr. 2022.
[43]
P. Chen, L. Li, L. Ma, J. Wu, and G. Shi, “RIRNet: Recurrent-in-recurrent network for video quality assessment,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 834–842.
[44]
Y. Liu, X. Zhou, H. Yin, H. Wang, and C. C. Yan, “Efficient video quality assessment with deeper spatiotemporal feature extraction and integration,” J. Electron. Imag., vol. 30, no. 6, 2021, Art. no.
[45]
V. Hosu, H. Lin, T. Sziranyi, and D. Saupe, “KonIQ-10 k: An ecologically valid database for deep learning of blind image quality assessment,” IEEE Trans. Image Process., vol. 29, pp. 4041–4056, 2020.
[46]
Z. Ying, H. Niu, P. Gupta, D. Mahajan, D. Ghadiyaram, and A. Bovik, “From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3572–3582.
[47]
W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 1, pp. 36–47, Jan. 2020.
[48]
D. Li, T. Jiang, W. Lin, and M. Jiang, “Which has better visual quality: The clear blue sky or a blurry animal?,” IEEE Trans. Multim., vol. 21, no. 5, pp. 1221–1234, May 2019.
[49]
L. Wang et al., “Temporal segment networks for action recognition in videos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 11, pp. 2740–2755, Nov. 2019.
[50]
A. Kolesnikov et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021.
[51]
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 976–11 986.
[52]
A. Howard et al., “Searching for MobileNetV3,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 1314–1324.
[53]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process., 2017, pp. 6000–6010.
[54]
Z. Sinno and A. C. Bovik, “Large-scale study of perceptual video quality,” IEEE Trans. Image Process., vol. 28, no. 2, pp. 612–627, Feb. 2019.
[55]
X. Liu, J. Van De Weijer, and A. D. Bagdanov, “Exploiting unlabeled data in CNNs by self-supervised learning to rank,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1862–1878, pp. 1–1, Aug. 2019.
[56]
W. Zhang, K. Ma, G. Zhai, and X. Yang, “Uncertainty-aware blind image quality assessment in the laboratory and wild,” IEEE Trans. Image Process., vol. 30, pp. 3474–3486, 2021.
[57]
J. Ke, K. Ye, J. Yu, Y. Wu, P. Milanfar, and F. Yang, “VILA: Learning image aesthetics from user comments with vision-language pretraining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 10041–10051.
[58]
D. Li, T. Jiang, and M. Jiang, “Norm-in-norm loss with faster convergence and better performance for image quality assessment,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 789–797.
[59]
W. Kay et al., “The kinetics human action video dataset,” 2017, arXic: 1705.06950.
[60]
V. Hosu et al., “The Konstanz natural video database (KoNViD-1k),” in Proc. 9th Int. Conf. Qual. Multimedia Experience, 2017, pp. 1–6.
[61]
D. Ghadiyaram, J. Pan, A. C. Bovik, A. K. Moorthy, P. Panda, and K.-C. Yang, “In-capture mobile video distortions: A study of subjective behavior and objective algorithms,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 9, pp. 2061–2077, Sep. 2018.
[62]
M. Nuutinen, T. Virtanen, M. Vaahteranoksa, T. Vuori, P. Oittinen, and J. Häkkinen, “CVD2014–a database for evaluating no-reference video quality assessment algorithms,” IEEE Trans. Image Process., vol. 25, no. 7, pp. 3073–3086, Jul. 2016.
[63]
J. G. Yim, Y. Wang, N. Birkbeck, and B. Adsumilli, “Subjective quality assessment for YouTube UGC dataset,” in Proc. IEEE Conf. Image Process., 2020, pp. 131–135.
[64]
Z. Tu, X. Yu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik, “RAPIQUE: Rapid and accurate video quality prediction of user generated content,” IEEE Open J. Signal Process., vol. 2, pp. 425–440, 2021.
[65]
Y. Wang et al., “Rich features for perceptual quality assessment of UGC videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 13 435–13 444.
[66]
Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang, “Perceptual quality assessment of smartphone photography,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3677–3686.
[67]
D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 372–387, Jun. 2016.
[68]
H. Wang, G. Li, S. Liu, and C.-C. J. Kuo, “ICME 2021 UGC-VQA challenge,” 2021. [Online]. Available: http://ugcvqa.com/
[69]
R. Goyal et al., “The “something something” video database for learning and evaluating visual common sense,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 5843–5851.
[70]
F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 961–970.
[71]
C. Gu et al., “AVA: A video dataset of spatio-temporally localized atomic visual actions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6047–6056.
[72]
R. Wightman, H. Touvron, and H. Jégou, “Resnet strikes back: An improved training procedure in timm,” 2021,.

Cited By

View all

Index Terms

  1. Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
        IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 45, Issue 12
        Dec. 2023
        1966 pages

        Publisher

        IEEE Computer Society

        United States

        Publication History

        Published: 26 September 2023

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 22 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media