skip to main content
research-article

AdvReverb: Rethinking the Stealthiness of Audio Adversarial Examples to Human Perception

Published: 21 December 2023 Publication History

Abstract

As one of the most representative applications built on deep learning, audio systems, including keyword spotting, automatic speech recognition, and speaker identification, have recently been demonstrated to be vulnerable to adversarial examples, which have already raised general concerns in both academia and industry. Existing attacks follow the same adversarial example generation paradigm from computer vision, i.e., overlaying the optimized additive perturbations on original voices. However, due to the additive perturbations’ nature on human audibility, balancing the stealthiness and attack capability remains a challenging problem. In this paper, we rethink the stealthiness of audio adversarial examples and turn to introduce another kind of audio distortion, i.e., reverberation, as a new perturbation format for stealthy adversarial example generation. Such convolutional adversarial perturbations are crafted as real-world impulse responses and behave as a natural reverberation for deceiving humans. Based on this idea, we propose AdvReverb to construct, optimize, and deliver phoneme-level convolutional adversarial perturbations on both speech and music carriers with a well-designed objective. Experimental results demonstrate that AdvReverb could realize high attack success rates over 95% on three audio-domain tasks while achieving superior perceptual quality and keeping stealthy from human perception in over-the-air and over-the-line delivery scenarios.

References

[1]
S. Choiet al., “Temporal convolution for real-time keyword spotting on mobile devices,” in Proc. Interspeech, Graz, Austria, Sep. 2019, pp. 3372–3376.
[2]
B. Kim, S. Chang, J. Lee, and D. Sung, “Broadcasted residual learning for efficient keyword spotting,” in Proc. Interspeech, Brno, Czech Republic, Aug. 2021, pp. 4538–4542.
[3]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Alberta, AB, Canada, Apr. 2018, pp. 5329–5333.
[4]
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, Oct. 2020, pp. 3830–3834.
[5]
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 12449–12460.
[6]
A. Radford, J. Wook Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022, arXiv:2212.04356.
[7]
M. Cissé, Y. Adi, N. Neverova, and J. Keshet, “Houdini: Fooling deep structured visual and speech recognition models with adversarial examples,” in Proc. NeurIPS, Long Beach, CA, USA, 2017, pp. 6977–6987.
[8]
X. Yuanet al., “CommanderSong: A systematic approach for practical adversarial voice recognition,” in Proc. USENIX Secur., Baltimore, MD, USA, 2018, pp. 49–64.
[9]
Y. Chenet al., “Devil’s whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices,” in Proc. USENIX Secur., Boston, MA, USA, 2020, pp. 2667–2684.
[10]
B. Zhenget al., “Black-box adversarial attacks on commercial speech platforms with minimal information,” in Proc. ACM CCS, 2021, pp. 86–107.
[11]
Q. Wang, B. Zheng, Q. Li, C. Shen, and Z. Ba, “Towards query-efficient adversarial attacks against automatic speech recognition systems,” IEEE Trans. Inf. Forensics Security, vol. 16, pp. 896–908, 2021.
[12]
G. Chenet al., “Who is real bob? Adversarial attacks on speaker recognition systems,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2021, pp. 694–711.
[13]
M. Chen, L. Lu, Z. Ba, and K. Ren, “PhoneyTalker: An out-of-the-box toolkit for adversarial example attack on speaker recognition,” in Proc. IEEE Conf. Comput. Commun., May 2022, pp. 1419–1428.
[14]
S. Wang, Z. Zhang, G. Zhu, X. Zhang, Y. Zhou, and J. Huang, “Query-efficient adversarial attack with low perturbation against end-to-end speech recognition systems,” IEEE Trans. Inf. Forensics Security, vol. 18, pp. 351–364, 2023.
[15]
J. Liet al., “Universal adversarial perturbations generative network for speaker recognition,” in Proc. IEEE Int. Conf. Multimedia Expo (ICME), London, U.K., Jul. 2020, pp. 1–6.
[16]
G. Chen, Z. Zhao, F. Song, S. Chen, L. Fan, and Y. Liu, “AS2T: Arbitrary source-to-target adversarial attack on speaker recognition systems,” IEEE Trans. Dependable Secure Comput., early access, pp. 1–17, 2022.
[17]
T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah, “SirenAttack: Generating adversarial audio for end-to-end acoustic systems,” in Proc. 15th ACM Asia Conf. Comput. Commun. Secur., New York, NY, USA, Oct. 2020, pp. 357–369.
[18]
H. Yakura and J. Sakuma, “Robust audio adversarial example for a physical attack,” in Proc. 28th Int. Joint Conf. Artif. Intell., Macao, China, Aug. 2019, pp. 5334–5341.
[19]
X. Wuet al., “KENKU: Towards efficient and stealthy black-box adversarial attacks against ASR systems,” in Proc. USENIX Secur., Anaheim, CA, USA, 2023, pp. 247–264.
[20]
N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” in Proc. IEEE Secur. Privacy Workshops (SPW), San Francisco, CA, USA, May 2018, pp. 1–7.
[21]
P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. McAuley, and F. Koushanfar, “Universal adversarial perturbations for speech recognition systems,” in Proc. Interspeech, Graz, Austria, Sep. 2019, pp. 481–485.
[22]
Z. Li, Y. Wu, J. Liu, Y. Chen, and B. Yuan, “AdvPulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., New York, NY, USA, Oct. 2020, pp. 1121–1134.
[23]
T. Chen, L. Shangguan, Z. Li, and K. Jamieson, “Metamorph: Injecting inaudible commands into over-the-air voice controlled systems,” in Proc. Netw. Distrib. Syst. Secur. Symp., San Diego, CA, USA, 2020.
[24]
Y. Xie, Z. Li, C. Shi, J. Liu, Y. Chen, and B. Yuan, “Enabling fast and universal audio adversarial attack using generative model,” in Proc. AAAI, 2021, vol. 35, no. 16, pp. 14129–14137.
[25]
L. Schonherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” in Proc. Netw. Distrib. Syst. Secur. Symp., San Diego, CA, USA, 2019, pp. 1–15.
[26]
Y. Qin, N. Carlini, G. Cottrell, I. Goodfellow, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” in Proc. ICML, Long Beach, CA, USA, 2019, pp. 5231–5240.
[27]
J. Li, S. Qu, X. Li, J. Szurley, J. Z. Kolter, and F. Metze, “Adversarial music: Real world audio adversary against wake-word detection system,” in Proc. NeurIPS, Vancouver, BC, Canada, 2019, pp. 11908–11918.
[28]
L. Schönherr, T. Eisenhofer, S. Zeiler, T. Holz, and D. Kolossa, “Imperio: Robust over-the-air adversarial examples for automatic speech recognition systems,” in Proc. Annu. Comput. Secur. Appl. Conf., New York, NY, USA, Dec. 2020, pp. 843–855.
[29]
L. Zhang, Y. Meng, J. Yu, C. Xiang, B. Falk, and H. Zhu, “Voiceprint mimicry attack towards speaker verification system in smart home,” in Proc. IEEE Conf. Comput. Commun., Jul. 2020, pp. 377–386.
[30]
H. Guo, Y. Wang, N. Ivanov, L. Xiao, and Q. Yan, “SPECPATCH: Human-in-the-loop adversarial audio spectrogram patch attack on speech recognition,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Los Angeles, CA, USA, Nov. 2022, pp. 1353–1366.
[31]
T. Eisenhofer, L. Schönherr, J. Frank, L. Speckemeier, D. Kolossa, and T. Holz, “Dompteur: Taming audio adversarial examples,” in Proc. USENIX Secur., 2021, pp. 2309–2326.
[32]
J. Portêlo, A. Abad, B. Raj, and I. Trancoso, “Secure binary embeddings of front-end factor analysis for privacy preserving speaker verification,” in Proc. Interspeech, Lyon, France, Aug. 2013, pp. 2494–2498.
[33]
P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” 2018, arXiv:1804.03209.
[34]
A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Interspeech, Stockholm, Sweden, Aug. 2017, pp. 2616–2620.
[35]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), South Brisbane, QLD, Australia, Apr. 2015, pp. 5206–5210.
[36]
K. Kinoshitaet al., “A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP J. Adv. Signal Process., vol. 2016, no. 1, p. 7, Dec. 2016.
[37]
M. Ravanelliet al., “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
[38]
M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using Kaldi,” in Proc. Interspeech, Stockholm, Sweden, Aug. 2017, pp. 498–502.
[39]
K. Rajaratnam and J. Kalita, “Noise flooding for detecting audio adversarial examples against automatic speech recognition,” in Proc. IEEE Int. Symp. Signal Process. Inf. Technol. (ISSPIT), Louisville, KY, USA, Dec. 2018, pp. 197–201.
[40]
H. Kwon, H. Yoon, and K.-W. Park, “POSTER: Detecting audio adversarial example through audio modification,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., London, U.K., Nov. 2019, pp. 2521–2523.
[41]
S. Hussain, P. Neekhara, S. Dubnov, J. J. McAuley, and F. Koushanfar, “WaveGuard: Understanding and mitigating audio adversarial examples,” in Proc. USENIX Secur., 2021, pp. 2273–2290.
[42]
H. Zhang, P. Zhou, Q. Yan, and X.-Y. Liu, “Generating robust audio adversarial examples with temporal dependency,” in Proc. 29th Int. Joint Conf. Artif. Intell., New Orleans, LA, USA, Jul. 2020, pp. 3167–3173.
[43]
X. Du, C.-M. Pun, and Z. Zhang, “A unified framework for detecting audio adversarial examples,” in Proc. 28th ACM Int. Conf. Multimedia, WA, WA, USA, Oct. 2020, pp. 3986–3994.
[44]
L. Zhang, S. Tan, and J. Yang, “Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authentication,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Dallas, TX, USA, Oct. 2017, pp. 57–71.
[45]
L. Luet al., “LipPass: Lip reading-based user authentication on smartphones leveraging acoustic signals,” in Proc. IEEE Conf. Comput. Commun., Honolulu, HI, USA, Apr. 2018, pp. 1466–1474.
[46]
M. E. Ahmed, I. Kwak, J. H. Huh, I. Kim, T. Oh, and H. Kim, “Void: A fast and light voice liveness detection system,” in Proc. USENIX Secur., Boston, MA, USA, 2020, pp. 2685–2702.
[47]
Z. Liet al., “Robust detection of machine-induced audio attacks in intelligent audio systems with microphone array,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Nov. 2021, pp. 1884–1899.
[48]
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proc. ICLR, San Diego, CA, USA, 2015, pp. 1–11.

Cited By

View all
  • (2024)Adversarial Perturbation Prediction for Real-Time Protection of Speech PrivacyIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.346353819(8701-8716)Online publication date: 1-Jan-2024
  • (2024)Adversarial Examples Against WiFi Fingerprint-Based Localization in the Physical WorldIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345304119(8457-8471)Online publication date: 1-Jan-2024

Index Terms

  1. AdvReverb: Rethinking the Stealthiness of Audio Adversarial Examples to Human Perception
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Information Forensics and Security
        IEEE Transactions on Information Forensics and Security  Volume 19, Issue
        2024
        9612 pages

        Publisher

        IEEE Press

        Publication History

        Published: 21 December 2023

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 21 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Adversarial Perturbation Prediction for Real-Time Protection of Speech PrivacyIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.346353819(8701-8716)Online publication date: 1-Jan-2024
        • (2024)Adversarial Examples Against WiFi Fingerprint-Based Localization in the Physical WorldIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345304119(8457-8471)Online publication date: 1-Jan-2024

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media