Skip to main content

Showing 1–50 of 260 results for author: Meng, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.07542  [pdf, other

    eess.SP cs.AI

    Generalizable Indoor Human Activity Recognition Method Based on Micro-Doppler Corner Point Cloud and Dynamic Graph Learning

    Authors: Xiaopeng Yang, Weicheng Gao, Xiaodong Qu, Haoyu Meng

    Abstract: Through-the-wall radar (TWR) human activity recognition can be achieved by fusing micro-Doppler signature extraction and intelligent decision-making algorithms. However, limited by the insufficient priori of tester in practical indoor scenarios, the trained models on one tester are commonly difficult to inference well on other testers, which causes poor generalization ability. To solve this proble… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: 15 pages, 12 figures, 6 tables, in IEEE Transactions on Aerospace and Electronics Systems, 2024

    MSC Class: 94 ACM Class: I.5.1

  2. arXiv:2409.16322  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD q-bio.NC

    Towards Within-Class Variation in Alzheimer's Disease Detection from Spontaneous Speech

    Authors: Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng

    Abstract: Alzheimer's Disease (AD) detection has emerged as a promising research area that employs machine learning classification models to distinguish between individuals with AD and those without. Unlike conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Given that many AD detectio… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

  3. arXiv:2409.12560  [pdf, other

    eess.AS cs.SD

    AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Zhiyong Wu, Helen Meng, Xixin Wu

    Abstract: Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by incorporating additional frame-level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  4. arXiv:2409.12388  [pdf, other

    eess.AS cs.AI cs.SD

    Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

    Authors: Jiawen Kang, Lingwei Meng, Mingyu Cui, Yuejiao Wang, Xixin Wu, Xunying Liu, Helen Meng

    Abstract: Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different sp… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

  5. arXiv:2409.11630  [pdf, other

    cs.SD eess.AS

    Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng

    Abstract: The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  6. arXiv:2409.10411  [pdf, other

    cs.CR cs.SE

    A Large-Scale Privacy Assessment of Android Third-Party SDKs

    Authors: Mark Huasong Meng, Chuan Yan, Yun Hao, Qing Zhang, Zeyu Wang, Kailong Wang, Sin Gee Teo, Guangdong Bai, Jin Song Dong

    Abstract: Third-party Software Development Kits (SDKs) are widely adopted in Android app development, to effortlessly accelerate development pipelines and enhance app functionality. However, this convenience raises substantial concerns about unauthorized access to users' privacy-sensitive information, which could be further abused for illegitimate purposes like user tracking or monetization. Our study offer… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

    Comments: 16 pages

  7. arXiv:2409.08596  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

    Authors: Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, Helen Meng

    Abstract: Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following ve… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  8. arXiv:2409.06029  [pdf, other

    cs.SD cs.AI eess.AS

    SongCreator: Lyrics-based Universal Song Generation

    Authors: Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

    Abstract: Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the a… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: work in progress

  9. arXiv:2409.00933  [pdf, other

    cs.SD eess.AS

    SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

    Authors: Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng

    Abstract: The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It compresses speech into a shorter, multi-stream discrete semantic sequence with multiple tokens at each frame. Meanwhile, the ordered product quantization is proposed… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Accepted by SLT 2024

  10. arXiv:2408.14357  [pdf, other

    cs.SE

    Exploring ChatGPT App Ecosystem: Distribution, Deployment and Security

    Authors: Chuan Yan, Ruomai Ren, Mark Huasong Meng, Liuhuo Wan, Tian Yang Ooi, Guangdong Bai

    Abstract: ChatGPT has enabled third-party developers to create plugins to expand ChatGPT's capabilities.These plugins are distributed through OpenAI's plugin store, making them easily accessible to users. With ChatGPT as the backbone, this app ecosystem has illustrated great business potential by offering users personalized services in a conversational manner. Nonetheless, many crucial aspects regarding app… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: Accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

  11. arXiv:2408.13893  [pdf, other

    cs.SD cs.CL eess.AS

    SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

    Authors: Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

    Abstract: Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works dem… ▽ More

    Submitted 28 August, 2024; v1 submitted 25 August, 2024; originally announced August 2024.

    Comments: Submit to TASLP

  12. arXiv:2408.10588  [pdf, other

    cs.CV cs.GR

    DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

    Authors: Zhijing Shao, Duotun Wang, Qing-Yao Tian, Yao-Dong Yang, Hengyu Meng, Zeyu Cai, Bo Dong, Yu Zhang, Kang Zhang, Zeyu Wang

    Abstract: Although neural rendering has made significant advancements in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method lea… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  13. arXiv:2408.10493  [pdf, other

    cs.LG

    Clustering by Mining Density Distributions and Splitting Manifold Structure

    Authors: Zhichang Xu, Zhiguo Long, Hua Meng

    Abstract: Spectral clustering requires the time-consuming decomposition of the Laplacian matrix of the similarity graph, thus limiting its applicability to large datasets. To improve the efficiency of spectral clustering, a top-down approach was recently proposed, which first divides the data into several micro-clusters (granular-balls), then splits these micro-clusters when they are not "compact'', and fin… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  14. arXiv:2408.10084  [pdf, other

    cs.LG

    TANGO: Clustering with Typicality-Aware Nonlocal Mode-Seeking and Graph-Cut Optimization

    Authors: Haowen Ma, Zhiguo Long, Hua Meng

    Abstract: Density-based clustering methods by mode-seeking usually achieve clustering by using local density estimation to mine structural information, such as local dependencies from lower density points to higher neighbors. However, they often rely too heavily on \emph{local} structures and neglect \emph{global} characteristics, which can lead to significant errors in peak selection and dependency establi… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  15. arXiv:2408.04914  [pdf, other

    cs.CV

    GuidedNet: Semi-Supervised Multi-Organ Segmentation via Labeled Data Guide Unlabeled Data

    Authors: Haochen Zhao, Hui Meng, Deqian Yang, Xiaozheng Xie, Xiaoze Wu, Qingfeng Li, Jianwei Niu

    Abstract: Semi-supervised multi-organ medical image segmentation aids physicians in improving disease diagnosis and treatment planning and reduces the time and effort required for organ annotation.Existing state-of-the-art methods train the labeled data with ground truths and train the unlabeled data with pseudo-labels. However, the two training flows are separate, which does not reflect the interrelationsh… ▽ More

    Submitted 2 September, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: Accepted by ACM MM2024, 10 pages, 5 figures

  16. GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

    Authors: Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling Shi, Li Li, Jun Wang, Haoyu Wang

    Abstract: Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model's vocabulary space and named them "glitch tokens". Those tokens, once included in th… ▽ More

    Submitted 22 September, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

  17. arXiv:2407.21467  [pdf

    cs.CV cs.AI

    Deep Learning-Based Longitudinal Prediction of Childhood Myopia Progression Using Fundus Image Sequences and Baseline Refraction Data

    Authors: Mengtian Kang, Yansong Hu, Shuo Gao, Yuanyuan Liu, Hongbei Meng, Xuemeng Li, Xuhang Chen, Hubin Zhao, Jing Fu, Guohua Hu, Wei Wang, Yanning Dai, Arokia Nathan, Peter Smielewski, Ningli Wang, Shiming Li

    Abstract: Childhood myopia constitutes a significant global health concern. It exhibits an escalating prevalence and has the potential to evolve into severe, irreversible conditions that detrimentally impact familial well-being and create substantial economic costs. Contemporary research underscores the importance of precisely predicting myopia progression to enable timely and effective interventions, there… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  18. Large Kernel Distillation Network for Efficient Single Image Super-Resolution

    Authors: Chengxing Xie, Xiaoming Zhang, Linze Li, Haiteng Meng, Tianlin Zhang, Tianrui Li, Xiaole Zhao

    Abstract: Efficient and lightweight single-image super-resolution (SISR) has achieved remarkable performance in recent years. One effective approach is the use of large kernel designs, which have been shown to improve the performance of SISR models while reducing their computational requirements. However, current state-of-the-art (SOTA) models still face problems such as high computational costs. To address… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

    Comments: Accepted to CVPR workshop 2023

  19. arXiv:2407.13782  [pdf, other

    eess.AS cs.AI cs.SD

    Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

    Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu

    Abstract: Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  20. arXiv:2407.13509  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

    Authors: Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng

    Abstract: Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating v… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted by INTERSPEECH 2024

  21. arXiv:2407.10376  [pdf, other

    q-bio.NC cs.CL

    Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder

    Authors: Yuejiao Wang, Xianmin Gong, Lingwei Meng, Xixin Wu, Helen Meng

    Abstract: Functional magnetic resonance imaging (fMRI) is essential for developing encoding models that identify functional changes in language-related brain areas of individuals with Neurocognitive Disorders (NCD). While large language model (LLM)-based fMRI encoding has shown promise, existing studies predominantly focus on healthy, young adults, overlooking older NCD populations and cognitive level corre… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: 5 pages, accepted by Interspeech 2024

  22. arXiv:2407.09817  [pdf, other

    cs.SD cs.CL eess.AS

    Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

    Authors: Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng

    Abstract: Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recogniti… ▽ More

    Submitted 24 August, 2024; v1 submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted to INTERSPEECH 2024

  23. arXiv:2407.08551  [pdf, other

    cs.CL cs.SD eess.AS

    Autoregressive Speech Synthesis without Vector Quantization

    Authors: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

    Abstract: We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  24. arXiv:2407.07810  [pdf, other

    cs.LG cs.AI cs.CL

    Transformer Block Coupling and its Correlation with Generalization in LLMs

    Authors: Murdock Aubry, Haoming Meng, Anton Sugolov, Vardan Papyan

    Abstract: Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we trace the trajectories of individual tokens as they pass through transformer blocks, and linearize the system along these trajectories through their Jacobian matrices. By examining the relationships be… ▽ More

    Submitted 14 October, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

  25. arXiv:2407.07518  [pdf, other

    cs.CV

    Multi-modal Crowd Counting via a Broker Modality

    Authors: Haoliang Meng, Xiaopeng Hong, Chenhao Wang, Miao Shang, Wangmeng Zuo

    Abstract: Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this brok… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: This is the preprint version of the paper and supplemental material to appear in ECCV 2024. Please cite the final published version. Code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting

  26. arXiv:2407.06310  [pdf, other

    cs.SD cs.AI cs.HC cs.LG eess.AS

    Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

    Authors: Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu

    Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: In submission to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  27. arXiv:2407.01850  [pdf, other

    cs.CL

    Purple-teaming LLMs with Adversarial Defender Training

    Authors: Jingyan Zhou, Kun Li, Junan Li, Jiawen Kang, Minda Hu, Xixin Wu, Helen Meng

    Abstract: Existing efforts in safeguarding LLMs are limited in actively exposing the vulnerabilities of the target LLM and readily adapting to newly emerging safety risks. To address this, we present Purple-teaming LLMs with Adversarial Defender training (PAD), a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. In PAD, we au… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  28. arXiv:2406.14092  [pdf, other

    cs.CL eess.AS

    Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

    Authors: Jing Xu, Minglin Wu, Xixin Wu, Helen Meng

    Abstract: Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose ad… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  29. arXiv:2406.13963  [pdf, ps, other

    cs.CV

    SSAD: Self-supervised Auxiliary Detection Framework for Panoramic X-ray based Dental Disease Diagnosis

    Authors: Zijian Cai, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuechen Li, Linlin Shen, He Meng, Yongqiang Deng

    Abstract: Panoramic X-ray is a simple and effective tool for diagnosing dental diseases in clinical practice. When deep learning models are developed to assist dentist in interpreting panoramic X-rays, most of their performance suffers from the limited annotated data, which requires dentist's expertise and a lot of time cost. Although self-supervised learning (SSL) has been proposed to address this challeng… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  30. arXiv:2406.12236  [pdf, other

    eess.AS cs.SD eess.SP

    Binaural Selective Attention Model for Target Speaker Extraction

    Authors: Hanyu Meng, Qiquan Zhang, Xiangyu Zhang, Vidhyasaharan Sethu, Eliathamby Ambikairajah

    Abstract: The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  31. arXiv:2406.10991  [pdf, other

    cs.CL

    Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers

    Authors: Tianhua Zhang, Kun Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng

    Abstract: Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing methods attempt to incorporate retriever's preference during the training of rewriting models. However, these approaches typically rely on extensive annotations su… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  32. arXiv:2406.10152  [pdf, other

    cs.SD eess.AS

    Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

    Authors: Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu

    Abstract: This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  33. arXiv:2406.10056  [pdf, other

    cs.SD eess.AS

    UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

    Authors: Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

    Abstract: The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-dr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  34. arXiv:2406.10034  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

    Authors: Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu

    Abstract: This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam s… ▽ More

    Submitted 30 August, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 2 tables, Interspeech24 conference

  35. arXiv:2406.08336  [pdf, other

    cs.SD cs.CV eess.AS

    CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

    Authors: Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (… ▽ More

    Submitted 24 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  36. arXiv:2406.06326  [pdf, other

    cs.CL

    Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

    Authors: Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Yipeng Zhang, Haitao Mi, Helen Meng

    Abstract: Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in ef… ▽ More

    Submitted 15 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: 30 pages

  37. arXiv:2406.05358  [pdf, other

    cs.LG math.OC

    Reinforcement Learning for Intensity Control: An Application to Choice-Based Network Revenue Management

    Authors: Huiling Meng, Ningyuan Chen, Xuefeng Gao

    Abstract: Intensity control is a type of continuous-time dynamic optimization problems with many important applications in Operations Research including queueing and revenue management. In this study, we adapt the reinforcement learning framework to intensity control using choice-based network revenue management as a case study, which is a classical problem in revenue management that features a large state… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  38. arXiv:2406.02940  [pdf, other

    cs.SD eess.AS

    Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

    Abstract: VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  39. arXiv:2406.02328  [pdf, other

    cs.SD eess.AS

    SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

    Authors: Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

    Abstract: In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compac… ▽ More

    Submitted 14 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  40. arXiv:2405.16090  [pdf, other

    cs.HC eess.SP

    EEG-DBNet: A Dual-Branch Network for Temporal-Spectral Decoding in Motor-Imagery Brain-Computer Interfaces

    Authors: Xicheng Lou, Xinwei Li, Hongying Meng, Jun Hu, Meili Xu, Yue Zhao, Jiazhang Yang, Zhangyong Li

    Abstract: Motor imagery electroencephalogram (EEG)-based brain-computer interfaces (BCIs) offer significant advantages for individuals with restricted limb mobility. However, challenges such as low signal-to-noise ratio and limited spatial resolution impede accurate feature extraction from EEG signals, thereby affecting the classification accuracy of different actions. To address these challenges, this stud… ▽ More

    Submitted 19 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  41. arXiv:2405.05758  [pdf, other

    cs.HC cs.CL cs.CY

    Exploring the Potential of Human-LLM Synergy in Advancing Qualitative Analysis: A Case Study on Mental-Illness Stigma

    Authors: Han Meng, Yitian Yang, Yunan Li, Jungup Lee, Yi-Chieh Lee

    Abstract: Qualitative analysis is a challenging, yet crucial aspect of advancing research in the field of Human-Computer Interaction (HCI). Recent studies show that large language models (LLMs) can perform qualitative coding within existing schemes, but their potential for collaborative human-LLM discovery and new insight generation in qualitative analysis is still underexplored. To bridge this gap and adva… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: 55 pages

  42. arXiv:2405.00074  [pdf, other

    cs.LG cs.SE

    PAODING: A High-fidelity Data-free Pruning Toolkit for Debloating Pre-trained Neural Networks

    Authors: Mark Huasong Meng, Hao Guan, Liuhuo Wan, Sin Gee Teo, Guangdong Bai, Jin Song Dong

    Abstract: We present PAODING, a toolkit to debloat pretrained neural network models through the lens of data-free pruning. To preserve the model fidelity, PAODING adopts an iterative process, which dynamically measures the effect of deleting a neuron to identify candidates that have the least impact to the output layer. Our evaluation shows that PAODING can significantly reduce the model size, generalize on… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

    Comments: 3 pages

  43. arXiv:2404.09011  [pdf, other

    cs.CV cs.LG

    PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization

    Authors: Zining Chen, Weiqiu Wang, Zhicheng Zhao, Fei Su, Aidong Men, Hongying Meng

    Abstract: Domain Generalization (DG) aims to resolve distribution shifts between source and target domains, and current DG methods are default to the setting that data from source and target domains share identical categories. Nevertheless, there exists unseen classes from target domains in practical scenarios. To address this issue, Open Set Domain Generalization (OSDG) has emerged and several methods have… ▽ More

    Submitted 13 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR2024

  44. arXiv:2404.06702  [pdf, other

    eess.AS cs.SD eess.SP

    What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions

    Authors: Hanyu Meng, Vidhyasaharan Sethu, Eliathamby Ambikairajah

    Abstract: There is increasing interest in the use of the LEArnable Front-end (LEAF) in a variety of speech processing systems. However, there is a dearth of analyses of what is actually learnt and the relative importance of training the different components of the front-end. In this paper, we investigate this question on keyword spotting, speech-based emotion recognition and language identification tasks an… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Interspeech 2023 Proceeding

    Journal ref: Interspeech 2023

  45. arXiv:2403.16078  [pdf, other

    cs.SD eess.AS

    Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

    Authors: Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng

    Abstract: Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

    Comments: Accepted by IJCNN 2024

  46. arXiv:2403.09326  [pdf, other

    cs.GR cs.AI

    HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation

    Authors: Duotun Wang, Hengyu Meng, Zeyu Cai, Zhijing Shao, Qianxi Liu, Lin Wang, Mingming Fan, Xiaohang Zhan, Zeyu Wang

    Abstract: Current text-to-avatar methods often rely on implicit representations (e.g., NeRF, SDF, and DMTet), leading to 3D content that artists cannot easily edit and animate in graphics software. This paper introduces a novel framework for generating stylized head avatars from text guidance, which leverages locally learnable mesh deformation and 2D diffusion priors to achieve high-quality digital assets f… ▽ More

    Submitted 20 September, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: 13 pages, 20 figures

    ACM Class: I.2.6; I.3.8

  47. arXiv:2403.05834  [pdf, other

    cs.MM cs.SD eess.AS

    Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information

    Authors: Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu, Haozhi Huang, Helen Meng

    Abstract: Dance generation, as a branch of human motion generation, has attracted increasing attention. Recently, a few works attempt to enhance dance expressiveness, which includes genre matching, beat alignment, and dance dynamics, from certain aspects. However, the enhancement is quite limited as they lack comprehensive consideration of the aforementioned three factors. In this paper, we propose Expressi… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  48. arXiv:2403.00606  [pdf, other

    cs.CV

    Flattening Singular Values of Factorized Convolution for Medical Images

    Authors: Zexin Feng, Na Zeng, Jiansheng Fang, Xingyue Wang, Xiaoxi Lu, Heng Meng, Jiang Liu

    Abstract: Convolutional neural networks (CNNs) have long been the paradigm of choice for robust medical image processing (MIP). Therefore, it is crucial to effectively and efficiently deploy CNNs on devices with different computing capabilities to support computer-aided diagnosis. Many methods employ factorized convolutional layers to alleviate the burden of limited computational resources at the expense of… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  49. arXiv:2402.10642  [pdf, other

    eess.AS cs.AI

    Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model

    Authors: Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia, Eng Siong Chng, Lina Yao

    Abstract: Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches… ▽ More

    Submitted 23 September, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  50. arXiv:2402.09267  [pdf, other

    cs.CL cs.AI

    Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

    Authors: Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng

    Abstract: Despite showing increasingly human-like abilities, large language models (LLMs) often struggle with factual inaccuracies, i.e. "hallucinations", even when they hold relevant knowledge. To address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. In this work, we explore Self-Alignment for Factuality, where we leverage the self-evaluation capa… ▽ More

    Submitted 11 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: 20 pages

    Journal ref: ACL2024 Main