Search | arXiv e-print repository

Movie Gen: A Cast of Media Foundation Models

Authors: Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le , et al. (63 additional authors not shown)

Abstract: We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization,… ▽ More We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.08135 [pdf, other]

State Feedback System Level Synthesis in Continuous Time

Authors: Yaozhi Du, Jing Shuang Li

Abstract: System level synthesis (SLS) is a controller parameterization technique that facilitates distributed structured control via convex techniques. Results on SLS are primarily in the discrete-time setting; this paper extends SLS to the continuous-time setting. We translate the parametrization and associated constraints to continuous time, and propose a controller design procedure consisting of two ste… ▽ More System level synthesis (SLS) is a controller parameterization technique that facilitates distributed structured control via convex techniques. Results on SLS are primarily in the discrete-time setting; this paper extends SLS to the continuous-time setting. We translate the parametrization and associated constraints to continuous time, and propose a controller design procedure consisting of two steps: (1) pole selection and (2) optimization over closed-loops. We provide SLS reformulations of H2 and Hinf control, and show that the proposed procedure allows for convex design of structured H2 and Hinf controllers. We verify our methods in simulation on a grid of linearized swing equations. The resulting structured (i.e. sparse) controllers perform similarly (in some cases within 1\% cost) as the centralized (i.e. dense) controllers. The proposed procedure preserves the scalability and disturbance-rejection features of the original discrete-time SLS framework. △ Less

Submitted 10 October, 2024; originally announced October 2024.

Comments: 8 pages, 6 figures, conference

arXiv:2410.04261 [pdf, other]

Compositional Diffusion Models for Powered Descent Trajectory Generation with Flexible Constraints

Authors: Julia Briden, Yilun Du, Enrico M. Zucchelli, Richard Linares

Abstract: This work introduces TrajDiffuser, a compositional diffusion-based flexible and concurrent trajectory generator for 6 degrees of freedom powered descent guidance. TrajDiffuser is a statistical model that learns the multi-modal distributions of a dataset of simulated optimal trajectories, each subject to only one or few constraints that may vary for different trajectories. During inference, the tra… ▽ More This work introduces TrajDiffuser, a compositional diffusion-based flexible and concurrent trajectory generator for 6 degrees of freedom powered descent guidance. TrajDiffuser is a statistical model that learns the multi-modal distributions of a dataset of simulated optimal trajectories, each subject to only one or few constraints that may vary for different trajectories. During inference, the trajectory is generated simultaneously over time, providing stable long-horizon planning, and constraints can be composed together, increasing the model's generalizability and decreasing the training data required. The generated trajectory is then used to initialize an optimizer, increasing its robustness and speed. △ Less

Submitted 5 October, 2024; originally announced October 2024.

Comments: Full manuscript submitted to IEEE Aerospace 2025 on 4-Oct-2024

arXiv:2409.15911 [pdf, other]

A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation

Authors: Xiaoqian Liu, Yangfan Du, Jianjin Wang, Yuan Ge, Chen Xu, Tong Xiao, Guocheng Chen, Jingbo Zhu

Abstract: Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conf… ▽ More Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conflict resolution methods are not well-suited for this task which exacerbates inefficiencies and leads to high GPU memory consumption. To address these challenges, we propose a Modular Gradient Conflict Mitigation (MGCM) strategy that detects conflicts at a finer-grained modular level and resolves them utilizing gradient projection. Experimental results demonstrate that MGCM significantly improves SimulST performance, particularly under medium and high latency conditions, achieving a 0.68 BLEU score gain in offline tasks. Additionally, MGCM reduces GPU memory consumption by over 95\% compared to other conflict mitigation methods, establishing it as a robust solution for SimulST tasks. △ Less

Submitted 24 September, 2024; originally announced September 2024.

arXiv:2409.14739 [pdf, other]

AmpAgent: An LLM-based Multi-Agent System for Multi-stage Amplifier Schematic Design from Literature for Process and Performance Porting

Authors: Chengjie Liu, Weiyu Chen, Anlan Peng, Yuan Du, Li Du, Jun Yang

Abstract: Multi-stage amplifiers are widely applied in analog circuits. However, their large number of components, complex transfer functions, and intricate pole-zero distributions necessitate extensive manpower for derivation and param sizing to ensure their stability. In order to achieve efficient derivation of the transfer function and simplify the difficulty of circuit design, we propose AmpAgent: a mul… ▽ More Multi-stage amplifiers are widely applied in analog circuits. However, their large number of components, complex transfer functions, and intricate pole-zero distributions necessitate extensive manpower for derivation and param sizing to ensure their stability. In order to achieve efficient derivation of the transfer function and simplify the difficulty of circuit design, we propose AmpAgent: a multi-agent system based on large language models (LLMs) for efficiently designing such complex amplifiers from literature with process and performance porting. AmpAgent is composed of three agents: Literature Analysis Agent, Mathematics Reasoning Agent and Device Sizing Agent. They are separately responsible for retrieving key information (e.g. formulas and transfer functions) from the literature, decompose the whole circuit's design problem by deriving the key formulas, and address the decomposed problem iteratively. AmpAgent was employed in the schematic design of seven types of multi-stage amplifiers with different compensation techniques. In terms of design efficiency, AmpAgent has reduced the number of iterations by 1.32$ \sim $4${\times}$ and execution time by 1.19$ \sim $2.99${\times}$ compared to conventional optimization algorithms, with a success rate increased by 1.03$ \sim $6.79${\times}$. In terms of circuit performance, it has improved by 1.63$ \sim $27.25${\times}$ compared to the original literature. The findings suggest that LLMs could play a crucial role in the field of complex analog circuit schematic design, as well as process and performance porting. △ Less

Submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.13863 [pdf, other]

Unsupervised Learning of Multi-modal Affine Registration for PET/CT

Authors: Junyu Chen, Yihao Liu, Shuwen Wei, Aaron Carass, Yong Du

Abstract: Affine registration plays a crucial role in PET/CT imaging, where aligning PET with CT images is challenging due to their respective functional and anatomical representations. Despite the significant promise shown by recent deep learning (DL)-based methods in various medical imaging applications, their application to multi-modal PET/CT affine registration remains relatively unexplored. This study… ▽ More Affine registration plays a crucial role in PET/CT imaging, where aligning PET with CT images is challenging due to their respective functional and anatomical representations. Despite the significant promise shown by recent deep learning (DL)-based methods in various medical imaging applications, their application to multi-modal PET/CT affine registration remains relatively unexplored. This study investigates a DL-based approach for PET/CT affine registration. We introduce a novel method using Parzen windowing to approximate the correlation ratio, which acts as the image similarity measure for training DNNs in multi-modal registration. Additionally, we propose a multi-scale, instance-specific optimization scheme that iteratively refines the DNN-generated affine parameters across multiple image resolutions. Our method was evaluated against the widely used mutual information metric and a popular optimization-based technique from the ANTs package, using a large public FDG-PET/CT dataset with synthetic affine transformations. Our approach achieved a mean Dice Similarity Coefficient (DSC) of 0.870, outperforming the compared methods and demonstrating its effectiveness in multi-modal PET/CT image registration. △ Less

Submitted 20 September, 2024; originally announced September 2024.

Comments: Accepted by IEEE NSS/MIC/RTSD'24 ((c) IEEE). Code available at https://github.com/junyuchen245/Correlation_Ratio

arXiv:2409.12873 [pdf, other]

doi 10.1109/TSTE.2024.3462476

Reliability-Based Planning of Cable Layout for Offshore Wind Farm Electrical Collector System Considering Post-Fault Network Reconfiguration

Authors: Xiaochi Ding, Yunfei Du, Xinwei Shen, Qiuwei Wu, Xuan Zhang, Nikos D. Hatziargyriou

Abstract: The electrical collector system (ECS) plays a crucial role in determining the performance of offshore wind farms (OWFs). Existing research has predominantly restricted ECS cable layouts to conventional radial or ring structures and employed graph theory heuristics for solutions. However, both economic efficiency and reliability of the OWFs heavily depend on their ECS structure, and the optimal ECS… ▽ More The electrical collector system (ECS) plays a crucial role in determining the performance of offshore wind farms (OWFs). Existing research has predominantly restricted ECS cable layouts to conventional radial or ring structures and employed graph theory heuristics for solutions. However, both economic efficiency and reliability of the OWFs heavily depend on their ECS structure, and the optimal ECS cable layout often deviates from typical configurations. In this context, this paper introduces a novel reliability-based ECS cable layout planning method for large-scale OWFs, employing a two-stage stochastic programming approach to address uncertainties of wind power and contingencies. To enhance reliability, the model incorporates optimal post-fault network reconfiguration strategies by adjusting wind turbine power supply paths through link cables. To tackle computation challenges arising from numerous contingency scenarios, a customized progressive contingency incorporation (CPCI) framework is developed to solve the model with higher efficiency by iteratively identifying non-trivial scenarios and solving the simplified problems. The convergence and optimality are theoretically proven. Numerical tests on several real-world OWFs validate the necessity of fully optimizing ECS structures and demonstrate the efficiency of the CPCI algorithm. △ Less

Submitted 19 September, 2024; originally announced September 2024.

Comments: 13 pages

arXiv:2408.06185 [pdf, other]

Hi-SAM: A high-scalable authentication model for satellite-ground Zero-Trust system using mean field game

Authors: Xuesong Wu, Tianshuai Zheng, Runfang Wu, Jie Ren, Junyan Guo, Ye Du

Abstract: As more and more Internet of Thing (IoT) devices are connected to satellite networks, the Zero-Trust Architecture brings dynamic security to the satellite-ground system, while frequent authentication creates challenges for system availability. To make the system's accommodate more IoT devices, this paper proposes a high-scalable authentication model (Hi-SAM). Hi-SAM introduces the Proof-of-Work id… ▽ More As more and more Internet of Thing (IoT) devices are connected to satellite networks, the Zero-Trust Architecture brings dynamic security to the satellite-ground system, while frequent authentication creates challenges for system availability. To make the system's accommodate more IoT devices, this paper proposes a high-scalable authentication model (Hi-SAM). Hi-SAM introduces the Proof-of-Work idea to authentication, which allows device to obtain the network resource based on frequency. To optimize the frequency, mean field game is used for competition among devices, which can reduce the decision space of large-scale population games. And a dynamic time-range message authentication code is designed for security. From the test at large population scales, Hi-SAM is superior in the optimization of authentication workload and the anomaly detection efficiency. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2407.15335 [pdf, other]

Addressing Out-of-Distribution Challenges in Image Semantic Communication Systems with Multi-modal Large Language Models

Authors: Feifan Zhang, Yuyang Du, Kexin Chen, Yulin Shao, Soung Chang Liew

Abstract: Semantic communication is a promising technology for next-generation wireless networks. However, the out-of-distribution (OOD) problem, where a pre-trained machine learning (ML) model is applied to unseen tasks that are outside the distribution of its training data, may compromise the integrity of semantic compression. This paper explores the use of multi-modal large language models (MLLMs) to add… ▽ More Semantic communication is a promising technology for next-generation wireless networks. However, the out-of-distribution (OOD) problem, where a pre-trained machine learning (ML) model is applied to unseen tasks that are outside the distribution of its training data, may compromise the integrity of semantic compression. This paper explores the use of multi-modal large language models (MLLMs) to address the OOD issue in image semantic communication. We propose a novel "Plan A - Plan B" framework that leverages the broad knowledge and strong generalization ability of an MLLM to assist a conventional ML model when the latter encounters an OOD input in the semantic encoding process. Furthermore, we propose a Bayesian optimization scheme that reshapes the probability distribution of the MLLM's inference process based on the contextual information of the image. The optimization scheme significantly enhances the MLLM's performance in semantic compression by 1) filtering out irrelevant vocabulary in the original MLLM output; and 2) using contextual similarities between prospective answers of the MLLM and the background information as prior knowledge to modify the MLLM's probability distribution during inference. Further, at the receiver side of the communication system, we put forth a "generate-criticize" framework that utilizes the cooperation of multiple MLLMs to enhance the reliability of image reconstruction. △ Less

Submitted 21 July, 2024; originally announced July 2024.

arXiv:2407.11333 [pdf, other]

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

Authors: Jie Yin, Andrew Luo, Yilun Du, Anoop Cherian, Tim K. Marks, Jonathan Le Roux, Chuang Gan

Abstract: We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning… ▽ More We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning a disentangled model of acoustic formation, referred to as disentangled acoustic field (DAF), to capture the sound generation and propagation process, enables the embodied agent to construct a spatial uncertainty map over where the objects may have fallen. We demonstrate that our analysis-by-synthesis framework can jointly infer sound properties by explicitly decomposing and factorizing the latent space of the disentangled model. We further show that the spatial uncertainty map can significantly improve the success rate for the localization of fallen objects by proposing multiple plausible exploration locations. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.04675 [pdf, other]

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance. △ Less

Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.02913 [pdf, other]

SFC: Achieve Accurate Fast Convolution under Low-precision Arithmetic

Authors: Liulu He, Yufei Zhao, Rui Gao, Yuan Du, Li Du

Abstract: Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast co… ▽ More Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68x multiplication reduction for 3x3 convolution, while the Winograd algorithm only achieves a 2.25x reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: ICML 2024

arXiv:2407.00008 [pdf, other]

Spectral Brain Graph Neural Network for Prediction of Anxiety in Children with Autism Spectrum Disorder

Authors: Peiyu Duan, Nicha C. Dvornek, Jiyao Wang, Jeffrey Eilbott, Yuexi Du, Denis G. Sukhodolsky, James S. Duncan

Abstract: Children with Autism Spectrum Disorder (ASD) frequently exhibit comorbid anxiety, which contributes to impairment and requires treatment. Therefore, it is critical to investigate co-occurring autism and anxiety with functional imaging tools to understand the brain mechanisms of this comorbidity. Multidimensional Anxiety Scale for Children, 2nd edition (MASC-2) score is a common tool to evaluate th… ▽ More Children with Autism Spectrum Disorder (ASD) frequently exhibit comorbid anxiety, which contributes to impairment and requires treatment. Therefore, it is critical to investigate co-occurring autism and anxiety with functional imaging tools to understand the brain mechanisms of this comorbidity. Multidimensional Anxiety Scale for Children, 2nd edition (MASC-2) score is a common tool to evaluate the daily anxiety level in autistic children. Predicting MASC-2 score with Functional Magnetic Resonance Imaging (fMRI) data will help gain more insights into the brain functional networks of children with ASD complicated by anxiety. However, most of the current graph neural network (GNN) studies using fMRI only focus on graph operations but ignore the spectral features. In this paper, we explored the feasibility of using spectral features to predict the MASC-2 total scores. We proposed SpectBGNN, a graph-based network, which uses spectral features and integrates graph spectral filtering layers to extract hidden information. We experimented with multiple spectral analysis algorithms and compared the performance of the SpectBGNN model with CPM, GAT, and BrainGNN on a dataset consisting of 26 typically developing and 70 ASD children with 5-fold cross-validation. We showed that among all spectral analysis algorithms tested, using the Fast Fourier Transform (FFT) or Welch's Power Spectrum Density (PSD) as node features performs significantly better than correlation features, and adding the graph spectral filtering layer significantly increases the network's performance. △ Less

Submitted 23 April, 2024; originally announced July 2024.

Comments: ISBI 2024 Oral

arXiv:2406.16754 [pdf, other]

The MRI Scanner as a Diagnostic: Image-less Active Sampling

Authors: Yuning Du, Rohan Dharmakumar, Sotirios A. Tsaftaris

Abstract: Despite the high diagnostic accuracy of Magnetic Resonance Imaging (MRI), using MRI as a Point-of-Care (POC) disease identification tool poses significant accessibility challenges due to the use of high magnetic field strength and lengthy acquisition times. We ask a simple question: Can we dynamically optimise acquired samples, at the patient level, according to an (automated) downstream decision… ▽ More Despite the high diagnostic accuracy of Magnetic Resonance Imaging (MRI), using MRI as a Point-of-Care (POC) disease identification tool poses significant accessibility challenges due to the use of high magnetic field strength and lengthy acquisition times. We ask a simple question: Can we dynamically optimise acquired samples, at the patient level, according to an (automated) downstream decision task, while discounting image reconstruction? We propose an ML-based framework that learns an active sampling strategy, via reinforcement learning, at a patient-level to directly infer disease from undersampled k-space. We validate our approach by inferring Meniscus Tear in undersampled knee MRI data, where we achieve diagnostic performance comparable with ML-based diagnosis, using fully sampled k-space data. We analyse task-specific sampling policies, showcasing the adaptability of our active sampling approach. The introduced frugal sampling strategies have the potential to reduce high field strength requirements that in turn strengthen the viability of MRI-based POC disease identification and associated preliminary screening tools. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted in MICCAI 2024

arXiv:2406.11546 [pdf, other]

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline uses Whisper for initial transcription and TorchAudio for forced alignment, combined with multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thus enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus's high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to the Whisper large-v3 model, with merely 10% model parameters. Furthermore, our ASR models trained on Gigaspeech 2 yield superior performance compared to commercial services. We believe that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2406.05954 [pdf, other]

Aligning Large Language Models with Representation Editing: A Control Perspective

Authors: Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang

Abstract: Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabi… ▽ More Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabilities. To address these challenges, we propose aligning LLMs through representation editing. The core of our method is to view a pre-trained autoregressive LLM as a discrete-time stochastic dynamical system. To achieve alignment for specific objectives, we introduce external control signals into the state space of this language dynamical system. We train a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain the optimal control signals at test time. Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods. △ Less

Submitted 11 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

Comments: fix typos

arXiv:2406.00497 [pdf, ps, other]

Recent Advances in End-to-End Simultaneous Speech Translation

Authors: Xiaoqian Liu, Guoqiang Hu, Yangfan Du, Erfeng He, Yingfeng Luo, Chen Xu, Tong Xiao, Jingbo Zhu

Abstract: Simultaneous speech translation (SimulST) is a demanding task that involves generating translations in real-time while continuously processing speech input. This paper offers a comprehensive overview of the recent developments in SimulST research, focusing on four major challenges. Firstly, the complexities associated with processing lengthy and continuous speech streams pose significant hurdles.… ▽ More Simultaneous speech translation (SimulST) is a demanding task that involves generating translations in real-time while continuously processing speech input. This paper offers a comprehensive overview of the recent developments in SimulST research, focusing on four major challenges. Firstly, the complexities associated with processing lengthy and continuous speech streams pose significant hurdles. Secondly, satisfying real-time requirements presents inherent difficulties due to the need for immediate translation output. Thirdly, striking a balance between translation quality and latency constraints remains a critical challenge. Finally, the scarcity of annotated data adds another layer of complexity to the task. Through our exploration of these challenges and the proposed solutions, we aim to provide valuable insights into the current landscape of SimulST research and suggest promising directions for future exploration. △ Less

Submitted 20 August, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

Comments: Accepted by IJCAI 2024

arXiv:2404.08199 [pdf, other]

doi 10.1109/TCSII.2023.3266594

Cepstral Analysis Based Artifact Detection, Recognition and Removal for Prefrontal EEG

Authors: Siqi Han, Chao Zhang, Jiaxin Lei, Qingquan Han, Yuhui Du, Anhe Wang, Shuo Bai, Milin Zhang

Abstract: This paper proposes to use cepstrum for artifact detection, recognition and removal in prefrontal EEG. This work focuses on the artifact caused by eye movement. A database containing artifact-free EEG and eye movement contaminated EEG from different subjects is established. A cepstral analysis-based feature extraction with support vector machine (SVM) based classifier is designed to identify the a… ▽ More This paper proposes to use cepstrum for artifact detection, recognition and removal in prefrontal EEG. This work focuses on the artifact caused by eye movement. A database containing artifact-free EEG and eye movement contaminated EEG from different subjects is established. A cepstral analysis-based feature extraction with support vector machine (SVM) based classifier is designed to identify the artifacts from the target EEG signals. The proposed method achieves an accuracy of 99.62% on the artifact detection task and a 82.79% accuracy on the 6-category eye movement classification task. A statistical value-based artifact removal method is proposed and evaluated on a public EEG database, where an accuracy improvement of 3.46% is obtained on the 3-category emotion classification task. In order to make a confident decision of each 5s EEG segment, the algorithm requires only 0.66M multiplication operations. Compared to the state-of-the-art approaches in artifact detection and removal, the proposed method features higher detection accuracy and lower computational cost, which makes it a more suitable solution to be integrated into a real-time and artifact robust Brain-Machine Interface (BMI). △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 5 pages, 4 figures, published by TCAS-II

Journal ref: IEEE Transactions on Circuits and Systems II: Express Briefs, 2023

arXiv:2404.01082 [pdf, other]

The state-of-the-art in Cardiac MRI Reconstruction: Results of the CMRxRecon Challenge in MICCAI 2023

Authors: Jun Lyu, Chen Qin, Shuo Wang, Fanwen Wang, Yan Li, Zi Wang, Kunyuan Guo, Cheng Ouyang, Michael T�nzer, Meng Liu, Longyu Sun, Mengting Sun, Qin Li, Zhang Shi, Sha Hua, Hao Li, Zhensen Chen, Zhenlin Zhang, Bingyu Xin, Dimitris N. Metaxas, George Yiasemis, Jonas Teuwen, Liping Zhang, Weitian Chen, Yidong Zhao , et al. (25 additional authors not shown)

Abstract: Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation p… ▽ More Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation platform hinder the development of data-driven reconstruction algorithms. To address this issue, we organized the Cardiac MRI Reconstruction Challenge (CMRxRecon) in 2023, in collaboration with the 26th International Conference on MICCAI. CMRxRecon presented an extensive k-space dataset comprising cine and mapping raw data, accompanied by detailed annotations of cardiac anatomical structures. With overwhelming participation, the challenge attracted more than 285 teams and over 600 participants. Among them, 22 teams successfully submitted Docker containers for the testing phase, with 7 teams submitted for both cine and mapping tasks. All teams use deep learning based approaches, indicating that deep learning has predominately become a promising solution for the problem. The first-place winner of both tasks utilizes the E2E-VarNet architecture as backbones. In contrast, U-Net is still the most popular backbone for both multi-coil and single-coil reconstructions. This paper provides a comprehensive overview of the challenge design, presents a summary of the submitted results, reviews the employed methods, and offers an in-depth discussion that aims to inspire future advancements in cardiac MRI reconstruction models. The summary emphasizes the effective strategies observed in Cardiac MRI reconstruction, including backbone architecture, loss function, pre-processing techniques, physical modeling, and model complexity, thereby providing valuable insights for further developments in this field. △ Less

Submitted 16 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 25 pages, 17 figures

arXiv:2403.13148 [pdf, other]

SIFT-DBT: Self-supervised Initialization and Fine-Tuning for Imbalanced Digital Breast Tomosynthesis Image Classification

Authors: Yuexi Du, Regina J. Hooley, John Lewin, Nicha C. Dvornek

Abstract: Digital Breast Tomosynthesis (DBT) is a widely used medical imaging modality for breast cancer screening and diagnosis, offering higher spatial resolution and greater detail through its 3D-like breast volume imaging capability. However, the increased data volume also introduces pronounced data imbalance challenges, where only a small fraction of the volume contains suspicious tissue. This further… ▽ More Digital Breast Tomosynthesis (DBT) is a widely used medical imaging modality for breast cancer screening and diagnosis, offering higher spatial resolution and greater detail through its 3D-like breast volume imaging capability. However, the increased data volume also introduces pronounced data imbalance challenges, where only a small fraction of the volume contains suspicious tissue. This further exacerbates the data imbalance due to the case-level distribution in real-world data and leads to learning a trivial classification model that only predicts the majority class. To address this, we propose a novel method using view-level contrastive Self-supervised Initialization and Fine-Tuning for identifying abnormal DBT images, namely SIFT-DBT. We further introduce a patch-level multi-instance learning method to preserve spatial resolution. The proposed method achieves 92.69% volume-wise AUC on an evaluation of 970 unique studies. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE ISBI 2024

arXiv:2403.05111 [pdf, other]

From Registration Uncertainty to Segmentation Uncertainty

Authors: Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Aaron Carass, Yong Du

Abstract: Understanding the uncertainty inherent in deep learning-based image registration models has been an ongoing area of research. Existing methods have been developed to quantify both transformation and appearance uncertainties related to the registration process, elucidating areas where the model may exhibit ambiguity regarding the generated deformation. However, our study reveals that neither uncert… ▽ More Understanding the uncertainty inherent in deep learning-based image registration models has been an ongoing area of research. Existing methods have been developed to quantify both transformation and appearance uncertainties related to the registration process, elucidating areas where the model may exhibit ambiguity regarding the generated deformation. However, our study reveals that neither uncertainty effectively estimates the potential errors when the registration model is used for label propagation. Here, we propose a novel framework to concurrently estimate both the epistemic and aleatoric segmentation uncertainties for image registration. To this end, we implement a compact deep neural network (DNN) designed to transform the appearance discrepancy in the warping into aleatoric segmentation uncertainty by minimizing a negative log-likelihood loss function. Furthermore, we present epistemic segmentation uncertainty within the label propagation process as the entropy of the propagated labels. By introducing segmentation uncertainty along with existing methods for estimating registration uncertainty, we offer vital insights into the potential uncertainties at different stages of image registration. We validated our proposed framework using publicly available datasets, and the results prove that the segmentation uncertainties estimated with the proposed method correlate well with errors in label propagation, all while achieving superior registration performance. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE ISBI'24 ((c) IEEE). Code available at https://bit.ly/42VOZER

arXiv:2403.02565 [pdf, other]

Deep Cooperation in ISAC System: Resource, Node and Infrastructure Perspectives

Authors: Zhiqing Wei, Haotian Liu, Zhiyong Feng, Huici Wu, Fan Liu, Qixun Zhang, Yucong Du

Abstract: With the emerging Integrated Sensing and Communication (ISAC) technique, exploiting the mobile communication system with multi-domain resources, multiple network elements, and large-scale infrastructures to realize cooperative sensing is a crucial approach satisfying the requirements of high-accuracy and large-scale sensing in IoE. In this article, the deep cooperation in ISAC system including thr… ▽ More With the emerging Integrated Sensing and Communication (ISAC) technique, exploiting the mobile communication system with multi-domain resources, multiple network elements, and large-scale infrastructures to realize cooperative sensing is a crucial approach satisfying the requirements of high-accuracy and large-scale sensing in IoE. In this article, the deep cooperation in ISAC system including three perspectives is investigated. In the microscopic perspective, namely, within a single node, the sensing information carried by time-frequency-space-code domain resources is processed, such as phase compensation, coherent accumulation and other operations, thereby improving the sensing accuracy. In the mesoscopic perspective, the sensing accuracy could be improved through the cooperation of multiple nodes. We explore various multi-node cooperative sensing scenarios and present the corresponding challenges and future research trends. In the macroscopic perspective, the massive number of infrastructures from the same operator or different operators could perform cooperative sensing to extend the sensing coverage and improve the sensing continuity. We investigate network architecture, target tracking methods, and the large-scale sensing assisted digital twin construction. Simulation results demonstrate the superiority of multi-nodes and multi-resources cooperative sensing over single resource or node sensing. This article may provide a deep and comprehensive view on the cooperative sensing in ISAC system to enhance the performance of sensing, supporting the applications of IoE. △ Less

Submitted 2 September, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: 8 pages and 6 figures, Accepted by IEEE Internet of Things Magazine

arXiv:2402.05390 [pdf, other]

Integrated Sensing and Communication Driven Digital Twin for Intelligent Machine Network

Authors: Zhiqing Wei, Yucong Du, Qixun Zhang, Wangjun Jiang, Yanpeng Cui, Zeyang Meng, Huici Wu, Zhiyong Feng

Abstract: Intelligent machines (IMs), including industrial machines, unmanned aerial vehicles (UAVs), and unmanned vehicles, etc., could perform effective cooperation in complex environment when they form IM network. The efficient environment sensing and communication are crucial for IM network, enabling the real-time and stable control of IMs. With the emergence of integrated sensing and communication (ISA… ▽ More Intelligent machines (IMs), including industrial machines, unmanned aerial vehicles (UAVs), and unmanned vehicles, etc., could perform effective cooperation in complex environment when they form IM network. The efficient environment sensing and communication are crucial for IM network, enabling the real-time and stable control of IMs. With the emergence of integrated sensing and communication (ISAC) technology, IM network is empowered with ubiquitous sensing capabilities, which is helpful in improving the efficiency of communication and sensing with the mutual benefit of them. However, the massive amount of sensing information brings challenges for the processing, storage and application of sensing information. In this article, ISAC driven digital twin (DT) is proposed for IM network, and the architecture and enabling technologies are revealed. ISAC driven DT structurally stores the sensing information, which is further applied to optimize communication, networking and control schemes of IMs, promoting the widespread applications of IMs. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 9 pages, 5 figures, 1 Table

ACM Class: C.2.1

arXiv:2402.02694 [pdf, other]

Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Authors: Jisheng Bai, Mou Wang, Haohe Liu, Han Yin, Yafei Jia, Siwei Huang, Yutong Du, Dongzhe Zhang, Dongyuan Shi, Woon-Seng Gan, Mark D. Plumbley, Susanto Rahardja, Bin Xiang, Jianfeng Chen

Abstract: Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Althoug… ▽ More Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task, in recent years, has achieved substantial progress in device generalization, the challenge of domain shift between different geographical regions, involving discrepancies such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift. △ Less

Submitted 28 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.10070 [pdf, other]

Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks

Authors: Yichao Du, Zhirui Zhang, Linan Yue, Xu Huang, Yuqing Zhang, Tong Xu, Linli Xu, Enhong Chen

Abstract: To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems, including automatic speech recognition (ASR) and speech translation (ST). However, the commonly used FL approach (i.e., \textsc{FedAvg}) in S2T tasks typically suffers from extensive communication overhead due to multi-round interactions based on the wh… ▽ More To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems, including automatic speech recognition (ASR) and speech translation (ST). However, the commonly used FL approach (i.e., \textsc{FedAvg}) in S2T tasks typically suffers from extensive communication overhead due to multi-round interactions based on the whole model and performance degradation caused by data heterogeneity among clients.To address these issues, we propose a personalized federated S2T framework that introduces \textsc{FedLoRA}, a lightweight LoRA module for client-side tuning and interaction with the server to minimize communication overhead, and \textsc{FedMem}, a global model equipped with a $k$-nearest-neighbor ($k$NN) classifier that captures client-specific distributional shifts to achieve personalization and overcome data heterogeneity. Extensive experiments based on Conformer and Whisper backbone models on CoVoST and GigaSpeech benchmarks show that our approach significantly reduces the communication overhead on all S2T tasks and effectively personalizes the global model to overcome data heterogeneity. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: ICASSP 2024

arXiv:2312.15538 [pdf, other]

doi 10.1109/TVT.2024.3433028

Exploiting Multipath Information for Integrated Localization and Sensing via PHD Filtering

Authors: Yinuo Du, Hanying Zhao, Yang Liu, Xinlei Yu, Yuan Shen

Abstract: Accurate localization and perception are pivotal for enhancing the safety and reliability of vehicles. However, current localization methods suffer from reduced accuracy when the line-of-sight (LOS) path is obstructed, or a combination of reflections and scatterings is present. In this paper, we present an integrated localization and sensing method that delivers superior performance in complex env… ▽ More Accurate localization and perception are pivotal for enhancing the safety and reliability of vehicles. However, current localization methods suffer from reduced accuracy when the line-of-sight (LOS) path is obstructed, or a combination of reflections and scatterings is present. In this paper, we present an integrated localization and sensing method that delivers superior performance in complex environments while being computationally efficient. Our method uniformly leverages various types of multipath components (MPCs) through the lens of random finite sets (RFSs), encompassing reflections, scatterings, and their combinations. This advancement eliminates the need for the multipath identification step and streamlines the filtering process by removing the necessity for distinct filters for different multipath types, a requirement that was critical in previous research. The simulation results demonstrate the superior performance of our method in both robustness and effectiveness, particularly in complex environments where the LOS MPC is obscured and in situations involving clutter and missed detection of MPC measurements. △ Less

Submitted 15 August, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

Comments: 6 pages, 6 figures. This work has been accepted and published by the IEEE Transactions on Vehicular Technology (2024)

arXiv:2311.15069 [pdf, ps, other]

Multiuser Beamforming for Partially-Connected Millimeter Wave Massive MIMO

Authors: Chenhao Qi, Jinlin Hu, Yang Du, Arumugam Nallanathan

Abstract: Multiuser beamforming is considered for partially-connected millimeter wave massive MIMO systems. Based on perfect channel state information (CSI), a low-complexity hybrid beamforming scheme that decouples the analog beamformer and the digital beamformer is proposed to maximize the sum-rate. The analog beamformer design is modeled as a phase alignment problem to harvest the array gain. Given the a… ▽ More Multiuser beamforming is considered for partially-connected millimeter wave massive MIMO systems. Based on perfect channel state information (CSI), a low-complexity hybrid beamforming scheme that decouples the analog beamformer and the digital beamformer is proposed to maximize the sum-rate. The analog beamformer design is modeled as a phase alignment problem to harvest the array gain. Given the analog beamformer, the digital beamformer is designed by solving a weighted minimum mean squared error problem. Then based on imperfect CSI, an analog-only beamformer design scheme is proposed, where the design problem aims at maximizing the desired signal power on the current user and minimizing the power on the other users to mitigate the multiuser interference. The original problem is then transformed into a series of independent beam nulling subproblems, where an efficient iterative algorithm using the majorization-minimization framework is proposed to solve the subproblems. Simulation results show that, under perfect CSI, the proposed scheme achieves almost the same sum-rate performance as the existing schemes but with lower computational complexity; and under imperfect CSI, the proposed analog-only beamforming design scheme can effectively mitigate the multiuser interference. △ Less

Submitted 25 November, 2023; originally announced November 2023.

arXiv:2311.13785 [pdf, other]

Federated Learning Assisted Distributed Energy Optimization

Authors: Yuhan Du, Nuno Mendes, Simin Rasouli, Javad Mohammadi, Pedro Moura

Abstract: The increased penetration of distributed energy resources and the adoption of sensing and control technologies are driving the transition from our current centralized electric grid to a distributed system controlled by multiple entities (agents). The Transactive Energy Community (TEC) serves as an established example of this transition. Distributed energy management approaches can effectively addr… ▽ More The increased penetration of distributed energy resources and the adoption of sensing and control technologies are driving the transition from our current centralized electric grid to a distributed system controlled by multiple entities (agents). The Transactive Energy Community (TEC) serves as an established example of this transition. Distributed energy management approaches can effectively address the scalability, resilience, and privacy requirements of the evolving grid. In this context, the accuracy of agents' estimations becomes crucial for the performance of distributed and multi-agent decision-making paradigms. This paper specifically focuses on integrating Federated Learning (FL) with the multi-agent energy management procedure. FL is utilized to forecast agents' local energy generation and demand, aiming to accelerate the convergence of the distributed decision-making process. To enhance energy aggregation in TECs, we propose an FL-assisted distributed Consensus + Innovations approach. The results demonstrate that employing FL significantly reduces errors in predicting net power demand. The improved forecast accuracy, in turn, introduces less error in the distributed optimization process, thereby enhancing its convergence behavior. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: 14 pages, 14 figures, submitted for journal IET Renewable Power Generation

arXiv:2311.12190 [pdf, other]

Equitable Coordination in Multi-agent Power Systems: Impacts of Computation Granularity

Authors: Yuhan Du, Javad Mohammadi

Abstract: The growing integration of distributed energy resources drives the centralized power system towards a decentralized multi-agent network. Operating multi-agent networks significantly relies on inter-agent communications. Computation granularity in this context refers to the number of nodes overseen by an agent. The impact of granularity on equitable power coordination, particularly among marginaliz… ▽ More The growing integration of distributed energy resources drives the centralized power system towards a decentralized multi-agent network. Operating multi-agent networks significantly relies on inter-agent communications. Computation granularity in this context refers to the number of nodes overseen by an agent. The impact of granularity on equitable power coordination, particularly among marginalized customers with limited communication bandwidth (e.g., intermittent internet connectivity) is not well studied. This work explores different levels of computation granularity for agent-based energy dispatch and studies their impact on equitable coordination. We will leverage and utilize the Consensus + Innovations approach to model the equitable coordination of a multi-agent power system. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: 5 pages, 10 figures, submitted for 2024 IEEE Power & Energy Society General Meeting

arXiv:2311.08585 [pdf, other]

Unsupervised segmentation of irradiation$\unicode{x2010}$induced order$\unicode{x2010}$disorder phase transitions in electron microscopy

Authors: Arman H Ter-Petrosyan, Jenna A Bilbrey, Christina M Doty, Bethany E Matthews, Le Wang, Yingge Du, Eric Lang, Khalid Hattar, Steven R Spurgeon

Abstract: We present a method for the unsupervised segmentation of electron microscopy images, which are powerful descriptors of materials and chemical systems. Images are oversegmented into overlapping chips, and similarity graphs are generated from embeddings extracted from a domain$\unicode{x2010}$pretrained convolutional neural network (CNN). The Louvain method for community detection is then applied to… ▽ More We present a method for the unsupervised segmentation of electron microscopy images, which are powerful descriptors of materials and chemical systems. Images are oversegmented into overlapping chips, and similarity graphs are generated from embeddings extracted from a domain$\unicode{x2010}$pretrained convolutional neural network (CNN). The Louvain method for community detection is then applied to perform segmentation. The graph representation provides an intuitive way of presenting the relationship between chips and communities. We demonstrate our method to track irradiation$\unicode{x2010}$induced amorphous fronts in thin films used for catalysis and electronics. This method has potential for "on$\unicode{x2010}$the$\unicode{x2010}$fly" segmentation to guide emerging automated electron microscopes. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 7 pages, 3 figures. Accepted to Machine Learning and the Physical Sciences Workshop, NeurIPS 2023

arXiv:2310.00593 [pdf, other]

Nonlinear Multi-Carrier System with Signal Clipping: Measurement, Analysis, and Optimization

Authors: Yuyang Du, Liang Hao, Yiming Lei, Qun Yang, Shiqi Xu

Abstract: Signal clipping is a classic technique for reducing peak-to-average power ratio (PAPR) in orthogonal frequency division multiplexing (OFDM) systems. It has been widely applied in consumer electronic devices owing to its low complexity and high efficiency. Although clipping reduces the nonlinear distortion caused by power amplifiers (PAs), it induces additional clipping distortion. Optimizing the j… ▽ More Signal clipping is a classic technique for reducing peak-to-average power ratio (PAPR) in orthogonal frequency division multiplexing (OFDM) systems. It has been widely applied in consumer electronic devices owing to its low complexity and high efficiency. Although clipping reduces the nonlinear distortion caused by power amplifiers (PAs), it induces additional clipping distortion. Optimizing the joint system performance with consideration of both PA nonlinearity and clipping distortion remains an open problem due to the complex PA modeling. In this paper, we analyze the PA nonlinearity through the Bessel-Fourier PA (BFPA) model and simplify its power expression using inter-modulation product (IMP) analysis. We derive expressions of the receiver signal-to-noise ratio (SNR) and system symbol error rate (SER) for the nonlinear clipped OFDM system. With the derivations, we investigate the optimal system setting to achieve the SER lower bound in a practical OFDM system that considers both PA nonlinearity and clipping distortion. The methods and results presented in this paper can serve as a useful reference for the system-level optimization of clipped OFDM systems with nonlinear PA. △ Less

Submitted 16 February, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

arXiv:2309.14392 [pdf, other]

Unveiling Fairness Biases in Deep Learning-Based Brain MRI Reconstruction

Authors: Yuning Du, Yuyang Xue, Rohan Dharmakumar, Sotirios A. Tsaftaris

Abstract: Deep learning (DL) reconstruction particularly of MRI has led to improvements in image fidelity and reduction of acquisition time. In neuroimaging, DL methods can reconstruct high-quality images from undersampled data. However, it is essential to consider fairness in DL algorithms, particularly in terms of demographic characteristics. This study presents the first fairness analysis in a DL-based b… ▽ More Deep learning (DL) reconstruction particularly of MRI has led to improvements in image fidelity and reduction of acquisition time. In neuroimaging, DL methods can reconstruct high-quality images from undersampled data. However, it is essential to consider fairness in DL algorithms, particularly in terms of demographic characteristics. This study presents the first fairness analysis in a DL-based brain MRI reconstruction model. The model utilises the U-Net architecture for image reconstruction and explores the presence and sources of unfairness by implementing baseline Empirical Risk Minimisation (ERM) and rebalancing strategies. Model performance is evaluated using image reconstruction metrics. Our findings reveal statistically significant performance biases between the gender and age subgroups. Surprisingly, data imbalance and training discrimination are not the main sources of bias. This analysis provides insights of fairness in DL-based image reconstruction and aims to improve equity in medical AI applications. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: Accepted for publication at FAIMI 2023 (Fairness of AI in Medical Imaging) at MICCAI

arXiv:2309.13385 [pdf, other]

doi 10.1007/978-3-031-52448-6_40

Cine cardiac MRI reconstruction using a convolutional recurrent network with refinement

Authors: Yuyang Xue, Yuning Du, Gianluca Carloni, Eva Pachetti, Connor Jordan, Sotirios A. Tsaftaris

Abstract: Cine Magnetic Resonance Imaging (MRI) allows for understanding of the heart's function and condition in a non-invasive manner. Undersampling of the $k$-space is employed to reduce the scan duration, thus increasing patient comfort and reducing the risk of motion artefacts, at the cost of reduced image quality. In this challenge paper, we investigate the use of a convolutional recurrent neural netw… ▽ More Cine Magnetic Resonance Imaging (MRI) allows for understanding of the heart's function and condition in a non-invasive manner. Undersampling of the $k$-space is employed to reduce the scan duration, thus increasing patient comfort and reducing the risk of motion artefacts, at the cost of reduced image quality. In this challenge paper, we investigate the use of a convolutional recurrent neural network (CRNN) architecture to exploit temporal correlations in supervised cine cardiac MRI reconstruction. This is combined with a single-image super-resolution refinement module to improve single coil reconstruction by 4.4\% in structural similarity and 3.9\% in normalised mean square error compared to a plain CRNN implementation. We deploy a high-pass filter to our $\ell_1$ loss to allow greater emphasis on high-frequency details which are missing in the original data. The proposed model demonstrates considerable enhancements compared to the baseline case and holds promising potential for further improving cardiac MRI reconstruction. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: MICCAI STACOM workshop 2023

arXiv:2309.03641 [pdf, other]

Spiking Structured State Space Model for Monaural Speech Enhancement

Authors: Yu Du, Xu Liu, Yansong Chua

Abstract: Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence… ▽ More Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence modeling capabilities of Structured State Space Models (S4), offering a compelling solution. Evaluation on the DNS Challenge and VoiceBank+Demand Datasets confirms that Spiking-S4 rivals existing Artificial Neural Network (ANN) methods but with fewer computational resources, as evidenced by reduced parameters and Floating Point Operations (FLOPs). △ Less

Submitted 20 April, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

arXiv:2308.15742 [pdf, other]

ASTER: Automatic Speech Recognition System Accessibility Testing for Stutterers

Authors: Yi Liu, Yuekang Li, Gelei Deng, Felix Juefei-Xu, Yao Du, Cen Zhang, Chengwei Liu, Yeting Li, Lei Ma, Yang Liu

Abstract: The popularity of automatic speech recognition (ASR) systems nowadays leads to an increasing need for improving their accessibility. Handling stuttering speech is an important feature for accessible ASR systems. To improve the accessibility of ASR systems for stutterers, we need to expose and analyze the failures of ASR systems on stuttering speech. The speech datasets recorded from stutterers are… ▽ More The popularity of automatic speech recognition (ASR) systems nowadays leads to an increasing need for improving their accessibility. Handling stuttering speech is an important feature for accessible ASR systems. To improve the accessibility of ASR systems for stutterers, we need to expose and analyze the failures of ASR systems on stuttering speech. The speech datasets recorded from stutterers are not diverse enough to expose most of the failures. Furthermore, these datasets lack ground truth information about the non-stuttered text, rendering them unsuitable as comprehensive test suites. Therefore, a methodology for generating stuttering speech as test inputs to test and analyze the performance of ASR systems is needed. However, generating valid test inputs in this scenario is challenging. The reason is that although the generated test inputs should mimic how stutterers speak, they should also be diverse enough to trigger more failures. To address the challenge, we propose ASTER, a technique for automatically testing the accessibility of ASR systems. ASTER can generate valid test cases by injecting five different types of stuttering. The generated test cases can both simulate realistic stuttering speech and expose failures in ASR systems. Moreover, ASTER can further enhance the quality of the test cases with a multi-objective optimization-based seed updating algorithm. We implemented ASTER as a framework and evaluated it on four open-source ASR models and three commercial ASR systems. We conduct a comprehensive evaluation of ASTER and find that it significantly increases the word error rate, match error rate, and word information loss in the evaluated ASR systems. Additionally, our user study demonstrates that the generated stuttering audio is indistinguishable from real-world stuttering audio clips. △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2308.05941 [pdf, other]

doi 10.1016/j.apenergy.2023.121713

A Robust Planning Model for Offshore Microgrid Considering Tidal Power and Desalination

Authors: Zhimeng Wang, Ang Xuan, Xinwei Shen, Yunfei Du, Hongbin Sun

Abstract: Increasing attention has been paid to resources on islands, thus microgrids on islands need to be invested. Different from onshore microgrids, offshore microgrids (OM) are usually abundant in ocean renewable energy (ORE), such as offshore wind, tidal power generation (TPG), etc. Moreover, some special loads such as seawater desalination unit (SDU) should be included. In this sense, this paper prop… ▽ More Increasing attention has been paid to resources on islands, thus microgrids on islands need to be invested. Different from onshore microgrids, offshore microgrids (OM) are usually abundant in ocean renewable energy (ORE), such as offshore wind, tidal power generation (TPG), etc. Moreover, some special loads such as seawater desalination unit (SDU) should be included. In this sense, this paper proposes a planning method for OM to minimize the investment cost while the ORE's fluctuation could be accommodated with robustness. First, a deterministic planning model (DPM) is formulated for the OM with TPG and SDU. A robust planning model (RPM) is then developed considering the uncertainties from both TPG and load demand. The Column-and-constraint generation (C&CG) algorithm is then employed to solve the RPM, producing planning results for the OM that is robust against the worst scenario. Results of the case studies show that the investment and operation decisions of the proposed model are robust, and TPG shows good complementarity with the other RESs. △ Less

Submitted 11 August, 2023; originally announced August 2023.

arXiv:2307.16518 [pdf, other]

Continuous-Time Channel Prediction Based on Tensor Neural Ordinary Differential Equation

Authors: Mingyao Cui, Hao Jiang, Yuhao Chen, Yang Du, Linglong Dai

Abstract: Channel prediction is critical to address the channel aging issue in mobile scenarios. Existing channel prediction techniques are mainly designed for discrete channel prediction, which can only predict the future channel in a fixed time slot per frame, while the other intra-frame channels are usually recovered by interpolation. However, these approaches suffer from a serious interpolation loss, es… ▽ More Channel prediction is critical to address the channel aging issue in mobile scenarios. Existing channel prediction techniques are mainly designed for discrete channel prediction, which can only predict the future channel in a fixed time slot per frame, while the other intra-frame channels are usually recovered by interpolation. However, these approaches suffer from a serious interpolation loss, especially for mobile millimeter wave communications. To solve this challenging problem, we propose a tensor neural ordinary differential equation (TN-ODE) based continuous-time channel prediction scheme to realize the direct prediction of intra-frame channels. Specifically, inspired by the recently developed continuous mapping model named neural ODE in the field of machine learning, we first utilize the neural ODE model to predict future continuous-time channels. To improve the channel prediction accuracy and reduce computational complexity, we then propose the TN-ODE scheme to learn the structural characteristics of the high-dimensional channel by low dimensional learnable transform. Simulation results show that the proposed scheme is able to achieve higher intra-frame channel prediction accuracy than existing schemes. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: A tensor neural ODE based method is proposed to predict continuous-time wireless channels

arXiv:2307.15615 [pdf, other]

A survey on deep learning in medical image registration: new technologies, uncertainty, evaluation metrics, and beyond

Authors: Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L. Prince, Yong Du

Abstract: Deep learning technologies have dramatically reshaped the field of medical image registration over the past decade. The initial developments, such as regression-based and U-Net-based networks, established the foundation for deep learning in image registration. Subsequent progress has been made in various aspects of deep learning-based registration, including similarity measures, deformation regula… ▽ More Deep learning technologies have dramatically reshaped the field of medical image registration over the past decade. The initial developments, such as regression-based and U-Net-based networks, established the foundation for deep learning in image registration. Subsequent progress has been made in various aspects of deep learning-based registration, including similarity measures, deformation regularizations, network architectures, and uncertainty estimation. These advancements have not only enriched the field of image registration but have also facilitated its application in a wide range of tasks, including atlas construction, multi-atlas segmentation, motion estimation, and 2D-3D registration. In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration. △ Less

Submitted 30 April, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

Comments: A list of open-sourced code from the papers reviewed has been organized and is available at https://bit.ly/3QgFJ9z

arXiv:2307.07319 [pdf, other]

The Power of Large Language Models for Wireless Communication System Development: A Case Study on FPGA Platforms

Authors: Yuyang Du, Hongyu Deng, Soung Chang Liew, Kexin Chen, Yulin Shao, He Chen

Abstract: Large language models (LLMs) have garnered significant attention across various research disciplines, including the wireless communication community. There have been several heated discussions on the intersection of LLMs and wireless technologies. While recent studies have demonstrated the ability of LLMs to generate hardware description language (HDL) code for simple computation tasks, developing… ▽ More Large language models (LLMs) have garnered significant attention across various research disciplines, including the wireless communication community. There have been several heated discussions on the intersection of LLMs and wireless technologies. While recent studies have demonstrated the ability of LLMs to generate hardware description language (HDL) code for simple computation tasks, developing wireless prototypes and products via HDL poses far greater challenges because of the more complex computation tasks involved. In this paper, we aim to address this challenge by investigating the role of LLMs in FPGA-based hardware development for advanced wireless signal processing. We begin by exploring LLM-assisted code refactoring, reuse, and validation, using an open-source software-defined radio (SDR) project as a case study. Through the case study, we find that an LLM assistant can potentially yield substantial productivity gains for researchers and developers. We then examine the feasibility of using LLMs to generate HDL code for advanced wireless signal processing, using the Fast Fourier Transform (FFT) algorithm as an example. This task presents two unique challenges: the scheduling of subtasks within the overall task and the multi-step thinking required to solve certain arithmetic problem within the task. To address these challenges, we employ in-context learning (ICL) and Chain-of-Thought (CoT) prompting techniques, culminating in the successful generation of a 64-point Verilog FFT module. Our results demonstrate the potential of LLMs for generalization and imitation, affirming their usefulness in writing HDL code for wireless communication systems. Overall, this work contributes to understanding the role of LLMs in wireless communication and motivates further exploration of their capabilities. △ Less

Submitted 14 July, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

arXiv:2305.14374 [pdf, other]

Inferring Attracting Basins of Power System with Machine Learning

Authors: Yao Du, Qing Li, Huawei Fan, Meng Zhan, Jinghua Xiao, Xingang Wang

Abstract: Power systems dominated by renewable energy encounter frequently large, random disturbances, and a critical challenge faced in power-system management is how to anticipate accurately whether the perturbed systems will return to the functional state after the transient or collapse. Whereas model-based studies show that the key to addressing the challenge lies in the attracting basins of the functio… ▽ More Power systems dominated by renewable energy encounter frequently large, random disturbances, and a critical challenge faced in power-system management is how to anticipate accurately whether the perturbed systems will return to the functional state after the transient or collapse. Whereas model-based studies show that the key to addressing the challenge lies in the attracting basins of the functional and dysfunctional states in the phase space, the finding of the attracting basins for realistic power systems remains a challenge, as accurate models describing the system dynamics are generally unavailable. Here we propose a new machine learning technique, namely balanced reservoir computing, to infer the attracting basins of a typical power system based on measured data. Specifically, trained by the time series of a handful of perturbation events, we demonstrate that the trained machine can predict accurately whether the system will return to the functional state in response to a large, random perturbation, thereby reconstructing the attracting basin of the functional state. The working mechanism of the new machine is analyzed, and it is revealed that the success of the new machine is attributed to the good balance between the echo and fading properties of the reservoir network; the effect of noisy signals on the prediction performance is also investigated, and a stochastic-resonance-like phenomenon is observed. Finally, we demonstrate that the new technique can be also utilized to infer the attracting basins of coexisting attractors in typical chaotic systems. △ Less

Submitted 20 May, 2023; originally announced May 2023.

Comments: 13 pages, 7 figures

arXiv:2304.08490 [pdf, other]

Conditional Generation of Audio from Video via Foley Analogies

Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

Abstract: The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributi… ▽ More The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. Project site: https://xypb.github.io/CondFoleyGen/ △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: CVPR 2023

arXiv:2303.06179 [pdf, other]

Deformable Cross-Attention Transformer for Medical Image Registration

Authors: Junyu Chen, Yihao Liu, Yufan He, Yong Du

Abstract: Transformers have recently shown promise for medical image applications, leading to an increasing interest in developing such models for medical image registration. Recent advancements in designing registration Transformers have focused on using cross-attention (CA) to enable a more precise understanding of spatial correspondences between moving and fixed images. Here, we propose a novel CA mechan… ▽ More Transformers have recently shown promise for medical image applications, leading to an increasing interest in developing such models for medical image registration. Recent advancements in designing registration Transformers have focused on using cross-attention (CA) to enable a more precise understanding of spatial correspondences between moving and fixed images. Here, we propose a novel CA mechanism that computes windowed attention using deformable windows. In contrast to existing CA mechanisms that require intensive computational complexity by either computing CA globally or locally with a fixed and expanded search window, the proposed deformable CA can selectively sample a diverse set of features over a large search window while maintaining low computational complexity. The proposed model was extensively evaluated on multi-modal, mono-modal, and atlas-to-patient registration tasks, demonstrating promising performance against state-of-the-art methods and indicating its effectiveness for medical image registration. The source code for this work will be available after publication. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2303.06168 [pdf, other]

Spatially-varying Regularization with Conditional Transformer for Unsupervised Image Registration

Authors: Junyu Chen, Yihao Liu, Yufan He, Yong Du

Abstract: In the past, optimization-based registration models have used spatially-varying regularization to account for deformation variations in different image regions. However, deep learning-based registration models have mostly relied on spatially-invariant regularization. Here, we introduce an end-to-end framework that uses neural networks to learn a spatially-varying deformation regularizer directly f… ▽ More In the past, optimization-based registration models have used spatially-varying regularization to account for deformation variations in different image regions. However, deep learning-based registration models have mostly relied on spatially-invariant regularization. Here, we introduce an end-to-end framework that uses neural networks to learn a spatially-varying deformation regularizer directly from data. The hyperparameter of the proposed regularizer is conditioned into the network, enabling easy tuning of the regularization strength. The proposed method is built upon a Transformer-based model, but it can be readily adapted to any network architecture. We thoroughly evaluated the proposed approach using publicly available datasets and observed a significant performance improvement while maintaining smooth deformation. The source code of this work will be made available after publication. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2303.04015 [pdf, ps, other]

Simultaneous Recursive Identification of Parameters and Switching Manifolds Identification of Discrete-Time Switched Linear Systems

Authors: Zengjie Zhang, Yingwei Du, Tong Liu, Fangzhou Liu, Martin Buss

Abstract: A novel procedure for the online identification of a class of discrete-time switched linear systems, which simultaneously estimates the parameters and switching manifolds of the systems, is proposed in this paper. Firstly, to estimate the parameters of the subsystems, a discrete-time concurrent learning-based recursive parameter estimator is designed to guarantee the exponential convergence of the… ▽ More A novel procedure for the online identification of a class of discrete-time switched linear systems, which simultaneously estimates the parameters and switching manifolds of the systems, is proposed in this paper. Firstly, to estimate the parameters of the subsystems, a discrete-time concurrent learning-based recursive parameter estimator is designed to guarantee the exponential convergence of the estimation errors to zero. Secondly, as an assistant procedure of the identification framework, an online switching detection method is proposed by making use of the history stacks produced by the concurrent learning estimators. Thirdly, techniques of incremental support vector machine are applied to develop the recursive algorithm to estimate the system switching manifolds, with its stability proven by a Lynapunov-based method. At the end of the paper, the stability and precision of the proposed identification methods are confirmed by the numerical simulation of a 2-order switched linear system. Compared to the traditional offline identification methods, the proposed online identification framework possesses superior efficiency with respect to large amounts of data, while the limitations and outlook of this framework are also discussed within the conclusion. △ Less

Submitted 7 March, 2023; originally announced March 2023.

arXiv:2301.03857 [pdf, ps, other]

doi 10.1109/JIOT.2023.3235618

Integrated Sensing and Communication Signals Toward 5G-A and 6G: A Survey

Authors: Zhiqing Wei, Hanyang Qu, Yuan Wang, Xin Yuan, Huici Wu, Ying Du, Kaifeng Han, Ning Zhang, Zhiyong Feng

Abstract: Integrated sensing and communication (ISAC) has the advantages of efficient spectrum utilization and low hardware cost. It is promising to be implemented in the fifth-generation-advanced (5G-A) and sixth-generation (6G) mobile communication systems, having the potential to be applied in intelligent applications requiring both communication and high-accurate sensing capabilities. As the fundamental… ▽ More Integrated sensing and communication (ISAC) has the advantages of efficient spectrum utilization and low hardware cost. It is promising to be implemented in the fifth-generation-advanced (5G-A) and sixth-generation (6G) mobile communication systems, having the potential to be applied in intelligent applications requiring both communication and high-accurate sensing capabilities. As the fundamental technology of ISAC, ISAC signal directly impacts the performance of sensing and communication. This article systematically reviews the literature on ISAC signals from the perspective of mobile communication systems, including ISAC signal design, ISAC signal processing algorithms and ISAC signal optimization. We first review the ISAC signal design based on 5G, 5G-A and 6G mobile communication systems. Then, radar signal processing methods are reviewed for ISAC signals, mainly including the channel information matrix method, spectrum lines estimator method and super resolution method. In terms of signal optimization, we summarize peak-to-average power ratio (PAPR) optimization, interference management, and adaptive signal optimization for ISAC signals. This article may provide the guidelines for the research of ISAC signals in 5G-A and 6G mobile communication systems. △ Less

Submitted 15 December, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

Comments: 25 pages, 13 figures, 8 tables. IEEE Internet of Things Journal, 2023

MSC Class: 94-02 ACM Class: A.1

arXiv:2212.12134 [pdf, other]

AMDET: Attention based Multiple Dimensions EEG Transformer for Emotion Recognition

Authors: Yongling Xu, Yang Du, Jing Zou, Tianying Zhou, Lushan Xiao, Li Liu, Pengcheng

Abstract: Affective computing is an important branch of artificial intelligence, and with the rapid development of brain computer interface technology, emotion recognition based on EEG signals has received broad attention. It is still a great challenge to effectively explore the multi-dimensional information in the EEG data in spite of a large number of deep learning methods. In this paper, we propose a dee… ▽ More Affective computing is an important branch of artificial intelligence, and with the rapid development of brain computer interface technology, emotion recognition based on EEG signals has received broad attention. It is still a great challenge to effectively explore the multi-dimensional information in the EEG data in spite of a large number of deep learning methods. In this paper, we propose a deep model called Attention-based Multiple Dimensions EEG Transformer (AMDET), which can exploit the complementarity among the spectral-spatial-temporal features of EEG data by employing the multi-dimensional global attention mechanism. We transformed the original EEG data into 3D temporal-spectral-spatial representations and then the AMDET would use spectral-spatial transformer encoder layer to extract effective features in the EEG signal and concentrate on the critical time frame with a temporal attention layer. We conduct extensive experiments on the DEAP, SEED, and SEED-IV datasets to evaluate the performance of AMDET and the results outperform the state-of-the-art baseline on three datasets. Accuracy rates of 97.48%, 96.85%, 97.17%, 87.32% were achieved in the DEAP-Arousal, DEAP-Valence, SEED, and SEED-IV datasets, respectively. We also conduct extensive experiments to explore the possible brain regions that influence emotions and the coupling of EEG signals. AMDET can perform as well even with few channels which are identified by visualizing what learned model focus on. The accuracy could achieve over 90% even with only eight channels and it is of great use and benefit for practical applications. △ Less

Submitted 22 December, 2022; originally announced December 2022.

arXiv:2212.02715 [pdf, other]

Efficient Learning of Voltage Control Strategies via Model-based Deep Reinforcement Learning

Authors: Ramij R. Hossain, Tianzhixi Yin, Yan Du, Renke Huang, Jie Tan, Wenhao Yu, Yuan Liu, Qiuhua Huang

Abstract: This article proposes a model-based deep reinforcement learning (DRL) method to design emergency control strategies for short-term voltage stability problems in power systems. Recent advances show promising results in model-free DRL-based methods for power systems, but model-free methods suffer from poor sample efficiency and training time, both critical for making state-of-the-art DRL algorithms… ▽ More This article proposes a model-based deep reinforcement learning (DRL) method to design emergency control strategies for short-term voltage stability problems in power systems. Recent advances show promising results in model-free DRL-based methods for power systems, but model-free methods suffer from poor sample efficiency and training time, both critical for making state-of-the-art DRL algorithms practically applicable. DRL-agent learns an optimal policy via a trial-and-error method while interacting with the real-world environment. And it is desirable to minimize the direct interaction of the DRL agent with the real-world power grid due to its safety-critical nature. Additionally, state-of-the-art DRL-based policies are mostly trained using a physics-based grid simulator where dynamic simulation is computationally intensive, lowering the training efficiency. We propose a novel model-based-DRL framework where a deep neural network (DNN)-based dynamic surrogate model, instead of a real-world power-grid or physics-based simulation, is utilized with the policy learning framework, making the process faster and sample efficient. However, stabilizing model-based DRL is challenging because of the complex system dynamics of large-scale power systems. We solved these issues by incorporating imitation learning to have a warm start in policy learning, reward-shaping, and multi-step surrogate loss. Finally, we achieved 97.5% sample efficiency and 87.7% training efficiency for an application to the IEEE 300-bus test system. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2210.14644 [pdf, ps, other]

Speaker Diarization Based on Multi-channel Microphone Array in Small-scale Meeting

Authors: Yuxuan Du, Ruohua Zhou

Abstract: In the task of speaker diarization, the number of small-scale meetings accounts for a large proportion. When microphone arrays are employed as a recording device, its spatial information is usually ignored by most researchers. In this paper, inspired by the clustering method combining d-vector and microphone array spatial vector, we proposed a diarization method which using multi-channel microphon… ▽ More In the task of speaker diarization, the number of small-scale meetings accounts for a large proportion. When microphone arrays are employed as a recording device, its spatial information is usually ignored by most researchers. In this paper, inspired by the clustering method combining d-vector and microphone array spatial vector, we proposed a diarization method which using multi-channel microphone arrays for a meeting with no more than 4 speakers. We utilize speech enhancement to preprocess the audio from the microphone array. The Steered-Response Power Phase Transform (SRP-PHAT) algorithm are employed to get more accurate speakers, and apply the number of speakers to recluster the speech segments to achieve better performance. Finally, we fuse our system by DOVER-LAP to get the best result. We evaluated our system on the AMI corpus. Compared with the best experimental results so far, our system has achieved largely improvement in the diarization error rate (DER). △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2209.09635 [pdf]

The BUCEA Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2022

Authors: Ruohua Zhou, Yuxuan Du, Chenlei Hu

Abstract: This paper describes the BUCEA speaker diarization system for the 2022 VoxCeleb Speaker Recognition Challenge. Voxsrc-22 provides the development set and test set of VoxConverse, and we mainly use the test set of VoxConverse for parameter adjustment. Our system consists of several modules, including speech activity detection (VAD), speaker embedding extractor, clustering methods, overlapping speec… ▽ More This paper describes the BUCEA speaker diarization system for the 2022 VoxCeleb Speaker Recognition Challenge. Voxsrc-22 provides the development set and test set of VoxConverse, and we mainly use the test set of VoxConverse for parameter adjustment. Our system consists of several modules, including speech activity detection (VAD), speaker embedding extractor, clustering methods, overlapping speech detection (OSD), and result fusion. Without considering overlap, the Dover-LAP (short for Diarization Output Voting Error Reduction) method was applied to system fusion, and overlapping speech detection and processing were finally carried out. Our best system achieves a diarization error rate (DER) of 5.48% and a Jaccard error rate (JER) of 32.1% on the VoxSRC 2022 evaluation set respectively. △ Less

Submitted 20 September, 2022; originally announced September 2022.

arXiv:2208.08654 [pdf, other]

Rethinking the Performance of ISAC System: From Efficiency and Utility Perspectives

Authors: Jiamo Jiang, Mingfeng Xu, Zhongyuan Zhao, Kaifeng Han, Yang Li, Ying Du, Zhiqin Wang

Abstract: Integrated sensing and communications (ISAC) is an essential technology for the 6G communication system, which enables the conventional wireless communication network capable of sensing targets around. The shared use of pilots is a promising strategy to achieve ISAC. It brings a trade-off between communication and sensing, which is still unclear under the imperfect channel estimation condition. To… ▽ More Integrated sensing and communications (ISAC) is an essential technology for the 6G communication system, which enables the conventional wireless communication network capable of sensing targets around. The shared use of pilots is a promising strategy to achieve ISAC. It brings a trade-off between communication and sensing, which is still unclear under the imperfect channel estimation condition. To provide some insights, the trade-off between ergodic capacity with imperfect channel estimation and ergodic Cramer-Rao bound (CRB) of range sensing is investigated. Firstly, the closedform expressions of ergodic capacity and ergodic range CRB are derived, which are associated with the number of pilots. Secondly, two novel metrics named efficiency and utility are firstly proposed to evaluate the joint performance of capacity and range sensing error. Specifically, efficiency is used to evaluate the achievable capacity per unit of the sensing error, and utility is designed to evaluate the utilization degree of ISAC. Moreover, an algorithm of pilot length optimization is designed to achieve the best efficiency. Finally, simulation results are given to verify the accuracy of analytical results, and provide some insights on designing the slot structure. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Showing 1–50 of 88 results for author: Du, Y