Search | arXiv e-print repository

HumanDiffusion: diffusion model using perceptual gradients

Authors: Yota Ueda, Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Hiroshi Saruwatari

Abstract: We propose {\it HumanDiffusion,} a diffusion model trained from humans' perceptual gradients to learn an acceptable range of data for humans (i.e., human-acceptable distribution). Conventional HumanGAN aims to model the human-acceptable distribution wider than the real-data distribution by training a neural network-based generator with human-based discriminators. However, HumanGAN training tends t… ▽ More We propose {\it HumanDiffusion,} a diffusion model trained from humans' perceptual gradients to learn an acceptable range of data for humans (i.e., human-acceptable distribution). Conventional HumanGAN aims to model the human-acceptable distribution wider than the real-data distribution by training a neural network-based generator with human-based discriminators. However, HumanGAN training tends to converge in a meaningless distribution due to the gradient vanishing or mode collapse and requires careful heuristics. In contrast, our HumanDiffusion learns the human-acceptable distribution through Langevin dynamics based on gradients of human perceptual evaluations. Our training iterates a process to diffuse real data to cover a wider human-acceptable distribution and can avoid the issues in the HumanGAN training. The evaluation results demonstrate that our HumanDiffusion can successfully represent the human-acceptable distribution without any heuristics for the training. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: Proceedings of INTERSPEECH

arXiv:2211.05869 [pdf, other]

A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

Authors: Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

Abstract: Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and thei… ▽ More Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: Accepted at SLT 2022

arXiv:2203.17068 [pdf, other]

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Authors: Soumi Maiti, Yushi Ueda, Shinji Watanabe, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Yong Xu

Abstract: In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple 1x1 convoluti… ▽ More In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple 1x1 convolutional layer architecture for estimating the separation masks corresponding to a flexible number of speakers and a fusion technique for refining the separated speech signal with obtained speaker diarization information to improve the joint framework. Experiments using the LibriMix dataset show that our proposed method outperforms the single-task baselines in both diarization and separation metrics for fixed and flexible numbers of speakers and improves speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit. △ Less

Submitted 15 December, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted in SLT 2022

arXiv:2111.14706 [pdf, other]

ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

Authors: Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe

Abstract: As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language Processing (NLP) tasks. However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks. Hence, there is a need to build an open source standard that can b… ▽ More As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language Processing (NLP) tasks. However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks. Hence, there is a need to build an open source standard that can be used to have a faster start into SLU research. We present ESPnet-SLU, which is designed for quick development of spoken language understanding in a single framework. ESPnet-SLU is a project inside end-to-end speech processing toolkit, ESPnet, which is a widely used open-source standard for various speech processing tasks like ASR, Text to Speech (TTS) and Speech Translation (ST). We enhance the toolkit to provide implementations for various SLU benchmarks that enable researchers to seamlessly mix-and-match different ASR and NLU models. We also provide pretrained models with intensively tuned hyper-parameters that can match or even outperform the current state-of-the-art performances. The toolkit is publicly available at https://github.com/espnet/espnet. △ Less

Submitted 3 March, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: Accepted at ICASSP 2022 (5 pages)

arXiv:2102.04051 [pdf, other]

HumanACGAN: conditional generative adversarial network with human-based auxiliary classifier and its evaluation in phoneme perception

Authors: Yota Ueda, Kazuki Fujii, Yuki Saito, Shinnosuke Takamichi, Yukino Baba, Hiroshi Saruwatari

Abstract: We propose a conditional generative adversarial network (GAN) incorporating humans' perceptual evaluations. A deep neural network (DNN)-based generator of a GAN can represent a real-data distribution accurately but can never represent a human-acceptable distribution, which are ranges of data in which humans accept the naturalness regardless of whether the data are real or not. A HumanGAN was propo… ▽ More We propose a conditional generative adversarial network (GAN) incorporating humans' perceptual evaluations. A deep neural network (DNN)-based generator of a GAN can represent a real-data distribution accurately but can never represent a human-acceptable distribution, which are ranges of data in which humans accept the naturalness regardless of whether the data are real or not. A HumanGAN was proposed to model the human-acceptable distribution. A DNN-based generator is trained using a human-based discriminator, i.e., humans' perceptual evaluations, instead of the GAN's DNN-based discriminator. However, the HumanGAN cannot represent conditional distributions. This paper proposes the HumanACGAN, a theoretical extension of the HumanGAN, to deal with conditional human-acceptable distributions. Our HumanACGAN trains a DNN-based conditional generator by regarding humans as not only a discriminator but also an auxiliary classifier. The generator is trained by deceiving the human-based discriminator that scores the unconditioned naturalness and the human-based classifier that scores the class-conditioned perceptual acceptability. The training can be executed using the backpropagation algorithm involving humans' perceptual evaluations. Our experimental results in phoneme perception demonstrate that our HumanACGAN can successfully train this conditional generator. △ Less

Submitted 8 February, 2021; originally announced February 2021.

Comments: 5 pages, 6 figures, to be published in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:2005.11040 [pdf, other]

DevReplay: Automatic Repair with Editable Fix Pattern

Authors: Yuki Ueda, Takashi Ishio, Akinori Ihara, Kenichi Matsumoto

Abstract: Static analysis tools, or linters, detect violation of source code conventions to maintain project readability. Those tools automatically fix specific violations while developers edit the source code. However, existing tools are designed for the general conventions of programming languages. These tools do not check the project/API-specific conventions. We propose a novel static analysis tool DevRe… ▽ More Static analysis tools, or linters, detect violation of source code conventions to maintain project readability. Those tools automatically fix specific violations while developers edit the source code. However, existing tools are designed for the general conventions of programming languages. These tools do not check the project/API-specific conventions. We propose a novel static analysis tool DevReplay that generates code change patterns by mining the code change history, and we recommend changes using the matched patterns. Using DevReplay, developers can automatically detect and fix project/API-specific problems in the code editor and code review. Also, we evaluate the accuracy of DevReplay using automatic program repair tool benchmarks and real software. We found that DevReplay resolves more bugs than state-of-the-art APR tools. Finally, we submitted patches to the most popular open-source projects that are implemented by different languages, and project reviewers accepted 80% (8 of 10) patches. DevReplay is available on https://devreplay.github.io. △ Less

Submitted 22 May, 2020; originally announced May 2020.

arXiv:1911.08816 [pdf, other]

Can We Benchmark Code Review Studies? A Systematic Mapping Study of Methodology, Dataset, and Metric

Authors: Dong Wang, Yuki Ueda, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto

Abstract: Code Review (CR) is the cornerstone for software quality assurance and a crucial practice for software development. As CR research matures, it can be difficult to keep track of the best practices and state-of-the-art in methodology, dataset, and metric. This paper investigates the potential of benchmarking by collecting methodology, dataset, and metric of CR studies. A systematic mapping study was… ▽ More Code Review (CR) is the cornerstone for software quality assurance and a crucial practice for software development. As CR research matures, it can be difficult to keep track of the best practices and state-of-the-art in methodology, dataset, and metric. This paper investigates the potential of benchmarking by collecting methodology, dataset, and metric of CR studies. A systematic mapping study was conducted. A total of 112 studies from 19,847 papers published in high-impact venues between the years 2011 and 2019 were selected and analyzed. First, we find that empirical evaluation is the most common methodology (65% of papers), with solution and experience being the least common methodology. Second, we highlight 50% of papers that use the quantitative method or mixed-method have the potential for replicability. Third, we identify 457 metrics that are grouped into sixteen core metric sets, applied to nine Software Engineering topics, showing different research topics tend to use specific metric sets. We conclude that at this stage, we cannot benchmark CR studies. Nevertheless, a common benchmark will facilitate new researchers, including experts from other fields, to innovate new techniques and build on top of already established methodologies. A full replication is available at https://naist-se.github.io/code-review/. △ Less

Submitted 13 April, 2021; v1 submitted 20 November, 2019; originally announced November 2019.

Showing 1–7 of 7 results for author: Ueda, Y