EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Maiti, Soumi; Ueda, Yushi; Watanabe, Shinji; Zhang, Chunlei; Yu, Meng; Zhang, Shi-Xiong; Xu, Yong

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2203.17068 (eess)

[Submitted on 31 Mar 2022 (v1), last revised 15 Dec 2022 (this version, v2)]

Title:EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Authors:Soumi Maiti, Yushi Ueda, Shinji Watanabe, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Yong Xu

View PDF

Abstract:In this paper, we present a novel framework that jointly performs three tasks: speaker diarization, speech separation, and speaker counting. Our proposed framework integrates speaker diarization based on end-to-end neural diarization (EEND) models, speaker counting with encoder-decoder based attractors (EDA), and speech separation using Conv-TasNet. In addition, we propose a multiple 1x1 convolutional layer architecture for estimating the separation masks corresponding to a flexible number of speakers and a fusion technique for refining the separated speech signal with obtained speaker diarization information to improve the joint framework. Experiments using the LibriMix dataset show that our proposed method outperforms the single-task baselines in both diarization and separation metrics for fixed and flexible numbers of speakers and improves speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit.

Comments:	Accepted in SLT 2022
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2203.17068 [eess.AS]
	(or arXiv:2203.17068v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2203.17068

Submission history

From: Soumi Maiti [view email]
[v1] Thu, 31 Mar 2022 14:36:00 UTC (92 KB)
[v2] Thu, 15 Dec 2022 19:35:55 UTC (1,655 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators