Search | arXiv e-print repository

Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset

Authors: Panagiota Anastasopoulou, Jessica Torrey, Xavier Serra, Frederic Font

Abstract: Automatic sound classification has a wide range of applications in machine listening, enabling context-aware sound processing and understanding. This paper explores methodologies for automatically classifying heterogeneous sounds characterized by high intra-class variability. Our study evaluates the classification task using the Broad Sound Taxonomy, a two-level taxonomy comprising 28 classes desi… ▽ More Automatic sound classification has a wide range of applications in machine listening, enabling context-aware sound processing and understanding. This paper explores methodologies for automatically classifying heterogeneous sounds characterized by high intra-class variability. Our study evaluates the classification task using the Broad Sound Taxonomy, a two-level taxonomy comprising 28 classes designed to cover a heterogeneous range of sounds with semantic distinctions tailored for practical user applications. We construct a dataset through manual annotation to ensure accuracy, diverse representation within each class and relevance in real-world scenarios. We compare a variety of both traditional and modern machine learning approaches to establish a baseline for the task of heterogeneous sound classification. We investigate the role of input features, specifically examining how acoustically derived sound representations compare to embeddings extracted with pre-trained deep neural networks that capture both acoustic and semantic information about sounds. Experimental results illustrate that audio embeddings encoding acoustic and semantic information achieve higher accuracy in the classification task. After careful analysis of classification errors, we identify some underlying reasons for failure and propose actions to mitigate them. The paper highlights the need for deeper exploration of all stages of classification, understanding the data and adopting methodologies capable of effectively handling data complexity and generalizing in real-world sound environments. △ Less

Submitted 1 October, 2024; originally announced October 2024.

Comments: DCASE2024, post-print, 5 pages, 2 figures

arXiv:2409.01864 [pdf, other]

The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

Authors: Pedro Ramoneda, Emilia Parada-Cabaleiro, Benno Weck, Xavier Serra

Abstract: In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice ques… ▽ More In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge. △ Less

Submitted 3 September, 2024; originally announced September 2024.

arXiv:2408.00473 [pdf, other]

Towards Explainable and Interpretable Musical Difficulty Estimation: A Parameter-efficient Approach

Authors: Pedro Ramoneda, Vsevolod Eremenko, Alexandre D'Hooge, Emilia Parada-Cabaleiro, Xavier Serra

Abstract: Estimating music piece difficulty is important for organizing educational music collections. This process could be partially automatized to facilitate the educator's role. Nevertheless, the decisions performed by prevalent deep-learning models are hardly understandable, which may impair the acceptance of such a technology in music education curricula. Our work employs explainable descriptors for d… ▽ More Estimating music piece difficulty is important for organizing educational music collections. This process could be partially automatized to facilitate the educator's role. Nevertheless, the decisions performed by prevalent deep-learning models are hardly understandable, which may impair the acceptance of such a technology in music education curricula. Our work employs explainable descriptors for difficulty estimation in symbolic music representations. Furthermore, through a novel parameter-efficient white-box model, we outperform previous efforts while delivering interpretable results. These comprehensible outcomes emulate the functionality of a rubric, a tool widely used in music education. Our approach, evaluated in piano repertoire categorized in 9 classes, achieved 41.4% accuracy independently, with a mean squared error (MSE) of 1.7, showing precise difficulty estimation. Through our baseline, we illustrate how building on top of past research can offer alternatives for music difficulty assessment which are explainable and interpretable. With this, we aim to promote a more effective communication between the Music Information Retrieval (MIR) community and the music education one. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2407.14364 [pdf, other]

Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio

Authors: Roser Batlle-Roca, Wei-Hisang Liao, Xavier Serra, Yuki Mitsufuji, Emilia G�mez

Abstract: Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant discussion and related technical challenge is the potential replication and plagiarism of the training set in AI-generated music, which could lead to misuse of data and intellectua… ▽ More Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant discussion and related technical challenge is the potential replication and plagiarism of the training set in AI-generated music, which could lead to misuse of data and intellectual property rights violations. To tackle this issue, we present the Music Replication Assessment (MiRA) tool: a model-independent open evaluation method based on diverse audio music similarity metrics to assess data replication. We evaluate the ability of five metrics to identify exact replication by conducting a controlled replication experiment in different music genres using synthetic samples. Our results show that the proposed methodology can estimate exact data replication with a proportion higher than 10%. By introducing the MiRA tool, we intend to encourage the open evaluation of music-generative models by researchers, developers, and users concerning data replication, highlighting the importance of the ethical, social, legal, and economic consequences. Code and examples are available for reproducibility purposes. △ Less

Submitted 1 August, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

Comments: Accepted at ISMIR 2024

arXiv:2403.03947 [pdf, other]

Can Audio Reveal Music Performance Difficulty? Insights from the Piano Syllabus Dataset

Authors: Pedro Ramoneda, Minhee Lee, Dasaem Jeong, J. J. Valero-Mas, Xavier Serra

Abstract: Automatically estimating the performance difficulty of a music piece represents a key process in music education to create tailored curricula according to the individual needs of the students. Given its relevance, the Music Information Retrieval (MIR) field depicts some proof-of-concept works addressing this task that mainly focuses on high-level music abstractions such as machine-readable scores… ▽ More Automatically estimating the performance difficulty of a music piece represents a key process in music education to create tailored curricula according to the individual needs of the students. Given its relevance, the Music Information Retrieval (MIR) field depicts some proof-of-concept works addressing this task that mainly focuses on high-level music abstractions such as machine-readable scores or music sheet images. In this regard, the potential of directly analyzing audio recordings has been generally neglected, which prevents students from exploring diverse music pieces that may not have a formal symbolic-level transcription. This work pioneers in the automatic estimation of performance difficulty of music pieces on audio recordings with two precise contributions: (i) the first audio-based difficulty estimation dataset -- namely, Piano Syllabus (PSyllabus) dataset -- featuring 7,901 piano pieces across 11 difficulty levels from 1,233 composers; and (ii) a recognition framework capable of managing different input representations -- both unimodal and multimodal manners -- directly derived from audio to perform the difficulty estimation task. The comprehensive experimentation comprising different pre-training schemes, input modalities, and multi-task scenarios prove the validity of the proposal and establishes PSyllabus as a reference dataset for audio-based difficulty estimation in the MIR field. The dataset as well as the developed code and trained models are publicly shared to promote further research in the field. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.09318 [pdf, other]

Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

Authors: Pablo Alonso-Jim�nez, Leonardo Pepino, Roser Batlle-Roca, Pablo Zinemanas, Dmitry Bogdanov, Xavier Serra, Mart�n Rocamora

Abstract: We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing represe… ▽ More We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing representations with better generalization. APNet allows prototypes' reconstruction to waveforms for interpretability relying on the nearest training data samples. In contrast, we explore using a diffusion decoder that allows reconstruction without such dependency. We evaluate our method on datasets for music instrument classification (Medley-Solos-DB) and genre recognition (GTZAN and a larger in-house dataset), the latter being a more challenging task not addressed with prototypical networks before. We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings, while the sonification of prototypes benefits understanding the behavior of the classifier. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2312.09207 [pdf, other]

doi 10.1007/978-3-031-56435-2_4

WikiMuTe: A web-sourced dataset of semantic descriptions for music audio

Authors: Benno Weck, Holger Kirchhoff, Peter Grosche, Xavier Serra

Abstract: Multi-modal deep learning techniques for matching free-form text with music have shown promising results in the field of Music Information Retrieval (MIR). Prior work is often based on large proprietary data while publicly available datasets are few and small in size. In this study, we present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from… ▽ More Multi-modal deep learning techniques for matching free-form text with music have shown promising results in the field of Music Information Retrieval (MIR). Prior work is often based on large proprietary data while publicly available datasets are few and small in size. In this study, we present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. Using a dedicated text-mining pipeline, we extract both long and short-form descriptions covering a wide range of topics related to music content such as genre, style, mood, instrumentation, and tempo. To show the use of this data, we train a model that jointly learns text and audio representations and performs cross-modal retrieval. The model is evaluated on two tasks: tag-based music retrieval and music auto-tagging. The results show that while our approach has state-of-the-art performance on multiple tasks, but still observe a difference in performance depending on the data used for training. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: Submitted to 30th International Conference on MultiMedia Modeling (MMM2024). This preprint has not undergone peer review or any post-submission improvements or corrections

Journal ref: The Version of Record of this contribution is published in MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14565. Springer, Cham

arXiv:2311.08350 [pdf, other]

ChoralSynth: Synthetic Dataset of Choral Singing

Authors: Jyoti Narang, Viviana De La Vega, Xavier Lizarraga, Oscar Mayor, Hector Parra, Jordi Janer, Xavier Serra

Abstract: Choral singing, a widely practiced form of ensemble singing, lacks comprehensive datasets in the realm of Music Information Retrieval (MIR) research, due to challenges arising from the requirement to curate multitrack recordings. To address this, we devised a novel methodology, leveraging state-of-the-art synthesizers to create and curate quality renditions. The scores were sourced from Choral Pub… ▽ More Choral singing, a widely practiced form of ensemble singing, lacks comprehensive datasets in the realm of Music Information Retrieval (MIR) research, due to challenges arising from the requirement to curate multitrack recordings. To address this, we devised a novel methodology, leveraging state-of-the-art synthesizers to create and curate quality renditions. The scores were sourced from Choral Public Domain Library(CPDL). This work is done in collaboration with a diverse team of musicians, software engineers and researchers. The resulting dataset, complete with its associated metadata, and methodology is released as part of this work, opening up new avenues for exploration and advancement in the field of singing voice research. △ Less

Submitted 21 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Dataset Link: https://doi.org/10.5281/zenodo.10137883

arXiv:2309.16418 [pdf, other]

Efficient Supervised Training of Audio Transformers for Music Representation Learning

Authors: Pablo Alonso-Jim�nez, Xavier Serra, Dmitry Bogdanov

Abstract: In this work, we address music representation learning using convolution-free transformers. We build on top of existing spectrogram-based audio transformers such as AST and train our models on a supervised task using patchout training similar to PaSST. In contrast to previous works, we study how specific design decisions affect downstream music tagging tasks instead of focusing on the training tas… ▽ More In this work, we address music representation learning using convolution-free transformers. We build on top of existing spectrogram-based audio transformers such as AST and train our models on a supervised task using patchout training similar to PaSST. In contrast to previous works, we study how specific design decisions affect downstream music tagging tasks instead of focusing on the training task. We assess the impact of initializing the models with different pre-trained weights, using various input audio segment lengths, using learned representations from different blocks and tokens of the transformer for downstream tasks, and applying patchout at inference to speed up feature extraction. We find that 1) initializing the model from ImageNet or AudioSet weights and using longer input segments are beneficial both for the training and downstream tasks, 2) the best representations for the considered downstream tasks are located in the middle blocks of the transformer, and 3) using patchout at inference allows faster processing than our convolutional baselines while maintaining superior performance. The resulting models, MAEST, are publicly available and obtain the best performance among open models in music tagging tasks. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: Accepted at the 2023 International Society for Music Information Retrieval Conference (ISMIR'23)

arXiv:2309.16287 [pdf, other]

Predicting performance difficulty from piano sheet music images

Authors: Pedro Ramoneda, Jose J. Valero-Mas, Dasaem Jeong, Xavier Serra

Abstract: Estimating the performance difficulty of a musical score is crucial in music education for adequately designing the learning curriculum of the students. Although the Music Information Retrieval community has recently shown interest in this task, existing approaches mainly use machine-readable scores, leaving the broader case of sheet music images unaddressed. Based on previous works involving shee… ▽ More Estimating the performance difficulty of a musical score is crucial in music education for adequately designing the learning curriculum of the students. Although the Music Information Retrieval community has recently shown interest in this task, existing approaches mainly use machine-readable scores, leaving the broader case of sheet music images unaddressed. Based on previous works involving sheet music images, we use a mid-level representation, bootleg score, describing notehead positions relative to staff lines coupled with a transformer model. This architecture is adapted to our task by introducing an encoding scheme that reduces the encoded sequence length to one-eighth of the original size. In terms of evaluation, we consider five datasets -- more than 7500 scores with up to 9 difficulty levels -- , two of them particularly compiled for this work. The results obtained when pretraining the scheme on the IMSLP corpus and fine-tuning it on the considered datasets prove the proposal's validity, achieving the best-performing model with a balanced accuracy of 40.34\% and a mean square error of 1.33. Finally, we provide access to our code, data, and models for transparency and reproducibility. △ Less

Submitted 28 September, 2023; originally announced September 2023.

arXiv:2307.12888 [pdf, other]

An objective evaluation of Hearing Aids and DNN-based speech enhancement in complex acoustic scenes

Authors: Enric Gus�, Joanna Luberadzka, Mart� Baig, Umut Sayin Sara�, Xavier Serra

Abstract: We investigate the objective performance of five high-end commercially available Hearing Aid (HA) devices compared to DNN-based speech enhancement algorithms in complex acoustic environments. To this end, we measure the HRTFs of a single HA device to synthesize a binaural dataset for training two state-of-the-art causal and non-causal DNN enhancement models. We then generate an evaluation set of r… ▽ More We investigate the objective performance of five high-end commercially available Hearing Aid (HA) devices compared to DNN-based speech enhancement algorithms in complex acoustic environments. To this end, we measure the HRTFs of a single HA device to synthesize a binaural dataset for training two state-of-the-art causal and non-causal DNN enhancement models. We then generate an evaluation set of realistic speech-in-noise situations using an Ambisonics loudspeaker setup and record with a KU100 dummy head wearing each of the HA devices, both with and without the conventional HA algorithms, applying the DNN enhancers to the latter. We find that the DNN-based enhancement outperforms the HA algorithms in terms of noise suppression and objective intelligibility metrics. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: Accepted to WASPAA23

arXiv:2306.08480 [pdf, other]

Combining piano performance dimensions for score difficulty classification

Authors: Pedro Ramoneda, Dasaem Jeong, Vsevolod Eremenko, Nazif Can Tamer, Marius Miron, Xavier Serra

Abstract: Predicting the difficulty of playing a musical score is essential for structuring and exploring score collections. Despite its importance for music education, the automatic difficulty classification of piano scores is not yet solved, mainly due to the lack of annotated data and the subjectiveness of the annotations. This paper aims to advance the state-of-the-art in score difficulty classification… ▽ More Predicting the difficulty of playing a musical score is essential for structuring and exploring score collections. Despite its importance for music education, the automatic difficulty classification of piano scores is not yet solved, mainly due to the lack of annotated data and the subjectiveness of the annotations. This paper aims to advance the state-of-the-art in score difficulty classification with two major contributions. To address the lack of data, we present Can I Play It? (CIPI) dataset, a machine-readable piano score dataset with difficulty annotations obtained from the renowned classical music publisher Henle Verlag. The dataset is created by matching public domain scores with difficulty labels from Henle Verlag, then reviewed and corrected by an expert pianist. As a second contribution, we explore various input representations from score information to pre-trained ML models for piano fingering and expressiveness inspired by the musicology definition of performance. We show that combining the outputs of multiple classifiers performs better than the classifiers on their own, pointing to the fact that the representations capture different aspects of difficulty. In addition, we conduct numerous experiments that lay a foundation for score difficulty classification and create a basis for future research. Our best-performing model reports a 39.47% balanced accuracy and 1.13 median square error across the nine difficulty levels proposed in this study. Code, dataset, and models are made available for reproducibility. △ Less

Submitted 27 September, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: 36 pages

arXiv:2304.12257 [pdf, other]

Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Authors: Pablo Alonso-Jim�nez, Xavier Favory, Hadrien Foroughmand, Grigoris Bourdalas, Xavier Serra, Thomas Lidy, Dmitry Bogdanov

Abstract: In this work, we investigate an approach that relies on contrastive learning and music metadata as a weak source of supervision to train music representation models. Recent studies show that contrastive learning can be used with editorial metadata (e.g., artist or album name) to learn audio representations that are useful for different classification tasks. In this paper, we extend this idea to us… ▽ More In this work, we investigate an approach that relies on contrastive learning and music metadata as a weak source of supervision to train music representation models. Recent studies show that contrastive learning can be used with editorial metadata (e.g., artist or album name) to learn audio representations that are useful for different classification tasks. In this paper, we extend this idea to using playlist data as a source of music similarity information and investigate three approaches to generate anchor and positive track pairs. We evaluate these approaches by fine-tuning the pre-trained models for music multi-label classification tasks (genre, mood, and instrument tagging) and music similarity. We find that creating anchor and positive track pairs by relying on co-occurrences in playlists provides better music similarity and competitive classification results compared to choosing tracks from the same artist as in previous works. Additionally, our best pre-training approach based on playlists provides superior classification performance for most datasets. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: Accepted at the 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP'23)

arXiv:2302.12258 [pdf, other]

doi 10.1109/ICASSP49357.2023.10094617

Data leakage in cross-modal retrieval training: A case study

Authors: Benno Weck, Xavier Serra

Abstract: The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In ou… ▽ More The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In our analysis, we find that SoundDesc contains several duplicates that cause leakage of training data to the evaluation data. This data leakage ultimately leads to overly optimistic retrieval performance estimates in previous benchmarks. We propose new training, validation, and testing splits for the dataset that we make available online. To avoid weak contamination of the test data, we pool audio files that share similar recording setups. In our experiments, we find that the new splits serve as a more challenging benchmark. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: 5 pages. Accepted at ICASSP2023

arXiv:2211.08367 [pdf, other]

FlowGrad: Using Motion for Visual Sound Source Localization

Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes

Abstract: Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-ar… ▽ More Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding. △ Less

Submitted 14 April, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: Accepted in ICASSP 2023

arXiv:2210.02833 [pdf, other]

Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

Authors: Benno Weck, Miguel P�rez Fern�ndez, Holger Kirchhoff, Xavier Serra

Abstract: We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval T… ▽ More We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval Task of the DCASE Challenge 2022, employs the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. Furthermore, our ablation study reveals that the proper choice of the loss function and fine-tuning the pretrained models are essential in training a competitive retrieval system. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: 5 pages, 2 figures. Accepted at Detection and Classification of Acoustic Scenes and Events 2022 (DCASE2022)

arXiv:2207.10947 [pdf, other]

doi 10.1016/j.patcog.2022.109190

Multilabel Prototype Generation for Data Reduction in k-Nearest Neighbour classification

Authors: Jose J. Valero-Mas, Antonio Javier Gallego, Pablo Alonso-Jim�nez, Xavier Serra

Abstract: Prototype Generation (PG) methods are typically considered for improving the efficiency of the $k$-Nearest Neighbour ($k$NN) classifier when tackling high-size corpora. Such approaches aim at generating a reduced version of the corpus without decreasing the classification performance when compared to the initial set. Despite their large application in multiclass scenarios, very few works have addr… ▽ More Prototype Generation (PG) methods are typically considered for improving the efficiency of the $k$-Nearest Neighbour ($k$NN) classifier when tackling high-size corpora. Such approaches aim at generating a reduced version of the corpus without decreasing the classification performance when compared to the initial set. Despite their large application in multiclass scenarios, very few works have addressed the proposal of PG methods for the multilabel space. In this regard, this work presents the novel adaptation of four multiclass PG strategies to the multilabel case. These proposals are evaluated with three multilabel $k$NN-based classifiers, 12 corpora comprising a varied range of domains and corpus sizes, and different noise scenarios artificially induced in the data. The results obtained show that the proposed adaptations are capable of significantly improving -- both in terms of efficiency and classification performance -- the only reference multilabel PG work in the literature as well as the case in which no PG method is applied, also presenting a statistically superior robustness in noisy scenarios. Moreover, these novel PG strategies allow prioritising either the efficiency or efficacy criteria through its configuration depending on the target scenario, hence covering a wide area in the solution space not previously filled by other works. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Journal ref: Pattern Recognition, Vol. 135, 2023

arXiv:2203.13010 [pdf, other]

Score difficulty analysis for piano performance education based on fingering

Authors: Pedro Ramoneda, Nazif Can Tamer, Vsevolod Eremenko, Xavier Serra, Marius Miron

Abstract: In this paper, we introduce score difficulty classification as a sub-task of music information retrieval (MIR), which may be used in music education technologies, for personalised curriculum generation, and score retrieval. We introduce a novel dataset for our task, Mikrokosmos-difficulty, containing 147 piano pieces in symbolic representation and the corresponding difficulty labels derived by its… ▽ More In this paper, we introduce score difficulty classification as a sub-task of music information retrieval (MIR), which may be used in music education technologies, for personalised curriculum generation, and score retrieval. We introduce a novel dataset for our task, Mikrokosmos-difficulty, containing 147 piano pieces in symbolic representation and the corresponding difficulty labels derived by its composer B�la Bart�k and the publishers. As part of our methodology, we propose piano technique feature representations based on different piano fingering algorithms. We use these features as input for two classifiers: a Gated Recurrent Unit neural network (GRU) with attention mechanism and gradient-boosted trees trained on score segments. We show that for our dataset fingering based features perform better than a simple baseline considering solely the notes in the score. Furthermore, the GRU with attention mechanism classifier surpasses the gradient-boosted trees. Our proposed models are interpretable and are capable of generating difficulty feedback both locally, on short term segments, and globally, for whole pieces. Code, datasets, models, and an online demo are made available for reproducibility △ Less

Submitted 24 March, 2022; originally announced March 2022.

arXiv:2111.13468 [pdf, other]

Emotion Embedding Spaces for Matching Music to Stories

Authors: Minz Won, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore, Xavier Serra

Abstract: Content creators often use music to enhance their stories, as it can be a powerful tool to convey emotion. In this paper, our goal is to help creators find music to match the emotion of their story. We focus on text-based stories that can be auralized (e.g., books), use multiple sentences as input queries, and automatically retrieve matching music. We formalize this task as a cross-modal text-to-m… ▽ More Content creators often use music to enhance their stories, as it can be a powerful tool to convey emotion. In this paper, our goal is to help creators find music to match the emotion of their story. We focus on text-based stories that can be auralized (e.g., books), use multiple sentences as input queries, and automatically retrieve matching music. We formalize this task as a cross-modal text-to-music retrieval problem. Both the music and text domains have existing datasets with emotion labels, but mismatched emotion vocabularies prevent us from using mood or emotion annotations directly for matching. To address this challenge, we propose and investigate several emotion embedding spaces, both manually defined (e.g., valence/arousal) and data-driven (e.g., Word2Vec and metric learning) to bridge this gap. Our experiments show that by leveraging these embedding spaces, we are able to successfully bridge the gap between modalities to facilitate cross modal retrieval. We show that our method can leverage the well established valence-arousal space, but that it can also achieve our goal via data-driven embedding spaces. By leveraging data-driven embeddings, our approach has the potential of being generalized to other retrieval tasks that require broader or completely different vocabularies. △ Less

Submitted 26 November, 2021; originally announced November 2021.

Comments: International Society for Music Information Retrieval (ISMIR) 2021, Best Student Paper

arXiv:2111.13457 [pdf, other]

Semi-Supervised Music Tagging Transformer

Authors: Minz Won, Keunwoo Choi, Xavier Serra

Abstract: We present Music Tagging Transformer that is trained with a semi-supervised approach. The proposed model captures local acoustic characteristics in shallow convolutional layers, then temporally summarizes the sequence of the extracted features using stacked self-attention layers. Through a careful model assessment, we first show that the proposed architecture outperforms the previous state-of-the-… ▽ More We present Music Tagging Transformer that is trained with a semi-supervised approach. The proposed model captures local acoustic characteristics in shallow convolutional layers, then temporally summarizes the sequence of the extracted features using stacked self-attention layers. Through a careful model assessment, we first show that the proposed architecture outperforms the previous state-of-the-art music tagging models that are based on convolutional neural networks under a supervised scheme. The Music Tagging Transformer is further improved by noisy student training, a semi-supervised approach that leverages both labeled and unlabeled data combined with data augmentation. To our best knowledge, this is the first attempt to utilize the entire audio of the million song dataset. △ Less

Submitted 26 November, 2021; originally announced November 2021.

Comments: International Society for Music Information Retrieval (ISMIR) 2021

arXiv:2111.08009 [pdf, other]

Piano Fingering with Reinforcement Learning

Authors: Pedro Ramoneda, Marius Miron, Xavier Serra

Abstract: Hand and finger movements are a mainstay of piano technique. Automatic Fingering from symbolic music data allows us to simulate finger and hand movements. Previous proposals achieve automatic piano fingering based on knowledge-driven or data-driven techniques. We combine both approaches with deep reinforcement learning techniques to derive piano fingering. Finally, we explore how to incorporate pa… ▽ More Hand and finger movements are a mainstay of piano technique. Automatic Fingering from symbolic music data allows us to simulate finger and hand movements. Previous proposals achieve automatic piano fingering based on knowledge-driven or data-driven techniques. We combine both approaches with deep reinforcement learning techniques to derive piano fingering. Finally, we explore how to incorporate past experience into reinforcement learning-based piano fingering in further work. △ Less

Submitted 15 November, 2021; originally announced November 2021.

arXiv:2110.07410 [pdf, other]

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Authors: Benno Weck, Xavier Favory, Konstantinos Drossos, Xavier Serra

Abstract: Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recent… ▽ More Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recently, very few works on AAC study the performance of existing pre-trained audio and natural language processing resources. In this paper, we evaluate the performance of off-the-shelf models with a Transformer-based captioning approach. We utilize the freely available Clotho dataset to compare four different pre-trained machine listening models, four word embedding models, and their combinations in many different settings. Our evaluation suggests that YAMNet combined with BERT embeddings produces the best captions. Moreover, in general, fine-tuning pre-trained word embeddings can lead to better performance. Finally, we show that sequences of audio embeddings can be processed using a Transformer encoder to produce higher-quality captions. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: 5 pages, 4 figures. Accepted at Detection and Classification of Acoustic Scenes and Events 2021 (DCASE2021)

arXiv:2109.12690 [pdf, ps, other]

Soundata: A Python library for reproducible use of audio datasets

Authors: Magdalena Fuentes, Justin Salamon, Pablo Zinemanas, Mart�n Rocamora, Gen�s Paja, Ir�n R. Rom�n, Marius Miron, Xavier Serra, Juan Pablo Bello

Abstract: Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid… ▽ More Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, validate that the dataset is complete and correct, and more. Soundata is based and inspired on mirdata and design to complement mirdata by working with environmental sound, bioacoustic and speech datasets, among others. Soundata was created to be easy to use, easy to contribute to, and to increase reproducibility and standardize usage of sound datasets in a flexible way. △ Less

Submitted 4 October, 2021; v1 submitted 26 September, 2021; originally announced September 2021.

arXiv:2107.00623 [pdf, other]

Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks

Authors: Eduardo Fonseca, Andres Ferraro, Xavier Serra

Abstract: Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we analyze the benefits of addressing lack of shift invariance in CNN-based sound event classification. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs… ▽ More Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we analyze the benefits of addressing lack of shift invariance in CNN-based sound event classification. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming feature maps. These methods are implemented via small architectural modifications inserted into the pooling layers of CNNs. We evaluate the effect of these architectural changes on the FSD50K dataset using models of different capacity and in presence of strong regularization. We show that these modifications consistently improve sound event classification in all cases considered. We also demonstrate empirically that the proposed pooling methods increase shift invariance in the network, making it more robust against time/frequency shifts in input spectrograms. This is achieved by adding a negligible amount of trainable parameters, which makes these methods an appealing alternative to conventional pooling layers. The outcome is a new state-of-the-art mAP of 0.541 on the FSD50K classification benchmark. △ Less

Submitted 22 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

arXiv:2106.02415 [pdf, ps, other]

What is fair? Exploring the artists' perspective on the fairness of music streaming platforms

Authors: Andres Ferraro, Xavier Serra, Christine Bauer

Abstract: Music streaming platforms are currently among the main sources of music consumption, and the embedded recommender systems significantly influence what the users consume. There is an increasing interest to ensure that those platforms and systems are fair. Yet, we first need to understand what fairness means in such a context. Although artists are the main content providers for music platforms, ther… ▽ More Music streaming platforms are currently among the main sources of music consumption, and the embedded recommender systems significantly influence what the users consume. There is an increasing interest to ensure that those platforms and systems are fair. Yet, we first need to understand what fairness means in such a context. Although artists are the main content providers for music platforms, there is a research gap concerning the artists' perspective. To fill this gap, we conducted interviews with music artists to understand how they are affected by current platforms and what improvements they deem necessary. Using a Qualitative Content Analysis, we identify the aspects that the artists consider relevant for fair platforms. In this paper, we discuss the following aspects derived from the interviews: fragmented presentation, reaching an audience, transparency, influencing users' listening behavior, popularity bias, artists' repertoire size, quotas for local music, gender balance, and new music. For some topics, our findings do not indicate a clear direction about the best way how music platforms should act and function; for other topics, though, there is a clear consensus among our interviewees: for these, the artists have a clear idea of the actions that should be taken so that music platforms will be fair also for the artists. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Journal ref: Proceedings of the 18th IFIP International Conference on Human-Computer Interaction (INTERACT 2021)

arXiv:2105.10371 [pdf, other]

LoopNet: Musical Loop Synthesis Conditioned On Intuitive Musical Parameters

Authors: Pritish Chandna, Ant�nio Ramires, Xavier Serra, Emilia G�mez

Abstract: Loops, seamlessly repeatable musical segments, are a cornerstone of modern music production. Contemporary artists often mix and match various sampled or pre-recorded loops based on musical criteria such as rhythm, harmony and timbral texture to create compositions. Taking such criteria into account, we present LoopNet, a feed-forward generative model for creating loops conditioned on intuitive par… ▽ More Loops, seamlessly repeatable musical segments, are a cornerstone of modern music production. Contemporary artists often mix and match various sampled or pre-recorded loops based on musical criteria such as rhythm, harmony and timbral texture to create compositions. Taking such criteria into account, we present LoopNet, a feed-forward generative model for creating loops conditioned on intuitive parameters. We leverage Music Information Retrieval (MIR) models as well as a large collection of public loop samples in our study and use the Wave-U-Net architecture to map control parameters to audio. We also evaluate the quality of the generated audio and propose intuitive controls for composers to map the ideas in their minds to an audio loop. △ Less

Submitted 21 May, 2021; originally announced May 2021.

arXiv:2105.02132 [pdf, other]

Self-Supervised Learning from Automatically Separated Sound Scenes

Authors: Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

Abstract: Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this… ▽ More Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark. △ Less

Submitted 14 September, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

arXiv:2102.00201 [pdf, other]

Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

Authors: Andres Ferraro, Yuntae Kim, Soohyeon Lee, Biho Kim, Namjun Jo, Semi Lim, Suyon Lim, Jungtaek Jang, Sehwan Kim, Xavier Serra, Dmitry Bogdanov

Abstract: One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered fr… ▽ More One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation. Even though the latter can be addressed by collaborative filtering approaches, audio provides opportunities for research on track suggestions and building systems resistant to the cold-start problem, for which we provide a baseline. Moreover, the playlists and the annotations included in the Melon Playlist Dataset make it suitable for metric learning and representation learning. △ Less

Submitted 30 January, 2021; originally announced February 2021.

Comments: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:2011.07616 [pdf, other]

Unsupervised Contrastive Learning of Sound Event Representations

Authors: Eduardo Fonseca, Diego Ortego, Kevin McGuinness, Noel E. O'Connor, Xavier Serra

Abstract: Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data---a common scenario in sound event research. In this work, we explore unsupervised contrastive learning as a way to learn sound event representations. To this end, we propose to use the pretext task of contrasting differently augmented views of sound… ▽ More Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data---a common scenario in sound event research. In this work, we explore unsupervised contrastive learning as a way to learn sound event representations. To this end, we propose to use the pretext task of contrasting differently augmented views of sound events. The views are computed primarily via mixing of training examples with unrelated backgrounds, followed by other data augmentations. We analyze the main components of our method via ablation experiments. We evaluate the learned representations using linear evaluation, and in two in-domain downstream sound event classification tasks, namely, using limited manually labeled data, and using noisy labeled data. Our results suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels, outperforming supervised baselines. △ Less

Submitted 15 November, 2020; originally announced November 2020.

Comments: A 4-page version is submitted to ICASSP 2021

arXiv:2010.16030 [pdf, other]

Multimodal Metric Learning for Tag-based Music Retrieval

Authors: Minz Won, Sergio Oramas, Oriol Nieto, Fabien Gouyon, Xavier Serra

Abstract: Tag-based music retrieval is crucial to browse large-scale music libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, metric learning has already proven its… ▽ More Tag-based music retrieval is crucial to browse large-scale music libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, metric learning has already proven its suitability for cross-modal retrieval tasks in other domains (e.g., text-to-image) by jointly learning a multimodal embedding space. In this paper, we investigate three ideas to successfully introduce multimodal metric learning for tag-based music retrieval: elaborate triplet sampling, acoustic and cultural music information, and domain-specific word embeddings. Our experimental results show that the proposed ideas enhance the retrieval system quantitatively, and qualitatively. Furthermore, we release the MSD500, a subset of the Million Song Dataset (MSD) containing 500 cleaned tags, 7 manually annotated tag categories, and user taste profiles. △ Less

Submitted 29 October, 2020; originally announced October 2020.

Comments: 5 pages, 2 figures, submitted to ICASSP 2021

arXiv:2010.14171 [pdf, other]

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

Abstract: Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learni… ▽ More Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: 5 pages, 1 figure

arXiv:2010.00475 [pdf, other]

FSD50K: An Open Dataset of Human-Labeled Sound Events

Authors: Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra

Abstract: Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube vid… ▽ More Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research. △ Less

Submitted 23 April, 2022; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: Accepted version in TASLP. Main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. https://ieeexplore.ieee.org/document/9645159

arXiv:2008.11529 [pdf, other]

TIV.lib: an open-source library for the tonal description of musical audio

Authors: Ant�nio Ramires, Gilberto Bernardes, Matthew E. P. Davies, Xavier Serra

Abstract: In this paper, we present TIV.lib, an open-source library for the content-based tonal description of musical audio signals. Its main novelty relies on the perceptually-inspired Tonal Interval Vector space based on the Discrete Fourier transform, from which multiple instantaneous and global representations, descriptors and metrics are computed - e.g., harmonic change, dissonance, diatonicity, and m… ▽ More In this paper, we present TIV.lib, an open-source library for the content-based tonal description of musical audio signals. Its main novelty relies on the perceptually-inspired Tonal Interval Vector space based on the Discrete Fourier transform, from which multiple instantaneous and global representations, descriptors and metrics are computed - e.g., harmonic change, dissonance, diatonicity, and musical key. The library is cross-platform, implemented in Python and the graphical programming language Pure Data, and can be used in both online and offline scenarios. Of note is its potential for enhanced Music Information Retrieval, where tonal descriptors sit at the core of numerous methods and applications. △ Less

Submitted 26 August, 2020; originally announced August 2020.

arXiv:2008.11507 [pdf, other]

The Freesound Loop Dataset and Annotation Tool

Authors: Antonio Ramires, Frederic Font, Dmitry Bogdanov, Jordan B. L. Smith, Yi-Hsuan Yang, Joann Ching, Bo-Yu Chen, Yueh-Kao Wu, Hsu Wei-Han, Xavier Serra

Abstract: Music loops are essential ingredients in electronic music production, and there is a high demand for pre-recorded loops in a variety of styles. Several commercial and community databases have been created to meet this demand, but most are not suitable for research due to their strict licensing. We present the Freesound Loop Dataset (FSLD), a new large-scale dataset of music loops annotated by expe… ▽ More Music loops are essential ingredients in electronic music production, and there is a high demand for pre-recorded loops in a variety of styles. Several commercial and community databases have been created to meet this demand, but most are not suitable for research due to their strict licensing. We present the Freesound Loop Dataset (FSLD), a new large-scale dataset of music loops annotated by experts. The loops originate from Freesound, a community database of audio recordings released under Creative Commons licenses, so the audio in our dataset may be redistributed. The annotations include instrument, tempo, meter, key and genre tags. We describe the methodology used to assemble and annotate the data, and report on the distribution of tags in the data and inter-annotator agreement. We also present to the community an online loop annotator tool that we developed. To illustrate the usefulness of FSLD, we present short case studies on using it to estimate tempo and key, generate music tracks, and evaluate a loop separation algorithm. We anticipate that the community will find yet more uses for the data, in applications from automatic loop characterisation to algorithmic composition. △ Less

Submitted 23 September, 2020; v1 submitted 26 August, 2020; originally announced August 2020.

Comments: This work will be presented in the 21st International Society for Music Information Retrieval (ISMIR2020). Annotator website: http://mtg.upf.edu/fslannotator Dataset: https://zenodo.org/record/3967852

arXiv:2008.07226 [pdf, other]

doi 10.1145/3383313.3412213

Exploring Longitudinal Effects of Session-based Recommendations

Authors: Andres Ferraro, Dietmar Jannach, Xavier Serra

Abstract: Session-based recommendation is a problem setting where the task of a recommender system is to make suitable item suggestions based only on a few observed user interactions in an ongoing session. The lack of long-term preference information about individual users in such settings usually results in a limited level of personalization, where a small set of popular items may be recommended to many us… ▽ More Session-based recommendation is a problem setting where the task of a recommender system is to make suitable item suggestions based only on a few observed user interactions in an ongoing session. The lack of long-term preference information about individual users in such settings usually results in a limited level of personalization, where a small set of popular items may be recommended to many users. This repeated exposure of such a subset of the items through the recommendations may in turn lead to a reinforcement effect over time, and to a system which is not able to help users discover new content anymore to the desirable extent. In this work, we investigate such potential longitudinal effects of session-based recommendations in a simulation-based approach. Specifically, we analyze to what extent algorithms of different types may lead to concentration effects over time. Our experiments in the music domain reveal that all investigated algorithms---both neural and heuristic ones---may lead to lower item coverage and to a higher concentration on a subset of the items. Additional simulation experiments however also indicate that relatively simple re-ranking strategies, e.g., by avoiding too many repeated recommendations in the music domain, may help to deal with this problem. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: The 14th ACM Conference on Recommender Systems

arXiv:2006.08386 [pdf, other]

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

Abstract: Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. A… ▽ More Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors. △ Less

Submitted 8 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austria

arXiv:2006.00751 [pdf, other]

Evaluation of CNN-based Automatic Music Tagging Models

Authors: Minz Won, Andres Ferraro, Dmitry Bogdanov, Xavier Serra

Abstract: Recent advances in deep learning accelerated the development of content-based automatic music tagging systems. Music information retrieval (MIR) researchers proposed various architecture designs, mainly based on convolutional neural networks (CNNs), that achieve state-of-the-art results in this multi-label binary classification task. However, due to the differences in experimental setups followed… ▽ More Recent advances in deep learning accelerated the development of content-based automatic music tagging systems. Music information retrieval (MIR) researchers proposed various architecture designs, mainly based on convolutional neural networks (CNNs), that achieve state-of-the-art results in this multi-label binary classification task. However, due to the differences in experimental setups followed by researchers, such as using different dataset splits and software versions for evaluation, it is difficult to compare the proposed architectures directly with each other. To facilitate further research, in this paper we conduct a consistent evaluation of different music tagging models on three datasets (MagnaTagATune, Million Song Dataset, and MTG-Jamendo) and provide reference results using common evaluation metrics (ROC-AUC and PR-AUC). Furthermore, all the models are evaluated with perturbed inputs to investigate the generalization capabilities concerning time stretch, pitch shift, dynamic range compression, and addition of white noise. For reproducibility, we provide the PyTorch implementations with the pre-trained models. △ Less

Submitted 1 June, 2020; originally announced June 2020.

Comments: 7 pages, 2 figures, Sound and Music Computing 2020 (SMC 2020)

arXiv:2005.00878 [pdf, other]

doi 10.1109/LSP.2020.3006378

Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

Authors: Eduardo Fonseca, Shawn Hershey, Manoj Plakal, Daniel P. W. Ellis, Aren Jansen, R. Channing Moore, Xavier Serra

Abstract: The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first ident… ▽ More The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process. We find that a simple optimisation of the training label set improves recognition performance without additional computation. We discover that most of the improvement comes from ignoring a critical tiny portion of the missing labels. We also show that the damage done by missing labels is larger as the training set gets smaller, yet it can still be observed even when training with massive amounts of audio. We believe these insights can generalize to other large-scale datasets. △ Less

Submitted 25 July, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

Comments: Accepted in IEEE Signal Processing Letters, openly accessible at https://ieeexplore.ieee.org/document/9130823

Journal ref: IEEE Signal Processing Letters, Vol. 27, 2020, pages 1235-1239

arXiv:2004.03985 [pdf, other]

doi 10.1145/3372278.3390691

Search Result Clustering in Collaborative Sound Collections

Authors: Xavier Favory, Frederic Font, Xavier Serra

Abstract: The large size of nowadays' online multimedia databases makes retrieving their content a difficult and time-consuming task. Users of online sound collections typically submit search queries that express a broad intent, often making the system return large and unmanageable result sets. Search Result Clustering is a technique that organises search-result content into coherent groups, which allows us… ▽ More The large size of nowadays' online multimedia databases makes retrieving their content a difficult and time-consuming task. Users of online sound collections typically submit search queries that express a broad intent, often making the system return large and unmanageable result sets. Search Result Clustering is a technique that organises search-result content into coherent groups, which allows users to identify useful subsets in their results. Obtaining coherent and distinctive clusters that can be explored with a suitable interface is crucial for making this technique a useful complement of traditional search engines. In our work, we propose a graph-based approach using audio features for clustering diverse sound collections obtained when querying large online databases. We propose an approach to assess the performance of different features at scale, by taking advantage of the metadata associated with each sound. This analysis is complemented with an evaluation using ground-truth labels from manually annotated datasets. We show that using a confidence measure for discarding inconsistent clusters improves the quality of the partitions. After identifying the most appropriate features for clustering, we conduct an experiment with users performing a sound design task, in order to evaluate our approach and its user interface. A qualitative analysis is carried out including usability questionnaires and semi-structured interviews. This provides us with valuable new insights regarding the features that promote efficient interaction with the clusters. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: 8 pages, 4 figures, Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR 20), June 8-11, 2020, Dublin, Ireland. ACM, NewYork, NY, USA, 8 pages

ACM Class: H.3.3

arXiv:2003.07393 [pdf, ps, other]

TensorFlow Audio Models in Essentia

Authors: Pablo Alonso-Jim�nez, Dmitry Bogdanov, Jordi Pons, Xavier Serra

Abstract: Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia, allow predictions with pre-trained deep learning models, and are designed to offer flexibility of use, easy extensibility, and real-time inference. To show the potential of this new interface with TensorFlow, we provide a number of pr… ▽ More Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia, allow predictions with pre-trained deep learning models, and are designed to offer flexibility of use, easy extensibility, and real-time inference. To show the potential of this new interface with TensorFlow, we provide a number of pre-trained state-of-the-art music tagging and classification CNN models. We run an extensive evaluation of the developed models. In particular, we assess the generalization capabilities in a cross-collection evaluation utilizing both external tag datasets as well as manual annotations tailored to the taxonomies of our models. △ Less

Submitted 16 March, 2020; originally announced March 2020.

arXiv:1911.11853 [pdf, other]

Neural Percussive Synthesis Parameterised by High-Level Timbral Features

Authors: Ant�nio Ramires, Pritish Chandna, Xavier Favory, Emilia G�mez, Xavier Serra

Abstract: We present a deep neural network-based methodology for synthesising percussive sounds with control over high-level timbral characteristics of the sounds. This approach allows for intuitive control of a synthesizer, enabling the user to shape sounds without extensive knowledge of signal processing. We use a feedforward convolutional neural network-based architecture, which is able to map input para… ▽ More We present a deep neural network-based methodology for synthesising percussive sounds with control over high-level timbral characteristics of the sounds. This approach allows for intuitive control of a synthesizer, enabling the user to shape sounds without extensive knowledge of signal processing. We use a feedforward convolutional neural network-based architecture, which is able to map input parameters to the corresponding waveform. We propose two datasets to evaluate our approach on both a restrictive context, and in one covering a broader spectrum of sounds. The timbral features used as parameters are taken from recent literature in signal processing. We also use these features for evaluation and validation of the presented model, to ensure that changing the input parameters produces a congruent waveform with the desired characteristics. Finally, we evaluate the quality of the output sound using a subjective listening test. We provide sound examples and the system's source code for reproducibility. △ Less

Submitted 3 April, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

arXiv:1911.04827 [pdf, other]

Artist and style exposure bias in collaborative filtering based music recommendations

Authors: Andres Ferraro, Dmitry Bogdanov, Xavier Serra, Jason Yoon

Abstract: Algorithms have an increasing influence on the music that we consume and understanding their behavior is fundamental to make sure they give a fair exposure to all artists across different styles. In this on-going work we contribute to this research direction analyzing the impact of collaborative filtering recommendations from the perspective of artist and music style exposure given by the system.… ▽ More Algorithms have an increasing influence on the music that we consume and understanding their behavior is fundamental to make sure they give a fair exposure to all artists across different styles. In this on-going work we contribute to this research direction analyzing the impact of collaborative filtering recommendations from the perspective of artist and music style exposure given by the system. We first analyze the distribution of the recommendations considering the exposure of different styles or genres and compare it to the users' listening behavior. This comparison suggests that the system is reinforcing the popularity of the items. Then, we simulate the effect of the system in the long term with a feedback loop. From this simulation we can see how the system gives less opportunity to the majority of artists, concentrating the users on fewer items. The results of our analysis demonstrate the need for a better evaluation methodology for current music recommendation algorithms, not only limited to user-focused relevance metrics. △ Less

Submitted 12 November, 2019; originally announced November 2019.

Comments: Presented at Workshop on Designing Human-Centric MIR Systems, ISMIR 2019

arXiv:1911.04824 [pdf, other]

How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging

Authors: Andres Ferraro, Dmitry Bogdanov, Xavier Serra, Jay Ho Jeon, Jason Yoon

Abstract: Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram rep… ▽ More Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram representations and evaluate model performances that can be achieved by reducing the input size in terms of both lesser amount of frequency bands and larger frame rates. We use the MagnaTagaTune dataset for comprehensive performance comparisons and then compare selected configurations on the larger Million Song Dataset. The results of this study can serve researchers and practitioners in their trade-off decision between accuracy of the models, data storage size and training and inference times. △ Less

Submitted 28 June, 2020; v1 submitted 12 November, 2019; originally announced November 2019.

Comments: The 28th European Signal Processing Conference (EUSIPCO)

arXiv:1911.04385 [pdf, other]

Visualizing and Understanding Self-attention based Music Tagging

Authors: Minz Won, Sanghyuk Chun, Xavier Serra

Abstract: Recently, we proposed a self-attention based music tagging model. Different from most of the conventional deep architectures in music information retrieval, which use stacked 3x3 filters by treating music spectrograms as images, the proposed self-attention based model attempted to regard music as a temporal sequence of individual audio events. Not only the performance, but it could also facilitate… ▽ More Recently, we proposed a self-attention based music tagging model. Different from most of the conventional deep architectures in music information retrieval, which use stacked 3x3 filters by treating music spectrograms as images, the proposed self-attention based model attempted to regard music as a temporal sequence of individual audio events. Not only the performance, but it could also facilitate better interpretability. In this paper, we mainly focus on visualizing and understanding the proposed self-attention based music tagging model. △ Less

Submitted 11 November, 2019; originally announced November 2019.

Comments: Machine Learning for Music Discovery Workshop (ML4MD) at ICML 2019

arXiv:1910.12004 [pdf, other]

Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers

Authors: Eduardo Fonseca, Frederic Font, Xavier Serra

Abstract: Label noise is emerging as a pressing issue in sound event classification. This arises as we move towards larger datasets that are difficult to annotate manually, but it is even more severe if datasets are collected automatically from online repositories, where labels are inferred through automated heuristics applied to the audio content or metadata. While learning from noisy labels has been an ac… ▽ More Label noise is emerging as a pressing issue in sound event classification. This arises as we move towards larger datasets that are difficult to annotate manually, but it is even more severe if datasets are collected automatically from online repositories, where labels are inferred through automated heuristics applied to the audio content or metadata. While learning from noisy labels has been an active area of research in computer vision, it has received little attention in sound event classification. Most recent computer vision approaches against label noise are relatively complex, requiring complex networks or extra data resources. In this work, we evaluate simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions. The main advantage of these methods is that they can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources. We report results from experiments conducted with the FSDnoisy18k dataset. We show that these simple methods can be effective in mitigating the effect of label noise, providing up to 2.5\% of accuracy boost when incorporated to two different CNNs, while requiring minimal intervention and computational overhead. △ Less

Submitted 26 October, 2019; originally announced October 2019.

Comments: WASPAA 2019

arXiv:1909.06654 [pdf, other]

musicnn: Pre-trained convolutional neural networks for music audio tagging

Authors: Jordi Pons, Xavier Serra

Abstract: Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. W… ▽ More Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. We also provide the code to train the aforementioned models: https://github.com/jordipons/musicnn-training. This framework also allows implementing novel models. For example, a musically motivated convolutional neural network with an attention-based output layer (instead of the temporal pooling layer) can achieve state-of-the-art results for music audio tagging: 90.77 ROC-AUC / 38.61 PR-AUC on the MagnaTagATune dataset --- and 88.81 ROC-AUC / 31.51 PR-AUC on the Million Song Dataset. △ Less

Submitted 14 September, 2019; originally announced September 2019.

Comments: Accepted to be presented at the Late-Breaking/Demo session of ISMIR 2019

arXiv:1908.10133 [pdf, other]

A hybrid parametric-deep learning approach for sound event localization and detection

Authors: Andres Perez-Lopez, Eduardo Fonseca, Xavier Serra

Abstract: This work describes and discusses an algorithm submitted to the Sound Event Localization and Detection Task of DCASE2019 Challenge. The proposed methodology relies on parametric spatial audio analysis for source localization and detection, combined with a deep learning-based monophonic event classifier. The evaluation of the proposed algorithm yields overall results comparable to the baseline syst… ▽ More This work describes and discusses an algorithm submitted to the Sound Event Localization and Detection Task of DCASE2019 Challenge. The proposed methodology relies on parametric spatial audio analysis for source localization and detection, combined with a deep learning-based monophonic event classifier. The evaluation of the proposed algorithm yields overall results comparable to the baseline system. The main highlight is a reduction of the localization error on the evaluation dataset by a factor of 2.6, compared with the baseline performance. △ Less

Submitted 27 August, 2019; originally announced August 2019.

Comments: 5 pages, 5 figures, submitted to DCASE2019 Workshop

arXiv:1907.08520 [pdf, other]

Data Augmentation for Instrument Classification Robust to Audio Effects

Authors: Ant�nio Ramires, Xavier Serra

Abstract: Reusing recorded sounds (sampling) is a key component in Electronic Music Production (EMP), which has been present since its early days and is at the core of genres like hip-hop or jungle. Commercial and non-commercial services allow users to obtain collections of sounds (sample packs) to reuse in their compositions. Automatic classification of one-shot instrumental sounds allows automatically cat… ▽ More Reusing recorded sounds (sampling) is a key component in Electronic Music Production (EMP), which has been present since its early days and is at the core of genres like hip-hop or jungle. Commercial and non-commercial services allow users to obtain collections of sounds (sample packs) to reuse in their compositions. Automatic classification of one-shot instrumental sounds allows automatically categorising the sounds contained in these collections, allowing easier navigation and better characterisation. Automatic instrument classification has mostly targeted the classification of unprocessed isolated instrumental sounds or detecting predominant instruments in mixed music tracks. For this classification to be useful in audio databases for EMP, it has to be robust to the audio effects applied to unprocessed sounds. In this paper we evaluate how a state of the art model trained with a large dataset of one-shot instrumental sounds performs when classifying instruments processed with audio effects. In order to evaluate the robustness of the model, we use data augmentation with audio effects and evaluate how each effect influences the classification accuracy. △ Less

Submitted 19 July, 2019; originally announced July 2019.

arXiv:1906.04972 [pdf, other]

Toward Interpretable Music Tagging with Self-Attention

Authors: Minz Won, Sanghyuk Chun, Xavier Serra

Abstract: Self-attention is an attention mechanism that learns a representation by relating different positions in the sequence. The transformer, which is a sequence model solely based on self-attention, and its variants achieved state-of-the-art results in many natural language processing tasks. Since music composes its semantics based on the relations between components in sparse positions, adopting the s… ▽ More Self-attention is an attention mechanism that learns a representation by relating different positions in the sequence. The transformer, which is a sequence model solely based on self-attention, and its variants achieved state-of-the-art results in many natural language processing tasks. Since music composes its semantics based on the relations between components in sparse positions, adopting the self-attention mechanism to solve music information retrieval (MIR) problems can be beneficial. Hence, we propose a self-attention based deep sequence model for music tagging. The proposed architecture consists of shallow convolutional layers followed by stacked Transformer encoders. Compared to conventional approaches using fully convolutional or recurrent neural networks, our model is more interpretable while reporting competitive results. We validate the performance of our model with the MagnaTagATune and the Million Song Dataset. In addition, we demonstrate the interpretability of the proposed architecture with a heat map visualization. △ Less

Submitted 12 June, 2019; originally announced June 2019.

Comments: 13 pages, 12 figures; code: https://github.com/minzwon/self-attention-music-tagging

arXiv:1906.02975 [pdf, other]

Audio tagging with noisy labels and minimal supervision

Authors: Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra

Abstract: This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sou… ▽ More This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sound classes. In addition, the proposed dataset poses an acoustic mismatch problem between the noisy train set and the test set due to the fact that they come from different web audio sources. This can correspond to a realistic scenario given by the difficulty in gathering large amounts of manually labeled data. We present the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network. All these resources are freely available. △ Less

Submitted 19 January, 2020; v1 submitted 7 June, 2019; originally announced June 2019.

Comments: DCASE2019 Workshop

Showing 1–50 of 77 results for author: Serra, X