Skip to main content

Showing 1–50 of 77 results for author: Serra, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.00980  [pdf, other

    cs.SD cs.AI eess.AS

    Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset

    Authors: Panagiota Anastasopoulou, Jessica Torrey, Xavier Serra, Frederic Font

    Abstract: Automatic sound classification has a wide range of applications in machine listening, enabling context-aware sound processing and understanding. This paper explores methodologies for automatically classifying heterogeneous sounds characterized by high intra-class variability. Our study evaluates the classification task using the Broad Sound Taxonomy, a two-level taxonomy comprising 28 classes desi… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

    Comments: DCASE2024, post-print, 5 pages, 2 figures

  2. arXiv:2409.01864  [pdf, other

    cs.SD cs.AI cs.CL cs.DL eess.AS

    The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?

    Authors: Pedro Ramoneda, Emilia Parada-Cabaleiro, Benno Weck, Xavier Serra

    Abstract: In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice ques… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  3. arXiv:2408.00473  [pdf, other

    cs.SD cs.AI cs.IR eess.AS

    Towards Explainable and Interpretable Musical Difficulty Estimation: A Parameter-efficient Approach

    Authors: Pedro Ramoneda, Vsevolod Eremenko, Alexandre D'Hooge, Emilia Parada-Cabaleiro, Xavier Serra

    Abstract: Estimating music piece difficulty is important for organizing educational music collections. This process could be partially automatized to facilitate the educator's role. Nevertheless, the decisions performed by prevalent deep-learning models are hardly understandable, which may impair the acceptance of such a technology in music education curricula. Our work employs explainable descriptors for d… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  4. arXiv:2407.14364  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio

    Authors: Roser Batlle-Roca, Wei-Hisang Liao, Xavier Serra, Yuki Mitsufuji, Emilia G�mez

    Abstract: Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant discussion and related technical challenge is the potential replication and plagiarism of the training set in AI-generated music, which could lead to misuse of data and intellectua… ▽ More

    Submitted 1 August, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

    Comments: Accepted at ISMIR 2024

  5. arXiv:2403.03947  [pdf, other

    cs.SD eess.AS

    Can Audio Reveal Music Performance Difficulty? Insights from the Piano Syllabus Dataset

    Authors: Pedro Ramoneda, Minhee Lee, Dasaem Jeong, J. J. Valero-Mas, Xavier Serra

    Abstract: Automatically estimating the performance difficulty of a music piece represents a key process in music education to create tailored curricula according to the individual needs of the students. Given its relevance, the Music Information Retrieval (MIR) field depicts some proof-of-concept works addressing this task that mainly focuses on high-level music abstractions such as machine-readable scores… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  6. arXiv:2402.09318  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning of Music Audio

    Authors: Pablo Alonso-Jim�nez, Leonardo Pepino, Roser Batlle-Roca, Pablo Zinemanas, Dmitry Bogdanov, Xavier Serra, Mart�n Rocamora

    Abstract: We present PECMAE, an interpretable model for music audio classification based on prototype learning. Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network. Instead, we propose to decouple both training processes. This enables us to leverage existing self-supervised autoencoders pre-trained on much larger data (EnCodecMAE), providing represe… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  7. arXiv:2312.09207  [pdf, other

    cs.CL cs.IR cs.LG cs.SD eess.AS

    WikiMuTe: A web-sourced dataset of semantic descriptions for music audio

    Authors: Benno Weck, Holger Kirchhoff, Peter Grosche, Xavier Serra

    Abstract: Multi-modal deep learning techniques for matching free-form text with music have shown promising results in the field of Music Information Retrieval (MIR). Prior work is often based on large proprietary data while publicly available datasets are few and small in size. In this study, we present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Submitted to 30th International Conference on MultiMedia Modeling (MMM2024). This preprint has not undergone peer review or any post-submission improvements or corrections

    Journal ref: The Version of Record of this contribution is published in MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14565. Springer, Cham

  8. arXiv:2311.08350  [pdf, other

    cs.SD cs.IR eess.AS

    ChoralSynth: Synthetic Dataset of Choral Singing

    Authors: Jyoti Narang, Viviana De La Vega, Xavier Lizarraga, Oscar Mayor, Hector Parra, Jordi Janer, Xavier Serra

    Abstract: Choral singing, a widely practiced form of ensemble singing, lacks comprehensive datasets in the realm of Music Information Retrieval (MIR) research, due to challenges arising from the requirement to curate multitrack recordings. To address this, we devised a novel methodology, leveraging state-of-the-art synthesizers to create and curate quality renditions. The scores were sourced from Choral Pub… ▽ More

    Submitted 21 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: Dataset Link: https://doi.org/10.5281/zenodo.10137883

  9. arXiv:2309.16418  [pdf, other

    cs.SD eess.AS

    Efficient Supervised Training of Audio Transformers for Music Representation Learning

    Authors: Pablo Alonso-Jim�nez, Xavier Serra, Dmitry Bogdanov

    Abstract: In this work, we address music representation learning using convolution-free transformers. We build on top of existing spectrogram-based audio transformers such as AST and train our models on a supervised task using patchout training similar to PaSST. In contrast to previous works, we study how specific design decisions affect downstream music tagging tasks instead of focusing on the training tas… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted at the 2023 International Society for Music Information Retrieval Conference (ISMIR'23)

  10. arXiv:2309.16287  [pdf, other

    cs.SD cs.DL eess.AS

    Predicting performance difficulty from piano sheet music images

    Authors: Pedro Ramoneda, Jose J. Valero-Mas, Dasaem Jeong, Xavier Serra

    Abstract: Estimating the performance difficulty of a musical score is crucial in music education for adequately designing the learning curriculum of the students. Although the Music Information Retrieval community has recently shown interest in this task, existing approaches mainly use machine-readable scores, leaving the broader case of sheet music images unaddressed. Based on previous works involving shee… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

  11. arXiv:2307.12888  [pdf, other

    cs.SD eess.AS

    An objective evaluation of Hearing Aids and DNN-based speech enhancement in complex acoustic scenes

    Authors: Enric Gus�, Joanna Luberadzka, Mart� Baig, Umut Sayin Sara�, Xavier Serra

    Abstract: We investigate the objective performance of five high-end commercially available Hearing Aid (HA) devices compared to DNN-based speech enhancement algorithms in complex acoustic environments. To this end, we measure the HRTFs of a single HA device to synthesize a binaural dataset for training two state-of-the-art causal and non-causal DNN enhancement models. We then generate an evaluation set of r… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: Accepted to WASPAA23

  12. arXiv:2306.08480  [pdf, other

    cs.SD eess.AS

    Combining piano performance dimensions for score difficulty classification

    Authors: Pedro Ramoneda, Dasaem Jeong, Vsevolod Eremenko, Nazif Can Tamer, Marius Miron, Xavier Serra

    Abstract: Predicting the difficulty of playing a musical score is essential for structuring and exploring score collections. Despite its importance for music education, the automatic difficulty classification of piano scores is not yet solved, mainly due to the lack of annotated data and the subjectiveness of the annotations. This paper aims to advance the state-of-the-art in score difficulty classification… ▽ More

    Submitted 27 September, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: 36 pages

  13. arXiv:2304.12257  [pdf, other

    cs.SD eess.AS

    Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

    Authors: Pablo Alonso-Jim�nez, Xavier Favory, Hadrien Foroughmand, Grigoris Bourdalas, Xavier Serra, Thomas Lidy, Dmitry Bogdanov

    Abstract: In this work, we investigate an approach that relies on contrastive learning and music metadata as a weak source of supervision to train music representation models. Recent studies show that contrastive learning can be used with editorial metadata (e.g., artist or album name) to learn audio representations that are useful for different classification tasks. In this paper, we extend this idea to us… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: Accepted at the 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP'23)

  14. arXiv:2302.12258  [pdf, other

    cs.SD cs.CL cs.IR cs.LG eess.AS

    Data leakage in cross-modal retrieval training: A case study

    Authors: Benno Weck, Xavier Serra

    Abstract: The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In ou… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: 5 pages. Accepted at ICASSP2023

  15. arXiv:2211.08367  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    FlowGrad: Using Motion for Visual Sound Source Localization

    Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello, Magdalena Fuentes

    Abstract: Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-ar… ▽ More

    Submitted 14 April, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: Accepted in ICASSP 2023

  16. arXiv:2210.02833  [pdf, other

    cs.IR cs.CL cs.LG cs.SD eess.AS

    Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

    Authors: Benno Weck, Miguel P�rez Fern�ndez, Holger Kirchhoff, Xavier Serra

    Abstract: We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval T… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: 5 pages, 2 figures. Accepted at Detection and Classification of Acoustic Scenes and Events 2022 (DCASE2022)

  17. Multilabel Prototype Generation for Data Reduction in k-Nearest Neighbour classification

    Authors: Jose J. Valero-Mas, Antonio Javier Gallego, Pablo Alonso-Jim�nez, Xavier Serra

    Abstract: Prototype Generation (PG) methods are typically considered for improving the efficiency of the $k$-Nearest Neighbour ($k$NN) classifier when tackling high-size corpora. Such approaches aim at generating a reduced version of the corpus without decreasing the classification performance when compared to the initial set. Despite their large application in multiclass scenarios, very few works have addr… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Journal ref: Pattern Recognition, Vol. 135, 2023

  18. arXiv:2203.13010  [pdf, other

    cs.SD cs.MM eess.AS

    Score difficulty analysis for piano performance education based on fingering

    Authors: Pedro Ramoneda, Nazif Can Tamer, Vsevolod Eremenko, Xavier Serra, Marius Miron

    Abstract: In this paper, we introduce score difficulty classification as a sub-task of music information retrieval (MIR), which may be used in music education technologies, for personalised curriculum generation, and score retrieval. We introduce a novel dataset for our task, Mikrokosmos-difficulty, containing 147 piano pieces in symbolic representation and the corresponding difficulty labels derived by its… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

  19. arXiv:2111.13468  [pdf, other

    cs.IR

    Emotion Embedding Spaces for Matching Music to Stories

    Authors: Minz Won, Justin Salamon, Nicholas J. Bryan, Gautham J. Mysore, Xavier Serra

    Abstract: Content creators often use music to enhance their stories, as it can be a powerful tool to convey emotion. In this paper, our goal is to help creators find music to match the emotion of their story. We focus on text-based stories that can be auralized (e.g., books), use multiple sentences as input queries, and automatically retrieve matching music. We formalize this task as a cross-modal text-to-m… ▽ More

    Submitted 26 November, 2021; originally announced November 2021.

    Comments: International Society for Music Information Retrieval (ISMIR) 2021, Best Student Paper

  20. arXiv:2111.13457  [pdf, other

    cs.SD eess.AS

    Semi-Supervised Music Tagging Transformer

    Authors: Minz Won, Keunwoo Choi, Xavier Serra

    Abstract: We present Music Tagging Transformer that is trained with a semi-supervised approach. The proposed model captures local acoustic characteristics in shallow convolutional layers, then temporally summarizes the sequence of the extracted features using stacked self-attention layers. Through a careful model assessment, we first show that the proposed architecture outperforms the previous state-of-the-… ▽ More

    Submitted 26 November, 2021; originally announced November 2021.

    Comments: International Society for Music Information Retrieval (ISMIR) 2021

  21. arXiv:2111.08009  [pdf, other

    cs.OH

    Piano Fingering with Reinforcement Learning

    Authors: Pedro Ramoneda, Marius Miron, Xavier Serra

    Abstract: Hand and finger movements are a mainstay of piano technique. Automatic Fingering from symbolic music data allows us to simulate finger and hand movements. Previous proposals achieve automatic piano fingering based on knowledge-driven or data-driven techniques. We combine both approaches with deep reinforcement learning techniques to derive piano fingering. Finally, we explore how to incorporate pa… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  22. arXiv:2110.07410  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

    Authors: Benno Weck, Xavier Favory, Konstantinos Drossos, Xavier Serra

    Abstract: Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language. Existing works mainly focus on investigating new methods and try to improve their performance measured on existing datasets. Having attracted attention only recent… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: 5 pages, 4 figures. Accepted at Detection and Classification of Acoustic Scenes and Events 2021 (DCASE2021)

  23. arXiv:2109.12690  [pdf, ps, other

    cs.SD cs.DB cs.LG eess.AS

    Soundata: A Python library for reproducible use of audio datasets

    Authors: Magdalena Fuentes, Justin Salamon, Pablo Zinemanas, Mart�n Rocamora, Gen�s Paja, Ir�n R. Rom�n, Marius Miron, Xavier Serra, Juan Pablo Bello

    Abstract: Soundata is a Python library for loading and working with audio datasets in a standardized way, removing the need for writing custom loaders in every project, and improving reproducibility by providing tools to validate data against a canonical version. It speeds up research pipelines by allowing users to quickly download a dataset, load it into memory in a standardized and reproducible way, valid… ▽ More

    Submitted 4 October, 2021; v1 submitted 26 September, 2021; originally announced September 2021.

  24. arXiv:2107.00623  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks

    Authors: Eduardo Fonseca, Andres Ferraro, Xavier Serra

    Abstract: Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we analyze the benefits of addressing lack of shift invariance in CNN-based sound event classification. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs… ▽ More

    Submitted 22 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

  25. arXiv:2106.02415  [pdf, ps, other

    cs.HC

    What is fair? Exploring the artists' perspective on the fairness of music streaming platforms

    Authors: Andres Ferraro, Xavier Serra, Christine Bauer

    Abstract: Music streaming platforms are currently among the main sources of music consumption, and the embedded recommender systems significantly influence what the users consume. There is an increasing interest to ensure that those platforms and systems are fair. Yet, we first need to understand what fairness means in such a context. Although artists are the main content providers for music platforms, ther… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Journal ref: Proceedings of the 18th IFIP International Conference on Human-Computer Interaction (INTERACT 2021)

  26. arXiv:2105.10371  [pdf, other

    cs.SD cs.LG eess.AS

    LoopNet: Musical Loop Synthesis Conditioned On Intuitive Musical Parameters

    Authors: Pritish Chandna, Ant�nio Ramires, Xavier Serra, Emilia G�mez

    Abstract: Loops, seamlessly repeatable musical segments, are a cornerstone of modern music production. Contemporary artists often mix and match various sampled or pre-recorded loops based on musical criteria such as rhythm, harmony and timbral texture to create compositions. Taking such criteria into account, we present LoopNet, a feed-forward generative model for creating loops conditioned on intuitive par… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

  27. arXiv:2105.02132  [pdf, other

    cs.SD cs.LG eess.AS

    Self-Supervised Learning from Automatically Separated Sound Scenes

    Authors: Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

    Abstract: Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this… ▽ More

    Submitted 14 September, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

  28. arXiv:2102.00201  [pdf, other

    cs.SD cs.IR cs.LG cs.MM eess.AS

    Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging

    Authors: Andres Ferraro, Yuntae Kim, Soohyeon Lee, Biho Kim, Namjun Jo, Semi Lim, Suyon Lim, Jungtaek Jang, Sehwan Kim, Xavier Serra, Dmitry Bogdanov

    Abstract: One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered fr… ▽ More

    Submitted 30 January, 2021; originally announced February 2021.

    Comments: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing

  29. arXiv:2011.07616  [pdf, other

    cs.SD cs.LG eess.AS

    Unsupervised Contrastive Learning of Sound Event Representations

    Authors: Eduardo Fonseca, Diego Ortego, Kevin McGuinness, Noel E. O'Connor, Xavier Serra

    Abstract: Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data---a common scenario in sound event research. In this work, we explore unsupervised contrastive learning as a way to learn sound event representations. To this end, we propose to use the pretext task of contrasting differently augmented views of sound… ▽ More

    Submitted 15 November, 2020; originally announced November 2020.

    Comments: A 4-page version is submitted to ICASSP 2021

  30. arXiv:2010.16030  [pdf, other

    cs.IR cs.MM cs.SD eess.AS

    Multimodal Metric Learning for Tag-based Music Retrieval

    Authors: Minz Won, Sergio Oramas, Oriol Nieto, Fabien Gouyon, Xavier Serra

    Abstract: Tag-based music retrieval is crucial to browse large-scale music libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, metric learning has already proven its… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2021

  31. arXiv:2010.14171  [pdf, other

    cs.SD cs.IR cs.LG eess.AS stat.ML

    Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learni… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figure

  32. arXiv:2010.00475  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    FSD50K: An Open Dataset of Human-Labeled Sound Events

    Authors: Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra

    Abstract: Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube vid… ▽ More

    Submitted 23 April, 2022; v1 submitted 1 October, 2020; originally announced October 2020.

    Comments: Accepted version in TASLP. Main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. https://ieeexplore.ieee.org/document/9645159

  33. arXiv:2008.11529  [pdf, other

    eess.AS cs.SD

    TIV.lib: an open-source library for the tonal description of musical audio

    Authors: Ant�nio Ramires, Gilberto Bernardes, Matthew E. P. Davies, Xavier Serra

    Abstract: In this paper, we present TIV.lib, an open-source library for the content-based tonal description of musical audio signals. Its main novelty relies on the perceptually-inspired Tonal Interval Vector space based on the Discrete Fourier transform, from which multiple instantaneous and global representations, descriptors and metrics are computed - e.g., harmonic change, dissonance, diatonicity, and m… ▽ More

    Submitted 26 August, 2020; originally announced August 2020.

  34. arXiv:2008.11507  [pdf, other

    eess.AS cs.SD

    The Freesound Loop Dataset and Annotation Tool

    Authors: Antonio Ramires, Frederic Font, Dmitry Bogdanov, Jordan B. L. Smith, Yi-Hsuan Yang, Joann Ching, Bo-Yu Chen, Yueh-Kao Wu, Hsu Wei-Han, Xavier Serra

    Abstract: Music loops are essential ingredients in electronic music production, and there is a high demand for pre-recorded loops in a variety of styles. Several commercial and community databases have been created to meet this demand, but most are not suitable for research due to their strict licensing. We present the Freesound Loop Dataset (FSLD), a new large-scale dataset of music loops annotated by expe… ▽ More

    Submitted 23 September, 2020; v1 submitted 26 August, 2020; originally announced August 2020.

    Comments: This work will be presented in the 21st International Society for Music Information Retrieval (ISMIR2020). Annotator website: http://mtg.upf.edu/fslannotator Dataset: https://zenodo.org/record/3967852

  35. Exploring Longitudinal Effects of Session-based Recommendations

    Authors: Andres Ferraro, Dietmar Jannach, Xavier Serra

    Abstract: Session-based recommendation is a problem setting where the task of a recommender system is to make suitable item suggestions based only on a few observed user interactions in an ongoing session. The lack of long-term preference information about individual users in such settings usually results in a limited level of personalization, where a small set of popular items may be recommended to many us… ▽ More

    Submitted 17 August, 2020; originally announced August 2020.

    Comments: The 14th ACM Conference on Recommender Systems

  36. arXiv:2006.08386  [pdf, other

    cs.LG cs.IR eess.AS stat.ML

    COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

    Authors: Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

    Abstract: Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. A… ▽ More

    Submitted 8 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech at the 37th International Conference on Machine Learning (ICML), 2020, Vienna, Austria

  37. arXiv:2006.00751  [pdf, other

    eess.AS cs.SD

    Evaluation of CNN-based Automatic Music Tagging Models

    Authors: Minz Won, Andres Ferraro, Dmitry Bogdanov, Xavier Serra

    Abstract: Recent advances in deep learning accelerated the development of content-based automatic music tagging systems. Music information retrieval (MIR) researchers proposed various architecture designs, mainly based on convolutional neural networks (CNNs), that achieve state-of-the-art results in this multi-label binary classification task. However, due to the differences in experimental setups followed… ▽ More

    Submitted 1 June, 2020; originally announced June 2020.

    Comments: 7 pages, 2 figures, Sound and Music Computing 2020 (SMC 2020)

  38. arXiv:2005.00878  [pdf, other

    cs.SD cs.LG eess.AS

    Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

    Authors: Eduardo Fonseca, Shawn Hershey, Manoj Plakal, Daniel P. W. Ellis, Aren Jansen, R. Channing Moore, Xavier Serra

    Abstract: The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first ident… ▽ More

    Submitted 25 July, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted in IEEE Signal Processing Letters, openly accessible at https://ieeexplore.ieee.org/document/9130823

    Journal ref: IEEE Signal Processing Letters, Vol. 27, 2020, pages 1235-1239

  39. arXiv:2004.03985  [pdf, other

    cs.IR cs.HC cs.LG cs.SD

    Search Result Clustering in Collaborative Sound Collections

    Authors: Xavier Favory, Frederic Font, Xavier Serra

    Abstract: The large size of nowadays' online multimedia databases makes retrieving their content a difficult and time-consuming task. Users of online sound collections typically submit search queries that express a broad intent, often making the system return large and unmanageable result sets. Search Result Clustering is a technique that organises search-result content into coherent groups, which allows us… ▽ More

    Submitted 8 April, 2020; originally announced April 2020.

    Comments: 8 pages, 4 figures, Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR 20), June 8-11, 2020, Dublin, Ireland. ACM, NewYork, NY, USA, 8 pages

    ACM Class: H.3.3

  40. arXiv:2003.07393  [pdf, ps, other

    eess.AS cs.LG cs.SD

    TensorFlow Audio Models in Essentia

    Authors: Pablo Alonso-Jim�nez, Dmitry Bogdanov, Jordi Pons, Xavier Serra

    Abstract: Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia, allow predictions with pre-trained deep learning models, and are designed to offer flexibility of use, easy extensibility, and real-time inference. To show the potential of this new interface with TensorFlow, we provide a number of pr… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

  41. arXiv:1911.11853  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Neural Percussive Synthesis Parameterised by High-Level Timbral Features

    Authors: Ant�nio Ramires, Pritish Chandna, Xavier Favory, Emilia G�mez, Xavier Serra

    Abstract: We present a deep neural network-based methodology for synthesising percussive sounds with control over high-level timbral characteristics of the sounds. This approach allows for intuitive control of a synthesizer, enabling the user to shape sounds without extensive knowledge of signal processing. We use a feedforward convolutional neural network-based architecture, which is able to map input para… ▽ More

    Submitted 3 April, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

  42. arXiv:1911.04827  [pdf, other

    cs.IR

    Artist and style exposure bias in collaborative filtering based music recommendations

    Authors: Andres Ferraro, Dmitry Bogdanov, Xavier Serra, Jason Yoon

    Abstract: Algorithms have an increasing influence on the music that we consume and understanding their behavior is fundamental to make sure they give a fair exposure to all artists across different styles. In this on-going work we contribute to this research direction analyzing the impact of collaborative filtering recommendations from the perspective of artist and music style exposure given by the system.… ▽ More

    Submitted 12 November, 2019; originally announced November 2019.

    Comments: Presented at Workshop on Designing Human-Centric MIR Systems, ISMIR 2019

  43. arXiv:1911.04824  [pdf, other

    cs.IR cs.SD eess.AS

    How Low Can You Go? Reducing Frequency and Time Resolution in Current CNN Architectures for Music Auto-tagging

    Authors: Andres Ferraro, Dmitry Bogdanov, Xavier Serra, Jay Ho Jeon, Jason Yoon

    Abstract: Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram rep… ▽ More

    Submitted 28 June, 2020; v1 submitted 12 November, 2019; originally announced November 2019.

    Comments: The 28th European Signal Processing Conference (EUSIPCO)

  44. arXiv:1911.04385  [pdf, other

    cs.SD eess.AS

    Visualizing and Understanding Self-attention based Music Tagging

    Authors: Minz Won, Sanghyuk Chun, Xavier Serra

    Abstract: Recently, we proposed a self-attention based music tagging model. Different from most of the conventional deep architectures in music information retrieval, which use stacked 3x3 filters by treating music spectrograms as images, the proposed self-attention based model attempted to regard music as a temporal sequence of individual audio events. Not only the performance, but it could also facilitate… ▽ More

    Submitted 11 November, 2019; originally announced November 2019.

    Comments: Machine Learning for Music Discovery Workshop (ML4MD) at ICML 2019

  45. arXiv:1910.12004  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers

    Authors: Eduardo Fonseca, Frederic Font, Xavier Serra

    Abstract: Label noise is emerging as a pressing issue in sound event classification. This arises as we move towards larger datasets that are difficult to annotate manually, but it is even more severe if datasets are collected automatically from online repositories, where labels are inferred through automated heuristics applied to the audio content or metadata. While learning from noisy labels has been an ac… ▽ More

    Submitted 26 October, 2019; originally announced October 2019.

    Comments: WASPAA 2019

  46. arXiv:1909.06654  [pdf, other

    cs.SD cs.CL eess.AS

    musicnn: Pre-trained convolutional neural networks for music audio tagging

    Authors: Jordi Pons, Xavier Serra

    Abstract: Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. W… ▽ More

    Submitted 14 September, 2019; originally announced September 2019.

    Comments: Accepted to be presented at the Late-Breaking/Demo session of ISMIR 2019

  47. arXiv:1908.10133  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    A hybrid parametric-deep learning approach for sound event localization and detection

    Authors: Andres Perez-Lopez, Eduardo Fonseca, Xavier Serra

    Abstract: This work describes and discusses an algorithm submitted to the Sound Event Localization and Detection Task of DCASE2019 Challenge. The proposed methodology relies on parametric spatial audio analysis for source localization and detection, combined with a deep learning-based monophonic event classifier. The evaluation of the proposed algorithm yields overall results comparable to the baseline syst… ▽ More

    Submitted 27 August, 2019; originally announced August 2019.

    Comments: 5 pages, 5 figures, submitted to DCASE2019 Workshop

  48. arXiv:1907.08520  [pdf, other

    cs.SD eess.AS

    Data Augmentation for Instrument Classification Robust to Audio Effects

    Authors: Ant�nio Ramires, Xavier Serra

    Abstract: Reusing recorded sounds (sampling) is a key component in Electronic Music Production (EMP), which has been present since its early days and is at the core of genres like hip-hop or jungle. Commercial and non-commercial services allow users to obtain collections of sounds (sample packs) to reuse in their compositions. Automatic classification of one-shot instrumental sounds allows automatically cat… ▽ More

    Submitted 19 July, 2019; originally announced July 2019.

  49. arXiv:1906.04972  [pdf, other

    cs.SD eess.AS

    Toward Interpretable Music Tagging with Self-Attention

    Authors: Minz Won, Sanghyuk Chun, Xavier Serra

    Abstract: Self-attention is an attention mechanism that learns a representation by relating different positions in the sequence. The transformer, which is a sequence model solely based on self-attention, and its variants achieved state-of-the-art results in many natural language processing tasks. Since music composes its semantics based on the relations between components in sparse positions, adopting the s… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

    Comments: 13 pages, 12 figures; code: https://github.com/minzwon/self-attention-music-tagging

  50. arXiv:1906.02975  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Audio tagging with noisy labels and minimal supervision

    Authors: Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra

    Abstract: This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sou… ▽ More

    Submitted 19 January, 2020; v1 submitted 7 June, 2019; originally announced June 2019.

    Comments: DCASE2019 Workshop