Enhancing Song Classification through Machine Listening
Building on existing methods, this project aims to classify songs based on their audio content using Deep Learning and Community Detection. By analyzing spectrograms, the framework extracts features to identify similarities and create clusters in songs, potentially aiding historical studies and species identification based on sounds.
Uploaded on Apr 28, 2025 | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
SongNet: Making a machine listen to music Tyler Farnan, Aisha Dantuluri, Soumyaraj Bose
Motivation Most frameworks used by music recommendation platforms identify musical works by using their metadata (tags, artists,, etc.) and, state-of-the-art clustering methods We aim to build a framework that uses established methods, such as the above, but also examines (rather, listens to) the songs in order to classify them, like a human ear. We use tools and concepts from Deep Learning and Community Detection to examine the frequency-time maps of songs and cluster those with similar aesthetics We believe that, in the long run, this functionality can be adapted into facilitating historical studies or better identifying species or phenomena based on sounds.
Literature Survey Our framework is inspired from the DEEJ-AI project [1] developed by Robert D. Smith We derive the notion of using spectrograms (frequency-time maps) of song samples as training data from the DEEJ-AI setup For the most part, DEEJ-AI learns and recreates the physical structure of tracks using 1-D and 2-D CNNs, optimizing the quality of the reconstruction using cosine proximity We extend upon the idea that one could use convolutional autoencoders (conjectured in [1]) to improve the reconstruction of spectrograms of the song samples
Literature Survey We use the latent layer representations generated by these autoencoders and apply a combination of approaches to classify the analyzed set of songs We use a data-driven clustering approach [2] to clustering songs with similar latent representations and We integrate Community Detection [3] on user-curated playlist metadata to identify the specific communities of songs as a point of comparison against clusters of latent audio- features.
How do you think Machine Learning/Deep Learning can help solve this problem? Feature extraction is core to the solution to our problem; it is essential to detecting low- level details from the spectrograms so as to generate close-to-accurate reconstructions. Convolutional neural network architectures achieve state-of-the-art performance in computer vision. By visualizing music audio as spectrograms (in the time-frequency space) we can leverage CNN s to learn hierarchical feature maps that are capable of representing the complex patterns which make musical audio unique.
Details on the dataset Musical Audio Features: We employ the same physical dataset (Mel spectrograms of frequencies within the space of 96 Hz, recorded over 216 seconds) as used in [1]. (We used first 5 seconds). The Mel spectrograms are .PNG image files converted to NumPy arrays for training. Playlist MetaData: We also use user-curated playlist data from the Spotify API, processed using a graph-pruning/analysis to generate a network of tracks that are most connected (songs are connected if found in the same playlist) in the overall network of songs.
Details on feature extraction used Audio Features from MelSpectograms : Convolutional Autoencoders (CAEs) are used to replicate the spectrograms Latent representations of these spectrograms are then extracted as preliminary features The latent representations are then clustered and the clusters are studied and compared Communities from Playlist Metadata: Playlist and Song metadata is used to perform graph analysis and community detection The communities are then used to compare to the clusters created
Details on the models tested: Various CAE architectures were tested to explore the generated clusters further. The Model Selection process involved the testing of the following: 4 separate models with different sizes of latent representations Hyperparameters: Batchsize: 5, 10, 15 Learning rate: 0.01, 0.005, 0.001, 0.0005 Activation functions: ReLU Sigmoid
Details on the model used: A variational autoencoder (VAE) model based on [1] was also tested. Feature extraction was done in the same way as the above discussed models; added LeakyReLU for activation and dropout layers (20-40%) We used a Multivariate normal layer from Tensorflow Probability functionality to generate the same encoding. We experimented with the decoder output: 1) 2D Transpose Convolutional Layer with sigmoid activation 2) Pixel-independent Independent Bernoulli Layer Losses optimized over: Negative log-likelihood (Reconstruction), KL divergence (to determine proximity over encoder and prior distribution used)
Results: Training and Cluster analysis Kmeans: Optimal number of clusters Compare Loss Plots : Model Architecture Silhouette Scoring Elbo Method CAE 72x72 2 3 CAE 48X81 4 3 CAE 168X108 3 3 CAE 12x15 2 3
Results: PCA with Kmeans labels - latent representations are poorly clustered. - Similar results across all model architectures
Results: TSNE across Community labels - - 4 (left) vs. 10 (right) Communities No correlated structure emerges
Confusion Matrix: K-means vs. Communities 4 Playlist Communities Playlist Communities 10
Further items to be completed... To improve upon our prediction capabilities, we are currently fine-tuning and training Variational Autoencoders (VAEs) on the same dataset. Analyze the diversity of the full training dataset (300k Spectrograms) Explore various playlist sub-networks and community detection methods We will also modularize our code for reproducibility
References [1] R. Smith, Create automatic playlists by using Deep Learning to listen to music , https://towardsdatascience.com/create-automatic-playlists-by-using-deep-learning-to-listen-to-the- music-b72836c24ce2, 2019 [2] L.Barreira, S. Cavaco, J. F. da Silva, Unsupervised Music Genre Classification with a Model-based Approach , EPIA: Progress in Artificial Intelligence, pp. 268-281, 2011 [3] B. He, Y. Li, B. Nguy, Music Playlist Generation based on Community Detection & Personalized PageRank http://snap.stanford.edu/class/cs224w- 2015/projects_2015/Music_Playlist_Generation.pdf, 2015. [4] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Thank you! ...Please stay tuned for live code review.