Applying machine learning and deep learning methods to audio analysis

93 points by gidim 6 years ago · 20 comments

Reader

jononor 6 years ago

As an introduction introduction I guess this is OK. However there are two major limitations:

1: The feature extraction ends with mean-summarizing across the entire audio clip - leaving no temporal information. This only works well for simple tasks. At least mentioning something about analysis windows and temporal modelling would be good, as the natural next step. Be it LSTM/GRU on the MFCC, or CNN on mel-spectrogram.

2: The folds of the Urbansound8k dataset are not respected in the evaluation. In Urbansound8k different folds contains clips extracted from the same original audio files, usually very close in time. So mixing the folds for the testset means it is no longer entirely "unseen data". The model very likely exploits this data leakage, as the reported accuracy is above SOTA (for no data-augmentation) - unreasonable given the low fidelity feature representation. At least mentioning this limitation and that the performance number they give cannot be compared with other methods, would be prudent.

When I commented similarly on r/machinelearning the authors acknowledged these weaknesses, but did not update the article to reflect it.

gidimOP 6 years ago

we're working on another version fixing the folds issue on Urbansound8k and will update the article asap.
- jononor 6 years ago
  
  Nice!
  - gidimOP 6 years ago
    
    just to clarify - are you referring to this experiment? https://www.comet.ml/demo/urbansound8k/be09e32700cd435fb6b55...
    
    jononor 6 years ago
    
    Sure, that demonstrates the issue. Problem is with using train_test_split(X, yy, test_size=0.2..) - this assumes independent samples, which is violated for this dataset (because some come from same source audio files). The easiest (and completely acceptable) is to use one fold as the validation data, one fold for the test set, and the remaining folds as training.
    This problem is unfortunately quite common even in academic papers using this dataset, even though the authors warn about it.
    EDIT: There is one more issue with Urbansound8k folds, and that is that the difficulty of the various folds is quite different. So one should ideally report the performance across all folds (mean/std or boxplot). But this is a minor issue compared to data leakage.
    PS: Nice use of Comet.ml platform this, collaborating online on improving the experimental setup :)
    
    nikolaskaris 6 years ago
    
    Hey jononor — we've updated the post to split the training and test sets based on the folds. Good catch and thanks again for reporting this. Some of the experiments in the project will still have the old code, but the blog post will reflect this new train/test split.
    
    jononor 6 years ago
    
    Nice. Did you update the reported results also? I think they will change quite a bit

jononor 6 years ago

Warning: shameless-self-promotion. For those that wish to go a bit beyond this article, I gave a presentation on the topic at EuroPython. https://www.youtube.com/watch?v=uCGROOUO_wY It explains how to build models that can make use of temporal variations and learn the feature representations based on the (Mel) spectrogram. Especially suited if you are already familiar with image-classification using Convolutional Neural Networks.

m0zg 6 years ago

As one of the long-suffering Comet.ml customers, I wish they'd spend more time working on their site's performance and less on writing blog posts. It takes multiple seconds for graphs to render, and leaving any part of Comet.ml UI open in the browser leads to spinning fans and quick battery drain when working from a laptop. The logging component will sometimes hang without a warning and hang your training session as well. Bizarrely, there's no way to show min/max metric values for ongoing and completed runs (AKA the only thing a researcher actually cares about): you have to log them separately in order to display them.

This is a weird field: these are not difficult problems to solve, yet as far as I can tell, all of the popular choices available so far each suck in their own unique way and there's no option that I know of that actually offers convenience and high performance. FOSS options are barely existent, as well, and they also suck.

For the things where Comet.ml would be too onerous to deal with, I still use pen and paper.

gidimOP 6 years ago

Hi M0zg! Gideon from Comet- sorry to hear you're having issues. Did you every try to report these? if you share more info at support@comet.ml or at our slack channel i'm sure we can fix it / improve. On a general note: 1. you can see min/max values in the metrics tab for finished/running experiments. 2. we spend tons of time on performance but these are actually difficult problems to solve, i.e if you have ten charts all showing 10k data points all updating in real time. That said if you share your project we can use it to improve. Finally the SDK is designed to never crash or slow down your training and this is the first time we've heard that complaint - again please ping us so we can figure out what's going on.
alon7 6 years ago

We're actually very happy with Comet and have been using it on v large projects (>50 researchers, 10k models). You can reduce the refresh interval and the amount of data points reported if things feel slow
- m0zg 6 years ago
  
  I don't log that many points as it is: about 4K data points per run in total (windowed average loss and LR every 25-30 batches, eval metrics every epoch), for all metrics combined. I also log the same data to TensorBoard, which renders everything pretty much instantaneously with no issues at all, even though I tell it to not downsample beyond 5K samples per graph.
  - gidimOP 6 years ago
    
    M0zg do you mind sending me an email with your project? Happy to look into it. gideon a t comet.ml
    
    gidimOP 6 years ago
    
    Also keep in mind that unlike tensorboard we keep your full data series available in the API and only downsample the charts to 15k points.
bentoboox 6 years ago

When we did our evaluation comet was far superior to the alternatives and we’re not seeing any of the issues you reported. For better performance make sure you don’t log every step but rather every epoch.
abrichr 6 years ago

I'd love to learn more about your use case. What kind of models are you training? What are you using Comet.ml for?
Thanks!

syntaxing 6 years ago

Is there a method to detect a specific word and tell me the timestamp throughout an audio sample easily? I've been trying to implement something like this but wasn't sure how to approach it.

yorwba 6 years ago

If you already have the transcript without timestamps (e.g. for an audiobook where you know the source text), you could use https://github.com/readbeyond/aeneas , which infers the timestamps by aligning text-to-speech output with the audio using dynamic time warping.
If you don't have the transcript, you'd use a transcription service that also gives you timestamps. E.g. there was a frontpage submission yesterday where someone used AWS Transcription to count the number of words in each minute of a talk: https://news.ycombinator.com/item?id=21635939
jononor 6 years ago

If you can utilize a cloud API, speech transcription route is likely the simplest. Recognizing spoken words is challenging and data-demanding when it can be spoken by many different speakers.
But if you want to do this on the audio you chop up your audio stream into fixed-length (in time) analysis windows. These length of the window should be a bit longer than the sound of interest (the word). Overlap is normally used for the windows. Say with 90% overlap the next window is created by moving forward by 10%. This gives the model multiple "shots" at detecting the word as it passes by. This is suitable for spotting a word and giving the time within something like 50ms resolution.
For each analysis window you apply feature pre-processing and a model such as the one shown in the article.
This task sounds like what is called Keyword Spotting in academic literature. Which can be seen as as specific version of Audio Event Detection, applied to spoken words.
bootloop 6 years ago

Given you have a transcription methode (but without the t timestamp output) the simplest methode might be to do a moving window throughout the sample and try to see where the detection results in a positiv.

Settings

Applying machine learning and deep learning methods to audio analysis

Keyboard Shortcuts