I need your help. This is an audio waveform of somebody singing the first four lines of Happy Birthday. Click if you want to see if bigger.
The first line should begin about 1.5 seconds in, and should be finished before 4.5 seconds in. This is because the singer is singing in response to a karaoke style lyrics scroller. There should be about a 1.5 second gap between lines, so the second line of the song should start at about 6 seconds in, and each line should be no more than 3 seconds long. I say should here, but its somewhat inexact, due to peoples sense of timing, and machine performance characteristics etc. There could be a 0.5 second variation in any of these timings, based on testing.
Now, what I want to so, is identify in the waveform where singing starts for each of the four lines of the song. So I want 4 pairs of numbers. The first might be (1.3, 4.1), meaning the singing starts 1.3 seconds into the waveform, and ends 4.1 seconds into the waveform, making a 2.8 second singing clip. The second pair might be (6.1, 8.6) etc.
How can you help? I’ve been hand-rolling all kinds of amateur algorithms to pull this off, with a modicum of success (that I feel chuffed about 🙂 ), but I need something more robust than what I have been able to do. So what I am looking for is pointers either to papers or code for algorithms that are suited to this problem, or even the name of potentially suitable algorithms and I’ll do the research myself. Or else pointers to open-source tools that can do this, so I can dig around and see what they do.
My promise to you is, if anything interesting learning comes from this process, I will report back on what I learned.
Keep well!