Home > Q&A Sessions >
Live Q&A - Signal Processing Formulations of Sequence Models
Julius O. Smith III- Recording Soon Available - DSP Online Conference 2024
Hi Robert,
Thanks for your kind words!
Transcribing Clapton's guitar playing and such fits within the classic problem "automatic transcription" and specifically "polyphonic F0 estimation". Nowadays, neural methods are probably best to pursue first, followed by comparison to, and conversion to, more classical methods. In this case, I'd say the "ideal answer" is a maximum-likelihood estimate of the playing parameters for each guitar string. Ideally you take advantage of all constraints, such as "there is only one left hand", and estimate left-hand position along with everything else. For neural starters, I would probably build on Audio Spectrogram Transformer (AST) which is based on Vision Transformer (ViT). Its output could be, e.g., the stopped fret for each string, and whether or not that string is sounding. For a 25-fret neck, adding open and muted states, that gives 27 * 6 = 162 output logits. For solos you also want to follow bending, of course, and that could be modeled as a second "original fret" output, or whatever. All that said, what I try to do as a guitar player is find a YouTube video showing his hands, and reverse engineer from there by ear.
Hello again, and thank you for the reply and information! I'll be looking further into your recommendations, which confirms my assumption that the problem has received a good deal of interest and research from fellow audio lovers and musicians of the world. A brief look into the 'polyphonic F0 estimation' approach also confirms my conclusion that it's not an easy problem to solve (if even). From wikipedia "There have been many attempts at multiple (aka polyphonic) F0 estimation and melody extraction, a related area" and "Since F0 tracking of all sources in a complex audio mixture can be very hard, we are restricting the problem to 3 cases...". Maybe it will be solved by future DSP and music heads :) Being able to input any music composition, be it Cream, or Bach on classical guitar (a much easier prospect), and get an immediate transcription, is just too enticing.
I've used YouTube live performances too :), not to mention some of the detailed instructional videos, which pretty much show what to play and where! Versus the old days, when the main option was infinite repeat of a record or cassette (record playback at 16 speed was useful for some of the more tricky solos :) )
Best Regards, and thanks again,
Robert
I see there is an October 2023 review article on music transcription:
https://www.mdpi.com/2076-3417/13/21/11882
I would also find all citations to that (and other good papers you find) at Google Scholar to get fully updated.
Cheers,
Julius
An excellent paper, thanks!, right in the heart of the matter. Section 4.3 was particularly relevant to my findings: "Musical sounds often consist of a fundamental frequency and its harmonics. The presence of strong harmonics can lead to ambiguity in pitch estimation, as the algorithm may detect multiple potential fundamental frequencies that align with different harmonics."
I've been checking out some of the transcription tools cited, including AnthemScore, which was relatively easy to use and interpret. And even though it didn't exactly replicate polyphonic scores, it did get pretty close with a relatively clean Bach Fugue on classical guitar.
Fascinating!
Best Regards,
Robert
Hi Dr. Smith,
During the Q&A discussion you had mentioned something about an interesting YouTube titled "Make More" (or something like that) - I made a note to ask you about that as it sounded interesting, could you post a link here?
Sure: Andrej Karpathy's "micrograd" and "makemore" tutorials on YouTube:
https://www.youtube.com/@AndrejKarpathy
I especially recommend the series "Neural Networks: Zero to Hero"
Super, that was it. Thank you!
Hello Dr Smith, Thanks for the discussion yesterday. I meant to ask a question during it, but the opportunity wasn't right, so I'll ask here. My entire career in DSP and embedded software has been driven by a love of music, and a desire to use technology to understand it better. And coming out of grad school, I had an idea of being able to decompose music into the associated chord progressions and notation, using DSP methods, and primarily FFT or frequency analysis. I'm a lifelong guitar player (electric maninly, Gibson SG), and can usually figure out note-by-note solos, but had trouble determing the chord progression of a song, and even more coveted, the position on the fretboard that chord was struck, using which strings. So as an example, using Cream's White Room that you had mentioned in the presentation yesterday, I could figure out the awesome wah solo, but I really wanted to know how Eric Clapton was fingering the rhythm section, since by ear, it was hard to know. I spent a long time personally pursuing that, as a DSP audio hobbyist (and learning a good deal about embedded DSP in the process). But it ended up as a glorified spectrum/audio analyzer, which took in the real-time audio, and could zoom filter and display various frequency bands, etc. And occassionaly determining some rudimentary chords, but only under the most ideal conditions; so it didn't come close to my original idea. One primary issue I encountered was that the interplay of fundamental and harmonics of a chord made it almost impossible to know what was the note, versus its harmonics. In my pursuit of this technology, I often ended up on your webpages, thanks.
My question is whether you are aware of that this problem has been solved, or is it the search for the Holy Grail, inherently unsolvable due to the physics and data characteristics involved? I'm aware that having a separate sampling channnel/amplifier on each string would make it easier, but I was more interested in taking any composition, recorded without such special setups.
Thanks again, for any review of my question, and your contributions to an area of lifelong interest to me (DSP audio!)
Best Regards,
Robert Wolfe