Home > On-Demand Archives > Workshops >
GPU-Powered Real-Time Audio Source Separation
Alexander Talashov - Watch Now - DSP Online Conference 2025 - Duration: 01:52:37
Audio source separation has been a long-standing area of research due to its broad range of real-world applications. This technology allows music producers and sound engineers to repurpose archival recordings, even those captured under suboptimal conditions. In the film industry, it enables the restoration of classic movies—particularly those with mono audio—by enhancing them with modern innovations such as spatial audio. Moreover, it plays a crucial role in assistive technologies, helping individuals with hearing impairments communicate more effectively in noisy environments.
Leveraging GPUs to accelerate source separation addresses two key objectives: (1) increasing throughput to process multiple audio files in parallel, which streamlines workflows for music and film professionals; and (2) reducing latency, which is vital for real-time communication systems. The latter is especially relevant in high-demand sectors such as industrial operations and transportation, where speed and precision are essential.
In this workshop, we will provide an overview of the GPU AUDIO Platform—its architecture, core features, and how it enables the development of accelerated audio modules. Following this introduction, we will explore how to integrate state-of-the-art, open-source neural network-based source separation models with the GPU AUDIO Platform. Attendees will learn how to build a custom software stack, integrate with virtually any GPU-powered environment, and achieve millisecond-level latencies while running audio compute alongside other GPU workloads.
Participants will receive access to the workshop codebase and development environment, enabling them to replicate the setup on their own machines and continue experimenting with real-time, low-latency audio processing. By the end of the session, attendees will have a solid foundation in GPU-accelerated digital signal processing for modern audio applications.
This guide was created with the help of AI, based on the presentation's transcript. Its goal is to give you useful context and background so you can get the most out of the session.
What this presentation is about and why it matters
This talk demonstrates how to run neural audio source separation in real time by moving the entire pipeline onto GPUs and exposing a practical SDK and processor model. It covers both the algorithmic choices that make low-latency separation possible (choosing hybrid/time-domain models like TasNet and other low-lookahead networks) and the engineering challenges of building a predictable, multi-instance GPU scheduler for audio workloads.
For DSP engineers and developers this matters because source separation is useful in music production, restoration, spatialization, hearing-assistive tech, and teleconferencing. The difference between an offline, high-quality model and a real-time model is both algorithmic (how much future context the model needs) and systems-level (how to feed small buffers, manage overlap, preserve state, and schedule many short GPU kernels with low jitter). This presentation links both worlds.
Who will benefit the most from this presentation
- DSP engineers and audio plugin developers interested in migrating heavy audio ML to GPUs.
- Machine‑learning engineers who want to understand latency vs. quality tradeoffs in audio models.
- Systems programmers building low‑latency media pipelines (buffer/overlap management, scheduling).
- Researchers wanting a practical example of turning open‑source models into real‑time processors.
What you need to know
These are the core concepts and simple formulas that will help you follow the talk.
Time vs frequency representations
Most separation networks work on time‑domain waveforms, frequency‑domain frames (STFT), or a hybrid of both. The short‑time Fourier transform (STFT) splits the signal into overlapping frames and transforms each to frequency:
STFT (conceptual): $X(\tau,k)=\sum_n x[n]\,w[n-\tau]\,e^{-j2\pi k n/N}$
The inverse STFT (ISTFT) reconstructs the time waveform using overlap‑add plus normalization. Window choice (Hann/Hamming) and normalization for overlapping windows are key to avoid amplitude modulation.
Masking and hybrid approaches
Many networks predict masks on a spectrogram: multiply the mixture spectrogram by a mask for each source and invert. Hybrid models combine time‑domain processing and spectrogram masks to improve perceptual quality while keeping latency manageable.
Latency, lookahead and buffers
Lookahead is how much future audio the model needs before emitting output. High‑quality offline models can use seconds of lookahead (Demucs v4 uses ~8s), which is impractical for live use. Low‑latency designs target tens of milliseconds of lookahead. Real systems handle overlapping frames: with frame length $N$ and hop $H$, each new input of size $H$ is combined with previous samples to form the $N$‑sample window.
LSTM statefulness and real‑time processing
LSTM layers provide temporal memory and are often used to compensate for short lookahead. Running LSTMs in real time means preserving cell/hidden states between small input chunks and ensuring deterministic execution order on the GPU.
Performance metric you will hear
Real‑time ratio: (audio length processed) / (wall‑clock processing time). A ratio >1 means faster‑than‑real‑time; <1 is too slow for live use. The talk compares CPU, basic PyTorch GPU runs, and a tuned GPU implementation that reaches multi‑times real‑time.
Glossary
- STFT: Short‑time Fourier transform; framewise Fourier analysis of the waveform.
- ISTFT: Inverse STFT; overlap‑add synthesis with window normalization.
- Spectrogram: Magnitude (and optionally phase) representation across time and frequency.
- Masking: Multiplicative selection applied to a spectrogram to isolate sources.
- LSTM: Recurrent neural unit (long short‑term memory) that keeps internal state for temporal context.
- TasNet: A family of time‑domain separation networks designed for low latency.
- Demucs v4: A high‑quality hybrid time‑frequency separator with multi‑second lookahead (offline oriented).
- SDR: Source‑to‑distortion ratio; a common objective metric for separation quality.
- Lookahead / Latency: Amount of future audio the model needs and the resulting delay of output.
- Real‑time ratio: Throughput metric = (audio processed) / (processing time); >1 is real‑time capable.
Closing notes — why you should watch
This presentation is practical: it mixes model selection, low‑level signal handling (STFT, windowing, overlap‑add), and systems engineering (GPU scheduling, task splitting, tuning thread/block sizes) in one workshop. The speakers share code and a Jupyter environment, demonstrating not just theory but a working pipeline and notes on tradeoffs between quality and latency. If you build audio plugins, realtime ML systems, or are curious about squeezing small‑buffer audio onto GPUs, you will find both motivation and actionable guidance here.
Finally, the talk bridges algorithmic intuition and engineering practice: it shows how a careful choice of model (favoring low‑lookahead hybrids or TasNet variants), plus attention to buffer/state management and GPU tasking, makes real‑time source separation achievable. That combination is exactly what practitioners need — enjoy the workshop and bring your questions.
