Home > On-Demand Archives > Talks >

Recent Advancements In ML-Based Speech Enhancement Techniques

Satheesh PK - Watch Now - DSP Online Conference 2025 - Duration: 02:00:03

Recent Advancements In ML-Based Speech Enhancement Techniques
Satheesh PK

Recent advancements in machine learning-based speech enhancement focus on sophisticated deep learning models—including deep neural networks (DNNs), transformer architectures, and generative models—engineered for robust, real-time performance, personalization, and adaptability across varied acoustic environments.

The presentation will address key developments in speech enhancement such as:

  • Quality-optimized and task-specific model architectures
  • Real-time and edge-oriented implementations
  • Multichannel and spatially aware signal processing
  • Generative data augmentation methods
  • Applications of self-supervised and foundation models
  • Unified frameworks supporting both personalized and general speech enhancement

This guide was created with the help of AI, based on the presentation's transcript. Its goal is to give you useful context and background so you can get the most out of the session.

Before you watch: Recent Advancements in ML-Based Speech Enhancement

Welcome — this talk is aimed at practicing engineers and curious learners who want to connect classic signal‑processing intuition with the latest machine‑learning advances for making speech sound and work better in the real world. If you design audio products, build ASR pipelines, or optimize low‑latency embedded systems, the topics here will directly affect the perceptual quality, robustness, and deployability of your systems.

Why this matters in practice

  • Human perception matters: small numerical gains (low MSE) can still sound bad. Perceptual quality, intelligibility, and downstream task performance (like ASR) are the real goals.
  • Real environments are hard: non‑stationary noise, overlapping speakers, and reverberation break assumptions behind classical filters and require richer models.
  • Edge constraints are real: latency, power, and memory shape what models you can ship — not just peak quality in a lab.

Key concepts to have in mind

Glossary of short, practical reminders that will make the talk easier to follow:

  • Time–frequency representation (STFT): many systems work on short frames and spectrograms (magnitude and phase). Know why frame length and overlap trade off time vs. frequency resolution and latency.
  • Masking and spectral gains: classical and neural methods often estimate a multiplicative mask applied to the noisy spectrum. This has been extended to complex masks that modify both magnitude and phase.
  • Phase vs magnitude: phase is no longer optional—phase modeling (or representations like instantaneous frequency) can improve perceived quality, especially at short frames.
  • Perceptual metrics: PESQ, STOI and ASR word‑error metrics correlate with listening quality but are hard to use as training losses directly.
  • Discriminative vs generative approaches: discriminative models predict a target (masks or spectra); generative models (GANs, diffusion) model the full data distribution and can reduce artifacts but bring new evaluation challenges.

What background helps most

  • Basic DSP: sampling, windowing, FFT/STFT, and how frame size affects latency and resolution.
  • Linear algebra / probability: covariance, basic Gaussian noise models, and why beamforming uses covariance matrices.
  • Machine learning fundamentals: supervised losses, gradient descent, and high‑level familiarity with convnets, RNNs/LSTMs, and attention/transformers.
  • Practical deployment concerns: quantization, pruning, and why on‑device vs cloud tradeoffs matter.

High‑level map of the talk (no spoilers)

  • Why older approaches fell short — especially the mismatch between training objectives and perceptual goals.
  • How modern neural methods address perceptual quality (e.g., metric‑aware adversarial training) without assuming perfect differentiability.
  • Phase‑aware processing and representations that make phase learnable and stable.
  • Generative models (notably diffusion methods) and when they shine versus discriminative models.
  • Transformer and Conformer ideas for long‑range context, plus U‑Net and hybrid architectures used in practice.
  • Multichannel processing and learned beamforming — how spatial cues change the game.
  • Real‑time, on‑device constraints and the compression techniques that make deployment feasible.
  • Data strategies: generative augmentation, self‑supervision, and foundation models for transfer and personalization.

One small technical preview (optional)

If you encounter the complex ideal ratio mask in the talk, it is simply the elementwise complex ratio of clean to noisy spectrum: cIRM = \(S/Y\). Neural models often learn to predict this complex mask (real and imaginary parts) to jointly correct magnitude and phase.

How to get the most out of the session

  • Listen for the design tradeoffs: quality vs latency vs model size; the best solution depends on product constraints.
  • Note how evaluation objectives drive architecture choices — a useful lesson for any engineering decisions you’ll make.
  • Think about which parts could be plugged into your pipeline immediately (e.g., metric‑aware loss, small U‑Net, or learned beamforming) and which are longer‑term (large diffusion models, foundation model distillation).

Enjoy the talk — it’s a concise tour of the current frontier where signal processing principles meet modern generative and sequence models. You’ll leave with practical ideas you can test and a clearer view of where the field is heading next.

M↓ MARKDOWN HELP
italicssurround text with
*asterisks*
boldsurround text with
**two asterisks**
hyperlink
[hyperlink](https://example.com)
or just a bare URL
code
surround text with
`backticks`
strikethroughsurround text with
~~two tilde characters~~
quote
prefix with
>

No comments or questions yet. Will you be the one who will break the ice?