Home > On-Demand Archives > Talks >
Performance Engineering for Modern DSP
Jatin Chowdhury - Watch Now - DSP Online Conference 2025 - Duration: 43:44
There has been significant changes in computer hardware over the past two decades, however most publicly available advice on performance engineering and optimization for DSP software is unchanged from the late '90's or early 2000's.
This talk will discuss some of the features available in modern CPUs and DSPs including SIMD vectorization, execution pipelines, and hardware caching, as well as advice for how to optimize DSP code with those features in mind.
This guide was created with the help of AI, based on the presentation's transcript. Its goal is to give you useful context and background so you can get the most out of the session.
What this presentation is about and why it matters
This talk surveys modern performance engineering for digital signal processing (DSP). It explains how contemporary CPUs and DSPs — with multi‑level caches, wide SIMD units, deep pipelines, and many cores — change the old rules of optimization. Instead of low‑level bit tricks and hand‑written assembly, the talk shows practical ways to make DSP code fast and predictable by thinking about memory, vectorization, and instruction‑level parallelism.
Why this matters: in real‑time DSP (audio plugins, instrument effects, live systems) you have hard deadlines measured in microseconds or milliseconds. Poor use of memory or lots of unpredictable branching can cause glitches, limit how many effects you can run, and prevent interesting algorithms from being practical. Understanding the modern hardware model unlocks much higher utilization of your machine and opens creative possibilities (e.g., running many filter instances, using higher‑order approximations, or larger neural nets in real time).
Who will benefit the most from this presentation
- Audio and music software developers working on real‑time plugins and effects.
- Embedded DSP engineers and system designers who need predictable latency and throughput.
- Students and early‑career engineers who want practical rules for writing fast DSP code.
- Any developer curious about when to favor computation over memory accesses, how to exploit SIMD, and how to reason about pipelines and branching.
What you need to know
No deep prior knowledge is required, but the talk moves quickly through a number of hardware‑centric ideas. To get the most out of it, be comfortable with:
- Basic DSP algorithms: filters, FFTs, waveshapers, and the notion of sample blocks and processing deadlines (e.g., 48 kHz, block sizes, and per‑block runtime budgets).
- Basic computer architecture terms: CPU core, cache, main memory, and the idea that accessing memory has dramatically different latencies depending on where the data lives.
- Elementary programming concepts: loops, function calls, branching, and why predictable control flow matters for performance.
- Single‑precision vs double‑precision floating point and why precision choice affects register/throughput and vector width.
Useful background to follow the examples: a sense of what SIMD/vector instructions are (doing the same arithmetic on multiple values at once), and the difference between cache hits and misses. The speaker assumes you understand why a cache miss can cost many CPU cycles and why that can be more expensive than several arithmetic operations.
Glossary
- Cache (L1/L2/L3) — Small, fast memory layers on the CPU. L1 is per‑core and fastest; L3 is larger and often shared. Latency increases with level.
- Memory hierarchy — The layered model (registers → L1 → L2 → L3 → RAM → disk) where each level trades size for speed.
- SIMD / Vectorization — Executing one instruction on multiple data elements (e.g., 4 or 8 floats at once) to improve throughput.
- Auto‑vectorizer — Compiler machinery that tries to transform scalar loops into SIMD code automatically; useful but fragile.
- Prefetcher — Hardware that speculatively loads sequential data into cache to hide memory latency when access patterns are predictable.
- Cache locality — The property that nearby or recently used data is likely to be reused soon; good locality reduces cache misses.
- Instruction pipeline — Stages an instruction passes through (fetch, decode, execute, write‑back). Stalls and bubbles reduce throughput.
- Branch predictor — Hardware that guesses the direction of branches to keep the pipeline fed; mispredictions are costly.
- Out‑of‑order execution — CPU feature that reorders independent instructions at runtime to keep execution units busy despite data dependencies.
- Scratch vs persistent memory — Scratch: temporary buffers you can reuse (good for cache); persistent: long‑lived state (may increase working set and cause misses).
Final thoughts
This is a practical, approachable talk: the speaker balances hardware realities with pragmatic advice for DSP engineers. He emphasizes rules that matter today — think about memory first, favor algorithms that vectorize, and be explicit about performance‑critical code instead of relying blindly on compilers. If you write audio, real‑time systems, or computational DSP, this presentation will give you mental models and concrete directions you can apply immediately. It’s a friendly and realistic guide to getting far more out of modern hardware without descending into micro‑magic.
I'm glad you enjoyed it!
For working with SIMD in general, I mostly learned from reading blogs and watching presentations... here's a few that I can recommend:
- WolfSound: https://youtu.be/XiaIbmMGqdg?si=HoCAJv3wVO1dpavx
- Jamie Pond at the Audio Developer Conference: https://youtu.be/X8dPANPmC7E?si=qZwoTKuKd-zWec4w
- Handmade Hero: https://youtu.be/1CVmlnhgT3g?si=XFXhiKTMtM8b5wxT
For working with SIMD at the assembly level, I can recommend Agner Fog's Assembly Optimization Manual (https://www.agner.org/optimize/#manuals), and the FFMPEG's projects assembly lessons (https://github.com/FFmpeg/asm-lessons).
Although I haven't read it, I've also heard good things about Aart Bik's "Software Vectorization Handbook".
A good algorithm to implement to learn/practice SIMD vectorization is a time-domain FIR filter, since it forces you to think about memory alignment and which "dimension" you want to vectorize along.

Great presentation, thank you! A question - can you recommend one or more books on learning SIMD and/or algorithms that use SIMD vectorization?