Home > On-Demand Archives > Talks >
Vectorizing IIR Filters: Making Recursive Filters Parallel
Ayan Shafqat - Watch Now - DSP Online Conference 2025 - Duration: 37:27
IIR filters dominate real-time audio, yet their feedback paths make straightforward SIMD implementation difficult. This talk distills practical ways to vectorize IIR filters on mainstream CPUs like ARM and x86 without sacrificing numerical stability.
We explore:
- batch independent channels or filter instances lane-wise;
- restructure multi-band crossovers into breadth-parallel, tree-like structure so SIMD lanes executes multiple SOS at once;
- apply algebraic splits (e.g., partial fractions) to evaluate sections concurrently.
We detail coefficient/state packing, transposition-free layouts, and scratch-buffer scheduling that keep vector registers and cache lines full. The result is a repeatable recipe for turning scalar IIR code into high-throughput, energy-efficient SIMD pipelines.
This guide was created with the help of AI, based on the presentation's transcript. Its goal is to give you useful context and background so you can get the most out of the session.
What this presentation is about and why it matters
This talk explains how to turn recursive IIR filters — which are inherently serial because each output depends on previous outputs — into code that runs efficiently on SIMD (single instruction, multiple data) vector hardware. The speaker walks through three practical strategies: 1) mapping independent channels or filter instances across SIMD lanes, 2) restructuring multi-band crossovers into a breadth-parallel (tree) form so lanes process different bands, and 3) using algebraic transforms (partial-fraction expansion / residue calculus) to rewrite cascaded sections into parallel sections.
Why it matters: real-time audio systems (EQs, crossovers, multiband processors) frequently use IIR filters because they are cheap per-sample. But on modern CPUs and embedded cores, throughput and energy efficiency increasingly depend on vectorization. Learning these techniques lets engineers get large SIMD speedups without changing the filter behavior or adding latency — or at least know the trade-offs when numerical issues arise.
Who will benefit the most from this presentation
- DSP engineers and audio developers optimizing real-time processing (EQ, crossovers, multiband compressors).
- Systems programmers who write performance-critical, portable signal-processing kernels in C/C++ using intrinsics.
- Engineers working on embedded audio (ARM NEON, Helium) or desktop/server audio (SSE/AVX) who need to squeeze more throughput from limited cores or power budgets.
- Students learning how algorithm structure affects vectorizability and numerical stability.
What you need to know
Reading or viewing will be more productive if you understand the following concepts at a basic level:
- IIR vs FIR: IIR (infinite impulse response) filters use feedback (past outputs). FIR filters are non‑recursive (only past inputs).
- Second‑order sections / biquads: Higher‑order IIRs are typically implemented as cascades of 2nd order sections (biquads) for numerical stability. A common notation is $H(z)=\dfrac{b_0 + b_1 z^{-1} + b_2 z^{-2}}{1 + a_1 z^{-1} + a_2 z^{-2}}$.
- Transposed Direct Form II: A biquad structure preferred in floating point for fewer state variables and better numeric behavior.
- SIMD basics: Vector registers contain multiple lanes; one instruction computes the same operation across lanes. Familiarity with intrinsics, lane packing, and typical register counts (e.g., AArch64 vs x86) helps follow the implementation details.
- Memory layout and packing: For vector throughput you often convert planar audio (buffers per channel) into interleaved lane-aligned vectors (pack/unpack) so loads/stores are full-width and aligned.
- FMA (fused multiply-add): Using FMA reduces instruction count and improves numeric accuracy by avoiding intermediate rounding.
- Partial fraction expansion / residues: A mathematical transform that rewrites a rational transfer function as a sum of first-order terms. Conjugate pairs of complex poles combine into real biquads, enabling evaluation of multiple sections in parallel.
- Numerical caveats: Parallelizing by residues can be ill-conditioned for repeated or closely spaced poles; expect a hybrid approach (some cascades preserved, others parallelized) in practice.
Glossary
- SIMD lane: One element slot inside a vector register where the same instruction is applied.
- Biquad: A second-order IIR section with two poles and (usually) two zeros; common building block for higher-order filters.
- Transposed Direct Form II: A biquad realization with minimal state and good numeric behavior for floating point.
- FMA: Fused multiply-add instruction that computes a*b + c in one rounding step.
- Partial-fraction expansion: Algebraic decomposition that expresses a rational function as a sum of simpler fractions (residues over poles).
- Residue: The coefficient associated with a pole in a partial-fraction term.
- Horizontal add: An operation that sums the lanes of a vector to produce a scalar.
- Register pressure: When a routine needs more registers than available, forcing spills to memory and hurting performance.
- Pack / unpack: Converting between host buffer layouts (planar) and SIMD-interleaved lane layouts used by vector kernels.
- Balanced crossover tree: A way to order crossover splits so sibling bands are processed together, mapping well to SIMD breadth-parallel execution.
