Home > On-Demand Archives > Talks >
Signal Processing Formulations of Sequence Models
Julius O. Smith III - Watch Now - DSP Online Conference 2024 - Duration: 01:08:59
Today’s sequence models (such as large language models) in machine learning (AI) arose from a blend of principle-based design and empirical discovery, spanning several fields. This talk describes how many of the main ideas could have emerged from an elementary signal-processing approach. This viewpoint offers some features:
- Signal processing folks can quickly learn what is happening in a motivated way
- Machine-learning experts might benefit from signal-processing insights
- Obvious suggestions for things to try next naturally arise
This guide was created with the help of AI, based on the presentation's transcript. Its goal is to give you useful context and background so you can get the most out of the session.
What this presentation is about and why it matters
This talk re-derives many modern sequence-modeling building blocks (RNNs, gated units, state-space models, and attention) from a signal-processing perspective. Instead of starting from machine-learning heuristics, the speaker shows how familiar DSP concepts — one-pole (IIR) filters, FIR filters (delay lines), inner products, and normalization — can motivate the same components used in large language models and recent hybrid architectures (e.g., Mamba, RWKV, SSM-based nets).
Why this matters: engineers who understand the DSP viewpoint get intuition about memory duration, interference, normalization, and gating. Those insights give quantitative rules-of-thumb (for example, how model dimension limits recoverable items from a vector sum) and suggest concrete experiments or architecture tweaks (gating strategies, where to place attention vs. recurrence, state expansion). If you design or analyze sequence systems, this talk connects principled signal‑processing thinking to state‑of‑the‑art neural architectures.
Who will benefit the most from this presentation
- Signal-processing engineers curious about how DSP ideas map to sequence models and LLM internals.
- Machine-learning practitioners who want principled intuition for gating, normalization, and memory limitations.
- Researchers designing hybrid architectures (IIR + FIR) or experimenting with state-space / recurrent layers.
- Students learning sequence models who already know linear systems, convolution, and basic linear algebra.
What you need to know
To get the most out of the talk, be comfortable with these core concepts:
- Vector inner products & matched filtering: detecting whether a known pattern is present in a sum of vectors via dot products. This is the basic detection mechanism used to read items out of a vector memory.
- High-dimensional orthogonality: random unit vectors in Nn are approximately orthogonal; the expected squared dot product between two random unit vectors scales like 1/N. This underpins why embeddings can be retrieved from sums if the dimension is large enough.
- One-pole IIR filter as associative memory: a running, exponentially-weighted vector sum with forgetting factor p stores recent vectors; contributions decay like p^{N-k}. The effective squared-length term for noise leads to factors like \frac{1-p^{2N}}{1-p^2}\cdot\frac{1}{N}\right. when computing expected interference.
- FIR filters / delay lines and attention: an FIR with M taps gives direct access to the recent context buffer. Relevance weights for taps can be computed by inner products between a query vector and down-projected keys (the transformer Q-K-V view).
- Gating and state expansion: gating (input gate, forget/reset gate, output gate) is naturally derived by letting gains depend on the input. State expansion (up-projection) increases internal capacity and is equivalent to using larger state vectors in a state-space model.
- Normalization and residuals: RMS normalization keeps vectors on the hypersphere; skip connections (residual streams) preserve gradient flow and avoid an isolated layer from blocking information.
Glossary
- Sequence model: a model that maps sequences to sequences (e.g., language models), often built from recurrent or attention blocks.
- IIR (Infinite Impulse Response): filters with feedback (poles); in the talk, one-pole vector recurrences implement associative memory.
- FIR (Finite Impulse Response): filters implemented with a finite delay line (taps); used to model attention/context buffers.
- State-space model (SSM): representation with matrices A, B, C, D describing internal state evolution and input/output mappings; structured SSMs are used in recent sequence nets.
- Gating: multiplicative control (input/forget/output gates) that enables selective storage, reset, or suppression of signals.
- RMSNorm: normalization that fixes the root-mean-square (L2) magnitude across the model dimension (puts vectors on a hypersphere).
- Embedding: a vector representation of a token or primitive (word, syllable, patch) used as the signal-space for processing.
- Attention (QKV): mechanism that computes weighted sums of values using relevance scores from queries and keys (implemented via down-projections and dot products).
- Orthogonality (high-D): in large dimensions, random vectors have small pairwise dot products (~1/N), which limits interference in vector sums.
- Residual / skip connection: direct passthrough that adds layer outputs to a running residual stream, aiding gradient flow and robustness.
Parting note
This presentation is a clear, principled bridge between classic DSP intuition and modern sequence-model practice. The speaker methodically shows how filters, delay lines, normalization, and gating lead naturally to the RNN/SSM and attention components you see in LLMs — and he gives practical, quantitative takeaways (memory bounds, when to normalize, where to gate). If you like seeing familiar signal-processing tools used to explain and motivate cutting‑edge architectures, this talk rewards close listening and sparks many useful ideas to try in experiments.
