Home > On-Demand Archives > Talks >

Non-Powers-of-2 FFTs and Why You Should Use Them

Bradford Watson - Watch Now - DSP Online Conference 2025 - Duration: 48:10

Non-Powers-of-2 FFTs and Why You Should Use Them
Bradford Watson

The Fast Fourier Transform (FFT) is a fundamental algorithm in digital signal processing, and it comes in various forms. However, the most commonly used FFTs are those with sizes that are powers of 2 (e.g., 256, 512, 1024, etc.). This is due to their widespread availability, ease of implementation, and ability to achieve an efficiency of O(N log N), making them a popular choice among designers.

While powers-of-2 FFTs are convenient and efficient, they may not always be the best choice for every design. In some cases, using a power-of-2 FFT is not always the best choice from a system standpoint, and can lead to complications in other parts of the system in terms of inefficient use of resources and increased complexity.

Fortunately, there are alternative FFT algorithms that can handle any size of FFT, not just powers of 2.

This talk will describe traditional FFT implementation, more exotic FFT algorithms such as:

  • Rader FFT
  • Winograd FFT
  • Prime Factor FFT

It will also cover how to extend Cooley-Tukey factoring to combine powers-of-2 and non-powers-of-2 FFTs, and the Four Step FFT.

Several other algorithms will be touched upon, along with their application, advantages, and disadvantages.

This guide was created with the help of AI, based on the presentation's transcript. Its goal is to give you useful context and background so you can get the most out of the session.

What this presentation is about and why it matters

This talk explains why you dont have to be "handcuffed" to radix-2 (power-of-two) FFT sizes and shows practical, efficient alternatives for arbitrary transform lengths. Many real systems (radio front-ends, legacy sampling rates, matched-bandwidth detectors) use sample rates and desired frequency resolutions that dont line up with 2^n sizes. Forcing a power-of-two FFT often costs sensitivity (worse bin matching), extra resampling, wasted compute, or extra noise. The presenter surveys generalizations of CooleyTukey (mixed-radix and four-step), the prime-factor algorithm, and lower-level methods (Rader, Winograd), and discusses where each method can reduce multipliers, trades adders for multiplies, or eliminate twiddle factors. If you design DSP systems, firmware for FPGAs/ASICs, or high-performance signal-processing software, these alternatives can save resources and improve detection performance.

Who will benefit the most from this presentation

  • DSP engineers who design spectral processing chains (detectors, channelizers, spectrometers).
  • Hardware implementers writing FFT kernels for FPGAs, ASICs, or DSPs who need resource-efficient layouts.
  • Software engineers optimizing FFT-based systems where memory and cache patterns matter.
  • Students and researchers who want to understand how number theory and algorithm choices affect practical FFT cost and performance.

What you need to know

Familiarity with the Discrete Fourier Transform (DFT), complex exponentials, and the basics of the CooleyTukey FFT will make the talk much easier to follow. The N-point DFT is usually written as

$X[k]=\sum_{n=0}^{N-1} x[n]\,e^{-j2\pi kn/N}$

Key background topics that will be referenced:

  • CooleyTukey factoring: splitting an N-point DFT into smaller DFTs by factoring N. This yields radix-2, radix-4, and mixed-radix FFTs and explains the usual O(N log N) savings.
  • Twiddle factors: the multiplicative complex weights introduced when combining sub-DFTs. They cost complex multiplies in implementation.
  • Four-step FFT: reorganizes the computation into input sub-FFTs, twiddle multiplications, a matrix transpose (data rearrangement), then output sub-FFTs. This emphasizes memory and transpose costs.
  • Prime-factor algorithm (GoodThomas): when N = n1*n2 with n1 and n2 coprime, inputs/outputs can be remapped so twiddle factors disappear; uses number-theoretic index remapping (Chinese Remainder / modular inverses).
  • Rader and Winograd methods: low-level techniques for prime or small composite sizes. Rader turns a prime-length DFT into a cyclic convolution; Winograd uses number theory and cyclotomic structures to minimize multipliers (often trading more additions for fewer multiplies).

Also useful: basic modular arithmetic (modular inverses, Chinese remainder theorem), convolution properties of DFTs, and some understanding of hardware costs (why multiplies are more expensive than adds, and why memory transposes are nontrivial in FPGAs/ASICs).

Glossary

  • DFT  Discrete Fourier Transform, mapping an N-sample sequence to N frequency bins via complex exponentials.
  • FFT  Fast Fourier Transform, any algorithm that reduces the DFTs O(N^2) cost toward O(N log N).
  • CooleyTukey  A general factoring approach that splits the DFT by integer factors of N (radix-2, radix-4, mixed-radix).
  • Twiddle factors  The complex exponential multipliers introduced when combining sub-DFTs in factorization.
  • Mixed-radix  An FFT built from stages of different radices (e.g., radix-2 then radix-4), useful for composite N with varied prime factors.
  • Four-step FFT  A reorganization into sub-FFTs, twiddles, a transpose, and final sub-FFTs; highlights memory layout and transpose cost.
  • Prime-factor algorithm (PFA)  An FFT method for coprime factorization of N that removes twiddle factors via index remapping.
  • Raders algorithm  Converts a prime-length DFT into a cyclic convolution to leverage convolution techniques for efficiency.
  • Winograds algorithm  Uses number-theoretic and cyclotomic constructions to minimize multipliers (often more adders) for small transform sizes.
  • Bluestein (chirp Z)  A method that computes arbitrary-frequency or arbitrary-length DFTs via chirp convolution; useful when direct factorization is awkward.

When you watch: listen for the practical trade-offs (multiplier vs adder cost, transpose/memory cost, and when remapping removes twiddles), and for the example pipeline (360-point FFT decomposed to 8, 9 and 5point stages) that shows why a non-power-of-two choice can improve detection sensitivity with less total effort than heavy resampling.

M↓ MARKDOWN HELP
italicssurround text with
*asterisks*
boldsurround text with
**two asterisks**
hyperlink
[hyperlink](https://example.com)
or just a bare URL
code
surround text with
`backticks`
strikethroughsurround text with
~~two tilde characters~~
quote
prefix with
>

EnesMUTTA
Score: 0 | 1 week ago | 1 reply

Hello Mr Watson
I enjoyed your presentation and would like to thank you.

I am a young FPGA designer and I implement FFT cores for real-time signal processing. The systems I work on sample @100+MHz and the FFT cores I design crunch data in a serial manner (1 input-output per clock sample). Before I watched your presentation, I thought non-powers of 2 FFTs would be too impractical (costly) for the kind of data loads I handle, but I am starting to wonder if that is not the case :). I can already think of a couple of benefits a non-power of 2 FFT implementation can provide.

Do you know (maybe from experience) whether or not it is possible to implement the non-power of 2 approaches you presented can be implemented on FPGAs in a real-time pipelined fashion? If so, can you comment on the memory requirement? For an FFT that has a point size N (which is a power of 2) the single-path delay feedback (SDF) architecture requires a memory that can fit N-1 samples. Can the pipelined versions of non-power of 2 FFTs be implemented with a similar amount of memory?

Also, if these pipelined versions exist, I think it should be possible to have run time configurable transform length functionality by simply bypassing some initial stages on the pipeline. For instance, in the FFT360 example @44min; if the FFT8 and the following transpose is bypassed and the input samples are directly fed to FFT9 then we should obtain FFT45 using the remaining structure (with minor modifications to rotator modules within FFT9 and FFT5). Can you comment if this sounds right?
Kind regards

Bradford WatsonSpeaker
Score: 1 | 1 week ago | 1 reply

Hello, thanks for watching the presentation and for your comments.
I have implemented many non-powers-of-2 FFTs in FPGAs with various types of architectures, including those with the SDF architecture (although this is the first time I've heard it called that, lol). I think the primary challenge you'll have with the memories is that there will always be some non-utilized memory in FPGA block ram because the number of samples isn't a power of 2. This is less of a problem in ASICs where you can specify exactly how many memory locations you need. In principle I think it should be similar to the radix-2 FFTs (N-1 samples as you're saying), but I haven't looked into it, so I don't know. Also, the bit-reversal trick that's commonly used with radix-2 FFTs isn't likely to be very clean, so you'll need to manage memory pointers.
For your second question, it's perfectly possible to bypass stages and use components of a larger FFT to calculate smaller ones. I'm doing that right now on a project.
Thanks again for your interest and comments.
Brad

EnesMUTTA
Score: 0 | 1 week ago | 1 reply

Hello, thanks for your swift response.
The point sizes I deal with are ranging from 10^4 to 10^6 so the imperfect utilization of some memory primitives can be simply disregarded as rounding errors. This is also the reason why, although very neat and clever, the memory requirement of the reordering of the input samples for the prime factor algorithm sounded to me like a nightmare compared to the cost of a single rotator which it saves. I should also admit that I was baffled when you said in your presentation

The Prime Factor FFT is useful for large composite FFTs that are non power of 2.

Although, I guess the scale of large and throughput requirement changes from application to application and I should not only consider it for my usage types. I have come across multiple cases where the memory utilization within the FPGA approaching +90% while the DSP and LUT utilization was barely 10%. Had to come up with some creative solutions for that.
When it comes to natural ordering, you just have to bite the memory cost of that if you don’t want to take any penalty to throughput. There is no way around that (as far as I know, I would welcome anyone who points out there are clever solution around it). I remember coming across a paper that reduced the memory cost of output reordering by a couple of percent while increasing the complexity of managing it astronomically higher.
When it comes to the reordering the output of a pipelined non-power of 2 FFT module, it did not occur to me it may be different compared to regular power of 2 modules before you pointed it out (thanks). Definitely some in depth work is needed to analyse and figure out a solution for it.
Thanks again for taking your time and responding

Bradford WatsonSpeaker
Score: 0 | 1 week ago | no reply

Thanks. Sounds like you have some interesting challenges.
When I said "large" I was thinking more in a practical sense. For example, a 6-point FFT could be composed of a 2-point cascaded with a 3-point. But, you could build a 6-point FFT using the Winograd approach, which would likely be better performing. On the other hand, I've build 12-point FFTs using the PFA approach many times, and I wouldn't call them "large." So, maybe I overstated it.