BCM2835 "GPU_FFT" release 3.0 by Andrew Holme, 2015.

GPU_FFT is an FFT library for the Raspberry Pi which exploits the BCM2835 SoC
3D hardware to deliver ten times more data throughput than is possible on the
700 MHz ARM of the Pi 1.  Kernels are provided for all power-of-2 FFT lengths
between 256 and 4,194,304 points inclusive.  A transpose function, which also
uses the 3D hardware, is provided to support 2-dimensional transforms.


*** Accuracy ***

GPU_FFT uses single-precision floats for data and twiddle factors.  The output
is not scaled.  The relative root-mean-square (rms) error in parts-per-million
(ppm) for different transform lengths (N) is typically:

log2(N) |  8    |  9    | 10    | 11    | 12    | 13    | 14    | 15
ppm rms |  0.33 |  0.46 |  0.52 |  0.59 |  0.78 |  0.83 |  0.92 |  0.98

log2(N) | 16    | 17    | 18    | 19    | 20    |  21   | 22
ppm rms |  1.0  |  1.3  |  1.3  |  1.4  |  1.5  |  1.5  |  1.5

Accuracy has improved significantly over previous releases at the expense of a
small (2%) performance hit; however, FFTW is still one order of magnitude more
accurate than GPU_FFT.


*** Throughput ***

GPU_FFT 1.0 had to be invoked through a "mailbox" which added a 100us overhead
on every call.  To mitigate this, batches of transforms could be submitted via
a single call.  GPU_FFT now avoids this 100us overhead by poking GPU registers
directly from the ARM if total batch runtime will be short; but still uses the
mailbox for longer jobs to avoid busy waiting at 100% CPU for too long.

Typical per-transform runtimes for batch sizes of 1 and 10; and comparative
figures for FFTW (FFTW_MEASURE mode) on a Pi 1 with L2 cache enabled are:

log2(N) |   8   |   9   |  10   |  11   |  12  |  13  |  14  |  15
      1 | 0.033 | 0.049 | 0.070 | 0.12  | 0.25 | 0.61 |  1.2 |  3.5
     10 | 0.017 | 0.029 | 0.049 | 0.11  | 0.27 | 0.66 |  1.2 |  3.3
   FFTW | 0.092 | 0.22  | 0.48  | 0.95  | 3.0  | 5.1  | 12   | 31

log2(N) |  16  |  17 |  18 |  19 |   20 |   21 |   22        All times in
      1 |  7.0 |  17 |  43 |  97 |  194 |  388 |  786        milliseconds
   FFTW | 83   | 180 | 560 | 670 | 1600 | 3400 | 8800        2 sig. figs.


*** API functions ***

    gpu_fft_prepare()       Call once to allocate memory and initialise data
                            structures.  Returns 0 for success.

    gpu_fft_execute()       Call one or more times to execute a previously
                            prepared FFT batch.  Returns 0 for success.

    gpu_fft_release()       Call once to release resources after use.
                            GPU memory is permanently lost if not freed.


*** Parameters ***

    int mb          Mailbox file descriptor obtained by calling mbox_open()

    int log2_N      log2(FFT length) = 8 to 22

    int direction   FFT direction:  GPU_FFT_FWD for forward FFT
                                    GPU_FFT_REV for inverse FFT

    int jobs        Number of transforms in batch = 1 or more

    GPU_FFT **      Output parameter from prepare: control structure.
    GPU_FFT *       Input parameter to execute and release


*** Data format ***

Complex data arrays are stored as alternate real and imaginary parts:

    struct GPU_FFT_COMPLEX {
        float re, im;
    };

The GPU_FFT struct created by gpu_fft_prepare() contains pointers to the input
and output arrays:

    struct GPU_FFT {
       struct GPU_FFT_COMPLEX *in, *out;

When executing a batch of transforms, buffer pointers are obtained as follows:

    struct GPU_FFT *fft = gpu_fft_prepare( ... , jobs);
    for (int j=0; j<jobs; j++) {
       struct GPU_FFT_COMPLEX *in  = fft->in  + j*fft->step;
       struct GPU_FFT_COMPLEX *out = fft->out + j*fft->step;

GPU_FFT.step is greater than FFT length because a guard space is left between
buffers for caching and alignment reasons.

GPU_FFT performs multiple passes between ping-pong buffers.  The final output
lands in the same buffer as input after an even number of passes.  Transforms
where log2_N=12...16 use an odd number of passes and the final result is left
out-of-place.  The input data is never preserved.


*** Example program ***

The code that produced the above accuracy and performance figures is included
as a demo with the latest Raspbian distro.  Build and run it as follows:

cd /opt/vc/src/hello_pi/hello_fft
make
sudo ./hello_fft.bin 12

It accepts three optional command-line arguments: <log2_N> <batch> <loops>

The special character device is required for the ioctl mailbox through which
the ARM communicates with the Videocore GPU.


*** With Open GL on Pi 1 ***

GPU_FFT and Open GL will run concurrently on Pi 1 if GPU_FFT is configured not
to use VC4 L2 cache by zeroing a define in file gpu_fft_base.c as follows:

#define GPU_FFT_USE_VC4_L2_CACHE 0 // Pi 1 only: cached=1; direct=0

Overall performance will probably be higher if GPU_FFT and Open GL take turns
at using the 3D hardware.  Since eglSwapBuffers() returns immediately without
waiting for rendering, call glFlush() and glFinish() afterwards as follows:

    for (;;) {
        ....
        eglSwapBuffers(....); // non-blocking call returns immediately
        glFlush();
        glFinish(); // wait until V3D hardware is idle
        ....
        gpu_fft_execute(....); // blocking call
        ....
    }


*** 2-dimensional FFT ***

Please study the hello_fft_2d demo source, which is built and executed thus:

make hello_fft_2d.bin
sudo ./hello_fft_2d.bin

This generates a Windows BMP file: "hello_fft_2d.bmp"

The demo uses a square 512x512 array; however, rectangular arrays are allowed.
The following lines in gpu_fft_trans.c will do what is safe:

    ptr.arm.uptr[6] = src->x < dst->y? src->x : dst->y;
    ptr.arm.uptr[7] = src->y < dst->x? src->y : dst->x;

One may transpose the output from the second FFT pass back into the first pass
input buffer, by preparing and executing a second transposition; however, this
is probably unnecessary.  It depends on how the final output will be accessed.