[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)

Sun Apr 20 22:18:31 CEST 2008

On Sunday 20 April 2008, Michael Niedermayer wrote:

[...]

> > +static void vector_fmul_vfp(float *dst, const float *src, int len)
> > +{
> > +    int tmp;
> > +    asm volatile(
> > +        "fmrx       %[tmp], fpscr\n\t"
> > +        "orr        %[tmp], %[tmp], #(3 << 16)\n\t" /* set vector size
> > to 4 */ +        "fmxr       fpscr, %[tmp]\n\t"
> > +
> > +        "fldmias    %[src1]!, {s0-s3}\n\t"
> > +        "fldmias    %[src2]!, {s8-s11}\n\t"
> > +        "fldmias    %[src1]!, {s4-s7}\n\t"
> > +        "fldmias    %[src2]!, {s12-s15}\n\t"
> > +        "fmuls      s8, s0, s8\n\t"
> > +    "1:\n\t"
> > +        "subs       %[len], %[len], #16\n\t"
> > +        "fmuls      s12, s4, s12\n\t"
> > +        "fldmiasge  %[src1]!, {s16-s19}\n\t"
> > +        "fldmiasge  %[src2]!, {s24-s27}\n\t"
> > +        "fldmiasge  %[src1]!, {s20-s23}\n\t"
> > +        "fldmiasge  %[src2]!, {s28-s31}\n\t"
> > +        "fmulsge    s24, s16, s24\n\t"
> > +        "fstmias    %[dst]!, {s8-s11}\n\t"
> > +        "fstmias    %[dst]!, {s12-s15}\n\t"
> > +        "fmulsge    s28, s20, s28\n\t"
> > +        "fldmiasgt  %[src1]!, {s0-s3}\n\t"
> > +        "fldmiasgt  %[src2]!, {s8-s11}\n\t"
> > +        "fldmiasgt  %[src1]!, {s4-s7}\n\t"
> > +        "fldmiasgt  %[src2]!, {s12-s15}\n\t"
> > +        "fmulsge    s8, s0, s8\n\t"
> > +        "fstmiasge  %[dst]!, {s24-s27}\n\t"
> > +        "fstmiasge  %[dst]!, {s28-s31}\n\t"
> > +        "bgt        1b\n\t"
>
> If the 4 and 8 cycle latencies you mentioned are correct then this has many
> stalls.

This code has zero stalls (except on the first iteration). I checked it
carefully and also verified with oprofile (ARM11 has hardware performance
counters and can collect pipeline stalls statistics).

The part you have probably missed is that this code operates on vectors. So
one fmuls* instruction queues 4 multiplies, which get performed one after
another in arithmetic pipeline (occupying it for 4 cycles). Each load/store 
instruction also queues 4 loads or stores, occupying load/store pipeline for
2 cycles.

Real benchmarks show that this function processes data, using appromaximately
2 cycles or less per element. The maximal unreachable theoretical throughput 
is 1.5 cycles per element (it needs to do two single precision floating 
point loads, one multiplication and one single precision floating point
store) with multiplications completely shadowed by load/store operations.

To make everything more fun, ARM optimization manual has the following notice:
"Complex instruction dependencies and memory system interactions make it 
impossible to describe briefly the exact cycle timing of all instructions in 
all circumstances. The timing shown in Table 4.17 is accurate in most cases. 
For precise timing, you must use a cycle-accurate model of the ARM1136JF-S 
processor." :)

So the optimization manual from ARM provides only some simplified model
and can't guarantee exact results. I also tried to remove all the
multiplication instructions, keeping load/store operations only, the
performance remained exactly the same (while supposedly calculating cycles 
for load/store operations should be trivial). The final code is a result of
some 'genetic' variations and taking the fastest version :)

Oprofile shows that we get a lot of 'LSU_STALL' events, whatever it means. So
it probably has something to do with some data cache throughput limitation
which is not mentioned in the manual.

Better implementations are surely welcome, all the infrastructure for
testing/benchmarking these functions is available.

[...]

> Also note that there are high level optims which should be done to the
> IMDCT that is merging it with these windowing functions. It might make
> sense to look into this before low level optimizing them.

Sure, but I will leave this stuff to somebody else for now :)

In the short term, I'm more interested in optimizing DJBFFT library for ARM
VFP to get faster FFT performance. It is part of a big plan :)

-- 
Best regards,
Siarhei Siamashka