[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)
Siarhei Siamashka
siarhei.siamashka
Sun Apr 20 22:18:31 CEST 2008
On Sunday 20 April 2008, Michael Niedermayer wrote:
[...]
> > +static void vector_fmul_vfp(float *dst, const float *src, int len)
> > +{
> > + int tmp;
> > + asm volatile(
> > + "fmrx %[tmp], fpscr\n\t"
> > + "orr %[tmp], %[tmp], #(3 << 16)\n\t" /* set vector size
> > to 4 */ + "fmxr fpscr, %[tmp]\n\t"
> > +
> > + "fldmias %[src1]!, {s0-s3}\n\t"
> > + "fldmias %[src2]!, {s8-s11}\n\t"
> > + "fldmias %[src1]!, {s4-s7}\n\t"
> > + "fldmias %[src2]!, {s12-s15}\n\t"
> > + "fmuls s8, s0, s8\n\t"
> > + "1:\n\t"
> > + "subs %[len], %[len], #16\n\t"
> > + "fmuls s12, s4, s12\n\t"
> > + "fldmiasge %[src1]!, {s16-s19}\n\t"
> > + "fldmiasge %[src2]!, {s24-s27}\n\t"
> > + "fldmiasge %[src1]!, {s20-s23}\n\t"
> > + "fldmiasge %[src2]!, {s28-s31}\n\t"
> > + "fmulsge s24, s16, s24\n\t"
> > + "fstmias %[dst]!, {s8-s11}\n\t"
> > + "fstmias %[dst]!, {s12-s15}\n\t"
> > + "fmulsge s28, s20, s28\n\t"
> > + "fldmiasgt %[src1]!, {s0-s3}\n\t"
> > + "fldmiasgt %[src2]!, {s8-s11}\n\t"
> > + "fldmiasgt %[src1]!, {s4-s7}\n\t"
> > + "fldmiasgt %[src2]!, {s12-s15}\n\t"
> > + "fmulsge s8, s0, s8\n\t"
> > + "fstmiasge %[dst]!, {s24-s27}\n\t"
> > + "fstmiasge %[dst]!, {s28-s31}\n\t"
> > + "bgt 1b\n\t"
>
> If the 4 and 8 cycle latencies you mentioned are correct then this has many
> stalls.
This code has zero stalls (except on the first iteration). I checked it
carefully and also verified with oprofile (ARM11 has hardware performance
counters and can collect pipeline stalls statistics).
The part you have probably missed is that this code operates on vectors. So
one fmuls* instruction queues 4 multiplies, which get performed one after
another in arithmetic pipeline (occupying it for 4 cycles). Each load/store
instruction also queues 4 loads or stores, occupying load/store pipeline for
2 cycles.
Real benchmarks show that this function processes data, using appromaximately
2 cycles or less per element. The maximal unreachable theoretical throughput
is 1.5 cycles per element (it needs to do two single precision floating
point loads, one multiplication and one single precision floating point
store) with multiplications completely shadowed by load/store operations.
To make everything more fun, ARM optimization manual has the following notice:
"Complex instruction dependencies and memory system interactions make it
impossible to describe briefly the exact cycle timing of all instructions in
all circumstances. The timing shown in Table 4.17 is accurate in most cases.
For precise timing, you must use a cycle-accurate model of the ARM1136JF-S
processor." :)
So the optimization manual from ARM provides only some simplified model
and can't guarantee exact results. I also tried to remove all the
multiplication instructions, keeping load/store operations only, the
performance remained exactly the same (while supposedly calculating cycles
for load/store operations should be trivial). The final code is a result of
some 'genetic' variations and taking the fastest version :)
Oprofile shows that we get a lot of 'LSU_STALL' events, whatever it means. So
it probably has something to do with some data cache throughput limitation
which is not mentioned in the manual.
Better implementations are surely welcome, all the infrastructure for
testing/benchmarking these functions is available.
[...]
> Also note that there are high level optims which should be done to the
> IMDCT that is merging it with these windowing functions. It might make
> sense to look into this before low level optimizing them.
Sure, but I will leave this stuff to somebody else for now :)
In the short term, I'm more interested in optimizing DJBFFT library for ARM
VFP to get faster FFT performance. It is part of a big plan :)
--
Best regards,
Siarhei Siamashka
More information about the ffmpeg-devel
mailing list