[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)

Mon Apr 21 00:11:57 CEST 2008

On Sun, Apr 20, 2008 at 11:18:31PM +0300, Siarhei Siamashka wrote:
> On Sunday 20 April 2008, Michael Niedermayer wrote:
> 
> [...]
> 
> > > +static void vector_fmul_vfp(float *dst, const float *src, int len)
> > > +{
> > > +    int tmp;
> > > +    asm volatile(
> > > +        "fmrx       %[tmp], fpscr\n\t"
> > > +        "orr        %[tmp], %[tmp], #(3 << 16)\n\t" /* set vector size
> > > to 4 */ +        "fmxr       fpscr, %[tmp]\n\t"
> > > +
> > > +        "fldmias    %[src1]!, {s0-s3}\n\t"
> > > +        "fldmias    %[src2]!, {s8-s11}\n\t"
> > > +        "fldmias    %[src1]!, {s4-s7}\n\t"
> > > +        "fldmias    %[src2]!, {s12-s15}\n\t"
> > > +        "fmuls      s8, s0, s8\n\t"
> > > +    "1:\n\t"
> > > +        "subs       %[len], %[len], #16\n\t"
> > > +        "fmuls      s12, s4, s12\n\t"
> > > +        "fldmiasge  %[src1]!, {s16-s19}\n\t"
> > > +        "fldmiasge  %[src2]!, {s24-s27}\n\t"
> > > +        "fldmiasge  %[src1]!, {s20-s23}\n\t"
> > > +        "fldmiasge  %[src2]!, {s28-s31}\n\t"
> > > +        "fmulsge    s24, s16, s24\n\t"
> > > +        "fstmias    %[dst]!, {s8-s11}\n\t"
> > > +        "fstmias    %[dst]!, {s12-s15}\n\t"
> > > +        "fmulsge    s28, s20, s28\n\t"
> > > +        "fldmiasgt  %[src1]!, {s0-s3}\n\t"
> > > +        "fldmiasgt  %[src2]!, {s8-s11}\n\t"
> > > +        "fldmiasgt  %[src1]!, {s4-s7}\n\t"
> > > +        "fldmiasgt  %[src2]!, {s12-s15}\n\t"
> > > +        "fmulsge    s8, s0, s8\n\t"
> > > +        "fstmiasge  %[dst]!, {s24-s27}\n\t"
> > > +        "fstmiasge  %[dst]!, {s28-s31}\n\t"
> > > +        "bgt        1b\n\t"
> >
> > If the 4 and 8 cycle latencies you mentioned are correct then this has many
> > stalls.
> 
> This code has zero stalls (except on the first iteration). I checked it
> carefully and also verified with oprofile (ARM11 has hardware performance
> counters and can collect pipeline stalls statistics).
> 

> The part you have probably missed is that this code operates on vectors. 

no


> So
> one fmuls* instruction queues 4 multiplies, which get performed one after
> another in arithmetic pipeline (occupying it for 4 cycles). 

Thats what i missed, i expected it to do them in parallel, like a real CPU :)


[...]

> So the optimization manual from ARM provides only some simplified model
> and can't guarantee exact results. I also tried to remove all the
> multiplication instructions, keeping load/store operations only, the
> performance remained exactly the same (while supposedly calculating cycles 
> for load/store operations should be trivial). The final code is a result of
> some 'genetic' variations and taking the fastest version :)
> 
> Oprofile shows that we get a lot of 'LSU_STALL' events, whatever it means. So
> it probably has something to do with some data cache throughput limitation
> which is not mentioned in the manual.

google says:
LSU_STALL : cycles stalled because Load Store request queque \
is full 

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I hate to see young programmers poisoned by the kind of thinking
Ulrich Drepper puts forward since it is simply too narrow -- Roman Shaposhnik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080421/ba5fb533/attachment.pgp>