[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)
Michael Niedermayer
michaelni
Mon Apr 21 00:11:57 CEST 2008
On Sun, Apr 20, 2008 at 11:18:31PM +0300, Siarhei Siamashka wrote:
> On Sunday 20 April 2008, Michael Niedermayer wrote:
>
> [...]
>
> > > +static void vector_fmul_vfp(float *dst, const float *src, int len)
> > > +{
> > > + int tmp;
> > > + asm volatile(
> > > + "fmrx %[tmp], fpscr\n\t"
> > > + "orr %[tmp], %[tmp], #(3 << 16)\n\t" /* set vector size
> > > to 4 */ + "fmxr fpscr, %[tmp]\n\t"
> > > +
> > > + "fldmias %[src1]!, {s0-s3}\n\t"
> > > + "fldmias %[src2]!, {s8-s11}\n\t"
> > > + "fldmias %[src1]!, {s4-s7}\n\t"
> > > + "fldmias %[src2]!, {s12-s15}\n\t"
> > > + "fmuls s8, s0, s8\n\t"
> > > + "1:\n\t"
> > > + "subs %[len], %[len], #16\n\t"
> > > + "fmuls s12, s4, s12\n\t"
> > > + "fldmiasge %[src1]!, {s16-s19}\n\t"
> > > + "fldmiasge %[src2]!, {s24-s27}\n\t"
> > > + "fldmiasge %[src1]!, {s20-s23}\n\t"
> > > + "fldmiasge %[src2]!, {s28-s31}\n\t"
> > > + "fmulsge s24, s16, s24\n\t"
> > > + "fstmias %[dst]!, {s8-s11}\n\t"
> > > + "fstmias %[dst]!, {s12-s15}\n\t"
> > > + "fmulsge s28, s20, s28\n\t"
> > > + "fldmiasgt %[src1]!, {s0-s3}\n\t"
> > > + "fldmiasgt %[src2]!, {s8-s11}\n\t"
> > > + "fldmiasgt %[src1]!, {s4-s7}\n\t"
> > > + "fldmiasgt %[src2]!, {s12-s15}\n\t"
> > > + "fmulsge s8, s0, s8\n\t"
> > > + "fstmiasge %[dst]!, {s24-s27}\n\t"
> > > + "fstmiasge %[dst]!, {s28-s31}\n\t"
> > > + "bgt 1b\n\t"
> >
> > If the 4 and 8 cycle latencies you mentioned are correct then this has many
> > stalls.
>
> This code has zero stalls (except on the first iteration). I checked it
> carefully and also verified with oprofile (ARM11 has hardware performance
> counters and can collect pipeline stalls statistics).
>
> The part you have probably missed is that this code operates on vectors.
no
> So
> one fmuls* instruction queues 4 multiplies, which get performed one after
> another in arithmetic pipeline (occupying it for 4 cycles).
Thats what i missed, i expected it to do them in parallel, like a real CPU :)
[...]
> So the optimization manual from ARM provides only some simplified model
> and can't guarantee exact results. I also tried to remove all the
> multiplication instructions, keeping load/store operations only, the
> performance remained exactly the same (while supposedly calculating cycles
> for load/store operations should be trivial). The final code is a result of
> some 'genetic' variations and taking the fastest version :)
>
> Oprofile shows that we get a lot of 'LSU_STALL' events, whatever it means. So
> it probably has something to do with some data cache throughput limitation
> which is not mentioned in the manual.
google says:
LSU_STALL : cycles stalled because Load Store request queque \
is full
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
I hate to see young programmers poisoned by the kind of thinking
Ulrich Drepper puts forward since it is simply too narrow -- Roman Shaposhnik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080421/ba5fb533/attachment.pgp>
More information about the ffmpeg-devel
mailing list