[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)

Sun Apr 20 21:02:26 CEST 2008

On Sun, Apr 20, 2008 at 07:29:33PM +0300, Siarhei Siamashka wrote:
> On Sunday 20 April 2008, M?ns Rullg?rd wrote:
> > Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> > > Hello,
> > >
> > > Here is a patch which adds some initial optimizations for ARM VFP
> > > (floating point coprocessor available in some ARM11 cores).
> > > Index: configure
> > > ===================================================================
> > > --- configure	(revision 12910)
> > > +++ configure	(working copy)
> >
> > Oddly enough, I had those exact changes to the configure script here,
> > which I've now committed.
> 
> Here is the second revision of patch (with arguments in asm code realigned 
> as you asked).

[...]
> +/*
> + * VFP is a floating point coprocessor used in some ARM cores. VFP11 has 1 cycle
> + * throughput for almost all the instructions (except for double precision
> + * arithmetics), but rather high latency. Latency is 4 cycles for loads and 8 cycles
> + * for arithmetic operations. Scheduling code to avoid pipeline stalls is very
> + * important for performance. One more interesting feature is that VFP has
> + * independent load/store and arithmetics pipelines, so it is possible to make
> + * them work simultaneously and get more than 1 operation per cycle. Load/store
> + * pipeline can process 2 single precision floating point values per cycle and
> + * supports bulk loads and stores for large sets of registers. Arithmetic operations
> + * can be done on vectors, which allows to keep the arithmetics pipeline busy,
> + * while the processor may issue and execute other instructions. Detailed
> + * optimization manuals can be found at http://www.arm.com
> + */
> +

> +/**
> + * ARM VFP optimized implementation of 'vector_fmul_c' function.
> + * Assume that len is a positive number and is multiple of 8
> + *
> + * Takes ~1.9 cycles per element on VFP11 when source and destination
> + * buffers are perfectly aligned and cached.
> + */
> +static void vector_fmul_vfp(float *dst, const float *src, int len)
> +{
> +    int tmp;
> +    asm volatile(
> +        "fmrx       %[tmp], fpscr\n\t"
> +        "orr        %[tmp], %[tmp], #(3 << 16)\n\t" /* set vector size to 4 */
> +        "fmxr       fpscr, %[tmp]\n\t"
> +
> +        "fldmias    %[src1]!, {s0-s3}\n\t"
> +        "fldmias    %[src2]!, {s8-s11}\n\t"
> +        "fldmias    %[src1]!, {s4-s7}\n\t"
> +        "fldmias    %[src2]!, {s12-s15}\n\t"
> +        "fmuls      s8, s0, s8\n\t"
> +    "1:\n\t"
> +        "subs       %[len], %[len], #16\n\t"
> +        "fmuls      s12, s4, s12\n\t"
> +        "fldmiasge  %[src1]!, {s16-s19}\n\t"
> +        "fldmiasge  %[src2]!, {s24-s27}\n\t"
> +        "fldmiasge  %[src1]!, {s20-s23}\n\t"
> +        "fldmiasge  %[src2]!, {s28-s31}\n\t"
> +        "fmulsge    s24, s16, s24\n\t"
> +        "fstmias    %[dst]!, {s8-s11}\n\t"
> +        "fstmias    %[dst]!, {s12-s15}\n\t"
> +        "fmulsge    s28, s20, s28\n\t"
> +        "fldmiasgt  %[src1]!, {s0-s3}\n\t"
> +        "fldmiasgt  %[src2]!, {s8-s11}\n\t"
> +        "fldmiasgt  %[src1]!, {s4-s7}\n\t"
> +        "fldmiasgt  %[src2]!, {s12-s15}\n\t"
> +        "fmulsge    s8, s0, s8\n\t"
> +        "fstmiasge  %[dst]!, {s24-s27}\n\t"
> +        "fstmiasge  %[dst]!, {s28-s31}\n\t"
> +        "bgt        1b\n\t"

If the 4 and 8 cycle latencies you mentioned are correct then this has many
stalls.

the following is less unrolled and should have no stalls (note i did not
look at any arm docs, so feel free to flame me) It of course can be
unrolled more ...

subs
fstmias s16
fmul    s16, s4, s0
fstmias s20
fmul    s20, s8, s12
fldmias s0
fldmias s4
fldmias s8
fldmias s12
bgt

Ive not checked the othersm but if they access operands before they are
available they should also be changed so they do not.

Also note that there are high level optims which should be done to the IMDCT
that is merging it with these windowing functions. It might make sense to look
into this before low level optimizing them.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Frequently ignored awnser#1 FFmpeg bugs should be sent to our bugtracker. User
questions about the command line tools should be sent to the ffmpeg-user ML.
And questions about how to use libav* should be sent to the libav-user ML.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080420/7e630bcb/attachment.pgp>