[FFmpeg-devel] Some IWMMXT functions for libavcodec

Sat May 17 20:11:13 CEST 2008

On Saturday 17 May 2008, Dmitry Antipov wrote:
> Michael Niedermayer wrote:
> > So write code which is near perfect on both
>
> As we're investigated,
>
>      wldrd wr2, [%1, #8]
>      wldrd wr1, [%1], %2
>
> is much better than
>
>      wldrd wr1, [%1]
>      wldrd wr2, [%1, #8]
>      add %1, %1, %2
>
> but the first version will work on PXA3xx cores only.

There is one pitfall with modern gnu toolchains. For example if we have this
chunk of code:

    asm volatile(
        "wldrd    wr1, [r0]        \n\t"
        "wldrd    wr1, [r0], r1    \n\t"
        "wldrd    wr1, [r0, r1]    \n\t"
        "wldrd    wr1, [r0, r1]!   \n\t");

Compiling it with:
gcc -march=iwmmxt2 -mcpu=iwmmxt2 -mfloat-abi=soft -O2 -c test.c
objdump -m iwmmxt2 -d test.o

Results with the following output:
00000000 <test>:
   0:   edd01100        wldrd   wr1, [r0]
   4:   fcf01101        wldrd   wr1, [r0], +r1
   8:   fdd01101        wldrd   wr1, [r0, +r1]
   c:   fdf01101        wldrd   wr1, [r0, +r1]!

BUT! If we compile it with:
gcc -march=iwmmxt -mcpu=iwmmxt -mfloat-abi=soft -O2 -c test.c
objdump -m iwmmxt2 -d test.o

The result is:
00000000 <test>:
   0:   edd01100        wldrd   wr1, [r0]
   4:   ecf01100        wldrd   wr1, [r0]
   8:   edd01100        wldrd   wr1, [r0]
   c:   edf01100        wldrd   wr1, [r0]

So using "-march=iwmmxt -mcpu=iwmmxt" options results in incorrect code
generation. The expected result would be compilation failure with error
message about these instructions not being supported. It's a dangerous 
bug that needs to be reported somewhere.

Now about the performance. The following code is perfectly fine until we
take potential cache misses into account:
      wldrd wr2, [%1, #8]
      wldrd wr1, [%1], %2

But this code reads memory "backwards" and may (or may not) result in worse
performance. For example ARM9 and ARM11 cores have "critical word first"
cache refill policy. If we get a cache miss, whole 32 bytes of cache line 
need to be loaded from memory. This operation keeps memory bus busy for 
a while. With a "critical word first" algorithm, some speedup is gained
by starting to first fetch the data that is needed by the instruction that 
caused cache miss. After that, the rest of cache line continues to be loaded
from memory in the background (by continuing reading till the end of cache
line, then wrapping around and reading the rest of data from the start of
cache line) and cpu already has access to the data it needs. So memory
access latency gets reduced. But if we read memory backwards, first access
would gain some speedup from "critical word first" loading, on the other
hand, the second memory access would stall as the data for it would become
available last. Everything needs to be benchmarked of course, but generally,
reading memory forward is always as fast or faster than reading it backwards.

Hopefully my explanation was not very confusing :) If anybody is interested,
he can consult ARM1136 Technical Reference Manual for more details.

I used that code fragment for simplicity and easier understanding. But in
reality, a bit more changes may be required for the code to make it perfect :)

> How is it reasonable to implement different specialized version for
> each generation of the core like it does with MMX and MMX2?
>
> Moreover, WMMX2 adds some useful instructions - for example, WAVG4 may
> be used to implement very fast pix_abs16_xy2 and pix_abs8_xy2 (which
> will not work on PXA2xx cores).

You can add support for WMMX2 to ffmpeg configure script in a similar way as
support for WMMX and other ARM instruction set extension are supported. You
just need an instruction which makes the compiler refuse to compile the code
unless you select "-march=iwmmxt2" option for it (this register post-increment
WLDRD variant is naturally not a good choice :) ). Also --enable-iwmmxt2,
--disable-iwmmxt2 configure options could be used to force the configuration
you need even if the autodetection fails.

-- 
Best regards,
Siarhei Siamashka