[FFmpeg-devel] [PATCH] Merge some computations in C code for VC-1 inverse transforms

Sat Feb 16 15:09:00 CET 2008

On 18 January 2008, Michael Niedermayer wrote:
> On Thu, Jan 17, 2008 at 08:52:53PM +0100, Christophe GISQUET wrote:
> [...]

       t3 = 22 * src[ 8] + 10 * src[24];
       t4 = 22 * src[24] - 10 * src[ 8];

> > > t3= 10*(src[ 8] + src[24]);
> > > t4= 32*src[24] - t3;
> > > t3+= 12*src[ 8];
> > >
> > > is faster?
> > > its 3 add, 2 mul, 1 shift vs. 2 add, 4 mul
> >
> > Should have been at first glance, but this seems to cost 10-30
> > dezicycles more per loop
> >
> > Again, maybe it could explained by checking the generated asm code, but
> > another CPU might see another result with the same code...
>
> tests from something like ARM would be interresting, there the reduction
> of multiplies should make a difference, x86 will have mmx code anyway ...

Actually this code snippet is quite interesting on ARM. The compiler
completely eliminates all the multiplications and replaces them with 
additions and shifts (almost every arithmetic instruction can have 
a 'free' shift for one of the source operands). That is both for old 
and new code when compiling for ARMv4. Your new code has less instructions 
and should be faster.

Anyway, when compiling for ARMv5TE ISA, the compiler uses fast single cycle
16x16->32 multiplication instructions. These instructions also have
multiply&accumulate variants which are unfortunately not used by the 
compiler (nothing is perfect).

Theoretically, the original code requires only 4 operations on ARM (ARMv5TE+),
each of them executing in a single cycle: 2 mul and 2 mac.

If you are interested, I can provide disassembly listings for different code
fragments.

Subjectively, I prefer old variant - it should be faster on ARM (with a
nonexistent theoretical 'perfect' compiler), is less 'obfuscated' and it is
easier to use as a reference implementation when doing SIMD optimizations.