[FFmpeg-devel] [RFC][PATCH] DSPUtilize some functions from APE decoder

Kostya kostya.shishkov
Thu Jul 3 05:57:46 CEST 2008


On Thu, Jul 03, 2008 at 03:29:07AM +0300, Ivan Kalvachev wrote:
> On 7/3/08, Loren Merritt <lorenm at u.washington.edu> wrote:
> > On Wed, 2 Jul 2008, Kostya wrote:
> >
> >> I'm not satisfied with the decoding speed of APE decoder,
> >> so I've decided to finally dsputilize functions marked as such.
> >
> >> +static void vector_int16_add_sse(int16_t * v1, int16_t * v2, int order)
> >
> > sse2

oops

Michael, can you say something about moving C functions to dsputil,
I'll polish SSE2 and Altivec versions later. 

> >> +       "movdqa  (%0),   %%xmm0 \n\t"
> >> +       "movdqu  (%1),   %%xmm1 \n\t"
> >> +       "paddw   %%xmm1, %%xmm0 \n\t"
> >
> > movdqu  (%1),   %%xmm0
> > paddw   (%0),   %%xmm0
> >
> >> +static int32_t vector_int16_scalarproduct_sse(int16_t * v1, int16_t * v2,
> >> int order)
> >> +{
> >> +    int i;
> >> +    int res = 0, *resp=&res;
> >> +
> >> +    asm volatile("pxor %xmm7, %xmm7 \n\t");
> >> +
> >> +    for(i = 0; i < order; i += 8){
> >> +        asm volatile(
> >> +       "movdqu   (%0),   %%xmm0 \n\t"
> >> +       "movdqa   (%1),   %%xmm1 \n\t"
> >> +       "pmaddwd  %%xmm1, %%xmm0 \n\t"
> >> +       "movhlps  %%xmm0, %%xmm2 \n\t"
> >> +
> >> +       "paddd    %%xmm2, %%xmm0 \n\t"
> >> +       "pshufd  $0x01, %%xmm0,%%xmm2 \n\t"
> >> +       "paddd    %%xmm2, %%xmm0 \n\t"
> >> +       "paddd   %%xmm0, %%xmm7 \n\t"
> >> +       : "+r"(v1), "+r"(v2)
> >> +       );
> >> +       v1 += 8;
> >> +       v2 += 8;
> >> +    }
> >> +    asm volatile("movd %%xmm7, (%0)\n\t" : "+r"(resp));
> >> +    return res;
> >> +}
> >
> > horizontal sum should be outside the loop
> > pshuflw is faster than pshufd
> 
> 
> Few more things.
> 
> What guarantees that these functions are called at 8 bytes aligned
> addresses and that they always process the data in bunch of 8 (aka
> order%8 ==0);
> (I actually have no idea if the exact instructions you used require 8B
> alignment, I just assume they do. If they don't, they are slow ;)

In APE decoder we have orders=16, 32, 64, 256 and 1280.
Also all vector operations are invoked on av_malloc()ed array with some
offset, so one of the arguments have perfect align and another has
increments by 2.
 
> I think somewhere in the docs there is requirement to don't break
> asm blocks just to do loop in C, this definitely would make you
> use one variable/register for loop instead of 2.
> 
> I'm not sure why you use pointer to local variable,
> there must be way to give the return variable directly
> to the asm block, so if compiler pleases and that variable
> is assigned to eax register then "movd" would put the value
> in eax directly and return it this way.

I don't speak assembler well, I can only read it.




More information about the ffmpeg-devel mailing list