[Ffmpeg-devel] VP3/Theora Perfection

Tue May 17 13:55:52 CEST 2005

Hi

On Monday 16 May 2005 22:10, Mike Melanson wrote:
> Michael Niedermayer wrote:
[...]
> > * the switch / case mess used for some vlc decoding
>
> 	Expound. Are you talking about the unpack_token() function? That is

yes, and get_motion_vector_vlc()

> called a lot and perhaps should be inline'd. Otherwise, the actual
> switch/case logic should reduce to a jump table. On2's original code

you dont seem to be aware that jump tables with unpredictable jump targets are 
very slow

[...]
> > actually
> > the dequant should be done during bitstream decoding
>
> 	Why? Dequantization is a parallelizable operation that can be optimized
> with SIMD instructions. That is why it is done at the same time as the
> optimized IDCTs.

i prefer to multiply 2 elements without SIMD over multiplying 64 with SIMD

[...]

>  > * mmx.h based asm code (slow due to gcc bugs, and problematic due to
>
> bugs in
>
>  > mmx.h)
>  >
> >>has MMX and SSE2 optimizations that I can port over when I am confident
> >>that the C-based loop filter works.
> >
> > note, please do not use mmx.h,
>
> 	Please give me a good reason. I have checked code generated from mmx.h
> against objdump and the generated ASM is correct.

the operand constraints in mmx.h are wrong, for example:
#define mmx_m2ri(op,mem,reg,imm) \
        __asm__ __volatile__ (#op " %1, %0, %%" #reg \
                              : /* nothing */ \
                              : "X" (mem), "X" (imm))

the speed of mmx.h code depends strongly upon the compiler ...
you cannot use integer instructions or non mmx registers directly
theres no real advantage over asm() style

[...]
-- 
Michael