[MPlayer-dev-eng] [PATCH] (new version) AltiVec: dct64 for mp3lib, IMDCT for liba52, detection code

Mon Jan 20 10:20:46 CET 2003

Daniel Egger wrote:

> Well, I figured out what they do by preprocessing the source and using
> the expanded (and reformatted) source for ideas.

Did the same thing. The preprocessed output of dsputil.c
is also unreadable :-(

> Huh? You either need to align it (costly, especially if you need to read
> 2x16 bytes which is often the case) or you'll end up with wrong pixels
> on the screen.

Dual load + vec_lvsl is OK. It isn't that slow, in practice.

Unaligned store are the real killers.

 >>For the xy2, maybe I should check if ((address & 0X1F) < 8), in that
 >>case I have both 8 pixels block in a single vector and I can avoid the
 >>second load from "pixels". What do you think ?
 >
 > Why 0x1f? To check for 16bit alignment you'd use something like
 > (address & 0x0f) == 0. The nice thing about the code blocks calulating
 > the mean of 4 adjacent pixels is that with altivec one can trivially
 > do something like:
[snip]
 > as long as the stride is a multiple of 16.
 > It gets quickly complicated when the start of the useful data in memory
 > is between 0x.......7 and 0x.......f because then one needs to either
 > special case or generally fetch 2x16 bytes and align them. The alignment
 > vector can be calculated in advance and reused unless stride % 16 != 0.
 > Of course there a small distinction between the to-be-applied data and
 > the picture itself because the former will be en block while the latter
 > uses the stride

The reason for the Ox1F isa typing mistake :-) I only need to know
if  it's between 0 and 7 inclusive, to that all 9 pixels are inside
the first vector.

My last patch to ffmpeg mistakenly "fixed" the stride. I did
use to pre-compute it, then put it back in the loop for some
functions. This is stupid, as if line_size % 16 != 0, then all
load to 16 bytes-aligned block become wrong anyway.

 > This does not necessarily matter, you can mark the data as uncacheable
 > if in doubt. Also the i-cache footprint and the schedulability of the
 > code make a whole lot of difference. Normally altivec code will yield
 > at least in a four fold improvement in terms of instructions which can
 > easily pay off, especially when considering that CPUs normally fetch
 > memory in sizes of a whole cacheline which is 16 bytes for embedded
 > PPC CPUs, 32 bytes in our case or even 128 bytes on PPC64.

I talked about that to my PhD advisor, he also thinks it's
not useful to try remove the (potentially) spurious load
in put_pixels8_x2y_altivec. More code, more branchs, and the
only gain is to avoid accessing a cache line in 7/32 of
the call (assuming an even distribution of alignments).
And this is costly only if the line isn't in the L1.

-- 
Romain Dolbeau