[MPlayer-dev-eng] [PATCH] (new version) AltiVec: dct64 for mp3lib, IMDCT for liba52, detection code
Romain Dolbeau
dolbeau at irisa.fr
Mon Jan 20 10:20:46 CET 2003
Daniel Egger wrote:
> Well, I figured out what they do by preprocessing the source and using
> the expanded (and reformatted) source for ideas.
Did the same thing. The preprocessed output of dsputil.c
is also unreadable :-(
> Huh? You either need to align it (costly, especially if you need to read
> 2x16 bytes which is often the case) or you'll end up with wrong pixels
> on the screen.
Dual load + vec_lvsl is OK. It isn't that slow, in practice.
Unaligned store are the real killers.
>>For the xy2, maybe I should check if ((address & 0X1F) < 8), in that
>>case I have both 8 pixels block in a single vector and I can avoid the
>>second load from "pixels". What do you think ?
>
> Why 0x1f? To check for 16bit alignment you'd use something like
> (address & 0x0f) == 0. The nice thing about the code blocks calulating
> the mean of 4 adjacent pixels is that with altivec one can trivially
> do something like:
[snip]
> as long as the stride is a multiple of 16.
> It gets quickly complicated when the start of the useful data in memory
> is between 0x.......7 and 0x.......f because then one needs to either
> special case or generally fetch 2x16 bytes and align them. The alignment
> vector can be calculated in advance and reused unless stride % 16 != 0.
> Of course there a small distinction between the to-be-applied data and
> the picture itself because the former will be en block while the latter
> uses the stride
The reason for the Ox1F isa typing mistake :-) I only need to know
if it's between 0 and 7 inclusive, to that all 9 pixels are inside
the first vector.
My last patch to ffmpeg mistakenly "fixed" the stride. I did
use to pre-compute it, then put it back in the loop for some
functions. This is stupid, as if line_size % 16 != 0, then all
load to 16 bytes-aligned block become wrong anyway.
> This does not necessarily matter, you can mark the data as uncacheable
> if in doubt. Also the i-cache footprint and the schedulability of the
> code make a whole lot of difference. Normally altivec code will yield
> at least in a four fold improvement in terms of instructions which can
> easily pay off, especially when considering that CPUs normally fetch
> memory in sizes of a whole cacheline which is 16 bytes for embedded
> PPC CPUs, 32 bytes in our case or even 128 bytes on PPC64.
I talked about that to my PhD advisor, he also thinks it's
not useful to try remove the (potentially) spurious load
in put_pixels8_x2y_altivec. More code, more branchs, and the
only gain is to avoid accessing a cache line in 7/32 of
the call (assuming an even distribution of alignments).
And this is costly only if the line isn't in the L1.
--
Romain Dolbeau
More information about the MPlayer-dev-eng
mailing list