[FFmpeg-devel] [PATCH 04/12] Add vector_fmul_matrix to dsputil

Mon Oct 19 02:00:24 CEST 2009

Michael Niedermayer <michaelni at gmx.at> writes:

> On Sun, Oct 18, 2009 at 11:29:22PM +0100, M?ns Rullg?rd wrote:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>> 
>> > On Sun, Oct 18, 2009 at 10:13:20PM +0100, M?ns Rullg?rd wrote:
>> >> Michael Niedermayer <michaelni at gmx.at> writes:
>> >> 
>> >> > On Sun, Oct 18, 2009 at 09:17:48PM +0100, M?ns Rullg?rd wrote:
>> >> >> Michael Niedermayer <michaelni at gmx.at> writes:
>> >> > [...]
>> >> >> >> +        }
>> >> >> >> +    } else {
>> >> >> >> +        for (i = 0; i < len; i++) {
>> >> >> >> +            const float *m = mtx;
>> >> >> >> +            for (j = 0; j < w; j++) {
>> >> >> >> +                float s = 0;
>> >> >> >
>> >> >> >> +                for (k = 0; k < w; k++)
>> >> >> >> +                    s += v[k][i] * *m++;
>> >> >> >
>> >> >> > this is quite inefficient because for(k) v[k][i] needs 2
>> >> >> > memory reads a flat 2d array would be better
>> >> >> 
>> >> >> And how will the data magically transform itself into such a layout?
>> >> >
>> >> > What is the a reason that the data is not in that layout?
>> >> > If the awnser is that some decoder is implemenetd that way then my next
>> >> > question is, would there be a disadvanatge in changing it?
>> >> 
>> >> Many of the audio decoders allocate the channels separately.  I didn't
>> >> write them, so I can't say how difficult it would be to change that.
>> >
>> > for many channels it should even be faster to memcpy them instead of the
>> > double dereferences
>> > memcpy needs O(w*len)
>> > the dereferences are O(w*w*len)
>> 
>> I don't expect w to be greater than 8.
>> It will probably be 2 or 6 in most cases.
>
> for 6 channels we have 36 dereferences, a cpy copying just
> 1 value at a time needs 6 reads and 6 writes to get rid of these 36
> at that naive instruction counting level, it seems my suggesting
> with copy is faster than yours without

Can you please explain exactly what you're thinking of.  I thought you
were saying the audio channel data was to be moved such that all the
channels would be contiguous in memory instead of passing pointer to
each.  Copying it into such a layout would require w*len operations,
not w*w, and I still don't see how that would be massively more
efficient.  I also don't understand what 36 unnecessary dereferences
you're talking about.  The entire matrix must of course be read for
each sample.  We are doing len [1 x w]*[w x w] matrix multiplications.

>> > also, maybe mtx would be more convenient for SIMD if its transposed
>> > before the function 
>> 
>> That's quite possible, but we can change that later.
>
> true but the later its done, the more code has to be changed at once.
> Now its just a unused dsp function that would need a trvial change.

If every implementation is at least as good with the transposed
matrix, then you're right.  If there is any reason to allow different
permutations, such as for the IDCT, there will be still be more work
needed when a new layout is added.

-- 
M?ns Rullg?rd
mans at mansr.com