[FFmpeg-devel] [PATCH 04/10] lavc/flacenc: add sse4 version of the lpc encoder
James Darnley
james.darnley at gmail.com
Wed Feb 12 16:11:54 CET 2014
On 2014-02-12 12:41, Christophe Gisquet wrote:
> Hi,
>
> 2014-02-12 0:11 GMT+01:00 James Darnley <james.darnley at gmail.com>:
>
>> +%if ARCH_X86_64
>> + cglobal flac_enc_lpc_16, 6, 8, 4, 0, res, smp, len, order, coefs, shift
>> + %define posj r6
>> + %define negj r7
>> +%else
>> + cglobal flac_enc_lpc_16, 6, 6, 4, 0, res, smp, len, order, coefs, shift
>> + %define posj r2
>> + %define negj r5
>> +%endif
> [...]
>> +movd m3, shiftmp
>
> If I'm not mistaken and x264asm isn't already brighter than me, you're
> forcing the loading of shift into a gpr, while you really never have
> to.
> This 6th register will always be on stack, so you need one less gpr in
> all cases.
As I understand it, nix64 has it in a register. I think that is what
libavutil/x86/x86inc.asm:501 says anyway.
I just ended up all the args loaded because I tried on Win64 and I saw
that I got "cmp R9, R9" at one point despite me thinking I had a
register and a memory location.
> I'm not sure, but is it possible to leave order or len wherever they
> are for x86, so as to save another gpr? That may require to manually
> load the args.
I will look again, more closely, to see if I can reduce the number of
registers used. I think an easy way to do this to re-order the
arguments so the pointers can all go at the beginning.
>> +.looplen:
>> + pxor m0, m0
>> + xor posj, posj
>> + xor negj, negj
>> + .looporder:
>> + movd m2, [coefsq+posj*4] ; c = coefs[j]
>> + SPLATD m2
>> + movu m1, [smpq+negj*4-4] ; s = smp[i-j-1]
>> + pmulld m1, m2
>> + paddd m0, m1 ; p += c * s
>> +
>> + add posj, 1
>> + sub negj, 1
>> + cmp posj, ordermp
>> + jne .looporder
>
> Potentially stupid question: do the add and sub gets compiled to
> inc/dec ? Is there a benefit compared to adding/subtracting 4? (I
> guess it does)
> Also, maybe not worthwhile, coefsq could be incremented by orderq*4,
> posj set to -orderq, and then you would do:
> dec negj
> inc posj
> jl/jnz .looporder
No they don't get reduced to inc and dec.
In my first, non-public attempt at this I did loop over decreasing order
but my code produced completely wrong results. I could look again at
doing this now that I have working code and I "decoded" the algorithm
from the C code.
>> + movu [resq], m1 ; res[i] = smp[i] - (p >> shift)
>> +
>> + add resq, mmsize
>> + add smpq, mmsize
>> + sub lenmp, mmsize/4
>> +jg .looplen
>
> Equivalent trick here if len is in a reg: add 4*len*mmsize to resq,
> neg lenq then:
> movu [resq+4*lenq], m1
> add smpq, mmsize
> add lenq, mmsize/4
> jg .looplen
> There are probably errors in what I gave, but this should be
> sufficient to give you the idea.
Yes, I think so.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 683 bytes
Desc: OpenPGP digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140212/ea9c6da4/attachment.asc>
More information about the ffmpeg-devel
mailing list