[FFmpeg-devel] [Patch][OpenHEVC]added ASM DBF functions
Pierre Edouard Lepere
Pierre-Edouard.Lepere at insa-rennes.fr
Fri May 16 11:47:27 CEST 2014
Hi,
Here's a patch with the changes you suggested. However, I think that the luma is still ssse3 dependant.
Regards,
Pierre-Edouard Lepere
----- Mail original -----
De: "James Almer" <jamrial at gmail.com>
À: "FFmpeg development discussions and patches" <ffmpeg-devel at ffmpeg.org>
Envoyé: Jeudi 15 Mai 2014 19:29:55
Objet: Re: [FFmpeg-devel] [Patch][OpenHEVC]added ASM DBF functions
On 15/05/14 11:40 AM, Pierre Edouard Lepere wrote:
> Hi,
> Here is a patch adding Seppo Tomperi's ASM functions for HEVC loop filters with some quick fixes and cosmetic changes.
>
> Regards,
> Pierre-Edouard Lepere
A couple comments below.
> +SECTION_RODATA
> +
> +pw_pixel_max: times 8 dw ((1 << 10)-1)
> +
> +SECTION .text
> +INIT_XMM sse2
> +
> +; expands to [base],...,[base+7*stride]
> +%define PASS8ROWS(base, base3, stride, stride3) \
> + [base], [base+stride], [base+stride*2], [base3], \
> + [base3+stride], [base3+stride*2], [base3+stride3], [base3+stride*4]
> +
> +; in: 8 rows of 4 bytes in %4..%11
> +; out: 4 rows of 8 words in m0..m3
> +%macro TRANSPOSE4x8B_LOAD 8
> + movd m0, %1
> + movd m2, %2
> + movd m1, %3
> + movd m3, %4
> +
> + punpcklbw m0, m2
> + punpcklbw m1, m3
> + punpcklwd m0, m1
> +
> + movd m4, %5
> + movd m6, %6
> + movd m5, %7
> + movd m7, %8
> +
> + punpcklbw m4, m6
> + punpcklbw m5, m7
> + punpcklwd m4, m5
> +
> + movdqa m2, m0
> + punpckldq m0, m4
> + punpckhdq m2, m4
There are tons of cases like this where you should instead use a 3-operand form,
and let x86inc take care of the copy instruction if needed.
This will let you add xmm AVX versions of the functions that will be faster than
their SSE2/SSSE3 counterparts because all these movdqa will be removed.
Also a nit: In general we use mova instead of movdqa/movaps. x86inc expands it
to the correct instruction depending on what you used for INIT_[XY]MM.
[...]
> +; input in m0 ... m7, betas in r2 tcs in r3. Output in m1...m6
> +%macro LUMA_DEBLOCK_BODY 2
> + movdqa m9, m2
> + psllw m9, 1; *2
> + movdqa m10, m1
> + psubw m10, m9
> + paddw m10, m3
> + pabsw m10, m10 ; 0dp0, 0dp3 , 1dp0, 1dp3
ABS1, or PABSW when using pabsw with two different registers.
Then you can add an SSE2 version of the luma functions as well (Phenom users
will thank you).
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel at ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-updated-to-use-x86util-macros.patch
Type: text/x-patch
Size: 14318 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140516/f51b74b4/attachment.bin>
More information about the ffmpeg-devel
mailing list