[FFmpeg-devel] [PATCH] x86/hevc_res_add: refactor ff_hevc_transform_add{16, 32}_8
Hendrik Leppkes
h.leppkes at gmail.com
Thu Aug 21 15:03:08 CEST 2014
On Thu, Aug 21, 2014 at 12:42 AM, James Almer <jamrial at gmail.com> wrote:
> * Reduced xmm register count to 7 (As such they are now enabled for x86_32).
> * Removed four movdqa (affects the sse2 version only).
> * pxor is now used to clear m0 only once.
>
> ~5% faster.
>
> Signed-off-by: James Almer <jamrial at gmail.com>
> ---
Good job, faster and 32-bit compat!
> libavcodec/x86/hevc_res_add.asm | 122 ++++++++++++++++------------------------
> libavcodec/x86/hevcdsp_init.c | 10 ++--
> 2 files changed, 51 insertions(+), 81 deletions(-)
>
> diff --git a/libavcodec/x86/hevc_res_add.asm b/libavcodec/x86/hevc_res_add.asm
> index feea50c..7238fb3 100644
> --- a/libavcodec/x86/hevc_res_add.asm
> +++ b/libavcodec/x86/hevc_res_add.asm
> @@ -88,71 +88,41 @@ cglobal hevc_transform_add4_8, 3, 4, 6
> movhps [r0+r3 ], m1
> %endmacro
>
> -%macro TR_ADD_INIT_SSE_8 0
> - pxor m0, m0
> -
> - mova m4, [r1]
> - mova m1, [r1+16]
> - psubw m2, m0, m1
> - psubw m5, m0, m4
> - packuswb m4, m1
> - packuswb m5, m2
> -
> - mova m6, [r1+32]
> - mova m1, [r1+48]
> - psubw m2, m0, m1
> - psubw m7, m0, m6
> - packuswb m6, m1
> - packuswb m7, m2
> -
> - mova m8, [r1+64]
> - mova m1, [r1+80]
> - psubw m2, m0, m1
> - psubw m9, m0, m8
> - packuswb m8, m1
> - packuswb m9, m2
> -
> - mova m10, [r1+96]
> - mova m1, [r1+112]
> - psubw m2, m0, m1
> - psubw m11, m0, m10
> - packuswb m10, m1
> - packuswb m11, m2
> -%endmacro
> -
> -
> -%macro TR_ADD_SSE_16_8 0
> - TR_ADD_INIT_SSE_8
> -
> - paddusb m0, m4, [r0 ]
> - paddusb m1, m6, [r0+r2 ]
> - paddusb m2, m8, [r0+r2*2]
> - paddusb m3, m10,[r0+r3 ]
> - psubusb m0, m5
> - psubusb m1, m7
> - psubusb m2, m9
> - psubusb m3, m11
> - mova [r0 ], m0
> - mova [r0+r2 ], m1
> - mova [r0+2*r2], m2
> - mova [r0+r3 ], m3
> -%endmacro
> -
> -%macro TR_ADD_SSE_32_8 0
> - TR_ADD_INIT_SSE_8
> -
> - paddusb m0, m4, [r0 ]
> - paddusb m1, m6, [r0+16 ]
> - paddusb m2, m8, [r0+r2 ]
> - paddusb m3, m10,[r0+r2+16]
> - psubusb m0, m5
> - psubusb m1, m7
> - psubusb m2, m9
> - psubusb m3, m11
> - mova [r0 ], m0
> - mova [r0+16 ], m1
> - mova [r0+r2 ], m2
> - mova [r0+r2+16], m3
> +%macro TR_ADD_SSE_16_32_8 3
> + mova m2, [r1+%1 ]
> + mova m6, [r1+%1+16]
> +%if cpuflag(avx)
> + psubw m1, m0, m2
> + psubw m5, m0, m6
> +%else
> + mova m1, m0
> + mova m5, m0
> + psubw m1, m2
> + psubw m5, m6
> +%endif
I was wondering about these blocks - doesn't the x264asm layer
automatically add the mova's when you just use the 3-arg form on sse2?
Or is there a speed benefit grouping the mov's?
- Hendrik
More information about the ffmpeg-devel
mailing list