[FFmpeg-devel] libavcodec/exr : add SSE SIMD for reorder_pixels v2 (WIP)
Martin Vignali
martin.vignali at gmail.com
Mon Sep 4 00:26:43 EEST 2017
Hello,
Thanks Ivan for your comments and explanations,
---
> > [...]
> > +;**********************************************************
> ********************
> > +
> > +%include "libavutil/x86/x86util.asm"
>
> Still missing explicit x86inc.asm
>
if i include x86inc instead of x86util, i have linker error (seems that the
prefixe of func become x264, instead of ff)
> > +
> > + shr sizeq, 1; sizeq = half_size
> > + mov r3, sizeq
> > + shr r3, 4; r3 = half_size/16 -> loop_simd
> count
> > +
> > +loop_simd:
> > +;initial condition loop
> > + jle after_loop_simd; jump to scalar part if loop_simd
> count(r3) is 0
> > +
> > + movdqa m0, [srcq]; load first part
> > + movdqu m1, [srcq + sizeq]; load second part
>
> Would you test if moving the movdqu first makes any difference in speed?
> I had a similar case and I think that makes it faster,
> since movdqu has bigger latency.
> Might not matter on newer cpu.
>
> (If you can't tell the difference, leave it as it is.)
>
Doesn't notice speed difference.
For the rest of your comments :
You're right, i can remove the scalar part,
the src and dst buffer seems to be padded to 32 in av_fast_padded_malloc
So for the SSE version, can be enough to not overread, overwrite
But need to take care of that, for an avx2 version
I also modify the loop, following your comments.
I offset src, and src2, by half_size, and dst by 2*half_size, so i can
remove some add, sub
and i use half_size * -1, for offset src, src2, and dst
The current asm version is that : (still WIP, but pass fate test for me)
Need to better check, the max overread, overwrite, for several size value
%include "libavutil/x86/x86util.asm"
SECTION .text
;------------------------------------------------------------------------------
; void ff_reorder_pixels(uint8_t *src, uint8_t *dst, int size)
;------------------------------------------------------------------------------
INIT_XMM sse2
cglobal reorder_pixels, 3,5,3, src, dst, size
add dstq, sizeq; offset dstq by 2* half_size
shr sizeq, 1; sizeq = half_size
mov r3, sizeq; r3 = half_size
add srcq, r3; offset src by half_size
mov r4, srcq; r4 is the start of the second
part of the buffer
add r4, r3; offset r4 by half_size
neg r3; r3 = half_size * -1 (offset of
dst, src, src2 (r4))
loop_simd:
;initial condition loop
jge end;
movdqa m0, [srcq+r3]; load first part
movdqu m1, [r4 +r3] ; load second part
punpcklbw m2, m0, m1; interleaved part 1
movdqa [dstq+r3*2], m2; copy to dst array
punpckhbw m0, m1; interleaved part 2
movdqa [dstq+r3*2+mmsize], m0; copy to dst array
add r3, mmsize
jmp loop_simd
end:
RET
For the perf, the current state is :
Scalar :
3082024 decicycles in reorder_pixels_zip, 130413 runs, 659 skips
bench: utime=115.926s
bench: maxrss=607670272kB
SSE ASM :
296370 decicycles in reorder_pixels_zip, 130946 runs, 126 skips
bench: utime=101.481s
bench: maxrss=607698944kB
SSE Intrinsics
289448 decicycles in reorder_pixels_zip, 130944 runs, 128 skips
bench: utime=101.417s
bench: maxrss=607694848kB
After taking a look at the asm code generate by clang from intrinsics
version (in O2)
seems like, clang modify the loop_simd part, in order to process twice more
bytes inside the loop
(and it add a condition, to process odd half_size)
I will try to make some test for that, to see if i can have a speed
improvement using the same method
Martin
<http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>
More information about the ffmpeg-devel
mailing list