[FFmpeg-devel] libavcodec/exr : add x86 SIMD for predictor
Henrik Gramner
henrik at gramner.com
Sun Oct 1 15:47:24 EEST 2017
On Fri, Sep 22, 2017 at 11:12 PM, Martin Vignali
<martin.vignali at gmail.com> wrote:
> +static void predictor_scalar(uint8_t *src, ptrdiff_t size)
> +{
> + uint8_t *t = src + 1;
> + uint8_t *stop = src + size;
> +
> + while (t < stop) {
> + int d = (int) t[-1] + (int) t[0] - 128;
> + t[0] = d;
> + ++t;
> + }
> +}
Can be simplified quite a bit:
static void predictor_scalar(uint8_t *src, ptrdiff_t size)
{
for (size_t i = 1; i < size; i++)
src[i] += src[i-1] - 128;
}
> +SECTION_RODATA 32
> +
> +neg_128: times 16 db -128
> +shuffle_15: times 16 db 15
Drop the 32-byte alignment from the section directive, we don't need it here.
db -128 is weird since it's identical to +128. I would rename those as such:
pb_128: times 16 db 128
pb_15: times 16 db 15
> +INIT_XMM ssse3
> +cglobal predictor, 2,3,5, src, size, tmp
> +
> + mov tmpb, [srcq]
> + xor tmpb, -128
> + mov [srcq], tmpb
> +
> +;offset src by size
> + add srcq, sizeq
> + neg sizeq ; size = offset for src
> +
> +;init mm
> + mova m0, [neg_128] ; m0 = const for xor high byte
> + mova m1, [shuffle_15] ; m1 = shuffle mask
> + pxor m2, m2 ; m2 = prev_buffer
> +
> +
> +.loop:
> + mova m3, [srcq + sizeq]
> + pxor m3, m0
> +
> + ;compute prefix sum
> + mova m4, m3
> + pslldq m4, 1
> +
> + paddb m4, m3
> + mova m3, m4
> + pslldq m3, 2
> +
> + paddb m3, m4
> + mova m4, m3
> + pslldq m4, 4
> +
> + paddb m4, m3
> + mova m3, m4
> + pslldq m3, 8
> +
> + paddb m4, m2
> + paddb m4, m3
> +
> + mova [srcq + sizeq], m4
> +
> + ;broadcast high byte for next iter
> + pshufb m4, m1
> + mova m2, m4
> +
> + add sizeq, mmsize
> + jl .loop
> + RET
%macro PREDICTOR 0
cglobal predictor, 2,3,5, src, size, tmp
%if mmsize == 32
vbroadcasti128 m0, [pb_128]
%else
mova xm0, [pb_128]
%endif
mova xm1, [pb_15]
mova xm2, xm0
add srcq, sizeq
neg sizeq
.loop:
pxor m3, m0, [srcq + sizeq]
pslldq m4, m3, 1
paddb m3, m4
pslldq m4, m3, 2
paddb m3, m4
pslldq m4, m3, 4
paddb m3, m4
pslldq m4, m3, 8
%if mmsize == 32
paddb m3, m4
paddb xm2, xm3
vextracti128 xm4, m3, 1
mova [srcq + sizeq], xm2
pshufb xm2, xm1
paddb xm2, xm4
mova [srcq + sizeq + 16], xm2
%else
paddb m2, m3
paddb m2, m4
mova [srcq + sizeq], m2
%endif
pshufb xm2, xm1
add sizeq, mmsize
jl .loop
RET
%endmacro
INIT_XMM ssse3
PREDICTOR
INIT_XMM avx
PREDICTOR
%if HAVE_AVX2_EXTERNAL
INIT_YMM avx2
PREDICTOR
%endif
predictor_c: 15351.5
predictor_ssse3: 1206.5
predictor_avx: 1207.5
predictor_avx2: 880.0
On SKL-X. Only tested in checkasm.
AVX is same speed as SSSE3 since modern Intel CPU:s eliminate reg-reg
moves in the register renaming stage, but somewhat older CPU:s such as
Sandy Bridge, which is still quite popular, does not so it should help
there.
On Fri, Sep 22, 2017 at 11:12 PM, Martin Vignali
<martin.vignali at gmail.com> wrote:
> Hello,
>
> in attach a patch
> with a port to asm of the predictor part of this patch :
>
> https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf
>
> Tested on OSX, pass fate test for me
> Check asm also pass for me
>
> Results with reorder simd disable :
> SSSE3 : 94.5s
> 1036758 decicycles in predictor, 130751 runs, 321 skips
>
> Scalar : 114s
> 4255109 decicycles in predictor, 130276 runs, 796 skips
>
> using reorder and predictor simd : 82.6s
>
>
> Check asm benchmark :
> ./tests/checkasm/checkasm --test=exrdsp --bench
>
> predictor_c: 10635.1
> predictor_ssse3: 1634.6
>
>
> Comments welcome
>
>
> Martin
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
More information about the ffmpeg-devel
mailing list