[FFmpeg-devel] swscale/arm/yuv2rgb: make the code bitexact with its aarch64 counter part

Matthieu Bouron matthieu.bouron at gmail.com
Sun Mar 27 17:58:15 CEST 2016


On Fri, Mar 25, 2016 at 11:45 PM, Matthieu Bouron <matthieu.bouron at gmail.com
> wrote:

> The following patchset aims to make bitexact the yuv->rgba armv7 neon code
> path
> with the aarch64 one. It also aims to make the two code bases as close as
> possible.
>
> [PATCH 01/10] swscale/arm/yuv2rgb: remove 32bit code path
>
> The current 32bit code path which is unused is removed.
>
> [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time
>
> The code process only one line at a time for the yuv420p,nv12 and nv21
> formats
> with no regression in performance observed on a rpi2 (I've even observed a
> slight increase of performance for the nv12 and nv21 formats).
>
> [PATCH 10/10] swscale/arm/yuv2rgb: make the code bitexact with its
>
> The last patch of the serie makes the code bitexact with the aarch64
> version.
> The increase of precision (which introduces a performance loss) is
> compensated
> by a refactor/optimisation that saves quite a few mov,vdup and vqdmulh.
>
> ./ffmpeg_g -nostats -f lavfi -i
> testsrc2=1920x1080:d=5,format=nv12,bench=start,format=bgra,bench=stop -f
> null -
>
> without patchset :
> [bench @ 0x3eb6a0] t:0.020660 avg:0.020813 max:0.039399 min:0.020605
>
> with patchset:
> [bench @ 0xe5f6a0] t:0.018924 avg:0.019075 max:0.037472 min:0.01884


I've managed tu run the code on a beagle bone black board, here are the
results:

nv12->bgra
without patchset: [bench @ 0x1fc02d0] t:0.011618 avg:0.011743 max:0.032600
min:0.011513
with patches 01-06/10 applied: [bench @ 0x8052d0] t:0.013438 avg:0.013659
max:0.034427 min:0.013411
with patches 01-10/10 applied: [bench @ 0x1fbb2d0] t:0.012554 avg:0.012751
max:0.034288 min:0.012523

yuv420p->bgra
without patchset: [bench @ 0x6d42d0] t:0.012954 avg:0.013159 max:0.033866
min:0.012945
with patches 01-06/10 applied: [bench @ 0x20172d0] t:0.015154 avg:0.015358
max:0.036186 min:0.015134
with patches 01-10/10 applied: [bench @ 0x1d162d0] t:0.014623 avg:0.014784
max:0.035487 min:0.014568

So it looks like processing one line at a time as negative effect on
performance on this board (as opposed to the rpi2). I'll try to keep the
two line processing code and post some result (so we can decide, which
version to choose).

Matthieu


More information about the ffmpeg-devel mailing list