[FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon
Clément Bœsch
u at pkh.me
Sat Nov 25 10:25:03 EET 2017
On Sat, Nov 18, 2017 at 06:35:48PM +0100, Rafal Dabrowa wrote:
>
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.
>
> I'm testing my optimizations on NanoPi M3 device. I'm using
> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
> The video file was pulled from libde265.org page, see
> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
> The movie duration is 00:10:34.53.
>
> Overall performance gain is about 2x. Without optimizations the movie
> playback stops in practice after a few seconds. With
> optimizations the file is played smoothly 99% of the time.
>
> For performance testing the following command was used:
>
> time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe - >/dev/null
>
> The video file was pre-read before test to minimize disk reads during testing.
> Program execution time without optimization was as follows:
>
> real 11m48.576s
> user 43m8.111s
> sys 0m12.469s
>
> Execution time with optimizations:
>
> real 6m17.046s
> user 21m19.792s
> sys 0m14.724s
>
Can you post the results of checkasm --bench for hevc?
Did you run it to check for any calling convention violation?
>
> The patch contains optimizations for most heavily used qpel, epel, sao and idct
> functions. Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch: hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
>
You may want to check x86/hevc_deblock.asm then (no idea if these are
implemented).
[...]
> +function ff_hevc_put_hevc_pel_pixels4_8_neon, export=1
> + mov x7, 128
> +1: ld1 { v0.s }[0], [x1], x2
> + ushll v4.8h, v0.8b, 6
> + st1 { v4.d }[0], [x0], x7
using #128 not possible?
> + subs x3, x3, 1
> + b.ne 1b
> + ret
here and below: no use of the x6 register?
A few comments on the style:
- please use a consistent spacing (current function mismatches with later
code), preferably using a relatively large number of spaces as common
ground (check the other sources)
- we use capitalized size suffixes (B, H, ...); and IIRC the lower case
form are problematic with some assembler but don't quote me on that.
- we don't use spaces between {}
> +endfunc
> +
> +function ff_hevc_put_hevc_pel_pixels6_8_neon, export=1
> + mov x7, 120
> +1: ld1 { v0.8b }, [x1], x2
> + ushll v4.8h, v0.8b, 6
> + st1 { v4.d }[0], [x0], 8
I think you need to use # as prefix for the immediates
> + st1 { v4.s }[2], [x0], x7
I assume you can't use #120?
Have you checked if using #128 here and decrementing x0 afterward isn't
faster?
[...]
> +function ff_hevc_put_hevc_pel_bi_pixels32_8_neon, export=1
> + mov x10, 128
> +1: ld1 { v0.16b, v1.16b }, [x2], x3 // src
> + ushll v16.8h, v0.8b, 6
> + ushll2 v17.8h, v0.16b, 6
> + ushll v18.8h, v1.8b, 6
> + ushll2 v19.8h, v1.16b, 6
> + ld1 { v20.8h, v21.8h, v22.8h, v23.8h }, [x4], x10 // src2
> + sqadd v16.8h, v16.8h, v20.8h
> + sqadd v17.8h, v17.8h, v21.8h
> + sqadd v18.8h, v18.8h, v22.8h
> + sqadd v19.8h, v19.8h, v23.8h
> + sqrshrun v0.8b, v16.8h, 7
> + sqrshrun2 v0.16b, v17.8h, 7
> + sqrshrun v1.8b, v18.8h, 7
> + sqrshrun2 v1.16b, v19.8h, 7
does pairing helps here?
sqrshrun v0.8b, v16.8h, 7
sqrshrun v1.8b, v18.8h, 7
sqrshrun2 v0.16b, v17.8h, 7
sqrshrun2 v1.16b, v19.8h, 7
[...]
> + sqrshrun v0.8b, v16.8h, 7
> + sqrshrun2 v0.16b, v17.8h, 7
> + sqrshrun v1.8b, v18.8h, 7
> + sqrshrun2 v1.16b, v19.8h, 7
> + sqrshrun v2.8b, v20.8h, 7
> + sqrshrun2 v2.16b, v21.8h, 7
> + sqrshrun v3.8b, v22.8h, 7
> + sqrshrun2 v3.16b, v23.8h, 7
Again, this might be a good candidate for attempting to shuffle the
instructions and see if it helps (there are many other places, I picked
one randomly).
> +.Lepel_filters:
const/endconst + align might be better for all these labels
[...]
> +function ff_hevc_put_hevc_epel_hv12_8_neon, export=1
> + add x10, x3, 3
> + lsl x10, x10, 7
> + sub sp, sp, x10 // tmp_array
> + stp x0, x3, [sp, -16]!
> + stp x5, x30, [sp, -16]!
> + add x0, sp, 32
> + sub x1, x1, x2
> + add x3, x3, 3
> + bl ff_hevc_put_hevc_epel_h12_8_neon
> + ldp x5, x30, [sp], 16
> + ldp x0, x3, [sp], 16
> + load_epel_filterh x5, x4
> + mov x5, 112
> + mov x10, 128
> + ld1 { v16.8h, v17.8h }, [sp], x10
> + ld1 { v18.8h, v19.8h }, [sp], x10
> + ld1 { v20.8h, v21.8h }, [sp], x10
> +1: ld1 { v22.8h, v23.8h }, [sp], x10
> + calc_epelh v4, v16, v18, v20, v22
> + calc_epelh2 v4, v5, v16, v18, v20, v22
> + calc_epelh v5, v17, v19, v21, v23
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.eq 2f
> +
> + ld1 { v16.8h, v17.8h }, [sp], x10
> + calc_epelh v4, v18, v20, v22, v16
> + calc_epelh2 v4, v5, v18, v20, v22, v16
> + calc_epelh v5, v19, v21, v23, v17
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.eq 2f
> +
> + ld1 { v18.8h, v19.8h }, [sp], x10
> + calc_epelh v4, v20, v22, v16, v18
> + calc_epelh2 v4, v5, v20, v22, v16, v18
> + calc_epelh v5, v21, v23, v17, v19
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.eq 2f
> +
> + ld1 { v20.8h, v21.8h }, [sp], x10
> + calc_epelh v4, v22, v16, v18, v20
> + calc_epelh2 v4, v5, v22, v16, v18, v20
> + calc_epelh v5, v23, v17, v19, v21
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.ne 1b
Introducing macros probably makes sense in these functions
[...]
> +8: b 9f // 0
> + nop
> + nop
> + nop
> + st1 { v29.b }[0], [x7] // 1
> + b 9f
> + nop
> + nop
> + st1 { v29.h }[0], [x7] // 2
> + b 9f
> + nop
> + nop
> + st1 { v29.h }[0], [x7], 2 // 3
> + st1 { v29.b }[2], [x7]
> + b 9f
> + nop
> + st1 { v29.s }[0], [x7] // 4
> + b 9f
> + nop
> + nop
> + st1 { v29.s }[0], [x7], 4 // 5
> + st1 { v29.b }[4], [x7]
> + b 9f
> + nop
> + st1 { v29.s }[0], [x7], 4 // 6
> + st1 { v29.h }[2], [x7]
> + b 9f
> + nop
> + st1 { v29.s }[0], [x7], 4 // 7
> + st1 { v29.h }[2], [x7], 2
> + st1 { v29.b }[6], [x7]
What are these nops for? align?
[...]
Anyway, can you split your patch? It's really a lot of code and there is
no way anyone can review it properly quickly.
I also think macros would be welcome in many places to reduce the size of
the patch(es).
Regards,
--
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20171125/7afd1465/attachment.sig>
More information about the ffmpeg-devel
mailing list