[FFmpeg-devel] [PATCH 1/4] lavc/aarch64: new optimization for 8-bit hevc_epel_v

Tue Oct 31 14:17:16 EET 2023

On Thu, 26 Oct 2023, Logan.Lyu wrote:

> And I missed submitting a commit that was earlier than these four commits, 
> which caused the corrupted whitespace problem. Now I have recreated these 
> patches.
>
> In addition, I rebased it to ensure that these patches can be successfully 
> applied on the latest master branch.
>
> Please check again, thank you.

Thanks, now these was possibly to apply, and they looked mostly ok, so I 
touched up the last details I noticed and pushed them.

Things I noticed and fixed before pushing:

A bunch of minor cosmetics, you had minor misindentations in a few places 
(that were copypasted around in lots of places), that I fixed like this:

          ld1             {v18.16b}, [x1], x2
  .macro calc src0, src1, src2, src3
-        ld1            {\src3\().16b}, [x1], x2
+        ld1             {\src3\().16b}, [x1], x2
          movi            v4.8h, #0
          movi            v5.8h, #0
          calc_epelb      v4, \src0, \src1, \src2, \src3
@@ -461,7 +461,7 @@ function ff_hevc_put_hevc_epel_v64_8_neon, export=1
  .endm
  1:      calc_all16
  .purgem calc
-2:             ld1             {v8.8b-v11.8b}, [sp]
+2:      ld1             {v8.8b-v11.8b}, [sp]
          add             sp, sp, #32
          ret

The first patch, with mostly small trivial functions, can probably be 
scheduled better for in-order cores. I'll send a patch if I can make them 
measurably faster.

In almost every patch, you have loads/stores to the stack; you use the 
fused stack decrement nicely everywhere possible, but for the loading, 
you're almost always lacking the fused stack increment. I've fixed it now 
for this patchset, but please do keep this in mind and fix it up before 
submitting any further patches. I've fixed that up like this:

          bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
-        ldp             x5, x30, [sp]
          ldp             x0, x3, [sp, #16]
-        add             sp, sp, #32
+        ldp             x5, x30, [sp], #32
          load_epel_filterh x5, x4

(In many places.)

In one place, you wrote below the stack pointer before decrementing it. 
That's ok on OSes with a defined red zone, but we shouldn't need to assume 
that; I've fixed that like this:

  function ff_hevc_put_hevc_qpel_v48_8_neon, export=1
-        stp             x5, x30, [sp, #-16]
-        stp             x0, x1, [sp, #-32]
          stp             x2, x3, [sp, #-48]!
+        stp             x0, x1, [sp, #16]
+        stp             x5, x30, [sp, #32]

I'll push the patchset with these changes soon.


// Martin