[FFmpeg-devel] [PATCH] vp9/x86: 16px MC functions (64bit only).
Ronald S. Bultje
rsbultje at gmail.com
Wed Jan 15 15:13:54 CET 2014
Hi,
On Wed, Jan 15, 2014 at 8:33 AM, Clément Bœsch <u at pkh.me> wrote:
> On Thu, Dec 26, 2013 at 09:05:37PM -0500, Ronald S. Bultje wrote:
> > Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0):
>
> decicyle?
No, cycle (yes they take forever!).
> > 16x8: 876 -> 870 (0.7%)
> > 16x16: 1444 -> 1435 (0.7%)
> > 16x32: 2784 -> 2748 (1.3%)
> > 32x16: 2455 -> 2349 (4.5%)
> > 32x32: 4641 -> 4084 (13.6%)
> > 32x64: 9200 -> 7834 (17.4%)
> > 64x32: 8980 -> 7197 (24.8%)
> > 64x64: 17330 -> 13796 (25.6%)
> > Total decoding time goes from 9.326sec to 9.182sec.
> > ---
> > libavcodec/x86/vp9dsp_init.c | 5 ++
> > libavcodec/x86/vp9mc.asm | 122
> +++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 127 insertions(+)
> >
> [...]
> > +%if ARCH_X86_64
> > +
> > +%macro filter_vx2_fn 1
> > +%assign %%px mmsize
> > +cglobal %1_8tap_1d_v_ %+ %%px, 6, 8, 14, dst, dstride, src, sstride, h,
> filtery, src4, sstride3
>
> > + sub srcq, sstrideq
> > + lea sstride3q, [sstrideq*3]
> > + sub srcq, sstrideq
> > + mova m13, [pw_256]
> > + sub srcq, sstrideq
> > + mova m8, [filteryq+ 0]
> > + lea src4q, [srcq+sstrideq*4]
> > + mova m9, [filteryq+16]
> > + mova m10, [filteryq+32]
> > + mova m11, [filteryq+48]
>
> Untested, but wouldn't it be simpler to have:
>
> lea sstride3q, [sstrideq*3]
> lea src4q, [srcq+sstrideq]
> sub srcq, sstride3q
> mova m13, [pw_256]
> mova m8, [filteryq+ 0]
> mova m9, [filteryq+16]
> mova m10, [filteryq+32]
> mova m11, [filteryq+48]
>
> ?
I think you're almost right, but we'd have to swap the second lea and the
sub. Feel free to test and if that works and it's faster (or not slower),
feel free to commit. I can maybe try it this weekend.
Ronald
More information about the ffmpeg-devel
mailing list