[FFmpeg-devel] [PATCH] flac/x86: add ff_flac_lpc_32_sse4()
James Almer
jamrial at gmail.com
Sat Feb 1 05:45:38 CET 2014
On 01/02/14 1:38 AM, James Almer wrote:
> x64
> 1261661 decicycles in flac_lpc_32_c, 32768 runs
> 1045689 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>
> 1431506 decicycles in flac_lpc_32_c, 32768 runs
> 1209322 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>
> x86
> 1429597 decicycles in flac_lpc_32_c, 32768 runs
> 953667 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>
> 1610348 decicycles in flac_lpc_32_c, 32768 runs
> 1079424 decicycles in ff_flac_lpc_32_sse4, 32768 runs
>
> About 100 to 500 ms faster decoding using -threads 1 depending on song and arch.
> Tested using a few 24 bits samples on an AMD FX 6300, Win7 x64 and x86.
> Biggest speedup appears to be on x86 builds.
>
> Signed-off-by: James Almer <jamrial at gmail.com>
> ---
> libavcodec/flacdsp.c | 2 ++
> libavcodec/flacdsp.h | 1 +
> libavcodec/x86/Makefile | 2 ++
> libavcodec/x86/flacdsp.asm | 61 +++++++++++++++++++++++++++++++++++++++++++
> libavcodec/x86/flacdsp_init.c | 39 +++++++++++++++++++++++++++
> 5 files changed, 105 insertions(+)
> create mode 100644 libavcodec/x86/flacdsp.asm
> create mode 100644 libavcodec/x86/flacdsp_init.c
>
Couldn't test with Valgrind, or on a Linux box for that matter.
I have access to this FX 6300 for the time being so I used it to write this, but can't
install a VM.
I originally wrote this doing two calculations per packed instruction (using all 128
bits on the xmm registers instead of 64), but after punpckldq-ing and pshufd-ing values
around and adding extra checks for odd pred_order values it somehow ended up slower
than the pure c implementation.
This will do until i get that other version working faster. If i can, of course.
Regards.
More information about the ffmpeg-devel
mailing list