[FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis
Lynne
dev at lynne.ee
Sat Mar 23 18:16:00 EET 2019
23 Mar 2019, 15:04 by ceffmpeg at gmail.com:
> 2019-03-23 15:23 GMT+01:00, Lynne <> dev at lynne.ee <mailto:dev at lynne.ee>> >:
>
>> 16 Mar 2019, 16:34 by >> dev at lynne.ee <mailto:dev at lynne.ee>>> :
>>
>>> 153372 UNITS in postfilter_c, 65536 runs, 0 skips
>>> 73164 UNITS in postfilter_neon, 65536 runs, 0 skips -> 2.1x speedup
>>>
>>> 80591 UNITS in deemphasis_c, 131072 runs, 0 skips
>>> 43969 UNITS in deemphasis_neon, 131072 runs, 0 skips -> 1.83x
>>> speedup
>>>
>>> Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x
>>> realtime)
>>>
>>> Deemphasis SIMD based on the following unrolling:
>>> const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
>>> float state = coeff;
>>>
>>> for (int i = 0; i < len; i += 4) {
>>> y[0] = x[0] + c1*state;
>>> y[1] = x[1] + c2*state + c1*x[0];
>>> y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
>>> y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
>>>
>>> state = y[3];
>>> y += 4;
>>> x += 4;
>>> }
>>>
>>> Unlike the x86 version, duplication is used instead of pslldq so
>>> the structure and tables are different.
>>> Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps + pslldq)
>>> had the same performance, so 3x pslldq was kept as vbroadcastss has a
>>> higher latency.
>>>
>>
>> Could someone review the patches?
>>
>
> Which toolchains did you test?
> (For compilation, not performance.)
>
gcc 8.2.1 on both aarch64 and x86-64
More information about the ffmpeg-devel
mailing list