[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder
Sebastian Vater
cdgs.basty
Thu Apr 29 20:45:55 CEST 2010
Sebastian Vater a ?crit :
> Ronald S. Bultje a ?crit :
>
>> Hi,
>>
>> On Thu, Apr 29, 2010 at 2:22 PM, Sebastian Vater
>> <cdgs.basty at googlemail.com> wrote:
>>
>>
>>> Regarding your idea gcc outputs:
>>> 52: 0f b6 16 movzbl (%esi),%edx
>>> 55: 83 c6 01 add $0x1,%esi
>>> 58: 89 d0 mov %edx,%eax
>>> 5a: 83 e2 0f and $0xf,%edx
>>> 5d: c1 e8 04 shr $0x4,%eax
>>> 60: 8b 04 81 mov (%ecx,%eax,4),%eax
>>> 63: 09 03 or %eax,(%ebx)
>>> 65: 8b 04 91 mov (%ecx,%edx,4),%eax
>>> 68: 09 43 04 or %eax,0x4(%ebx)
>>> 6b: 83 c3 08 add $0x8,%ebx
>>> 6e: 39 df cmp %ebx,%edi
>>> 70: 77 e0 ja 52 <decodeplane8+0x52>
>>>
>>>
>> 12 instructions, so 2 less, as intended. Is it faster?
>>
>>
>
> Old method:
> basty at cdgs-basty:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
> FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
> built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
> configuration:
> libavutil 50.14. 0 / 50.14. 0
> libavcodec 52.66. 0 / 52.66. 0
> libavformat 52.61. 0 / 52.61. 0
> libavdevice 52. 2. 0 / 52. 2. 0
> libswscale 0.10. 0 / 0.10. 0
> [IFF @ 0x8b32790]Estimating duration from bitrate, this may be inaccurate
> Input #0, IFF, from '../patches/MRLake.iff':
> Duration: N/A, bitrate: N/A
> Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
> 737:700, 90k tbr, 90k tbn, 90k tbc
> 18270 dezicycles in decodeplane8, 1 runs, 0 skips
> 14805 dezicycles in decodeplane8, 2 runs, 0 skips
> 12340 dezicycles in decodeplane8, 4 runs, 0 skips
> 11127 dezicycles in decodeplane8, 8 runs, 0 skips
> 9640 dezicycles in decodeplane8, 16 runs, 0 skips
> 8882 dezicycles in decodeplane8, 32 runs, 0 skips
> 8496 dezicycles in decodeplane8, 64 runs, 0 skips
> 8312 dezicycles in decodeplane8, 127 runs, 1 skips
> 8220 dezicycles in decodeplane8, 255 runs, 1 skips
> 8173 dezicycles in decodeplane8, 511 runs, 1 skips
> 8149 dezicycles in decodeplane8, 1023 runs, 1 skips
> 8135 dezicycles in decodeplane8, 2046 runs, 2 skips
> 8128 dezicycles in decodeplane8, 4093 runs, 3 skips
> 0.76 A-V: 0.000 s:0.0 aq= 0KB vq= 0KB sq= 0B f=0/0 0/0
>
> New method with 2 instructions less:
> basty at cdgs-basty:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
> FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
> built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
> configuration:
> libavutil 50.14. 0 / 50.14. 0
> libavcodec 52.66. 0 / 52.66. 0
> libavformat 52.61. 0 / 52.61. 0
> libavdevice 52. 2. 0 / 52. 2. 0
> libswscale 0.10. 0 / 0.10. 0
> [IFF @ 0x8b32790]Estimating duration from bitrate, this may be inaccurate
> Input #0, IFF, from '../patches/MRLake.iff':
> Duration: N/A, bitrate: N/A
> Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
> 737:700, 90k tbr, 90k tbn, 90k tbc
> 16180 dezicycles in decodeplane8, 1 runs, 0 skips
> 14210 dezicycles in decodeplane8, 2 runs, 0 skips
> 12887 dezicycles in decodeplane8, 4 runs, 0 skips
> 10575 dezicycles in decodeplane8, 8 runs, 0 skips
> 9363 dezicycles in decodeplane8, 16 runs, 0 skips
> 8755 dezicycles in decodeplane8, 32 runs, 0 skips
> 8445 dezicycles in decodeplane8, 64 runs, 0 skips
> 8288 dezicycles in decodeplane8, 128 runs, 0 skips
> 8208 dezicycles in decodeplane8, 256 runs, 0 skips
> 8185 dezicycles in decodeplane8, 512 runs, 0 skips
> 8162 dezicycles in decodeplane8, 1024 runs, 0 skips
> 8148 dezicycles in decodeplane8, 2048 runs, 0 skips
> 8145 dezicycles in decodeplane8, 4095 runs, 1 skips
> 3.84 A-V: 0.000 s:0.0 aq= 0KB vq= 0KB sq= 0B f=0/0 0/0
>
> The new method is initially faster but is 0,01% slower at the end...
>
> Remember, I'm back at home again, so these benchmarks are again Athlon
> XP+ 2100, so please don't compare them with the ones I did on the
> Pentium 4.
>
Did another version:
/**
* Decode interleaved plane buffer up to 8bpp
* @param dst Destination buffer
* @param buf Source buffer
* @param buf_size
* @param bps bits_per_coded_sample (must be <= 8)
* @param plane plane number to decode as
*/
static void decodeplane8(uint8_t *dst,
const uint8_t *buf,
unsigned buf_size,
const unsigned bps,
const unsigned plane)
{
START_TIMER;
const uint32_t *lut = plane8_lut[plane];
for(; --buf_size != 0; dst += 8) {
uint32_t v;
const unsigned x = *buf++;
v = AV_RN32A(dst) | lut[x >> 4];
AV_WN32A(dst, v);
v = AV_RN32A(dst + 4) | lut[x & 0x0F];
AV_WN32A(dst + 4, v);
}
STOP_TIMER("decodeplane8");
}
With this, I get:
58: 0f b6 16 movzbl (%esi),%edx
5b: 83 c6 01 add $0x1,%esi
5e: 89 d0 mov %edx,%eax
60: 83 e2 0f and $0xf,%edx
63: c1 e8 04 shr $0x4,%eax
66: 8b 04 87 mov (%edi,%eax,4),%eax
69: 09 01 or %eax,(%ecx)
6b: 8b 04 97 mov (%edi,%edx,4),%eax
6e: 09 41 04 or %eax,0x4(%ecx)
71: 83 c1 08 add $0x8,%ecx
74: 83 eb 01 sub $0x1,%ebx
77: 75 df jne 58 <decodeplane8+0x58>
Benchmark results:
basty at cdgs-basty:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
configuration:
libavutil 50.14. 0 / 50.14. 0
libavcodec 52.66. 0 / 52.66. 0
libavformat 52.61. 0 / 52.61. 0
libavdevice 52. 2. 0 / 52. 2. 0
libswscale 0.10. 0 / 0.10. 0
[IFF @ 0x8b33790]Estimating duration from bitrate, this may be inaccurate
Input #0, IFF, from '../patches/MRLake.iff':
Duration: N/A, bitrate: N/A
Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
737:700, 90k tbr, 90k tbn, 90k tbc
22270 dezicycles in decodeplane8, 1 runs, 0 skips
16775 dezicycles in decodeplane8, 2 runs, 0 skips
14285 dezicycles in decodeplane8, 4 runs, 0 skips
12052 dezicycles in decodeplane8, 8 runs, 0 skips
10081 dezicycles in decodeplane8, 16 runs, 0 skips
9067 dezicycles in decodeplane8, 32 runs, 0 skips
8562 dezicycles in decodeplane8, 64 runs, 0 skips
8318 dezicycles in decodeplane8, 128 runs, 0 skips
8195 dezicycles in decodeplane8, 256 runs, 0 skips
8132 dezicycles in decodeplane8, 512 runs, 0 skips
8096 dezicycles in decodeplane8, 1023 runs, 1 skips
8077 dezicycles in decodeplane8, 2046 runs, 2 skips
8070 dezicycles in decodeplane8, 4094 runs, 2 skips
1.04 A-V: 0.000 s:0.0 aq= 0KB vq= 0KB sq= 0B f=0/0 0/0
It removes one pipeline stall and also the initial right shift by 3...
--
Best regards,
:-) Basty/CDGS (-:
More information about the ffmpeg-devel
mailing list