[FFmpeg-devel] [PATCH] Fix non-rounding up to next 16-bit aligned bug in IFF decoder
Måns Rullgård
mans
Fri Apr 30 15:28:47 CEST 2010
Sebastian Vater <cdgs.basty at googlemail.com> writes:
> Sebastian Vater a ?crit :
>> Sebastian Vater a ?crit :
>>
>>> Ronald S. Bultje a ?crit :
>>>
>>>
>>>> Hi,
>>>>
>>>> On Thu, Apr 29, 2010 at 2:45 PM, Sebastian Vater
>>>> <cdgs.basty at googlemail.com> wrote:
>>>>
>>>>
>>>>
>>>>> Did another version:
>>>>>
>>>>>
>>>>>
>>>> [..]
>>>>
>>>>
>>>>
>>>>> START_TIMER;
>>>>> const uint32_t *lut = plane8_lut[plane];
>>>>> for(; --buf_size != 0; dst += 8) {
>>>>> uint32_t v;
>>>>> const unsigned x = *buf++;
>>>>> v = AV_RN32A(dst) | lut[x >> 4];
>>>>> AV_WN32A(dst, v);
>>>>> v = AV_RN32A(dst + 4) | lut[x & 0x0F];
>>>>> AV_WN32A(dst + 4, v);
>>>>> }
>>>>> STOP_TIMER("decodeplane8");
>>>>>
>>>>>
>>>>>
>>>> [..]
>>>>
>>>>
>>>>
>>>>> 58: 0f b6 16 movzbl (%esi),%edx
>>>>> 5b: 83 c6 01 add $0x1,%esi
>>>>> 5e: 89 d0 mov %edx,%eax
>>>>> 60: 83 e2 0f and $0xf,%edx
>>>>> 63: c1 e8 04 shr $0x4,%eax
>>>>> 66: 8b 04 87 mov (%edi,%eax,4),%eax
>>>>> 69: 09 01 or %eax,(%ecx)
>>>>> 6b: 8b 04 97 mov (%edi,%edx,4),%eax
>>>>> 6e: 09 41 04 or %eax,0x4(%ecx)
>>>>> 71: 83 c1 08 add $0x8,%ecx
>>>>> 74: 83 eb 01 sub $0x1,%ebx
>>>>> 77: 75 df jne 58 <decodeplane8+0x58>
>>>>>
>>>>>
>>>>>
>>>> [..]
>>>>
>>>>
>>>>
>>>>> 9067 dezicycles in decodeplane8, 32 runs, 0 skips
>>>>> 8562 dezicycles in decodeplane8, 64 runs, 0 skips
>>>>> 8318 dezicycles in decodeplane8, 128 runs, 0 skips
>>>>> 8195 dezicycles in decodeplane8, 256 runs, 0 skips
>>>>> 8132 dezicycles in decodeplane8, 512 runs, 0 skips
>>>>> 8096 dezicycles in decodeplane8, 1023 runs, 1 skips
>>>>> 8077 dezicycles in decodeplane8, 2046 runs, 2 skips
>>>>> 8070 dezicycles in decodeplane8, 4094 runs, 2 skips
>>>>>
>>>>>
>>>>>
>>>> That looks good to me.
>>>>
>>>>
>>>>
>>> Using uint64_t again with 8-bit lut:
>>> /**
>>> * Decode interleaved plane buffer up to 8bpp
>>> * @param dst Destination buffer
>>> * @param buf Source buffer
>>> * @param buf_size
>>> * @param bps bits_per_coded_sample (must be <= 8)
>>> * @param plane plane number to decode as
>>> */
>>> static void decodeplane8(uint8_t *dst,
>>> const uint8_t *buf,
>>> unsigned buf_size,
>>> const unsigned bps,
>>> const unsigned plane)
>>> {
>>> START_TIMER;
>>> const uint64_t *lut = plane8_lut[plane];
>>> for(; --buf_size != 0; dst += 8) {
>>> const uint64_t v = AV_RN64A(dst) | lut[*buf++];
>>> AV_WN64A(dst, v);
>>> }
>>> STOP_TIMER("decodeplane8");
>>> }
>>>
>>> 58: 8b 54 24 64 mov 0x64(%esp),%edx
>>> 5c: 8b 4c 24 44 mov 0x44(%esp),%ecx
>>> 60: 8b 5d 04 mov 0x4(%ebp),%ebx
>>> 63: 8b 45 00 mov 0x0(%ebp),%eax
>>> 66: 0f b6 32 movzbl (%edx),%esi
>>> 69: 83 44 24 64 01 addl $0x1,0x64(%esp)
>>> 6e: 8b 54 f1 04 mov 0x4(%ecx,%esi,8),%edx
>>> 72: 0b 04 f1 or (%ecx,%esi,8),%eax
>>> 75: 09 da or %ebx,%edx
>>> 77: 89 45 00 mov %eax,0x0(%ebp)
>>> 7a: 89 55 04 mov %edx,0x4(%ebp)
>>> 7d: 83 c5 08 add $0x8,%ebp
>>> 80: 83 ef 01 sub $0x1,%edi
>>> 83: 75 d3 jne 58 <decodeplane8+0x58>
>>>
>>> Benchmark results:
>>> basty at cdgs-basty:~/src/ffmpeg/build$ ./ffplay ../patches/MRLake.iff
>>> FFplay version git-5b9f10d, Copyright (c) 2003-2010 the FFmpeg developers
>>> built on Apr 29 2010 15:19:11 with gcc 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
>>> configuration:
>>> libavutil 50.14. 0 / 50.14. 0
>>> libavcodec 52.66. 0 / 52.66. 0
>>> libavformat 52.61. 0 / 52.61. 0
>>> libavdevice 52. 2. 0 / 52. 2. 0
>>> libswscale 0.10. 0 / 0.10. 0
>>> [IFF @ 0x8b33790]Estimating duration from bitrate, this may be inaccurate
>>> Input #0, IFF, from '../patches/MRLake.iff':
>>> Duration: N/A, bitrate: N/A
>>> Stream #0.0: Video: iff_byterun1, pal8, 737x595, PAR 17:20 DAR
>>> 737:700, 90k tbr, 90k tbn, 90k tbc
>>> 40560 dezicycles in decodeplane8, 1 runs, 0 skips
>>> 35145 dezicycles in decodeplane8, 2 runs, 0 skips
>>> 23632 dezicycles in decodeplane8, 4 runs, 0 skips
>>> 17127 dezicycles in decodeplane8, 8 runs, 0 skips
>>> 11680 dezicycles in decodeplane8, 16 runs, 0 skips
>>> 8965 dezicycles in decodeplane8, 32 runs, 0 skips
>>> 7628 dezicycles in decodeplane8, 64 runs, 0 skips
>>> 6939 dezicycles in decodeplane8, 128 runs, 0 skips
>>> 6565 dezicycles in decodeplane8, 256 runs, 0 skips
>>> 6385 dezicycles in decodeplane8, 512 runs, 0 skips
>>> 6290 dezicycles in decodeplane8, 1024 runs, 0 skips
>>> 6246 dezicycles in decodeplane8, 2048 runs, 0 skips
>>> 6224 dezicycles in decodeplane8, 4096 runs, 0 skips
>>> 1.94 A-V: 0.000 s:0.0 aq= 0KB vq= 0KB sq= 0B f=0/0 0/0
>>>
>>> Way faster, but when I look at the code I wonder why...probably because
>>> it handles pipeline stalling much better...
>>>
>>>
>> I have attached the uint64_t patch. Please review it!
>>
>> The 8-bit table declaration looks a little bit long...
>>
>
> Sorry, the patch from yesterday had a wrong dp8 8-bit table which caused
> graphics glitches. The new patch attached here fixes that.
>
> BTW, I got confirmed that this patch also works on big-endian now!
Would you like an account on a PPC machine?
> Little endian was tested by me, so it works now for both...if someone
> could help me shortening the #define stuff here for the 8-bit table,
> I'ld be glad.
I will, when you send a patch that applies to current svn.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list