[FFmpeg-devel] [PATCH] use AV_RB16 in cabac refill

Sat Mar 27 02:02:11 CET 2010

Alexander Strange <astrange at ithinksw.com> writes:

> 2010/3/25 M?ns Rullg?rd <mans at mansr.com>:
>> Alexander Strange <astrange at ithinksw.com> writes:
>>
>>> On Mar 25, 2010, at 4:08 AM, David Conrad wrote:
>>>
>>>> On Mar 25, 2010, at 3:30 AM, Alexander Strange wrote:
>>>>
>>>>> Measured 1 cycle faster decode_cabac_residual on x86-64. Didn't try anywhere else, but I'd be a little interested in what arm does.
>>>>
>>>> It ought to be 2 instruction less and faster. However, both llvm
>>>> and gcc decide to zero extend from 16 bits twice, and
>>>> (llvm-)gcc-4.2 decides to load bytestream twice.
>>>
>>> Hmm, zero-extending in bswap_16 isn't really surprising, since asm
>>> operands are always extended to int.
>>
>> That depends on how the asm is written.
>>
>>> The only solution there is to write AV_RB16 in asm too.
>>>
>>> --disable-asm is remarkably bad, I think it should be using
>>> (p[0] << 8 | p[1]) instead of __attribute__((packed)) and bswap_16
>>> when FAST_UNALIGNED isn't defined.
>>
>> I don't quite understand that.
>
> If I configure for arm with --disable-asm (using iPhone gcc), this:

What --cpu did you use?

> #include "libavutil/intreadwrite.h"
>
> int test(uint16_t *p);
>
> int test(uint16_t *p)
> {
>     return AV_RB16(p);
> }
>
> turns into this:
>
> int test(uint16_t *p)
> {
>     return bswap_16((((const union unaligned_16 *) (p))->l));
> }
>
> which compiles to:
>
> _test:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         ldrb    r3, [r0, #0]    @ zero_extendqisi2
>         ldrb    r0, [r0, #1]    @ zero_extendqisi2
>         @ lr needed for prologue
>         orr     r3, r3, r0, asl #8
>         mov     r0, r3, lsr #8
>         orr     r0, r0, r3, asl #8
>         uxth    r0, r0
>         bx      lr

This is gcc being stupid in multiple ways.  A clever compiler would
turn that into an LDRH followed by a REV16.  Since it used UXTH, it
must have been compiling for ARMv6 or later, so unaligned loads would
be OK.  Even without unaligned loads, a decent compiler should be able
to unravel it all and eliminate the redundant shift/or operations.

> because it (apparently) always uses byte loads for packed structures.
> I don't know if anyone cares how well --disable-asm compiles, I was
> just curious.

Guess why I added the asm for those things.

-- 
M?ns Rullg?rd
mans at mansr.com