[FFmpeg-devel] [PATCH] PPC: 32-bit asm for MAC64 and MLS64
Måns Rullgård
mans
Mon May 11 05:20:27 CEST 2009
Luca Barbato <lu_zero at gentoo.org> writes:
> Mans Rullgard wrote:
>> GCC makes a mess of these operations, so give it a hand.
>> 55% faster MP3 decoding on G4.
>
> O_o? Do you have a reduced testcase to send to gcc people? It smells
> like bug IMHO...
The part gcc struggles with is the loop at the end of
ff_mpa_synth_filter(). Here's what the start of that loop looks like
after my patch:
2634: 80 d3 00 44 lwz r6,68(r19)
2638: 81 77 00 00 lwz r11,0(r23)
263c: 7d 0b 31 d6 mullw r8,r11,r6
2640: 7d 4b 30 96 mulhw r10,r11,r6
2644: 7d 29 40 14 addc r9,r9,r8
2648: 7c 00 51 14 adde r0,r0,r10
264c: 90 01 00 08 stw r0,8(r1)
2650: 91 21 00 0c stw r9,12(r1)
2654: 80 b3 01 44 lwz r5,324(r19)
2658: 80 93 02 44 lwz r4,580(r19)
265c: 80 73 03 44 lwz r3,836(r19)
2660: 83 93 04 44 lwz r28,1092(r19)
2664: 83 53 05 44 lwz r26,1348(r19)
2668: 83 34 01 bc lwz r25,444(r20)
266c: 81 77 01 00 lwz r11,256(r23)
2670: 83 13 06 44 lwz r24,1604(r19)
2674: 7d 0b 29 d6 mullw r8,r11,r5
2678: 7d 4b 28 96 mulhw r10,r11,r5
267c: 7d 29 40 14 addc r9,r9,r8
2680: 7c 00 51 14 adde r0,r0,r10
2684: 80 f6 00 00 lwz r7,0(r22)
2688: 81 77 02 00 lwz r11,512(r23)
268c: 7d 0b 21 d6 mullw r8,r11,r4
2690: 7d 4b 20 96 mulhw r10,r11,r4
2694: 7d 29 40 14 addc r9,r9,r8
2698: 7c 00 51 14 adde r0,r0,r10
269c: 81 77 03 00 lwz r11,768(r23)
26a0: 7d 0b 19 d6 mullw r8,r11,r3
26a4: 7d 4b 18 96 mulhw r10,r11,r3
26a8: 7d 29 40 14 addc r9,r9,r8
26ac: 7c 00 51 14 adde r0,r0,r10
Here's what an unsupervised gcc does:
2660: 83 61 00 1c lwz r27,28(r1)
2664: 83 41 00 1c lwz r26,28(r1)
2668: 82 c1 00 1c lwz r22,28(r1)
266c: 83 7b 02 44 lwz r27,580(r27)
2670: 83 5a 01 44 lwz r26,324(r26)
2674: 83 81 00 1c lwz r28,28(r1)
2678: 93 61 01 58 stw r27,344(r1)
267c: 93 41 01 54 stw r26,340(r1)
2680: 83 a1 00 1c lwz r29,28(r1)
2684: 81 21 01 58 lwz r9,344(r1)
2688: 80 a1 00 1c lwz r5,28(r1)
268c: 81 41 00 1c lwz r10,28(r1)
2690: 82 d6 00 44 lwz r22,68(r22)
2694: 7d 29 fe 70 srawi r9,r9,31
2698: 81 0e 00 00 lwz r8,0(r14)
269c: 83 9c 03 44 lwz r28,836(r28)
26a0: 83 e1 01 54 lwz r31,340(r1)
26a4: 7e de fe 70 srawi r30,r22,31
26a8: 80 0e 01 00 lwz r0,256(r14)
26ac: 7d 1b fe 70 srawi r27,r8,31
26b0: 7e fe 41 d6 mullw r23,r30,r8
26b4: 83 bd 04 44 lwz r29,1092(r29)
26b8: 80 a5 05 44 lwz r5,1348(r5)
26bc: 7f ff fe 70 srawi r31,r31,31
26c0: 7f 7b b1 d6 mullw r27,r27,r22
26c4: 91 21 00 64 stw r9,100(r1)
26c8: 7c 03 fe 70 srawi r3,r0,31
26cc: 81 4a 06 44 lwz r10,1604(r10)
26d0: 7d 36 40 16 mulhwu r9,r22,r8
26d4: 80 e1 01 54 lwz r7,340(r1)
26d8: 93 81 01 5c stw r28,348(r1)
26dc: 7f 9c fe 70 srawi r28,r28,31
26e0: 7e bf 01 d6 mullw r21,r31,r0
26e4: 80 ce 02 00 lwz r6,512(r14)
26e8: 7e f7 da 14 add r23,r23,r27
26ec: 80 8e 03 00 lwz r4,768(r14)
26f0: 7c 63 39 d6 mullw r3,r3,r7
26f4: 7d 37 4a 14 add r9,r23,r9
26f8: 83 4e 04 00 lwz r26,1024(r14)
26fc: 83 0e 05 00 lwz r24,1280(r14)
2700: 7d 67 00 16 mulhwu r11,r7,r0
2704: 7c d9 fe 70 srawi r25,r6,31
2708: 82 8e 06 00 lwz r20,1536(r14)
270c: 92 c1 01 50 stw r22,336(r1)
2710: 7e b5 1a 14 add r21,r21,r3
2714: 7d 87 01 d6 mullw r12,r7,r0
2718: 93 a1 01 60 stw r29,352(r1)
271c: 7c 80 fe 70 srawi r0,r4,31
2720: 7f 11 fe 70 srawi r17,r24,31
2724: 93 c1 00 58 stw r30,88(r1)
Lots of loads and stores there, and the whole loop contains a total of
128 multiplies even though only 64 are needed.
> PS: the patch looks fine for me
Applied.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list