[Ffmpeg-devel-irc] ffmpeg-devel.log.20161109

Thu Nov 10 03:05:02 EET 2016

[00:17:10 CET] <kierank> Gramner: thanks, the redis crc thing works, going to implement the avx2 tomorrow
[00:17:18 CET] <kierank> funnily enough ends up being 10-bit bitpacked
[04:55:16 CET] <kierank> time for bed
[04:55:20 CET] <kierank> all hail the trump overlord
[16:43:59 CET] <J_Darnley> I'l like to pick your brains.  (Particularly you Gramner, kierank said you'd help).  Does the mask controlling loads in vpgather(dd) affect the speed?  Will loading data I don't really need make it slower?
[16:45:00 CET] <Gramner> yes
[16:47:15 CET] <J_Darnley> thanks
[17:50:23 CET] <cone-299> ffmpeg 03Steven Liu 07master:ab6ffc2a0800: MAINTAINERS: Add myself to flvenc
[20:52:10 CET] <J_Darnley> Gramner or anyone else, do you have any tips for doing horizontal ORs and XORs of dwords or qwords?  (paste of code coming shortly)
[20:53:51 CET] <Gramner> move the upper half of the register to the lower half of a temporary register. perform desired operation. repeat until done.
[20:55:13 CET] <Gramner> x86 SIMD simply isn't designed to do horizontal operations fast
[20:55:31 CET] <J_Darnley> Yeah, I'm seeing that.
[20:56:24 CET] <Gramner> it's also getting worse as vector sizes increases. Intel are aware of this.
[20:56:31 CET] <J_Darnley> Would you expect it to be faster to load once and shift/shuffle than to load several times?
[20:56:53 CET] <J_Darnley> Yeah, the split lanes can be a pain if you want to cross them.
[20:57:04 CET] <Gramner> afaik the original Xeon Phis had horizontal sum instructions but they were removed when it turned into AVX-512 for some reason
[20:57:46 CET] <Gramner> loads are generally pretty fast
[20:58:03 CET] <Gramner> modern intel cpus can do 2 loads per cycle but only one shuffle per cycle
[20:58:05 CET] <J_Darnley> So I better write both and bench it.
[20:58:19 CET] <Gramner> also loads can oftne be OOE:d better
[20:59:37 CET] <J_Darnley> Thanks again.
[21:05:52 CET] <Gramner> there was actually a horizontal OR reduce instruction in KNC. https://software.intel.com/en-us/node/523756
[21:06:26 CET] <Gramner> I wouldn't be surprised to see those returned in soem AVX-512 extension at some point
[21:17:27 CET] <cone-299> ffmpeg 03Andreas Cadhalpun 07master:467eece1bea5: icodec: fix leaking pkt on error
[21:17:29 CET] <cone-299> ffmpeg 03Andreas Cadhalpun 07master:d54c95a1435a: icodec: add ico_read_close to fix leaking ico->images
[21:17:31 CET] <cone-299> ffmpeg 03Andreas Cadhalpun 07master:226d35c84591: escape124: reject codebook size 0
[22:32:58 CET] <J_Darnley> Gramner, would you mind casting your eye over this ugly thing http://pastebin.com/2fxEQCQS
[22:33:21 CET] <J_Darnley> It works but I'm looking for ways to speed it up.
[22:33:53 CET] <J_Darnley> I plan to try replacing some moves with shuffles
[22:34:12 CET] <J_Darnley> And perhaps interleave some distant instructions.
[22:34:40 CET] <J_Darnley> By the way, how large is a typical OOE buffer?
[22:34:43 CET] <Gramner> pmozx is a shuffle
[22:34:50 CET] <Gramner> pmovzx*
[22:37:26 CET] <J_Darnley> unpack is a shuffle too, right?
[22:37:35 CET] <Gramner> yes
[22:38:18 CET] <Gramner> http://agner.org/optimize/instruction_tables.pdf look for p5, that's the "shuffle unit"
[22:38:33 CET] <J_Darnley> ah, thank you
[22:44:04 CET] <jamrial_> J_Darnley: what is this for?
[22:44:10 CET] <Gramner> might be more efficient to combine the two first gathers and shuffle the data afterwards instead of doing the unaligned pmovzx in the beginning
[22:44:45 CET] <J_Darnley> An attempt at speeding up a problematic CRC for kierank
[22:45:00 CET] <J_Darnley> we already managed to get the C very fast
[22:45:39 CET] <J_Darnley> but this looks like a dead end from a performance viewpoint.
[22:46:43 CET] <Gramner> you might be bottlenecked by memory loads which means that reducing computations with SIMD wont help
[22:50:44 CET] <rcombs> why not use the CRC instruction?
[22:50:54 CET] <kierank> because that's a specific crc
[22:51:13 CET] <rcombs> ah, need a different poly?
[22:51:15 CET] <kierank> and it's not a 10-bit crc
[22:51:16 CET] <kierank> yes
[22:51:38 CET] <rcombs> well then
[22:51:42 CET] <rcombs> carry on
[23:15:03 CET] <kierank> J_Darnley: i think you can do a load and then do a variable shift with pmulld
[23:15:14 CET] <kierank> which will shift only the necessary buffers
[23:15:39 CET] <J_Darnley> ah, I always forget about trying a multiply
[23:16:34 CET] <Gramner> avx2 has variable shifts already though, but only for dwords
[23:16:48 CET] <Gramner> and words I guess
[23:16:54 CET] <Gramner> qwords*
[23:17:56 CET] <J_Darnley> If I had to shift more than 2 samples I'm sure it could be helpful.
[23:18:15 CET] <J_Darnley> (either a multiply or variable shift)
[23:18:43 CET] <kierank> not sure if mova, punpk is better than pmovzxwd 
[23:19:14 CET] <J_Darnley> Was a bitwise dqword shift added at some point?
[23:19:31 CET] <Gramner> J_Darnley: nope
[23:20:43 CET] <Gramner> kierank: punpck and pmovzx are both shuffles. they are equally fast but the latter is shorter and doesn't require a zero register
[23:21:02 CET] <kierank> interesting
[23:21:42 CET] <kierank> J_Darnley: is there a reason you use xm3 and xm4
[23:22:05 CET] <Gramner> I'm actually a bit surprised that they didn't make the zero/sign extension part of the load unit like it is for scalar ones
[23:22:15 CET] <Gramner> that would make it essentially free
[23:22:44 CET] <J_Darnley> kierank: mostly because I don't need the upper 16bytes everywhere
[23:23:12 CET] <kierank> xm3 and xm4 are the same no?
[23:23:38 CET] <kierank> or i misunderstand pehaps
[23:23:42 CET] <J_Darnley> Huh?
[23:23:53 CET] <Gramner> gather overwrites the mask
[23:23:59 CET] <kierank> ah
[23:24:01 CET] <J_Darnley> Oh, the gather obliterates the mask
[23:24:38 CET] <Gramner> because it can encounter faults during the instruction and have to resume. without overwriting the mask it would then need to re-fetch everything.
[23:24:47 CET] Action: kierank tries to optimise c
[23:25:00 CET] <J_Darnley> Ah, so there is a reason to it.
[23:51:50 CET] <kierank> J_Darnley: I wonder if the dw nature of things makes the c faster
[00:00:00 CET] --- Thu Nov 10 2016