[Ffmpeg-devel-irc] ffmpeg-devel.log.20161003

Tue Oct 4 03:05:03 EEST 2016

[00:17:03 CEST] <Chloe> jamrial: thanks for the review
[01:17:33 CEST] <cone-979> ffmpeg 03Josh de Kock 07master:441d15b7c0a0: doc/t2h: use container
[01:45:00 CEST] <cone-979> ffmpeg 03Michael Niedermayer 07master:cced8394b6e0: fate: Add PSP copy test
[02:46:04 CEST] <Chloe> BBB: the musl thing is an issue on our side though, isn't it?
[02:46:45 CEST] <BBB> depends
[02:46:46 CEST] <Chloe> no need to report it upstream, afaik we should be clearing the x87 state before calling an external function
[02:47:02 CEST] <BBB> I disagree
[02:47:23 CEST] <BBB> youre free to have that opinion btw, Im by no means the teller of absolute truth
[02:47:53 CEST] <BBB> Id rather expect these kind of functions to not use x87 functions
[02:49:17 CEST] <Chloe> I just thought it was part of the cdecl calling convention
[02:52:09 CEST] <Chloe> 'If the function does not return a floating-point value, then this register must be empty. This register must be empty before entry to a function.' (in regards to the floating point registers)
[02:52:26 CEST] <Chloe> http://sco.com/developers/devspecs/abi386-4.pdf page 37-38
[02:53:06 CEST] <Chloe> we're in violation of this atm
[02:55:23 CEST] <wm4> yeah, I agree it's ffmpeg's fault, even if musl is doing ridiculous things
[02:55:43 CEST] <wm4> BBB: I think it's a log2, not a hash function
[03:01:26 CEST] <BBB> oh is that the old-fashioned log2 without lut?
[03:02:55 CEST] <wm4> whoops posted the same thing as Chloe 
[03:03:02 CEST] <wm4> BBB: looks like it
[03:03:14 CEST] <Chloe> wm4: it seems we do this a lot
[03:04:22 CEST] <BBB> maybe we should send them a speed-up
[03:04:28 CEST] <BBB> (make it lut-based)
[03:11:30 CEST] <Chloe> they have a lut log2 just above the float stuff, it's just macro'd out
[03:12:22 CEST] <Chloe> we're all talking about bin_index/bin_index_up right?
[03:12:33 CEST] <Chloe> http://git.musl-libc.org/cgit/musl/tree/src/malloc/malloc.c#n114
[03:13:51 CEST] <BBB> yes
[03:14:01 CEST] <BBB> carl eugen suggested that that is likely the offending code
[03:14:23 CEST] <BBB> it seems to be the only explicit float variable in there
[03:14:31 CEST] <BBB> (or double)
[03:15:46 CEST] <Chloe> I'm confused, why are you talking about a log2?
[03:26:04 CEST] <wm4> we're thinking the float trickery is a log2
[03:37:44 CEST] <BBB> a very limited one at best...
[03:38:38 CEST] <BBB> I dont see the macroed out log2 btw
[03:39:12 CEST] <Chloe> just above
[03:39:27 CEST] <Chloe> BBB: http://git.musl-libc.org/cgit/musl/tree/src/malloc/malloc.c#n91
[03:39:38 CEST] <BBB> thats not a log2 :-p
[03:39:43 CEST] <Chloe> ;_;
[03:39:46 CEST] <Chloe> what is it
[03:39:52 CEST] <BBB> a debruijn
[03:40:21 CEST] <Chloe> so I saw a debruijn in the answer to a question asking how to log2
[03:40:25 CEST] <BBB> an algo for counting trailing zeroes
[03:40:40 CEST] <BBB> but its not used in the function itself
[03:41:07 CEST] <Chloe> ok maybe I didnt 
[03:41:12 CEST] <BBB> ok I see
[03:41:13 CEST] <Chloe> idk how I got them confused
[03:41:38 CEST] <BBB> first_set is a log2, yes
[03:41:50 CEST] <BBB> Im still not sure what bin_index does
[03:42:46 CEST] <Chloe> so ctz is used in log2
[03:42:47 CEST] <Chloe> god I need a good algorithms book
[03:42:55 CEST] <Chloe> I also need to learn wtf a log2 is
[03:45:34 CEST] <Chloe> oh right. It just solves 2^n = x for n 
[04:20:49 CEST] <Compn> i think we know the musl dev
[04:21:05 CEST] <Compn> its dalias :P
[04:21:27 CEST] <Compn> if you wanted to bug him direct haha
[05:04:09 CEST] <Compn> oh theres a #musl too
[06:51:47 CEST] <cone-238> ffmpeg 03James Almer 07master:eb60256c2083: fate: add bitexact decode flag to fate-svq3-watermark
[13:52:12 CEST] <wm4> where do we still have mmx code that matters?
[13:52:25 CEST] <wm4> I mean, I'd expect _at_least_ half of them being meaningless
[13:52:52 CEST] <nevcairiel> some algorithms using small datasets that probably dont benefit much from bigger regs
[13:52:56 CEST] <nevcairiel> and old code
[14:02:51 CEST] <kierank> some prediction
[14:03:42 CEST] <nevcairiel> you could probably re-build all the mmx code in sse2 only using half the register size, and wouldnt notice any performance changes
[14:04:33 CEST] <kierank> iirc x264's checkasm disagrees
[14:04:49 CEST] <nevcairiel> may depend on the cpu generation, of course
[14:05:14 CEST] <nevcairiel> but iirc i heard some people say that mmx execution is getting slightly slower on newer cpus, presumably because they are moving those to be executed with the sse2 units anyway
[14:05:46 CEST] <nevcairiel> and save hardware
[14:18:01 CEST] <cone-426> ffmpeg 03Matthieu Bouron 07master:68822da8ff7d: lavc/mediacodecdec_h2645: fix nalu data_size type
[14:20:39 CEST] <BBB> a lot of modern SIMD code is still MMX, wm4
[14:21:35 CEST] <BBB> wm4: on medium-end CPUs, xmm instructions are one cycle slower than mmx instructions (and handle double the data, indeed), but that means if you dont need double data (small blocksizes), then mmx is actually faster than xmm on these CPUs
[14:34:48 CEST] <Chloe> BBB: wm4: 'It's roughly log2(x), but with 4 linearly spaced bins for each logarithmic step'
[14:35:29 CEST] <BBB> ok
[14:42:07 CEST] <wm4> Chloe: fascinating
[14:43:12 CEST] <Chloe> I assume they wouldn't mind a patch which speeds it up, and just happens to fix our issue as well, assuming the patch still follows the standard/complies.
[14:43:26 CEST] <Chloe> (as BBB suggested earlier)
[15:00:37 CEST] <Gramner> punpckh* instructions in particular are very common and behave differently on mmx and xmm registers, also memory operands require 16-byte alignment with sse2 and no alignment with mmx (although 8-byte alignment is beneficial) which is an issue on 8x8 blocks for example
[15:01:09 CEST] <Gramner> so simply straight up porting mmx code to use lower half of xmm registers isn't possible
[15:02:58 CEST] <wm4> thanks for the mess, intel
[15:02:59 CEST] <Gramner> some instructions are faster with xmm registers than mmx registers on skylake, but this really only affects a small subset of existing code anyway
[15:21:31 CEST] <Gramner> BBB: mmx is often faster than xmm on 10+ year old cpus (e.g. conroe). when it comes to more modern chips though the situation isn't really the same. I know that shifts with variable amount (specified in vector reg) is slower with xmm than mmx but is there actually anything else?
[15:24:42 CEST] <Gramner> mmx does have the advantage of more compact instruction encoding which saves cache
[15:28:22 CEST] <BBB> Gramner: from my testing, on medium-end cpus, mmx is still faster (not by much, but by some) for small-block functions
[15:32:49 CEST] <BBB> whether that difference is important is an interesting question
[15:32:52 CEST] <BBB> anyway
[15:33:23 CEST] <BBB> the more pressing issue is that we have a ton of legacy mmx code and thats unlikely to be converted to sse2 anytime soon even if we wanted to convert it
[15:33:25 CEST] <BBB> its like inline asm
[15:33:27 CEST] <Gramner> hmm, could be size-related. e.g. bottlenecked by instruction decoding
[15:33:28 CEST] <BBB> nice long-term quest
[15:33:41 CEST] <BBB> but short-term, were stuck with the status quo and have to accept that as-such
[15:34:02 CEST] <Gramner> yes, any potential rewrite would take plenty of time
[15:34:12 CEST] <BBB> Gramner: maybe, yes. I never looked very deeply at it. I assumed it was b/c mmx is one cycle faster according to that instruction cycle counter thing you guys always use
[15:34:16 CEST] <BBB> but I didnt prove that
[15:34:52 CEST] <BBB> (mostly because I didnt care why (I just cared that) it was faster
[16:05:09 CEST] <cone-426> ffmpeg 03Timo Rothenpieler 07master:a0d7ce140662: avutil/hwcontext_cuda: align allocated frames
[16:05:10 CEST] <cone-426> ffmpeg 03Timo Rothenpieler 07master:c4b78f966223: MAINTAINERS: add myself for hwcontext_cuda
[16:47:29 CEST] <cone-426> ffmpeg 03Adriano Pallavicino 07master:f4e692a0e90b: lavf/bink.c: fix warning due to misleading indentation
[18:29:56 CEST] <lehar> @michaelni
[20:01:15 CEST] <kierank> ==18388== 32 bytes in 1 blocks are definitely lost in loss record 14 of 41
[20:01:15 CEST] <kierank> ==18388==    at 0x5630899: posix_memalign (jemalloc.c:1062)
[20:01:15 CEST] <kierank> ==18388==    by 0x4DF62C: av_malloc (mem.c:95)
[20:01:15 CEST] <kierank> ==18388==    by 0x4DF7ED: av_mallocz (mem.c:252)
[20:01:15 CEST] <kierank> ==18388==    by 0x4D65A1: av_dict_set (dict.c:85)
[20:01:15 CEST] <kierank> ==18388==    by 0x4E2FB8: av_opt_set_dict2 (opt.c:1467)
[20:01:16 CEST] <kierank> ==18388==    by 0x456AFB: avcodec_open2 (utils.c:1413)
[20:01:20 CEST] <kierank> how do I deal with that?
[20:03:39 CEST] <BtbN> kierank, did you build jemalloc with --enable-valgrind?
[20:03:54 CEST] <kierank> does it need that?
[20:04:11 CEST] <BtbN> without that, valgrind gets terribly confused by what it's doing and reports nonsense/misses stuff.
[20:04:27 CEST] <BtbN> It makes it _a lot_ slower though
[20:04:49 CEST] <kierank> i have a ton of valgrind spam from ioctls anyway
[20:04:58 CEST] <kierank> it's just leaks that are of interest really
[21:14:18 CEST] <Chloe> I dont really get this guy "you have no experience or say at all because this is ffmpeg, but err, no offence"
[21:18:52 CEST] <philipl> Odd chap.
[21:21:49 CEST] <kierank> atomnuker: https://www2.iis.fraunhofer.de/AAC/multichannel.html
[21:25:24 CEST] <lehar_> quit
[22:12:25 CEST] <TD-Linux> if you're referring to the musl thing, fwiw I think adding emms is the "correct" solution (it was added to libtheora)
[22:16:04 CEST] <ubitux> yes it is
[22:16:11 CEST] <Gramner> the strictly "correct" solution is to add emms between every single use of mmx and calling any third party functions (memcpy, memset, memcmp et. al). the problem is that doing so will hurt performance because emms is very expensive on many cpus. plus it would be a nightmare to maintain.
[22:18:07 CEST] <BtbN> Why is the ffmpeg mmx code even doing memory allocations?
[22:18:14 CEST] <BtbN> Can't that just be moved outside of the loop?
[22:21:34 CEST] <ubitux> jemalloc seems to have doubles in its profiling code
[22:22:02 CEST] <ubitux> pretty sure we can find floats in others
[22:23:46 CEST] <Gramner> the libc function call doesn't have to be in an inner loop for it to be a potential issue. it's quite common to have some inner loop simd asm followed by some c code that calls memcpy or whatever. you need to identify every single case in the codebase where this occurs and add an emms
[22:24:16 CEST] <ubitux> we're just talking about the allocator currently
[22:27:41 CEST] <Gramner> we could guard only mallocs with emms and ignore everything else, sure. but that's a hack, not a "correct" solution (which doesn't necessarily mean it's a bad idea though)
[22:28:37 CEST] <ubitux> https://github.com/jemalloc/jemalloc/blob/dev/src/prof.c#L836
[22:28:44 CEST] <ubitux> > workaround for versions of glibc that don't properly save/restore floating point registers
[22:28:47 CEST] <ubitux> heh
[22:29:29 CEST] <ubitux> and indeed i guess compilers that generate simd could be an issue 
[22:29:35 CEST] <ubitux> but in that case, it could happen... anyway
[22:29:39 CEST] <ubitux> anywhere*
[22:30:41 CEST] <TD-Linux> Gramner, well, it's closer to correct. I agree guarding every C library call is more correct
[22:30:43 CEST] <jkqxz> Any C library call can be hooked by the user, so you really do have to do everything.  memcpy() is totally fatal because the compiler can insert it anywhere and you can imagine someone doing some sort of instrumentation on it which happens to use floating point.
[22:34:52 CEST] <cone-172> ffmpeg 03Marton Balint 07master:2face3e7b568: lavc/utils: disallow zero sized packets with data set in avcodec_send_packet
[22:34:52 CEST] <cone-172> ffmpeg 03Marton Balint 07master:fbf8ac7d2a37: lavd/openal: don't return zero sized packet if no samples are available
[22:37:08 CEST] <BBB> memcpy is indeed an issue
[22:37:15 CEST] <BBB> according to the standard, we cant use memcpy
[22:37:25 CEST] <BBB> (without preceeding the call to it with a call to emms)
[22:37:40 CEST] <nevcairiel> according to the standard, we cant call any function with a mmx fpu state
[22:37:46 CEST] <BBB> right
[22:37:47 CEST] <nevcairiel> C library or something else
[22:37:59 CEST] <nevcairiel> the cdecl calling convention forbids it
[22:38:18 CEST] <nevcairiel> of course if its our own functions, we can do what we want
[22:38:24 CEST] <ubitux> maybe we should have our av_memcpy etc
[22:38:35 CEST] <ubitux> :)
[22:38:39 CEST] <Gramner> libavc
[22:38:51 CEST] <nevcairiel> the real problem I see is not fixing those cases, but actually finding relevant places
[22:38:59 CEST] <nevcairiel> often it can probably just be fixed by moving an existing emms
[22:39:02 CEST] <nevcairiel> or something alike
[22:40:43 CEST] <wm4> like jkqxz mentioned, some compilers replace plain C code with memcpys and you can't control that
[22:41:08 CEST] <nevcairiel> we cant control that, and we cant fix that
[22:41:22 CEST] <Gramner> simply "fixing" it is easy. modify x86inc.asm to add emms before returning from an mmx function and add emms to any inline mmx asm. doing it in a way that minimizes the performance impact on the other hand by only using emms it when it's actually required is hard
[22:41:26 CEST] <TD-Linux> wm4, actual calls to libc memcpy or the compiler intrinsic memcpy?
[22:41:38 CEST] <jkqxz> TD-Linux:  Actual calls to libc memcpy.
[22:41:41 CEST] <nevcairiel> its ludicrious to assume anything can convert into a compiled inserted libc call and try to fix those
[22:42:03 CEST] <TD-Linux> jkqxz, ah. I seem to recall a clang bug where they screwed that up.
[22:42:17 CEST] <jkqxz> You can suppress it with compiler options, I think.
[22:44:09 CEST] <nevcairiel> can you assert for a  proper fpu state, somehow?
[22:44:11 CEST] <jkqxz> It is "fixable" without that, though.  You can force resolution of the possibly-generated symbols to somewhere inside lav* (either real memcpy(), etc. implementations or state reset + actual library call).
[22:46:44 CEST] <rcombs> does the required emms defeat the advantage of MMX's smaller instruction encoding?
[22:47:09 CEST] <nevcairiel> not really if its used properly, you would only fire it once after the whole processing is done
[22:51:25 CEST] <Gramner> emms is like 30+ µops or something crazy like that on many cpus, probably microcoded. probably slower then many entire mmx functions. so you'd want to avoid using it more than necessary
[22:53:16 CEST] <nevcairiel> is it only this slow when the  fpu is in mmx mode, or always?
[22:54:40 CEST] <Gramner> I don't know, but I would guess always. it's 31 µops on sandy bridge - broadwell according to agner and no indication of it being variable. they improved it to 10 on skylake
[22:55:36 CEST] <wm4> at this level wouldn't the function call overhead matter too?
[22:55:47 CEST] <nevcairiel> which function?
[22:55:49 CEST] <TD-Linux> for the memcpy() case it is pretty tempting to ether pick some amount of reliance on implementation specific behavior, or implement jkqxz's solution
[22:56:26 CEST] <BtbN> there could be one "generic" implementation, that's safe. And one that issues emms whenever it might be required.
[22:56:27 CEST] <jkqxz> I think the compiler-injected code case isn't actually relevant.  If anything fails there then it is an error in the toolchain and must be fixed there - the compiler and libc do not have the same beliefs about the effect of memcpy() (in a extra-standard sense here, but the FPU rounding mode will have an equivalent problem in a fully standard situation).
[22:56:55 CEST] <jkqxz> So we only need to worry about explicit calls.
[22:56:59 CEST] <BtbN> And then the current one, for libcs where it's known to work.
[22:58:20 CEST] <jkqxz> But memcpy() is still technically a problem.  (Except not if we can guarantee that the compiler /does/ generate it, because that forces the appropriate contract onto the libc.)
[22:59:26 CEST] <wm4> nevcairiel: most asm functions are called via function pointer, right
[22:59:57 CEST] <nevcairiel> sure, anything that processes entire blocks etc
[23:00:16 CEST] <nevcairiel> we do have some legimate inline asm thats inline for speed, not because someone couldnt write yasm
[23:47:06 CEST] <cone-172> ffmpeg 03Stephan Holljes 07master:d0be0cbebc20: lavf/aviobuf.c: Adapt avio_accept and avio_handshake to new AVIOContext API
[00:00:00 CEST] --- Tue Oct  4 2016