[Ffmpeg-devel-irc] ffmpeg-devel.log.20140207

Sat Feb 8 02:05:02 CET 2014

[00:00] <JEEB> orly, did you opt out?
[00:00] <BtbN> no idea
[00:00] <JEEB> or did they stop using it?
[00:00] <JEEB> I would be surprised tho
[00:00] <BtbN> The official way to use the CUDA sdk is via their new Visual Studio addon
[00:00] <JEEB> since most SDKs for windows have their own env var
[00:00] <BtbN> which handles all the path stuff
[00:00] <JEEB> yes, those usually use the env var as well
[00:01] <BtbN> it would also cause most distributions to not build it with nvenc support, because they won't setup the cuda and nvenc SDKs on their build environments
[00:02] <nevcairiel> the nvenc sdk i downloaded is a zip file, it doesnt come with anything by default :p
[00:02] <BtbN> it also depends on the CUDA sdk
[00:03] <BtbN> which isn't needed at all, because i dynload a minimal subset of it, just enough to use nvenc
[00:04] <nevcairiel> in any case, if the license doesnt allow including the header, then you can't include the header, and anyone that wants to build it with it needs to provide it somehow.
[00:04] <BtbN> which would make it quite pointless for me to contribute it at all
[00:05] <nevcairiel> its not exactly rocket science to get that one header if you care about that feature
[00:05] <BtbN> but you have to build it yourself, as no distribution will ever include it
[00:05] <BtbN> and that there is absolutely no default path for that header doesn't help
[00:06] <BtbN> there isn't even a default name
[00:08] <BtbN> i'm not sure if the header can't be included. The license header seems to allow that, it's just the (L)GPL which might have a problem with it, and i'm not sure what it says about headers
[00:13] <BtbN> Hm, nope, can't be included...
[00:14] <BtbN> So i can't implement it in any usefull way
[00:18] <J_Darnley> Does fate.sh just exit successfully if no changes can be pulled?
[00:20] <nevcairiel> i believe it stores the v ersion of the last run in a status file, if thats the same script i'm thinking about
[00:21] <J_Darnley> Oh yes, there it is.  Line 108: test ... && exit 0
[00:21] <J_Darnley> No wonder I can't get Windows to run it.
[00:21] <J_Darnley> I'll remove the version file and try again.
[00:57] <cone-595> ffmpeg.git 03Peter Krefting 07master:d5733936d857: configure: Remove dcbzl check for e500v1 and e500v2 architectures
[00:57] <cone-595> ffmpeg.git 03James Almer 07master:a4e4948ffe36: x86/cpu: add missing avx2 AVOption in av_parse_cpu_flags()
[01:09] <cone-595> ffmpeg.git 03Ronald S. Bultje 07master:5351964a2b52: vp8: fix bilinear C code to work if src_stride != dst_stride.
[01:09] <cone-595> ffmpeg.git 03Michael Niedermayer 07master:9707b539b995: Merge remote-tracking branch 'qatar/master'
[04:53] <cone-295> ffmpeg.git 03Michael Niedermayer 07master:951793717a28: avcodec/hevc_filter: assert validity of qp predictor input
[04:53] <cone-295> ffmpeg.git 03Michael Niedermayer 07master:56985d26d705: avcodec/hevc: clear tab_slice_address in hevc_frame_start()
[04:53] <cone-295> ffmpeg.git 03Michael Niedermayer 07master:a18f11158216: avcodec/hevc: clear tab_slice_address of ctb on error.
[04:53] <cone-295> ffmpeg.git 03Michael Niedermayer 07master:6ef57f4d9a09: avcodec/hevc: hls_decode_entry: check that the previous slice segment is available before decoding the next
[09:36] <cone-240> ffmpeg.git 03Clément BSsch 07master:f21d0beb0cc4: Fix a few heigth/height typo.
[10:56] <ubitux> BBB: Carl pointed me out that the failure i told you about yesterday was because i didn't enable the vp9 parser
[10:56] <ubitux> BBB: any idea why some tests actually need the parser, and will output something different (??) in case it's not present?
[11:10] <ubitux> trac still down :(
[11:10] <ubitux> sadness.
[12:34] <BBB> ubitux: only if there's composite packets (i.e. one invisible frame - ARF - and a regular frame in a single matroska video packet
[12:35] <BBB> ubitux: the parser splits them in two so the decoder can handle individual frames
[12:35] <BBB> ubitux: similar to mpeg4 packet parsing in the avi container
[12:35] <BBB> ubitux: so yes vp9 decoding absolutely requires the vp9 parser
[12:43] <ubitux> BBB: should we add a dependency to the parser in the decoder then?
[12:43] <ubitux> or just in FATE?
[12:44] <nevcairiel> if the decoder cant really function without it, it should probably depend on it
[12:44] <ubitux> -vp9_decoder_select="videodsp"
[12:44] <ubitux> +vp9_decoder_select="videodsp vp9_parser"
[12:44] <ubitux> i guess.
[12:45] <nevcairiel> which brings me to another interesting question, what happens if you run the parser twice? is it smart enough not to wreak havoc?
[12:48] <BBB> ubitux: yeah that looks good
[12:48] <BBB> nevcairiel: it just ignores itself the second time I think
[12:59] <ubitux> BBB: we're decoding at about 2 fps on a beaglebone black ("BBB" ;))
[13:00] <ubitux> i'm going to try to do something about it i guess :)
[13:01] <ubitux> (ped1080p.webm)
[13:12] <BBB> \o/
[13:12] <BBB> yeah arm is pretty much nothing-done-so-far territory
[13:12] <BBB> as is ppc
[13:12] <BBB> not that anyone cares about ppc
[13:14] Action: Daemon404 remembers the vp9 fpga being larger than an a8
[13:16] <BBB> good thing I don't write fpgas
[13:16] <BBB> how big is the hevc fpga?
[13:18] <Daemon404> good question
[13:19] <BBB> bash-3.2$ wc -l ../../libvpx/vp9/{decoder,common}/*.[ch] ../../libvpx/vp9/common/x86/*.{asm,c,h} | grep total
[13:19] <BBB>    28321 total
[13:19] <BBB> bash-3.2$ wc -l ../libavcodec/vp9* ../libavcodec/x86/vp9* | grep total
[13:19] <BBB>    12530 total
[13:19] <BBB> there's also that
[13:20] <Daemon404> i was more referring to mobile use
[13:20] <BBB> I know
[13:20] <Daemon404> though, i think phones are *finally* getting vp8 hw
[13:20] <Daemon404> mabe.
[13:20] <Daemon404> maybe*
[13:23] <kierank> Daemon404: you mean asic, right?
[13:26] <Daemon404> probably
[13:36] <kurosu__> the hevc decoder is already implemented in mediatek and qualcomm newest flag[sc]hips, but I don't know the respective sizes
[13:37] <kurosu__> and I guess it depends on the process, eg 40nm or 28nm, so size is not a proper metrics
[13:51] <BBB> bash-3.2$ du -ch libavcodec/{x86/,}vp9*.o|grep total
[13:51] <BBB> 508K	total
[13:51] <BBB> bash-3.2$ du -ch ../../libvpx/x86-64/vp9/{decoder,common}/*.o ../../libvpx/x86-64/vp9/common/x86/*.o|grep total
[13:51] <BBB> 932K	total
[13:51] <BBB> also good stuff
[13:54] <ubitux> :)
[13:54] <ubitux> BBB: did you have a chance to look at the latest fuzzed?
[13:54] <BBB> is that fuzzy7?
[13:54] <BBB> or are there new ones?
[13:55] <BBB> because fuz7 doesn't crash for me
[13:56] Action: BBB tries valgrind
[13:56] <ubitux> fuz7 yes iirc
[13:56] <ubitux> let me check
[13:59] <ubitux> BBB: try with -threads 7
[14:00] <ubitux> BBB: valgrind output here: http://pastie.org/pastes/8708604/text
[14:13] <BBB> ==23507==    at 0x100AFE0B4: find_ref_mvs (vp9.c:1059)
[14:13] <BBB> ==23507==    by 0x100B018D2: fill_mv (vp9.c:1181)
[14:15] <BBB> good we get the same thing :)
[14:18] <ubitux> BBB: so, we want to have prefetch (& friends) for the coeffs decode, some AVX2 and ARM before doing a proper bench, or you want to do that soon?
[14:22] <BBB> not coefs decode
[14:22] <BBB> prefetch is for mc
[14:22] <BBB> avx2 would be nice
[14:22] <BBB> arm isn't necessary since we'll bench on x86 anyway, but it'd be nice to have anyway
[14:22] <BBB> coefs decode is typically one of the hard-to-optimize slow parts of a decoder; so maybe we just have to live with it being slow?
[14:24] <ubitux> fine with me :)
[14:24] <BBB> it's up to us what we want finished before we bench, I guess
[14:24] <BBB> there's no rules for this; we make the rules
[14:24] <BBB> as long as it's fair
[14:24] <ubitux> ok
[14:24] <BBB> I'll look at the valgrind error tonight
[14:24] <BBB> work now
[14:24] <BBB> bbl
[14:24] <ubitux> i'll have an avx2 cheap laptop in a few days
[14:25] <ubitux> probably next week, maybe in 2
[14:25] <Compn> thats a lot of LOC
[14:26] <Compn> :)
[14:28] <kierank> I have some cargo cult avx2 patches to swscale somewhere
[14:43] <kurosu__> iirc, there are a lot of low hanging fruits with audio dsps and avx
[15:03] <J_Darnley> Even older SIMD sets have some fruit available.
[15:04] <kurosu__> I'm particularly aware of this
[15:28] <Compn> is any of the /mips/ ripe for stealing ?
[15:38] <kurosu__> oh, I was restricting older SIMDs to < SSE2
[15:38] <kurosu__> as in, some stuff still hasn't x86 SIMD
[15:39] <kurosu__> even fewer contributors there
[15:42] <cone-240> ffmpeg.git 03Michael Niedermayer 07master:2a03eb4c99f9: avcodec/wmalosslessdec: use sizeof() instead of literal number
[15:42] <cone-240> ffmpeg.git 03Michael Niedermayer 07master:ec9578d54d09: avcodec/wmalosslessdec: fix mclms_coeffs* array size
[15:58] <J_Darnley> I don't mind writing for SSE2 and earlier
[15:58] <J_Darnley> Until 5 months ago that was all I had.  (I miss that Athlon64)
[16:00] <J_Darnley> I would have done for my flac patch if it wasn't so easy to use one sse4 instruction for multiplying double words
[16:00] <kurosu__> I remember people complaining about that SSE was so antiquated they disliked the fact I wrote code for that set
[16:00] <kurosu__> I admit I often slip and use SSE2 insn sometimes
[16:01] <J_Darnley> If they've ever used a float on x64 code they've "written" sse
[16:02] <J_Darnley> Next thing people will say that with the PC being declared dead is that nobody should write any x86 code
[16:04] <kurosu__> those were people actually quite seasoned in writing x86 asm, so it might have stemed from the maintenance burden. or something
[16:07] <kurosu__> 3dnow (even the float part) is more debatable though as the pcs having this at most are now really old
[16:08] <J_Darnley> Aren't new CPUs supposed to be dropping it at some point?
[16:08] <kurosu__> since 2010 iirc
[16:08] <kurosu__> I mean, newer amd models starting from around 2010 don't have support for 3dnow
[16:10] <kurosu__> http://developer.amd.com/community/blog/2010/08/18/3dnow-deprecated/ <- confirmed
[16:14] Action: J_Darnley is so very out of touch
[17:10] <kurosu__> J_Darnley: btw, there is also the mlp decoder that might benefit from the flac work, but I remember an unorthodox mix of things in this
[17:13] <Keestu> what it is all about cortex-v8, ?  when i build latest ffmpeg/x264 , and loading in android i get the error   /libx264.a(pixel-a.o) for Cortex-A8 erratum because it has no mapping symbols.
[17:14] <Keestu> could someone kindly give me a hint ?
[17:21] <J_Darnley> kurosu__: I might look tome time.  I was working on the encoder.  Another James was working on the decoder.
[17:22] <J_Darnley> *some time
[17:42] <yamyam> Does anybody may know if it is possible to compile ffmpeg arguments to auto start with the binary? like -af aresample=async=1000. I have to create a work around how to give ffmpeg arguments which is called by flussonic streaming server who is compiling the transcoder dynamically.I played around on several injects but winded up getting escaped by the other bin who is calling ffmpeg
[17:47] <J_Darnley> yamyam: You could edit ffmpeg.c to change how it makes the filter graph
[17:47] <J_Darnley> I can't say whether that is easy though
[17:48] <wm4> create a script and let your broken software call that instead
[17:48] <yamyam> ok thanks, that's allredy helpful :-) I was plying around in the filter.c and trying to call the option in any case
[17:48] <wm4> the script would just call ffmpeg with additional args
[17:49] <yamyam> I tried that with a script but the other bin is dynamically shifting with the arguments for ffmpeg (then they escape or crop my argument)
[17:58] Action: J_Darnley curses firefox pdf viewer
[18:49] <BtbN> What YUV format does AV_PIX_FMT_YUV420P mean? nvenc accepts NV12(Which has its own PIX_FMT), YV12 and IYUV(which seems to be identical to I420). It also supports YUV444, but doesn't specify a specific format for it.
[18:50] <nevcairiel> its YV12 with different plane order
[18:50] <nevcairiel> YUV is well, Y U V
[18:50] <nevcairiel> while YV12 is Y V U
[18:50] <BtbN> i know what the formats are, i mean what format AV_PIX_FMT_YUV420P is
[18:50] <nevcairiel> i just told you
[18:51] <wm4> isn't the difference that the planes are "packed"
[18:51] <wm4> and come after each other in memory?
[18:51] <nevcairiel> YUV420P is YV12 with a different plane order
[18:51] <wm4> maybe packed is a bad word to describe it
[18:51] <nevcairiel> there is nothing in ffmpeg that corresponds 1:1 to YV12 as-is
[18:51] <BtbN> so i should be able to just copy it directly, swapping the two planes
[18:51] <nevcairiel> yes
[18:51] <BtbN> and honoring the diffrent pitches
[18:52] <nevcairiel> wm4: I dont know of any special behaviour of YV12, always looked like plain planar YUV420 to me, except the different U/V ordering
[18:52] <nevcairiel> which is quite obvious if you get it wrong, since all people look like smurfs
[18:52] <nevcairiel> (blue skin)
[18:52] <JEEB> NV12 is the half-packed one
[18:53] <JEEB> YV12 is planar
[18:53] <wm4> mplayer also has it, I was surprised when I realized that mplayer's YV12 is actually "swapped" (the same as YUV420P)
[18:53] <JEEB> lal
[18:53] <BtbN> nevcairiel, ah, so it's IYUV/I420. Which is YV12 with swapped planes
[18:53] <nevcairiel> i guess, dont really know those two
[18:54] <JEEB> usually it goes the other way
[18:54] <JEEB> YV12 is I420 with swapped planes :D
[18:54] <nevcairiel> in directshow world I420 isnt really used by anything, they prefer YV12 (or rather NV12)
[18:54] <BtbN> but i have to copy it manualy anyway. Or can i request a specific pitch?
[18:55] <nevcairiel> no you cannot, the caller sets the pitch in the incoming frame
[18:55] <nevcairiel> so unless you can tell the encoder to deal with a specific pitch...
[18:55] <BtbN> nope, it justs sets it
[18:55] <BtbN> -s
[18:56] <BtbN> are there ready to use accelerated copying functions?
[18:57] <wm4> memcpy
[18:57] <BtbN> that's not realy efficient
[18:57] <BtbN> and causes a lot of load just for the copying
[19:02] <BtbN> Hm, doesn't look like it. So that's over 3000 memcpy calls for each frame. For example VLC has a set of special functions for plane copying, which use sse4 and other optimizations if possible
[19:03] <wm4> lol
[19:03] <Plorkyeran> "optimized" memcpys tend to be trivially faster on the author's cpu and slower on all others
[19:04] <BtbN> no
[19:04] <BtbN> it's not an optimized memcpy
[19:04] <BtbN> it's an optimised plane with line stride copy
[19:04] <Plorkyeran> aka optimized memcpy
[19:04] <wm4> so it doesn't copy the stride padding or what?
[19:05] <BtbN> it does, but in a much faster way by using special cpu instructions
[19:05] <wm4> haha that attitude
[19:05] <BtbN> and the difference between 3000 memcpy calls and one call to the optimized function can easily be 3% vs. 30% cpu usage
[19:05] <wm4> don't you think libc authors would optimize their memcpyies this way too?
[19:05] <BtbN> memcpy is optimized, but it's called 3000 times for 1080p
[19:05] <Plorkyeran> it really can't be 3% vs 30% unless you're comparing -O0
[19:06] <Plorkyeran> the number of calls to memcpy is not particularly relevant
[19:06] <BtbN> it is 3% vs. 30%, just by enablind or disabling the SSE optimizations
[19:06] <wm4> OTOH, I wouldn't surprised if these OMG OPTIMIZED memcpyies had a high startup overhead
[19:06] <BtbN> no, they don't
[19:07] <BtbN> http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers this is an intel article about it, the VLC copy functions are based on it
[19:07] <BtbN> they also have a benchmark
[19:07] <wm4> oh
[19:07] <wm4> this talks about special memory, like video memory
[19:08] <BtbN> yes, which is exactly what i'm talking about
[19:08] <Plorkyeran> you should probably have mentioned that at some point, then
[19:08] <BtbN> i did oO
[19:09] <BtbN> i was looking for yuv frame copying functions which honor the line stride
[19:09] <kurosu_> memcpy has too do many checks, like the length of the copy, whether dst/src are aligned etc
[19:10] <kurosu_> I remember replacing such a memcpy in an fft or mdct function (already x86) and got a 10% speedup
[19:10] <kurosu_> (for the function)
[19:10] <wm4> BtbN: no, you want a memcpy which works with uncached memory
[19:11] <wm4> or something like this
[19:11] <BtbN> for 1080p it's over 3200 memcpy calls each frame to change the line stride, which is realy slow
[19:11] <wm4> it doesn't have much to do with strides... unless startup overhead matters somehow
[19:11] <BtbN> It does, if the strides where the same, i'd be just 3 memcpy calls
[19:12] <BtbN> but as they are diffrent, each line needs its own memcpy call
[19:24] <ubitux> BtbN: we could probably use that in av_image_copy_plane()
[19:29] <ubitux> maybe you can test and send a patch if that provides a performance boost?
[19:35] <BtbN> I benchmarked this with some xbmc guys, that's where i got that 30% to 3% from. With a simple memcpy each line solution, we had like 30% cpu usage with 1080p25 video. We then switched to the intel solutin, and got only 3% cpu usage
[19:35] <BtbN> *solution
[19:37] <BtbN> it's quite complicated to implement, though. Because it needs inline asm or sse intrinsics which aren't available everywhere
[19:41] <nevcairiel> the special memcpy the intel article mentions is for copying FROM such memory, not to
[19:41] <nevcairiel> and any good compiler has intrinsic inline memcpy so that there is no call overhead
[19:42] <BtbN> it's still inefficient to do 3200 memcpy calls, even if memcpy itself is highly optimized
[19:42] <nevcairiel> there are no special instructions to write to such special memory anyway
[19:43] <BtbN> sse4.1 has some instructions which address the problem quite exactly
[19:43] <nevcairiel> no, its for reading from such memory
[19:43] <nevcairiel> not writing to
[19:43] <BtbN> it's both
[19:44] <nevcairiel> you keep believing that
[19:44] <BtbN> there's nothing to believe here, i successfully implemented such a function and it was 30% cpu usage to 3% cpu usage. Compared to 3200 memcpy calls with gcc -O3
[19:45] <ubitux> and so, are you going to submit a patch to ffmpeg?
[19:45] <BtbN> Maybe later
[19:45] <wm4> BtbN: was that main memory to main memory?
[19:45] <ubitux> you mean you won't?
[19:45] <nevcairiel> your system must be rather weird if simply copying data around gives you 30% load
[19:46] <BtbN> copying data around with 3200 memcpy calls creates load
[19:46] <BtbN> wm4, what do you mean with main memory?
[19:46] <nevcairiel> I can copy 1080p60 realtime with memcpy twice and still get like 1% load
[19:46] <wm4> normal, cached system RAM
[19:46] <nevcairiel> because you know, my software does it
[19:46] <BtbN> yes, if you do one memcpy call for each plane that's no problem.
[19:46] <nevcairiel> for every line
[19:47] <BtbN> but if you have to do one memcpy call for every line, it creates extreme load
[19:47] <nevcairiel> memcpy is inlined, there is very little overhead
[19:47] <BtbN> wm4, it does normal system ram -> ucws block -> system ram
[19:48] <wm4> I don't think gcc inlines memcpy when the size is runtime variable, but I don't know details
[19:48] <ubitux> what's that specific scenario?
[19:48] <ubitux> is it just after hw decode or something?
[19:48] <BtbN> copying one yuv plane to another yuv plane, with a diffrent linesize
[19:51] <ubitux> http://pastie.org/pastes/8709519/text
[19:52] <ubitux> this is a ffmpeg -i big_buck_bunny_1080p_h264.mov -vf 'pad=iw*2' -f null -
[19:52] <ubitux> so 15% of the time is in copying this
[19:52] <ubitux> (it's indeed the memcpy of the pad since removing the filter drop the memcpy from the benchmark)
[19:53] <ubitux> maybe we can reduce that 15% with that method
[19:54] <BtbN> the vlc code can't be used directly unfortunately, as it relys on gcc features for it
[19:54] <ubitux> what's the problem?
[19:54] <nevcairiel> its nonsense, it doesn't do anything except in the special case when you need to copy from USWC video memory
[19:54] <ubitux> BtbN: we can just add #ifdefery, that's not a problem
[19:54] <nevcairiel> my code does several memcpys  on a line-by-line base in some cases, and there is no higher load then the cases where its a full-plane copy
[19:55] <BtbN> ubitux, but having it working in diffrent compilers would be better
[19:55] <ubitux> can be added later
[19:55] <BtbN> no idea how widespread the _mm_... intrinsics are
[19:55] <ubitux> adding improvements just for one specific compiler/arch/whatever is fine as long as the generic path still works
[19:55] <ubitux> no, no intrinsics though
[19:55] <ubitux> inline asm would be fine
[19:56] <nevcairiel> ubitux: it doesn't help
[19:56] <BtbN> it does help...
[19:56] <Daemon404> do you guys want big sticks to hit eachother with?
[19:56] <ubitux> BtbN: well, i'd suggest you just send a PoC patch to prove what you say :)
[19:56] <nevcairiel> I don't know what GCC does, but on MSVC the memcpy is so efficient that there is no difference between one call or 3000 calls
[19:57] <nevcairiel> in performance anyway
[19:57] <nevcairiel> because memcpy is a intrinsic compiler function
[19:58] <wm4> nevcairiel: what code does msvc generate for a memcpy with the size not known at compile time?
[19:58] <BtbN> Even if it's inlined, it's still far less efficient than a 4K block streamed copy using sse
[19:59] <nevcairiel> memcpy uses sse instructions
[19:59] <nevcairiel> even avx
[20:00] <BtbN> it does, but it does not optimize accross multiple memcpy calls
[20:00] <nevcairiel> proof it then
[20:00] <nevcairiel> show an example we can benchmark
[20:01] <wm4> all memcpy sse examples I see are quite lengthy and don't really look suitable for inlining
[20:02] <BtbN> nevcairiel, i already did a benchmark for this, just in a diffrent software. Nothing that needs proof there.
[20:02] <nevcairiel> And i benchmarked the opposite result just now
[20:05] <BtbN> also, you can't just enable sse4 optimizations when compiling binary packages.
[20:05] <ubitux> BtbN: do you have a thread/post/patch/whatever about the xbmc perf boost?
[20:06] <BtbN> ubitux, no, we did that outselves during development
[20:06] <BtbN> *r
[20:06] <ubitux> what year was that?
[20:06] <BtbN> last year...
[20:07] Action: ubitux wonders where the memcpy happens in vf_pad
[20:10] <BtbN> The way the streamed load/store plane copying works is by using an uncachable intermediate buffer. Which is filled with streaming load instructions, and then written to the destination buffer with streaming store instructions. So yes, it needs special memory, but only as intermediate storage.
[20:10] <BtbN> that's something memcpy can never achive
[20:10] <wm4> BtbN: I think you keep mixing these 2 things?
[20:10] <BtbN> which two things?
[20:10] <wm4> uncached special memory vs. calling memcpy "too often"
[20:10] <BtbN> why would i mix them?
[20:10] <nevcairiel> What you talk about is only really useful for weird memory, like USWC memory
[20:10] <BtbN> nevcairiel, no, it's not.
[20:10] <nevcairiel> like when copying FROM a GPU
[20:11] <nevcairiel> BtbN: then show the proof already
[20:11] <BtbN> ...
[20:11] <BtbN> I don't have a written 3000 page proof paper ready for you
[20:11] <nevcairiel> why would we believe you if all you give us is "i have benchmarked it"
[20:11] <BtbN> I can just tell you from personal experience from two diffrent software projects that 3200 memcpy calls ARE inefficient
[20:12] <nevcairiel> sure, doing one call is more efficient, but not 3% vs 30% cpu usage inefficient, more 3% vs 4% inefficient
[20:14] <wm4> BtbN: the right way to go about this is to write a test program that confirms your claims, and then let us confirm it
[20:15] <BtbN> so i should invest maybe a whole day just to prove something where even research papers from intel exist, which you just don't believe?
[20:15] <nevcairiel> the paper you linked is about one specific case, reading from USWC memory
[20:15] <BtbN> no, it's not
[20:15] <nevcairiel> read it
[20:15] <BtbN> it uses a 4k USWC memory block as intermediate storage, that's it
[20:15] <nevcairiel> This paper explains best known methods for improving performance of data copies from Uncacheable Speculative Write Combining (USWC) memory to ordinary write back (WB) system memory.
[20:15] <nevcairiel> no, the GPU memory is USWC
[20:16] <BtbN> the other two buffers are normal system memory
[20:16] <nevcairiel> i give up, you clearly havent even read the article you linked
[20:17] <BtbN> i did, i even implemented a function based on it. And it was several hundret times faster than a memcpy based solution
[20:17] <BtbN> for normal system memory to system memory copys
[20:18] <nevcairiel> then give us a small  test application that uses this function. If you've already written it, it should be a matter of half an hour
[20:18] <nevcairiel> otherwise, stop claims you don't want to back with facts
[20:18] <BtbN> The xbmc function needs gcc inline asm, so it can't be used right away
[20:19] <nevcairiel> we can test with gcc, thats no problem
[20:19] <BtbN> but i can't currently, because i only have MSVC here
[20:20] <nevcairiel> here have a gcc; http://files.1f0.de/mingw/mingw-w64-gcc-4.8.2-stable-r10.7z
[20:20] <BtbN> yes, gcc on windows is super usefull on its own...
[20:20] <BtbN> and i don't want a full mingw/cygwin environment on this machine
[20:30] <durandal_1707> for how long is trac going to be dead?
[20:34] <llogan> durandal_1707: maybe today. probably by tomorrow according to beastd, IIRCAFAIK
[21:09] <rcombs> anyone know if there's a way to access the current AVPacket from AVCodec.encode_sub? I need to do av_packet_get_side_data
[21:11] <nevcairiel> rcombs: you cannot, the subtitle encode API doesn't accept AVPackets
[21:11] <rcombs> ah, that's unfortunate
[21:11] <wm4> there's a subtitle encode API?
[21:11] <nevcairiel> sure
[21:12] <nevcairiel> there is some bitmap sub encoders
[21:12] <nevcairiel> vob and dvb, iirc
[21:12] <wm4> nice
[21:14] <cone-12> ffmpeg.git 03Lukasz Marek 07master:0792b8733570: lavc/evrcdec: fix const misplacement
[21:14] <cone-12> ffmpeg.git 03Lukasz Marek 07master:7bb8b8765452: lavc/adpcm_data: fix const misplacement
[21:39] <J_Darnley> Does anyone know a good visual representation of the instructions added by sse3 and later?
[21:39] <J_Darnley> I mean something like: http://tommesani.com/index.php/component/content/article/2-simd/37-mmx-arithmetic.html
[21:41] <J_Darnley> Or is there an updated NASM manual that explains the newer instructions?
[21:46] <kurosu_> you mean, sorted by generation? ie these is from sse3, these from ss4 etc ?
[21:47] <kurosu_> otherwise, if for the full set, I generally check a site in French, or this: http://download.intel.com/products/processor/manual/253667.pdf
[21:50] <J_Darnley> It needn't be sorted into generations, I can see that elsewhere.
[21:52] <J_Darnley> Yes, that should do
[22:06] <kurosu_> the document also has the different sets, but there's value in having them sorted, as it helps not using an sse(y>x) insn in a ssex function
[23:21] <j-b> good morning!
[23:21] <nevcairiel> morning? did you move to australia?
[23:24] <j-b> It's always morning on IRC
[00:00] --- Sat Feb  8 2014