[Ffmpeg-devel-irc] ffmpeg-devel.log.20160819

burek burek021 at gmail.com
Sat Aug 20 03:05:03 EEST 2016


[00:27:30 CEST] <durandal_1707> michaelni: those h264 changes, got new reports? Or those are old?
[00:30:38 CEST] <jamrial_> Dresk|Dev: look at avcodec_parameters_to_context(), if you haven't already
[00:31:44 CEST] <jamrial_> you need to call it after avcodec_alloc_context3 but before avcodec_open2, to pass the stream codec parameters to your manually allocated avcodeccontext
[00:32:50 CEST] <jamrial_> you're right that it's currently poorly documented. the avcodec_open2 doxy should probably have a line about it
[03:22:09 CEST] <michaelni> why does everyone quit after asking a question
[03:23:16 CEST] <michaelni> and before one can awnser 
[03:24:57 CEST] <cone-254> ffmpeg 03Michael Niedermayer 07master:237207645b36: avcodec/rawdec: Fix bits_per_coded_sample checks
[03:24:57 CEST] <cone-254> ffmpeg 03Michael Niedermayer 07master:2a3720bc22d9: avformat/swfdec: Move packet size check before side data allocation
[03:28:34 CEST] <jamrial_> michaelni: if you mean paul, it's kinda late for him right now so it makes sense he left. also his connection is often spotty
[11:04:23 CEST] <cone-941> ffmpeg 03Michael Niedermayer 07master:a453bbb68f3e: avformat/swfdec: Fix inflate() error code check
[14:50:54 CEST] <cone-095> ffmpeg 03Michael Niedermayer 07master:9ffe44c5c75c: avcodec/indeo2: check ctab
[16:06:01 CEST] <durandal_1707> michaelni: see above
[16:15:38 CEST] <cone-095> ffmpeg 03Umair Khan 07master:4f6f56114e56: avformat/movenc: allow rewriting extradata
[16:15:39 CEST] <cone-095> ffmpeg 03Michael Niedermayer 07master:ca906e81909d: avformat/movenc: Free extradata after successfull allocation of new instead of before
[18:03:13 CEST] <cone-095> ffmpeg 03Paul B Mahol 07master:0d8b6a15ddca: avfilter/vf_histogram: make foreground and background opacity configurable
[19:47:29 CEST] <durandal_1707> michaelni: so there's no way to loop audio without reinitializing?
[19:49:43 CEST] <CFS-MP3> michaelni: Just submitted a new version of SCTE-35, fixes the crash you reported on the previous one for one of the test files
[20:02:53 CEST] <BtbN> Is there a way to find out the type of a memory address? Specifically, if it's uswc memory or not.
[20:03:45 CEST] <nevcairiel> no
[20:04:05 CEST] <BtbN> Is there no table somewhere to find out if it's in a range of uncachable memory?
[20:04:39 CEST] <nevcairiel> nothing even remotely portable
[20:04:49 CEST] <BtbN> Well, it only has to work on linux.
[20:15:11 CEST] <fritsch> you also cannot check vaapi / driver as they might do it differently depending on the driver / gpu gen
[20:18:37 CEST] <BtbN> The movntdqa based copy code is notably slower on normal memory, so knowing when to use it would be good.
[20:19:00 CEST] <nevcairiel> ime its not really slower
[20:19:23 CEST] <fritsch> afaik it was factor 4 slower
[20:19:27 CEST] <BtbN> might have just been my test-code then
[20:19:40 CEST] <fritsch> i think btbn tested it with a simple memcpy
[20:19:47 CEST] <fritsch> so not much you can do wrong there :-)
[20:20:02 CEST] <BtbN> Well, I compared against av_image_copy_plane iirc
[20:20:05 CEST] <nevcairiel> these instructions have certain restrictions on how to use them
[20:21:13 CEST] <nevcairiel> like always unroll into groups of 4
[20:21:33 CEST] <BtbN> I also used intrinsics for the test
[20:21:55 CEST] <nevcairiel> anyway trying to transparently shoehorn that into existing APIs is like not a good idea
[20:22:08 CEST] <nevcairiel> better to make a new one and let the calling code decide
[20:22:57 CEST] <fritsch> the unroll and the alignment should not be an issue in that regard
[20:23:10 CEST] <fritsch> as ffmpeg's buffer fullfil all the needs this sse4 copy needs
[20:23:37 CEST] <nevcairiel> but still needs to be written like that, because if you use it without the unroll it will be slower
[20:24:09 CEST] <BtbN> I hope I still have that code somewhere
[20:24:47 CEST] <fritsch> we have it in kodi, pretty straight forward encapsulated into a method
[20:27:02 CEST] <BtbN> https://gist.github.com/35f0e54489d5494628405100b389fe93 still have it
[20:27:42 CEST] <fritsch> for ffmpeg that alignement should be given correctly
[20:27:55 CEST] <fritsch> so no need to copy the modulo bits
[20:28:01 CEST] <fritsch> manually I think
[20:29:22 CEST] <fritsch> the copy2d misses the unaligned partly loop
[20:29:23 CEST] <fritsch> btw.
[20:29:28 CEST] <fritsch> that is a bug
[20:29:41 CEST] <fritsch> ah no it's not, you assert above
[20:29:50 CEST] <fritsch> but compute the unaligned var never the less
[20:30:13 CEST] <BtbN> i think that code comes from some Intel-Sample
[20:30:39 CEST] <fritsch> i know :-) we copied the same
[20:31:01 CEST] <nevcairiel> personally i've also never seen the small cache block intermediate to actually help but only complicate the code
[20:31:13 CEST] <fritsch> we have tested it
[20:31:23 CEST] <nevcairiel> you think i have not? =p
[20:31:23 CEST] <fritsch> our use case was: yadif deinterlacing of vaapi decoded video
[20:31:54 CEST] <fritsch> and that needs the copy from uscw memory
[20:32:12 CEST] <fritsch> celeron 1007U: 35 to 45 ms 1080p frame
[20:32:15 CEST] <fritsch> with memcpy
[20:32:25 CEST] <fritsch> and 1 to 5 ms with this copy
[20:32:52 CEST] <nevcairiel> its not about this copy vs. memcopy, its about this copy or a simplified version of this copy
[20:33:00 CEST] <nevcairiel> without the intermediate cache
[20:33:16 CEST] <BtbN> It's 16 seconds vs. 11 seconds here. So, SSE4 code vs. plain memcpy
[20:33:49 CEST] <BtbN> Is there some way to get a bit of uswc memory, to actually test the speed benefit?
[20:34:03 CEST] <nevcairiel> other then using your gpu to lock some?
[20:34:23 CEST] <BtbN> without actually going through all of VAAPI
[20:34:33 CEST] <fritsch> mmh
[20:34:50 CEST] <fritsch> you can just use the decoded image itself?
[20:35:04 CEST] <fritsch> it's actually "a fair bit of uswc memory"
[20:35:05 CEST] <nevcairiel> its not necessarily about being uswc alone, but also that its really gpu memory which might impact it further, so i would test with the proper case
[20:35:39 CEST] <BtbN> Well, I could just hijack ffmpeg, and place the benchmark in the middle of its VAAPI code
[20:35:45 CEST] <fritsch> hehe
[20:35:56 CEST] <fritsch> or just assume it's uswc memory in 100% of the cases
[20:36:07 CEST] <fritsch> that will be true for vaapi nowadys most of the time I think
[20:36:08 CEST] <nevcairiel> ffmpeg_vaapi.c copies the image to memory, just add your alternate copy function there
[20:36:09 CEST] <BtbN> No, I want to see how fast it is on actual uswc memory
[20:36:18 CEST] <nevcairiel> its not really much hijacking
[20:36:28 CEST] <BtbN> libavutil copies the image iirc?
[20:36:40 CEST] <nevcairiel> oh right that was probably refactored
[20:36:55 CEST] <BtbN> https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L801
[20:36:57 CEST] <nevcairiel> avutil/hwcontext_vaapi.c then =p
[20:37:42 CEST] <BtbN> Will have to re-implement larger parts of av_frame_copy though. It does quite a bit of stuff
[20:38:06 CEST] <fritsch> for benchmarking, hacking it into av_frame_copy directly?
[20:38:17 CEST] <fritsch> will most likely be sufficient - as it's for testing only
[20:38:49 CEST] <nevcairiel> its really just a wrapper around av_image_copy which just copies all planes
[20:38:52 CEST] <nevcairiel> not that much magic
[20:39:14 CEST] <BtbN> only some weird PAL stuff
[20:39:30 CEST] <fritsch> probably bypassed in the case
[20:39:35 CEST] <nevcairiel> which you dont need to handle :D
[20:39:47 CEST] <BtbN> it can't happen in stuff vaapi decodes?
[20:40:39 CEST] <fritsch> i currently don't see how - as vaapi outputs NV12
[20:40:52 CEST] <fritsch> or has kind of nv12 format in its buffers
[20:42:04 CEST] <BtbN> For this to land in ffmpeg I'd have to write yasm for it anyway, and I have no idea about that.
[20:46:09 CEST] <BtbN> nevcairiel, so, without that buffer would essentialy mean just using that copy_from_uswc function from my test directly, right?
[20:46:20 CEST] <nevcairiel> yes
[20:46:31 CEST] <BtbN> that should be somewhat simple to write in yasm
[20:46:36 CEST] <jamrial_> BtbN: BBB will hit me for suggesting it, but if you can glue it properly you could maybe use inline asm
[20:47:03 CEST] <BtbN> It should not be hard to do it in yasm. For someone who has done yasm before...
[20:47:44 CEST] <fritsch> i wonder if there is some kind of auto converter
[20:47:46 CEST] <fritsch> :p
[20:48:03 CEST] <nevcairiel> there is no reason to use inline asm, we dont want any new inline asm, and copying an entire plane also has no argument of saving overhead
[20:48:05 CEST] <jamrial_> assuming it's just one or a few mov instructions and not a full function implemented in assembly, using inline (like the intreadwrite macros) may also be better than yasm
[20:48:14 CEST] <jamrial_> fair enough
[20:48:37 CEST] <BtbN> https://bpaste.net/show/70b17c7760b8
[20:48:42 CEST] <BtbN> it's exactly this function
[20:48:54 CEST] <nevcairiel> if you implement it with very contrained requirements its probably pretty simple
[20:49:13 CEST] <michaelni> CFS-MP3, i can confirm it doesnt crash anymore
[20:49:29 CEST] <fritsch> BtbN: that's without the intermediate or do I overlook that?
[20:49:36 CEST] <BtbN> Yes.
[20:51:29 CEST] <jamrial_> yeah, that should be implemented in yasm
[20:51:58 CEST] <fritsch> i really wonder how many instructions that will generate at the end
[20:52:11 CEST] <BtbN> way too much, no idea what gcc is doing to it
[20:52:30 CEST] <fritsch> and I wonder about the policy to why this is needed
[20:52:36 CEST] <fritsch> for such a straight forward loop
[20:52:45 CEST] <BtbN> Uhm.
[20:52:54 CEST] <BtbN> Without the intermittend 4K buffer, it even outperforms plain memcpy
[20:53:06 CEST] <BtbN> not by much, but it's faster.
[20:53:25 CEST] <fritsch> and how fast is it with uscw memory in comparison to with 4k buffer?
[20:53:39 CEST] <BtbN> No idea, I don't have a uswc buffer at hand in the test app
[20:54:05 CEST] <fritsch> i have a feeling, that it makes sense
[20:54:09 CEST] <fritsch> for non uscw memory
[20:54:18 CEST] <fritsch> to be faster
[20:54:27 CEST] <BtbN> In that case, av_image_copy_plane could just be using that sse4 code if it's available all the time?
[20:54:46 CEST] <fritsch> if it's still fast enough for the real uscw copy
[20:54:59 CEST] <fritsch> which it should be in any ways - as it's faster
[20:55:01 CEST] <fritsch> than before
[20:55:13 CEST] <fritsch> i wonder how that performs on amd cpus
[20:55:22 CEST] <fritsch> or other platforms hitting that path
[20:55:54 CEST] <fritsch> i can benchmark on ivb, hsw, bsw, snb
[20:56:00 CEST] <fritsch> but no amd currently plugged in
[21:11:40 CEST] <BtbN> http://www.felixcloutier.com/x86/MOVNTDQA.html what's the difference between the first two?
[21:12:33 CEST] <fritsch> i think it's the same
[21:12:47 CEST] <fritsch> besides the first one works on cpus with sse41 only that don't have avx?
[21:12:50 CEST] <fritsch> does that make sense?
[21:12:51 CEST] <BtbN> The operation seems to be slightly diffrent
[21:14:13 CEST] <jamrial_> BtbN: legacy sse and VEX encoding versions of the same instruction
[21:17:55 CEST] <jamrial_> they are the same. x86inc will emit the second if you init the function targeting avx or newer, otherwise the first
[21:18:39 CEST] <BtbN> https://gist.github.com/56be86b002a21db35de5a4b66f78c483 good patch.
[21:19:25 CEST] <jamrial_> the latter obviosuly has the difference of clearing the high 128 bits of ymm regs whereas the first doesn't, but that's only an issue if you mix sse and ymm vex instructions which you never should
[21:19:51 CEST] <BtbN> ah, so that's what it refers to
[21:20:21 CEST] <BtbN> The ymm and xmm registers share the first 128 bit?
[21:21:19 CEST] <jamrial_> yes. xmm regs on avx cpus are mapped to the low 128 bits of ymm regs
[21:21:34 CEST] <BtbN> Yeah, makes sense then.
[21:22:01 CEST] <BtbN> I'm quite amazed this patch actually compiles. If you configure with cpu=host that is, or something that supports those.
[21:42:43 CEST] <BtbN> turns out testing vaapi decoding stuff on my nvidia box wasn't the best idea.
[21:43:01 CEST] <BtbN> took me way to long to realize why stuff was failing horribly.
[21:45:37 CEST] <fritsch> hehe
[21:45:42 CEST] <fritsch> should I test something for you?
[21:45:55 CEST] <BtbN> Nah, just had to use another PC...
[21:46:08 CEST] <fritsch> and already some preliminary results?
[21:46:23 CEST] <BtbN> Well, using VAAPI doesn't work on Nvidia cards.
[21:46:30 CEST] <fritsch> yeah
[21:46:37 CEST] <fritsch> and even if you get the vdpau-vaapi wrapper
[21:46:41 CEST] <fritsch> it won't be uscw memory
[21:47:49 CEST] <BtbN> The box I was testing on does have an ivy bridge GPU though
[21:47:54 CEST] <BtbN> But it kept refusing to use it
[21:47:58 CEST] <BtbN> Even vainfo fails
[21:48:09 CEST] <fritsch> use your braswell
[21:48:20 CEST] <BtbN> Yeah, that's what I'm re-compiling on right now.
[22:01:27 CEST] <BtbN> hm, i barely get 60 fps out of vaapi decoding h264
[22:01:38 CEST] <BtbN> There must be some other bottleneck
[22:02:43 CEST] <fritsch> is it vsynced?
[22:02:46 CEST] <fritsch> "somehow"?
[22:02:56 CEST] <fritsch> yeah I think if vaPutSurface is used it actually is :-(
[22:04:23 CEST] <BtbN> Where does it use vaPutSurface?
[22:04:42 CEST] <fritsch> "if" <-
[22:04:58 CEST] <fritsch> read something like that recently on vaapi mailing list
[22:05:10 CEST] <BtbN> I get 55 fps with classic copying, and 77 fps with the sse4 intrinsic copy function
[22:05:37 CEST] <BtbN> with ffmpeg using around 50% CPU each time
[22:05:38 CEST] <fritsch> can you "noop" also?
[22:05:44 CEST] <fritsch> e.g. don't copy at all?
[22:05:53 CEST] <fritsch> or memset zeros?
[22:06:29 CEST] <BtbN> doing nothing hits 87 fps
[22:06:48 CEST] <BtbN> and still 40% CPU
[22:07:05 CEST] <BtbN> No idea how to propperly operate that vaapi decoding in ffmpeg though
[22:07:11 CEST] <fritsch> what's the input file?
[22:07:16 CEST] <BtbN> 1080p
[22:07:58 CEST] <fritsch> 77/55 = 40 % improvement
[22:08:03 CEST] <fritsch> with 10 lines of code
[22:08:12 CEST] <BtbN> But I wonder what is using so much CPU there?
[22:08:15 CEST] <BtbN> The CPU should be idle!
[22:08:37 CEST] <fritsch> good question
[22:09:04 CEST] <BtbN> ./ffmpeg -threads 1 -vaapi_device /dev/dri/renderD128 -hwaccel vaapi -i ~/game.of.thrones.s01e01.1080.mkv -c:v rawvideo -sn -an -y -f null -pix_fmt nv12 /dev/null
[22:10:39 CEST] <BtbN> [auto-inserted scaler 0 @ 0x247be60] Setting 'flags' to value 'bicubic'
[22:10:43 CEST] <fritsch> haha
[22:10:44 CEST] <fritsch> ^^
[22:10:44 CEST] <BtbN> that sounds like a good candidate
[22:10:46 CEST] <fritsch> now you know
[22:10:51 CEST] <BtbN> But why?!
[22:11:07 CEST] <fritsch> it's the default sws_scale algorithm?
[22:11:17 CEST] <fritsch> can you force it to bilinear_fast or something?
[22:11:18 CEST] <BtbN> It should not insert a scaler for matching formats
[22:11:21 CEST] <BtbN> at all
[22:11:34 CEST] <BtbN> or rather, the scaler should not do anything
[22:11:44 CEST] <fritsch> change the pix_fmt?
[22:11:50 CEST] <BtbN> I did.
[22:12:03 CEST] <BtbN> [graph 0 input from stream 0:0 @ 0x244f1a0] w:1920 h:1080 pixfmt:yuv420p tb:1/1000 fr:13978/583 sar:1/1 sws_param:flags=2
[22:12:07 CEST] <fritsch> perhaps it's used to convert from input to output
[22:12:13 CEST] <BtbN> it thinks the input is yuv420p, even though vaapi will output nv12
[22:12:13 CEST] <BtbN> hm
[22:13:04 CEST] <CFS-MP3> michaelni: About this reply to send on the ML: "can you document in a comment briefly what is in the AVPackets
[22:13:04 CEST] <CFS-MP3> for most codecs that is clear, SCTE-35 is maybe a bit special
[22:13:04 CEST] <CFS-MP3> also what the dts/pts values of teh AVPackets mean"
[22:13:25 CEST] <CFS-MP3> Can you point me to an example comment that meets the comment requirements? :-)
[22:13:39 CEST] <CFS-MP3> reply you sent I mean, sorry
[22:15:12 CEST] <michaelni> i dont think others are documented, its clear for most like what pts/dts are for video
[22:24:48 CEST] <fritsch> BtbN: so what happens if you say yuv420p?
[22:25:48 CEST] <BtbN> it switches to nv12 at runtime and scales.
[22:25:55 CEST] <fritsch> hehe
[22:32:52 CEST] <cone-820> ffmpeg 03Michael Niedermayer 07master:b8b36717217c: avcodec/cfhd: Increase minimum band dimension to 3
[22:34:11 CEST] <jkqxz> Are you getting the scaler because of the reinit after hwaccel is enabled?  You might be able to do better by setting -hwaccel_output_format explicitly.
[22:41:21 CEST] <BtbN> Even with that set to nv12, it allways inserts a scaler.
[22:41:23 CEST] <jkqxz> Also, thoughts on hardware frame mapping such as you are looking for there are most welcome.  <https://lists.libav.org/pipermail/libav-devel/2016-July/078123.html>
[22:42:34 CEST] <BtbN> isn't hwframe_map already in ffmpeg?
[22:42:50 CEST] <durandal_17> michaelni: https://we.tl/aJmrgBc5KK
[22:43:29 CEST] <BtbN> ah, no. It uses vaPutImage...
[22:43:44 CEST] <BtbN> on unmap though oO
[22:44:00 CEST] <fritsch> BtbN: haha
[22:44:14 CEST] <fritsch> that's totally bad
[22:44:27 CEST] <fritsch> it auto scales to rgb full
[22:44:30 CEST] <jkqxz> It's kindof there for VAAPI, because it tries to do something sensible for the read/write cases.  Really this is for user-accessible mapping.
[22:44:32 CEST] <BtbN> seems like it uses vaPutSurface to copy back changed data
[22:44:49 CEST] <BtbN> Which should never happen during hwdownload
[22:45:32 CEST] <jkqxz> It only puts the data back if you were writing to the surface.
[22:46:13 CEST] <BtbN> But why does it not use vaDeriveImage, which is the fastest from what i know, if the mapping is READ?
[22:46:16 CEST] <jkqxz> (transfer_data_from() maps it read-only.)
[22:46:19 CEST] <BtbN> https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L717
[22:46:50 CEST] <jkqxz> See comment above.  Noone has written a usable uncached copy for it.
[22:47:09 CEST] <BtbN> So, for testing my uncached copy, i should plain remove that.
[22:47:26 CEST] <jkqxz> Yes.
[22:47:27 CEST] <fritsch> wait it uses vaPutImage and not vaPutSurface
[22:47:31 CEST] <fritsch> that's a big difference
[22:48:55 CEST] <BtbN> Yes, now i see a noticable speedup
[22:49:26 CEST] <BtbN> 200 fps with a no-op copy.
[22:50:18 CEST] <jkqxz> fritsch:  Um, yeah.  vaPutSurface() is a debug-only function for terrible output to X11 windows.
[22:50:28 CEST] <fritsch> it was "the only resort" for years
[22:50:32 CEST] <fritsch> :-(
[22:50:39 CEST] <fritsch> and used with texture from pixmap
[22:50:46 CEST] <fritsch> in kodi and anywhere else
[22:50:56 CEST] <BtbN> So, 200 fps with not copying at all.
[22:51:08 CEST] <fritsch> now let's see what you get with sse4 part
[22:51:09 CEST] <BtbN> 25 fps with plain simple sse4 copy
[22:51:14 CEST] <BtbN> 10 fps with classic copy
[22:51:22 CEST] <fritsch> mmmh
[22:51:25 CEST] <fritsch> there is something wrong
[22:51:27 CEST] <fritsch> isn't it?
[22:51:32 CEST] <BtbN> Going to put the whole 4K buffer thing in there
[22:51:58 CEST] <fritsch> kod results: https://dl.dropboxusercontent.com/u/55728161/sse4vsputsurface.png
[22:52:17 CEST] <fritsch> you should get > 100
[22:55:53 CEST] <BtbN> even with the optimized copy function, it's slow, using tons of CPU
[22:57:45 CEST] <fritsch> can you post the diff, please?
[22:57:52 CEST] <fritsch> I don't fully get what you removed now
[22:58:03 CEST] <fritsch> and why you had 77 fps before
[22:58:15 CEST] <BtbN> just reduced the check for it to use deriveImage
[22:58:23 CEST] <BtbN> it allways uses it now
[22:58:48 CEST] <BtbN> perf also confirms, 90% of the CPU time is spent in av_image_copy_plane
[22:59:56 CEST] <BtbN> https://bpaste.net/show/f1d53e60696b that's the current diff
[23:00:01 CEST] <nevcairiel> on DXVA with an optimized copy function i maybe get 2-3% CPU use and practically the same decoding speed as not copying back at all
[23:00:25 CEST] <BtbN> Yeah, something must be wrong with that copy function
[23:00:26 CEST] <nevcairiel> but thats not though ffmpeg.c
[23:00:45 CEST] <BtbN> Well, perf confirmed that it indeed is the copy function being slow
[23:01:37 CEST] <BtbN> There is no difference in speed between the cache-block and the direct approach though
[23:01:41 CEST] <BtbN> so they are equally slow
[23:04:14 CEST] <jamrial_> https://trac.ffmpeg.org/ticket/5781#comment:4 the doxy needs to be improved
[00:00:00 CEST] --- Sat Aug 20 2016


More information about the Ffmpeg-devel-irc mailing list