[MPlayer-dev-eng] Adding threaded functionality to mplayer NODAEMON

Roberto Ragusa mail at robertoragusa.it
Mon Sep 13 15:43:19 CEST 2004


On Sun, 12 Sep 2004 09:32:49 +0900
Attila Kinali <attila at kinali.ch> wrote:

> DO NOT SEND MAILS DIRECTLY TO DEVELOPERS!
> We all read this list!

Sorry, I reply personally and CC the list by habit.

Thank you for your detailed reply.
The discussion on threads has been pratically settled through the
mail replies from last days.

The summary I can make is that L2 cache thrashing is (according to you
developers, who have direct experience) really important for performance.
I pointed out that a frame is almost the entire size of the cache (if not bigger),
so the filtering stage will not have hot caches after the decoding stage, but
apparently slice by slice processing in mplayer is more common than I thought.

> Is it possible to extract a clock from the DVB stream, w/o relying on
> the bandwidth ? If so, this should be used to adjust MPlayer's
> reference. Ie implementing a PLL in software.

It is possible, the PCR pid (usually equal to the video pid) has the timings.
Real set top boxes use that reference for buffer management syncronization and
I suppose they are also able to use it to clock the TV encoder chip, which
in turns clocks the beam of your TV.
In this way you watch television at 25.001 fps or 24.999 fps, exactly how
the broadcaster wants.

The buffer fill management is easy, but I don't know how it can be implemented
in mplayer (as everything depends on audio, I messed with the quantity of
audio samples, and it worked).

The TV encoder part is difficult, maybe there is hardware support, but not driver
support. I used to do this kind of synchronization with my Amiga, it was really
easy because there was a register containing the current video coords and it was
readable and *writable*, so you could write(read()+7+offset) and skip or duplicate
a few pixels (7 because of the read/write delay); doing it during the vsync
worked beautifully.

> > How the reference is generated is not important.
> 
> It is. The reference needs to have a certain phase stability.

Well, I didn't mean that every bad reference can be used, I meant that
the software is ready to follow whatever reference we have.

> Nope, although schedulers are similar to state machines, they have one
> big draw back in our case: they have no idea about the dataflow within
> the "tasks" they run. While we exactly know what's going to happen and
> thus can optimize on this.

What you are saying is that instead of decoding frame 100,101,102 and then
filtering frame 100,101,102 (which is reasonable from a scheduler point of
view), you decode and filter 100, than 101 and then 102, hoping for cache
benefits.

> > With multiple threads (on a single CPU) you're doing a similar thing,
> > but it is now preemptive multitasking. I can stop the decoding of a frame
> > to write some audio to the sound card immediately.
> 
> And destroy cache coherency completly.

Ok, but in the "I have to wait 2ms and then output an audio frame" scenario,
wouldn't having some video decoding interrupted by the audio player and then
resumed on trashed caches be better than waiting 2 ms doing nothing? The video
decoder will resume with cache misses but part of the work would be already done.
(here, we are not considering the fact that the system can run another process
in those 2ms).

> And dont forget that RAM is slow compared to L2 cache.

Maybe words like speed and slow are ambigous, because they can refer to
latency or bandwidth.
I'd say that memory is very (latency)slow but not too (bandwidth)slow.

Reading 262144 bytes in random order from a hot 512KiB cache is a lot
faster than doing that directly from memory, but maybe reading 262144
bytes in sequence order from memory is not excessively slow compared
to a read from cache.

Current memories/chipset/processors are optimized for bandwidth nowadays
(for the simple reason that killing latency is too hard so they put
some caches and hope the working set is small enough to fit in).
I'm referring to wide buses and all the prefetch and look-ahead tricks
the hardware usually does.

As a DVD frame is 622kB, at 25 fps we have 15MB/s which is not a
significant part of the available RAM bandwidth.
Estimating 15MB/s of writes from the decoder, 15MB/s of reads and 15MB/s
of writes of one filter and 15 MB/s of reads to go to the video card
we have "only" 60MB/s of highly sequential traffic. Today the RAM peak
performance is measured in GB/s, right?

L2 cache is important for mpeg quantization tables and similar things,
sure, but for raw streaming data?
Didn't hardware designers come up with instructions to read/write memory
directly bypassing the cache, explicit prefetching...? It was said
that it's better to keep code and tables in the cache than thrashing
everything with pixels from a frame that will not fit entirely in the
cache at last.

A similar issue is debated on the kernel level; why should be try to
cache gigabytes from VOB files during playing and discard all the
things which can be useful in the future (libraries, config files,
tmp files,...). See madvise(MADV_DONTNEED)


> > Am I describing DirectX Graph Flows? Am I describing gstreamer? I don't
> > know, it's just the way I see it.
> 
> Yes it's somewhat similar. But DirectX (or rather DShow) sucks because
> of its bad design and gstreamer seems to be very slow.

Sorry, DShow.

> > I hope to receive more insightful comments than flames :-)
> 
> Sorry, our master flamer got cured in tibet ;)

I suspect I miss information needed to understand this joke.

Thank you for the nice conversation.
-- 
   Roberto Ragusa    mail at robertoragusa.it




More information about the MPlayer-dev-eng mailing list