[FFmpeg-devel] [PATCH 2/3] doc/dict2: Add doc and api change for AVDictionary2

Fri Apr 18 01:38:32 EEST 2025

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> Michael Niedermayer
> Sent: Donnerstag, 17. April 2025 01:41
> To: FFmpeg development discussions and patches <ffmpeg-
> devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 2/3] doc/dict2: Add doc and api
> change for AVDictionary2
> 
> Hi
> 
> On Wed, Apr 16, 2025 at 11:15:12PM +0000, softworkz . wrote:
> >
> >
> > > -----Original Message-----
> > > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> > > Michael Niedermayer
> > > Sent: Mittwoch, 16. April 2025 23:48
> > > To: FFmpeg development discussions and patches <ffmpeg-
> > > devel at ffmpeg.org>
> > > Subject: Re: [FFmpeg-devel] [PATCH 2/3] doc/dict2: Add doc and api
> > > change for AVDictionary2
> > >
> > > Hi softworkz
> > >
> > > I think we should use AI to support us and reduce the workload
> > > on people.
> > > I think this here cost you money
> >
> > This is part of an ongoing research for a project that is totally
> > unrelated to FFmpeg. It wasn't my own money and it wasn't spent
> > in order to create an AvDictionary2 for FFmpeg.
> >
> 
> > Also, I didn't know that you are working on it, you had written
> > that you won't have time. That's why I thought it's a good subject,
> 
> Yeah, I say i have no time and then spend time on it anyway ;)

I know that just too well - unfortunately 😊

> maybe thats one of several reasons why i dont have time
> But AVMap surely is/was an interresting project
> 
> There are just too many interresting things to work on
> I need more time, the days are too short, life is too short
> and i need an assistent
> also we (FFMpeg) needs someone to
> manage the bug tracker better. In the past carl did that
> (ask people questions when reports where incomplete or unreproduceable
>  bisect regressions contact people causing regressions stuff like
> that)
> and i think we should fund carl to do it again. But until we find
> someone funding carl, maybe you can get some AI to do a subset of
> these tasks ?
> also maybe we could train a LLM on the bugtracker data, so that
> we then could just ask it questions about it. 

I am no expert on the subject, but from my understanding it doesn't 
work like that. When a model is trained on data, the information that
it "learns" needs to be reflected at multiple places in the data for
being "memorable". Singular data - like in the bugtracker is more 
like some kind of "noise" that will fall off the table. 
So, even when the trac data would be part of the training data, 
it wouldn't know about it in a per-ticket way - only recurring 
information patterns might stick, or maybe tickets that have 
been mentioned and/or discussed at multiple places within the
whole base of data.
Anyway, "training" a model requires Millions of dollars for the 
GPU clusters that are required to compute it.

There's "fine tuning" - that's a kind of additional training on top
of an existing model. But it has the same limitations and everywhere 
they are saying that this still needs large amounts of data for this
to be effective. It still won't remember the trac database and Fine-
tuning is also not something you'd do weekly to keep it up-to-date.

What might be suitable for Fine Tuning is the ML content from 
the past 10 years (user and devel), but it would need to be pre-
processed to exclude mails with patches/code and all e-mails from 
the unfriendly members here - that's surely not what you want to 
teach a model.

Another option are vector databases. In this case, data doesn't 
become part of the model, it's rather a storage which the model
can interact with (if supported). Yet, I don't have the impression 
that this the hottest cow on the field.

More interesting are "embeddings". You need to pay for tokenizing
the data you supply. It's the same operation that happens as a
first step when you submit a message or anything. 
Those embeddings can be configured to be included in all 
conversations. It's more or less the same like when you provide
any other input to the model - it's part of the conversation, but 
with an important difference: it doesn't add to the context 
window of the model which is limited by its max supported token
length.
Embeddings would be suitable to supply the FFmpeg source code,
all other kinds of documents, the website content, the Wiki on 
trac and also instructions regarding its intended behavior etc.
But still not suitable for the bug tracker content. Actually, 
this is not something that it needs to "know", it rather needs
to be able to access it (just like us humans) via an APIs or
browser automation.

> the LLM would probably mix and confuse things and hallucinate
> a lot of nonsense.

That's less of a problem meanwhile as the available context 
windows have increased and operating on trac ticket discussions
does not create such long conversations where the context 
window overflows and important parts fall off.
Some care might only need to be taken for that it doesn't ingest
really large log outputs as are sometimes included in the tickets.

At this time, it would be still too bold to let it work fully 
autonomously, but that's not necessary because its 
operations could be easily arbitrated by conventional logic.

It could be controlled by a set of tags - something like:

- tracbot-error
- tracbot-inconclusive
- tracbot-needs-manual-review
- tracbot-awaiting-user-response
- tracbot-reproduced-in-master
- tracbot-fixed-in-master

Then, a scheduler service would run over all open issues and
invoke the AI on it (see below).

The scheduler would exclude tickets which already have one of
those tags assigned.
Additionally, it would include tickets that are tagged with 
"tracbot-awaiting-user-response" and have been updated since 
the tag was assigned.

When the AI is invoked on a ticked, it has clear instructions
to follow. The primary directive is to reproduce the reported
issue. If the specified information is unclear or incomplete
or when no test file is provided, it posts a message, asking
for the missing information and applies the awaiting-user-response
tag.

The AI would have an execution environment in a Docker 
container where it has access to a library with daily builds
from the past 5 years.
If the issue doesn't reproduce with the latest daily build,
it adds the tracbot-fixed-in-master tag.

If it can be reproduced with the latest build, it "bisects"
the issue using the daily binaries.
It adds a message like: "Issue reproducible since version
20xx-xx-xx and the tag tracbot-reproduced-in-master

If it can't make sense of it, or is platform-specific or
needs certain hardware, or errors, it adds one of the 
other tags.

Some safeguards must be added to avoid anybody getting 
into a longer chat with it (always ending with
awaiting-user-response), but otherwise, I don't think
that there's much that can go wrong.

A mailing list could be set up, to which it reports it 
operations, and where interested members (or anybody) 
can subscribe to. This would provide a kind of real-time
monitoring by the community.

All-in-all I think it's well doable.

Unfortunately though, I cannot spend that much time.
Perhaps a candidate for GSoC?

Best,
sw