[FFmpeg-devel] [RFC] Type descriptors

Thu Dec 31 22:15:45 EET 2020

On Thu Dec 31 15:35:38 EET 2020, Nicolas George <george at nsup.org> wrote:

> …For each simple type, including enumerations like AVColorRange and flat
> structures like AVReplayGain, have a set of standardized functions for
> common operations, including probably:
>
> - printing;
> - serializing to string;
> - parsing from string;
> …
> These functions will have a standardized name and prototype. They will be
> grouped in structures that describe a type entirely.
>
> Note: this project requires a good unified string API.
This relates to one of FFmpeg's imperfections: it writes human-readable 
text to stdout and stderr in an unpredictable and inconsistent encoding. 
It should be 100% consistently encoded. I suggest it should be Unicode 
in UTF-8 code form.

One of the places where FFmpeg's inconsistent encoding caused me a 
problem was when I was operating on a Quicktime video. FFmpeg (or 
perhaps FFprobe) printed a 4-byte Quicktime tag literally to stdout. The 
tag's byte sequence was not valid UTF-8. It messed up the output. That 
tag, being arbitrary binary data, should have been escaped or printed in 
hex or otherwise represented in valid UTF-8.

I suggest that the type descriptor[1] and Unified string / stream API[2] 
proposals offer a good opportunity to define two separate data types: 
string of text, and stream of bytes. Define encode functions to 
transform text into bytes, and decode functions to transform bytes into 
text. The Python language str, bytes, and codecs architecture[3] is a 
pretty good model.

I suggest that FFmpeg define that strings of text always be stored as 
UTF-8 code units. An argument could be made for defining strings of text 
as being in any encoding, as long as every single string instance be 
clearly labelled with its text encoding. (Specifying that all text is in 
UTF-8 achieves clear labelling with no code.) I suggest requiring that 
only validly-encoded data shall be permitted in text strings.

FFmpeg code often operates on byte-granularity binary data. These should 
be defined as data types which are different than "string", because they 
are not text.

FFmpeg generates human-readable output to stdout, to stderr, and to 
logs. I suggest that all this output be required to be text strings, 
preferably always in UTF-8. Any arbitrary binary data written to 
human-readable output must be encoded or escaped somehow, so that it is 
represented as valid text.

[1] https://ffmpeg.org/pipermail/ffmpeg-devel/2020-December/274170.html
[2] https://ffmpeg.org/pipermail/ffmpeg-devel/2020-December/274169.html
[3] https://docs.python.org/3/howto/unicode.html

This is an ambitious project. Good luck with it!
        --Jim DeLaHunt, Vancouver, Canada