[FFmpeg-devel] [PATCH] libavcodec: Do not return encoding errors when -sub_charenc_mode is do_nothing

Fri Aug 30 00:08:27 CEST 2013

On 29 aug. 2013, at 22:16, Nicolas George <nicolas.george at normalesup.org> wrote:
>> I’m also curious to hear how you plan to handle the encoding detection
>> (e.g. for an SRT file) or if you think that’s the responsibility of the
>> user.
> 
> My plan is mostly to imitate Vim's behaviour: let the user specify a list of
> encodings, try them each until one works, and recognize obvious signs such
> as byte order marks.

Hmm, sorry, that’s not the solution I’m looking for.  I want something so that I can just pass in an SRT file and FFmpeg will figure out the encoding.  

For Vim it’s okay to make a wrong guess as long as the format is binary compatible (e.g. ISO-8859-* encodings).  Even if you make some changes in Vim, you can write out the file and don’t break anything.  FFmpeg will always need to convert the input to UTF8, so, for example mistaking ISO-8859-2 for ISO-8859-3 means every Ś will turn into a Ĥ.

It’s very important to realize that character encoding detection is not something that can be done in an exact matter.  Please take a look at the charset encoding detectors of the ICU project (the files starting with ‘cs’ in the ‘i18n’ dir of the sources) and the statistical models they use to calculate ‘confidence’ scores of encodings.  It’s also good to see how libicu is used in projects like Chromium.

ICU is a very large library (although it’s probably also widely installed). A smaller solution, albeit only in C++ is Mozilla’s ‘universalcharsetdet’ (http://lxr.mozilla.org/mozilla-release/source/extensions/universalchardet/src/). I don’t have any experience using it though.

>> I’m not entirely sure that all formats and tools can be trusted though.
> 
> It is probably a dangerous assumption indeed, but I believe you should not
> try to spend time on how to handle the situation until it actually occurs
> for you, just be sure you can detect it.
> 
> That makes me realize: disabling the check would allow ffmpeg to produce
> just that kind of invalid files: S_TEXT in Matroska is specified as UTF-8,
> while ffmpeg would just copy the encoding of the input file. It is IMHO a
> very good reason not to disable it.

I would expect the Matroska muxer to enforce that, not the decoder.

Regards,

Eelco Lempsink

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 204 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130830/03f76587/attachment.asc>