[FFmpeg-devel] [PATCH] lavc: support subtitles charset conversion.

Clément Bœsch ubitux at gmail.com
Thu Jan 3 09:08:58 CET 2013


On Wed, Jan 02, 2013 at 11:20:13AM +0100, Nicolas George wrote:
> Le tridi 13 nivôse, an CCXXI, Clement Boesch a écrit :
> > I was considering the UTF-16 to UTF-8 as part of the demuxing: all the
> > text decoders are currently designed to deal with text only (and I don't
> > think that's a good idea to change this); they deal with a simple ASCII
> > text string, because that's pretty straightforward.
> > 
> > We must IMO just make the demuxers output ASCII-compliant charset in all
> > the cases, which will be sent to the different text decoders, and then
> > converted to UTF-8 is necessary (like proposed in the patch).
> 
> That will work perfectly for subtitles coming from text files in strange
> encodings, but it will not work when the subtitles come from a muxed file,
> since we agree that special cases in real demuxers must be avoided.
> 
> I do not have an example currently, but finding a format that can store text
> subtitles and has a metadata field for the encoding seems quite likely.
> Matroska mandates that subtitles are in UTF-8, but I am pretty sure someone
> somewhere produced Matroska files with UTF-16 text subtitles in them, and if
> someone reports them, we will want to support them.
> 

OK

> The way I see it, recoding may need to happen either before the demuxer,
> inside the demuxer, between the demuxer and the decoder, inside the decoder
> or after the decoder. And probably any of these case can be necessary in at
> least one situation: we need an API that can handle all.
> 
> Since your patch is about lavc, we do not have to worry about the demuxer
> part, and only the before-decoder, inside-decoder, after-decoder parts have
> to be handled.
> 
> A simple additional flag may be just enough:
> 
>     char *text_encoding;
>     unsigned char text_encoding_mode;

User configurable?

>     AV_TEXT_ENCODING_MODE_DEFAULT, //< let lavc decide

Detection based on what?

>     AV_TEXT_ENCODING_MODE_MANUAL,  //< the decoder does the work

Internally to the decoder, using the helper you're talking below?

>     AV_TEXT_ENCODING_MODE_DONE,    //< the demuxer did the work

Internally to the demuxer, using the helper you're talking below?

>     AV_TEXT_ENCODING_MODE_PRE,     //< lavc must recode the packet

Since lavc is not really supposed to modify the AVPacket (AFAIK), this
might be a bit painful (buf copy before decoding callback). Maybe it would
belong in a post-demux, but that may be a bit problematic for the stream
selection.

>     AV_TEXT_ENCODING_MODE_POST,    //< lavc must recode the decoded text
> 

That sounds like the perfect place ;)

Except that it doesn't contain the buffer size, so it can only do ASCII
compliant charset conversions.

> Your patch already implements POST; implementing PRE the same way would be
> pretty trivial, it is just a matter of copying an AVPacket instead of an
> AVSubtitleRect; the other cases do not need implementing at all.
> 
> > Now to make life easy for demuxers, we need to propose a few helpers to
> > transform UTF-16 input into UTF-8. The main problem I see currently is the
> > format detection of such encoding; it might require some tweaking in the
> > probing. Any idea welcome.
> 
> I suggest this:
> 
> /**
>  * Try to detect a memory buffer text encoding and convert it to UTF-8.
>  *
>  * @param[out] ret        text in UTF-8 with 0-terminator
>  * @param[in]  in         text in unknown encoding
>  * @param[in]  in_size    size of in
>  * @param[in]  encodings  coma-separated list of encodings to try (or NULL)
>  * @param[out] encoding   detected encoding
>  * @param[out] remaining  size of in that could not be recoded
>  * @return  score of the detection, or <0 error code
>  */
> int ff_recode_detect_buffer(char **ret, const char *in, size_t in_size,
>                             const char *encodings,
>                             char **encoding, size_t *remaining);
> 

Note: inside the demuxer, you don't have access to the codec charset
(options are not yet populated). Inside the decoder that's possible.

> A similar ff_read_detect_recode_stream() taking an aviobuf would be helpful
> too.
> 
> But that can come later.
> 

I must say I have a hard time following what you actually want me to do.
Can you tell me more about what you want to want to expose to the user
first?

-- 
Clément B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20130103/671c2f7a/attachment.asc>


More information about the ffmpeg-devel mailing list