[Ffmpeg-devel] Character encoding in libavformat header

Fri Apr 28 14:20:32 CEST 2006

Hi,

I'm trying to use ffmpeg for ASF demuxing and WMA decoding mostly in  
XMMS2 project's official asf/wma plugin. I already made a forked  
version of it where I mainly ripped off just the ASF and WMA decoding  
parts and supporting functions and it worked fine but it was ugly as  
hell. Now most Linux and *BSD distributions seem to distribute  
statically linkable versions of ffmpeg libraries so I thought why not  
use one of those since that would make all the security patches and  
stuff other people's responsibility. However one major obstacle came  
into my way...

ASF demuxer stores all the header information (title, author,  
description) in ISO-8859-1 charset even though ASF file format  
natively uses UCS-2 (UTF-16, but although I don't know I suspect it  
doesn't support surrogates) charset. The get_str16_nolen function in  
asf.c goes as follows:

static void get_str16_nolen(ByteIOContext *pb, int len, char *buf,  
int buf_size)
{
     int c;
     char *q;

     q = buf;
     while (len > 0) {
         c = get_le16(pb);
         if ((q - buf) < buf_size - 1)
             *q++ = c;
         len-=2;
     }
     *q = '\0';
}

As you can see it simply ignores every second byte of the field. This  
doesn't even necessarily create any recognizable ISO-8859-1 text if  
the header has >255 characters stored. So it should at least do some  
check like: *q++ = (c > 255) ? '?' : c; to make sure that all unknown  
characters are shown as ? characters instead of garbage.

What would be even better would be to re-encode it into UTF-8 which  
is trivial to say at least, or alternatively have some way to access  
the original raw header data. The advantage of UTF-8 would of course  
be that it can be handled the same way as ISO-8859-1 string.  
Disadvantage is that characters [128, 255] wouldn't show correctly in  
ISO-8859-1 strings. Has ffmpeg made some decision about internal  
metadata character encoding?

Our goal is to support metadata and charsets as well as possible so  
this is really an important issue. I'd very much like to hear some  
comments about the issue.

Juho V?h?-Herttua

P.S. Please keep me in the cc while replying since I'm not on this  
mailing list.