[MPlayer-dev-eng] Detecting icy info charset
Reimar Döffinger
Reimar.Doeffinger at gmx.de
Wed Aug 17 21:39:18 CEST 2011
On Wed, Aug 17, 2011 at 05:10:34PM +0300, Timur Aydin wrote:
> On 08/17/11 15:48, Nicolas George wrote:
> > Walk the string; if it is valid UTF-8 until the end, then treat it as UTF-8
> > (this also takes care of the ASCII case). If you encounter byte sequence
> > that is not valid in UTF-8, consider the string as being in the user's
> > locale, as defined by the LC_CTYPE category. If that fails, fall back to
> > ISO-8859-1.
> >
> > The rationale for this is:
> >
> > - UTF-8 is quite recognizable, there are few chances for a string in legacy
> > 8-bits encoding to be valid UTF-8.
> >
> > - If someone have his locale set to a Russian encoding, they are most likely
> > to listen to Russian radios than Greek ones.
> >
>
> Hmm, I guess statistically, this would work most of the time. But as you
> mentioned, there are characters that are both valid UTF-8 and a valid
> member of other charsets.
No, not really in the way I guess you meant it.
Unless it is using only values in the ASCII range (0-127) it is very unlikely
to be valid UTF-8 unless it really is UTF-8.
So the real issue only starts when something is not UTF-8.
There's libenca but it's generally not very useful.
Some of the Chinese and Japanese encodings probably are reasonably
auto-detectable, too.
> Right now I have assembled a list of radio stations that use a certain
> type of charsets. For each one of them, I will use Wireshark to see if
> the HTTP headers give a hint as to what encoding is in effect...
Uh, mplayer -v prints the headers.
And there usually isn't anything useful.
More information about the MPlayer-dev-eng
mailing list