[MPlayer-cvslog] r34443 - trunk/sub/subreader.c
Ivan Kalvachev
ikalvachev at gmail.com
Sat Dec 17 12:04:51 CET 2011
On 12/14/11, Reimar Döffinger <Reimar.Doeffinger at gmx.de> wrote:
> On Wed, Dec 14, 2011 at 01:17:50AM +0100, iive wrote:
>> Author: iive
>> Date: Wed Dec 14 01:17:49 2011
>> New Revision: 34443
>>
>> Log:
>> Avoid double conversion from utf16/ucs2 to utf8 for text subtitles.
>>
>> There is code that auto-detects utf16 encoding of the subtitle stream
>> and forces the reading functions to convert it to native utf8.
>> The bug happens when using enca to (correctly) guess that the input
>> file has ucs2 encoding and tries to convert the input stream to utf8,
>> again.
>
> Hm, wouldn't it be better to change the enca code so it never sees any
> utf16 encoded data in the case we convert ourselves?
> Or what else is going wrong here?
> Or is there a reset of enca state missing at some point or such?
It's not enca at fault here. The whole utf16 handling is wrong.
To provide reliable detection enca needs as much data as possible,
that's why it is provided with big buffer of the raw file content.
The the utf16 autodetection however uses line-by-line reading
functions while detecting the subtitle format.
One line would not be enough to provide reliable detection of the
encoding E.g. srt subtitles start with distinctive headers that are
using ascii characters and almost all encodings have ascii as subset.
On the other hand the utf16 detection relies on the fact that
subtitles would be using the ascii subset to recognize the multi-byte
encoding.
The utf16 solution seems to spawn from the fact that you can't use
stock library string functions, so the stream reading function needs
to do the conversion on their own, otherwise we won't even find the
line ending. As end result utf16 function parameter have propagated in
almost all subtitling code.
I think the proper solution would be if the stream line reading
function reads a raw buffer of bytes, converts them using iconv then
try to find the line-end. The rest of the code would work as before.
This would allow other stuff like overriding of the detected encoding
(e.g. for the case where ucs2 and utf16 differ).
Unfortunately, I don't have the time or incentive to mess with the
subtitle code at the moment.
More information about the MPlayer-cvslog
mailing list