[FFmpeg-devel] [RFC] function to check for valid UTF-8 string
Rich Felker
dalias
Tue Dec 11 05:43:57 CET 2007
On Mon, Dec 10, 2007 at 08:38:56PM +0100, Reimar D?ffinger wrote:
> Hello,
> On Mon, Dec 10, 2007 at 11:33:57AM -0500, Rich Felker wrote:
> > On Mon, Dec 10, 2007 at 11:01:59AM -0500, Rich Felker wrote:
> > > Validating UTF-8 is trivial. Again see the ABNF. If you don't want to
> > > write the code I'll write it...
> >
> > I just wrote (or rather adapted from libc) the code but I don't have
> > time to check for mistakes at the moment and I don't feel like being
> > ridiculed for any silly errors I made. I'll post it later once I
> > reread and test it.
>
> I'm interested to see what you did, but I think I will clearly win the
> ugliness contest with this (I doubt I want this to actually be used,
> though it is interesting that gcc actually manages to unroll the inner
> loop with -O3 and replaces the arrays by and/cmp with constants):
>
> const char *check_utf8(const char *in) {
> static const uint32_t masks[] = {0xf8c0c0c0, 0xf0c0c0, 0xe0c0, 0x80};
> static const uint32_t vals[] = {0xf0808080, 0xe08080, 0xc080, 0x00};
> static const uint32_t anymasks[] = {0x07300000, 0x0f2000, 0x1e00, 0x7f};
> const uint8_t *str = in;
> while (*str) {
> long i = 3;
> uint32_t v = *str++;
> while ((v & masks[i]) != vals[i] || !(v & anymasks[i])) {
> if (--i < 0 || !*str) return str - 3 + i;
> v = (v << 8) | *str++;
> }
> }
> return NULL;
> }
Your code is incorrect. It considers "ed a0 80" and "f5 80 80 80"
valid, contrary to the definition of UTF-8 and just like the buggy
UTF-8 decoder already in ffmpeg.
Here is my implementation. Feel free to optimize as long as you keep
it correct:
int is_valid_utf8(const unsigned char *s)
{
/* bounds table to use for 0xe0 thru 0xf4 lead bytes */
static const unsigned char bmap[] = {
1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2, 0, 0,
3, 0, 0, 0, 4
};
/* valid byte bounds tables in the form { start, length } */
static const unsigned char bounds[][2] = {
{ 0x80, 0x40 }, { 0xa0, 0x20 }, { 0x80, 0x20 },
{ 0x90, 0x30 }, { 0x80, 0x10 }
};
unsigned b, i;
unsigned char k;
while ((b=*s++)) {
if (b < 0x80) continue;
else if (b - 0xc2 > 0xf4 - 0xc2) return 0;
k = b << 1;
if (b < 0xe0) i = 0;
else i = bmap[b-0xe0];
if ((unsigned)*s++ - bounds[i][0] >= bounds[i][1]) return 0;
while (((k<<=1) & 0x80))
if ((unsigned)*s++ - 0x80 >= 0x40) return 0;
}
return 1;
}
Rich
More information about the ffmpeg-devel
mailing list