[FFmpeg-devel] [PATCH] lavu/avstring: add av_get_utf8() function
Stefano Sabatini
stefasab at gmail.com
Wed Nov 20 13:02:16 CET 2013
On date Saturday 2013-11-16 11:57:07 +0100, Nicolas George encoded:
> Le quartidi 24 brumaire, an CCXXII, Stefano Sabatini a écrit :
> > Another interface optimization would be to document that *code is set
> > whenever the sequence is structurally valid, even if the code range is
> > not accepted.
>
> Not sure how it can be done, since you still need to be able to distinguish
> cases where the UTF-8 is really invalid.
I can set the code only in case it is structurally valid, and set the
output code only in this case (and set it unset otherwise). Check the
updated test program.
> OTOH, a flag to automatically return a replacement character when en invalid
> sequence is detected could be useful, but that can come later.
The problem is that you need to specify which code to use instead, or
which sequence. Also for this a separate function should be needed,
and it's not clear what level of control the function should provide
before becoming too bloated.
>
> > >From 40a1b7a61d509efe64fdd1c1047fdd1507ab181e Mon Sep 17 00:00:00 2001
> > From: Stefano Sabatini <stefasab at gmail.com>
> > Date: Thu, 3 Oct 2013 01:21:40 +0200
> > Subject: [PATCH] lavu/avstring: add av_utf8_decode() function
> >
> > ---
> > doc/APIchanges | 3 +++
> > libavutil/Makefile | 1 +
> > libavutil/avstring.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > libavutil/avstring.h | 35 ++++++++++++++++++++++++++++
> > libavutil/utf8.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > libavutil/version.h | 2 +-
> > 6 files changed, 170 insertions(+), 1 deletion(-)
> > create mode 100644 libavutil/utf8.c
> >
> > diff --git a/doc/APIchanges b/doc/APIchanges
> > index dfdc159..b292d19 100644
> > --- a/doc/APIchanges
> > +++ b/doc/APIchanges
> > @@ -15,6 +15,9 @@ libavutil: 2012-10-22
> >
> > API changes, most recent first:
> >
> > +2013-11-12 - xxxxxxx - lavu 52.53.100 - avstring.h
> > + Add av_utf8_decode() function.
> > +
> > 2013-11-xx - xxxxxxx - lavc 55.41.100 / 55.25.0 - avcodec.h
> > lavu 52.51.100 - frame.h
> > Add ITU-R BT.2020 and other not yet included values to color primaries,
> > diff --git a/libavutil/Makefile b/libavutil/Makefile
> > index 7b3b439..19540e4 100644
> > --- a/libavutil/Makefile
> > +++ b/libavutil/Makefile
> > @@ -155,6 +155,7 @@ TESTPROGS = adler32 \
> > sha \
> > sha512 \
> > tree \
> > + utf8 \
> > xtea \
> >
> > TESTPROGS-$(HAVE_LZO1X_999_COMPRESS) += lzo
> > diff --git a/libavutil/avstring.c b/libavutil/avstring.c
> > index eed58fa..8ff953e 100644
> > --- a/libavutil/avstring.c
> > +++ b/libavutil/avstring.c
> > @@ -307,6 +307,70 @@ int av_isxdigit(int c)
> > return av_isdigit(c) || (c >= 'a' && c <= 'f');
> > }
> >
> > +int av_utf8_decode(int32_t *codep, const uint8_t **bufp, const uint8_t *buf_end,
> > + unsigned int flags)
> > +{
> > + const uint8_t *p = *bufp;
> > + uint32_t top;
> > + uint64_t code;
> > + int ret = 0;
> > +
> > + if (p >= buf_end)
> > + return 0;
> > +
> > + code = *p++;
> > +
> > + /* first sequence byte starts with 10, or is 1111-1110 or 1111-1111,
> > + which is not admitted */
> > + if ((code & 0xc0) == 0x80 || code >= 0xFE) {
> > + ret = AVERROR(EILSEQ);
> > + goto end;
> > + }
> > + top = (code & 128) >> 1;
> > +
> > + while (code & top) {
> > + int tmp;
> > + if (p >= buf_end) {
> > + ret = AVERROR(EILSEQ); /* incomplete sequence */
> > + goto end;
> > + }
> > +
> > + /* we assume the byte to be in the form 10xx-xxxx */
> > + tmp = *p++ - 128; /* strip leading 1 */
> > + if (tmp>>6) {
> > + ret = AVERROR(EILSEQ);
> > + goto end;
> > + }
> > + code = (code<<6) + tmp;
> > + top <<= 5;
> > + }
> > + code &= (top << 1) - 1;
> > +
> > + if (code >= 1<<31) {
> > + ret = AVERROR(EILSEQ); /* out-of-range value */
> > + goto end;
> > + }
> > +
> > + *codep = code;
> > +
> > + if (code > 0x10FFFF &&
> > + !(flags & AV_UTF8_CHECK_FLAG_ACCEPT_INVALID_BIG_CODES))
> > + ret = AVERROR(EILSEQ);
> > + if (code < 0x20 && code != 0x9 && code != 0xA && code != 0xD &&
> > + flags & AV_UTF8_CHECK_FLAG_EXCLUDE_XML_INVALID_CONTROL_CODES)
> > + ret = AVERROR(EILSEQ);
> > + if (code >= 0xD800 && code <= 0xDFFF &&
> > + !(flags & AV_UTF8_CHECK_FLAG_ACCEPT_SURROGATES))
> > + ret = AVERROR(EILSEQ);
> > + if (code == 0xFFFE || code == 0xFFFF &&
> > + (!flags & AV_UTF8_CHECK_FLAG_ACCEPT_NON_CHARACTERS))
> > + ret = AVERROR(EILSEQ);
> > +
> > +end:
> > + *bufp = p;
> > + return ret;
> > +}
> > +
> > #ifdef TEST
> >
> > int main(void)
> > diff --git a/libavutil/avstring.h b/libavutil/avstring.h
> > index 438ef79..9a8aadf 100644
> > --- a/libavutil/avstring.h
> > +++ b/libavutil/avstring.h
> > @@ -22,6 +22,7 @@
> > #define AVUTIL_AVSTRING_H
> >
> > #include <stddef.h>
> > +#include <stdint.h>
> > #include "attributes.h"
> >
> > /**
> > @@ -295,6 +296,40 @@ enum AVEscapeMode {
> > int av_escape(char **dst, const char *src, const char *special_chars,
> > enum AVEscapeMode mode, int flags);
> >
>
> > +#define AV_UTF8_CHECK_FLAG_ACCEPT_INVALID_BIG_CODES 1 ///< accept codepoints over 0x10FFFF
> > +#define AV_UTF8_CHECK_FLAG_ACCEPT_NON_CHARACTERS 2 ///< accept non-characters - 0xFFFE and 0xFFFF
> > +#define AV_UTF8_CHECK_FLAG_ACCEPT_SURROGATES 4 ///< accept UTF-16 surrogates codes
> > +#define AV_UTF8_CHECK_FLAG_EXCLUDE_XML_INVALID_CONTROL_CODES 8 ///< exclude control codes not accepted by XML
>
> I still think that CHECK is redundant with ACCEPT and EXCLUDE, but that is
> your call.
Removed CHECK, hope we won't need to change it later.
> > +
> > +/**
> > + * Read and decode a single UTF-8 code point (character) from the
> > + * buffer in *buf, and update *buf to point to the next byte to
> > + * decode.
> > + *
> > + * In case of an invalid byte sequence, the pointer will be updated to
> > + * the next byte after the invalid sequence and the function will
> > + * return an error code.
> > + *
> > + * Depending on the specified flags, the function will also fail in
> > + * case the decoded code point does not belong to a valid range.
> > + *
> > + * @note For speed-relevant code a carefully implemented use of
> > + * GET_UTF8() may be preferred.
> > + *
> > + * @param code pointer used to return the parsed code in case of success
> > + * @param buf pointer to the first byte of the sequence to decode
> > +
> > + * @param buf_end mark the end of the buffer, points to the next byte
> > + * past the last in the buffer. This is used to avoid
> > + * buffer overreads (in case of an unfinished UTF-8
>
> > + * sequence towards the end of the buffer).
> > + * @param flags a collection of AV_UTF8_CHECK_FLAG_* flags
>
> Nit: broken alignment.
[...]
> The patch looks very fine to me now, thanks for bearing with me.
Updated.
--
FFmpeg = Frenzy and Fancy Mythic Powered Evil Goblin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-lavu-avstring-add-av_utf8_decode-function.patch
Type: text/x-diff
Size: 8385 bytes
Desc: not available
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20131120/865b818d/attachment.bin>
More information about the ffmpeg-devel
mailing list