[FFmpeg-devel] [PATCH 3/3] vc-1: Optimise parser (with special attention to ARM)
Michael Niedermayer
michaelni at gmx.at
Sun Apr 20 05:22:25 CEST 2014
Hi
On Wed, Apr 16, 2014 at 08:38:14PM +0100, Ben Avison wrote:
> The previous implementation of the parser made four passes over each input
> buffer (reduced to two if the container format already guaranteed the input
> buffer corresponded to frames, such as with MKV). But these buffers are
> often 200K in size, certainly enough to flush the data out of L1 cache, and
> for many CPUs, all the way out to main memory. The passes were:
>
> 1) locate frame boundaries (not needed for MKV etc)
> 2) copy the data into a contiguous block (not needed for MKV etc)
> 3) locate the start codes within each frame
> 4) unescape the data between start codes
>
> After this, the unescaped data was parsed to extract certain header fields,
> but because the unescape operation was so large, this was usually also
> effectively operating on uncached memory. Most of the unescaped data was
> simply thrown away and never processed further. Only step 2 - because it
> used memcpy - was using prefetch, making things even worse.
>
> This patch reorganises these steps so that, aside from the copying, the
> operations are performed in parallel, maximising cache utilisation. No more
> than the worst-case number of bytes needed for header parsing is unescaped.
> Most of the data is, in practice, only read in order to search for a start
> code, for which optimised implementations already existed in the H264 codec
> (notably the ARM version uses prefetch, so we end up doing both remaining
> passes at maximum speed). For MKV files, we know when we've found the last
> start code of interest in a given frame, so we are able to avoid doing even
> that one remaining pass for most of the buffer.
>
> In some use-cases (such as the Raspberry Pi) video decode is handled by the
> GPU, but the entire elementary stream is still fed through the parser to
> pick out certain elements of the header which are necessary to manage the
> decode process. As you might expect, in these cases, the performance of the
> parser is significant.
>
> To measure parser performance, I used the same VC-1 elementary stream in
> either an MPEG-2 transport stream or a MKV file, and fed it through ffmpeg
> with -c:v copy -c:a copy -f null. These are the gperftools counts for
> those streams, both filtered to only include vc1_parse() and its callees,
> and unfiltered (to include the whole binary). Lower numbers are better:
>
> Before After
> File Filtered Mean StdDev Mean StdDev Confidence Change
> M2TS No 861.7 8.2 650.5 8.1 100.0% +32.5%
> MKV No 868.9 7.4 731.7 9.0 100.0% +18.8%
> M2TS Yes 250.0 11.2 27.2 3.4 100.0% +817.9%
> MKV Yes 149.0 12.8 1.7 0.8 100.0% +8526.3%
>
> Yes, that last case shows vc1_parse() running 86 times faster! The M2TS
> case does show a larger absolute improvement though, since it was worse
> to begin with.
>
> This patch has been tested with the FATE suite (albeit on x86 for speed).
> ---
> libavcodec/vc1_parser.c | 270 +++++++++++++++++++++++++++++------------------
> 1 files changed, 166 insertions(+), 104 deletions(-)
>
> diff --git a/libavcodec/vc1_parser.c b/libavcodec/vc1_parser.c
> index cc29ce1..b18f6dc 100644
> --- a/libavcodec/vc1_parser.c
> +++ b/libavcodec/vc1_parser.c
> @@ -30,122 +30,87 @@
> #include "vc1.h"
> #include "get_bits.h"
>
> +/** The maximum number of bytes of a sequence, entry point or
> + * frame header whose values we pay any attention to */
> +#define UNESCAPED_THRESHOLD 37
> +
> +/** The maximum number of bytes of a sequence, entry point or
> + * frame header which must be valid memory (because they are
> + * used to update the bitstream cache in skip_bits() calls)
> + */
> +#define UNESCAPED_LIMIT 144
> +
> +typedef enum {
> + NO_MATCH,
> + ONE_ZERO,
> + TWO_ZEROS,
> + ONE
> +} VC1ParseSearchState;
> +
> typedef struct {
> ParseContext pc;
> VC1Context v;
> + uint8_t prev_start_code;
> + uint8_t unesc_buffer[UNESCAPED_LIMIT];
> + size_t unesc_index;
> + VC1ParseSearchState search_state;
> } VC1ParseContext;
>
> -static void vc1_extract_headers(AVCodecParserContext *s, AVCodecContext *avctx,
> - const uint8_t *buf, int buf_size)
> +static void vc1_extract_header(AVCodecParserContext *s, AVCodecContext *avctx,
> + const uint8_t *buf, int buf_size)
> {
> + /* Parse the header we just finished unescaping */
> VC1ParseContext *vpc = s->priv_data;
> GetBitContext gb;
> - const uint8_t *start, *end, *next;
> - uint8_t *buf2 = av_mallocz(buf_size + FF_INPUT_BUFFER_PADDING_SIZE);
> -
> + int ret;
> vpc->v.s.avctx = avctx;
> vpc->v.parse_only = 1;
> - vpc->v.first_pic_header_flag = 1;
> - next = buf;
> - s->repeat_pict = 0;
> -
> - for(start = buf, end = buf + buf_size; next < end; start = next){
> - int buf2_size, size;
> - int ret;
> -
> - next = find_next_marker(start + 4, end);
> - size = next - start - 4;
> - buf2_size = vc1_unescape_buffer(start + 4, size, buf2);
> - init_get_bits(&gb, buf2, buf2_size * 8);
> - if(size <= 0) continue;
> - switch(AV_RB32(start)){
> - case VC1_CODE_SEQHDR:
> - ff_vc1_decode_sequence_header(avctx, &vpc->v, &gb);
> - break;
> - case VC1_CODE_ENTRYPOINT:
> - ff_vc1_decode_entry_point(avctx, &vpc->v, &gb);
> - break;
> - case VC1_CODE_FRAME:
> - if(vpc->v.profile < PROFILE_ADVANCED)
> - ret = ff_vc1_parse_frame_header (&vpc->v, &gb);
> - else
> - ret = ff_vc1_parse_frame_header_adv(&vpc->v, &gb);
> -
> - if (ret < 0)
> - break;
> -
> - /* keep AV_PICTURE_TYPE_BI internal to VC1 */
> - if (vpc->v.s.pict_type == AV_PICTURE_TYPE_BI)
> - s->pict_type = AV_PICTURE_TYPE_B;
> - else
> - s->pict_type = vpc->v.s.pict_type;
> -
> - if (avctx->ticks_per_frame > 1){
> - // process pulldown flags
> - s->repeat_pict = 1;
> - // Pulldown flags are only valid when 'broadcast' has been set.
> - // So ticks_per_frame will be 2
> - if (vpc->v.rff){
> - // repeat field
> - s->repeat_pict = 2;
> - }else if (vpc->v.rptfrm){
> - // repeat frames
> - s->repeat_pict = vpc->v.rptfrm * 2 + 1;
> - }
> - }
> -
> - if (vpc->v.broadcast && vpc->v.interlace && !vpc->v.psf)
> - s->field_order = vpc->v.tff ? AV_FIELD_TT : AV_FIELD_BB;
> - else
> - s->field_order = AV_FIELD_PROGRESSIVE;
> + init_get_bits(&gb, buf, buf_size * 8);
> + switch (vpc->prev_start_code) {
> + case VC1_CODE_SEQHDR & 0xFF:
> + ff_vc1_decode_sequence_header(avctx, &vpc->v, &gb);
> + break;
> + case VC1_CODE_ENTRYPOINT & 0xFF:
> + ff_vc1_decode_entry_point(avctx, &vpc->v, &gb);
> + break;
> + case VC1_CODE_FRAME & 0xFF:
> + if(vpc->v.profile < PROFILE_ADVANCED)
> + ret = ff_vc1_parse_frame_header (&vpc->v, &gb);
> + else
> + ret = ff_vc1_parse_frame_header_adv(&vpc->v, &gb);
>
> + if (ret < 0)
> break;
> - }
> - }
>
> - av_free(buf2);
> -}
> + /* keep AV_PICTURE_TYPE_BI internal to VC1 */
> + if (vpc->v.s.pict_type == AV_PICTURE_TYPE_BI)
> + s->pict_type = AV_PICTURE_TYPE_B;
> + else
> + s->pict_type = vpc->v.s.pict_type;
>
> -/**
> - * Find the end of the current frame in the bitstream.
> - * @return the position of the first byte of the next frame, or -1
> - */
> -static int vc1_find_frame_end(ParseContext *pc, const uint8_t *buf,
> - int buf_size) {
> - int pic_found, i;
> - uint32_t state;
> -
> - pic_found= pc->frame_start_found;
> - state= pc->state;
> -
> - i=0;
> - if(!pic_found){
> - for(i=0; i<buf_size; i++){
> - state= (state<<8) | buf[i];
> - if(state == VC1_CODE_FRAME || state == VC1_CODE_FIELD){
> - i++;
> - pic_found=1;
> - break;
> + if (avctx->ticks_per_frame > 1){
> + // process pulldown flags
> + s->repeat_pict = 1;
> + // Pulldown flags are only valid when 'broadcast' has been set.
> + // So ticks_per_frame will be 2
> + if (vpc->v.rff){
> + // repeat field
> + s->repeat_pict = 2;
> + }else if (vpc->v.rptfrm){
> + // repeat frames
> + s->repeat_pict = vpc->v.rptfrm * 2 + 1;
> }
> + }else{
> + s->repeat_pict = 0;
> }
> - }
>
> - if(pic_found){
> - /* EOF considered as end of frame */
> - if (buf_size == 0)
> - return 0;
> - for(; i<buf_size; i++){
> - state= (state<<8) | buf[i];
> - if(IS_MARKER(state) && state != VC1_CODE_FIELD && state != VC1_CODE_SLICE){
> - pc->frame_start_found=0;
> - pc->state=-1;
> - return i-3;
> - }
> - }
> + if (vpc->v.broadcast && vpc->v.interlace && !vpc->v.psf)
> + s->field_order = vpc->v.tff ? AV_FIELD_TT : AV_FIELD_BB;
> + else
> + s->field_order = AV_FIELD_PROGRESSIVE;
> +
> + break;
> }
> - pc->frame_start_found= pic_found;
> - pc->state= state;
> - return END_NOT_FOUND;
> }
>
> static int vc1_parse(AVCodecParserContext *s,
> @@ -153,14 +118,106 @@ static int vc1_parse(AVCodecParserContext *s,
> const uint8_t **poutbuf, int *poutbuf_size,
> const uint8_t *buf, int buf_size)
> {
> + /* Here we do the searching for frame boundaries and headers at
> + * the same time. Only a minimal amount at the start of each
> + * header is unescaped. */
> VC1ParseContext *vpc = s->priv_data;
> - int next;
> + int pic_found = vpc->pc.frame_start_found;
> + uint8_t *unesc_buffer = vpc->unesc_buffer;
> + size_t unesc_index = vpc->unesc_index;
> + VC1ParseSearchState search_state = vpc->search_state;
> + int next = END_NOT_FOUND;
> + int i = 0;
> +
> + if (pic_found && buf_size == 0) {
> + /* EOF considered as end of frame */
> + memset(unesc_buffer + unesc_index, 0, UNESCAPED_THRESHOLD - unesc_index);
> + vc1_extract_header(s, avctx, unesc_buffer, unesc_index);
> + next = 0;
> + }
> + while (i < buf_size) {
> + int start_code_found = 0;
> + uint8_t b;
> + while (i < buf_size && unesc_index < UNESCAPED_THRESHOLD) {
> + b = buf[i++];
> + unesc_buffer[unesc_index++] = b;
> + if (search_state <= ONE_ZERO)
> + search_state = b ? NO_MATCH : search_state + 1;
> + else if (search_state == TWO_ZEROS) {
> + if (b == 1)
> + search_state = ONE;
> + else if (b > 1) {
> + if (b == 3)
> + unesc_index--; // swallow emulation prevention byte
> + search_state = NO_MATCH;
> + }
> + }
> + else { // search_state == ONE
> + // Header unescaping terminates early due to detection of next start code
> + search_state = NO_MATCH;
> + start_code_found = 1;
> + break;
> + }
> + }
> + if ((s->flags & PARSER_FLAG_COMPLETE_FRAMES) &&
> + unesc_index >= UNESCAPED_THRESHOLD &&
> + vpc->prev_start_code == (VC1_CODE_FRAME & 0xFF))
> + {
> + // No need to keep scanning the rest of the buffer for
> + // start codes if we know it contains a complete frame and
> + // we've already unescaped all we need of the frame header
> + vc1_extract_header(s, avctx, unesc_buffer, unesc_index);
> + break;
> + }
> + if (unesc_index >= UNESCAPED_THRESHOLD && !start_code_found) {
> + while (i < buf_size) {
> + if (search_state == NO_MATCH) {
> + i += vpc->v.vc1dsp.vc1_find_start_code_candidate(buf + i, buf_size - i) + 1;
this doesnt look correct
A parser can be fed with arbitrary pieces of data, for example
a parser could receive only 1 byte at a time and never more
above code looks like it needs more than 1 byte
also the unesc_buffer logic looks like it would fail if theres
random data before a frame
and make sure that cases where the start code is split over
several calls to the parser work correctly
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Rewriting code that is poorly written but fully understood is good.
Rewriting code that one doesnt understand is a sign that one is less smart
then the original author, trying to rewrite it will not make it better.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20140420/67d1edbc/attachment.asc>
More information about the ffmpeg-devel
mailing list