[FFmpeg-devel] [PATCH] avformat/rtcenc: Add WHIP muxer support for subsecond latency streaming

Wed May 31 23:46:07 EEST 2023

Hello, quick skim below, followe by some waffling at the end.

I have not bothere to mention any code style nits, to keep it technical.

On 5/29/2023 12:50 PM, Steven Liu wrote:
> @@ -3483,6 +3483,7 @@ ogg_demuxer_select="dirac_parse"
>  ogv_muxer_select="ogg_muxer"
>  opus_muxer_select="ogg_muxer"
>  psp_muxer_select="mov_muxer"
> +rtc_muxer_deps_any="openssl"

Given we support a bunch of SSL libraries, how exactly does this work with that? Can
it be abstracted out? OpenSSL poses licensing issues for many.

Further, I think this is missing a bunch of dependencies like HTTP and AVC?
> +#ifndef CONFIG_OPENSSL
> +#error "DTLS is not supported, please enable openssl"
> +#else
> +#include <openssl/ssl.h>
> +#include <openssl/err.h>
> +#if OPENSSL_VERSION_NUMBER < 0x1010102fL
> +#error "OpenSSL version 1.1.1b or newer is required"
> +#endif
> +#endif

This should be in configure.

> +/*
> + * Supported DTLS cipher suites for FFmpeg as a DTLS client.
> + * These cipher suites are used to negotiate with DTLS servers.
> + *
> + * It is advisable to use a limited number of cipher suites to reduce
> + * the size of DTLS UDP packets.
> + */
> +#define DTLS_CIPHER_SUTES "ECDHE-ECDSA-AES128-GCM-SHA256"\
> +    ":ECDHE-RSA-AES128-GCM-SHA256"\
> +    ":ECDHE-ECDSA-AES128-SHA"\
> +    ":ECDHE-RSA-AES128-SHA"\
> +    ":ECDHE-ECDSA-AES256-SHA"\
> +    ":ECDHE-RSA-AES256-SHA"

The reason for this specific subset should be documented.

> +/**
> + * STAP-A stands for Single-Time Aggregation Packet.
> + * The NALU type for STAP-A is 24 (0x18).
> + */
> +#define NALU_TYPE_STAP_A 24

Why is a NALU type being defined in a protocol implementation file?

> +/**
> + * Wait for a small timeout in milliseconds to allow for the server to process
> + * the Interactive Connectivity Establishment (ICE) request. If we immediately
> + * read the response after sending the request, we may receive nothing and need
> + * to immediately retry. To lessen the likelihood of retries, we can send the
> + * request and wait for a small amount of time for the server to process it
> + * before reading the response.
> + */
> +#define ICE_PROCESSING_TIMEOUT 10

Are we really hardcoding sleeps? Surely a better way exists. What do other implementations
do here?

> +/**
> + * Wait for a short timeout in milliseconds to allow the server to process
> + * the Datagram Transport Layer Security (DTLS) request. If we immediately
> + * read the response after sending the request, we may receive nothing and
> + * need to immediately retry. To reduce the likelihood of retries, we can
> + * send the request and wait a short amount of time for the server to
> + * process it before attempting to read the response.
> + */
> +#define DTLS_PROCESSING_TIMEOUT 30

Same comment as above.

> +/**
> + * The maximum number of retries for Datagram Transport Layer Security (DTLS) EAGAIN errors.
> + * When we send a DTLS request and receive no response, we may encounter an EAGAIN error.
> + * In this situation, we wait briefly and attempt to read the response again.
> + * We limit the maximum number of times we retry this loop.
> + */
> +#define DTLS_EAGAIN_RETRIES_MAX 5

Should this be an option?

> +/**
> + * The DTLS timer's base timeout in microseconds. Its purpose is to minimize the unnecessary
> + * retransmission of ClientHello.
> + */
> +#define DTLS_SSL_TIMER_BASE 400 * 1000

Is this from a spec or arbitrary?

> +typedef struct DTLSContext {
> +    /* For av_log to write log to this category. */
> +    void *log_avcl;

Why is this void *?

> +    /**
> +     * This represents the material used to build the SRTP master key. It is
> +     * generated by DTLS and has the following layout:
> +     *          16B         16B         14B             14B
> +     *      client_key | server_key | client_salt | server_salt
> +     */
> +    uint8_t dtls_srtp_material[DTLS_SRTP_MASTER_KEY_LEN * 2];

*2?

> +/**
> + * Generate a self-signed certificate and private key for DTLS.
> + */

Does WHIP allow use of arbitrary certs (i.e. non-self-signed)?

> +static av_cold int dtls_context_init(DTLSContext *ctx)
> +{
> +    int ret = 0, serial, expire_day, i, n = 0;
> +    AVBPrint fingerprint;
> +    unsigned char md[EVP_MAX_MD_SIZE];
> +    const char *aor = "ffmpeg.org", *curve = NULL;

I am not sure the the FFmpeg community wants to advertise itself as a CN.

Mayb lavf?

> +    curve = "prime256v1";

For this and all other occurrences: They should be intialized at declaration
time, like the rest of the codebase.

> +    ctx->dtls_pkey = dtls_pkey = EVP_EC_gen(curve);
> +    if (!dtls_pkey) {
> +        av_log(s1, AV_LOG_ERROR, "DTLS: EVP_EC_gen curve=%s failed\n", curve);

For this and all ther logs: Something more descriptive than "function_name failed:"
should be used.

> +    serial = (int)av_get_random_seed();

For all uses of av_get_random_seed: Why are we using a seed and not a PRNG?

> +    if (!av_bprint_is_complete(&fingerprint)) {
> +        av_log(s1, AV_LOG_ERROR, "Fingerprint %d exceed max %d, %s\n", ret, MAX_SDP_SIZE, fingerprint.str);
> +        ret = AVERROR(EIO);
> +        goto end;
> +    }

Is this actually possible to trigger?

> +    if (!fingerprint.str || !strlen(fingerprint.str)) {
> +        av_log(s1, AV_LOG_ERROR, "Fingerprint is empty\n");
> +        ret = AVERROR(EINVAL);
> +        goto end;
> +    }

Same as above.

> +    /* never exceed the max timeout. */
> +    timeout_us = FFMIN(timeout_us, 30 * 1000 * 1000); /* in us */

Is this max based on anything?

> +    /* Change_cipher_spec(20), alert(21), handshake(22), application_data(23) */
> +    if (length >= 1)
> +        content_type = (uint8_t)data[0];
> +    if (length >= 13)
> +        size = (uint16_t)(data[11])<<8 | (uint16_t)data[12];

nit: AV_WB16?

> +    if (length >= 14)
> +        handshake_type = (uint8_t)data[13];

Casting uint8_t to uint8_t? Above, also.

> +    av_log(s1, AV_LOG_VERBOSE, "WHIP: DTLS state %s %s, done=%u, arq=%u, r0=%d, r1=%d, len=%u, cnt=%u, size=%u, hs=%u\n",
> +        "Active", (incoming? "RECV":"SEND"), ctx->dtls_done_for_us, ctx->dtls_arq_packets, r0, r1, length,
> +        content_type, size, handshake_type);

Does the use of this trace callback affec perf in any way? I have to assume not,
but thought I would check.

> +    /* For ECDSA, we could set the curves list. */
> +    if (SSL_CTX_set1_curves_list(dtls_ctx, "P-521:P-384:P-256") != 1) {

Where did this list come from?

> +    /* Only support SRTP_AES128_CM_SHA1_80, please read ssl/d1_srtp.c *> +    if (SSL_CTX_set_tlsext_use_srtp(dtls_ctx, "SRTP_AES128_CM_SHA1_80")) {

What is ssl/d1_srtp.c? It is not from FFmpeg.

> +        /* Wait so that the server can process the request and no need ARQ then. */
> +#if DTLS_PROCESSING_TIMEOUT > 0
> +        av_usleep(DTLS_PROCESSING_TIMEOUT * 10000);
> +#endif

This #if is a no-op.

> +
> +        for (j = 0; j <= DTLS_EAGAIN_RETRIES_MAX && !res_size; j++) {

Does this not make DTLS_EAGAIN_RETRIES_MAX a misnomer (due to use of <=)?

> +            /* Ignore other packets, such as ICE indication, except DTLS. */
> +            if (ret > 0 && (ret < 13 || buf[0] <= 19 || buf[0] >= 64))
> +                continue;

Magic numbers?

> +    if (!ctx->udp_uc) {
> +        av_log(s1, AV_LOG_ERROR, "DTLS: No UDP context\n");

Kind of odd this would be possible?

> +    av_log(s1, AV_LOG_INFO, "WHIP: DTLS handshake done=%d, arq=%d, srtp_material=%luB, cost=%dms\n",
> +        ctx->dtls_done_for_us, ctx->dtls_arq_packets, sizeof(ctx->dtls_srtp_material),
> +        (int)(av_gettime() - starttime) / 1000);

Bit verbose (here and other uses of INFO)?

Why are we logging the cost/time diff?

Also we do not use %lu, I think.
> +    /* The SPS/PPS of AVC video */
> +    uint8_t *avc_sps;
> +    int avc_sps_size;
> +    uint8_t *avc_pps;
> +    int avc_pps_size;
> +    /* The size of NALU in ISOM format. */
> +    int avc_nal_length_size;
> +

Will get into this in my waffling at the end...

> +    /* The SRTP send context, to encrypt outgoing packets. */
> +    struct SRTPContext srtp_audio_send;
> +    struct SRTPContext srtp_video_send;
> +    struct SRTPContext srtp_rtcp_send;
> +    /* The SRTP receive context, to decrypt incoming packets. */
> +    struct SRTPContext srtp_recv;

Not typedefed?

> +static int on_rtp_write_packet(void *opaque, uint8_t *buf, int buf_size);

Why not reorder the file instead of a same-file proto?

> +    rtc->dtls_ctx.dtls_arq_max = rtc->dtls_arq_max;
> +    rtc->dtls_ctx.dtls_arq_timeout = rtc->dtls_arq_timeout;
> +    rtc->dtls_ctx.pkt_size = rtc->pkt_size;

Why are these duplicated instead of the struct pointer embedded?

> +static int isom_read_avcc(AVFormatContext *s, uint8_t *extradata, int  extradata_size)

AS Kieran noted, duplicated code.
> +/**
> + * Generate SDP offer according to the codec parameters, DTLS and ICE information.
> + * The below is an example of SDP offer:

This comment and function need some serious work:
  * The example and comment explain exactly nothing about the SDP text being generated. It
    is an opaque blob.
  * It is illegal SDP.
  * We already have SDP facilties you should use.

> +    snprintf(rtc->ice_ufrag_local, sizeof(rtc->ice_ufrag_local), "%08x",
> +             av_get_random_seed());
> +    snprintf(rtc->ice_pwd_local, sizeof(rtc->ice_pwd_local), "%08x%08x%08x%08x",
> +             av_get_random_seed(), av_get_random_seed(), av_get_random_seed(),
> +             av_get_random_seed());

Use the PRNG... calling av_get_random_seed 5 times is a bit whack.

Also, is it limited to hex characters?

> +    rtc->audio_ssrc = av_get_random_seed();
> +    rtc->video_ssrc = av_get_random_seed();

Same as above.

> +    rtc->audio_payload_type = 111;
> +    rtc->video_payload_type = 106;

Magic numbers. I assume this is H.264 and Opus? Do not harcode.

> +
> +    av_bprintf(&bp, ""
> +        "v=0\r\n"
> +        "o=FFmpeg 4489045141692799359 2 IN IP4 127.0.0.1\r\n"

Hardcoding 127.0.0.1?

4489045141692799359?

> +    if (rtc->video_par) {
> +        profile = rtc->video_par->profile < 0 ? 0x42 : rtc->video_par->profile;
> +        level = rtc->video_par->level < 0 ? 30 : rtc->video_par->level;
> +        profile_iop = profile & FF_PROFILE_H264_CONSTRAINED;

Do not hardcode H264.

> +        av_bprintf(&bp, ""
> +            "m=video 9 UDP/TLS/RTP/SAVPF %u\r\n"
> +            "c=IN IP4 0.0.0.0\r\n"
> +            "a=ice-ufrag:%s\r\n"
> +            "a=ice-pwd:%s\r\n"
> +            "a=fingerprint:sha-256 %s\r\n"
> +            "a=setup:active\r\n"
> +            "a=mid:1\r\n"
> +            "a=sendonly\r\n"
> +            "a=msid:FFmpeg video\r\n"
> +            "a=rtcp-mux\r\n"
> +            "a=rtcp-rsize\r\n"
> +            "a=rtpmap:%u H264/90000\r\n"

Same.

> +            "a=fmtp:%u level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=%02x%02x%02x\r\n"

Same.

> +            "a=ssrc:%u cname:FFmpeg\r\n"
> +            "a=ssrc:%u msid:FFmpeg video\r\n",
> +            rtc->video_payload_type,
> +            rtc->ice_ufrag_local,
> +            rtc->ice_pwd_local,
> +            rtc->dtls_ctx.dtls_fingerprint,
> +            rtc->video_payload_type,
> +            rtc->video_payload_type,
> +            profile & (~FF_PROFILE_H264_CONSTRAINED),

Same.

> + * Exchange SDP offer with WebRTC peer to get the answer.
> + * The below is an example of SDP answer:

Same comment as my before: This is an opaque blob. 

> +/**
> + * Parses the ICE ufrag, pwd, and candidates from the SDP answer.

We have SDP facilities already.

> +    /* Write 20 bytes header */
> +    avio_wb16(pb, 0x0001); /* STUN binding request */
> +    avio_wb16(pb, 0);      /* length */
> +    avio_wb32(pb, STUN_MAGIC_COOKIE); /* magic cookie */
> +    avio_wb32(pb, av_get_random_seed()); /* transaction ID */
> +    avio_wb32(pb, av_get_random_seed()); /* transaction ID */
> +    avio_wb32(pb, av_get_random_seed()); /* transaction ID */

Use the PRNG.
> +    av_opt_set(rtc->udp_uc->priv_data, "connect", "1", 0);
> +    av_opt_set(rtc->udp_uc->priv_data, "fifo_size", "0", 0);
> +    /* Set the max packet size to the buffer size. */
> +    snprintf(tmp, sizeof(tmp), "%d", rtc->pkt_size);
> +    av_opt_set(rtc->udp_uc->priv_data, "pkt_size", tmp, 0);

This is as good of a place as any to ask: Do we plan to handle passing
AVOptions down to subcontexts?

> +static int setup_srtp(AVFormatContext *s)

> +    const char* suite = "AES_CM_128_HMAC_SHA1_80";

Source?

Same as before, re: subcontext AVOptions?

> +    /* Ignore if not RTP or RTCP packet. */
> +    if (buf_size < 12 || (buf[0] & 0xC0) != 0x80)
> +        return 0;
> +
> +    /* Only support audio, video and rtcp. */
> +    is_rtcp = buf[1] >= 192 && buf[1] <= 223;

Magic numbers.

> +    /**
> +     * For video, the STAP-A with SPS/PPS should:
> +     * 1. The marker bit should be 0, never be 1.
> +     * 2. The NRI should equal to the first NALU's.
> +     */
> +    if (is_video && buf_size > 12) {
> +        nalu_header = buf[12] & 0x1f;
> +        if (nalu_header == NALU_TYPE_STAP_A) {
> +            /* Reset the marker bit to 0. */
> +            if (buf[1] & 0x80)
> +                buf[1] &= 0x7f;
> +
> +            /* Reset the NRI to the first NALU's NRI. */
> +            if (buf_size > 15 && (buf[15]&0x60) != (buf[12]&0x60))
> +                buf[12] = (buf[12]&0x80) | (buf[15]&0x60) | (buf[12]&0x1f);
> +        }
> +    }

This seems hardcoded to H.264.

> +/**
> + * Inserts the SPS/PPS data before each IDR (Instantaneous Decoder Refresh) frame.
> + *
> + * The SPS/PPS is parsed from the extradata. If it's in ISOM format, the SPS/PPS is
> + * multiplexed to the data field of the packet. If it's in annexb format, then the entire
> + * extradata is set to the data field of the packet.

Why do we do it different ways instea of always munging it into the same way?

Also, do we have a BSF for this?

> +    /* For audio OPUS stream, correct the timestamp. */
> +    if (st->codecpar->codec_type == AVMEDIA_TYPE_AUDIO) {
> +        pkt->dts = pkt->pts = rtc->audio_jitter_base;
> +        rtc->audio_jitter_base += 960;
> +    }

1. Do not hardcode opus.
2. 960 is not always true for opus, it just happens to be the libopus default.

> +    ret = insert_sps_pps_packet(s, pkt);
> +    if (ret < 0) {
> +        av_log(s, AV_LOG_ERROR, "Failed to insert SPS/PPS packet\n");
> +        return ret;
> +    }

Do not hardcode H264.

> +static const AVClass rtc_muxer_class = {
> +    .class_name = "WebRTC muxer",

WHIP or WebRTC?

[...]

OK. So now on to my waffling.

This patch, to me, seems almost like a proof of concept rather than a fully fleshed out
(read: pushable) patch. I have few concerns:

1. If it is pushed in such a state, lacking so many things, my hunch, based on experience,
   is that it will stay that way: incomplete, just being maintained, rather than finished.
   Users will rightfully ask why it is like this, and will ask for it to be completed, but
   I expect nobody will. But there will be a blog post about how "FFmpeg has WebRTC now",
   like we saw with IPFS.
2. It is monolithic - WHIP, DTLS, ICE, STUN, SRTP, RTP, and SPD should really all be separate
   layers. Typing that out really does make one realize WebRTC suffers from deep IETF
   committee-ness...
3. The codec parts should probably be separated out like they are for RTP.
4. If our SDP facilities are not good enough for WHIP, they should be improved so that they
   are.

- Derek