[FFmpeg-devel] [PATCH] libavfilter: Whisper audio filter

Wed Jul 23 11:43:16 EEST 2025

Hi,
I've applied some changes and created a pull request:
https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022

>
> > +    frames = FFMAX(0, FFMIN(frames, wctx->audio_buffer_fill_size));
>
> I would call it samples, sample_count or nb_samples
>
> why are you cliping the number of samples ?
>
> I assume run_transcription() would be called with the correct number or am i missing
> something ?

When using the VAD option, we want to process only a portion of the
total samples stored into the buffer (up to the detected silence).

> A bigger problem is that the input frame->pts are not passed through to the output
> srt/json timestamps.
>
> To understand why this is a problem, consider some audio input device
> which samples at 16khz. This hardware contains lets say for simplicity a 16khz
> crystal and samples based on that. But depending on temperature of this
> crystal it will really sample lets say between 15990 and 16010khz. So
> simply counting samples alone is not enough. the frame->pts need to be
> used too.
> If the subtitles should be perfectly in sync with the video
>
> Its probably best to give the user the option to produce srt/json times
> based purely on sample numbers but also on pts.

Ok, let me think about using pts instead.