Getting quite a bit of "Thank you for watching!" during silence

#19
by FlippFuzz - opened

With the VAD enabled, I've noticed that I'm getting quite a bit of "Thank you for watching!".
This is not as noticeable without the VAD.


Command:

time python3 cli.py --model large-v2 --language Japanese --task translate --verbose True \
--whisper_implementation faster-whisper --compute_type float32 --vad silero-vad --vad_cpu_cores 4 \
--language Japanese --beam_size 1 --temperature 0 \
--output_dir output -- Sdx8kCr9DvQ.webm

Sample 1:

Running whisper from  01:22:19.108  to  01:22:36.686 , duration:  17.57800000000043 expanded:  8.596000000000458 prompt:   Seriously? language:  ja
 Wow…
 Ouch!
 Seriously?
 Thank you for watching!       <--- Supposed to be silence.
 Please subscribe our channel! <--- Supposed to be silence.

Sample 2:

Running whisper from  01:45:06.798  to  01:45:45.198 , duration:  38.399999999999636 expanded:  17.84400000000005 prompt:  None language:  ja
 It's high
 But I see, I can't go in
 Is it one way?
 If it's a water elevator
 Inside here
 Eh? I can switch between the two?
 Thank you for watching! Please subscribe to my channel!  <--- Supposed to be silence.

Sample 3:

Running whisper from  03:39:57.186  to  03:40:27.906 , duration:  30.719999999999345 expanded:  12.5 prompt:  None language:  ja
 It's not good to die here
 First I'll try to go without dying
 That was close
 Thank you for watching!  <--- Supposed to be silence.

I'm guessing that Whisper is actually expecting 30s worth of input and if the input is short, there's a chance that Whisper thinks that the video is ending and translates it as "Thank you for watching".

For example, let's use "Sample 3" above.
30.719s would basically be processed twice. 30s + 0.719s.
Because the 2nd part of the input (0.719s) is very short, perhaps whisper thinks that the video is ending and adds the "Thank you for watching!".


Is there anything we can do to help mask this (or even verify my assumption that it's caused by the small chunks?)

The way Silero VAD works in conjunction with the Whisper WebUI, is to detect sections of continuous speech in the audio file. It then joins these sections into chunk, provided that they are within 5 seconds of each other (VAD - Merge Window (s)), until the chunk has reached the maximum merge size (VAD - Max Merge Size(s)). Sections are also padded by 1 second (VAD - Padding). Finally, the detected text of a previous chunk is also passed to Whisper through the initial prompt text, provided the text ended within 3 seconds (VAD - Prompt Window)

To illustrate this, take a look at the following figure:
Chunks and sections.png

There are multiple speech sections (Speech #1, Speech #2, etc.) detected marked in yellow, and these are then joined into larger chunks marked in green (chunk #1, #2, etc.). There is also a slight padding around each individual merged chunk. Finally, there are areas of silence not included in any of the chunks.

However, these silent areas are treated a bit differently when you select the "silero-vad" option (currently default) under VAD. The nearby chunks (in green) are expanded until they fill the subsequent silent area, and then sent directly to Whisper, or the silent area is passed as is to Whisper. This is what is meant by the "expanded" number in the samples you provided above. For instance, for sample #1, the total duration of the chunk is 17.578 seconds, out of which 8.596 seconds were subsequent silence that was included into the chunk.

You can avoid this by selecting the "silero-vad-skip-gaps", which will not run Whisper on these silent areas nor expand nearby chunks, at a cost of potentially missing out on dialogue. As explained in the documentation:

silero-vad-skip-gaps
As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
may cause dialogue to be skipped.

Thanks for the picture. That is really useful!


I've done a bit more research and it seems that Whisper is actually chunking into 30s and zero padding the input if the input is shorter than 30s.
https://github.com/openai/whisper/discussions/838

I highly suspect that sending data that is significantly lesser than 30s to be transcribed results in all of these "Thank you for watching" because whisper is basically seeing a zero padded input and thinks that the video ended.


Goal: I'm now looking for a method to send chunks to whisper that is in multiples of 30s (30s, 60s, 90s, etc).
Is this something can be done?

Example: Let's say the VAD detects audio at the following timestamps.
a) 10s-20s
b) 30s-35s
c) 42s-73s

Expected chunks to be sent to whisper:
10-40s, 42s-102s


This seems to be "silero-vad-expand-into-gaps", but I might be mis-understanding the description in the doc because I'm not getting it.

time python3 cli.py --model large-v2 --language Japanese --task translate --verbose True \
--whisper_implementation faster-whisper --compute_type float32 \
--vad silero-vad-expand-into-gaps --vad_cpu_cores 4 --language Japanese \
--beam_size 1 --temperature 0 --output_dir output -- Sdx8kCr9DvQ.webm

I don't seem to be getting multiple of 30s blocks to be passed into Whisper.
Example:

Running whisper from  03:14:15.976  to  03:14:44.488 , duration:  28.512000000000626 expanded:  0.0 prompt:  None language:  ja
Running whisper from  03:14:44.488  to  03:15:12.482 , duration:  27.993999999998778 expanded:  3.2519999999985885 prompt:   I'll go play some Athletic language:  ja
Running whisper from  03:15:12.482  to  03:15:37.688 , duration:  25.20600000000195 expanded:  0.0 prompt:  None language:  ja
Running whisper from  03:15:37.688  to  03:16:09.858 , duration:  32.169999999998254 expanded:  3.763999999999214 prompt:   I want to aim for 45 seconds. language:  ja
 Thank you for watching.
Running whisper from  03:16:09.858  to  03:16:12.120 , duration:  2.2620000000006257 expanded:  0.0 prompt:  None language:  ja
Running whisper from  03:16:12.120  to  03:16:40.568 , duration:  28.4479999999985 expanded:  0.0 prompt:  None language:  ja

All of these don't have duration that is a multiple of 30 seconds.

Thanks for the picture. That is really useful!

No problem. 👍

I've done a bit more research and it seems that Whisper is actually chunking into 30s (... and) sending data that is significantly lesser than 30s to be transcribed results in all of these "Thank you for watching" (...)

Interesting, though there is logic within Whisper WebUI to clamp or filter out predicted text segments that exceed each chunk passed to Whisper. So if it is hallucinating these "Thank you for watching" lines in the padded areas of up to 30 seconds, the lines ought to be filtered out by VAD.py, unless they start within the original chunk timestamps.

Goal: I'm now looking for a method to send chunks to whisper that is in multiples of 30s (30s, 60s, 90s, etc).
Is this something can be done?
(...)
All of these don't have duration that is a multiple of 30 seconds.

Neither "silero-vad", "silero- vad-skip-gaps" or"silero-vad-expand-into-gaps" will send chunks to Whisper in multiplies of 30 (or whatever VAD - Max Merge Size(s) happens to be set to). But they will try to merge sections (controlled by the VAD - Merge Window (s)) into something that is close to 30 seconds, though as you can see it can be much shorter, like 2.26 seconds.

You can force Whisper WebUI to use exact chunks of 30 seconds, however, by using the "perodic-vad" option, which simply subdivides the input into a chunk every 30 seconds (controlled by VAD - Max Merge Size(s)). For instance:

Running whisper from  00:00.000  to  00:30.000 , duration:  30 expanded:  0 prompt:  None language:  None
Loading faster whisper model large-v2 for device None
Running whisper from  00:30.000  to  01:00.000 , duration:  30 expanded:  0 prompt:  あの鍛えるランドはもう夏終わった と思います はい最近涼しくなってきました language:  ja
Running whisper from  01:00.000  to  01:30.000 , duration:  30 expanded:  0 prompt:  あーもう夏が終わったんじゃないって 感じるのは language:  ja
Running whisper from  01:30.000  to  02:00.000 , duration:  30 expanded:  0 prompt:  ノリJPティーチャーというアカウント で language:  ja
Running whisper from  02:00.000  to  02:30.000 , duration:  30 expanded:  0 prompt:  なぜなら7月ぐらいかな SNSで language:  ja
Running whisper from  02:30.000  to  03:00.000 , duration:  30 expanded:  0 prompt:  近くにあるからついつい手を伸ばして みてしまうんだけれどもでもそれでも language:  ja
Running whisper from  03:00.000  to  03:30.000 , duration:  30 expanded:  0 prompt:  なかなか気持ちが乗れなかった language:  ja
Running whisper from  03:30.000  to  04:00.000 , duration:  30 expanded:  0 prompt:  手紙は 手書きの時もあればタイピングをしてそれをプリントして送ってくれ language:  ja
Running whisper from  04:00.000  to  04:30.000 , duration:  30 expanded:  0 prompt:  プライベートレッスンも大好きなので 
...

This doesn't actually use a VAD though, so it will break up sentences or even words, introducing artifacts in the predicted text. But the chunks will be exactly 30 seconds, and the text from the previous chunk will be included as context (in the prompt part).

Interesting, though there is logic within Whisper WebUI to clamp or filter out predicted text segments that exceed each chunk passed to Whisper. So if it is hallucinating these "Thank you for watching" lines in the padded areas of up to 30 seconds, the lines ought to be filtered out by VAD.py, unless they start within the original chunk timestamps.

I didn't realize that there was logic.

I've added timestamps in https://huggingface.co/spaces/aadnk/whisper-webui/discussions/21 and can confirm that these happen before max_source_time.
So I'm not certain what is happening.

This only happens when we chunk the input. No issues when with vad=none.
I guess this needs more investigation.

This happens for both the original whisper and faster-whisper.
Anyways, I'm just manually dropping "Thank you for watching" in my own code.

FlippFuzz changed discussion status to closed

Sign up or log in to comment