How do I return timestamps on inference serverless api???
How do I return timestamps on inference serverless api???
Unfortunately, I have not have much luck with this as well. I wanted to enable verbose responses to get the detected language but so far, no luck.
First, I tried:
#################################################
files = {
"file": open(audio_file_path, "rb"),
"model": "openai/whisper-large-v3-turbo",
"response_format": "verbose_json",
"timestamp_granularities[]": "word",
}
response = requests.post(url, headers=headers, files=files)
#################################################
Results: Transcription was a success but I got only the transcribed text.
Next, I tried:
#################################################
options = {
"parameters":
{
"return_timestamps": True,
response_format:"verbose_json"
}
}
files = {
"file": ("audio.mp3", output_audio_io, "audio/mpeg")
}
response = requests.post(api_url, headers=headers, files=files, json=options)
#################################################
Results: Transcription was a success but I got only the transcribed text.
*** Update: using the helpful comment from @trystoh , I was able to get timestamp, though I am still trying to enable verbose
*** Update2: OOps, according to https://huggingface.co/docs/api-inference/tasks/automatic-speech-recognition, looks like I cannot get verbose responses... If anyone can point me to the right direction I would greatly appreciate it.
Understood, have you tried using a byte 64 encoded audio file?
In the docs it uses tricky logic like, “If not using parameters you can also just use an audio file directly” so you may have to first convert the to byte 64 encoding.
Let me know if I misunderstood, I spent all day on this 😂
yeah, when I use the byte64 encoded audio file, it tells me the payload is too big.