1 min readfrom Machine Learning

I can't believe text normalization is so underdiscussed in streaming text-to-speech [D]

Kinda suprises me how little discussion there is around about mistakes in streaming TTS models

People look for natural readers, high voice quality, expressive speech. And most models don't look dumb here and fail. They fail when you give them basic stuff like price, dates, URLs, promo codes, phone numbers.

So I was looking for some info and found a benchmark that compares commercial real time streaming TTS models in terms of how they pronounce dates, URLs, acronyms, etc. They are checking 1000+ sentences in 31 categories then use Gemini to see how results came out. https://async-vocie-ai-text-to-speech-normalization-benchmark.static.hf.space/index.html . Looks valid to me.

Obviously this is a vendor benchmark so I am not taking it for granted but the focus feels on point.

This has been one of the biggest challenges for us in the production.I am curious how you guys deal with it in practice.

submitted by /u/lilitbroyan
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#real-time collaboration
#rows.com
#financial modeling with spreadsheets
#natural language processing
#text normalization
#streaming TTS
#voice quality
#expressive speech
#benchmark
#commercial models
#pronunciation
#dates
#URLs
#acronyms
#sentences
#production challenges