Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The big problem for open source speech recognition is training data.

What exactly is needed for training - audio recordings with transcripts, human validation of recognized text?

There are successful crowdsourced efforts for proofreading of OCR'ed text. Archive.org could host a CC-licensed archive of sound & transcripts.

Recognition of the human voice is almost like writing, hopefully everyone could have access.

Edit: how much disk space would be needed - TB or PB?



For example, the Switchboard corpus (300h, 8khz, transcribed audio) is about 16GB.

That is a common size for LVCSR, and you need something around that area to get good performance (maybe minimum 100h). In academic papers by Google, they usually use their own private training data set, with e.g. 1900h. (E.g.: http://arxiv.org/pdf/1402.1128.pdf)

Some crowdsourced effort to collect transcribed audio under a CC-licence would be great!


Maybe this? http://www.voxforge.org/home - "VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac)." (caveat: I have not recorded on this from (any) of my machines - I don't have the right plugin apparently)

Maybe also: https://librivox.org - has audiobooks read by volunteers, plus the book text.


The more data the better although the relationship isn't linear.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: