> The big problem for open source speech recognition is training data. What exac...

albertzeyer · on Oct 9, 2014

For example, the Switchboard corpus (300h, 8khz, transcribed audio) is about 16GB.

That is a common size for LVCSR, and you need something around that area to get good performance (maybe minimum 100h). In academic papers by Google, they usually use their own private training data set, with e.g. 1900h. (E.g.: http://arxiv.org/pdf/1402.1128.pdf)

Some crowdsourced effort to collect transcribed audio under a CC-licence would be great!

bainsfather · on Oct 9, 2014

Maybe this? http://www.voxforge.org/home - "VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac)." (caveat: I have not recorded on this from (any) of my machines - I don't have the right plugin apparently)

Maybe also: https://librivox.org - has audiobooks read by volunteers, plus the book text.

kansface · on Oct 9, 2014

The more data the better although the relationship isn't linear.