First announced in November 2017, the first version of the National Speech Corpus (NSC) is now available for download. It contains 2,000 hours of locally accented audio and corresponding text transcriptions. There are more than 40,000 unique words within the text transcriptions comprising local words such as “Tanjong Pagar”, “ice kachang”, or “nasi lemak”. The data is made available via the Singapore Open Data Licence.
Automatic speech recognition engines use multiple corpus collections (collectively called corpora) to accurately train themselves to interpret spoken words and transcribe them. The NSC thus enables global technology providers to provide speech-related applications such as voice assistants, for use here. The NSC will be continually updated.
Interested parties can request for a copy of the corpus here.