The first version of the National Speech Corpus (NSC) was released on 22 Nov 2018 and contains 2,000 hours of locally accented audio and corresponding text transcriptions, which consist of read speech using scripts and featuring words pertinent to Singapore’s context. It consists of more than 40,000 unique words and local terms such as “Tanjong Pagar”, “ice kachang”, or “nasi lemak”.
In version 2.0 of the NSC, it has been expanded with a further 1,000 hours of conversational speech on top of its current 2,000 hours of read speech, and consists of more than 70,000 unique words. Read speech corpora are generally suited for Artificial Intelligence (AI) applications when user speech is articulated in a reading style. Conversational corpora can be used to train AI applications when users tends to speak more spontaneously. Therefore, the expansion has greatly extended the spectrum of speech applications as well as enable further improvements in the accuracy of speech recognition technologies for Singapore for different styles of speech.
Who is it for?
Technology Providers and Developers, Institute of Higher Learnings (IHLs), Research Institutes (RIs) and Individuals are welcome to use the NSC.
It improves speech engines' accuracy of recognition and transcription for locally accented English.
Click here to download the Corpus.
What Our Partners Say
“We are very happy that we have been working with IMDA on the National Speech Corpus that will enable us on this goal. …. We want to give “EMMA” a uniquely Singaporean voice. … As we go along, this will help us engage with our customers in a much better manner.”
Mr. Pranav Seth, Senior Vice President & Head, Digital & Innovation (E-business, Business Transformation and Fintech & Innovation group) on how the NSC can improve OCBC’s chatbot to enhance consumer engagement.
Everybody loves inspiring Success Stories. Find out how the NSC has been implemented in the industry here.
1. What are the cost associated to using the NSC?
The NSC is made available via the Singapore Open Data Licence.
2. Do I need to have a Dropbox account to download the NSC?
Yes, users who are interested to download the NSC will require a Dropbox account.
3. How big is the NSC?
Currently, the NSC is 1TB and is expected to grow over time.
4. Will there be future updates to the corpus?
The NSC will be continually updated over time. The current release version is V2.0.
For further enquiries on the National Speech Corpus, please contact DSL_Tech@imda.gov.sg.