National Speech Corpus (NSC)

About the National Speech Corpus (NSC)

The National Speech Corpus (NSC) is the first large-scale Singapore English corpus spearheaded by the Info-communications and Media Development Authority (IMDA) of Singapore. It aims to become an important source of open speech data for automatic speech recognition (ASR) research and speech-related applications. By harnessing the power of Artificial Intelligence (AI), the NSC is paving the way for innovative digital solutions and driving progress in Singapore's digital landscape.

There is a growing trend for people to use voice to interact with services, be it at home, at work, or in public spaces. Supporting Speech Technologies can be inaccurate at recognising and transcribing locally accented English. To solve this technology gap, IMDA introduced the National Speech Corpus (NSC).

The NSC improves speech engines’ accuracy of recognition and transcription for locally accented English. The NSC is also able to contribute to speech synthesis technology where an AI voice can be produced that is more familiar to Singaporeans, with local terms pronounced more accurately.

Benefits

As speech technology improves and with speech engines tuned to the Singaporean English accent, this will enable Singapore to keep pace with current and future advancements for speech interfaces.

For example, the Automatic Speech Recognition (ASR) may be used by telco call centres to transcribe calls for auditing and sentiment analysis purposes, chatbots can go beyond text and can accurately support our accent while replying in a familiar local accent with accurate pronunciations of street names and food.

Download the NSC tool to use it today!

FAQs

1. What are the cost associated to using the NSC?

The NSC is made available via the Singapore Open Data Licence.

2. Do I need to have a Dropbox account to download the NSC?

Yes, users who are interested to download the NSC will require a Dropbox account. The shared folder will be shared to users via email.

3. How big is the NSC?

Currently, the NSC is approximately 1.2 TB in size.

4. Will there be future updates to the National Speech Corpus (NSC)?

There are currently no planned future updates to the baseline corpora. The last update occurred in July 2021, and another 3 parts were added to the corpora.

5. How do I obtain the latest datasets (all 6 parts), I downloaded the corpora when there were only 3 parts.

Simply re-register and you should receive a link to all 6 parts.

6. Is there a demo of the transcription or speech engine?

The speech corpus is a dataset of audio and transcripts, and not a speech engine. Do contact us if you wish to find industry contacts who have trained their speech engine with the NSC.

Contact

For further enquiries on the National Speech Corpus, please contact nsc@imda.gov.sg.

Explore related tags

Explore more

Future Communications Programme (FCP)

Programme +1

Scholarship

Future Communications Programme (FCP)

The Future Communications Research and Development Programme (FCP) is a national programme to catalyse, synergise and...

SG:D for Companies

Digital technologies are reshaping industries - find out more about pilot projects that showcase the transformative potential...

IMDA SPARK

The Spark Accreditation programme supports promising Singapore-based ICM startups in navigating the changing landscape of...

View all programmes

National Speech Corpus

About the National Speech Corpus (NSC)

Benefits

FAQs

Contact

Explore more

Related Programmes

Future Communications Programme (FCP)

SG:D for Companies

IMDA SPARK

Stay tuned for our newsletter