Generative AI Evaluation Sandbox

Over 10 global players join Sandbox to develop evaluation benchmarks for trusted Gen AI
Sandbox is anchored on new draft catalogue proposing a common basis for understanding the current state of Large Language Model (LLM) evaluations

SINGAPORE – 31 OCT 2023

1. The Infocomm Media Development Authority (IMDA) and the AI Verify Foundation unveiled the first of its kind Generative AI (Gen AI) Evaluation Sandbox today. The Sandbox will bring global ecosystem players together through concrete use cases, to enable the evaluation of trusted AI products. The Sandbox will make use of a new Evaluation Catalogue, as a shared resource, that sets out common baseline methods and recommendations for Large Language Models (LLM).

2. This is part of the effort to have a common standard approach to assess Gen AI. The key risks and harms of LLMs were set out in our discussion paper entitled Generative AI: Implications for Trust and Governance¹. Efforts to tackle these harms have been piecemeal. To support broader safe and trustworthy adoption of Gen AI, IMDA is inviting industry partners to collaboratively build evaluation tools and capabilities in the Gen AI Evaluation Sandbox.

Sandbox will offer a common language for evaluation of Gen AI through the Catalogue

3. The Sandbox will provide a baseline by offering a research-based categorisation of current evaluation benchmarks and methods. The Catalogue provides an anchor by (a) compiling the existing commonly used technical testing tools and organising these tests according to what they test for and their methods; and (b) recommending a baseline set of evaluation tests for use in Gen AI products.

4. The catalogue can be accessed here. The AI Verify Foundation welcomes initial comments and feedback on this draft catalogue at info@aiverify.sg.

Sandbox will build up a body of knowledge on how Gen AI products should be tested

5. Beyond just a starting baseline of evaluation tests, how it is implemented across the ecosystem of those who build AI also needs to be developed. The Sandbox will help build evaluation capabilities beyond what currently resides with model developers, as testing of Gen AI should also include the application developers who build on top of the models. The Sandbox will also involve players in the third-party testing ecosystem, to help model developers understand what external testers would look for in responsible AI models. Where possible, each Sandbox use case should involve an upstream Gen AI model developer, a downstream application deployer and a third-party tester to demonstrate how the different players in the ecosystem can work together. By involving regulators like the Singapore Personal Data Protection Commission (PDPC), the Sandbox will provide a space for experimentation and development and allow all parties along the supply chain to be transparent about their needs.

Sandbox will develop new benchmarks and tests

6. It is anticipated that the Sandbox use cases will reveal gaps in the current landscape of Gen AI evaluations, particularly in domain-specific (e.g. in Human Resources or Security) and cultural-specific areas which are currently under-developed. The Sandbox will develop benchmarks for evaluating model performance in specific areas that are important for use cases, and for countries like Singapore because of cultural and language specificities.

7. For a start, (a) key model developers like Google, Microsoft, Anthropic, IBM, NVIDIA, Stability.AI and Amazon Web Servies (AWS); (b) app developers with concrete use cases like DataRobot, OCBC, Global Regulation Inc, Singtel and X0PA.AI; and (c) third-party testers such as Resaro.AI, Deloitte, EY and TÜV SÜD have joined the Sandbox. The full list of participants is available in the Annex A - List of participants in sandbox (111.92KB). All participants in the Sandbox will aid in creating a more robust testing environment.

“With AWS, businesses have access to a variety of leading foundation models, cost effective infrastructure, along with enterprise-grade security and privacy to power their generative AI applications. The responsible use of generative AI technologies will transform entire industries and reimagine how work gets done. We look forward to being a part of IMDA’s Generative AI Evaluation Sandbox to provide businesses with the tools and guidance needed to build artificial intelligence and machine learning applications responsibly.” - Elsie Tan, Country Manager, Worldwide Public Sector, Singapore, AWS.

“At Microsoft, we are committed to the responsible use of AI. Guardrails aligning with our core principles are a fundamental feature of our products, and we curate best practices and tools to empower others to achieve the same. To this end, Microsoft is delighted to partner with IMDA and the AI Verify Foundation in launching the Generative AI Evaluation Sandbox. This unique initiative opens up possibilities for developers to evaluate their applications for adherence to common standards, advancing our shared goal of a trusted ecosystem.” - Mike Yeh, Chief Legal Counsel, Microsoft Asia

8. An example of a Sandbox project that uses the Catalogue as a resource to identify aspects for red-teaming² is the collaboration between Anthropic and IMDA. Anthropic, a leading developer of frontier Gen AI models with a focus on AI safety, will lend its expertise and methodology in red-teaming. IMDA will leverage Anthropic's models and research tooling platform to develop and tailor red-teaming methodologies for Singapore’s diverse linguistic and cultural landscape, for example, testing AI models for their abilities to perform well for Singapore’s multi-lingual context.

“Promoting an ecosystem of independent, open source, and third-party model evaluations is critical to building safe and trustworthy AI. The AI Verify Foundation and Generative AI Sandbox is an important step in that direction. We appreciate our strong cooperation with IMDA and look forward to deepening our partnership.” said Dario Amodei, Co-Founder and CEO of Anthropic.

9. Singapore has made significant strides in the space of building responsible AI in the last year, with the introduction of AI Verify Foundation. The launch of this Sandbox marks the advancement of AI Verify into Generative AI by tapping on the collective power and contributions of the global open-source community.

10. AI Verify Foundation and IMDA invite interested model and app developers, and third-party testers to participate in this Sandbox.

Resources:

Annex A - List of participants in sandbox (111.92KB)

First of its kind Generative AI Evaluation Sandbox for Trusted AI by AI Verify Foundation and IMDA

Sandbox will offer a common language for evaluation of Gen AI through the Catalogue

Sandbox will build up a body of knowledge on how Gen AI products should be tested

Sandbox will develop new benchmarks and tests

Explore more

Digital Connectivity Blueprint (DCB)

Digital Leaders Programme

Directory of Global CBPR certified organisations

First of its kind Generative AI Evaluation Sandbox for Trusted AI by AI Verify Foundation and IMDA

Sandbox will offer a common language for evaluation of Gen AI through the Catalogue

Sandbox will build up a body of knowledge on how Gen AI products should be tested

Sandbox will develop new benchmarks and tests

Explore more

Related Programmes

Digital Connectivity Blueprint (DCB)

Digital Leaders Programme

Directory of Global CBPR certified organisations

Stay tuned for our newsletter