Publications

Publications, preprints, and conference tutorials. See my full publication list on Google Scholar .

* Equal contribution

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

My Chiffon Nguyen^*, Aulia Adila^*, Saksorn Ruangtanusak^*, Kittiphat Leesombatwathana^*, Vissuta Gunawan Lim^*, Patomporn Payoungkhamdee, Samuel Cahyawijaya,

In Submission • 2026

Abstract

We introduce SEATauBench, the first agentic-focused evaluation framework for sovereign AI development in Southeast Asia, a region of strategic importance with over 700 million people. Despite growing regional evaluation efforts, existing multilingual agents show limited capability when operating in SEA languages, particularly in mixed-language scenarios. Through evaluation across multiple adaptation approaches, we find that while English agentic capabilities transfer to target language responses, performance degrades significantly when context is provided in SEA languages. We propose a translation-based mitigation strategy that preserves entity consistency while enabling agents to leverage English comprehension. SEATauBench establishes a rigorous benchmark for sovereign AI agent assessment, providing diagnostic tools to address capability gaps and support agentic AI development in diverse linguistic communities in the region.

PDF

Position: Synthetic Persona Needs Explicit Grounding and Standardized Reporting

Yucheng Lu, Xiaoyi Liu, Heming Liu, Hanwen Xing, Shirley Huang, Guanghui Min, Zhengyang Shan, My Chiffon Nguyen, , Xiaomin Li, Yuexing Hao

In Submission (Early Work) • 2026

Abstract

This position paper argues that synthetic personas used in Large Language Models (LLM) social simulations need explicit grounding. By grounding, we mean that researchers should state where persona information comes from, what social actor or population it is meant to represent, what evidence supports it, and what kinds of simulation claims it can justify. Without such grounding, persona-based simulations risk treating appearance of plausible descriptions or vivid narratives as evidence of realism. We identify several recurring failure modes that explicit grounding can help expose, and argue that these failures should not be diagnosed only after simulation results are produced; they should be addressed at the level of persona construction and reporting. We conclude by proposing lightweight reporting standards for persona provenance, construction process, selection or sampling logic, internal consistency checks, model enactment checks, and intended inference.

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Carlos Rafael Catalan, Manuel Antonio Rufino, Hitesh Laxmichand Patel, Amit Agarwal, Muhammad Reza Qorib, , Dun Li Chan, Sherissa Caren Djuniwar

Preprint • 2026

Abstract

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

PDF

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, , Kenton Murray, Sarah Luger

ACL 2026 Main Conference • 2026

Abstract

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

PDF Website

Humanity's Last Exam: A benchmark of expert-level academic questions to assess AI capabilities

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, , Alexandr Wang, Dan Hendrycks

Nature • 2026

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

PDF Website

2026

Position: Synthetic Persona Needs Explicit Grounding and Standardized Reporting

Yucheng Lu, Xiaoyi Liu, Heming Liu, Hanwen Xing, Shirley Huang, Guanghui Min, Zhengyang Shan, My Chiffon Nguyen, , Xiaomin Li, Yuexing Hao

In Submission (Early Work) • 2026

Abstract

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

My Chiffon Nguyen^*, Aulia Adila^*, Saksorn Ruangtanusak^*, Kittiphat Leesombatwathana^*, Vissuta Gunawan Lim^*, Patomporn Payoungkhamdee, Samuel Cahyawijaya,

In Submission • 2026

Abstract

PDF

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

Preprint • 2026

Abstract

PDF

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, , Kenton Murray, Sarah Luger

ACL 2026 Main Conference • 2026

Abstract

PDF Website

Humanity's Last Exam: A benchmark of expert-level academic questions to assess AI capabilities

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, , Alexandr Wang, Dan Hendrycks

Nature • 2026

Abstract

PDF Website