Computation and Language
☆ MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, Markus Freitag
In this paper, we present our submissions to the unified WMT25 Translation
Evaluation Shared Task. For the Quality Score Prediction subtask, we create a
new generation of MetricX with improvements in the input format and the
training protocol, while for the Error Span Detection subtask we develop a new
model, GemSpanEval, trained to predict error spans along with their severities
and categories. Both systems are based on the state-of-the-art multilingual
open-weights model Gemma 3, fine-tuned on publicly available WMT data. We
demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture
with a regression head on top, can be trained to effectively predict both MQM
and ESA quality scores, and significantly outperforms its predecessor. Our
decoder-only GemSpanEval model, on the other hand, we show to be competitive in
error span detection with xCOMET, a strong encoder-only sequence-tagging
baseline. With error span detection formulated as a generative task, we
instruct the model to also output the context for each predicted error span,
thus ensuring that error spans are identified unambiguously.
comment: Accepted to WMT25
☆ ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu
Virtual Reality (VR) games require players to translate high-level semantic
actions into precise device manipulations using controllers and head-mounted
displays (HMDs). While humans intuitively perform this translation based on
common sense and embodied understanding, whether Large Language Models (LLMs)
can effectively replicate this ability remains underexplored. This paper
introduces a benchmark, ComboBench, evaluating LLMs' capability to translate
semantic actions into VR device manipulation sequences across 262 scenarios
from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II,
and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o,
Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against
annotated ground truth and human performance. Our results reveal that while
top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition
capabilities, they still struggle with procedural reasoning and spatial
understanding compared to humans. Performance varies significantly across
games, suggesting sensitivity to interaction complexity. Few-shot examples
substantially improve performance, indicating potential for targeted
enhancement of LLMs' VR manipulation capabilities. We release all materials at
https://sites.google.com/view/combobench.
★ Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig
Public research results on large-scale supervised finetuning of AI agents
remain relatively rare, since the collection of agent training data presents
unique challenges. In this work, we argue that the bottleneck is not a lack of
underlying data sources, but that a large variety of data is fragmented across
heterogeneous formats, tools, and interfaces. To this end, we introduce the
agent data protocol (ADP), a light-weight representation language that serves
as an "interlingua" between agent datasets in diverse formats and unified agent
training pipelines downstream. The design of ADP is expressive enough to
capture a large variety of tasks, including API/tool use, browsing, coding,
software engineering, and general agentic workflows, while remaining simple to
parse and train on without engineering at a per-dataset level. In experiments,
we unified a broad collection of 13 existing agent training datasets into ADP
format, and converted the standardized ADP data into training-ready formats for
multiple agent frameworks. We performed SFT on these data, and demonstrated an
average performance gain of ~20% over corresponding base models, and delivers
state-of-the-art or near-SOTA performance on standard coding, browsing, tool
use, and research benchmarks, without domain-specific tuning. All code and data
are released publicly, in the hope that ADP could help lower the barrier to
standardized, scalable, and reproducible agent training.
★ Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang
We present Tongyi DeepResearch, an agentic large language model, which is
specifically designed for long-horizon, deep information-seeking research
tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is
developed through an end-to-end training framework that combines agentic
mid-training and agentic post-training, enabling scalable reasoning and
information seeking across complex tasks. We design a highly scalable data
synthesis pipeline that is fully automatic, without relying on costly human
annotation, and empowers all training stages. By constructing customized
environments for each stage, our system enables stable and consistent
interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total
parameters, with only 3.3 billion activated per token, achieves
state-of-the-art performance across a range of agentic deep research
benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH,
WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We
open-source the model, framework, and complete solutions to empower the
community.
comment: https://tongyi-agent.github.io/blog
☆ ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking
Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, Yong Jiang
Parallel thinking expands exploration breadth, complementing the deep
exploration of information-seeking (IS) agents to further enhance
problem-solving capability. However, conventional parallel thinking faces two
key challenges in this setting: inefficiency from repeatedly rolling out from
scratch, and difficulty in integrating long-horizon reasoning trajectories
during answer generation, as limited context capacity prevents full
consideration of the reasoning process. To address these issues, we propose
ParallelMuse, a two-stage paradigm designed for deep IS agents. The first
stage, Functionality-Specified Partial Rollout, partitions generated sequences
into functional regions and performs uncertainty-guided path reuse and
branching to enhance exploration efficiency. The second stage, Compressed
Reasoning Aggregation, exploits reasoning redundancy to losslessly compress
information relevant to answer derivation and synthesize a coherent final
answer. Experiments across multiple open-source agents and benchmarks
demonstrate up to 62% performance improvement with a 10--30% reduction in
exploratory token consumption.
☆ AgentFold: Long-Horizon Web Agents with Proactive Context Management
Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang
LLM-based web agents show immense promise for information seeking, yet their
effectiveness on long-horizon tasks is hindered by a fundamental trade-off in
context management. Prevailing ReAct-based agents suffer from context
saturation as they accumulate noisy, raw histories, while methods that fixedly
summarize the full history at each step risk the irreversible loss of critical
details. Addressing these, we introduce AgentFold, a novel agent paradigm
centered on proactive context management, inspired by the human cognitive
process of retrospective consolidation. AgentFold treats its context as a
dynamic cognitive workspace to be actively sculpted, rather than a passive log
to be filled. At each step, it learns to execute a `folding' operation, which
manages its historical trajectory at multiple scales: it can perform granular
condensations to preserve vital, fine-grained details, or deep consolidations
to abstract away entire multi-step sub-tasks. The results on prominent
benchmarks are striking: with simple supervised fine-tuning (without continual
pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp
and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or
matches open-source models of a dramatically larger scale, such as the
DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like
OpenAI's o4-mini.
comment: 26 pages, 9 figures
☆ WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Liwen Zhang, Xinyu Wang, Pengjun Xie, Jingren Zhou, Yong Jiang
Large Language Model (LLM)-based agents have emerged as a transformative
approach for open-ended problem solving, with information seeking (IS) being a
core capability that enables autonomous reasoning and decision-making. While
prior research has largely focused on improving retrieval depth, we observe
that current IS agents often suffer from low search efficiency, which in turn
constrains overall performance. A key factor underlying this inefficiency is
the sparsity of target entities in training tasks, which limits opportunities
for agents to learn and generalize efficient search behaviors. To address these
challenges, we propose WebLeaper, a framework for constructing high-coverage IS
tasks and generating efficient solution trajectories. We formulate IS as a
tree-structured reasoning problem, enabling a substantially larger set of
target entities to be embedded within a constrained context. Leveraging curated
Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic,
Union, and Reverse-Union, to systematically increase both IS efficiency and
efficacy. Finally, we curate training trajectories by retaining only those that
are simultaneously accurate and efficient, ensuring that the model is optimized
for both correctness and search performance. Extensive experiments on both
basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp,
GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method
consistently achieves improvements in both effectiveness and efficiency over
strong baselines.
☆ AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Yong Jiang
Training large language model agents on tasks at the frontier of their
capabilities is key to unlocking advanced reasoning. We introduce a data
synthesis approach inspired by the educational theory of the Zone of Proximal
Development (ZPD), which defines this frontier as tasks an LLM cannot solve
alone but can master with guidance. To operationalize this, we present the
AgentFrontier Engine, an automated pipeline that synthesizes high-quality,
multidisciplinary data situated precisely within the LLM's ZPD. This engine
supports both continued pre-training with knowledge-intensive data and targeted
post-training on complex reasoning tasks. From the same framework, we derive
the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent
capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on
our synthesized data, which achieves state-of-the-art results on demanding
benchmarks like Humanity's Last Exam, even surpassing some leading proprietary
agents. Our work demonstrates that a ZPD-guided approach to data synthesis
offers a scalable and effective path toward building more capable LLM agents.
comment: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
☆ Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
LLM-based search agents are increasingly trained on entity-centric synthetic
data to solve complex, knowledge-intensive tasks. However, prevailing training
methods like Group Relative Policy Optimization (GRPO) discard this rich entity
information, relying instead on sparse, outcome-based rewards. This critical
limitation renders them unable to distinguish informative "near-miss"
samples-those with substantially correct reasoning but a flawed final
answer-from complete failures, thus discarding valuable learning signals. We
address this by leveraging the very entities discarded during training. Our
empirical analysis reveals a strong positive correlation between the number of
ground-truth entities identified during an agent's reasoning process and final
answer accuracy. Building on this insight, we introduce Entity-aware Group
Relative Policy Optimization (E-GRPO), a novel framework that formulates a
dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect
samples proportional to their entity match rate, enabling the model to
effectively learn from these "near-misses". Experiments on diverse
question-answering (QA) and deep research benchmarks show that E-GRPO
consistently and significantly outperforms the GRPO baseline. Furthermore, our
analysis reveals that E-GRPO not only achieves superior accuracy but also
induces more efficient reasoning policies that require fewer tool calls,
demonstrating a more effective and sample-efficient approach to aligning search
agents.
☆ STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang
Despite rapid progress in Multi-modal Large Language Models and Large
Audio-Language Models, existing audio benchmarks largely test semantics that
can be recovered from text captions, masking deficits in fine-grained
perceptual reasoning. We formalize audio 4D intelligence that is defined as
reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to
measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six
attributes under absolute and relative regimes) with a Holistic Spatio-Temporal
Reasoning setting that includes segment reordering for continuous and discrete
processes and spatial tasks spanning static localization, multi-source
relations, and dynamic trajectories. Our data curation pipeline uses two
methods to ensure high-quality samples. For foundational tasks, we use
procedurally synthesized and physics-simulated audio. For holistic data, we
follow a four-stage process that includes human annotation and final selection
based on human performance. Unlike prior benchmarks where caption-only
answering reduces accuracy slightly, STAR-Bench induces far larger drops
(-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically
hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared
with humans and a capability hierarchy: closed-source models are bottlenecked
by fine-grained perception, while open-source models lag across perception,
knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear
path forward for developing future models with a more robust understanding of
the physical world.
comment: Homepage: https://internlm.github.io/StarBench/
★ SPICE: Self-Play In Corpus Environments Improves Reasoning
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston
Self-improving systems require environmental interaction for continuous
adaptation. We introduce SPICE (Self-Play In Corpus Environments), a
reinforcement learning framework where a single model acts in two roles: a
Challenger that mines documents from a large corpus to generate diverse
reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics,
the Challenger creates an automatic curriculum at the frontier of the
Reasoner's capability, while corpus grounding provides the rich,
near-inexhaustible external signal necessary for sustained improvement. Unlike
existing ungrounded self-play methods that offer more limited benefits, SPICE
achieves consistent gains across mathematical (+8.9%) and general reasoning
(+9.8%) benchmarks on multiple model families. Our analysis reveals how
document grounding is a key ingredient in SPICE to continuously generate its
own increasingly challenging goals and achieve them, enabling sustained
self-improvement.
☆ Dissecting Role Cognition in Medical LLMs via Neuronal Ablation
Large language models (LLMs) have gained significant traction in medical
decision support systems, particularly in the
context of medical question answering and role-playing simulations. A common
practice, Prompt-Based Role Playing (PBRP),
instructs models to adopt different clinical roles (e.g., medical students,
residents, attending physicians) to simulate varied
professional behaviors. However, the impact of such role prompts on model
reasoning capabilities remains unclear. This
study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to
evaluate whether role prompts induce distinct,
role-specific cognitive processes in LLMs or merely modify linguistic style.
We test this framework on three medical QA
datasets, employing neuron ablation and representation analysis techniques to
assess changes in reasoning pathways. Our
results demonstrate that role prompts do not significantly enhance the
medical reasoning abilities of LLMs. Instead, they
primarily affect surface-level linguistic features, with no evidence of
distinct reasoning pathways or cognitive differentiation
across clinical roles. Despite superficial stylistic changes, the core
decision-making mechanisms of LLMs remain uniform
across roles, indicating that current PBRP methods fail to replicate the
cognitive complexity found in real-world medical
practice. This highlights the limitations of role-playing in medical AI and
emphasizes the need for models that simulate genuine
cognitive processes rather than linguistic imitation.We have released the
related code in the following repository:https:
//github.com/IAAR-Shanghai/RolePlay_LLMDoctor
comment: 15 pages, 9 figures
☆ InteractComp: Evaluating Search Agents With Ambiguous Queries
Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo
Language agents have demonstrated remarkable potential in web search and
information retrieval. However, these search agents assume user queries are
complete and unambiguous, an assumption that diverges from reality where users
begin with incomplete queries requiring clarification through interaction. Yet
most agents lack interactive mechanisms during the search process, and existing
benchmarks cannot assess this capability. To address this gap, we introduce
InteractComp, a benchmark designed to evaluate whether search agents can
recognize query ambiguity and actively interact to resolve it during search.
Following the principle of easy to verify, interact to disambiguate, we
construct 210 expert-curated questions across 9 domains through a
target-distractor methodology that creates genuine ambiguity resolvable only
through interaction. Evaluation of 17 models reveals striking failure: the best
model achieves only 13.73% accuracy despite 71.50% with complete context,
exposing systematic overconfidence rather than reasoning deficits. Forced
interaction produces dramatic gains, demonstrating latent capability current
strategies fail to engage. Longitudinal analysis shows interaction capabilities
stagnated over 15 months while search performance improved seven-fold,
revealing a critical blind spot. This stagnation, coupled with the immediate
feedback inherent to search tasks, makes InteractComp a valuable resource for
both evaluating and training interaction capabilities in search agents. The
code is available at https://github.com/FoundationAgents/InteractComp.
☆ MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
Human evaluation of machine translation is in an arms race with translation
model quality: as our models get better, our evaluation methods need to be
improved to ensure that quality gains are not lost in evaluation noise. To this
end, we experiment with a two-stage version of the current state-of-the-art
translation evaluation paradigm (MQM), which we call MQM re-annotation. In this
setup, an MQM annotator reviews and edits a set of pre-existing MQM
annotations, that may have come from themselves, another human annotator, or an
automatic MQM annotation system. We demonstrate that rater behavior in
re-annotation aligns with our goals, and that re-annotation results in
higher-quality annotations, mostly due to finding errors that were missed
during the first pass.
☆ Evolving Diagnostic Agents in a Virtual Clinical Environment
Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie
In this paper, we present a framework for training large language models
(LLMs) as diagnostic agents with reinforcement learning, enabling them to
manage multi-turn diagnostic processes, adaptively select examinations, and
commit to final diagnoses. Unlike instruction-tuned models trained on static
case summaries, our method acquires diagnostic strategies through interactive
exploration and outcome-based feedback. Our contributions are fourfold: (i) We
present DiagGym, a diagnostics world model trained with electronic health
records that emits examination outcomes conditioned on patient history and
recommended examination, serving as a virtual clinical environment for
realistic diagnosis training and evaluation; (ii) We train DiagAgent via
end-to-end, multi-turn reinforcement learning to learn diagnostic policies that
optimize both information yield and diagnostic accuracy; (iii) We introduce
DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated
examination recommendations and 99 cases annotated with 973 physician-written
rubrics on diagnosis process; (iv) we demonstrate superior performance across
diverse diagnostic settings. DiagAgent significantly outperforms 10
state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two
prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34%
higher diagnostic accuracy and 44.03% improvement in examination recommendation
hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic
accuracy and 23.09% boost in examination recommendation F1 score. In
rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by
7.1% in weighted rubric score. These findings indicate that learning policies
in interactive clinical environments confers dynamic and clinically meaningful
diagnostic management abilities unattainable through passive training alone.
☆ Optimizing Retrieval for RAG via Reinforced Contrastive Learning
As retrieval-augmented generation (RAG) becomes increasingly widespread, the
role of information retrieval (IR) is shifting from retrieving information for
human users to retrieving contextual knowledge for artificial intelligence (AI)
systems, where relevance becomes difficult to define or annotate beforehand. To
address this challenge, we propose R3, a Retrieval framework optimized for RAG
through trialand-feedback Reinforced contrastive learning. Unlike prior
approaches that rely on annotated or synthetic data for supervised fine-tuning,
R3 enables the retriever to dynamically explore and optimize relevance within
the RAG environment. During training, the retrieved results interact with the
environment to produce contrastive signals that automatically guide the
retriever's self-improvement. Extensive experiments across diverse tasks
demonstrate that R3 improves RAG performance by 5.2% over the original
retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving
comparable results to LLM-augmented retrieval and RAG systems built on
post-trained or instruction-tuned LLMs. It is both efficient and practical,
requiring only 4 GPUs and completing training within a single day.
☆ Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia
We ask where, and under what conditions, dyslexic reading costs arise in a
large-scale naturalistic reading dataset. Using eye-tracking aligned to
word-level features (word length, frequency, and predictability), we model how
each feature influences dyslexic time costs. We find that all three features
robustly change reading times in both typical and dyslexic readers, and that
dyslexic readers show stronger sensitivities to each, especially
predictability. Counterfactual manipulations of these features substantially
narrow the dyslexic-control gap by about one third, with predictability showing
the strongest effect, followed by length and frequency. These patterns align
with dyslexia theories that posit heightened demands on linguistic working
memory and phonological encoding, and they motivate further work on lexical
complexity and parafoveal preview benefits to explain the remaining gap. In
short, we quantify when extra dyslexic costs arise, how large they are, and
offer actionable guidance for interventions and computational models for
dyslexics.
☆ OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning
Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun, Pengjie Ren, Suzan Verberne, Zhaochun Ren
Reward models (RMs) have become essential for aligning large language models
(LLMs), serving as scalable proxies for human evaluation in both training and
inference. However, existing RMs struggle on knowledge-intensive and long-form
tasks, where evaluating correctness requires grounding beyond the model's
internal knowledge. This limitation hinders them from reliably discriminating
subtle quality differences, especially when external evidence is necessary. To
address this, we introduce OpenRM, a tool-augmented long-form reward model that
systematically judges open-ended responses by invoking external tools to gather
relevant evidence. We train OpenRM with Group Relative Policy Optimization
(GRPO) on over 27K synthesized pairwise examples generated through a
controllable data synthesis framework. The training objective jointly
supervises intermediate tool usage and final outcome accuracy, incentivizing
our reward model to learn effective evidence-based judgment strategies.
Extensive experiments on three newly-collected datasets and two widely-used
benchmarks demonstrate that OpenRM substantially outperforms existing reward
modeling approaches. As a further step, we integrate OpenRM into both
inference-time response selection and training-time data selection. This yields
consistent gains in downstream LLM alignment tasks, highlighting the potential
of tool-augmented reward models for scaling reliable long-form evaluation.
☆ "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue
Maintaining mutual understanding is a key component in human-human
conversation to avoid conversation breakdowns, in which repair, particularly
Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the
other to resolve), plays a vital role. However, Conversational Agents (CAs)
still fail to recognize user repair initiation, leading to breakdowns or
disengagement. This work proposes a multimodal model to automatically detect
repair initiation in Dutch dialogues by integrating linguistic and prosodic
features grounded in Conversation Analysis. The results show that prosodic cues
complement linguistic features and significantly improve the results of
pretrained text and audio embeddings, offering insights into how different
features interact. Future directions include incorporating visual cues,
exploring multilingual and cross-context corpora to assess the robustness and
generalizability.
comment: 9 pages
★ Relative Scaling Laws for LLMs
Scaling laws describe how language models improve with additional data,
parameters, and compute. While widely used, they are typically measured on
aggregate test sets. Aggregate evaluations yield clean trends but average over
heterogeneous subpopulations, obscuring performance disparities. We introduce
relative scaling laws, which track how performance gaps between test
distributions evolve with scale rather than focusing solely on absolute error.
Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP)
budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we
find diverse trajectories: academic domains on MMLU converge toward parity;
regional English dialects shift depending on population size; and clusters of
AI risk behaviours split, with capability- and influence-related risks
increasing during pretraining while adversarial risks do not. These results
show that although scaling improves overall performance, it is not a universal
equalizer. To support further study, we release all model checkpoints from this
work to enable practitioners to measure relative alongside traditional scaling
laws, in order to better prioritize robustness challenges in light of the
bitter lesson.
☆ Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
With the release of new large language models (LLMs) like Llama and Mistral,
zero-shot cross-lingual transfer has become increasingly feasible due to their
multilingual pretraining and strong generalization capabilities. However,
adapting these decoder-only LLMs to new tasks across languages remains
challenging. While parameter-efficient fine-tuning (PeFT) techniques like
Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as
soft prompt tuning, prefix tuning, and Llama Adapter are less explored,
especially for zero-shot transfer in decoder-only models. We present a
comprehensive study of three prefix-based methods for zero-shot cross-lingual
transfer from English to 35+ high- and low-resource languages. Our analysis
further explores transfer across linguistic families and scripts, as well as
the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix
methods outperform LoRA-baselines by up to 6% on the Belebele benchmark.
Similar improvements were observed with Mistral v0.3 7B as well. Despite using
only 1.23M learning parameters with prefix tuning, we achieve consistent
improvements across diverse benchmarks. These findings highlight the potential
of prefix-based techniques as an effective and scalable alternative to LoRA,
particularly in low-resource multilingual settings.
comment: 12 Pages
☆ Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs NeurIPS 2025
The quadratic cost of attention hinders the scalability of long-context LLMs,
especially in resource-constrained settings. Existing static sparse methods
such as sliding windows or global tokens utilizes the sparsity of attention to
reduce the cost of attention, but poorly adapts to the content-dependent
variations in attention due to their staticity. While previous work has
proposed several dynamic approaches to improve flexibility, they still depend
on predefined templates or heuristic mechanisms. Such strategies reduce
generality and prune tokens that remain contextually important, limiting their
accuracy across diverse tasks. To tackle these bottlenecks of existing methods
for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention
(DHSA), a data-driven framework that dynamically predicts attention sparsity
online without retraining. Our proposed DHSA adaptively segments sequences into
variable-length chunks, then computes chunk representations by aggregating the
token embeddings within each chunk. To avoid the bias introduced by varying
chunk lengths, we apply length-normalized aggregation that scales the averaged
embeddings by the square root of the chunk size. Finally, DHSA upsamples the
chunk-level similarity scores to token level similarities to calculate
importance scores that determine which token-level interactions should be
preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and
LongBench show that DHSA matches dense attention in accuracy, while reducing
prefill latency by 20-60% and peak memory usage by 35%. Compared to other
representative baselines such as block sparse attention, DHSA achieves
consistently higher accuracy (6-18% relative gains) with comparable or lower
cost, offering an efficient and adaptable solution for long-context on-device
LLMs.
comment: Accepted to NeurIPS 2025 Workshop on Efficient Reasoning
☆ Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way
Diffusion-based large language models (dLLMs) have exhibited substantial
potential for parallel text generation, which may enable more efficient
generation compared to autoregressive models. However, current dLLMs suffer
from fixed generation lengths, which indicates the generation lengths of dLLMs
have to be determined before decoding as a hyper-parameter, leading to issues
in efficiency and flexibility. To solve these problems, in this work, we
propose to train a diffusion LLM with native variable generation lengths,
abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately
predict the [EOS] token in the generated text, which makes a dLLM be able to
natively infer in a block diffusion manner, while still maintaining the ability
of global bi-directional (full) attention and high parallelism. Experiments on
standard benchmarks demonstrate that our method achieves a 30.1x speedup over
traditional dLLM inference paradigms and a 2.4x speedup relative to
autoregressive models such as Qwen and Llama. Our method achieves higher
accuracy and faster inference, elevating dLLMs beyond mere academic novelty and
supporting their practical use in real-world applications. Codes and models
have been released.
★ ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao
Autoformalization, which translates natural language mathematics into
machine-verifiable formal statements, is critical for using formal mathematical
reasoning to solve math problems stated in natural language. While Large
Language Models can generate syntactically correct formal statements, they
often fail to preserve the original problem's semantic intent. This limitation
arises from the LLM approaches' treating autoformalization as a simplistic
translation task which lacks mechanisms for self-reflection and iterative
refinement that human experts naturally employ. To address these issues, we
propose ReForm, a Reflective Autoformalization method that tightly integrates
semantic consistency evaluation into the autoformalization process. This
enables the model to iteratively generate formal statements, assess its
semantic fidelity, and self-correct identified errors through progressive
refinement. To effectively train this reflective model, we introduce
Prospective Bounded Sequence Optimization (PBSO), which employs different
rewards at different sequence positions to ensure that the model develops both
accurate autoformalization and correct semantic validations, preventing
superficial critiques that would undermine the purpose of reflection. Extensive
experiments across four autoformalization benchmarks demonstrate that ReForm
achieves an average improvement of 17.2 percentage points over the strongest
baselines. To further ensure evaluation reliability, we introduce
ConsistencyCheck, a benchmark of 859 expert-annotated items that not only
validates LLMs as judges but also reveals that autoformalization is inherently
difficult: even human experts produce semantic errors in up to 38.5% of cases.
comment: Ongoing Work
☆ ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca
Frontier AI agents show increasing promise as scientific research assistants,
and may eventually be useful for extended, open-ended research workflows.
However, in order to use agents for novel research, we must first assess the
underlying faithfulness and correctness of their work. To evaluate agents as
research assistants, we introduce ReplicationBench, an evaluation framework
that tests whether agents can replicate entire research papers drawn from the
astrophysics literature. Astrophysics, where research relies heavily on
archival data and computational study while requiring little real-world
experimentation, is a particularly useful testbed for AI agents in scientific
research. We split each paper into tasks which require agents to replicate the
paper's core contributions, including the experimental setup, derivations, data
analysis, and codebase. Each task is co-developed with the original paper
authors and targets a key scientific result, enabling objective evaluation of
both faithfulness (adherence to original methods) and correctness (technical
accuracy of results). ReplicationBench is extremely challenging for current
frontier language models: even the best-performing language models score under
20%. We analyze ReplicationBench trajectories in collaboration with domain
experts and find a rich, diverse set of failure modes for agents in scientific
research. ReplicationBench establishes the first benchmark of paper-scale,
expert-validated astrophysics research tasks, reveals insights about agent
performance generalizable to other domains of data-driven science, and provides
a scalable framework for measuring AI agents' reliability in scientific
research.
☆ BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation ICASSP 2026
Automatic Speech Recognition (ASR) systems, despite large multilingual
training, struggle in out-of-domain and low-resource scenarios where labeled
data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training
and Distillation), a novel framework designed to adapt Whisper's encoder using
unlabeled data. Unlike traditional self-supervised learning methods, BEARD
uniquely combines a BEST-RQ objective with knowledge distillation from a frozen
teacher encoder, ensuring the encoder's complementarity with the pre-trained
decoder. Our experiments focus on the ATCO2 corpus from the challenging Air
Traffic Control (ATC) communications domain, characterized by non-native
speech, noise, and specialized phraseology. Using about 5,000 hours of
untranscribed speech for BEARD and 2 hours of transcribed speech for
fine-tuning, the proposed approach significantly outperforms previous baseline
and fine-tuned model, achieving a relative improvement of 12% compared to the
fine-tuned model. To the best of our knowledge, this is the first work to use a
self-supervised learning objective for domain adaptation of Whisper.
comment: Submitted to ICASSP 2026
★ Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
The history of the Korean language is characterized by a discrepancy between
its spoken and written forms and a pivotal shift from Chinese characters to the
Hangul alphabet. However, this linguistic evolution has remained largely
unexplored in NLP due to a lack of accessible historical corpora. To address
this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly
licensed dataset spanning 1,300 years and 6 languages, as well as
under-represented writing systems like Korean-style Sinitic (Idu) and
Hanja-Hangul mixed script. This corpus contains 18 million documents and 5
billion tokens from 19 sources, ranging from the 7th century to 2025. We
leverage this resource to quantitatively analyze major linguistic shifts: (1)
Idu usage peaked in the 1860s before declining sharply; (2) the transition from
Hanja to Hangul was a rapid transformation starting around 1890; and (3) North
Korea's lexical divergence causes modern tokenizers to produce up to 51 times
higher out-of-vocabulary rates. This work provides a foundational resource for
quantitative diachronic analysis by capturing the history of the Korean
language. Moreover, it can serve as a pre-training corpus for large language
models, potentially improving their understanding of Sino-Korean vocabulary in
modern Hangul as well as archaic writing systems.
comment: Dataset and code available at https://github.com/seyoungsong/OKHC
☆ Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written
Textual humor is enormously diverse and computational studies need to account
for this range, including intentionally bad humor. In this paper, we curate and
analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to
better understand "bad" humor in English. Standard humor detection models
perform poorly on our corpus, and an analysis of literary devices finds that
these sentences combine features common in existing humor datasets (e.g., puns,
irony) with metaphor, metafiction and simile. LLMs prompted to synthesize
contest-style sentences imitate the form but exaggerate the effect by
over-using certain literary devices, and including far more novel
adjective-noun bigrams than human writers. Data, code and analysis are
available at https://github.com/venkatasg/bulwer-lytton
☆ Levée d'ambiguïtés par grammaires locales
Many words are ambiguous in terms of their part of speech (POS). However,
when a word appears in a text, this ambiguity is generally much reduced.
Disambiguating POS involves using context to reduce the number of POS
associated with words, and is one of the main challenges of lexical tagging.
The problem of labeling words by POS frequently arises in natural language
processing, for example for spelling correction, grammar or style checking,
expression recognition, text-to-speech conversion, text corpus analysis, etc.
Lexical tagging systems are thus useful as an initial component of many natural
language processing systems. A number of recent lexical tagging systems produce
multiple solutions when the text is lexically ambiguous or the uniquely correct
solution cannot be found. These contributions aim to guarantee a zero silence
rate: the correct tag(s) for a word must never be discarded. This objective is
unrealistic for systems that tag each word uniquely. This article concerns a
lexical disambiguation method adapted to the objective of a zero silence rate
and implemented in Silberztein's INTEX system (1993). We present here a formal
description of this method. We show that to verify a local disambiguation
grammar in this framework, it is not sufficient to consider the transducer
paths separately: one needs to verify their interactions. Similarly, if a
combination of multiple transducers is used, the result cannot be predicted by
considering them in isolation. Furthermore, when examining the initial labeling
of a text as produced by INTEX, ideas for disambiguation rules come
spontaneously, but grammatical intuitions may turn out to be inaccurate, often
due to an unforeseen construction or ambiguity. If a zero silence rate is
targeted, local grammars must be carefully tested. This is where a detailed
specification of what a grammar will do once applied to texts would be
necessary.
comment: in French language
★ Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei
While Multimodal Large Language Models (MLLMs) excel at visual understanding,
they often struggle in complex scenarios that require visual planning and
imagination. Inspired by how humans use sketching as a form of visual thinking
to develop and communicate ideas, we introduce Latent Sketchpad, a framework
that equips MLLMs with an internal visual scratchpad. The internal visual
representations of MLLMs have traditionally been confined to perceptual
understanding. We repurpose them to support generative visual thought without
compromising reasoning ability. Building on frontier MLLMs, our approach
integrates visual generation directly into their native autoregressive
reasoning process. It allows the model to interleave textual reasoning with the
generation of visual latents. These latents guide the internal thought process
and can be translated into sketch images for interpretability. To realize this,
we introduce two components: a Context-Aware Vision Head autoregressively
produces visual representations, and a pretrained Sketch Decoder renders these
into human-interpretable images. We evaluate the framework on our new dataset
MazePlanning. Experiments across various MLLMs show that Latent Sketchpad
delivers comparable or even superior reasoning performance to their backbone.
It further generalizes across distinct frontier MLLMs, including Gemma3 and
Qwen2.5-VL. By extending model's textual reasoning to visual thinking, our
framework opens new opportunities for richer human-computer interaction and
broader applications. More details and resources are available on our project
page: https://latent-sketchpad.github.io/.
☆ CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?
Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi Wang, Zhaowei Wang, Chunkit Chan, Yangqiu Song
Accurate confidence calibration in Large Language Models (LLMs) is critical
for safe use in high-stakes domains, where clear verbalized confidence enhances
user trust. Traditional methods that mimic reference confidence expressions
often fail to capture the reasoning needed for accurate confidence assessment.
We propose natural language critiques as a solution, ideally suited for
confidence calibration, as precise gold confidence labels are hard to obtain
and often require multiple generations. This paper studies how natural language
critiques can enhance verbalized confidence, addressing: (1) What to critique:
uncertainty (question-focused) or confidence (answer-specific)? Analysis shows
confidence suits multiple-choice tasks, while uncertainty excels in open-ended
scenarios. (2) How to critique: self-critique or critique calibration training?
We propose Self-Critique, enabling LLMs to critique and optimize their
confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration
training method that leverages natural language critiques to improve confidence
calibration, moving beyond direct numerical optimization. Experiments show that
CritiCal significantly outperforms Self-Critique and other competitive
baselines, even surpassing its teacher model, GPT-4o, in complex reasoning
tasks. CritiCal also shows robust generalization in out-of-distribution
settings, advancing LLM's reliability.
☆ A word association network methodology for evaluating implicit biases in LLMs compared to humans
As Large language models (LLMs) become increasingly integrated into our
lives, their inherent social biases remain a pressing concern. Detecting and
evaluating these biases can be challenging because they are often implicit
rather than explicit in nature, so developing evaluation methods that assess
the implicit knowledge representations of LLMs is essential. We present a novel
word association network methodology for evaluating implicit biases in LLMs
based on simulating semantic priming within LLM-generated word association
networks. Our prompt-based approach taps into the implicit relational
structures encoded in LLMs, providing both quantitative and qualitative
assessments of bias. Unlike most prompt-based evaluation methods, our method
enables direct comparisons between various LLMs and humans, providing a
valuable point of reference and offering new insights into the alignment of
LLMs with human cognition. To demonstrate the utility of our methodology, we
apply it to both humans and several widely used LLMs to investigate social
biases related to gender, religion, ethnicity, sexual orientation, and
political party. Our results reveal both convergences and divergences between
LLM and human biases, providing new perspectives on the potential risks of
using LLMs. Our methodology contributes to a systematic, scalable, and
generalizable framework for evaluating and comparing biases across multiple
LLMs and humans, advancing the goal of transparent and socially responsible
language technologies.
comment: 24 pages, 13 figures, 3 tables
☆ Talk2Ref: A Dataset for Reference Prediction from Scientific Talks
Scientific talks are a growing medium for disseminating research, and
automatically identifying relevant literature that grounds or enriches a talk
would be highly valuable for researchers and students alike. We introduce
Reference Prediction from Talks (RPT), a new task that maps long, and
unstructured scientific presentations to relevant papers. To support research
on RPT, we present Talk2Ref, the first large-scale dataset of its kind,
containing 6,279 talks and 43,429 cited papers (26 per talk on average), where
relevance is approximated by the papers cited in the talk's corresponding
source publication. We establish strong baselines by evaluating
state-of-the-art text embedding models in zero-shot retrieval scenarios, and
propose a dual-encoder architecture trained on Talk2Ref. We further explore
strategies for handling long transcripts, as well as training for domain
adaptation. Our results show that fine-tuning on Talk2Ref significantly
improves citation prediction performance, demonstrating both the challenges of
the task and the effectiveness of our dataset for learning semantic
representations from spoken scientific content. The dataset and trained models
are released under an open license to foster future research on integrating
spoken scientific communication into citation recommendation systems.
☆ Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems
Hallucination remains one of the key obstacles to the reliable deployment of
large language models (LLMs), particularly in real-world applications. Among
various mitigation strategies, Retrieval-Augmented Generation (RAG) and
reasoning enhancement have emerged as two of the most effective and widely
adopted approaches, marking a shift from merely suppressing hallucinations to
balancing creativity and reliability. However, their synergistic potential and
underlying mechanisms for hallucination mitigation have not yet been
systematically examined. This survey adopts an application-oriented perspective
of capability enhancement to analyze how RAG, reasoning enhancement, and their
integration in Agentic Systems mitigate hallucinations. We propose a taxonomy
distinguishing knowledge-based and logic-based hallucinations, systematically
examine how RAG and reasoning address each, and present a unified framework
supported by real-world applications, evaluations, and benchmarks.
comment: 25 pages, 7 figures, 3 tables
☆ Iterative Critique-Refine Framework for Enhancing LLM Personalization
Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed
Personalized text generation requires models not only to produce coherent
text but also to align with a target user's style, tone, and topical focus.
Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich
profiles with user and neighbor histories, but they stop at generation and
often yield outputs that drift in tone, topic, or style. We present PerFine, a
unified, training-free critique-refine framework that enhances personalization
through iterative, profile-grounded feedback. In each iteration, an LLM
generator produces a draft conditioned on the retrieved profile, and a critic
LLM - also conditioned on the same profile - provides structured feedback on
tone, vocabulary, sentence structure, and topicality. The generator then
revises, while a novel knockout strategy retains the stronger draft across
iterations. We further study additional inference-time strategies such as
Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp,
Goodreads, and Amazon datasets, PerFine consistently improves personalization
over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5
refinement iterations, and scalability with increasing critic size. These
results highlight that post-hoc, profile-aware feedback offers a powerful
paradigm for personalized LLM generation that is both training-free and
model-agnostic.
☆ Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices LREC 2026
While new benchmarks for large language models (LLMs) are being developed
continuously to catch up with the growing capabilities of new models and AI in
general, using and evaluating LLMs in non-English languages remains a
little-charted landscape. We give a concise overview of recent developments in
LLM benchmarking, and then propose a new taxonomy for the categorization of
benchmarks that is tailored to multilingual or non-English use scenarios. We
further propose a set of best practices and quality standards that could lead
to a more coordinated development of benchmarks for European languages. Among
other recommendations, we advocate for a higher language and culture
sensitivity of evaluation methods.
comment: 12 pages, 1 figure. Submitted to the LREC 2026 conference
☆ SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro
Multimodal large language models (MLLMs) have shown impressive capabilities
in vision-language tasks such as reasoning segmentation, where models generate
segmentation masks based on textual queries. While prior work has primarily
focused on perturbing image inputs, semantically equivalent textual
paraphrases-crucial in real-world applications where users express the same
intent in varied ways-remain underexplored. To address this gap, we introduce a
novel adversarial paraphrasing task: generating grammatically correct
paraphrases that preserve the original query meaning while degrading
segmentation performance. To evaluate the quality of adversarial paraphrases,
we develop a comprehensive automatic evaluation protocol validated with human
studies. Furthermore, we introduce SPARTA-a black-box, sentence-level
optimization method that operates in the low-dimensional semantic latent space
of a text autoencoder, guided by reinforcement learning. SPARTA achieves
significantly higher success rates, outperforming prior methods by up to 2x on
both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive
baselines to assess the robustness of advanced reasoning segmentation models.
We reveal that they remain vulnerable to adversarial paraphrasing-even under
strict semantic and grammatical constraints. All code and data will be released
publicly upon acceptance.
☆ Law in Silico: Simulating Legal Society with LLM-Based Agents
Since real-world legal experiments are often costly or infeasible, simulating
legal societies with Artificial Intelligence (AI) systems provides an effective
alternative for verifying and developing legal theory, as well as supporting
legal administration. Large Language Models (LLMs), with their world knowledge
and role-playing capabilities, are strong candidates to serve as the foundation
for legal society simulation. However, the application of LLMs to simulate
legal systems remains underexplored. In this work, we introduce Law in Silico,
an LLM-based agent framework for simulating legal scenarios with individual
decision-making and institutional mechanisms of legislation, adjudication, and
enforcement. Our experiments, which compare simulated crime rates with
real-world data, demonstrate that LLM-based agents can largely reproduce
macro-level crime trends and provide insights that align with real-world
observations. At the same time, micro-level simulations reveal that a
well-functioning, transparent, and adaptive legal system offers better
protection of the rights of vulnerable individuals.
☆ Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content NeurIPS 2025
Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir
Large language models are increasingly used for Islamic guidance, but risk
misquoting texts, misapplying jurisprudence, or producing culturally
inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar
on prompts from authentic Islamic blogs. Our dual-agent framework uses a
quantitative agent for citation verification and six-dimensional scoring (e.g.,
Structure, Islamic Consistency, Citations) and a qualitative agent for
five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality).
GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI
followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong
performance, models still fall short in reliably producing accurate Islamic
content and citations -- a paramount requirement in faith-sensitive writing.
GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led
qualitative pairwise wins (116/200). Fanar, though trailing, introduces
innovations for Islamic and Arabic contexts. This study underscores the need
for community-driven benchmarks centering Muslim perspectives, offering an
early step toward more reliable AI in Islamic knowledge and other high-stakes
domains such as medicine, law, and journalism.
comment: Accepted at 39th Conference on Neural Information Processing Systems
(NeurIPS 2025) Workshop: 5th Muslims in Machine Learning (MusIML) Workshop
♻ ☆ Retrieval-Augmented Generation-based Relation Extraction
Information Extraction (IE) is a transformative process that converts
unstructured text data into a structured format by employing entity and
relation extraction (RE) methodologies. The identification of the relation
between a pair of entities plays a crucial role within this framework. Despite
the existence of various techniques for relation extraction, their efficacy
heavily relies on access to labeled data and substantial computational
resources. In addressing these challenges, Large Language Models (LLMs) emerge
as promising solutions; however, they might return hallucinating responses due
to their own training data. To overcome these limitations, Retrieved-Augmented
Generation-based Relation Extraction (RAG4RE) in this work is proposed,
offering a pathway to enhance the performance of relation extraction tasks.
This work evaluated the effectiveness of our RAG4RE approach utilizing
different LLMs. Through the utilization of established benchmarks, such as
TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to
comprehensively evaluate the efficacy of our RAG4RE approach. In particularly,
we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our
investigation. The results of our study demonstrate that our RAG4RE approach
surpasses performance of traditional RE approaches based solely on LLMs,
particularly evident in the TACRED dataset and its variations. Furthermore, our
approach exhibits remarkable performance compared to previous RE methodologies
across both TACRED and TACREV datasets, underscoring its efficacy and potential
for advancing RE tasks in natural language processing.
comment: published at the Semantic Web journal. The last version is available:
https://doi.org/10.1177/22104968251385519
♻ ☆ Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
As Large Language Models (LLMs) expand across domains, LLM judges have become
essential for systems evaluation. Current benchmarks typically compare system
outputs against baselines. This baseline-mediated approach, though convenient,
yields lower reliability than direct comparison between systems. We propose
Arena-Lite which integrates tournament structure on top of head-to-head
comparison. The application of a tournament structure and direct comparison
eliminates the need for baseline outputs, reduces the number of required
comparisons, and allows higher reliability in system rankings. We conducted two
experiments: (1) controlled stochastic modeling and (2) empirical validation
with a real LLM judge. Those experiments collectively demonstrate that
Arena-Lite consistently achieves higher reliability with fewer comparisons,
even with smaller datasets or weaker judges. We release an easy-to-use web
demonstration and code to foster adoption of Arena-Lite, streamlining model
selection across research and industry communities. Arena-Lite demo and code
are available on
\href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}
comment: 8 pages for main body, 19 pages in total
♻ ☆ Says Who? Effective Zero-Shot Annotation of Focalization
Focalization describes the way in which access to narrative information is
restricted or controlled based on the knowledge available to knowledge of the
narrator. It is encoded via a wide range of lexico-grammatical features and is
subject to reader interpretation. Even trained annotators frequently disagree
on correct labels, suggesting this task is both qualitatively and
computationally challenging. In this work, we test how well five contemporary
large language model (LLM) families and two baselines perform when annotating
short literary excerpts for focalization. Despite the challenging nature of the
task, we find that LLMs show comparable performance to trained human
annotators, with GPT-4o achieving an average F1 of 84.79%. Further, we
demonstrate that the log probabilities output by GPT-family models frequently
reflect the difficulty of annotating particular excerpts. Finally, we provide a
case study analyzing sixteen Stephen King novels, demonstrating the usefulness
of this approach for computational literary studies and the insights gleaned
from examining focalization at scale.
comment: Accepted at CHR 2025
♻ ★ TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Large language models (LLMs) have demonstrated their effectiveness in
multivariate time series classification (MTSC). Effective adaptation of LLMs
for MTSC necessitates informative data representations. Existing LLM-based
methods directly encode embeddings for time series within the latent space of
LLMs from scratch to align with semantic space of LLMs. Despite their
effectiveness, we reveal that these methods conceal three inherent bottlenecks:
(1) they struggle to encode temporal and channel-specific information in a
lossless manner, both of which are critical components of multivariate time
series; (2) it is much difficult to align the learned representation space with
the semantic space of the LLMs; (3) they require task-specific retraining,
which is both computationally expensive and labor-intensive. To bridge these
gaps, we propose TableTime, which reformulates MTSC as a table understanding
task. Specifically, TableTime introduces the following strategies: (1) convert
multivariate time series into a tabular form, thus minimizing information loss
to the greatest extent; (2) represent tabular time series in text format to
achieve natural alignment with the semantic space of LLMs; (3) design a
reasoning framework that integrates contextual text information, neighborhood
assistance, multi-path inference and problem decomposition to enhance the
reasoning ability of LLMs and realize zero-shot classification. Extensive
experiments performed on 10 publicly representative datasets from UEA archive
verify the superiorities of the TableTime.
♻ ☆ BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Pengjun Xie, Jingren Zhou, Yong Jiang
Confidence in LLMs is a useful indicator of model uncertainty and answer
reliability. Existing work mainly focused on single-turn scenarios, while
research on confidence in complex multi-turn interactions is limited. In this
paper, we investigate whether LLM-based search agents have the ability to
communicate their own confidence through verbalized confidence scores after
long sequences of actions, a significantly more challenging task compared to
outputting confidence in a single interaction. Experimenting on open-source
agentic models, we first find that models exhibit much higher task accuracy at
high confidence while having near-zero accuracy when confidence is low. Based
on this observation, we propose Test-Time Scaling (TTS) methods that use
confidence scores to determine answer quality, encourage the model to try again
until reaching a satisfactory confidence level. Results show that our proposed
methods significantly reduce token consumption while demonstrating competitive
performance compared to baseline fixed budget TTS methods.
comment: 25 pages
♻ ☆ The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness NeurIPS 2025
Reasoning-focused LLMs sometimes alter their behavior when they detect that
they are being evaluated, which can lead them to optimize for test-passing
performance or to comply more readily with harmful prompts if real-world
consequences appear absent. We present the first quantitative study of how such
"test awareness" impacts model behavior, particularly its performance on
safety-related tasks. We introduce a white-box probing framework that (i)
linearly identifies awareness-related activations and (ii) steers models toward
or away from test awareness while monitoring downstream performance. We apply
our method to different state-of-the-art open-weight reasoning LLMs across both
realistic and hypothetical tasks (denoting tests or simulations). Our results
demonstrate that test awareness significantly impacts safety alignment (such as
compliance with harmful requests and conforming to stereotypes) with effects
varying in both magnitude and direction across models. By providing control
over this latent effect, our work aims to provide a stress-test mechanism and
increase trust in how we perform safety evaluations.
comment: NeurIPS 2025 (Spotlight). Code is available at:
https://github.com/microsoft/Test_Awareness_Steering
♻ ☆ TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
Accelerating the inference of large language models (LLMs) has been a
critical challenge in generative AI. Speculative decoding (SD) substantially
improves LLM inference efficiency. However, its utility is limited by a
fundamental constraint: the draft and target models must share the same
vocabulary, thus limiting the herd of available draft models and often
necessitating the training of a new model from scratch. Inspired by Dynamic
Time Warping (DTW), a classic algorithm for aligning time series, we propose
the algorithm TokenTiming for universal speculative decoding. It operates by
re-encoding the draft token sequence to get a new target token sequence, and
then uses DTW to build a mapping to transfer the probability distributions for
speculative sampling. Benefiting from this, our method accommodates mismatched
vocabularies and works with any off-the-shelf models without retraining and
modification. We conduct comprehensive experiments on various tasks,
demonstrating 1.57x speedup. This work enables a universal approach for draft
model selection, making SD a more versatile and practical tool for LLM
acceleration.
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring.
However, a limit of 512 tokens for these encoder-based models showed the
deficiency in automated scoring of long essays. Thus, this research explores
generative language models for automated scoring of long essays via
summarization and prompting. The results revealed great improvement of scoring
accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab
Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence
in Measurement and Education Conference (AIME-Con)
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation
Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin
We introduce AutoJudge, a method that accelerates large language model (LLM)
inference with task-specific lossy speculative decoding. Instead of matching
the original model output distribution token-by-token, we identify which of the
generated tokens affect the downstream quality of the response, relaxing the
distribution match guarantee so that the "unimportant" tokens can be generated
faster. Our approach relies on a semi-greedy search algorithm to test which of
the mismatches between target and draft models should be corrected to preserve
quality and which ones may be skipped. We then train a lightweight classifier
based on existing LLM embeddings to predict, at inference time, which
mismatching tokens can be safely accepted without compromising the final answer
quality. We evaluate the effectiveness of AutoJudge with multiple draft/target
model pairs on mathematical reasoning and programming benchmarks, achieving
significant speedups at the cost of a minor accuracy reduction. Notably, on
GSM8k with the Llama 3.1 70B target model, our approach achieves up to
$\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$
drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge
automatically detects programming-specific important tokens, accepting $\ge 25$
tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no
human annotation and is easy to integrate with modern LLM inference frameworks.
♻ ☆ Mano Technical Report
Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
Graphical user interfaces (GUIs) are the primary medium for human-computer
interaction, yet automating GUI interactions remains challenging due to the
complexity of visual elements, dynamic environments, and the need for
multi-step reasoning. Existing methods based on vision-language models (VLMs)
often suffer from limited resolution, domain mismatch, and insufficient
sequential decisionmaking capability. To address these issues, we propose Mano,
a robust GUI agent built upon a multi-modal foundation model pre-trained on
extensive web and computer system data. Our approach integrates a novel
simulated environment for high-fidelity data generation, a three-stage training
pipeline (supervised fine-tuning, offline reinforcement learning, and online
reinforcement learning), and a verification module for error recovery. Mano
demonstrates state-of-the-art performance on multiple GUI benchmarks, including
Mind2Web and OSWorld, achieving significant improvements in success rate and
operational accuracy. Our work provides new insights into the effective
integration of reinforcement learning with VLMs for practical GUI agent
deployment, highlighting the importance of domain-specific data, iterative
training, and holistic reward design.
♻ ☆ Are you sure? Measuring models bias in content moderation through uncertainty ACL
Automatic content moderation is crucial to ensuring safety in social media.
Language Model-based classifiers are being increasingly adopted for this task,
but it has been shown that they perpetuate racial and social biases. Even if
several resources and benchmark corpora have been developed to challenge this
issue, measuring the fairness of models in content moderation remains an open
issue. In this work, we present an unsupervised approach that benchmarks models
on the basis of their uncertainty in classifying messages annotated by people
belonging to vulnerable groups. We use uncertainty, computed by means of the
conformal prediction technique, as a proxy to analyze the bias of 11 models
against women and non-white annotators and observe to what extent it diverges
from metrics based on performance, such as the $F_1$ score. The results show
that some pre-trained models predict with high accuracy the labels coming from
minority groups, even if the confidence in their prediction is low. Therefore,
by measuring the confidence of models, we are able to see which groups of
annotators are better represented in pre-trained models and lead the debiasing
process of these models before their effective use.
comment: accepted at Findings of ACL: EMNLP 2025