Computation and Language
☆ Convergence and Divergence of Language Models under Different Random Seeds EMNLP 2025
In this paper, we investigate the convergence of language models (LMs)
trained under different random seeds, measuring convergence as the expected
per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM
convergence as a function of model size and training checkpoint, we identify a
four-phase convergence pattern: (i) an initial uniform phase, (ii) a
sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a
slow-reconvergence phase. Further, we observe that larger models reconverge
faster in later training stages, while smaller models never actually
reconverge; these results suggest that a certain model size may be necessary to
learn stable distributions. Restricting our analysis to specific token
frequencies or part-of-speech (PoS) tags further reveals that convergence is
uneven across linguistic categories: frequent tokens and function words
converge faster and more reliably than their counterparts (infrequent tokens
and content words). Overall, our findings highlight factors that influence the
stability of the learned distributions in model training.
comment: Published at EMNLP 2025
☆ Scaling Spoken Language Models with Syllabic Speech Tokenization
Spoken language models (SLMs) typically discretize speech into
high-frame-rate tokens extracted from SSL speech models. As the most successful
LMs are based on the Transformer architecture, processing these long token
streams with self-attention is expensive, as attention scales quadratically
with sequence length. A recent SSL work introduces acoustic tokenization of
speech at the syllable level, which is more interpretable and potentially more
scalable with significant compression in token lengths (4-5 Hz). Yet, their
value for spoken language modeling is not yet fully explored. We present the
first systematic study of syllabic tokenization for spoken language modeling,
evaluating models on a suite of SLU benchmarks while varying training data
scale. Syllabic tokens can match or surpass the previous high-frame rate tokens
while significantly cutting training and inference costs, achieving more than a
2x reduction in training time and a 5x reduction in FLOPs. Our findings
highlight syllable-level language modeling as a promising path to efficient
long-context spoken language models.
☆ Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
Runze Liu, Jiakang Wang, Yuling Shi, Zhihui Xie, Chenxin An, Kaiyan Zhang, Jian Zhao, Xiaodong Gu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai
Reinforcement Learning (RL) has shown remarkable success in enhancing the
reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL
(PSRL) has emerged as a more effective paradigm compared to outcome-based RL.
However, existing PSRL approaches suffer from limited exploration efficiency,
both in terms of branching positions and sampling. In this paper, we introduce
a novel PSRL framework (AttnRL), which enables efficient exploration for
reasoning models. Motivated by preliminary observations that steps exhibiting
high attention scores correlate with reasoning behaviors, we propose to branch
from positions with high values. Furthermore, we develop an adaptive sampling
strategy that accounts for problem difficulty and historical batch size,
ensuring that the whole training batch maintains non-zero advantage values. To
further improve sampling efficiency, we design a one-step off-policy training
pipeline for PSRL. Extensive experiments on multiple challenging mathematical
reasoning benchmarks demonstrate that our method consistently outperforms prior
approaches in terms of performance and sampling and training efficiency.
☆ Searching for Difficult-to-Translate Test Examples at Scale
NLP models require test data that are sufficiently challenging. The
difficulty of an example is linked to the topic it originates from (''seed
topic''). The relationship between the topic and the difficulty of its
instances is stochastic in nature: an example about a difficult topic can
happen to be easy, and vice versa. At the scale of the Internet, there are tens
of thousands of potential topics, and finding the most difficult one by drawing
and evaluating a large number of examples across all topics is computationally
infeasible. We formalize this task and treat it as a multi-armed bandit
problem. In this framework, each topic is an ''arm,'' and pulling an arm (at a
cost) involves drawing a single example, evaluating it, and measuring its
difficulty. The goal is to efficiently identify the most difficult topics
within a fixed computational budget. We illustrate the bandit problem setup of
finding difficult examples for the task of machine translation. We find that
various bandit strategies vastly outperform baseline methods like brute-force
searching the most challenging topics.
★ DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively
While previous AI Scientist systems can generate novel findings, they often
lack the focus to produce scientifically valuable contributions that address
pressing human-defined challenges. We introduce DeepScientist, a system
designed to overcome this by conducting goal-oriented, fully autonomous
scientific discovery over month-long timelines. It formalizes discovery as a
Bayesian Optimization problem, operationalized through a hierarchical
evaluation process consisting of "hypothesize, verify, and analyze". Leveraging
a cumulative Findings Memory, this loop intelligently balances the exploration
of novel hypotheses with exploitation, selectively promoting the most promising
findings to higher-fidelity levels of validation. Consuming over 20,000 GPU
hours, the system generated about 5,000 unique scientific ideas and
experimentally validated approximately 1100 of them, ultimately surpassing
human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by
183.7\%, 1.9\%, and 7.9\%. This work provides the first large-scale evidence of
an AI achieving discoveries that progressively surpass human SOTA on scientific
tasks, producing valuable findings that genuinely push the frontier of
scientific discovery. To facilitate further research into this process, we will
open-source all experimental logs and system code at
https://github.com/ResearAI/DeepScientist/.
★ MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz
Ensuring native-like quality of large language model (LLM) responses across
many languages is challenging. To address this, we introduce MENLO, a framework
that operationalizes the evaluation of native-like response quality based on
audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423
human-annotated prompt-response preference pairs covering four quality
dimensions with high inter-annotator agreement in 47 language varieties. Our
evaluation reveals that zero-shot LLM judges benefit significantly from
pairwise evaluation and our structured annotation rubrics, yet they still
underperform human annotators on our dataset. We demonstrate substantial
improvements through fine-tuning with reinforcement learning, reward shaping,
and multi-task learning approaches. Additionally, we show that RL-trained
judges can serve as generative reward models to enhance LLMs' multilingual
proficiency, though discrepancies with human judgment remain. Our findings
suggest promising directions for scalable multilingual evaluation and
preference alignment. We release our dataset and evaluation framework to
support further research in multilingual LLM evaluation.
comment: 10 pages, 23 tables, 17 figures
☆ Deconstructing Self-Bias in LLM-generated Translation Benchmarks
As large language models (LLMs) begin to saturate existing benchmarks,
automated benchmark creation using LLMs (LLM as a benchmark) has emerged as a
scalable alternative to slow and costly human curation. While these generated
test sets have to potential to cheaply rank models, we demonstrate a critical
flaw. LLM generated benchmarks systematically favor the model that created the
benchmark, they exhibit self bias on low resource languages to English
translation tasks. We show three key findings on automatic benchmarking of LLMs
for translation: First, this bias originates from two sources: the generated
test data (LLM as a testset) and the evaluation method (LLM as an evaluator),
with their combination amplifying the effect. Second, self bias in LLM as a
benchmark is heavily influenced by the model's generation capabilities in the
source language. For instance, we observe more pronounced bias in into English
translation, where the model's generation system is developed, than in out of
English translation tasks. Third, we observe that low diversity in source text
is one attribution to self bias. Our results suggest that improving the
diversity of these generated source texts can mitigate some of the observed
self bias.
☆ Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
Recent text-only models demonstrate remarkable mathematical reasoning
capabilities. Extending these to visual domains requires vision-language models
to translate images into text descriptions. However, current models, trained to
produce captions for human readers, often omit the precise details that
reasoning systems require. This creates an interface mismatch: reasoners often
fail not due to reasoning limitations but because they lack access to critical
visual information. We propose Adaptive-Clarification Reinforcement Learning
(AC-RL), which teaches vision models what information reasoners need through
interaction. Our key insight is that clarification requests during training
reveal information gaps; by penalizing success that requires clarification, we
create pressure for comprehensive initial captions that enable the reasoner to
solve the problem in a single pass. AC-RL improves average accuracy by 4.4
points over pretrained baselines across seven visual mathematical reasoning
benchmarks, and analysis shows it would cut clarification requests by up to 39%
if those were allowed. By treating clarification as a form of implicit
supervision, AC-RL demonstrates that vision-language interfaces can be
effectively learned through interaction alone, without requiring explicit
annotations.
♻ ☆ The Impact of Language Mixing on Bilingual LLM Reasoning EMNLP 2025
Proficient multilingual speakers often intentionally switch languages in the
middle of a conversation. Similarly, recent reasoning-focused bilingual large
language models (LLMs) with strong capabilities in both languages exhibit
language mixing-alternating languages within their chain of thought.
Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy,
suggesting that language mixing may benefit reasoning. In this work, we study
language switching in Chinese-English bilingual reasoning models. We identify
reinforcement learning with verifiable rewards (RLVR) as the critical training
stage that leads to language mixing. We show that language mixing can enhance
reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage
points on MATH500. Additionally, a lightweight probe can be trained to predict
whether a potential language switch would benefit or harm reasoning, and when
used to guide decoding, increases accuracy by 2.92 percentage points. Our
findings suggest that language mixing is not merely a byproduct of multilingual
training, but is a strategic reasoning behavior.
comment: Accepted at EMNLP 2025 (Main Conference)
♻ ☆ Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
The impact of misinformation arises not only from factual inaccuracies but
also from the misleading narratives that creators deliberately embed.
Interpreting such creator intent is therefore essential for multimodal
misinformation detection (MMD) and effective information governance. To this
end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000
image-caption pairs grounded in trustworthy reference articles, created using
an intent-guided simulation framework that models both the desired influence
and the execution plan of news creators. The dataset captures both misleading
and non-misleading cases, spanning manipulations across visual and textual
modalities, and supports three intent-centric tasks: (1) misleading intent
detection, (2) misleading source attribution, and (3) creator desire inference.
We evaluate 14 state-of-the-art vision-language models (VLMs) and find that
they struggle with intent reasoning, often relying on shallow cues such as
surface-level alignment, stylistic polish, or heuristic authenticity signals.
These results highlight the limitations of current VLMs and position
DeceptionDecoded as a foundation for developing intent-aware models that go
beyond shallow cues in MMD.