Computation and Language
☆ HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Recent progress in vision-language segmentation has significantly advanced
grounded visual understanding. However, these models often exhibit
hallucinations by producing segmentation masks for objects not grounded in the
image content or by incorrectly labeling irrelevant regions. Existing
evaluation protocols for segmentation hallucination primarily focus on label or
textual hallucinations without manipulating the visual context, limiting their
capacity to diagnose critical failures. In response, we introduce
HalluSegBench, the first benchmark specifically designed to evaluate
hallucinations in visual grounding through the lens of counterfactual visual
reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual
instance pairs spanning 281 unique object classes, and a set of newly
introduced metrics that quantify hallucination sensitivity under visually
coherent scene edits. Experiments on HalluSegBench with state-of-the-art
vision-language segmentation models reveal that vision-driven hallucinations
are significantly more prevalent than label-driven ones, with models often
persisting in false segmentation, highlighting the need for counterfactual
reasoning to diagnose grounding fidelity.
comment: Project webpage: https://plan-lab.github.io/hallusegbench/
★ Data Efficacy for Language Model Training
Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li
Data is fundamental to the training of language models (LM). Recent research
has been dedicated to data efficiency, which aims to maximize performance by
selecting a minimal or optimal subset of training data. Techniques such as data
filtering, sampling, and selection play a crucial role in this area. To
complement it, we define Data Efficacy, which focuses on maximizing performance
by optimizing the organization of training data and remains relatively
underexplored. This work introduces a general paradigm, DELT, for considering
data efficacy in LM training, which highlights the significance of training
data organization. DELT comprises three components: Data Scoring, Data
Selection, and Data Ordering. Among these components, we design
Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which
considers both the learnability and quality of each data sample from the
gradient consistency perspective. We also devise Folding Ordering (FO), as a
novel instance of Data Ordering, which addresses issues such as model
forgetting and data distribution bias. Comprehensive experiments validate the
data efficacy in LM training, which demonstrates the following: Firstly,
various instances of the proposed DELT enhance LM performance to varying
degrees without increasing the data scale and model size. Secondly, among these
instances, the combination of our proposed LQS for data scoring and Folding for
data ordering achieves the most significant improvement. Lastly, data efficacy
can be achieved together with data efficiency by applying data selection.
Therefore, we believe that data efficacy is a promising foundational area in LM
training.
☆ "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
People are increasingly seeking healthcare information from large language
models (LLMs) via interactive chatbots, yet the nature and inherent risks of
these conversations remain largely unexplored. In this paper, we filter
large-scale conversational AI datasets to achieve HealthChat-11K, a curated
dataset of 11K real-world conversations composed of 25K user messages. We use
HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs
when seeking healthcare information in order to systematically study user
interactions across 21 distinct health specialties. Our analysis reveals
insights into the nature of how and why users seek health information, such as
common interactions, instances of incomplete context, affective behaviors, and
interactions (e.g., leading questions) that can induce sycophancy, underscoring
the need for improvements in the healthcare support capabilities of LLMs
deployed as conversational AI. Code and artifacts to retrieve our analyses and
combine them into a curated dataset can be found here:
https://github.com/yahskapar/HealthChat
comment: 25 pages, 6 figures, 4 tables, corresponds to initial HealthChat-11K
dataset release
☆ Potemkin Understanding in Large Language Models
Large language models (LLMs) are regularly evaluated using benchmark
datasets. But what justifies making inferences about an LLM's capabilities
based on its answers to a curated set of questions? This paper first introduces
a formal framework to address this question. The key is to note that the
benchmarks used to test LLMs -- such as AP exams -- are also those used to test
people. However, this raises an implication: these benchmarks are only valid
tests if LLMs misunderstand concepts in ways that mirror human
misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin
understanding: the illusion of understanding driven by answers irreconcilable
with how any human would interpret a concept. We present two procedures for
quantifying the existence of potemkins: one using a specially designed
benchmark in three domains, the other using a general procedure that provides a
lower-bound on their prevalence. We find that potemkins are ubiquitous across
models, tasks, and domains. We also find that these failures reflect not just
incorrect understanding, but deeper internal incoherence in concept
representations.
☆ skLEP: A Slovak General Language Understanding Benchmark ACL 2025
Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko
In this work, we introduce skLEP, the first comprehensive benchmark
specifically designed for evaluating Slovak natural language understanding
(NLU) models. We have compiled skLEP to encompass nine diverse tasks that span
token-level, sentence-pair, and document-level challenges, thereby offering a
thorough assessment of model capabilities. To create this benchmark, we curated
new, original datasets tailored for Slovak and meticulously translated
established English NLU resources. Within this paper, we also present the first
systematic and extensive evaluation of a wide array of Slovak-specific,
multilingual, and English pre-trained language models using the skLEP tasks.
Finally, we also release the complete benchmark data, an open-source toolkit
facilitating both fine-tuning and evaluation of models, and a public
leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering
reproducibility and drive future research in Slovak NLU.
comment: ACL 2025 Findings
★ Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
Agentic search such as Deep Research systems, where large language models
autonomously browse the web, synthesize information, and return comprehensive
citation-backed answers, represents a major shift in how users interact with
web-scale information. While promising greater efficiency and cognitive
offloading, the growing complexity and open-endedness of agentic search have
outpaced existing evaluation benchmarks and methodologies, which largely assume
short search horizons and static answers. In this paper, we introduce Mind2Web
2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that
require real-time web browsing and extensive information synthesis, constructed
with over 1,000 hours of human labor. To address the challenge of evaluating
time-varying and complex answers, we propose a novel Agent-as-a-Judge
framework. Our method constructs task-specific judge agents based on a
tree-structured rubric design to automatically assess both answer correctness
and source attribution. We conduct a comprehensive evaluation of nine frontier
agentic search systems and human performance, along with a detailed error
analysis to draw insights for future development. The best-performing system,
OpenAI Deep Research, can already achieve 50-70% of human performance while
spending half the time, showing a great potential. Altogether, Mind2Web 2
provides a rigorous foundation for developing and benchmarking the next
generation of agentic search systems.
comment: Project Homepage: https://osu-nlp-group.github.io/Mind2Web2/
☆ Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments
Enhancing user engagement through interactions plays an essential role in
socially-driven dialogues. While prior works have optimized models to reason
over relevant knowledge or plan a dialogue act flow, the relationship between
user engagement and knowledge or dialogue acts is subtle and does not guarantee
user engagement in socially-driven dialogues. To this end, we enable
interactive LLMs to learn user engagement by leveraging signals from the future
development of conversations. Specifically, we adopt a more direct and relevant
indicator of user engagement, i.e., the user's reaction related to dialogue
intention after the interaction, as a reward to align interactive LLMs. To
achieve this, we develop a user simulator to interact with target interactive
LLMs and explore interactions between the user and the interactive LLM system
via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree
\textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset
containing pairs of higher and lower-quality experiences using
\textit{i$\times$MCTS}, and align interactive LLMs for high-level user
engagement by direct preference optimization (DPO) accordingly. Experiments
conducted on two socially-driven dialogue scenarios (emotional support
conversations and persuasion for good) demonstrate that our method effectively
enhances user engagement in interactive LLMs.
☆ Bridging Offline and Online Reinforcement Learning for LLMs
Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov
We investigate the effectiveness of reinforcement learning methods for
finetuning large language models when transitioning from offline to semi-online
to fully online regimes for both verifiable and non-verifiable tasks. Our
experiments cover training on verifiable math as well as non-verifiable
instruction following with a set of benchmark evaluations for both. Across
these settings, we extensively compare online and semi-online Direct Preference
Optimization and Group Reward Policy Optimization objectives, and surprisingly
find similar performance and convergence between these variants, which all
strongly outperform offline methods. We provide a detailed analysis of the
training dynamics and hyperparameter selection strategies to achieve optimal
results. Finally, we show that multi-tasking with verifiable and non-verifiable
rewards jointly yields improved performance across both task types.
☆ Logios : An open source Greek Polytonic Optical Character Recognition system
In this paper, we present an Optical Character Recognition (OCR) system
specifically designed for the accurate recognition and digitization of Greek
polytonic texts. By leveraging the combined strengths of convolutional layers
for feature extraction and recurrent layers for sequence learning, our system
addresses the unique challenges posed by Greek polytonic scripts. This approach
aims to overcome the limitations of traditional OCR methods, offering
significant improvements in accuracy and efficiency. We release the underlying
model as an open-source library and make our OCR platform available for
academic use.
☆ TopK Language Models
Sparse autoencoders (SAEs) have become an important tool for analyzing and
interpreting the activation space of transformer-based language models (LMs).
However, SAEs suffer several shortcomings that diminish their utility and
internal validity. Since SAEs are trained post-hoc, it is unclear if the
failure to discover a particular concept is a failure on the SAE's side or due
to the underlying LM not representing this concept. This problem is exacerbated
by training conditions and architecture choices affecting which features an SAE
learns. When tracing how LMs learn concepts during training, the lack of
feature stability also makes it difficult to compare SAEs features across
different checkpoints. To address these limitations, we introduce a
modification to the transformer architecture that incorporates a TopK
activation function at chosen layers, making the model's hidden states
equivalent to the latent features of a TopK SAE. This approach eliminates the
need for post-hoc training while providing interpretability comparable to SAEs.
The resulting TopK LMs offer a favorable trade-off between model size,
computational efficiency, and interpretability. Despite this simple
architectural change, TopK LMs maintain their original capabilities while
providing robust interpretability benefits. Our experiments demonstrate that
the sparse representations learned by TopK LMs enable successful steering
through targeted neuron interventions and facilitate detailed analysis of
neuron formation processes across checkpoints and layers. These features make
TopK LMs stable and reliable tools for understanding how language models learn
and represent concepts, which we believe will significantly advance future
research on model interpretability and controllability.
☆ Aligning Spoken Dialogue Models from User Interactions ICML 2025
We propose a novel preference alignment framework for improving spoken
dialogue models on real-time conversations from user interactions. Current
preference learning methods primarily focus on text-based language models, and
are not directly suited to the complexities of real-time speech interactions,
with richer dynamics (e.g. interruption, interjection) and no explicit
segmentation between speaker turns.We create a large-scale dataset of more than
150,000 preference pairs from raw multi-turn speech conversations, annotated
with AI feedback, to cover preferences over both linguistic content and
temporal context variations. We leverage offline alignment methods to finetune
a full-duplex autoregressive speech-to-speech model. Extensive experiments
demonstrate that feedback on generic conversations can be consistently
effective in improving spoken dialogue models to produce more factual, safer
and more contextually aligned interactions. We deploy the finetuned model and
conduct holistic human evaluations to assess the impact beyond single-turn
conversations. Our findings shed light on the importance of a well-calibrated
balance among various dynamics, crucial for natural real-time speech dialogue
systems.
comment: Accepted at ICML 2025
★ Spatial Mental Modeling from Limited Views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
Can Vision Language Models (VLMs) imagine the full scene from just a few
views, like humans do? Humans form spatial mental models, internal
representations of unseen space, to reason about layout, perspective, and
motion. Our new MindCube benchmark with 21,154 questions across 3,268 images
exposes this critical gap, where existing VLMs exhibit near-random performance.
Using MindCube, we systematically evaluate how well VLMs build robust spatial
mental models through representing positions (cognitive mapping), orientations
(perspective-taking), and dynamics (mental simulation for "what-if" movements).
We then explore three approaches to help VLMs approximate spatial mental
models, including unseen intermediate views, natural language reasoning chains,
and cognitive maps. The significant improvement comes from a synergistic
approach, "map-then-reason", that jointly trains the model to first generate a
cognitive map and then reason upon it. By training models to reason over these
internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding
reinforcement learning pushed performance even further to 70.7% (+32.9%). Our
key insight is that such scaffolding of spatial mental models, actively
constructing and utilizing internal structured spatial representations with
flexible reasoning processes, significantly improves understanding of
unobservable space.
comment: Preprint version
☆ Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
Recent advances in large language models have enabled natural language
interfaces that translate user questions into database queries, such as
Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database
accessibility, most research today focuses solely on English, with limited
evaluation in other languages. This paper investigates the performance of
foundational LLMs on the Text2Cypher task across multiple languages. We create
and release a multilingual test set by translating English questions into
Spanish and Turkish while preserving the original Cypher queries, enabling fair
cross-lingual comparison. We evaluate multiple foundational models using
standardized prompts and metrics. Our results show a consistent performance
pattern: highest on English, then Spanish, and lowest on Turkish. We attribute
this to differences in training data availability and linguistic
characteristics. Additionally, we explore the impact of translating task
prompts into Spanish and Turkish. Results show little to no change in
evaluation metrics, suggesting prompt translation has minor impact. Our
findings highlight the need for more inclusive evaluation and development in
multilingual query generation. Future work includes schema localization and
fine-tuning across diverse languages.
☆ Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection
Detecting deceptive conversations on dynamic platforms is increasingly
difficult due to evolving language patterns and Concept Drift (CD)\-i.e.,
semantic or topical shifts that alter the context or intent of interactions
over time. These shifts can obscure malicious intent or mimic normal dialogue,
making accurate classification challenging. While Large Language Models (LLMs)
show strong performance in natural language tasks, they often struggle with
contextual ambiguity and hallucinations in risk\-sensitive scenarios. To
address these challenges, we present a Domain Knowledge (DK)\-Enhanced LLM
framework that integrates pretrained LLMs with structured, task\-specific
insights to perform fraud and concept drift detection. The proposed
architecture consists of three main components: (1) a DK\-LLM module to detect
fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine
whether a semantic shift has occurred; and (3) a second DK\-LLM module to
classify the drift as either benign or fraudulent. We first validate the value
of domain knowledge using a fake review dataset and then apply our full
framework to SEConvo, a multiturn dialogue dataset that includes various types
of fraud and spam attacks. Results show that our system detects fake
conversations with high accuracy and effectively classifies the nature of
drift. Guided by structured prompts, the LLaMA\-based implementation achieves
98\% classification accuracy. Comparative studies against zero\-shot baselines
demonstrate that incorporating domain knowledge and drift awareness
significantly improves performance, interpretability, and robustness in
high\-stakes NLP applications.
☆ Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference UAI 2025
Despite their widespread use, large language models (LLMs) are known to
hallucinate incorrect information and be poorly calibrated. This makes the
uncertainty quantification of these models of critical importance, especially
in high-stakes domains, such as autonomy and healthcare. Prior work has made
Bayesian deep learning-based approaches to this problem more tractable by
performing inference over the low-rank adaptation (LoRA) parameters of a
fine-tuned model. While effective, these approaches struggle to scale to larger
LLMs due to requiring further additional parameters compared to LoRA. In this
work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank
Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform
Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By
repurposing the LoRA parameters as projection matrices, we are able to map
samples from this subspace into the full weight space of the LLM. This allows
us to learn all the parameters of our approach using stochastic variational
inference. Despite the low dimensionality of our subspace, we are able to
achieve competitive performance with state-of-the-art approaches while only
requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to
scale up to the largest Bayesian LLM to date, with four times as a many base
parameters as prior work.
comment: Accepted at UAI 2025
☆ Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings
Arabic dialect recognition presents a significant challenge in speech
technology due to the linguistic diversity of Arabic and the scarcity of large
annotated datasets, particularly for underrepresented dialects. This research
investigates hybrid modeling strategies that integrate classical signal
processing techniques with deep learning architectures to address this problem
in low-resource scenarios. Two hybrid models were developed and evaluated: (1)
Mel-Frequency Cepstral Coefficients (MFCC) combined with a Convolutional Neural
Network (CNN), and (2) Discrete Wavelet Transform (DWT) features combined with
a Recurrent Neural Network (RNN). The models were trained on a dialect-filtered
subset of the Common Voice Arabic dataset, with dialect labels assigned based
on speaker metadata. Experimental results demonstrate that the MFCC + CNN
architecture achieved superior performance, with an accuracy of 91.2% and
strong precision, recall, and F1-scores, significantly outperforming the
Wavelet + RNN configuration, which achieved an accuracy of 66.5%. These
findings highlight the effectiveness of leveraging spectral features with
convolutional models for Arabic dialect recognition, especially when working
with limited labeled data. The study also identifies limitations related to
dataset size, potential regional overlaps in labeling, and model optimization,
providing a roadmap for future research. Recommendations for further
improvement include the adoption of larger annotated corpora, integration of
self-supervised learning techniques, and exploration of advanced neural
architectures such as Transformers. Overall, this research establishes a strong
baseline for future developments in Arabic dialect recognition within
resource-constrained environments.
☆ Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation SIGIR 2025
Real-world live retrieval-augmented generation (RAG) systems face significant
challenges when processing user queries that are often noisy, ambiguous, and
contain multiple intents. While RAG enhances large language models (LLMs) with
external knowledge, current systems typically struggle with such complex
inputs, as they are often trained or evaluated on cleaner data. This paper
introduces Omni-RAG, a novel framework designed to improve the robustness and
effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs
LLM-assisted query understanding to preprocess user inputs through three key
modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs
with tailored prompts to denoise queries (e.g., correcting spelling errors) and
decompose multi-intent queries into structured sub-queries; (2) Intent-Aware
Knowledge Retrieval, which performs retrieval for each sub-query from a corpus
(i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking
and Generation, where a reranker (i.e., BGE) refines document selection before
a final response is generated by an LLM (i.e., Falcon-10B) using a
chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG
capabilities and the demands of real-world applications, such as those
highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex
and noisy queries.
comment: Accepted at SIGIR 2025 LiveRAG Workshop (Oral Presentation)
☆ Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models
Large Language Models (LLMs) excel in understanding and generating text but
struggle with providing professional literary criticism for works with profound
thoughts and complex narratives. This paper proposes GLASS (Greimas Literary
Analysis via Semiotic Square), a structured analytical framework based on
Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth
literary analysis. GLASS facilitates the rapid dissection of narrative
structures and deep meanings in narrative works. We propose the first dataset
for GSS-based literary criticism, featuring detailed analyses of 48 works. Then
we propose quantitative metrics for GSS-based literary criticism using the
LLM-as-a-judge paradigm. Our framework's results, compared with expert
criticism across multiple works and LLMs, show high performance. Finally, we
applied GLASS to 39 classic works, producing original and high-quality analyses
that address existing research gaps. This research provides an AI-based tool
for literary research and education, offering insights into the cognitive
mechanisms underlying literary engagement.
comment: Accepted in CogSci 2025
☆ Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
Mixture-of-Experts (MoE) architectures have emerged as a key strategy for
scaling large language models (LLMs) efficiently. However, current MoE systems
suffer from severe load imbalance, where only a small subset of experts is
consistently activated during training and inference, leading to significant
underutilization of model capacity and computational resources. In this work,
we revisit expert routing through a clustering perspective and propose Latent
Prototype Routing (LPR), a novel routing framework that generalizes existing
approaches while promoting balanced expert utilization without compromising
downstream performance. Extensive experiments across multiple open-source MoE
models -- including DeepSeek-V3, Qwen3-MoE, and Mixtral -- demonstrate that LPR
reduces the Gini coefficient of expert load from 0.70 to 0.035 on average,
improves the min-max expert load ratio from 1e-6 to 0.70, achieving
near-perfect load balancing.
comment: 15 pages,4 figures
☆ Exploring Adapter Design Tradeoffs for Low Resource Music Generation
Fine-tuning large-scale music generation models, such as MusicGen and
Mustango, is a computationally expensive process, often requiring updates to
billions of parameters and, therefore, significant hardware resources.
Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based
methods, have emerged as a promising alternative, enabling adaptation with
minimal trainable parameters while preserving model performance. However, the
design choices for adapters, including their architecture, placement, and size,
are numerous, and it is unclear which of these combinations would produce
optimal adapters and why, for a given case of low-resource music genre. In this
paper, we attempt to answer this question by studying various adapter
configurations for two AI music models, MusicGen and Mustango, on two genres:
Hindustani Classical and Turkish Makam music.
Our findings reveal distinct trade-offs: convolution-based adapters excel in
capturing fine-grained local musical details such as ornamentations and short
melodic phrases, while transformer-based adapters better preserve long-range
dependencies crucial for structured improvisation. Additionally, we analyze
computational resource requirements across different adapter scales,
demonstrating how mid-sized adapters (40M parameters) achieve an optimal
balance between expressivity and quality. Furthermore, we find that Mustango, a
diffusion-based model, generates more diverse outputs with better adherence to
the description in the input prompt while lacking in providing stability in
notes, rhythm alignment, and aesthetics. Also, it is computationally intensive
and requires significantly more time to train. In contrast, autoregressive
models like MusicGen offer faster training and are more efficient, and can
produce better quality output in comparison, but have slightly higher
redundancy in their generations.
comment: 9 pages, 5 figures
☆ Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models ACL 2025
In this paper, we explore the use of a text-only, autoregressive language
modeling approach for the extraction of referring expressions from visually
grounded dialogue. More specifically, the aim is to investigate the extent to
which the linguistic context alone can inform the detection of mentions that
have a (visually perceivable) referent in the visual context of the
conversation. To this end, we adapt a pretrained large language model (LLM) to
perform a relatively course-grained annotation of mention spans in unfolding
conversations by demarcating mention span boundaries in text via next-token
prediction. Our findings indicate that even when using a moderately sized LLM,
relatively small datasets, and parameter-efficient fine-tuning, a text-only
approach can be effective, highlighting the relative importance of the
linguistic context for this task. Nevertheless, we argue that the task
represents an inherently multimodal problem and discuss limitations fundamental
to unimodal approaches.
comment: Accepted for publication at XLLM @ ACL 2025
☆ Small Encoders Can Rival Large Decoders in Detecting Groundedness
Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
Augmenting large language models (LLMs) with external context significantly
improves their performance in natural language processing (NLP) tasks. However,
LLMs struggle to answer queries reliably when the provided context lacks
information, often resorting to ungrounded speculation or internal knowledge.
Groundedness - generating responses strictly supported by the context - is
essential for ensuring factual consistency and trustworthiness. This study
focuses on detecting whether a given query is grounded in a document provided
in context before the costly answer generation by LLMs. Such a detection
mechanism can significantly reduce both inference time and resource
consumption. We show that lightweight, task specific encoder models such as
RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy
comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in
groundedness detection while reducing inference latency by orders of magnitude.
The code is available at : https://github.com/chandarlab/Hallucinate-less
☆ Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
While slow-thinking large language models (LLMs) exhibit reflection-like
reasoning, commonly referred to as the "aha moment:, their ability to generate
informative critiques and refine prior solutions remains limited. In this
paper, we introduce Double-Checker, a principled framework designed to enhance
the reasoning capabilities of slow-thinking LLMs by fostering explicit
self-critique and iterative refinement of their previous solutions. By
fine-tuning on our curated 1,730 self-critical instances, Double-Checker
empowers long-CoT LLMs to iteratively critique and refine their outputs during
inference until they evaluate their solutions as correct under self-generated
critiques. We validate the efficacy of Double-Checker across a comprehensive
suite of reasoning benchmarks, demonstrating that iterative self-critique
significantly enhances the reasoning capabilities of long-CoT LLMs. Notably,
our Double-Checker increases the pass@1 performance on challenging AIME
benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These
results highlight a promising direction for developing more trustworthy and
effective LLMs capable of structured self-critique.
comment: 10 pages
☆ HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
With the rapid evolution of multimodal large language models, the capacity to
deeply understand and interpret human intentions has emerged as a critical
capability, which demands detailed and thoughtful reasoning. In recent studies,
Reinforcement Learning (RL) has demonstrated potential in enhancing the
reasoning capabilities of Large Language Models (LLMs). Nonetheless, the
challenges associated with adapting RL to multimodal data and formats remain
largely unaddressed. In this paper, we identify two issues in existing
multimodal reasoning models: insufficient global context understanding and
shortcut problems. Insufficient context understanding can happen when a model
misinterprets multimodal context, resulting in incorrect answers. The shortcut
problem occurs when the model overlooks crucial clues in multimodal inputs,
directly addressing the query without considering the multimodal information.
To tackle these issues, we emphasize the necessity for the model to reason with
a clear understanding of the global context within multimodal inputs. This
global context understanding can effectively prevent the model from overlooking
key multimodal cues and ensure a thorough reasoning process. To ensure the
accurate interpretation of multimodal context information, we implement a
context reward judged by a large language model, alongside format and accuracy
rewards. Additionally, to improve complex reasoning capability, we employ the
LLM to assess the logical reward, determining whether the reasoning process
successfully integrates multimodal information with logical methods. We also
introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating
models in understanding complex human intentions and emotions. Our proposed
method demonstrates advanced performance across multiple omni-modal benchmarks
compared to other open-source omni-modal models.
☆ Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?
Large language models can produce convincing "fake text" in domains such as
academic writing, product reviews, and political news. Many approaches have
been investigated for the detection of artificially generated text. While this
may seem to presage an endless "arms race", we note that newer LLMs use ever
more parameters, training data, and energy, while relatively simple classifiers
demonstrate a good level of detection accuracy with modest resources. To
approach the question of whether the models' ability to beat the detectors may
therefore reach a plateau, we examine the ability of statistical classifiers to
identify "fake text" in the style of classical detective fiction. Over a 0.5
version increase, we found that Gemini showed an increased ability to generate
deceptive text, while GPT did not. This suggests that reliable detection of
fake text may remain feasible even for ever-larger models, though new model
architectures may improve their deceptiveness
comment: (Submitted for publication)
☆ DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
The distributed training of foundation models, particularly large language
models (LLMs), demands a high level of communication. Consequently, it is
highly dependent on a centralized cluster with fast and reliable interconnects.
Can we conduct training on slow networks and thereby unleash the power of
decentralized clusters when dealing with models exceeding 100 billion
parameters? In this paper, we propose DiLoCoX, a low-communication large-scale
decentralized cluster training framework. It combines Pipeline Parallelism with
Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local
Training, and an Adaptive Gradient Compression Scheme. This combination
significantly improves the scale of parameters and the speed of model
pre-training. We justify the benefits of one-step-delay overlap of
communication and local training, as well as the adaptive gradient compression
scheme, through a theoretical analysis of convergence. Empirically, we
demonstrate that DiLoCoX is capable of pre-training a 107B foundation model
over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x
speedup in distributed training while maintaining negligible degradation in
model convergence. To the best of our knowledge, this is the first
decentralized training framework successfully applied to models with over 100
billion parameters.
★ Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents ACL 2025
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show
promise in real-world tasks like web navigation and embodied intelligence.
However, due to limitations in a lack of external feedback, these agents
struggle with self-correction and generalization. A promising approach is to
use reward models as external feedback, but there is no clear on how to select
reward models for agents. Thus, there is an urgent need to build a reward bench
targeted at agents. To address these challenges, we propose Agent-RewardBench,
a benchmark designed to evaluate reward modeling ability in MLLMs. The
benchmark is characterized by three key features: (1) Multiple dimensions and
real-world agent scenarios evaluation. It covers perception, planning, and
safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the
assessment of agent capabilities at the individual steps of a task, providing a
more granular view of performance during the planning process; and (3)
Appropriately difficulty and high-quality. We carefully sample from 10 diverse
models, difficulty control to maintain task challenges, and manual verification
to ensure the integrity of the data. Experiments demonstrate that even
state-of-the-art multimodal models show limited performance, highlighting the
need for specialized training in agent reward modeling. Code is available at
github.
comment: ACL 2025 Main
☆ Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
Automatic Term Extraction (ATE) identifies domain-specific expressions that
are crucial for downstream tasks such as machine translation and information
retrieval. Although large language models (LLMs) have significantly advanced
various NLP tasks, their potential for ATE has scarcely been examined. We
propose a retrieval-based prompting strategy that, in the few-shot setting,
selects demonstrations according to \emph{syntactic} rather than semantic
similarity. This syntactic retrieval method is domain-agnostic and provides
more reliable guidance for capturing term boundaries. We evaluate the approach
in both in-domain and cross-domain settings, analyzing how lexical overlap
between the query sentence and its retrieved examples affects performance.
Experiments on three specialized ATE benchmarks show that syntactic retrieval
improves F1-score. These findings highlight the importance of syntactic cues
when adapting LLMs to terminology-extraction tasks.
☆ Complexity-aware fine-tuning
General-purpose Large Language Models (LLMs) are frequently fine-tuned
through supervised fine-tuning (SFT) to enhance performance in specific
domains. Better results can be achieved by distilling the chain-of-thought of a
larger model at the cost of numerous expensive calls and a much greater amount
of data. We propose a novel blueprint for efficient fine-tuning that uses
reasoning only for complex data identified by entropy. Specifically, across two
small open models ($\approx 3B$) we split the training data into complexity
categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large
language models (LLMs) via SFT and distillation, and show that our pipeline
significantly outperforms the standard SFT approach ($0.55$ vs $0.43$ average
accuracy) and provides comparable with distillation performance while using
$62\%$ less data ($0.55$ average accuracy for both). We publish our code and
data to facilitate further research in this direction.
☆ Unveiling Causal Reasoning in Large Language Models: Reality or Mirage? NeurIPS 2024
Causal reasoning capability is critical in advancing large language models
(LLMs) toward strong artificial intelligence. While versatile LLMs appear to
have demonstrated capabilities in understanding contextual causality and
providing responses that obey the laws of causality, it remains unclear whether
they perform genuine causal reasoning akin to humans. However, current evidence
indicates the contrary. Specifically, LLMs are only capable of performing
shallow (level-1) causal reasoning, primarily attributed to the causal
knowledge embedded in their parameters, but they lack the capacity for genuine
human-like (level-2) causal reasoning. To support this hypothesis,
methodologically, we delve into the autoregression mechanism of
transformer-based LLMs, revealing that it is not inherently causal.
Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024,
whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs
exhibit a significant performance drop on CausalProbe-2024 compared to earlier
benchmarks, indicating the fact that they primarily engage in level-1 causal
reasoning. To bridge the gap towards level-2 causal reasoning, we draw
inspiration from the fact that human reasoning is usually facilitated by
general knowledge and intended goals. We propose G^2-Reasoner, a method that
incorporates general knowledge and goal-oriented prompts into LLMs' causal
reasoning processes. Experiments demonstrate that G^2-Reasoner significantly
enhances LLMs' causal reasoning capability, particularly in fresh and
counterfactual contexts. This work sheds light on a new path for LLMs to
advance towards genuine causal reasoning, going beyond level-1 and making
strides towards level-2.
comment: 24 pages, accepted at NeurIPS 2024
☆ Prompt-Guided Turn-Taking Prediction SIGDIAL 2025
Turn-taking prediction models are essential components in spoken dialogue
systems and conversational robots. Recent approaches leverage transformer-based
architectures to predict speech activity continuously and in real-time. In this
study, we propose a novel model that enables turn-taking prediction to be
dynamically controlled via textual prompts. This approach allows intuitive and
explicit control through instructions such as "faster" or "calmer" adapting
dynamically to conversational partners and contexts. The proposed model builds
upon a transformer-based voice activity projection (VAP) model, incorporating
textual prompt embeddings into both channel-wise transformers and a
cross-channel transformer. We evaluated the feasibility of our approach using
over 950 hours of human-human spoken dialogue data. Since textual prompt data
for the proposed approach was not available in existing datasets, we utilized a
large language model (LLM) to generate synthetic prompt sentences. Experimental
results demonstrated that the proposed model improved prediction accuracy and
effectively varied turn-taking timing behaviors according to the textual
prompts.
comment: This paper has been accepted for presentation at SIGdial Meeting on
Discourse and Dialogue 2025 (SIGDIAL 2025) and represents the author's
version of the work
☆ Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation
platform for text embedding models. While previous work has established the
core benchmark methodology, this paper focuses on the engineering aspects that
ensure MTEB's continued reproducibility and extensibility. We present our
approach to maintaining robust continuous integration pipelines that validate
dataset integrity, automate test execution, and assess benchmark results'
generalizability. We detail the design choices that collectively enhance
reproducibility and usability. Furthermore, we discuss our strategies for
handling community contributions and extending the benchmark with new tasks and
datasets. These engineering practices have been instrumental in scaling MTEB to
become more comprehensive while maintaining quality and, ultimately, relevance
to the field. Our experiences offer valuable insights for benchmark maintainers
facing similar challenges in ensuring reproducibility and usability in machine
learning evaluation frameworks. The MTEB repository is available at:
https://github.com/embeddings-benchmark/mteb
☆ Compressed and Smooth Latent Space for Text Diffusion Modeling
Autoregressive language models dominate modern text generation, yet their
sequential nature introduces fundamental limitations: decoding is slow, and
maintaining global coherence remains challenging. Diffusion models offer a
promising alternative by enabling parallel generation and flexible control;
however, their application to text generation is hindered by the high
dimensionality of token-level representations. We introduce Cosmos, a novel
approach to text generation that operates entirely in a compressed, smooth
latent space tailored specifically for diffusion. This space is learned using
an autoencoder trained simultaneously for token-level reconstruction and
alignment with frozen activations from a pretrained language encoder, providing
robust semantic grounding and enabling effective perturbation-based
augmentations. Empirically, we demonstrate that text representations can be
compressed by $8\times$ while maintaining generation quality comparable to
token-level diffusion models. Furthermore, increasing the latent sequence
length allows Cosmos to surpass both diffusion-based and autoregressive
baselines. We evaluate Cosmos on four diverse generative tasks including story
generation, question generation, summarization, and detoxification and compare
it with various generative paradigms. Cosmos achieves comparable or superior
generation quality while offering more than $2\times$ faster inference.
☆ Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models ICONIP 2024
Fine-tuning is a promising technique for leveraging Transformer-based
language models in downstream tasks. As model sizes continue to grow, updating
all model parameters becomes increasingly costly. Parameter-efficient
fine-tuning methods effectively address this issue by selectively updating a
small subset of parameters. However, fine-tuning and most existing
parameter-efficient fine-tuning methods require updating the same number of
parameters as the initial size, ignoring the unequal contribution across
Transformer blocks and leading to extremely inefficient allocation of computing
resources. In this paper, we propose Progtuning, the novel fine-tuning
framework combined with progressive learning for Transformer-based language
models. Specifically, Progtuning progressively reduces the number of updated
transformer blocks based on the contribution. Remarkably, Progtuning optimizes
resource allocation and reduces the number of updated parameters by
approximately 25\%, while still maintaining competitive performance. And it
also exhibits high adaptability with parameter-efficient fine-tuning methods,
demonstrating excellent performance across various adaptation scenarios.
comment: Accepted by ICONIP 2024
☆ Learning to Skip the Middle Layers of Transformers
Conditional computation is a popular strategy to make Transformers more
efficient. Existing methods often target individual modules (e.g.,
mixture-of-experts layers) or skip layers independently of one another.
However, interpretability research has demonstrated that the middle layers of
Transformers exhibit greater redundancy, and that early layers aggregate
information into token positions. Guided by these insights, we propose a novel
architecture that dynamically skips a variable number of layers from the middle
outward. In particular, a learned gating mechanism determines whether to bypass
a symmetric span of central blocks based on the input, and a gated attention
mechanism prevents subsequent tokens from attending to skipped token positions.
Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and
gate sparsity with an adaptive regularization loss. We had aimed to reduce
compute requirements for 'simpler' tokens and potentially foster an emergent
multi-level representational hierarchy but, at the scales investigated, our
approach does not achieve improvements in the trade-off between validation
cross-entropy and estimated FLOPs compared to dense baselines with fewer
layers. We release our code at https://github.com/tim-lawson/skip-middle.
comment: 11 pages, 2 figures
☆ Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection
Detecting deceptive conversations on dynamic platforms is increasingly
difficult due to evolving language patterns and Concept Drift (CD)-i.e.,
semantic or topical shifts that alter the context or intent of interactions
over time. These shifts can obscure malicious intent or mimic normal dialogue,
making accurate classification challenging. While Large Language Models (LLMs)
show strong performance in natural language tasks, they often struggle with
contextual ambiguity and hallucinations in risk-sensitive scenarios. To address
these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework
that integrates pretrained LLMs with structured, task-specific insights to
perform fraud and concept drift detection. The proposed architecture consists
of three main components: (1) a DK-LLM module to detect fake or deceptive
conversations; (2) a drift detection unit (OCDD) to determine whether a
semantic shift has occurred; and (3) a second DK-LLM module to classify the
drift as either benign or fraudulent. We first validate the value of domain
knowledge using a fake review dataset and then apply our full framework to
SEConvo, a multiturn dialogue dataset that includes various types of fraud and
spam attacks. Results show that our system detects fake conversations with high
accuracy and effectively classifies the nature of drift. Guided by structured
prompts, the LLaMA-based implementation achieves 98% classification accuracy.
Comparative studies against zero-shot baselines demonstrate that incorporating
domain knowledge and drift awareness significantly improves performance,
interpretability, and robustness in high-stakes NLP applications.
♻ ☆ OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
We present OpenNER 1.0, a standardized collection of openly-available named
entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52
languages, human-annotated in varying named entity ontologies. We correct
annotation format issues, standardize the original datasets into a uniform
representation with consistent entity type names across corpora, and provide
the collection in a structure that enables research in multilingual and
multi-ontology NER. We provide baseline results using three pretrained
multilingual language models and two large language models to compare the
performance of recent models and facilitate future research in NER. We find
that no single model is best in all languages and that significant work remains
to obtain high performance from LLMs on the NER task.
comment: Under review
♻ ☆ Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages NAACL 2025
Hoang H Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
Although multilingual LLMs have achieved remarkable performance across
benchmarks, we find they continue to underperform on non-Latin script languages
across contemporary LLM families. This discrepancy arises from the fact that
LLMs are pretrained with orthographic scripts, which are dominated by Latin
characters that obscure their shared phonology with non-Latin scripts. We
propose leveraging phonemic transcriptions as complementary signals to induce
script-invariant representations. Our study demonstrates that integrating
phonemic signals improves performance across both non-Latin and Latin script
languages, with a particularly significant impact on closing the performance
gap between the two. Through detailed experiments, we show that phonemic and
orthographic scripts retrieve distinct examples for in-context learning (ICL).
This motivates our proposed Mixed-ICL retrieval strategy, where further
aggregation from both leads to our significant performance improvements for
both Latin script languages (up to 12.6%) and non-Latin script languages (up to
15.1%) compared to randomized ICL retrieval.
comment: Accepted to NAACL 2025 (Main Conference). This version contains minor
improvements to the camera-ready
♻ ☆ From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Information retrieval is a cornerstone of modern knowledge acquisition,
enabling billions of queries each day across diverse domains. However,
traditional keyword-based search engines are increasingly inadequate for
handling complex, multi-step information needs. Our position is that Large
Language Models (LLMs), endowed with reasoning and agentic capabilities, are
ushering in a new paradigm termed Agentic Deep Research. These systems
transcend conventional information search techniques by tightly integrating
autonomous reasoning, iterative retrieval, and information synthesis into a
dynamic feedback loop. We trace the evolution from static web search to
interactive, agent-based systems that plan, explore, and learn. We also
introduce a test-time scaling law to formalize the impact of computational
depth on reasoning and search. Supported by benchmark results and the rise of
open-source implementations, we demonstrate that Agentic Deep Research not only
significantly outperforms existing approaches, but is also poised to become the
dominant paradigm for future information seeking. All the related resources,
including industry products, research papers, benchmark datasets, and
open-source implementations, are collected for the community in
https://github.com/DavidZWZ/Awesome-Deep-Research.
♻ ☆ Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations
Large language models like GPT, LLAMA, and Claude have become incredibly
powerful at generating text, but they are still black boxes, so it is hard to
understand how they decide what to say. That lack of transparency can be
problematic, especially in fields where trust and accountability matter. To
help with this, we introduce SMILE, a new method that explains how these models
respond to different parts of a prompt. SMILE is model-agnostic and works by
slightly changing the input, measuring how the output changes, and then
highlighting which words had the most impact. Create simple visual heat maps
showing which parts of a prompt matter the most. We tested SMILE on several
leading LLMs and used metrics such as accuracy, consistency, stability, and
fidelity to show that it gives clear and reliable explanations. By making these
models easier to understand, SMILE brings us one step closer to making AI more
transparent and trustworthy.
comment: The submission contains incorrect references that require substantial
revision
♻ ☆ Rethinking LLM Training through Information Geometry and Quantum Metrics
Optimization in large language models (LLMs) unfolds over high-dimensional
parameter spaces with non-Euclidean structure. Information geometry frames this
landscape using the Fisher information metric, enabling more principled
learning via natural gradient descent. Though often impractical, this geometric
lens clarifies phenomena such as sharp minima, generalization, and observed
scaling laws. We argue that curvature-aware approaches deepen our understanding
of LLM training. Finally, we speculate on quantum analogies based on the
Fubini-Study metric and Quantum Fisher Information, hinting at efficient
optimization in quantum-enhanced systems.
comment: 9 pages, 1 figure(s)
♻ ★ DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
Diffusion large language models (dLLMs) are compelling alternatives to
autoregressive (AR) models because their denoising models operate over the
entire sequence. The global planning and iterative refinement features of dLLMs
are particularly useful for code generation. However, current training and
inference mechanisms for dLLMs in coding are still under-explored. To demystify
the decoding behavior of dLLMs and unlock their potential for coding, we
systematically investigate their denoising processes and reinforcement learning
(RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code.
Using this model as a testbed, we analyze its decoding behavior, revealing how
it differs from that of AR models: (1) dLLMs can decide how causal their
generation should be without relying on semi-AR decoding, and (2) increasing
the sampling temperature diversifies not only token choices but also their
generation order. This diversity creates a rich search space for RL rollouts.
For RL training, to reduce the variance of token log-likelihood estimates and
maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel
sampling scheme that constructs complementary mask noise for completions used
in training. In our experiments, coupled-GRPO significantly improves
DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and
reduces reliance on AR bias during decoding. Our work provides deeper insight
into the machinery of dLLM generation and offers an effective, diffusion-native
RL training framework. https://github.com/apple/ml-diffucoder.
comment: minor update
♻ ☆ Thinkless: LLM Learns When to Think
Reasoning Language Models, capable of extended chain-of-thought reasoning,
have demonstrated remarkable performance on tasks requiring complex logical
inference. However, applying elaborate reasoning for all queries often results
in substantial computational inefficiencies, particularly when many problems
admit straightforward solutions. This motivates an open question: Can LLMs
learn when to think? To answer this, we propose Thinkless, a learnable
framework that empowers an LLM to adaptively select between short-form and
long-form reasoning, based on both task complexity and the model's ability.
Thinkless is trained under a reinforcement learning paradigm and employs two
control tokens, for concise responses and for detailed
reasoning. At the core of our method is a Decoupled Group Relative Policy
Optimization (DeGRPO) algorithm, which decomposes the learning objective of
hybrid reasoning into two components: (1) a control token loss that governs the
selection of the reasoning mode, and (2) a response loss that improves the
accuracy of the generated answers. This decoupled formulation enables
fine-grained control over the contributions of each objective, stabilizing
training and effectively preventing collapse observed in vanilla GRPO.
Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and
GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% -
90%, significantly improving the efficiency of Reasoning Language Models. The
code is available at https://github.com/VainF/Thinkless
♻ ★ A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns ACL 2025
With the development of large language models, they are widely used as agents
in various fields. A key component of agents is memory, which stores vital
information but is susceptible to jailbreak attacks. Existing research mainly
focuses on single-agent attacks and shared memory attacks. However, real-world
scenarios often involve independent memory. In this paper, we propose the
Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale,
multi-agent, multi-topology text-based attack evaluation framework. TMCHT
involves one attacker agent attempting to mislead an entire society of agents.
We identify two major challenges in multi-agent attacks: (1) Non-complete graph
structure, (2) Large-scale systems. We attribute these challenges to a
phenomenon we term toxicity disappearing. To address these issues, we propose
an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes
the retrieval suffix to make poisoned samples more easily retrieved and
optimizes the replication suffix to make poisoned samples have contagious
ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%,
18.95%, and 52.93% improvements in line topology, star topology, and 100-agent
settings. Encourage community attention to the security of multi-agent systems.
comment: ACL 2025 Main
♻ ☆ Simulating Hard Attention Using Soft Attention
We study conditions under which transformers using soft attention can
simulate hard attention, that is, effectively focus all attention on a subset
of positions. First, we examine several subclasses of languages recognized by
hard-attention transformers, which can be defined in variants of linear
temporal logic. We demonstrate how soft-attention transformers can compute
formulas of these logics using unbounded positional embeddings or temperature
scaling. Second, we demonstrate how temperature scaling allows softmax
transformers to simulate general hard-attention transformers, using a
temperature that depends on the minimum gap between the maximum attention
scores and other attention scores.
comment: 19 pages
♻ ☆ Capturing Style in Author and Document Representation
A wide range of Deep Natural Language Processing (NLP) models integrates
continuous and low dimensional representations of words and documents.
Surprisingly, very few models study representation learning for authors. These
representations can be used for many NLP tasks, such as author identification
and classification, or in recommendation systems. A strong limitation of
existing works is that they do not explicitly capture writing style, making
them hardly applicable to literary data. We therefore propose a new
architecture based on Variational Information Bottleneck (VIB) that learns
embeddings for both authors and documents with a stylistic constraint. Our
model fine-tunes a pre-trained document encoder. We stimulate the detection of
writing style by adding predefined stylistic features making the representation
axis interpretable with respect to writing style indicators. We evaluate our
method on three datasets: a literary corpus extracted from the Gutenberg
Project, the Blog Authorship Corpus and IMDb62, for which we show that it
matches or outperforms strong/recent baselines in authorship attribution while
capturing much more accurately the authors stylistic aspects.
♻ ☆ TAPS: Tool-Augmented Personalisation via Structured Tagging
Recent advancements in tool-augmented large language models have enabled them
to interact with external tools, enhancing their ability to perform complex
user tasks. However, existing approaches overlook the role of personalisation
in guiding tool use. This work investigates how user preferences can be
effectively integrated into goal-oriented dialogue agents. Through extensive
analysis, we identify key weaknesses in the ability of LLMs to personalise tool
use. To this end, we introduce TAPS, a novel solution that enhances
personalised tool use by leveraging a structured tagging tool and an
uncertainty-based tool detector. TAPS significantly improves the ability of
LLMs to incorporate user preferences, achieving the new state-of-the-art for
open source models on the NLSI task.
♻ ☆ LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
Recent advances in large language models (LLMs) have sparked growing interest
in building fully autonomous agents. However, fully autonomous LLM-based agents
still face significant challenges, including limited reliability due to
hallucinations, difficulty in handling complex tasks, and substantial safety
and ethical risks, all of which limit their feasibility and trustworthiness in
real-world applications. To overcome these limitations, LLM-based human-agent
systems (LLM-HAS) incorporate human-provided information, feedback, or control
into the agent system to enhance system performance, reliability and safety.
These human-agent collaboration systems enable humans and LLM-based agents to
collaborate effectively by leveraging their complementary strengths. This paper
provides the first comprehensive and structured survey of LLM-HAS. It clarifies
fundamental concepts, systematically presents core components shaping these
systems, including environment & profiling, human feedback, interaction types,
orchestration and communication, explores emerging applications, and discusses
unique challenges and opportunities arising from human-AI collaboration. By
consolidating current knowledge and offering a structured overview, we aim to
foster further research and innovation in this rapidly evolving
interdisciplinary field. Paper lists and resources are available at
https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
comment: Paper lists and resources are available at
https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems
♻ ☆ CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models
Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
Ensuring that Large Language Models (LLMs) align with mainstream human values
and ethical norms is crucial for the safe and sustainable development of AI.
Current value evaluation and alignment are constrained by Western cultural bias
and incomplete domestic frameworks reliant on non-native rules; furthermore,
the lack of scalable, rule-driven scenario generation methods makes evaluations
costly and inadequate across diverse cultural contexts. To address these
challenges, we propose a hierarchical value framework grounded in core Chinese
values, encompassing three main dimensions, 12 core values, and 50 derived
values. Based on this framework, we construct a large-scale Chinese Values
Corpus (CVC) containing over 250,000 value rules enhanced and expanded through
human annotation. Experimental results show that CVC-guided scenarios
outperform direct generation ones in value boundaries and content diversity. In
the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven
mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while
five Chinese human annotators showed an 87.5% alignment with CVC, confirming
its universality, cultural relevance, and strong alignment with Chinese values.
Additionally, we construct 400,000 rule-based moral dilemma scenarios that
objectively capture nuanced distinctions in conflicting value prioritization
across 17 LLMs. Our work establishes a culturally-adaptive benchmarking
framework for comprehensive value evaluation and alignment, representing
Chinese characteristics. All data are available at
https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at
https://github.com/Beijing-AISI/CVC.
♻ ☆ Do Large Language Models Advocate for Inferentialism?
The emergence of large language models (LLMs) such as ChatGPT and Claude
presents new challenges for philosophy of language, particularly regarding the
nature of linguistic meaning and representation. While LLMs have traditionally
been understood through distributional semantics, this paper explores Robert
Brandom's inferential semantics as an alternative foundational framework for
understanding these systems. We examine how key features of inferential
semantics -- including its anti-representationalist stance, logical
expressivism, and quasi-compositional approach -- align with the architectural
and functional characteristics of Transformer-based LLMs. Through analysis of
the ISA (Inference, Substitution, Anaphora) approach, we demonstrate that LLMs
exhibit fundamentally anti-representationalist properties in their processing
of language. We further develop a consensus theory of truth appropriate for
LLMs, grounded in their interactive and normative dimensions through mechanisms
like RLHF. While acknowledging significant tensions between inferentialism's
philosophical commitments and LLMs' sub-symbolic processing, this paper argues
that inferential semantics provides valuable insights into how LLMs generate
meaning without reference to external world representations. Our analysis
suggests that LLMs may challenge traditional assumptions in philosophy of
language, including strict compositionality and semantic externalism, though
further empirical investigation is needed to fully substantiate these
theoretical claims.
♻ ★ Learning Evaluation Models from Large Language Models for Sequence Generation
Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, Yue Zhang, Jingbo Zhu
Automatic evaluation of sequence generation, traditionally reliant on metrics
like BLEU and ROUGE, often fails to capture the semantic accuracy of generated
text sequences due to their emphasis on n-gram overlap. A promising solution to
this problem is to develop model-based metrics, such as BLEURT and COMET.
However, these approaches are typically hindered by the scarcity of labeled
evaluation data, which is necessary to train the evaluation models. In this
work, we build upon this challenge by proposing the Customized Sequence
Evaluation Metric (CSEM), a three-stage evaluation model training method that
utilizes large language models to generate labeled data for model-based metric
development, thereby eliminating the need for human-labeled data. Additionally,
we expand the scope of CSEM to support various evaluation types, including
single-aspect, multi-aspect, reference-free, and reference-based evaluations,
enabling the customization of metrics to suit diverse real-world scenarios.
Experimental results on the SummEval benchmark demonstrate that CSEM can
effectively train an evaluation model without human-labeled data. Further
experiments in reinforcement learning and reranking show that metrics developed
through CSEM outperform traditional evaluation metrics, leading to substantial
improvements in sequence quality as evaluated by both commonly used metrics and
ChatGPT.
comment: Accepted by TASLP 2025