Computation and Language
☆ Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
While question-answering~(QA) benchmark performance is an automatic and
scalable method to compare LLMs, it is an indirect method of evaluating their
underlying problem-solving capabilities. Therefore, we propose a holistic and
generalizable framework based on \emph{cascaded question disclosure} that
provides a more accurate estimate of the models' problem-solving capabilities
while maintaining the scalability and automation. This approach collects model
responses in a stagewise manner with each stage revealing partial information
about the question designed to elicit generalized reasoning in LLMs. We find
that our approach not only provides a better comparison between LLMs, but also
induces better intermediate traces in models compared to the standard QA
paradigm. We empirically verify this behavior on diverse reasoning and
knowledge-heavy QA datasets by comparing LLMs of varying sizes and families.
Our approach narrows the performance gap observed in the standard QA evaluation
settings, indicating that the prevalent indirect QA paradigm of evaluation
overestimates the differences in performance between models. We further
validate our findings by extensive ablation studies.
comment: Under review
★ SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model
AI agents built on large language models (LLMs) hold enormous promise, but
current practice focuses on a one-task-one-agent approach, which not only falls
short of scalability and generality, but also suffers from the fundamental
limitations of autoregressive LLMs. On the other hand, humans are general
agents who reason by mentally simulating the outcomes of their actions and
plans. Moving towards a more general and powerful AI agent, we introduce
SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based
on a principled formulation of optimal agent in any environment, \modelname
overcomes the limitations of autoregressive reasoning by introducing a world
model for planning via simulation. The generalized world model is implemented
using LLM, which can flexibly plan in a wide range of environments using the
concept-rich latent space of natural language. Experiments on difficult web
browsing tasks show that \modelname improves the success of flight search from
0\% to 32.2\%. World-model-based planning, in particular, shows consistent
advantage of up to 124\% over autoregressive planning, demonstrating the
advantage of world model simulation as a reasoning paradigm. We are excited
about the possibility for training a single, general agent model based on LLMs
that can act superintelligently in all environments. To start, we make SimuRA,
a web-browsing agent built on \modelname with pretrained LLMs, available as a
research demo for public testing.
★ CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu
We propose CoT-Self-Instruct, a synthetic data generation method that
instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the
given seed tasks, and then to generate a new synthetic prompt of similar
quality and complexity for use in LLM training, followed by filtering for
high-quality data with automatic metrics. In verifiable reasoning, our
synthetic data significantly outperforms existing training datasets, such as
s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For
non-verifiable instruction-following tasks, our method surpasses the
performance of human or standard self-instruct prompts on both AlpacaEval 2.0
and Arena-Hard.
☆ Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs
Knowledge graphs (KGs) often contain sufficient information to support the
inference of new facts. Identifying logical rules not only improves the
completeness of a knowledge graph but also enables the detection of potential
errors, reveals subtle data patterns, and enhances the overall capacity for
reasoning and interpretation. However, the complexity of such rules, combined
with the unique labeling conventions of each KG, can make them difficult for
humans to understand. In this paper, we explore the potential of large language
models to generate natural language explanations for logical rules.
Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery
algorithm from the benchmark dataset FB15k-237 and two large-scale datasets,
FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including
zero- and few-shot prompting, including variable entity types, and
chain-of-thought reasoning. We conduct a comprehensive human evaluation of the
generated explanations based on correctness, clarity, and hallucination, and
also assess the use of large language models as automatic judges. Our results
demonstrate promising performance in terms of explanation correctness and
clarity, although several challenges remain for future research. All scripts
and data used in this study are publicly available at
https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.
★ Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu
LLMs have demonstrated strong mathematical reasoning abilities by leveraging
reinforcement learning with long chain-of-thought, yet they continue to
struggle with theorem proving due to the lack of clear supervision signals when
solely using natural language. Dedicated domain-specific languages like Lean
provide clear supervision via formal verification of proofs, enabling effective
training through reinforcement learning. In this work, we propose
\textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover
can iteratively refine its proof based on Lean feedback, proved lemmas, and
self-summarization. To solve IMO-level contest problems, we design three
test-time inference strategies that enable both deep and broad reasoning.
Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F,
and achieves over 50\% on PutnamBench, outperforming the previous
state-of-the-art by a large margin. To address the lack of geometry support in
Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which
outperforms previous formal geometry engines. We use these two systems to
participate in IMO 2025 and fully prove 5 out of 6 problems. This work
represents a significant advancement in automated mathematical reasoning,
demonstrating the effectiveness of formal verification with long
chain-of-thought reasoning.
☆ TextQuests: How Good are LLMs at Text-Based Video Games?
Evaluating AI agents within complex, interactive environments that mirror
real-world challenges is critical for understanding their practical
capabilities. While existing agent benchmarks effectively assess skills like
tool use or performance on structured tasks, they often do not fully capture an
agent's ability to operate autonomously in exploratory environments that demand
sustained, self-directed reasoning over a long and growing context. To spur the
development of agents capable of more robust intrinsic reasoning over long
horizons, we introduce TextQuests, a benchmark based on the Infocom suite of
interactive fiction games. These text-based adventures, which can take human
players over 30 hours and require hundreds of precise actions to solve, serve
as an effective proxy for evaluating AI agents on focused, stateful tasks. The
benchmark is specifically designed to assess an LLM agent's capacity for
self-contained problem-solving by precluding the use of external tools, thereby
focusing on intrinsic long-context reasoning capabilities in an exploratory
environment characterized by the need for trial-and-error learning and
sustained problem-solving within a single interactive session. We release
TextQuests at https://textquests.ai.
☆ TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses
Muhammad Taha Cheema, Abeer Aamir, Khawaja Gul Muhammad, Naveed Anwar Bhatti, Ihsan Ayyub Qazi, Zafar Ayyub Qazi
Large Language Models (LLMs) process millions of queries daily, making
efficient response caching a compelling optimization for reducing cost and
latency. However, preserving relevance to user queries using this approach
proves difficult due to the personalized nature of chatbot interactions and the
limited accuracy of semantic similarity search. To address this, we present
TweakLLM, a novel routing architecture that employs a lightweight LLM to
dynamically adapt cached responses to incoming prompts. Through comprehensive
evaluation, including user studies with side-by-side comparisons, satisfaction
voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM
maintains response quality comparable to frontier models while significantly
improving cache effectiveness. Our results across real-world datasets highlight
TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM
deployments without compromising user experience.
comment: 13 pages, 9 figures
☆ Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning
Hate speech identification in social media has become an increasingly
important issue in recent years. In this research, we address two problems: 1)
to detect hate speech in Arabic text, 2) to clean a given text from hate
speech. The meaning of cleaning here is replacing each bad word with stars
based on the number of letters for each word. Regarding the first problem, we
conduct several experiments using deep learning models and transformers to
determine the best model in terms of the F1 score. Regarding second problem, we
consider it as a machine translation task, where the input is a sentence
containing dirty text and the output is the same sentence with masking the
dirty text. The presented methods achieve the best model in hate speech
detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text
cleaning experiment, the best result in the hate speech masking model reached
0.3 in BLEU score with 1-gram, which is a good result compared with the state
of the art machine translation systems.
comment: 23 pages, 5 figures
☆ Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates
Clinical trials are a systematic endeavor to assess the safety and efficacy
of new drugs or treatments. Conducting such trials typically demands
significant financial investment and meticulous planning, highlighting the need
for accurate predictions of trial outcomes. Accurately predicting patient
enrollment, a key factor in trial success, is one of the primary challenges
during the planning phase. In this work, we propose a novel deep learning-based
method to address this critical challenge. Our method, implemented as a neural
network model, leverages pre-trained language models (PLMs) to capture the
complexities and nuances of clinical documents, transforming them into
expressive representations. These representations are then combined with
encoded tabular features via an attention mechanism. To account for
uncertainties in enrollment prediction, we enhance the model with a
probabilistic layer based on the Gamma distribution, which enables range
estimation. We apply the proposed model to predict clinical trial duration,
assuming site-level enrollment follows a Poisson-Gamma process. We carry out
extensive experiments on real-world clinical trial data, and show that the
proposed method can effectively predict the number of patients enrolled at a
number of sites for a given clinical trial, outperforming established baseline
models.
☆ DiffLoRA: Differential Low-Rank Adapters for Large Language Models
Differential Transformer has recently been proposed to improve performance in
Transformer models by canceling out noise through a denoiser attention
mechanism. In this work, we introduce DiffLoRA, a parameter-efficient
adaptation of the differential attention mechanism, with low-rank adapters on
both positive and negative attention terms. This approach retains the
efficiency of LoRA while aiming to benefit from the performance gains of
differential attention. We evaluate DiffLoRA across a broad range of NLP tasks,
including general benchmarks, many-shot in-context learning, RAG, and
long-context tests. We observe that, although DiffLoRA falls short of other
parameter-efficient fine-tuning methods in most evaluation tasks, it shows
interesting results in certain domains (+11 pts on LoRA for HumanEval). We
analyze the attention patterns post-finetuning to identify the reasons for this
behavior.
★ T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text
The proliferation of sophisticated text generation models necessitates the
development of robust detection methods capable of identifying
machine-generated content, particularly text designed to evade detection
through adversarial perturbations. Existing zero-shot detectors often rely on
statistical measures that implicitly assume Gaussian distributions, a premise
that falters when confronted with the heavy-tailed statistical artifacts
characteristic of adversarial or non-native English texts. This paper
introduces T-Detect, a novel detection method that fundamentally redesigns the
statistical core of curvature-based detectors. Our primary innovation is the
replacement of standard Gaussian normalization with a heavy-tailed discrepancy
score derived from the Student's t-distribution. This approach is theoretically
grounded in the empirical observation that adversarial texts exhibit
significant leptokurtosis, rendering traditional statistical assumptions
inadequate. T-Detect computes a detection score by normalizing the
log-likelihood of a passage against the expected moments of a t-distribution,
providing superior resilience to statistical outliers. We validate our approach
on the challenging RAID benchmark for adversarial text and the comprehensive
HART dataset. Experiments show that T-Detect provides a consistent performance
uplift over strong baselines, improving AUROC by up to 3.9\% in targeted
domains. When integrated into a two-dimensional detection framework (CT), our
method achieves state-of-the-art performance, with an AUROC of 0.926 on the
Books domain of RAID. Our contributions are a new, theoretically-justified
statistical foundation for text detection, an ablation-validated method that
demonstrates superior robustness, and a comprehensive analysis of its
performance under adversarial conditions. Ours code are released at
https://github.com/ResearAI/t-detect.
☆ Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning
In medical scenarios, effectively retrieving external knowledge and
leveraging it for rigorous logical reasoning is of significant importance.
Despite their potential, existing work has predominantly focused on enhancing
either retrieval or reasoning capabilities of the models in isolation, with
little attention given to their joint optimization, which leads to limited
coordination between the two processes. Additionally, current methods rely
heavily on supervised fine-tuning (SFT), which can cause models to memorize
existing problem-solving pathways, thereby restricting their generalization
ability when confronted with novel problem contexts. Furthermore, while some
studies have explored to improve retrieval-augmented reasoning in general
domains via reinforcement learning, their reward function designs do not
adequately capture the specific demands of the medical domain. To address these
challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented
**R**easoning framework driven by progressive **R**einforcement learning. In
this framework, we first develop the model's ability to perform logical
reasoning over medical problems. Subsequently, on the basis of this foundation,
we adaptively optimize the retrieval capability to better align with the
characteristics of knowledge corpus and external information utilization
throughout the reasoning process. Finally, we conduct joint optimization of the
model's retrieval and reasoning coordination. Extensive experiments indicate
that **Med-R$^3$** could achieve state-of-the-art performances, with
LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by
3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with
Med-R$^3$ shows a more substantial gain of 13.53\%.
☆ MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan
While large audio-language models have advanced open-ended audio
understanding, they still fall short of nuanced human-level comprehension. This
gap persists largely because current benchmarks, limited by data annotations
and evaluation metrics, fail to reliably distinguish between generic and highly
detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert
Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via
a pipeline that integrates analysis from specialized expert models with
Chain-of-Thought large language model reasoning, MECAT provides
multi-perspective, fine-grained captions and open-set question-answering pairs.
The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced
Audio Text Evaluation). This metric penalizes generic terms and rewards
detailed descriptions by combining single-sample semantic similarity with
cross-sample discriminability. A comprehensive evaluation of state-of-the-art
audio models is also presented, providing new insights into their current
capabilities and limitations. The data and code are available at
https://github.com/xiaomi-research/mecat
comment: 9 main pages, 5 figures, 3 tables, and 14 appendix pages
☆ A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu
Large language models (LLMs) hold promise in clinical decision support but
face major challenges in safety evaluation and effectiveness validation. We
developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a
multidimensional framework built on clinical expert consensus, encompassing 30
criteria covering critical areas like critical illness recognition, guideline
adherence, and medication safety, with weighted consequence measures.
Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A
items aligned with these criteria, spanning 26 clinical departments to simulate
real-world scenarios. Benchmark testing of six LLMs revealed moderate overall
performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%),
with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001).
Domain-specific medical LLMs showed consistent performance advantages over
general-purpose models, with relatively higher top scores in safety (0.912) and
effectiveness (0.861). The findings of this study not only provide a
standardized metric for evaluating the clinical application of medical LLMs,
facilitating comparative analyses, risk exposure identification, and
improvement directions across different scenarios, but also hold the potential
to promote safer and more effective deployment of large language models in
healthcare environments.
☆ Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto
As large language models (LLMs) are increasingly deployed in enterprise
settings, controlling model behavior based on user roles becomes an essential
requirement. Existing safety methods typically assume uniform access and focus
on preventing harmful or toxic outputs, without addressing role-specific access
constraints. In this work, we investigate whether LLMs can be fine-tuned to
generate responses that reflect the access privileges associated with different
organizational roles. We explore three modeling strategies: a BERT-based
classifier, an LLM-based classifier, and role-conditioned generation. To
evaluate these approaches, we construct two complementary datasets. The first
is adapted from existing instruction-tuning corpora through clustering and role
labeling, while the second is synthetically generated to reflect realistic,
role-sensitive enterprise scenarios. We assess model performance across varying
organizational structures and analyze robustness to prompt injection, role
mismatch, and jailbreak attempts.
☆ Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
This paper investigates defenses for LLM-based evaluation systems against
prompt injection. We formalize a class of threats called blind attacks, where a
candidate answer is crafted independently of the true answer to deceive the
evaluator. To counter such attacks, we propose a framework that augments
Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which
re-evaluates the submission against a deliberately false ground-truth answer.
An attack is detected if the system validates an answer under both standard and
counterfactual conditions. Experiments show that while standard evaluation is
highly vulnerable, our SE+CFE framework significantly improves security by
boosting attack detection with minimal performance trade-offs.
☆ Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
Critical thinking is essential for building robust AI systems, preventing
them from blindly accepting flawed data or biased reasoning. However, prior
work has primarily focused on passive critical thinking, where models simply
reject problematic queries without taking constructive steps to address user
requests. In this work, we introduce proactive critical thinking, a paradigm
where models actively seek missing or clarifying information from users to
resolve their queries better. To evaluate this capability, we present GSM-MC
and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical
reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math
problems with a key variable deliberately removed, requiring models to identify
and request the missing information. GSM-MCE further increases the difficulty
by introducing irrelevant details to test robustness against distractions.
Experiments on Qwen3 and Llama series models show that, while these models
excel in traditional reasoning tasks due to extensive post-training and
inference-time scaling, they struggle with proactive critical thinking,
especially smaller ones. However, we demonstrate that reinforcement learning
(RL) can significantly improve this ability. Using our enhanced RL algorithm,
we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15% to
73.98% on GSM-MC. We hope this work advances models that collaborate more
effectively with users in problem-solving through proactive critical thinking.
☆ Enhanced Arabic Text Retrieval with Attentive Relevance Scoring
Arabic poses a particular challenge for natural language processing (NLP) and
information retrieval (IR) due to its complex morphology, optional diacritics
and the coexistence of Modern Standard Arabic (MSA) and various dialects.
Despite the growing global significance of Arabic, it is still underrepresented
in NLP research and benchmark resources. In this paper, we present an enhanced
Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At
the core of our approach is a novel Attentive Relevance Scoring (ARS) that
replaces standard interaction mechanisms with an adaptive scoring function that
more effectively models the semantic relevance between questions and passages.
Our method integrates pre-trained Arabic language models and architectural
refinements to improve retrieval performance and significantly increase ranking
accuracy when answering Arabic questions. The code is made publicly available
at \href{https://github.com/Bekhouche/APR}{GitHub}.
☆ MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization
The core challenge faced by multi-document summarization is the complexity of
relationships among documents and the presence of information redundancy. Graph
clustering is an effective paradigm for addressing this issue, as it models the
complex relationships among documents using graph structures and reduces
information redundancy through clustering, achieving significant research
progress. However, existing methods often only consider single-relational
graphs and require a predefined number of clusters, which hinders their ability
to fully represent rich relational information and adaptively partition
sentence groups to reduce redundancy. To overcome these limitations, we propose
MRGSEM-Sum, an unsupervised multi-document summarization framework based on
multi-relational graphs and structural entropy minimization. Specifically, we
construct a multi-relational graph that integrates semantic and discourse
relations between sentences, comprehensively modeling the intricate and dynamic
connections among sentences across documents. We then apply a two-dimensional
structural entropy minimization algorithm for clustering, automatically
determining the optimal number of clusters and effectively organizing sentences
into coherent groups. Finally, we introduce a position-aware compression
mechanism to distill each cluster, generating concise and informative
summaries. Extensive experiments on four benchmark datasets (Multi-News,
DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently
outperforms previous unsupervised methods and, in several cases, achieves
performance comparable to supervised models and large language models. Human
evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high
consistency and coverage, approaching human-level quality.
☆ Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators
The rapid proliferation of Large Language Models presents both opportunities
and challenges for the translation field. While commercial, cloud-based AI
chatbots have garnered significant attention in translation studies, concerns
regarding data privacy, security, and equitable access necessitate exploration
of alternative deployment models. This paper investigates the feasibility and
performance of locally deployable, free language models as a viable alternative
to proprietary, cloud-based AI solutions. This study evaluates three
open-source models installed on CPU-based platforms and compared against
commercially available online chat-bots. The evaluation focuses on functional
performance rather than a comparative analysis of human-machine translation
quality, an area already subject to extensive research. The platforms assessed
were chosen for their accessibility and ease of use across various operating
systems. While local deployment introduces its own challenges, the benefits of
enhanced data control, improved privacy, and reduced dependency on cloud
services are compelling. The findings of this study contribute to a growing
body of knowledge concerning the democratization of AI technology and inform
future research and development efforts aimed at making LLMs more accessible
and practical for a wider range of users, specifically focusing on the needs of
individual translators and small businesses.
☆ Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models
Decoder-only large language models (LLMs) are increasingly used to build
embedding models that effectively encode the semantic information of natural
language texts into dense vector representations for various embedding tasks.
However, many existing methods primarily focus on removing the causal attention
mask in LLMs to enable bidirectional attention, potentially undermining the
model's ability to extract semantic information acquired during pretraining.
Additionally, leading unidirectional approaches often rely on extra input text
to overcome the inherent limitations of causal attention, inevitably increasing
computational costs. In this work, we propose Causal2Vec, a general-purpose
embedding model tailored to enhance the performance of decoder-only LLMs
without altering their original architectures or introducing significant
computational overhead. Specifically, we first employ a lightweight BERT-style
model to pre-encode the input text into a single Contextual token, which is
then prepended to the LLM's input sequence, allowing each token to capture
contextualized information even without attending to future tokens.
Furthermore, to mitigate the recency bias introduced by last-token pooling and
help LLMs better leverage the semantic information encoded in the Contextual
token, we concatenate the last hidden states of Contextual and EOS tokens as
the final text embedding. In practice, Causal2Vec achieves state-of-the-art
performance on the Massive Text Embeddings Benchmark (MTEB) among models
trained solely on publicly available retrieval datasets, while reducing the
required sequence length by up to 85% and inference time by up to 82% compared
to best-performing methods.
★ MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models
Multimodal planning capabilities refer to the ability to predict, reason, and
design steps for task execution with multimodal context, which is essential for
complex reasoning and decision-making across multiple steps. However, current
benchmarks face two key challenges: (1) they cannot directly assess multimodal
real-world planning capabilities, and (2) they lack constraints or implicit
constraints across modalities. To address these issues, we introduce Multimodal
Planning with Complex Constraints (MPCC), the first benchmark to systematically
evaluate MLLMs' ability to handle multimodal constraints in planning. To
address the first challenge, MPCC focuses on three real-world tasks: Flight
Planning, Calendar Planning, and Meeting Planning. To solve the second
challenge, we introduce complex constraints (e.g. budget, temporal, and
spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to
separate constraint complexity from search space expansion. Experiments on 13
advanced MLLMs reveal significant challenges: closed-source models achieve only
21.3% feasible plans, while open-source models average below 11%. Additionally,
we observe that MLLMs are highly sensitive to constraint complexity and that
traditional multimodal prompting strategies fail in multi-constraint scenarios.
Our work formalizes multimodal constraints in planning, provides a rigorous
evaluation framework, and highlights the need for advancements in
constraint-aware reasoning for real-world MLLM applications.
comment: Accepted to ACM Multimedia 2025
☆ Holistic Evaluations of Topic Models
Topic models are gaining increasing commercial and academic interest for
their ability to summarize large volumes of unstructured text. As unsupervised
machine learning methods, they enable researchers to explore data and help
general users understand key themes in large text collections. However, they
risk becoming a 'black box', where users input data and accept the output as an
accurate summary without scrutiny. This article evaluates topic models from a
database perspective, drawing insights from 1140 BERTopic model runs. The goal
is to identify trade-offs in optimizing model parameters and to reflect on what
these findings mean for the interpretation and responsible use of topic models
comment: 10 pages, 6 tables
☆ SWE-Exp: Experience-Driven Software Issue Resolution
Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, Qianxiang Wang
Recent advances in large language model (LLM) agents have shown remarkable
progress in software issue resolution, leveraging advanced techniques such as
multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current
agents act as memoryless explorers - treating each problem separately without
retaining or reusing knowledge from previous repair experiences. This leads to
redundant exploration of failed trajectories and missed chances to adapt
successful issue resolution methods to similar problems. To address this
problem, we introduce SWE-Exp, an experience - enhanced approach that distills
concise and actionable experience from prior agent trajectories, enabling
continuous learning across issues. Our method introduces a multi-faceted
experience bank that captures both successful and failed repair attempts.
Specifically, it extracts reusable issue resolution knowledge at different
levels - from high-level problem comprehension to specific code changes.
Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6%
Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach
establishes a new paradigm in which automated software engineering agents
systematically accumulate and leverage repair expertise, fundamentally shifting
from trial-and-error exploration to strategic, experience-driven issue
resolution.
comment: Our code and data are available at
https://github.com/YerbaPage/SWE-Exp
☆ Text-to-SQL Task-oriented Dialogue Ontology Construction
Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin, Shutong Feng, Nurul Lubis, Milica Gasic
Large language models (LLMs) are widely used as general-purpose knowledge
sources, but they rely on parametric knowledge, limiting explainability and
trustworthiness. In task-oriented dialogue (TOD) systems, this separation is
explicit, using an external database structured by an explicit ontology to
ensure explainability and controllability. However, building such ontologies
requires manual labels or supervised training. We introduce TeQoDO: a
Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM
autonomously builds a TOD ontology from scratch without supervision using its
inherent SQL programming capabilities combined with dialogue theory provided in
the prompt. We show that TeQoDO outperforms transfer learning approaches, and
its constructed ontology is competitive on a downstream dialogue state tracking
task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also
scales to allow construction of much larger ontologies, which we investigate on
a Wikipedia and ArXiv dataset. We view this as a step towards broader
application of ontologies to increase LLM explainability.
☆ SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution
Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, Qianxiang Wang
Issue resolution has made remarkable progress thanks to the advanced
reasoning capabilities of large language models (LLMs). Recently, agent-based
frameworks such as SWE-agent have further advanced this progress by enabling
autonomous, tool-using agents to tackle complex software engineering tasks.
While existing agent-based issue resolution approaches are primarily based on
agents' independent explorations, they often get stuck in local solutions and
fail to identify issue patterns that span across different parts of the
codebase. To address this limitation, we propose SWE-Debate, a competitive
multi-agent debate framework that encourages diverse reasoning paths and
achieves more consolidated issue localization. SWE-Debate first creates
multiple fault propagation traces as localization proposals by traversing a
code dependency graph. Then, it organizes a three-round debate among
specialized agents, each embodying distinct reasoning perspectives along the
fault propagation trace. This structured competition enables agents to
collaboratively converge on a consolidated fix plan. Finally, this consolidated
fix plan is integrated into an MCTS-based code modification agent for patch
generation. Experiments on the SWE-bench benchmark show that SWE-Debate
achieves new state-of-the-art results in open-source agent frameworks and
outperforms baselines by a large margin.
comment: Our code and data are available at
https://github.com/YerbaPage/SWE-Debate
☆ DSBC : Data Science task Benchmarking with Context engineering
Recent advances in large language models (LLMs) have significantly impacted
data science workflows, giving rise to specialized data science agents designed
to automate analytical tasks. Despite rapid adoption, systematic benchmarks
evaluating the efficacy and limitations of these agents remain scarce. In this
paper, we introduce a comprehensive benchmark specifically crafted to reflect
real-world user interactions with data science agents by observing usage of our
commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet,
Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with
context engineering, multi-step with context engineering, and with SmolAgent.
Our benchmark assesses performance across a diverse set of eight data science
task categories, additionally exploring the sensitivity of models to common
prompting issues, such as data leakage and slightly ambiguous instructions. We
further investigate the influence of temperature parameters on overall and
task-specific outcomes for each model and approach. Our findings reveal
distinct performance disparities among the evaluated models and methodologies,
highlighting critical factors that affect practical deployment. The benchmark
dataset and evaluation framework introduced herein aim to provide a foundation
for future research of more robust and effective data science agents.
comment: 32 pages
♻ ☆ Perception-Aware Policy Optimization for Multimodal Reasoning
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a
highly effective strategy for endowing Large Language Models (LLMs) with robust
multi-step reasoning abilities. However, its design and optimizations remain
tailored to purely textual domains, resulting in suboptimal performance when
applied to multimodal reasoning tasks. In particular, we observe that a major
source of error in current multimodal reasoning lies in the perception of
visual inputs. To address this bottleneck, we propose PAPO, a novel policy
gradient algorithm that encourages the model to learn to perceive while
learning to reason. Specifically, we introduce the Implicit Perception Loss in
the form of a KL divergence term, which can be seamlessly plugged into
mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely
on additional data curation, reward models, or stronger teacher models. To
further enhance the training stability of PAPO, we introduce the Double Entropy
Loss, which effectively regularizes the new KL objective without compromising
performance. Despite its simplicity, PAPO yields significant overall
improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements
are more pronounced, approaching 8.0%-19.1%, on tasks with high vision
dependency. We also observe a substantial reduction of 30.5% in perception
errors, indicating improved perceptual capabilities with PAPO. Overall, our
work introduces a deeper integration of perception-aware supervision into core
learning objectives and lays the groundwork for a new RL framework that
encourages visually grounded reasoning. Code and data will be made publicly
available for research purposes. Project page:
https://mikewangwzhl.github.io/PAPO.
♻ ☆ How AI Ideas Affect the Creativity, Diversity, and Evolution of Human Ideas: Evidence From a Large, Dynamic Experiment
Exposure to large language model output is rapidly increasing. How will
seeing AI-generated ideas affect human ideas? We conducted an experiment (800+
participants, 40+ countries) where participants viewed creative ideas that were
from ChatGPT or prior experimental participants and then brainstormed their own
idea. We varied the number of AI-generated examples (none, low, or high
exposure) and if the examples were labeled as 'AI' (disclosure). Our dynamic
experiment design -- ideas from prior participants in an experimental condition
are used as stimuli for future participants in the same experimental condition
-- speaks to the interdependent process of cultural creation: creative ideas
are built upon prior ideas. Hence, we capture the compounding effects of having
LLMs 'in the culture loop'. We find that high AI exposure (but not low AI
exposure) did not affect the creativity of individual ideas but did increase
the average amount and rate of change of collective idea diversity. AI made
ideas different, not better. There were no main effects of disclosure. We also
found that self-reported creative people were less influenced by knowing an
idea was from AI and that participants may knowingly adopt AI ideas when the
task is difficult. Our findings suggest that introducing AI ideas may increase
collective diversity but not individual creativity.
comment: Accepted at ACM Collective Intelligence 2025. Originally posted 2024
♻ ★ RecGPT Technical Report
Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Sunhao Dai, Wen Chen, Wenjun Yang, Yuning Jiang, Zhujin Gao, Bo Zheng, Chi Li, Dimin Wang, Dixuan Wang, Fan Li, Fan Zhang, Haibin Chen, Haozhuang Liu, Jialin Zhu, Jiamang Wang, Jiawei Wu, Jin Cui, Ju Huang, Kai Zhang, Kan Liu, Lang Tian, Liang Rao, Longbin Li, Lulu Zhao, Na He, Peiyang Wang, Qiqi Huang, Tao Luo, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Yang Li, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yinnan Song, Yuchen Li, Yujie Luo, Yujin Yuan, Yuliang Yan, Zhengyang Wang, Zhibo Xiao, Zhixin Ma, Zile Zhou, Ziqi Zhang
Recommender systems are among the most impactful applications of artificial
intelligence, serving as critical infrastructure connecting users, merchants,
and platforms. However, most current industrial systems remain heavily reliant
on historical co-occurrence patterns and log-fitting objectives, i.e.,
optimizing for past user interactions without explicitly modeling user intent.
This log-fitting approach often leads to overfitting to narrow historical
preferences, failing to capture users' evolving and latent interests. As a
result, it reinforces filter bubbles and long-tail phenomena, ultimately
harming user experience and threatening the sustainability of the whole
recommendation ecosystem.
To address these challenges, we rethink the overall design paradigm of
recommender systems and propose RecGPT, a next-generation framework that places
user intent at the center of the recommendation pipeline. By integrating large
language models (LLMs) into key stages of user interest mining, item retrieval,
and explanation generation, RecGPT transforms log-fitting recommendation into
an intent-centric process. To effectively align general-purpose LLMs to the
above domain-specific recommendation tasks at scale, RecGPT incorporates a
multi-stage training paradigm, which integrates reasoning-enhanced
pre-alignment and self-training evolution, guided by a Human-LLM cooperative
judge system. Currently, RecGPT has been fully deployed on the Taobao App.
Online experiments demonstrate that RecGPT achieves consistent performance
gains across stakeholders: users benefit from increased content diversity and
satisfaction, merchants and the platform gain greater exposure and conversions.
These comprehensive improvement results across all stakeholders validates that
LLM-driven, intent-centric design can foster a more sustainable and mutually
beneficial recommendation ecosystem.
♻ ☆ Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length ICML 2025
Information retrieval in Large Language Models (LLMs) is increasingly
recognized as intertwined with generation capabilities rather than mere lookup.
While longer contexts are often assumed to improve retrieval, the effects of
intra-context interference remain understudied. To address this, we adapt the
proactive interference (PI) paradigm from cognitive science, where earlier
information disrupts recall of newer updates. In humans, susceptibility to such
interference is inversely linked to working memory capacity. We introduce
PI-LLM, an evaluation that sequentially streams semantically related key-value
updates and queries only the final values. Although these final values are
clearly positioned just before the query, LLM retrieval accuracy declines
log-linearly toward zero as interference accumulates; errors arise from
retrieving previously overwritten values. Attempts to mitigate interference via
prompt engineering (e.g., instructing models to ignore earlier input) yield
limited success. These findings reveal a fundamental constraint on LLMs'
ability to disentangle interference and flexibly manipulate information,
suggesting a working memory bottleneck beyond mere context access. This calls
for approaches that strengthen models' ability to suppress irrelevant content
during retrieval.
comment: Accepted at ICML 2025 Workshop on Long Context Foundation Models
(ICFM). Code: https://github.com/zhuangziGiantfish/Unable-to-Forget
♻ ☆ DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
We introduce DocPolarBERT, a layout-aware BERT model for document
understanding that eliminates the need for absolute 2D positional embeddings.
We extend self-attention to take into account text block positions in relative
polar coordinate system rather than the Cartesian one. Despite being
pre-trained on a dataset more than six times smaller than the widely used
IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results
demonstrate that a carefully designed attention mechanism can compensate for
reduced pre-training data, offering an efficient and effective alternative for
document understanding.
♻ ☆ Who's important? -- SUnSET: Synergistic Understanding of Stakeholder, Events and Time for Timeline Generation
As news reporting becomes increasingly global and decentralized online,
tracking related events across multiple sources presents significant
challenges. Existing news summarization methods typically utilizes Large
Language Models and Graphical methods on article-based summaries. However, this
is not effective since it only considers the textual content of similarly dated
articles to understand the gist of the event. To counteract the lack of
analysis on the parties involved, it is essential to come up with a novel
framework to gauge the importance of stakeholders and the connection of related
events through the relevant entities involved. Therefore, we present SUnSET:
Synergistic Understanding of Stakeholder, Events and Time for the task of
Timeline Summarization (TLS). We leverage powerful Large Language Models (LLMs)
to build SET triplets and introduced the use of stakeholder-based ranking to
construct a $Relevancy$ metric, which can be extended into general situations.
Our experimental results outperform all prior baselines and emerged as the new
State-of-the-Art, highlighting the impact of stakeholder information within
news article.
♻ ☆ How Can I Publish My LLM Benchmark Without Giving the True Answers Away? ICML 2025
Publishing a large language model (LLM) benchmark on the Internet risks
contaminating future LLMs: the benchmark may be unintentionally (or
intentionally) used to train or select a model. A common mitigation is to keep
the benchmark private and let participants submit their models or predictions
to the organizers. However, this strategy will require trust in a single
organization and still permits test-set overfitting through repeated queries.
To overcome this issue, we propose a way to publish benchmarks without
completely disclosing the ground-truth answers to the questions, while still
maintaining the ability to openly evaluate LLMs. Our main idea is to inject
randomness to the answers by preparing several logically correct answers, and
only include one of them as the solution in the benchmark. This reduces the
best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is
this helpful to keep us from disclosing the ground truth, but this approach
also offers a test for detecting data contamination. In principle, even fully
capable models should not surpass the Bayes accuracy. If a model surpasses this
ceiling despite this expectation, this is a strong signal of data
contamination. We present experimental evidence that our method can detect data
contamination accurately on a wide range of benchmarks, models, and training
methodologies.
comment: Extended version of the paper presented as an Oral at the ICML 2025
Workshop on the Impact of Memorization on Trustworthy Foundation Models
♻ ☆ Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation
Variation in language use, shaped by speakers' sociocultural background and
specific context of use, offers a rich lens into cultural perspectives, values,
and opinions. However, the computational study of these Sociocultural
Linguistic Phenomena (SLP) has often been limited to bespoke analyses of
specific groups or topics, hindering the pace of scientific discovery. To
address this, we introduce Splits!, a 9.7 million-post dataset from Reddit
designed for systematic and flexible research. The dataset contains posts from
over 53,000 users across 6 demographic groups, organized into 89 discussion
topics to enable comparative analysis. We validate Splits! via
self-identification and by successfully replicating several known SLPs from
existing literature. We complement this dataset with a framework that leverages
efficient retrieval methods to rapidly validate potential SLPs (PSLPs) by
automatically evaluating whether a given hypothesis is supported by our data.
Crucially, to distinguish between novel and obvious insights, the framework
incorporates a human-validated measure of a hypothesis's ``unexpectedness.'' We
demonstrate that the two-stage process reduces the number of statistically
significant findings requiring manual inspection by a factor of 1.5-1.8x,
streamlining the discovery of promising phenomena for further investigation.
comment: Preprint, under review
♻ ☆ ILID: Native Script Language Identification for Indian Languages
The language identification task is a crucial fundamental step in NLP. Often
it serves as a pre-processing step for widely used NLP applications such as
multilingual machine translation, information retrieval, question and
answering, and text summarization. The core challenge of language
identification lies in distinguishing languages in noisy, short, and code-mixed
environments. This becomes even harder in case of diverse Indian languages that
exhibit lexical and phonetic similarities, but have distinct differences. Many
Indian languages share the same script, making the task even more challenging.
Taking all these challenges into account, we develop and release a dataset of
250K sentences consisting of 23 languages including English and all 22 official
Indian languages labeled with their language identifiers, where data in most
languages are newly created. We also develop and release baseline models using
state-of-the-art approaches in machine learning and fine-tuning pre-trained
transformer models. Our models outperforms the state-of-the-art pre-trained
transformer models for the language identification task. The dataset and the
codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface
open source libraries.
comment: 10 pages, 1 figure, 6 tables, Paper accepted in RANLP 2025
♻ ☆ Inside-Out: Hidden Factual Knowledge in LLMs
Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart
This work presents a framework for assessing whether large language models
(LLMs) encode more factual knowledge in their parameters than what they express
in their outputs. While a few studies hint at this possibility, none has
clearly defined or demonstrated this phenomenon. We first propose a formal
definition of knowledge, quantifying it for a given question as the fraction of
correct-incorrect answer pairs where the correct one is ranked higher. This
gives rise to external and internal knowledge, depending on the information
used to score individual answer candidates: either the model's observable
token-level probabilities or its intermediate computations. Hidden knowledge
arises when internal knowledge exceeds external knowledge. We then present a
case study, applying this framework to three popular open-weights LLMs in a
closed-book QA setup. Our results indicate that: (1) LLMs consistently encode
more factual knowledge internally than what they express externally, with an
average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply
hidden that a model can internally know an answer perfectly, yet fail to
generate it even once, despite large-scale repeated sampling of 1,000 answers.
This reveals fundamental limitations in the generation capabilities of LLMs,
which (3) put a practical constraint on scaling test-time compute via repeated
answer sampling in closed-book QA: significant performance improvements remain
inaccessible because some answers are practically never sampled, yet if they
were, we would be guaranteed to rank them first.
comment: Accepted to COLM 2025
♻ ☆ Neutral Residues: Revisiting Adapters for Model Extension ICML 2025
We address the problem of extending a pretrained large language model to a
new domain that was not seen during training. Standard techniques, such as
finetuning or low-rank adaptation (LoRA) are successful at domain adaptation,
but do not formally add capacity to the model. This often leads to a trade-off,
between performing well on the new domain vs. degrading performance on the
original domain. Here, we revisit and improve adapters to extend LLMs from
three angles: data, architecture and training procedure, which are
advantageously considered jointly. The resulting method, called neutral
residues, modifies adapters in a way that leads each new residual block to
output near-zeros on the original domain. This solution leads to strong results
when adapting a state-of-the-art model originally trained on English to a new
language. Neutral residues significantly outperform competing approaches such
as finetuning, LoRA or vanilla adapters in terms of the trade-off between
learning the new language and not forgetting English.
comment: Accepted at ICML 2025
♻ ☆ Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
Ambiguous words are often found in modern digital communications. Lexical
ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due
to limited data. Consequently, the efficiency of translation, information
retrieval, and question-answering systems is hindered by these limitations.
This study investigates the use of Large Language Models (LLMs) to improve WSD
using a novel approach combining a systematic prompt augmentation mechanism
with a knowledge base (KB) consisting of different sense interpretations. The
proposed method incorporates a human-in-loop approach for prompt augmentation
where prompt is supported by Part-of-Speech (POS) tagging, synonyms of
ambiguous words, aspect-based sense filtering and few-shot prompting to guide
the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based
approach, this work demonstrates a substantial improvement in performance. The
evaluation was conducted using FEWS test data and sense tags. This research
advances accurate word interpretation in social media and digital
communication.
comment: 12 pages,6 tables, 1 figure, Proceedings of the 1st International
Conference on NLP & AI for Cyber Security
♻ ☆ PurpCode: Reasoning for Safer Code Generation
Jiawei Liu, Nirav Diwan, Zhe Wang, Haoyu Zhai, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Muntasir Wahed, Yinlin Deng, Hadjer Benkraouda, Yuxiang Wei, Lingming Zhang, Ismini Lourentzou, Gang Wang
We introduce PurpCode, the first post-training recipe for training safe code
reasoning models towards generating secure code and defending against malicious
cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule
Learning, which explicitly teaches the model to reference cybersafety rules to
generate vulnerability-free code and to avoid facilitating malicious
cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety
and preserves model utility through diverse, multi-objective reward mechanisms.
To empower the training pipelines with comprehensive cybersafety data, we
conduct internal red-teaming to synthesize comprehensive and high-coverage
prompts based on real-world tasks for inducing unsafe cyberactivities in the
model. Based on PurpCode, we develop a reasoning-based coding model, namely
PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming
various frontier models. Meanwhile, our alignment method decreases the model
overrefusal rates in both general and cybersafety-specific scenarios, while
preserving model utility in both code generation and common security knowledge.
♻ ☆ LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted
the critical roles of both the visual backbone and the underlying language
model. While prior work has primarily focused on scaling these components to
billions of parameters, the trade-offs between model size, architecture, and
performance remain underexplored. Additionally, inconsistencies in training
data and evaluation protocols have hindered direct comparisons, making it
difficult to derive optimal design choices. In this paper, we introduce
LLaVA-MORE, a new family of MLLMs that integrates recent language models with
diverse visual backbones. To ensure fair comparisons, we employ a unified
training protocol applied consistently across all architectures. Our analysis
systematically explores both small- and medium-scale LLMs -- including Phi-4,
LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and
instruction following, while examining the relationship between model size and
performance. Beyond evaluating the LLM impact on final results, we conduct a
comprehensive study of various visual encoders, ranging from CLIP-based
architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional
experiments investigate the effects of increased image resolution and
variations in pre-training datasets. Overall, our results provide insights into
the design of more effective MLLMs, offering a reproducible evaluation
framework that facilitates direct comparisons and can guide future model
development. Our source code and trained models are publicly available at:
https://github.com/aimagelab/LLaVA-MORE.
comment: ICCV 2025 Workshop on What is Next in Multimodal Foundation Models
♻ ☆ EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
Large language models (LLMs) increasingly serve as educational tools, yet
evaluating their teaching capabilities remains challenging due to the
resource-intensive, context-dependent, and methodologically complex nature of
teacher-student interactions. We introduce EducationQ, a multi-agent dialogue
framework that efficiently assesses teaching capabilities through simulated
dynamic educational scenarios, featuring specialized agents for teaching,
learning, and evaluation. Testing 14 LLMs across major AI Organizations
(OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13
disciplines and 10 difficulty levels reveals that teaching effectiveness does
not correlate linearly with model scale or general reasoning capabilities -
with some smaller open-source models outperforming larger commercial
counterparts in teaching contexts. This finding highlights a critical gap in
current evaluations that prioritize knowledge recall over interactive pedagogy.
Our mixed-methods evaluation, combining quantitative metrics with qualitative
analysis and expert case studies, identifies distinct pedagogical strengths
employed by top-performing models (e.g., sophisticated questioning strategies,
adaptive feedback mechanisms). Human expert evaluations show 78% agreement with
our automated qualitative analysis of effective teaching behaviors, validating
our methodology. EducationQ demonstrates that LLMs-as-teachers require
specialized optimization beyond simple scaling, suggesting next-generation
educational AI prioritize targeted enhancement of specific pedagogical
effectiveness.
comment: Paper URL: https://aclanthology.org/2025.acl-long.1576 ;Presentation
Video: https://www.youtube.com/watch?v=j63ooKE50I0
♻ ☆ The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models
Current large language models (LLMs) have demonstrated emerging capabilities
in social intelligence tasks, including implicature resolution and
theory-of-mind reasoning, both of which require substantial pragmatic
understanding. However, how LLMs acquire this pragmatic competence throughout
the training process remains poorly understood. In this work, we introduce
ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to
evaluate whether LLMs at different training stages can accurately infer nuanced
speaker intentions. Each instance pairs two equally plausible yet pragmatically
divergent continuations and requires the model to (i) infer the speaker's
intended meaning and (ii) explain when and why a speaker would choose one
utterance over its alternative, thus directly probing pragmatic competence
through contrastive reasoning. We systematically evaluate 22 LLMs across 3 key
training stages: after pre-training, supervised fine-tuning (SFT), and
preference optimization, to examine the development of pragmatic competence.
Our results show that even base models exhibit notable sensitivity to pragmatic
cues, which improves consistently with increases in model and data scale.
Additionally, SFT and RLHF contribute further gains, particularly in
cognitive-pragmatic scenarios. These findings highlight pragmatic competence as
an emergent and compositional property of LLM training and offer new insights
for aligning models with human communicative norms.
♻ ☆ RAVine: Reality-Aligned Evaluation for Agentic Search
Agentic search, as a more autonomous and adaptive paradigm of retrieval
augmentation, is driving the evolution of intelligent search systems. However,
existing evaluation frameworks fail to align well with the goals of agentic
search. First, the complex queries commonly used in current benchmarks often
deviate from realistic user search scenarios. Second, prior approaches tend to
introduce noise when extracting ground truth for end-to-end evaluations,
leading to distorted assessments at a fine-grained level. Third, most current
frameworks focus solely on the quality of final answers, neglecting the
evaluation of the iterative process inherent to agentic search. To address
these limitations, we propose RAVine -- a Reality-Aligned eValuation framework
for agentic LLMs with search. RAVine targets multi-point queries and long-form
answers that better reflect user intents, and introduces an attributable ground
truth construction strategy to enhance the accuracy of fine-grained evaluation.
Moreover, RAVine examines model's interaction with search tools throughout the
iterative process, and accounts for factors of efficiency. We benchmark a
series of models using RAVine and derive several insights, which we hope will
contribute to advancing the development of agentic search systems. The code and
datasets are available at https://github.com/SwordFaith/RAVine.
♻ ☆ Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models ACL 2025
Large language models (LLMs) have shown strong performance across natural
language reasoning tasks, yet their reasoning processes remain brittle and
difficult to interpret. Prompting techniques like Chain-of-Thought (CoT)
enhance reliability by eliciting intermediate reasoning steps or aggregating
multiple outputs. However, they lack mechanisms for enforcing logical structure
and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a
novel framework that models reasoning as collaboration among three parallel
agents, each simulating a distinct mode of inference: abductive, deductive, and
inductive. Each agent produces a reasoning trace, which is structured into a
formal reasoning graph. To evaluate consistency, we apply Bayesian belief
propagation guided by natural language inference (NLI), assigning confidence
scores to each step. The most coherent graph is selected to derive the final
answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith)
reasoning benchmarks show that ToTh consistently outperforms CoT,
Self-Consistency, and CoT-Decoding across multiple LLMs, while producing
interpretable and logically grounded reasoning chains. Our findings suggest a
promising direction for building more robust and cognitively inspired LLM
reasoning. The implementation is available at
https://github.com/KurbanIntelligenceLab/theorem-of-thought.
comment: ACL 2025 KnowFM
♻ ☆ WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation
Recent multi-modal Large Language Models (LLMs) such as GPT-4o have
demonstrated strong capabilities of direct speech interaction. However, the
lack of specialized and comprehensive benchmarks for end-to-end speech LLM
evaluation hinders optimizing the user experience of Audio LLMs in real-world
applications. Existing evaluation methods often adapt text-based benchmarks,
overlooking speech's unique characteristics and challenges, including prosody,
homophones, stuttering, and differing user expectations. Here, we present a
novel approach to thoroughly evaluate LLMs in practical speech conversations.
We systematically curate real-world chat data relevant to spoken scenarios,
introduce diversity in speaker attributes and acoustic conditions, and augment
the dataset with speech-specific phenomena. We further design a query-aware
evaluation method to use customized evaluation checklists and prompts to
enhance the accuracy of automatic evaluation. We conduct comprehensive testing
and detailed analysis of various mainstream speech models, revealing
significant differences in model performance across different speech scenarios.
The use of query-aware evaluation further enables a finer-grained assessment
under various speech-specific scenarios. Our benchmark can provide valuable
insights for speech model development and evaluation.
♻ ☆ Robust and Fine-Grained Detection of AI Generated Texts
Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Drishti Sharma, Siddhant Gupta, Jebish Purbey, Ashay Srivastava, Subhasya TippaReddy, Arvind Reddy Bobbili, Suraj Telugara Chandrashekhar, Modabbir Adeeb, Srinadh Vura, Suman Debnath, Hamza Farooq
An ideal detection system for machine generated content is supposed to work
well on any generator as many more advanced LLMs come into existence day by
day. Existing systems often struggle with accurately identifying AI-generated
content over shorter texts. Further, not all texts might be entirely authored
by a human or LLM, hence we focused more over partial cases i.e human-LLM
co-authored texts. Our paper introduces a set of models built for the task of
token classification which are trained on an extensive collection of
human-machine co-authored texts, which performed well over texts of unseen
domains, unseen generators, texts by non-native speakers and those with
adversarial inputs. We also introduce a new dataset of over 2.4M such texts
mostly co-authored by several popular proprietary LLMs over 23 languages. We
also present findings of our models' performance over each texts of each domain
and generator. Additional findings include comparison of performance against
each adversarial method, length of input texts and characteristics of generated
texts compared to the original human authored texts.
comment: 18 pages, 6 figures
♻ ★ VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong
Reinforcement learning has proven its effectiveness in enhancing the
reasoning capabilities of large language models. Recent research efforts have
progressively extended this paradigm to multimodal reasoning tasks. Due to the
inherent complexity and diversity of multimodal tasks, especially in semantic
content and problem formulations, existing models often exhibit unstable
performance across various domains and difficulty levels. To address these
limitations, we propose VL-Cogito, an advanced multimodal reasoning model
trained via a novel multi-stage Progressive Curriculum Reinforcement Learning
(PCuRL) framework. PCuRL systematically guides the model through tasks of
gradually increasing difficulty, substantially improving its reasoning
abilities across diverse multimodal contexts. The framework introduces two key
innovations: (1) an online difficulty soft weighting mechanism, dynamically
adjusting training difficulty across successive RL training stages; and (2) a
dynamic length reward mechanism, which encourages the model to adaptively
regulate its reasoning path length according to task complexity, thus balancing
reasoning efficiency with correctness. Experimental evaluations demonstrate
that VL-Cogito consistently matches or surpasses existing reasoning-oriented
models across mainstream multimodal benchmarks spanning mathematics, science,
logic, and general understanding, validating the effectiveness of our approach.
comment: 21 pages, 5 figures, 6 tables. Work in progress
♻ ☆ KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities
Fine-tuning is an immensely resource-intensive process when retraining Large
Language Models (LLMs) to incorporate a larger body of knowledge. Although many
fine-tuning techniques have been developed to reduce the time and computational
cost involved, the challenge persists as LLMs continue to grow in size and
complexity. To address this, a new approach to knowledge expansion in LLMs is
needed. Retrieval-Augmented Generation (RAG) offers one such alternative by
storing external knowledge in a database and retrieving relevant chunks to
support question answering. However, naive implementations of RAG face
significant limitations in scalability and answer accuracy. This paper
introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome
these limitations. Inspired by the divide-and-conquer paradigm, K2RAG
integrates dense and sparse vector search, knowledge graphs, and text
summarization to improve retrieval quality and system efficiency. The framework
also includes a preprocessing step that summarizes the training data,
significantly reducing the training time. K2RAG was evaluated using the
MultiHopRAG dataset, where the proposed pipeline was trained on the document
corpus and tested on a separate evaluation set. Results demonstrated notable
improvements over common naive RAG implementations. K2RAG achieved the highest
mean answer similarity score of 0.57, and reached the highest third quartile
(Q3) similarity of 0.82, indicating better alignment with ground-truth answers.
In addition to improved accuracy, the framework proved highly efficient. The
summarization step reduced the average training time of individual components
by 93%, and execution speed was up to 40% faster than traditional knowledge
graph-based RAG systems. K2RAG also demonstrated superior scalability,
requiring three times less VRAM than several naive RAG implementations tested
in this study.
comment: 21 pages, 14 figures
♻ ☆ Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Siddhant Gupta, Drishti Sharma, Jebish Purbey, Kanwal Mehreen, Muhammad Arham, Suman Debnath, Hamza Farooq
Large Language Models (LLMs) have shown remarkable capabilities, but their
development has primarily focused on English and other high-resource languages,
leaving many languages underserved. We present our latest Hindi-English
bi-lingual LLM \textbf{Mantra-14B} with ~3\% average improvement in benchmark
scores over both languages, outperforming models twice its size. Using a
curated dataset composed of English and Hindi instruction data of 485K samples,
we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve
performance over both English and Hindi. Our experiments encompassing seven
different LLMs of varying parameter sizes and over 140 training attempts with
varying English-Hindi training data ratios demonstrated that it is possible to
significantly improve multilingual performance without compromising native
performance. Further, our approach avoids resource-intensive techniques like
vocabulary expansion or architectural modifications, thus keeping the model
size small. Our results indicate that modest fine-tuning with culturally and
locally informed data can bridge performance gaps without incurring significant
computational overhead. We release our training code, datasets, and models
under mit and apache licenses to aid further research towards under-represented
and low-resource languages.
comment: 24 pages, 18 figures