Computation and Language
☆ Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
Parallel thinking has emerged as a novel approach for enhancing the reasoning
capabilities of large language models (LLMs) by exploring multiple reasoning
paths concurrently. However, activating such capabilities through training
remains challenging, as existing methods predominantly rely on supervised
fine-tuning (SFT) over synthetic data, which encourages teacher-forced
imitation rather than exploration and generalization. Different from them, we
propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework
that enables parallel thinking behaviors for complex real-world reasoning
tasks. Our framework employs a progressive curriculum that explicitly addresses
the cold-start problem in training parallel thinking with RL. We first use SFT
on prompt-generated trajectories from easier tasks to instill the parallel
thinking ability, then transition to RL to explore and generalize this skill on
harder problems. Experiments on various math benchmarks, including MATH, AMC23,
and AIME, show that Parallel-R1 successfully instills parallel thinking,
leading to 8.4% accuracy improvements over the sequential thinking model
trained directly on challenging tasks with RL. Further analysis reveals a clear
shift in the model's thinking behavior: at an early stage, it uses parallel
thinking as an exploration strategy, while in a later stage, it uses the same
capability for multi-perspective verification. Most significantly, we validate
parallel thinking as a \textbf{mid-training exploration scaffold}, where this
temporary exploratory phase unlocks a higher performance ceiling after RL,
yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and
code will be open-source at https://github.com/zhengkid/Parallel-R1.
comment: Project website: https://zhengkid.github.io/Parallel_R1.github.io/
★ Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Recent advances in large multimodal models have leveraged image-based tools
with reinforcement learning to tackle visual problems. However, existing
open-source approaches often exhibit monotonous reasoning patterns and allow
only a limited number of interaction turns, making them inadequate for
difficult tasks that require trial-and-error exploration. In this work, we
address this limitation by scaling up tool-based interactions and introduce
Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of
steps -- and achieves state-of-the-art performance on challenging visual search
tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key
components. First, we construct the Visual Probe Dataset, a collection of
thousands of challenging visual search problems designed for exploratory
reasoning. Second, we develop an iterative data collection pipeline to obtain
cold-start trajectories that exhibit diverse reasoning patterns, including
depth-first search, trial-and-error, and goal maintenance. Third, we propose an
over-turn masking strategy that prevents penalization of over-turn responses
(those that hit the maximum number of turns) during reinforcement learning,
thereby balancing training-time efficiency with test-time scalability. Despite
training with an upper bound of only six interaction turns, our model generates
trajectories that naturally scale to tens of turns at inference time, with
accuracy improving as the number of turns increases. Extensive experiments
demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking
paths, effectively solving challenging visual search problems.
comment: Code, datasets, models are available at
https://github.com/Mini-o3/Mini-o3. Project Page: https://mini-o3.github.io/
☆ SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large
Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It
addresses critical limitations in OpenAI's benchmark, including noisy and
incorrect labels, topical biases, and question redundancy. SimpleQA Verified
was created through a rigorous multi-stage filtering process involving
de-duplication, topic balancing, and source reconciliation to produce a more
reliable and challenging evaluation set, alongside improvements in the
autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a
state-of-the-art F1-score of 55.6, outperforming other frontier models,
including GPT-5. This work provides the research community with a
higher-fidelity tool to track genuine progress in parametric model factuality
and to mitigate hallucinations. The benchmark dataset, evaluation code, and
leaderboard are available at:
https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
☆ Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Visual reasoning over structured data such as tables is a critical capability
for modern vision-language models (VLMs), yet current benchmarks remain limited
in scale, diversity, or reasoning depth, especially when it comes to rendered
table images. Addressing this gap, we introduce Visual-TableQA, a large-scale,
open-domain multimodal dataset specifically designed to evaluate and enhance
visual reasoning over complex tabular data. Our generation pipeline is modular,
scalable, and fully autonomous, involving multiple reasoning LLMs collaborating
across distinct roles: generation, validation, and inspiration. Visual-TableQA
comprises 2.5k richly structured LaTeX-rendered tables and 6k
reasoning-intensive QA pairs, all produced at a cost of under USD 100. To
promote diversity and creativity, our pipeline performs multi-model
collaborative data generation via cross-model prompting ('inspiration') and
LLM-jury filtering. Stronger models seed layouts and topics that weaker models
elaborate, collectively distilling diverse reasoning patterns and visual
structures into the dataset. Empirical results show that models fine-tuned on
Visual-TableQA generalize robustly to external benchmarks, outperforming
several proprietary models despite the dataset's synthetic nature. The full
pipeline and resources are publicly available at
https://github.com/AI-4-Everyone/Visual-TableQA.
comment: Work in Progress
☆ GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models EMNLP 2025
Uncertainty estimation is essential for enhancing the reliability of Large
Language Models (LLMs), particularly in high-stakes applications. Existing
methods often overlook semantic dependencies, relying on token-level
probability measures that fail to capture structural relationships within the
generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty
Estimation for Large Language Models, a structure-aware framework that
leverages dependency parse trees and hierarchical graph pooling to refine
uncertainty quantification. By incorporating supervised learning, GENUINE
effectively models semantic and structural relationships, improving confidence
assessments. Extensive experiments across NLP tasks show that GENUINE achieves
up to 29% higher AUROC than semantic entropy-based approaches and reduces
calibration errors by over 15%, demonstrating the effectiveness of graph-based
uncertainty modeling. The code is available at
https://github.com/ODYSSEYWT/GUQ.
comment: Accepted by EMNLP 2025
☆ Uncovering Scaling Laws for Large Language Models via Inverse Problems EMNLP
Arun Verma, Zhaoxuan Wu, Zijian Zhou, Xiaoqiang Lin, Zhiliang Chen, Rachael Hwee Ling Sim, Rui Qiao, Jingtan Wang, Nhung Bui, Xinyuan Niu, Wenyang Hu, Gregory Kang Ruey Lau, Zi-Yu Khoo, Zitong Zhao, Xinyi Xu, Apivich Hemachandra, See-Kiong Ng, Bryan Kian Hsiang Low
Large Language Models (LLMs) are large-scale pretrained models that have
achieved remarkable success across diverse domains. These successes have been
driven by unprecedented complexity and scale in both data and computations.
However, due to the high costs of training such models, brute-force
trial-and-error approaches to improve LLMs are not feasible. Inspired by the
success of inverse problems in uncovering fundamental scientific laws, this
position paper advocates that inverse problems can also efficiently uncover
scaling laws that guide the building of LLMs to achieve the desirable
performance with significantly better cost-effectiveness.
comment: Accepted at EMNLP Findings 2025
☆ Biased Tales: Cultural and Topic Bias in Generating Children's Stories
Stories play a pivotal role in human communication, shaping beliefs and
morals, particularly in children. As parents increasingly rely on large
language models (LLMs) to craft bedtime stories, the presence of cultural and
gender stereotypes in these narratives raises significant concerns. To address
this issue, we present Biased Tales, a comprehensive dataset designed to
analyze how biases influence protagonists' attributes and story elements in
LLM-generated stories. Our analysis uncovers striking disparities. When the
protagonist is described as a girl (as compared to a boy), appearance-related
attributes increase by 55.26%. Stories featuring non-Western children
disproportionately emphasize cultural heritage, tradition, and family themes
far more than those for Western children. Our findings highlight the role of
sociocultural bias in making creative AI use more equitable and diverse.
☆ From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing NLPCC 2025
This paper presents our team's solution to Shared Task 7 of NLPCC-2025, which
focuses on sentence-level gender bias detection and mitigation in Chinese. The
task aims to promote fairness and controllability in natural language
generation by automatically detecting, classifying, and mitigating gender bias.
To address this challenge, we adopt a fine-tuning approach based on large
language models (LLMs), efficiently adapt to the bias detection task via
Low-Rank Adaptation (LoRA). In terms of data processing, we construct a more
balanced training set to alleviate class imbalance and introduce heterogeneous
samples from multiple sources to enhance model generalization. For the
detection and classification sub-tasks, we employ a majority voting strategy
that integrates outputs from multiple expert models to boost performance.
Additionally, to improve bias generation detection and mitigation, we design a
multi-temperature sampling mechanism to capture potential variations in bias
expression styles. Experimental results demonstrate the effectiveness of our
approach in bias detection, classification, and mitigation. Our method
ultimately achieves an average score of 47.90%, ranking fourth in the shared
task.
comment: NLPCC 2025
☆ Are Humans as Brittle as Large Language Models?
The output of large language models (LLM) is unstable, due to both
non-determinism of the decoding process as well as to prompt brittleness. While
the intrinsic non-determinism of LLM generation may mimic existing uncertainty
in human annotations through distributional shifts in outputs, it is largely
assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs.
This raises the question: do human annotators show similar sensitivity to
instruction changes? If so, should prompt brittleness in LLMs be considered
problematic? One may alternatively hypothesize that prompt brittleness
correctly reflects human annotation variances. To fill this research gap, we
systematically compare the effects of prompt modifications on LLMs and
identical instruction modifications for human annotators, focusing on the
question of whether humans are similarly sensitive to prompt perturbations. To
study this, we prompt both humans and LLMs for a set of text classification
tasks conditioned on prompt variations. Our findings indicate that both humans
and LLMs exhibit increased brittleness in response to specific types of prompt
modifications, particularly those involving the substitution of alternative
label sets or label formats. However, the distribution of human judgments is
less affected by typographical errors and reversed label order than that of
LLMs.
☆ Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost
Literary translation has recently gained attention as a distinct and complex
task in machine translation research. However, the translation by small open
models remains an open problem. We contribute to this ongoing research by
introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for
dataset creation, fine tuning, and evaluation in English-Romanian literary
translations, centred on the creation and open release of both a compact, fine
tuned language model (TF2-12B) and large scale synthetic parallel datasets
(DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the
largest collection of synthetic English fables to date, we address the need for
rich, high quality literary datasets in low resource languages such as
Romanian. Our pipeline first generates 15k high quality Romanian references
from the TF1 pool using a high performing LLM. We then apply a two stage fine
tuning process to a 12B parameter open weight model: (i) instruction tuning to
capture genre specific narrative style, and (ii) adapter compression for
efficient deployment. Evaluation combines corpus level BLEU and a five
dimension LLM based rubric (accuracy, fluency, coherence, style, cultural
adaptation) to provide a nuanced assessment of translation quality. Results
show that our fine tuned model achieves fluency and adequacy competitive with
top performing large proprietary models, while being open, accessible, and
significantly more cost effective. Alongside the fine tuned model and both
datasets, we publicly release all scripts and evaluation prompts. TF2 thus
provides an end-to-end, reproducible pipeline for research on cost efficient
translation, cross lingual narrative generation, and the broad adoption of open
models for culturally significant literary content in low resource settings.
comment: 25 pages, 8 figures, includes datasets and models released on Hugging
Face
☆ Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems
Textual response generation is pivotal for multimodal \mbox{task-oriented}
dialog systems, which aims to generate proper textual responses based on the
multimodal context. While existing efforts have demonstrated remarkable
progress, there still exist the following limitations: 1) \textit{neglect of
unstructured review knowledge} and 2) \textit{underutilization of large
language models (LLMs)}. Inspired by this, we aim to fully utilize dual
knowledge (\textit{i.e., } structured attribute and unstructured review
knowledge) with LLMs to promote textual response generation in multimodal
task-oriented dialog systems. However, this task is non-trivial due to two key
challenges: 1) \textit{dynamic knowledge type selection} and 2)
\textit{intention-response decoupling}. To address these challenges, we propose
a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for
multimodal dialog systems (named DK2R). To be specific, DK2R first extracts
both structured attribute and unstructured review knowledge from external
knowledge base given the dialog context. Thereafter, DK2R uses an LLM to
evaluate each knowledge type's utility by analyzing LLM-generated provisional
probe responses. Moreover, DK2R separately summarizes the intention-oriented
key clues via dedicated reasoning, which are further used as auxiliary signals
to enhance LLM-based textual response generation. Extensive experiments
conducted on a public dataset verify the superiority of DK2R. We have released
the codes and parameters.
☆ SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP EMNLP 2025
Structured information extraction from scientific literature is crucial for
capturing core concepts and emerging trends in specialized fields. While
existing datasets aid model development, most focus on specific publication
sections due to domain complexity and the high cost of annotating scientific
texts. To address this limitation, we introduce SciNLP - a specialized
benchmark for full-text entity and relation extraction in the Natural Language
Processing (NLP) domain. The dataset comprises 60 manually annotated full-text
NLP publications, covering 7,072 entities and 1,826 relations. Compared to
existing research, SciNLP is the first dataset providing full-text annotations
of entities and their relationships in the NLP domain. To validate the
effectiveness of SciNLP, we conducted comparative experiments with similar
datasets and evaluated the performance of state-of-the-art supervised models on
this dataset. Results reveal varying extraction capabilities of existing models
across academic texts of different lengths. Cross-comparisons with existing
datasets show that SciNLP achieves significant performance improvements on
certain baseline models. Using models trained on SciNLP, we implemented
automatic construction of a fine-grained knowledge graph for the NLP domain.
Our KG has an average node degree of 3.2 per entity, indicating rich semantic
topological information that enhances downstream applications. The dataset is
publicly available at https://github.com/AKADDC/SciNLP.
comment: EMNLP 2025 Main
☆ Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning
Michele Joshua Maggini, Dhia Merzougui, Rabiraj Bandyopadhyay, Gaël Dias, Fabrice Maurel, Pablo Gamallo
The spread of fake news, polarizing, politically biased, and harmful content
on online platforms has been a serious concern. With large language models
becoming a promising approach, however, no study has properly benchmarked their
performance across different models, usage methods, and languages. This study
presents a comprehensive overview of different Large Language Models adaptation
paradigms for the detection of hyperpartisan and fake news, harmful tweets, and
political bias. Our experiments spanned 10 datasets and 5 different languages
(English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and
multiclass classification scenarios. We tested different strategies ranging
from parameter efficient Fine-Tuning of language models to a variety of
different In-Context Learning strategies and prompts. These included zero-shot
prompts, codebooks, few-shot (with both randomly-selected and
diversely-selected examples using Determinantal Point Processes), and
Chain-of-Thought. We discovered that In-Context Learning often underperforms
when compared to Fine-Tuning a model. This main finding highlights the
importance of Fine-Tuning even smaller models on task-specific settings even
when compared to the largest models evaluated in an In-Context Learning setup -
in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and
Qwen2.5-7B-Instruct.
☆ Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts EMNLP 2025
As large language models (LLMs) adapted to sensitive domains such as
medicine, their fluency raises safety risks, particularly regarding provenance
and accountability. Watermarking embeds detectable patterns to mitigate these
risks, yet its reliability in medical contexts remains untested. Existing
benchmarks focus on detection-quality tradeoffs, overlooking factual risks
under low-entropy settings often exploited by watermarking's reweighting
strategy. We propose a medical-focused evaluation workflow that jointly
assesses factual accuracy and coherence. Using GPT-Judger and further human
validation, we introduce the Factuality-Weighted Score (FWS), a composite
metric prioritizing factual accuracy beyond coherence to guide watermarking
deployment in medical domains. Our evaluation shows current watermarking
methods substantially compromise medical factuality, with entropy shifts
degrading medical entity representation. These findings underscore the need for
domain-aware watermarking approaches that preserve the integrity of medical
content.
comment: Accepted at EMNLP 2025 Findings
☆ M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models EMNLP2025
For Relation Extraction (RE), the manual annotation of training data may be
prohibitively expensive, since the sentences that contain the target relations
in texts can be very scarce and difficult to find. It is therefore beneficial
to develop an efficient method that can automatically extract training
instances from unlabeled texts for training RE models. Recently, large language
models (LLMs) have been adopted in various natural language processing tasks,
with RE also benefiting from their advances. However, when leveraging LLMs for
RE with predefined relation categories, two key challenges arise. First, in a
multi-class classification setting, LLMs often struggle to comprehensively
capture the semantics of every relation, leading to suboptimal results. Second,
although employing binary classification for each relation individually can
mitigate this issue, it introduces significant computational overhead,
resulting in impractical time complexity for real-world applications.
Therefore, this paper proposes a framework called M-BRe to extract training
instances from unlabeled texts for RE. It utilizes three modules to combine the
advantages of both of the above classification approaches: Relation Grouping,
Relation Extraction, and Label Decision. Extensive experiments confirm its
superior capability in discovering high-quality training samples from unlabeled
texts for RE.
comment: Accepted by EMNLP2025 Main Conference
☆ MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs
Efficient communication between patients and clinicians plays an important
role in shared decision-making. However, clinical reports are often lengthy and
filled with clinical jargon, making it difficult for domain experts to identify
important aspects in the document efficiently. This paper presents the
methodology we applied in the MultiClinSUM shared task for summarising clinical
case documents. We used an Iterative Self-Prompting technique on large language
models (LLMs) by asking LLMs to generate task-specific prompts and refine them
via example-based few-shot learning. Furthermore, we used lexical and embedding
space metrics, ROUGE and BERT-score, to guide the model fine-tuning with
epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved
ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P,
R, F1) from the official evaluation on 3,396 clinical case reports from various
specialties extracted from open journals. The high BERTscore indicates that the
model produced semantically equivalent output summaries compared to the
references, even though the overlap at the exact lexicon level is lower, as
reflected in the lower ROUGE scores. This work sheds some light on how
perspective-aware ISP (PA-ISP) can be deployed for clinical report
summarisation and support better communication between patients and clinicians.
comment: system paper at CLEF 2025
☆ BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment SIGIR
In recent years, there has been substantial progress in using pretrained
Language Models (LMs) on a range of tasks aimed at improving the understanding
of biomedical texts. Nonetheless, existing biomedical LLMs show limited
comprehension of complex, domain-specific concept structures and the factual
information encoded in biomedical Knowledge Graphs (KGs). In this work, we
propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel
joint LM and KG pre-training method that augments an LM with external knowledge
by the simultaneous learning of a dedicated KG encoder and aligning the
representations of both the LM and the graph. For a given textual sequence, we
link biomedical concept mentions to the Unified Medical Language System (UMLS)
KG and utilize local KG subgraphs as cross-modal positive samples for these
mentions. Our empirical findings indicate that implementing our method on
several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves
their performance on a range of language understanding tasks and the quality of
entity representations, even with minimal pre-training on a small alignment
dataset sourced from PubMed scientific abstracts.
comment: 9 pages, 1 figure, published in "The 48th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 2025)"
☆ Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition EMNLP
In a rapidly evolving world where information updates swiftly, knowledge in
large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a
cost-effective option, making knowledge editing (KE) without modifying
parameters particularly necessary. We find that although existing
retrieval-augmented generation (RAG)-based KE methods excel at editing simple
knowledge, they struggle with KE in multi-hop question answering due to the
issue of "edit skipping", which refers to skipping the relevant edited fact in
inference. In addition to the diversity of natural language expressions of
knowledge, edit skipping also arises from the mismatch between the granularity
of LLMs in problem-solving and the facts in the edited memory. To address this
issue, we propose a novel Iterative Retrieval-Augmented Knowledge Editing
method with guided decomposition (IRAKE) through the guidance from single
edited facts and entire edited cases. Experimental results demonstrate that
IRAKE mitigates the failure of editing caused by edit skipping and outperforms
state-of-the-art methods for KE in multi-hop question answering.
comment: Accepted in EMNLP Findings 2025
☆ VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang
With the rapid progress of multimodal large language models, operating system
(OS) agents become increasingly capable of automating tasks through on-device
graphical user interfaces (GUIs). However, most existing OS agents are designed
for idealized settings, whereas real-world environments often present
untrustworthy conditions. To mitigate risks of over-execution in such
scenarios, we propose a query-driven human-agent-GUI interaction framework that
enables OS agents to decide when to query humans for more reliable task
completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy
OS agent trained with a two-stage learning paradigm that falicitate the
decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent
autonomously executes actions in normal conditions while proactively querying
humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves
the average step-wise success rate by 20.64\% in untrustworthy scenarios over
the state-of-the-art, without compromising normal performance. Analysis
highlights VeriOS-Agent's rationality, generalizability, and scalability. The
codes, datasets and models are available at
https://github.com/Wuzheng02/VeriOS.
☆ Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data
Large language models (LLMs) have transformed NLP, yet their integration with
audio remains underexplored -- despite audio's centrality to human
communication. We introduce Falcon3-Audio, a family of Audio-Language Models
(ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably
small amount of public audio data -- less than 30K hours (5K unique) --
Falcon3-Audio-7B matches the best reported performance among open-weight models
on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while
distinguishing itself through superior data and parameter efficiency,
single-stage training, and transparency. Notably, our smallest 1B model remains
competitive with larger open models ranging from 2B to 13B parameters. Through
extensive ablations, we find that common complexities -- such as curriculum
learning, multiple audio encoders, and intricate cross-attention connectors --
are not required for strong performance, even compared to models trained on
over 500K hours of data.
comment: Accepted at ASRU 2025
★ ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval
Many contemporary data-driven research efforts in the natural sciences, such
as chemistry and materials science, require large-scale, high-performance
entity recognition from scientific datasets. Large language models (LLMs) have
increasingly been adopted to solve the entity recognition task, with the same
trend being observed on all-spectrum NLP tasks. The prevailing entity
recognition LLMs rely on fine-tuned technology, yet the fine-tuning process
often incurs significant cost. To achieve a best performance-cost trade-off, we
propose ALLabel, a three-stage framework designed to select the most
informative and representative samples in preparing the demonstrations for LLM
modeling. The annotated examples are used to construct a ground-truth retrieval
corpus for LLM in-context learning. By sequentially employing three distinct
active learning strategies, ALLabel consistently outperforms all baselines
under the same annotation budget across three specialized domain datasets.
Experimental results also demonstrate that selectively annotating only 5\%-10\%
of the dataset with ALLabel can achieve performance comparable to the method
annotating the entire dataset. Further analyses and ablation studies verify the
effectiveness and generalizability of our proposal.
☆ Astra: A Multi-Agent System for GPU Kernel Performance Optimization
Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, Alex Aiken
GPU kernel optimization has long been a central challenge at the intersection
of high-performance computing and machine learning. Efficient kernels are
crucial for accelerating large language model (LLM) training and serving, yet
attaining high performance typically requires extensive manual tuning.
Compiler-based systems reduce some of this burden, but still demand substantial
manual design and engineering effort. Recently, researchers have explored using
LLMs for GPU kernel generation, though prior work has largely focused on
translating high-level PyTorch modules into CUDA code. In this work, we
introduce Astra, the first LLM-based multi-agent system for GPU kernel
optimization. Unlike previous approaches, Astra starts from existing CUDA
implementations extracted from SGLang, a widely deployed framework for serving
LLMs, rather than treating PyTorch modules as the specification. Within Astra,
specialized LLM agents collaborate through iterative code generation, testing,
profiling, and planning to produce kernels that are both correct and
high-performance. On kernels from SGLang, Astra achieves an average speedup of
1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study
further demonstrates that LLMs can autonomously apply loop transformations,
optimize memory access patterns, exploit CUDA intrinsics, and leverage fast
math operations to yield substantial performance gains. Our work highlights
multi-agent LLM systems as a promising new paradigm for GPU kernel
optimization.
♻ ☆ Llama-Nemotron: Efficient Reasoning Models
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Prasoon Varshney, Makesh Narsimhan, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi Mahabadi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Shaona Ghosh, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Chris Alexiuk, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung
We introduce the Llama-Nemotron series of models, an open family of
heterogeneous reasoning models that deliver exceptional reasoning capabilities,
inference efficiency, and an open license for enterprise use. The family comes
in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs
competitively with state-of-the-art reasoning models such as DeepSeek-R1 while
offering superior inference throughput and memory efficiency. In this report,
we discuss the training procedure for these models, which entails using neural
architecture search from Llama 3 models for accelerated inference, knowledge
distillation, and continued pretraining, followed by a reasoning-focused
post-training stage consisting of two main parts: supervised fine-tuning and
large scale reinforcement learning. Llama-Nemotron models are the first
open-source models to support a dynamic reasoning toggle, allowing users to
switch between standard chat and reasoning modes during inference. To further
support open research and facilitate model development, we provide the
following resources: 1. We release the Llama-Nemotron reasoning models --
LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA
Open Model License Agreement. 2. We release the complete post-training dataset:
Llama-Nemotron-Post-Training-Dataset. 3. We also release our training
codebases: NeMo, NeMo-Aligner, and Megatron-LM.
♻ ★ Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts ACL
Evaluating natural language generation systems is challenging due to the
diversity of valid outputs. While human evaluation is the gold standard, it
suffers from inconsistencies, lack of standardisation, and demographic biases,
limiting reproducibility. LLM-based evaluators offer a scalable alternative but
are highly sensitive to prompt design, where small variations can lead to
significant discrepancies. In this work, we propose an inversion learning
method that learns effective reverse mappings from model outputs back to their
input instructions, enabling the automatic generation of highly effective,
model-specific evaluation prompts. Our method requires only a single evaluation
sample and eliminates the need for time-consuming manual prompt engineering,
thereby improving both efficiency and robustness. Our work contributes toward a
new direction for more robust and efficient LLM-based evaluation.
comment: 12 pages, accepted by Transactions of the Association for
Computational Linguistics (TACL)
♻ ☆ A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges
This systematic review of the research literature on retrieval-augmented
generation (RAG) provides a focused analysis of the most highly cited studies
published between 2020 and May 2025. A total of 128 articles met our inclusion
criteria. The records were retrieved from ACM Digital Library, IEEE Xplore,
Scopus, ScienceDirect, and the Digital Bibliography and Library Project (DBLP).
RAG couples a neural retriever with a generative language model, grounding
output in up-to-date, non-parametric memory while retaining the semantic
generalisation stored in model weights. Guided by the PRISMA 2020 framework, we
(i) specify explicit inclusion and exclusion criteria based on citation count
and research questions, (ii) catalogue datasets, architectures, and evaluation
practices, and (iii) synthesise empirical evidence on the effectiveness and
limitations of RAG. To mitigate citation-lag bias, we applied a lower
citation-count threshold to papers published in 2025 so that emerging
breakthroughs with naturally fewer citations were still captured. This review
clarifies the current research landscape, highlights methodological gaps, and
charts priority directions for future research.
comment: 58 page
♻ ☆ Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers
Antonym vs synonym distinction across multiple languages presents unique
computational challenges due to the paradoxical nature of antonymous
relationships words that share semantic domains while expressing opposite
meanings. This work introduces Bhav-Net, a novel dual-space architecture that
enables effective knowledge transfer from complex multilingual models to
simpler, language-specific architectures while maintaining robust cross-lingual
antonym--synonym distinction capabilities. Our approach combines
language-specific BERT encoders with graph transformer networks, creating
distinct semantic projections where synonymous pairs cluster in one space while
antonymous pairs exhibit high similarity in a complementary space. Through
comprehensive evaluation across eight languages (English, German, French,
Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic
relationship modeling transfers effectively across languages. The dual-encoder
design achieves competitive performance against state-of-the-art baselines
while providing interpretable semantic representations and effective
cross-lingual generalization.
comment: Found some issues and need to correct them
♻ ☆ Cardiverse: Harnessing LLMs for Novel Card Game Prototyping EMNLP 2025
The prototyping of computer games, particularly card games, requires
extensive human effort in creative ideation and gameplay evaluation. Recent
advances in Large Language Models (LLMs) offer opportunities to automate and
streamline these processes. However, it remains challenging for LLMs to design
novel game mechanics beyond existing databases, generate consistent gameplay
environments, and develop scalable gameplay AI for large-scale evaluations.
This paper addresses these challenges by introducing a comprehensive automated
card game prototyping framework. The approach highlights a graph-based indexing
method for generating novel game variations, an LLM-driven system for
consistent game code generation validated by gameplay records, and a gameplay
AI constructing method that uses an ensemble of LLM-generated heuristic
functions optimized through self-play. These contributions aim to accelerate
card game prototyping, reduce human labor, and lower barriers to entry for game
developers. For code repo visit this http URL
https://github.com/danruili/Cardiverse
comment: 37 pages, 13 figures, 8 tables. Accepted by EMNLP 2025
♻ ☆ JoPA:Explaining Large Language Model's Generation via Joint Prompt Attribution ACL 2025
Large Language Models (LLMs) have demonstrated impressive performances in
complex text generation tasks. However, the contribution of the input prompt to
the generated content still remains obscure to humans, underscoring the
necessity of understanding the causality between input and output pairs.
Existing works for providing prompt-specific explanation often confine model
output to be classification or next-word prediction. Few initial attempts
aiming to explain the entire language generation often treat input prompt texts
independently, ignoring their combinatorial effects on the follow-up
generation. In this study, we introduce a counterfactual explanation framework
based on Joint Prompt Attribution, JoPA, which aims to explain how a few prompt
texts collaboratively influences the LLM's complete generation. Particularly,
we formulate the task of prompt attribution for generation interpretation as a
combinatorial optimization problem, and introduce a probabilistic algorithm to
search for the casual input combination in the discrete space. We define and
utilize multiple metrics to evaluate the produced explanations, demonstrating
both the faithfulness and efficiency of our framework.
comment: Accepted to ACL 2025 (Main)
♻ ☆ Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection EMNLP 2025
Hateful memes have become a significant concern on the Internet,
necessitating robust automated detection systems. While Large Multimodal Models
(LMMs) have shown promise in hateful meme detection, they face notable
challenges like sub-optimal performance and limited out-of-domain
generalization capabilities. Recent studies further reveal the limitations of
both supervised fine-tuning (SFT) and in-context learning when applied to LMMs
in this setting. To address these issues, we propose a robust adaptation
framework for hateful meme detection that enhances in-domain accuracy and
cross-domain generalization while preserving the general vision-language
capabilities of LMMs. Analysis reveals that our approach achieves improved
robustness under adversarial attacks compared to SFT models. Experiments on six
meme classification datasets show that our approach achieves state-of-the-art
performance, outperforming larger agentic systems. Moreover, our method
generates higher-quality rationales for explaining hateful content compared to
standard SFT, enhancing model interpretability. Code available at
https://github.com/JingbiaoMei/RGCL
comment: EMNLP 2025 Main
♻ ☆ Hunyuan-MT Technical Report
In this report, we introduce Hunyuan-MT-7B, our first open-source
multilingual translation model, which supports bidirectional translation across
33 major languages and places a special emphasis on translation between
Mandarin and several ethnic minority languages as well as dialects.
Furthermore, to serve and address diverse translation scenarios and enhance
model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a
translation model inspired by the slow thinking mode. This model integrates
multiple outputs generated by the Hunyuan-MT-7B model under varying parameter
settings, thereby achieving performance superior to that of conventional
slow-thinking models based on Chain-of-Thought (CoT). The development of our
models follows a holistic training process specifically engineered for
multilingual translation, which begins with general and MT-oriented
pre-training to build foundational capabilities, proceeds to Supervised
Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced
alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through
comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and
Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models
of comparable parameter size and most of the SOTA large models, particularly on
the task of translation between Mandarin and minority languages as well as
dialects. In the WMT2025 shared task (General Machine Translation), our models
demonstrate state-of-the-art performance, ranking first in 30 out of 31
language pairs. This result highlights the robustness of our models across a
diverse linguistic spectrum, encompassing high-resource languages such as
Chinese, English, and Japanese, as well as low-resource languages including
Czech, Marathi, Estonian, and Icelandic.
♻ ★ TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review EMNLP2025
Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, Ngai Wong
While Large Language Models (LLMs) have shown significant potential in
assisting peer review, current methods often struggle to generate thorough and
insightful reviews while maintaining efficiency. In this paper, we propose
TreeReview, a novel framework that models paper review as a hierarchical and
bidirectional question-answering process. TreeReview first constructs a tree of
review questions by recursively decomposing high-level questions into
fine-grained sub-questions and then resolves the question tree by iteratively
aggregating answers from leaf to root to get the final review. Crucially, we
incorporate a dynamic question expansion mechanism to enable deeper probing by
generating follow-up questions when needed. We construct a benchmark derived
from ICLR and NeurIPS venues to evaluate our method on full review generation
and actionable feedback comments generation tasks. Experimental results of both
LLM-based and human evaluation show that TreeReview outperforms strong
baselines in providing comprehensive, in-depth, and expert-aligned review
feedback, while reducing LLM token usage by up to 80% compared to
computationally intensive approaches. Our code and benchmark dataset are
available at https://github.com/YuanChang98/tree-review.
comment: Accepted to EMNLP2025 Main
♻ ☆ UPLex: Fine-Grained Personality Control in Large Language Models via Unsupervised Lexical Modulation EMNLP 2025
Tianlong Li, Wenhao Liu, Muling Wu, Shihan Dou, Zhenghua Wang, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang
Personality is a crucial factor that shapes human communication patterns,
thereby regulating the personalities of large language models (LLMs) holds
significant potential in enhancing their user experiences. Previous approaches
either relied on fine-tuning LLMs on specific corpora or required manually
crafted prompts to evoke specific personalities from LLMs. However, the former
is inefficient and costly, while the latter cannot precisely manipulate
personality traits at a fine-grained level. To address these challenges, we
propose UPLex, a method that uses an Unsupervisedly-Built Personalized Lexicon
(UPL) during the decoding phase to manipulate LLM's personality traits. UPL can
be constructed from a newly built situational judgment test dataset in an
unsupervised fashion, and used to modulate the personality expression of LLMs
by dynamically altering their predicted probability of upcoming words in a
pluggable fashion. Extensive experimentation demonstrates the remarkable
effectiveness and pluggability of our method for fine-grained manipulation of
LLMs' personalities.
comment: EMNLP 2025 Findings
♻ ☆ Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
With the continuous advancement in the performance of large language models
(LLMs), their demand for computational resources and memory has significantly
increased, which poses major challenges for efficient inference on
consumer-grade devices and legacy servers. These devices typically feature
relatively weaker GPUs and stronger CPUs. Although techniques such as parameter
offloading and partial offloading can alleviate GPU memory pressure to some
extent, their effectiveness is limited due to communication latency and
suboptimal hardware resource utilization. To address this issue, we propose
Dovetail, a lossless inference acceleration method that leverages the
complementary characteristics of heterogeneous devices and the advantages of
speculative decoding. Dovetail deploys a draft model on the GPU to perform
preliminary predictions, while a target model running on the CPU validates
these outputs. By reducing the granularity of data transfer, Dovetail
significantly minimizes communication overhead. To further improve efficiency,
we optimize the draft model specifically for heterogeneous hardware
environments by reducing the number of draft tokens to lower parallel
verification latency, increasing model depth to enhance predictive
capabilities, and introducing a Dynamic Gating Fusion (DGF) mechanism to
improve the integration of feature and embedding information. We conduct
comprehensive evaluations of Dovetail across various consumer-grade GPUs,
covering multiple tasks and mainstream models. Experimental results on 13B
models demonstrate that Dovetail achieves inference speedups ranging from 1.79x
to 10.1x across different devices, while maintaining consistency and stability
in the distribution of generated texts.
comment: 14 pages, 6 figures
♻ ☆ MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Multi-entity question answering (MEQA) represents significant challenges for
large language models (LLM) and retrieval-augmented generation (RAG) systems,
which frequently struggle to consolidate scattered information across diverse
documents. While existing methods excel at single-document comprehension, they
often struggle with cross-document aggregation, particularly when resolving
entity-dense questions like "What is the distribution of ACM Fellows among
various fields of study?", which require integrating entity-centric insights
from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we
introduce MEBench, a novel multi-document, multi-entity benchmark designed to
systematically evaluate LLMs' capacity to retrieve, consolidate, and reason
over fragmented information. Our benchmark comprises 4,780 questions which are
systematically categorized into three primary categories, further divided into
eight distinct types, ensuring broad coverage of real-world multi-entity
reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4,
Llama-3) and RAG pipelines reveal critical limitations: even advanced models
achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance
of completeness and factual precision of information extraction in MEQA tasks,
using Entity-Attributed F1 (EA-F1) metric for granular evaluation of
entity-level correctness and attribution validity. MEBench not only highlights
systemic weaknesses in current LLM frameworks but also provides a foundation
for advancing robust, entity-aware QA architectures.
♻ ☆ Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models
Advances in hardware and language model architecture have spurred a
revolution in natural language generation. However, autoregressive models
compute probability distributions over next-token choices, and sampling from
these distributions, known as decoding, has received significantly less
attention than other design choices. Existing decoding strategies are largely
based on heuristics, resulting in methods that are difficult to apply or
improve in a principled manner. We develop the theory of decoding strategies
for language models by expressing popular decoding algorithms as equilibrium
states in the language of ergodic theory and stating the objective functions
they optimize. Using this, we analyze the effect of the local normalization
step required to make probabilities sum to one in top-k, nucleus, and
temperature sampling. We argue that local normalization distortion is a
fundamental defect of decoding strategies and quantify the size of this
distortion and its effect on mathematical proxies for the quality and diversity
of generated text. This yields conclusions for the design of decoding
algorithms and the detection of machine-generated text.
♻ ☆ AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Recently, extensive research on the hallucination of the large language
models (LLMs) has mainly focused on the English language. Despite the growing
number of multilingual and Arabic-specific LLMs, evaluating LLMs' hallucination
in the Arabic context remains relatively underexplored. The knowledge gap is
particularly pressing given Arabic's widespread use across many regions and its
importance in global communication and media. This paper presents the first
comprehensive hallucination evaluation of Arabic and multilingual LLMs on two
critical Arabic natural language generation tasks: generative question
answering (GQA) and summarization. This study evaluates a total of 12 LLMs,
including 4 Arabic pre-trained models, 4 multilingual models, and 4
reasoning-based models. To assess the factual consistency and faithfulness of
LLMs' outputs, we developed a fine-grained hallucination evaluation framework
consisting of 12 fine-grained hallucination indicators that represent the
varying characteristics of each task. The results reveal that factual
hallucinations are more prevalent than faithfulness errors across all models
and tasks. Notably, the Arabic pre-trained model Allam consistently
demonstrates lower hallucination rates than multilingual models and a
comparative performance with reasoning-based models. The code is available at:
https://github.com/aishaalansari57/AraHalluEval
♻ ☆ A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
We present a Japanese domain-specific language model for the pharmaceutical
field, developed through continual pretraining on 2 billion Japanese
pharmaceutical tokens and 8 billion English biomedical tokens. To enable
rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on
national pharmacist licensing exams; NayoseQA, which tests cross-lingual
synonym and terminology normalization; and SogoCheck, a novel task designed to
assess consistency reasoning between paired statements. We evaluate our model
against both open-source medical LLMs and commercial models, including GPT-4o.
Results show that our domain-specific model outperforms existing open models
and achieves competitive performance with commercial ones, particularly on
terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o
performs poorly on SogoCheck, suggesting that cross-sentence consistency
reasoning remains an open challenge. Our benchmark suite offers a broader
diagnostic lens for pharmaceutical NLP, covering factual recall, lexical
variation, and logical consistency. This work demonstrates the feasibility of
building practical, secure, and cost-effective language models for Japanese
domain-specific applications, and provides reusable evaluation resources for
future research in pharmaceutical and healthcare NLP. Our model, codes, and
datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.
comment: 15 pages, 9 tables, 5 figures
♻ ★ FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain
Retrieval-Augmented Generation (RAG) plays a vital role in the financial
domain, powering applications such as real-time market analysis, trend
forecasting, and interest rate computation. However, most existing RAG research
in finance focuses predominantly on textual data, overlooking the rich visual
content in financial documents, resulting in the loss of key analytical
insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual
RAG benchmark tailored for finance which effectively integrates multimodal data
and provides visual citation to ensure traceability. It includes a bilingual
retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a
high-quality, human-annotated question-answering (QA) dataset spanning
heterogeneous data types and seven question categories. Moreover, we introduce
RGenCite, an RAG baseline that seamlessly integrates visual citation with
generation. Furthermore, we propose an automatic citation evaluation method to
systematically assess the visual citation capabilities of Multimodal Large
Language Models (MLLMs). Extensive experiments on RGenCite underscore the
challenging nature of FinRAGBench-V, providing valuable insights for the
development of multimodal RAG systems in finance.
♻ ☆ Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation
Pretraining data curation is a cornerstone in Large Language Model (LLM)
development, leading to growing research on quality filtering of large web
corpora. From statistical quality flags to LLM-based labelling systems,
datasets are divided into categories, frequently reducing to a binary: those
passing the filters are deemed as valuable examples, others are discarded as
useless or detrimental. However, a more detailed understanding of the
contribution of different kinds of texts to model performance is still largely
lacking. In this article, we present the first study utilising registers or
genres - a widely used standard in corpus linguistics to model linguistic
variation - to curate pretraining datasets and investigate the effect of
register on the performance of LLMs. We train small generative models with
register classified data and evaluate them using standard benchmarks, and show
that the register of pretraining data substantially affects model performance.
We uncover surprising relationships between the pretraining material and the
resulting models: using the News register results in subpar performance, and on
the contrary, including the Opinion class, covering texts such as reviews and
opinion blogs, is highly beneficial. While a model trained on the entire
unfiltered dataset outperforms those trained on datasets limited to a single
register, combining well-performing registers like How-to-Instructions,
Informational Description, and Opinion leads to major improvements.
Furthermore, analysis of individual benchmark results reveals key differences
in the strengths and drawbacks of specific register classes as pretraining
data. These findings show that register is an important explainer of model
variation and can facilitate more deliberate future data selection practices.
♻ ☆ TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection EMNLP2025
Rapid advances in Large Language Models (LLMs) have spurred demand for
processing extended context sequences in contemporary applications. However,
this progress faces two challenges: performance degradation due to sequence
lengths out-of-distribution, and excessively long inference times caused by the
quadratic computational complexity of attention. These issues limit LLMs in
long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache
Selection (TokenSelect), a training-free method for efficient and accurate
long-context inference. TokenSelect builds upon the observation of
non-contiguous attention sparsity, using QK dot products to measure per-head KV
Cache criticality at token-level. By per-head soft voting mechanism,
TokenSelect selectively involves a few critical KV cache tokens in attention
calculation without sacrificing accuracy. To further accelerate TokenSelect, we
design the Selection Cache based on observations of consecutive Query
similarity and implemented the efficient Paged Dot Product Kernel,
significantly reducing the selection overhead. A comprehensive evaluation of
TokenSelect demonstrates up to $23.84\times$ speedup in attention computation
and up to $2.28\times$ acceleration in end-to-end latency, while providing
superior performance compared to state-of-the-art long-context inference
methods.
comment: Accepted by EMNLP2025
♻ ☆ Trust but Verify! A Survey on Verification Design for Test-time Scaling
Test-time scaling (TTS) has emerged as a new frontier for scaling the
performance of Large Language Models. In test-time scaling, by using more
computational resources during inference, LLMs can improve their reasoning
process and task performance. Several approaches have emerged for TTS such as
distilling reasoning traces from another model or exploring the vast decoding
search space by employing a verifier. The verifiers serve as reward models that
help score the candidate outputs from the decoding process to diligently
explore the vast solution space and select the best outcome. This paradigm
commonly termed has emerged as a superior approach owing to parameter free
scaling at inference time and high performance gains. The verifiers could be
prompt-based, fine-tuned as a discriminative or generative model to verify
process paths, outcomes or both. Despite their widespread adoption, there is no
detailed collection, clear categorization and discussion of diverse
verification approaches and their training mechanisms. In this survey, we cover
the diverse approaches in the literature and present a unified view of verifier
training, types and their utility in test-time scaling. Our repository can be
found at
https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.
comment: 18 pages
♻ ☆ LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning
Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng
Text-to-SQL is a critical task in natural language processing that aims to
transform natural language questions into accurate and executable SQL queries.
In real-world scenarios, these reasoning tasks are often accompanied by complex
mathematical computations, domain knowledge, and hypothetical reasoning
scenarios. However, existing large-scale Text-to-SQL datasets typically focus
on business logic and task logic, neglecting critical factors such as vertical
domain knowledge, complex mathematical reasoning, and hypothetical reasoning,
which are essential for realistically reflecting the reasoning demands in
practical applications and completing data querying and analysis. To bridge
this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset
specifically designed for complex reasoning and chain-of-thought parsing,
encompassing physics, arithmetic, commonsense, and hypothetical reasoning
scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed
chain-of-thought reasoning steps, spanning 45 databases across diverse domains,
significantly surpassing existing datasets in complexity. Experimental results
demonstrate that LogicCat substantially increases the task difficulty for
current state-of-the-art models to at most 33.20% execution accuracy,
indicating that this task remains exceptionally challenging. The advancement of
LogicCat represents a crucial step toward developing systems suitable for
real-world enterprise data analysis and autonomous query generation. We have
released our dataset code at https://github.com/Ffunkytao/LogicCat.
comment: 9 pages, 5 figures
♻ ☆ CTourLLM: Enhancing LLMs with Chinese Tourism Knowledge
Recently, large language models (LLMs) have demonstrated their effectiveness
in various natural language processing (NLP) tasks. However, the lack of
tourism knowledge limits the performance of LLMs in tourist attraction
presentations and travel planning. To address this challenge, we constructed a
supervised fine-tuning dataset for the Chinese culture and tourism domain,
named Cultour. This dataset consists of three parts: tourism knowledge base
data, travelogues data, and tourism QA data. Additionally, we propose CTourLLM,
a Qwen-based model supervised fine-tuned with Cultour, to improve the quality
of information about attractions and travel planning. To evaluate the
performance of CTourLLM, we proposed a human evaluation criterion named RRA
(Relevance, Readability, Availability), and employed both automatic and human
evaluation. The experimental results demonstrate that CTourLLM outperforms
ChatGPT, achieving an improvement of 1.21 in BLEU-1 and 1.54 in Rouge-L,
thereby validating the effectiveness of the response outcomes. Our proposed
Cultour is accessible at https://github.com/mrweiqk/Cultour.
♻ ☆ Interleaving Reasoning for Better Text-to-Image Generation
Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
Unified multimodal understanding and generation models recently have achieve
significant improvement in image generation capability, yet a large gap remains
in instruction following and detail preservation compared to systems that
tightly couple comprehension with generation such as GPT-4o. Motivated by
recent advances in interleaving reasoning, we explore whether such reasoning
can further improve Text-to-Image (T2I) generation. We introduce Interleaving
Reasoning Generation (IRG), a framework that alternates between text-based
thinking and image synthesis: the model first produces a text-based thinking to
guide an initial image, then reflects on the result to refine fine-grained
details, visual quality, and aesthetics while preserving semantics. To train
IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL),
which targets two sub-goals: (1) strengthening the initial think-and-generate
stage to establish core content and base quality, and (2) enabling high-quality
textual reflection and faithful implementation of those refinements in a
subsequent image. We curate IRGL-300K, a dataset organized into six decomposed
learning modes that jointly cover learning text-based thinking, and full
thinking-image trajectories. Starting from a unified foundation model that
natively emits interleaved text-image outputs, our two-stage training first
builds robust thinking and reflection, then efficiently tunes the IRG pipeline
in the full thinking-image trajectory data. Extensive experiments show SoTA
performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF,
GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality
and fine-grained fidelity. The code, model weights and datasets will be
released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .
♻ ☆ SemCAFE: When Named Entities make the Difference Assessing Web Source Reliability through Entity-level Analytics
With the shift from traditional to digital media, the online landscape now
hosts not only reliable news articles but also a significant amount of
unreliable content. Digital media has faster reachability by significantly
influencing public opinion and advancing political agendas. While newspaper
readers may be familiar with their preferred outlets political leanings or
credibility, determining unreliable news articles is much more challenging. The
credibility of many online sources is often opaque, with AI generated content
being easily disseminated at minimal cost. Unreliable news articles,
particularly those that followed the Russian invasion of Ukraine in 2022,
closely mimic the topics and writing styles of credible sources, making them
difficult to distinguish. To address this, we introduce SemCAFE, a system
designed to detect news reliability by incorporating entity relatedness into
its assessment. SemCAFE employs standard Natural Language Processing
techniques, such as boilerplate removal and tokenization, alongside entity
level semantic analysis using the YAGO knowledge base. By creating a semantic
fingerprint for each news article, SemCAFE could assess the credibility of
46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of
Ukraine. Our approach improved the macro F1 score by 12% over state of the art
methods. The sample data and code are available on GitHub
♻ ☆ Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
While Multimodal Large Language Models (MLLMs) excel at general
vision-language tasks, visuospatial cognition - reasoning about spatial
layouts, relations, and dynamics - remains a significant challenge. Existing
models often lack the necessary architectural components and specialized
training data for fine-grained spatial understanding. We introduce ViCA2
(Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial
reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP
for semantics and Hiera for spatial structure, coupled with a token ratio
control mechanism for efficiency. We also developed ViCA-322K, a new
large-scale dataset with over 322,000 spatially grounded question-answer pairs
for targeted instruction tuning. On the challenging VSI-Bench benchmark, our
ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly
surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and
leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the
effectiveness of our approach in achieving strong visuospatial intelligence
with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset
to facilitate further research.
comment: 26 pages, 19 figures, 4 tables
♻ ☆ Visuospatial Cognitive Assistant
Video-based spatial cognition is vital for robotics and embodied AI but
challenges current Vision-Language Models (VLMs). This paper makes two key
contributions. First, we introduce ViCA (Visuospatial Cognitive
Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor
videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D
metadata-grounded queries and video-based complex reasoning. Second, we develop
ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all
eight VSI-Bench tasks, outperforming existing models, including larger ones
(e.g., +26.1 on Absolute Distance). For interpretability, we present
ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune
ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial
reasoning. Our work highlights the importance of targeted data and suggests
paths for improved temporal-spatial modeling. We release all resources to
foster research in robust visuospatial intelligence.
comment: 31 pages, 10 figures, 6 tables
♻ ☆ Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs EMNLP 2025
Personalized Large Language Models (LLMs) are increasingly used in diverse
applications, where they are assigned a specific persona - such as a happy high
school teacher - to guide their responses. While prior research has examined
how well LLMs adhere to predefined personas in writing style, a comprehensive
analysis of consistency across different personas and task types is lacking. In
this paper, we introduce a new standardized framework to analyze consistency in
persona-assigned LLMs. We define consistency as the extent to which a model
maintains coherent responses when assigned the same persona across different
tasks and runs. Our framework evaluates personas across four different
categories (happiness, occupation, personality, and political stance) spanning
multiple task dimensions (survey writing, essay generation, social media post
generation, single turn, and multi-turn conversations). Our findings reveal
that consistency is influenced by multiple factors, including the assigned
persona, stereotypes, and model design choices. Consistency also varies across
tasks, increasing with more structured tasks and additional context. All code
is available on GitHub.
comment: Accepted to EMNLP 2025 findings
♻ ☆ When Large Language Models Meet Speech: A Survey on Integration Approaches ACL 2025
Recent advancements in large language models (LLMs) have spurred interest in
expanding their application beyond text-based tasks. A large number of studies
have explored integrating other modalities with LLMs, notably speech modality,
which is naturally related to text. This paper surveys the integration of
speech with LLMs, categorizing the methodologies into three primary approaches:
text-based, latent-representation-based, and audio-token-based integration. We
also demonstrate how these methods are applied across various speech-related
applications and highlight the challenges in this field to offer inspiration
for
comment: Accepted at Findings of ACL 2025 (Long Paper)
♻ ★ Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models EMNLP 2025
Kaiyan Chang, Yonghao Shi, Chenglong Wang, Hang Zhou, Chi Hu, Xiaoqian Liu, Yingfeng Luo, Yuan Ge, Tong Xiao, Jingbo Zhu
Test-Time Scaling (TTS) is a promising approach to progressively elicit the
model's intelligence during inference. Recently, training-based TTS methods,
such as continued reinforcement learning (RL), have further surged in
popularity, while training-free TTS methods are gradually fading from
prominence. However, the additional computation overhead of training amplifies
the burden on test-time scaling. In this paper, we focus on training-free TTS
methods for reasoning. We first design Conditional Step-level Self-refinement,
a fine-grained sequential scaling method guided by process verification. On top
of its effectiveness, we further combine it with other classical parallel
scaling methods at the step level, to introduce a novel inference paradigm
called Hybrid Test-Time Scaling. Extensive experiments on five
instruction-tuned LLMs across different scales (3B-14B) and families
demonstrate that hybrid strategy incorporating various training-free TTS
methods at a fine granularity has considerable potential for expanding the
reasoning performance boundaries of LLMs.
comment: Accepted by EMNLP 2025. Code: https://github.com/Lucky-259/Hybrid_TTS