News | 南京大学大模型研究协同创新中心

Recent Research Progress: Neuro-Symbolic Automated Formal Proof

Wed, 03 Jun 2026 00:00:00 +0000

As Bjarne Stroustrup said, “software carries our civilization.” Ensuring the absolute correctness of large-scale software is a foundation for reliable information systems and a long-standing goal in software engineering. Formal proof offers a mathematically rigorous path toward that goal, but its human cost remains prohibitive. Verifying the seL4 operating-system kernel, which has fewer than ten thousand lines of source code, took more than 11 person-years [1]. End-to-end verification of the real distributed systems IronRSL and IronKV took 3.7 person-years [2]. This cost makes proof automation an urgent problem.

Recent progress in large language models has brought new opportunities. The Software Institute team at Nanjing University has carried out a series of studies on automated formal proof for foundational software systems. Their neuro-symbolic proof-generation framework substantially improves automation: using only a locally fine-tuned 7B model, it nearly doubles the success rate for automatically generating proofs of seL4-related theorems, from 40.3% to 77.6%. This work has been accepted by OSDI 2026. For safety proofs of distributed protocols, the team has also built an automated pipeline from inductive invariant generation to TLAPS proof writing and final safety verification. It can use general-purpose large models to fully prove an industrial-grade distributed protocol such as MongoLoglessDynamicRaft, producing more than 6,000 lines of proof.

01 A lightweight local solution proves 77% of seL4 theorems

In automated formal verification for the industrial-grade seL4 operating-system kernel, large language models show strong potential for mathematical reasoning and proof-code generation, but they also reveal clear weaknesses. They may produce logical hallucinations or steps that violate the strict syntax and semantics of interactive theorem provers (ITPs). In long and complex proofs, relying entirely on an LLM-generated sequence of steps often causes the proof to fail midway.

To address this issue, the team proposes a neuro-symbolic proof-generation framework. The core idea is simple: instead of letting the LLM work alone, make it collaborate closely with the ITP. The LLM acts as a heuristic generator, predicting candidate next steps from the current proof state. The ITP immediately checks each step and provides support for error repair, redundancy filtering, and branch completion. This turns proof generation from a one-shot attempt into a closed loop of generation, checking, repair, and feedback.

Figure 1: A proof-generation framework based on neuro-symbolic integration

Experiments show that with only a locally fine-tuned 7B model, the framework improves the automatic proof-generation success rate on seL4 from 40.3% to 77.6%.

02 An end-to-end proof framework writes a 6,000-line TLAPS proof

The team has also made progress on automated safety proofs for distributed protocols. Distributed protocols run in environments with many nodes, arbitrary network delays or partitions, and possible node failures or malicious messages. A single safety bug can cause history forks, state rollback, or consensus failure. Even some protocols that have been manually proved or checked by TLC at finite scale have later been found to contain errors. Theorem proving for distributed-protocol safety is therefore essential.

This task is difficult because protocol interactions are complex, state spaces explode, and key inductive invariants often require repeated trial and error. For example, Basilisk reports that IronFleet spent months proving the inductive invariant for Multi-Paxos [3], and IronFleet’s end-to-end verification of two real distributed systems required about 3.7 person-years [2].

The team implements an automated framework that generates inductive invariants, writes TLAPS proofs, and proves protocol safety properties. The technical path first combines large models with the classical IC3 algorithm through neuro-symbolic methods to obtain inductive invariants for distributed protocols. It then uses an agentic harness to generate TLAPS proof scripts automatically. Importantly, the TLAPS proof stage uses no protocol-specific manual prompts or handwritten lemmas. The agent sees only the protocol, the inductive invariant, the proof goal, and general proof feedback, and then iteratively advances proof obligations until it produces a complete proof.

Figure 2: Automatic TLAPS proof-generation workflow

For MongoLoglessDynamicRaft, a logless dynamic reconfiguration protocol proposed and deployed by MongoDB engineers William Schultz and colleagues [4], the team has completed a fully automated safety proof. The resulting proof contains 6,308 lines of TLAPS code, all generated by the agent.

03 The future: Verified Spec-driven Development

The continuing improvement of large language models is reshaping the boundary of formal verification. Frontier models with longer context windows and stronger reasoning ability can already call external tools autonomously when given the workflow. Numina-Lean-Agent, which combines Claude Code with the Lean theorem prover, solved all 12 problems in the Putnam 2025 competition [5]. In Agentic Proof Automation, researchers used Claude Code to generate proofs for an algorithm with 14,000 lines of Lean code and achieved an 87% success rate on 189 evaluated tasks, with about 16% requiring human intervention [6].

Formal verification is moving out of the academic ivory tower and into mainstream software development. In the future, developers may only need to clearly describe system behavior and constraints to obtain provably correct implementations, while every code change can receive mathematical correctness guarantees in real time.

References:

[1] Klein, Gerwin, et al. “seL4: Formal verification of an OS kernel.” Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 2009.

[2] C. Hawblitzel, J. Howell, M. Kapritsos, J. R. Lorch, B. Parno, M. L. Roberts, S. Setty, and B. Zill, “IronFleet: Proving Practical Distributed Systems Correct,” in Proceedings of the 25th Symposium on Operating Systems Principles (SOSP 2015), pp. 1-17, 2015.

[3] T. N. Zhang, K. Singh, T. Chajed, M. Kapritsos, and B. Parno, “Basilisk: Using Provenance Invariants to Automate Proofs of Undecidable Protocols,” in 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2025), 2025.

[4] W. Schultz and S. Zhou, “Rapid Prototyping A Safe, Logless Reconfiguration Protocol For MongoDB With TLA+,” MongoDB Technical Blog.

[5] Liu, Junqi, et al. “Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.” arXiv preprint arXiv:2601.14027 (2026).

[6] Xu, Yichen, and Martin Odersky. “Agentic Proof Automation: A Case Study.” arXiv preprint arXiv:2601.03768 (2026).

Read original article

ACL 2026 Accepted Papers

Fri, 08 May 2026 00:00:00 +0000

ACL (Annual Meeting of the Association for Computational Linguistics) is one of the top international conferences in natural language processing and computational linguistics. Organized annually by the Association for Computational Linguistics, it brings together frontier research from universities, research institutes, and industry in language understanding, machine translation, information extraction, dialogue systems, large language models, multimodal language intelligence, and related areas. ACL is a CCF-A conference in artificial intelligence and, together with EMNLP and NAACL, forms the core group of the most influential conferences in NLP. According to Google Scholar Metrics 2025, ACL ranks No. 1 in the Computational Linguistics category.

The Large Model Center at the School of Computer Science, Nanjing University has 9 papers accepted by ACL 2026, including 5 ACL Main papers and 4 Findings of ACL papers.

01

ACL Track: ACL Main

Title: Bootstrapping Code Translation with Weighted Multilanguage Exploration

Authors: Yuhan Wu, Huan Zhang, Wei Cheng, Chen Shen, Jingyue Yang, Wei Hu

Affiliations: Nanjing University

Link: https://arxiv.org/abs/2601.03512

Paper Summary:

BootTrans studies multilingual code translation and aims to reduce reliance on high-quality parallel corpora and executable test data. To address the scarcity of parallel verification data, the uneven difficulty of translation directions, and the tendency of models to favor easy translation paths, the authors propose BootTrans, a bootstrapped code translation framework that does not require parallel corpora. It uses the transferability of unit tests across programming languages to build a cyclic training mechanism driven by two data pools. During rollouts, the framework dynamically collects successful translation samples and further expands back-translation and cross-language translation paths, moving beyond traditional pivot-language constraints. BootTrans also introduces a language-aware dynamic weighted optimization strategy that adaptively adjusts training weights according to the difficulty of each translation direction, encouraging the model to focus on harder or lower-performing target languages. Experiments on HumanEval-X and TransCoder-Test show that BootTrans significantly outperforms base models and existing code-translation fine-tuning methods, achieving up to a 16.57% performance gain on Llama-3.1-8B. It also generalizes well to unseen languages, low-resource languages, and more complex class-level code translation tasks.

02

ACL Track: ACL Main

Title: AEA: Adaptive Expert Allocation Improves Sentence Embeddings from Mixture-of-Experts LLM

Authors: Shufan Yang, Zifeng Cheng, Zhiwei Jiang, Qingfeng Qi, Yafeng Yin, Cong Wang, Ao Zhou, Qing Gu

Affiliations: Nanjing University

Paper Summary:

Directly extracting sentence embeddings from Mixture-of-Experts (MoE) models is a promising but underexplored direction because it requires no additional data or fine-tuning. Previous studies have used semantic-compression prompts or expert routing information to improve sentence embeddings, but they usually assign a fixed number of experts uniformly across all layers and tokens, ignoring layer-wise and token-wise heterogeneity. This paper identifies two key phenomena in MoE models: layer-level differences in expert homogeneity, which indicate that different layers require different expert budgets, and imbalanced token contributions, which indicate that different tokens should also receive different numbers of experts. The authors propose Adaptive Expert Allocation (AEA), a framework that dynamically performs layer-level and token-level expert allocation to improve embedding quality. AEA assigns fewer experts to layers with higher expert homogeneity and to tokens with lower attention importance, where layer homogeneity is measured by the similarity among embeddings produced by experts in each layer. The method is plug-and-play, integrates smoothly with existing prompting methods, and introduces no additional time overhead. Experiments on STS tasks show that AEA consistently improves sentence embeddings across multiple MoE models.

03

ACL Track: ACL Main

Title: Focusing Condition: Inference-Time Self-Contrastive Steering Elicits Better Conditional Text Embeddings in LLMs

Authors: Zifeng Cheng, Lingyun Qian, Zhiwei Jiang, Cong Wang, Yafeng Yin, Fei Shen, Ao Zhou, Qing Gu

Affiliations: Nanjing University; National University of Singapore

Paper Summary:

Extracting conditional text embeddings directly from large language models has attracted broad interest because it requires no additional data or fine-tuning. Existing methods add conditions to prompts to guide LLMs toward condition-specific embeddings. However, relying only on prompts often fails to produce high-quality conditional embeddings, because these embeddings remain entangled with general-purpose text embeddings and therefore degrade in quality. This work proposes an inference-time self-contrastive steering (SCS) method that improves conditional embeddings by constructing unconditional general text embeddings and steering the conditional representation to focus more on the target condition. Specifically, the method masks out the condition by modifying the attention mask and positional encoding, obtains unconditional text embeddings, and intervenes in multi-head self-attention computation. The method is efficient and requires only one extra multi-head self-attention computation at inference time. Large-scale experiments on clustering, semantic textual similarity, and triplet-alignment datasets show that SCS can improve existing prompt-based methods across different LLMs in a training-free and plug-and-play manner.

04

ACL Track: ACL Main

Title: A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM∆ Integration into Upcycled MoE

Authors: Hao Zhou, Tianhao Li, Zhijun Wang, Shuaijie She, Linjuan Wu, Hao-ran Wei, Baosong Yang, Jiajun Chen, Shujian Huang

Affiliations: Nanjing University; Tongyi Lab; Zhejiang University

Paper Summary:

Current large language models mainly possess strong English or Chinese abilities, while their capabilities in low-resource languages remain limited. Traditional language-expansion methods first use large-scale monolingual data in continued pretraining to add basic knowledge for target languages, and then perform post-training to align the model with human preferences. Because post-training requires large amounts of high-quality annotated data in the target language, many works attempt to replace post-training with parameter merging to bypass the data bottleneck. However, these methods face a key conflict: parameters obtained from continued pretraining (CPT) and those obtained from post-training can be incompatible. To address this issue, the authors propose DeltaMoE, which expands multiple experts and adds the post-training parameter delta to each expert, helping the MoE model acquire alignment ability. Experiments show that DeltaMoE brings significant improvements on expanded languages under both matched parameter counts and matched training FLOPs, while also preserving knowledge in original languages and avoiding catastrophic forgetting.

05

ACL Track: ACL Main

Title: Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Authors: Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang

Affiliations: Nanjing University; Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

Link: https://arxiv.org/abs/2601.22139

Paper Summary:

Reasoning large language models such as DeepSeek-R1 have achieved strong performance on complex tasks through explicit reasoning traces. However, these models are still constrained by a blind self-thinking paradigm: when user instructions have missing premises or ambiguous intent, the model often performs lengthy internal reasoning, leading to overthinking, hallucinations, and conclusions that deviate from the user’s true intent. This harms interaction efficiency and user experience. To address this issue, the authors propose Proactive Interactive Reasoning (PIR), a new paradigm that transforms reasoning LLMs from passive solvers into proactive inquirers. PIR enables models to interleave thinking, asking, and feedback during reasoning, so they can clarify key uncertainties and better align with user intent.

The PIR framework has two stages. The first is interaction capability activation, where uncertainty-aware data augmentation identifies key decision points with uncertainty during reasoning and injects clarification questions and simulated user replies at those points. This converts monotonic reasoning traces into an interactive think-ask-feedback format and uses supervised fine-tuning to activate proactive questioning ability. The second is user-intent alignment, where the authors build a user-simulator-based Group Relative Policy Optimization framework (US-GRPO). It combines extrinsic rewards for task correctness with intrinsic rewards for the helpfulness and efficiency of model questions, guiding the model to solve tasks accurately while reducing unnecessary interaction. Experiments on multi-turn interaction tasks in mathematical reasoning, code generation, and document editing show that PIR consistently improves over baselines and significantly reduces reasoning computation and redundant interactions. Evaluations on non-interactive benchmarks such as MMLU, MMLU-Pro, TriviaQA, SQuAD, and Missing Premise tests further show PIR’s generalization potential and robustness.

06

ACL Track: Findings of ACL

Title: Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

Authors: Renfei Dang, Peng Hu, Zhejian Lai, Changjiang Gao, Min Zhang, Shujian Huang

Affiliations: Nanjing University; Huawei Translation Services Center

Link: https://arxiv.org/abs/2511.02626

Paper Summary:

Previous studies have shown that fine-tuning large language models on new knowledge can induce factual hallucinations, causing models to answer incorrectly when asked about information they originally knew. However, the forms and mechanisms of these hallucinations remain insufficiently understood. To fill this gap, the authors construct a controlled dataset, Biography-Reasoning, and conduct fine-grained analyses across multiple knowledge types and knowledge-based question answering and reasoning tasks.

They find that factual hallucinations seriously affect not only tasks involving the newly learned knowledge, but also other evaluation tasks. When a specific knowledge type in fine-tuning data consists entirely of new knowledge, LLMs show a stronger tendency toward hallucination. Through interpretability analysis, the authors further find that learning new knowledge weakens the model’s attention to key entities in the input question and makes it rely more heavily on surrounding context, thereby increasing hallucination risk. Conversely, reintroducing a small amount of known knowledge in the later training stage can restore attention to key entities and significantly reduce hallucinations. The paper also shows that this disturbed attention pattern can propagate between lexically similar contexts, causing hallucinations to spread beyond the original task.

07

ACL Track: Findings of ACL

Title: PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

Authors: Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang

Affiliations: Nanjing University; China Mobile

Link: https://arxiv.org/abs/2511.02626

Paper Summary:

Reinforcement learning has shown strong potential for LLM-based machine translation, and recent methods such as GRPO have achieved notable performance gains. However, applying reinforcement learning effectively to translation still faces several challenges. Policy-gradient estimates based on Monte Carlo baselines have high variance, and the large trajectory space encourages global exploration while making fine-grained local optimization harder.

To address these challenges, the authors propose PEGRL, a two-stage reinforcement learning framework that introduces post-editing as an auxiliary task to stabilize training and guide optimization. At each step, the model first samples translation outputs and then constructs post-editing task inputs from them. Low-variance gradients from the post-editing task can then propagate during training, strengthening local optimization while preserving global exploration.

The authors also design a task-specific weighting mechanism to amplify the influence of post-editing gradients, producing a moderately biased but more sample-efficient gradient estimator. Extensive experiments on English-to-Finnish, English-to-Turkish, and bidirectional English-Chinese translation show that PEGRL consistently improves over multiple reinforcement learning baselines. On English-to-Turkish, its COMETKiwi score is comparable to advanced LLM translation systems such as DeepSeek-V3.2.

08

ACL Track: Findings of ACL

Title: To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing

Authors: Wei Cheng, Yongchang Cao, Chen Shen, Binhua Li, Jue Chen, Yongbin Li, Wei Hu

Affiliations: Nanjing University; Tongyi Lab

Paper Summary:

This paper addresses the high latency and inference cost of LLM-based code editing in interactive coding assistants. Mainstream approaches often use full-code generation, which regenerates the entire file even when only a few lines need to change, causing substantial token waste and response latency. Traditional diff formats can shorten generation length, but because they rely on line numbers or fragmented content snippets, they can break code structure and make model outputs less natural, reducing editing accuracy. To solve this problem, researchers from Nanjing University and Tongyi Lab propose a structure-aware diff format and the AdaEdit adaptive editing strategy. The structure-aware diff organizes code changes into syntactically complete logical units based on the abstract syntax tree, preserving the efficiency of diff while improving the naturalness of model generation. AdaEdit further enables the model to decide whether to use diff generation or full generation for each editing task, selecting the more token-efficient output format. Experiments on Qwen2.5-Coder, DeepSeek-Coder, and multiple Python and JavaScript datasets show that the method matches or surpasses full-generation baselines in editing accuracy, reduces generation latency and token cost by more than 30% on long-code editing tasks, and achieves output-format selection accuracy above 90%. The study shows that optimizing output format and generation strategy can significantly improve the practical efficiency of coding assistants without increasing model size.

09

ACL Track: Findings of ACL

Title: How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

Authors: Haoyang Chen, Yi Liu, Jianzhi Shao, Tao Zhang, Chengfu Huo, Wei Hu

Affiliations: Nanjing University

Paper Summary:

This paper studies how reasoning LLMs that first think and then answer use their preceding reasoning traces during answer generation. It reveals a stable benign self-reading pattern in thinking models for quantitative reasoning. When a model answers correctly, attention during the answer stage usually moves progressively along the reasoning chain and remains focused on key semantic anchors such as problem constraints, solution plans, reflection and verification steps, and final conclusions. Incorrect samples more often show scattered attention and disordered reading trajectories. Based on this finding, the Nanjing University large model research group proposes a training-free activation steering method driven by Self-Reading Quality (SRQ). SRQ measures whether the model reads along an effective reasoning path from a geometric perspective and whether it focuses on key reasoning evidence from a semantic perspective. The method then constructs activation steering vectors from high- and low-SRQ samples to guide the model toward a more orderly, focused, and stable internal state. Experiments on GSM8K, MATH500, SciQ, AIME24-25, and other quantitative reasoning benchmarks show consistent improvements across reasoning models including R1-Distill-Qwen-7B, R1-Distill-Llama-8B, and Qwen3-4B-Thinking. The approach is also compatible with mainstream activation steering mechanisms such as CAA, Conceptor, and PCA-CAA. This study deepens the understanding of how reasoning models read their reasoning traces during answer generation and provides a general and effective internal supervision signal for improving reasoning without additional training.

View original

CVPR 2026 Accepted Papers

Tue, 05 May 2026 00:00:00 +0000

CVPR (IEEE/CVF Conference on Computer Vision and Pattern Recognition) is one of the most influential international conferences in artificial intelligence, focusing on frontier research in computer vision, pattern recognition, and related AI areas. According to Google Scholar Metrics 2025, CVPR ranks No. 2 among all English-language journals and conferences worldwide, second only to Nature, and No. 1 in the Engineering & Computer Science category.

The Large Model Center at the School of Computer Science, Nanjing University has 12 papers accepted by CVPR 2026.

01

Title: VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos

Authors: Min Yang, Xinwen Zhang, Jialei Tang, Xin Zhou, Kehan Li, Zeyi Huang, Limin Wang

Affiliations: Nanjing University; Huawei Central Media Technology Institute; Shanghai AI Laboratory

Paper Summary:

With the rapid development of video generation models, content creators and researchers increasingly use these technologies to produce human-centric videos at scale for content creation and customized data generation. Although current video generation models can produce videos with high visual quality, their limited understanding of realism often leads to generated content that lacks authenticity. Existing evaluators for generated video quality are commonly trained on low-quality generated videos and annotations, causing their scores to deviate from human preferences. They also often lack interpretability because they do not provide chain-of-thought-style reasoning. To address these issues, this work proposes VideoRealBench, a comprehensive benchmark for evaluating the realism of human-centric generated videos. The authors design a human-preference-based scoring system and provide three-step reasoning evidence for each score. Based on this design, they construct the carefully annotated VideoRealDataset and introduce VideoRealEval, an evaluator that provides both reliable scores and detailed reasoning. On VideoRealDataset, VideoRealEval achieves a PLCC of 57.07% and an SROCC of 56.78%, showing stronger alignment with human preferences than existing evaluators.

02

Title: TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Authors: Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang

Affiliations: Nanjing University; Tencent; Shanghai AI Laboratory

Paper Summary:

Video Temporal Grounding (VTG) is a core capability in video understanding. Rather than introducing an entirely new task-specific method, this work builds a direct, progressive, and strong baseline for VTG. Although multimodal large language models (MLLMs) have performed well on many video understanding tasks, systematic studies on how to optimize them for VTG remain limited. TimeLens studies this problem from two key dimensions: data quality and algorithm design. First, the authors reveal serious data-quality issues in existing VTG benchmarks and introduce TimeLens-Bench, a high-quality evaluation set produced by carefully re-annotating three mainstream benchmarks under strict standards. Their analysis shows that model rankings can change significantly under the revised benchmarks, indicating that previous evaluation protocols are unreliable. They also build TimeLens-100K, a large-scale high-quality training set through automatic re-annotation of noisy training data. On top of this, the work explores key algorithmic principles for VTG, including interleaved textual encoding for temporal representation, a thinking-free RLVR training paradigm, and a carefully designed RLVR recipe. These components lead to the TimeLens model family, which reaches state-of-the-art VTG performance among open-source models and even surpasses frontier closed-source models such as GPT-5 and Gemini-2.5-Flash. Code, data, and models will be released to support future research.

03

Title: UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Authors: Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang

Affiliations: Nanjing University; Tencent Hunyuan; Shanghai Jiao Tong University; Renmin University of China; Tsinghua University; Shanghai AI Laboratory

Paper Summary:

Existing open-source audio-video generation methods often lack effective cross-modal modeling, leading to weak lip synchronization and insufficient semantic consistency. This work proposes UniAVGen, a unified framework for joint audio and video generation. UniAVGen uses a dual-branch joint synthesis architecture and two parallel diffusion Transformers (DiTs) to construct a unified cross-modal latent space. Its core component is an asymmetric cross-modal interaction mechanism, which supports bidirectional and temporally aligned cross-attention to ensure precise spatiotemporal synchronization and semantic consistency. The framework further enhances cross-modal interaction through a face-aware modulation (FAM) module that dynamically weights visually salient regions. To improve generation fidelity at inference time, the authors introduce modality-aware classifier-free guidance (MA-CFG), which explicitly strengthens cross-modal association signals. UniAVGen can support audio-video joint generation, audio-video continuation, video-to-audio dubbing, audio-driven video synthesis, and other tasks within a single model. Experiments show that even with far fewer training samples than existing methods (1.3 million vs. 30.1 million), UniAVGen achieves strong advantages in audio-video synchronization, timbre consistency, and emotional consistency.

04

Title: InternVideo-Next: Towards World Understanding Video Models

Authors: Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan, Yali Wang, Yi Wang, Limin Wang

Affiliations: Shanghai Jiao Tong University; Shanghai AI Laboratory; Shanghai Innovation Institute; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; Nanjing University

Paper Summary:

Large-scale video-text pretraining has achieved strong performance, but it often relies heavily on noisy synthetic text and overlooks implicit physical-world knowledge such as object motion, 3D geometry, and physical cues. At the same time, masked video modeling methods that directly use spatiotemporal structure often underperform on general tasks because pixel-level reconstruction conflicts with high-level semantics or latent-space prediction encourages shortcut learning. To address these architectural issues, the authors propose InternVideo-Next, a two-stage video-only pretraining architecture for physical-world understanding. The method decouples the traditional encoder-decoder architecture into an encoder-predictor-decoder (EPD) framework, where the predictor acts as a latent world model. In the first stage, it uses a conditional diffusion decoder and reliable image-level semantic priors to build a latent space that preserves both semantics and low-level details. In the second stage, it learns world knowledge by predicting frozen target features in this latent space, mitigating shortcut learning. Experiments show that with only public unlabeled video data for pretraining, InternVideo-Next achieves state-of-the-art performance on benchmarks including action recognition, fine-grained motion, depth estimation, and object tracking. It is also the first video-only model to surpass image-text pretrained models on Kinetics-400 and SSv2 without explicit video-text supervision, offering an efficient and scalable path for general video representation learning.

05

Title: DDT: Decoupled Diffusion Transformer

Authors: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang

Affiliations: Nanjing University; ByteDance

Link: https://arxiv.org/abs/2504.05741

Paper Summary:

Diffusion Transformers achieve excellent generation quality but require many training iterations and numerous inference steps. At each denoising step, a Diffusion Transformer encodes noisy inputs to extract low-frequency semantic features and then decodes high-frequency information with modules of the same structure. This architecture contains an inherent optimization conflict: low-frequency semantic encoding suppresses high-frequency features, creating a trade-off between semantic encoding and high-frequency decoding. This paper proposes the Decoupled Diffusion Transformer (DDT), which uses a dual-branch design with a dedicated condition encoder for semantic feature extraction and an independent velocity decoder for high-frequency reconstruction. Experiments show that increasing encoder capacity brings consistent gains as the model scales. On ImageNet at 256x256 resolution, DDT-XL/2 achieves an FID of 1.31, setting a new state of the art and improving training convergence speed by nearly 4x over existing Diffusion Transformers. At 512x512 resolution, DDT-XL/2 further achieves an FID of 1.28. The decoupled architecture also enables self-conditioning information reuse across adjacent denoising steps, substantially accelerating inference. To minimize performance loss, the paper introduces a statistical dynamic programming strategy to solve the optimal information-reuse schedule.

06

Title: TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Authors: Tao Wu, Li Yang, Gen Zhan, Yabin Zhang, Yiting Liao, Junlin Li, Deliang Fu, Li Zhang, Limin Wang

Affiliations: Nanjing University; ByteDance; Shanghai AI Laboratory

Paper Summary:

Improving the temporal understanding ability of multimodal large language models (MLLMs) is essential for long-video analysis and supports tasks such as temporal grounding and time-sensitive video question answering. Existing reinforcement learning methods for temporal reasoning are often limited to specific tasks or datasets and cannot satisfy different temporal understanding requirements across scenarios. This paper proposes TempR1, a reinforcement learning-based multi-task training method for temporal understanding in MLLMs. It builds a multi-task corpus covering diverse temporal structures and uses Group Relative Policy Optimization (GRPO) for stable cross-task optimization. TempR1 further divides temporal tasks into three interval-instance correspondence types and designs tailored localization reward functions for each type, enabling the model to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments show that TempR1 achieves leading performance across multiple benchmarks for five temporal understanding tasks. Joint optimization across complementary tasks produces clear synergistic effects, improving model generalization and single-task performance, and provides a scalable paradigm for enhancing temporal reasoning in MLLMs.

07

Title: VMonarch: A Sub-Quadratic Attention Mechanism for Video Diffusion Transformers

Authors: Cheng Liang, Haoxian Chen, Liang Hou, Qi Fan, Gangshan Wu, Xin Tao, Limin Wang

Affiliations: Nanjing University; Kling Team, Kuaishou

Paper Summary:

The quadratic complexity of attention severely limits the context length of Video Diffusion Transformers (Video DiTs). The authors observe that the highly sparse spatiotemporal attention patterns in Video DiTs can be naturally represented by Monarch matrices, a class of structured matrices with flexible sparsity that enable sub-quadratic attention computation through alternating minimization. Based on this insight, they propose VMonarch, a new attention mechanism for Video DiTs that uses structured Monarch matrices to compute dynamic sparse patterns efficiently. The method includes a spatiotemporal Monarch decomposition to explicitly capture intra-frame and inter-frame correlations, a recomputation strategy to mitigate artifacts caused by instability during alternating minimization, and an online entropy algorithm fused into FlashAttention for fast Monarch matrix updates in long-sequence settings. Experiments show that after only limited tuning, VMonarch reaches comparable or better generation quality than full attention on VBench. It breaks the attention bottleneck of Video DiTs, reducing attention FLOPs by 17.5x, accelerating long-video attention by more than 5x, and outperforming state-of-the-art sparse attention methods at 90% sparsity.

08

Title: CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Authors: Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang

Affiliations: Nanjing University; Shanghai AI Laboratory

Paper Summary:

This paper proposes CoMo, a framework for unsupervised learning of continuous latent motion representations from large-scale Internet videos. Existing discretization-based methods often lose fine-grained motion information and are inconsistent with the continuous distribution of robot actions, hindering unified policy learning. CoMo addresses these issues with an early temporal differencing mechanism and a temporal contrastive learning strategy. Together, they improve the model’s ability to avoid shortcut learning, focus the extracted latent motion representations on meaningful foreground motion regions, and strengthen motion cues. CoMo also shows strong zero-shot generalization by generating effective pseudo action labels for unseen videos without action annotations. Its continuous latent motion representations align naturally with the continuous distribution of real robot actions, benefiting unified policy learning. Extensive simulation and real-robot experiments show that adding CoMo pseudo-labeled video data to joint training significantly improves robot policy performance across many manipulation tasks. Overall, CoMo provides a unified and more precise action-labeling solution for large-scale heterogeneous video data and offers an efficient path toward general and scalable robot policy learning.

09

Title: AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Authors: Lidong Lu, Guo Chen, Wei Zhu, Zhiqi Li, Yicheng Liu, Tong Lu

Affiliations: Nanjing University; China Mobile Zijin Innovation Institute

Link: https://arxiv.org/abs/2506.05328; https://av-reasoner.github.io/

Paper Summary:

Although multimodal large language models have made significant progress in image captioning, video question answering, and audio-video understanding, they still struggle with counting, especially in long videos that require fine-grained object recognition, spatiotemporal localization, cross-modal alignment, and de-duplication across multiple instances. Existing video counting benchmarks typically involve short scenarios, limited question forms, insufficient explainable clue annotations, and weak audio-video joint evaluation, making it difficult to assess whether models truly have interpretable counting reasoning ability. This work introduces CG-AV-Counting, a manually annotated benchmark for long-video audio-visual counting. It contains 497 real long videos, 1,027 multimodal counting questions, and 5,845 fine-grained clues covering events, objects, and attributes. The authors further propose AV-Reasoner, which improves perception, localization, and reasoning through cold-start supervised fine-tuning, curriculum reinforcement learning, staged review, and full-task reinforcement learning under limited counting annotations. Experiments show that mainstream multimodal models still lag far behind humans on long-video counting, while AV-Reasoner achieves significant improvements across counting and audio-video understanding benchmarks, providing a new benchmark and method for fine-grained multimodal reasoning and interpretable video counting.

10

Title: Will Mutimodal Models Be Dazzled by Muti-Image Visual Puzzles?

Authors: Zhi Zhu, YaoQi Fan, Zhe Chen, Yue Cao, Yangzhou Liu, Tong Lu

Affiliations: Nanjing University

Paper Summary:

With the rapid development of multimodal large language models (MLLMs), the limitations of existing benchmarks in evaluating complex reasoning across multiple images have become increasingly clear. To fill this gap, this work introduces MIRACLE, a benchmark designed for evaluating complex multi-image reasoning and logical understanding. MIRACLE contains 4,000 high-quality items spanning visual comparison, temporal analysis, spatial relationships, and other reasoning dimensions. Its key strength lies in its strict emphasis on inter-image dependency. Through systematic data collection, careful instance grouping, and targeted question design, the benchmark forces models to solve tasks through cross-image logical integration rather than single-image recognition. Experiments show that current leading MLLMs, such as Gemini-2.5-Pro, score only 55.91% on MIRACLE, highlighting the difficulty of multi-image reasoning. The study further finds that all tested models suffer substantial performance drops in high-density visual information scenarios, such as puzzle tasks and ultra-many-image inputs. These results reveal shortcomings of current MLLMs in handling complex structural relationships and collaborative reasoning under high visual load. MIRACLE provides a new evaluation dimension and may help push multimodal reasoning beyond current boundaries.

11

Title: Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation

Authors: Guangchen Shi, Yirui Wu, Zhu Wei, Tao Wang, Hao Zhang, Bo Li, Tong Lu

Affiliations: Nanjing University; Hohai University; China Mobile Zijin Innovation Institute; VIVO

Paper Summary:

Few-shot semantic segmentation (FSS) aims to segment objects of novel categories using only a small number of annotated examples. Existing methods often depend on complex class-specific modeling, which makes training costly and limits generalization under few-shot conditions. To address these challenges, the authors propose a Bayesian probabilistic network (BPNet), which reformulates FSS as the combination of three interpretable components: a prior, a likelihood, and a category-consistency term. Specifically, the method uses efficient SAM to generate fragmented prior regions for the query image, while both the likelihood and consistency terms are estimated by a lightweight class-agnostic localization module (CALM). CALM uses a binary classification head to predict category consistency between support and query images and estimates the likelihood term by localizing target regions in the support image. By evaluating fragmented regions generated by SAM in parallel, CALM can efficiently identify core class regions and turn segmentation into a simple binary classification task. To address the semantic incompleteness of SAM-generated regions, the authors introduce an attention-based semantic completion module (SCM), which uses local and global context cues to integrate fragmented regions into semantically complete masks. Extensive experiments show that BPNet achieves state-of-the-art performance while maintaining efficient segmentation.

12

Title: Rethinking BCE Loss for Multi-Label Image Recognition with Fine-Tuning

Authors: Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Shufan Yang, Qing Gu

Affiliations: Nanjing University

Paper Summary:

In multi-label image recognition, the authors find that when visual-language models are fine-tuned with binary cross-entropy loss, model confidence becomes systematically distorted: predictions for base classes seen during training are overly conservative, while predictions for unseen novel classes are overly confident. Existing calibration methods cannot adequately solve this issue. The paper proposes Class-wise Covariance Regularization (CCR), which uses a large number of negative samples to construct a prediction covariance matrix and aligns it with the semantic correlations of text embeddings, thereby preserving the stability of inter-class geometry during fine-tuning. CCR not only significantly improves confidence reliability but also improves recognition and calibration for head, tail, and novel classes. The method is plug-and-play and compatible with existing fine-tuning frameworks, including prompt fine-tuning and adapter fine-tuning. It has important application value in real-world scenarios requiring trustworthy predictions, such as medical imaging and autonomous driving.

View original

ICLR 2026 Accepted Papers

Fri, 24 Apr 2026 00:00:00 +0000

ICLR (International Conference on Learning Representations) is one of the leading AI conferences focusing on deep learning and representation learning. Eleven papers from the Large Model Innovation Center at Nanjing University’s School of Computer Science were accepted at ICLR 2026.

01

Title: Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism

Authors: Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu

Affiliations: Nanjing University; Peking University; Zhejiang University

Link: https://arxiv.org/abs/2510.17896

Abstract:

Transformer-based large language models have achieved remarkable success, but their softmax attention mechanism incurs quadratic computation and memory costs as sequence length grows, making it a key bottleneck for long-context training. Existing work mainly optimizes attention from two perspectives: operator-level acceleration for dense or sparse attention, and module-level distributed attention or context parallelism across devices. However, the field still lacks a systematic benchmark that comprehensively compares attention operators and clearly analyzes the performance of different context-parallel methods. This work proposes a unified benchmarking framework that integrates multiple attention operators and context-parallel mechanisms, evaluating them along two key dimensions: attention-mask patterns and the interaction between sequence length and distributed scale. Experiments on up to 96 GPUs provide reproducible comparisons, reveal trade-offs among methods, and offer practical guidance for designing and deploying attention mechanisms in long-context LLM training.

02

Title: PoseX: A Large-Scale Cross-Docking Benchmark for Real-World Protein-Ligand Docking

Authors: Yize Jiang, Xinze Li, Yuanyuan Zhang, Jin Han, Youjun Xu, Ayush Pandit, Zaixi Zhang, Mengdi Wang, Mengyang Wang, Chong Liu, Guang Yang, Yejin Choi, Wu-Jun Li, Tianfan Fu, Fang Wu, Junhong Liu

Affiliations: WecoAI; Princeton University; Nanjing University; ByteDance; Stanford University; Peking University

Link: https://arxiv.org/abs/2505.01700

Abstract:

Molecular docking is a core technique for biopharmaceutical research and industrial enzyme engineering, yet traditional methods struggle to handle dynamic protein conformations. Cross-docking has therefore become a widely recognized challenge. The field has long lacked a unified, high-quality benchmark oriented toward real-world practice, causing many algorithms that perform well in laboratory settings to fall short in practical deployment. PoseX introduces an open collaborative evaluation platform and the first large-scale benchmark dedicated to cross-docking. It contains 718 self-docking samples and 1,312 cross-docking samples, covering 24 mainstream algorithms across physics-based methods, AI docking, and AI co-folding. Rigorous evaluations show that leading AI algorithms comprehensively outperform traditional physics-based methods in cross-docking; SurfDock achieves state-of-the-art performance, with a success rate above 77% after relaxation. The benchmark also clarifies the performance gap between blind docking and pocket-specified docking, as well as the generalization behavior of AI models. PoseX fills a practical evaluation gap and provides digital infrastructure for synthetic biology, drug discovery, and enzyme engineering.

03

Title: PIXNERD: PIXEL NEURAL FIELD DIFFUSION

Authors: Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang

Affiliations: Nanjing University, ByteDance Seed, National University of Singapore

Link: https://arxiv.org/pdf/2507.23268

Abstract:

The strong performance of diffusion transformers often depends on compressed latent spaces produced by pretrained variational autoencoders. This two-stage paradigm inevitably introduces error accumulation and decoding artifacts. Pixel-space modeling is a promising alternative, but existing approaches often require complex cascaded pipelines and substantially increase token-level computation. Inspired by the simplicity and efficiency of latent diffusion transformers, this work proposes a pixel-space diffusion approach based on large image patches and neural-field decoding, forming a lightweight, single-stage, end-to-end solution called PixNerd. With efficient neural-field representation, PixNerd achieves an FID score of 1.93 on ImageNet at 256x256 resolution without cascaded architectures or VAEs, while reducing inference latency by nearly 8x compared with existing pixel-level diffusion models. The framework is further extended to text-to-image generation, achieving a GenEval score of 0.73 and a DPG score of 80.9.

04

Title: RIVER: A Real-Time Interaction Benchmark for Video LLMs

Authors: Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang

Affiliations: University of Science and Technology of China, Shanghai AI Laboratory, Fudan University, Nanjing University

Abstract:

Video large language models have demonstrated impressive capabilities, but most operate in an offline setting, limiting their potential for real-time interaction. RIVER Bench addresses this gap by evaluating how models interact with humans through streaming multimedia inputs. The benchmark introduces three task categories: retrospective memory, real-time perception, and proactive response. Rather than following the conventional one-shot full-video understanding paradigm, it simulates interactive human dialogue. The benchmark integrates heterogeneous video sources with varying durations and fine-grained annotations to create realistic real-time interaction settings. Results show that offline models may perform well on single-turn QA but struggle in real-time scenarios. The work further proposes a general improvement strategy for limitations such as weak long-term memory and insufficient future awareness, improving flexibility and adaptability in real-time interaction.

05

Title: ARBITRARY GENERATIVE VIDEO INTERPOLATION

Authors: Guozhen Zhang, Haiguang Wang, Chunyu Wang, Yuan Zhou, Qinglin Lu, Limin Wang

Affiliations: Nanjing University, Tencent Hunyuan, Shanghai AI Laboratory

Abstract:

Generative video frame interpolation synthesizes intermediate frames between given start and end frames and plays an important role in video creation. Existing generative VFI methods, however, are usually limited to producing a fixed number of intermediate frames, which restricts flexible control over frame rate and duration. This work proposes ArbInterp, a new generative VFI framework that supports arbitrary timestamps and arbitrary-length interpolation. To handle arbitrary timestamps, it introduces timestamp-aware rotary positional encoding, which modulates temporal RoPE positions so generated frames align with target normalized timestamps. For arbitrary-length interpolation, the method decomposes long-sequence generation into segmented frame synthesis and designs a disentangled appearance-motion conditioning strategy. Appearance consistency is maintained through previous segment boundaries, while temporal semantics preserve motion continuity across segments. Experiments on a multi-scale interpolation benchmark from 2x to 32x show that ArbInterp outperforms existing methods across settings, achieving higher fidelity and smoother spatiotemporal continuity.

06

Title: VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Authors: Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang

Affiliations: Shanghai AI Laboratory, Nanjing University, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

Abstract:

Long-context video modeling is essential for multimodal large language models, enabling them to process movies, online streams, and other long-form content. Despite recent progress, efficient understanding of extremely long video contexts remains challenging. This work provides a systematic solution across model architecture, training data, training strategy, and evaluation. It first proposes hierarchical video token compression (HiCo), which exploits visual redundancy in long videos and compresses video context from clip level to video level, retaining key information while reducing computation by approximately 50x with nearly no performance loss. It then introduces a short-to-long multi-stage learning paradigm, builds LongVid, a large-scale real-world long-video dataset, and designs a challenging multi-hop needle-in-a-video-haystack benchmark. The resulting VideoChat-Flash model achieves leading performance on both long- and short-video benchmarks at 2B and 7B scales, and reaches 99.1% accuracy on a 10,000-frame NIAH test among open-source models.

07

Title: CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval

Authors: Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang

Affiliations: Nanjing University, Shanghai AI Laboratory

Abstract:

Video captioning and video retrieval remain important challenges for video-language models. Existing benchmarks often contain only short text descriptions, making it difficult to evaluate whether models deeply understand fine-grained video details. CaReBench addresses this issue with a fine-grained benchmark for detailed video captioning and retrieval. It contains 1,000 high-quality video-description pairs, and each video is further annotated with spatial and temporal splits. Based on this design, the authors propose ReBias for retrieval and CapST for captioning, enabling systematic analysis of spatial and temporal biases in video-language models. The work also builds a unified baseline based on multimodal large language models, supporting both fine-grained video retrieval and detailed video captioning through two-stage supervised fine-tuning. Experiments show competitive performance against CLIP-style retrieval models and mainstream multimodal LLMs, suggesting the potential of unified multimodal LLM modeling for both tasks.

08

Title: Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards

Authors: Changlian Ma, Zizheng Huang, Xiangyu Zeng, Yi Wang, Cheng Liang, Kun Tian, Xinhai Zhao, Limin Wang

Affiliations: Nanjing University, Shanghai AI Laboratory, Huawei Noah’s Ark Lab, Shanghai Innovation Institute, Shanghai Jiao Tong University

Abstract:

Parameter-efficient mixture-of-experts architectures such as LoRA-MoE perform well in fine-tuning, but when combined with advanced reinforcement learning algorithms such as GRPO, they often suffer from severe routing collapse and insufficient parameter utilization. This work proposes RO-GRPO, a mechanism-aware reinforcement fine-tuning framework. Its core idea is to convert internal expert-routing statistics collected during training, such as routing entropy and load distribution, into direct scalar reward signals. Routing supervision is seamlessly integrated into reinforcement fine-tuning without adding extra training stages or differentiable auxiliary losses. Experiments on both unimodal and multimodal mathematical reasoning benchmarks show that RO-GRPO significantly improves task performance and expert load balancing while mitigating text degeneration. The work demonstrates that scalar rewards in GRPO can explicitly guide optimization of internal mechanisms, extending model alignment from external behavior to internal mechanism alignment.

09

Title: UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation

Authors: Zhenrong Yue, Haiyu Zhang, Xiangyu Zeng, Boyu Chen, Chenting Wang, Shaobin Zhuang, Lu Dong, Yi Wang, Limin Wang, Yali Wang

Affiliations: Shanghai Jiao Tong University, Shanghai AI Laboratory, Beihang University, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Nanjing University, University of Science and Technology of China, University of Chinese Academy of Sciences

Abstract:

Tokenizers are central components for visual understanding and generation. Recent research has turned toward unified tokenizers, but existing methods face a clear trade-off between understanding and generation due to the conflict between high-level semantic abstraction and low-level pixel reconstruction. UniFlow is a unified tokenizer that flexibly adapts any visual encoder through a simple reconstruction decoder. It introduces layer-wise adaptive self-distillation to well-pretrained visual encoders, allowing UniFlow to inherit strong semantic features for understanding while adapting to fine-grained details for generation. It also proposes a lightweight patch-wise pixel flow decoder, which models the conditional flow from noise to patch pixels for efficient high-fidelity reconstruction. By using semantic features as visual conditions and simplifying the data distribution through patch-wise learning, UniFlow reduces the training conflict between understanding and generation. Extensive experiments on 13 benchmarks across seven visual understanding and generation tasks show a win-win result: UniFlow-XL 7B outperforms TokenFlow-XL 14B by about 6.05% on average understanding benchmarks while remaining competitive in reconstruction and generation, surpassing UniTok by 0.15 rFID and 0.09 gFID without guidance.

10

Title: The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models

Authors: Renfei Dang, Zhening Li, Shujian Huang, Jiajun Chen

Affiliations: Nanjing University

Link: https://iclr.cc/virtual/2026/poster/10011746

Abstract:

Reasoning models often exhibit overthinking. This work argues that internal bias triggered by the input question is a key cause. When a model receives a user question, it forms an initial guess before systematic reasoning; because this guess is usually not explicitly output, the authors define it as internal bias. When this initial guess conflicts with later reasoning or the final answer, the model tends to fall into excessive reflection. The authors verify a significant correlation between internal bias and overthinking across multiple models and reasoning tasks. To establish causality, they design two counterfactual intervention experiments: removing the input question after the model forms an initial tendency reduces the influence of question-induced bias and redundant reasoning, while manually injecting bias changes the degree of overthinking. Interpretability experiments further show that excessive attention to the input question is a key mechanism through which internal bias affects later reasoning trajectories. The work also evaluates existing methods for reducing overthinking, finding that internal bias remains persistent across tested settings.

11

Title: DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Authors: Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Jianbing Zhang, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang

Affiliations: Nanjing University, ByteDance Seed

Link: https://iclr.cc/virtual/2026/poster/10009423

Abstract:

Training large language models still faces a key bottleneck: reinforcement learning from human feedback relies on costly human annotation, while reinforcement learning with verifiable rewards reduces annotation needs but is limited to verifiable tasks such as math and code. Traditional dual learning provides self-supervised feedback through task duality, but its strict bidirectional invertibility requirement restricts it to a small set of symmetric tasks and is affected by asymmetry between forward and reverse capabilities. DuPO proposes a preference optimization framework based on generalized duality. It decomposes the input of the original task into known and unknown parts, and redefines the dual task as reconstructing the unknown part using the forward output and known information, thereby mitigating capability asymmetry. Experiments show that DuPO improves multiple tasks without external annotation: it raises average COMET by 2.1 points across 756 translation directions, improves accuracy by 6.4 percentage points on four challenging mathematical reasoning benchmarks, and brings a 9.3 percentage-point gain at inference time through reranking without additional fine-tuning. These results suggest a scalable, general, annotation-free path for LLM self-improvement.

View original

ISCA 2026 | New MoE LLM Edge Inference Acceleration Method Reduces Decoding Latency by 48%

Fri, 17 Apr 2026 00:00:00 +0000

00 Overview

Recently, Professor Haipeng Dai’s team at Nanjing University made an important breakthrough in accelerating large language model inference on edge devices. By designing a new algorithm-system co-design mechanism called Expert Substitution, the team addresses the high latency caused by dynamic offloading when deploying Mixture-of-Experts (MoE) models on edge hardware with limited GPU memory. The paper “SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution” has been accepted by the 53rd Annual International Symposium on Computer Architecture (ISCA). This is the first ISCA paper led by the Nanjing University team.

ISCA is a CCF-A conference and one of the oldest and most authoritative conferences in computer architecture. Since its first edition in 1973, it has a history of more than 50 years. The conference is jointly sponsored by ACM SIGARCH and IEEE TCCA.

01 Motivation

As large language models are increasingly deployed on edge devices, the MoE architecture has become a promising low-cost inference approach. However, because edge devices have limited GPU memory, they cannot hold all experts at once. During inference, systems must frequently offload some experts to slower CPU memory. Since PCIe transfer and CPU computation are 10 to 100 times slower than GPU execution, this data movement introduces severe inference latency. Through an in-depth analysis of fine-grained MoE models, the research team identifies a blind spot in existing offloading strategies: they ignore the large importance differences among activated experts. In practice, although each step activates the top-k experts, only a few experts usually receive high gating scores, while the remaining activated experts have very low scores, sometimes close to those of inactive experts.

This observation reveals a fundamental problem in current online MoE offloading mechanisms. The system spends substantial time on CPU computation and PCIe transfer merely to process low-score experts that have little effect on the final output. Designing a GPU-friendly expert scheduling mechanism that greatly reduces inference latency without harming model accuracy is therefore the central challenge.

02 Solution

Figure 1: The core idea of SMoE and its expert-substitution scheduling mechanism

The paper proposes SMoE from an algorithm-system co-design perspective. SMoE addresses the challenge through three coordinated mechanisms: low-score expert substitution, high-score expert prefetching, and CPU-assisted task scheduling. Its core idea is to move beyond treating offloading as a pure scheduling problem. Instead, it uses expert importance to guide decisions and directly replaces low-importance activated experts with functionally similar idle experts already cached in GPU memory, reducing memory use, data transfer, and PCIe overhead while preserving accuracy.

For low-score expert substitution, SMoE designs an expert-cache router and a history-score-based cache eviction strategy to identify low-score experts accurately and replace them with same-score idle experts in GPU memory, maximizing GPU expert cache hit rate. For high-score expert prefetching, the system loads only predicted high-score experts, reducing PCIe bandwidth pressure and enabling effective overlap between data loading and computation. For CPU-assisted computation, SMoE introduces a dynamic two-pointer scheduling algorithm to balance CPU computation and PCIe transfer time, handling experts that cannot be substituted or successfully prefetched and preventing pipeline stalls.

03 Results

Figure 2: TPOT comparison between SMoE and different methods across workloads

Evaluations in realistic low-batch edge inference settings show that SMoE achieves strong performance in both decoding latency (TPOT) and model accuracy. Compared with existing state-of-the-art methods, SMoE reduces average latency by 24% at batch=1 and by 35% at batch=3. In particular, on A6000 hardware, SMoE reduces decoding latency by up to 48% and maintains an expert GPU cache hit rate above 60%.

For model accuracy, extensive tests on datasets including Gaokao, MMLU, and HumanEval show that when the expert-substitution threshold is kept within a reasonable range, such as below 0.35, SMoE introduces almost negligible accuracy loss.

SMoE explores a new path from pure scheduling optimization to score-based expert substitution. This work is the team’s latest research result in MLSys and offers a promising solution for efficient and nearly lossless deployment of large models on memory-constrained edge devices.

Read original article

Professor Haipeng Dai Elected Fellow of the Institution of Engineering and Technology

Fri, 17 Oct 2025 00:00:00 +0000

Congratulations to Professor Haipeng Dai of Nanjing University on being elected a Fellow of the Institution of Engineering and Technology (IET Fellow).

Recently, after selection by the Institution of Engineering and Technology (IET), Professor Haipeng Dai of Nanjing University was elected an IET Fellow. This marks an important achievement for the laboratory in developing high-level and leading talent.

About IET

The Institution of Engineering and Technology is a leading global professional society in engineering and technology. It currently has 167,000 members across 150 countries and is the largest professional engineering institution in Europe and the second largest in the world. IET Fellow is the highest academic honor granted by the IET to outstanding senior professionals who have made significant achievements in science and engineering. Each year, the IET selects about 200 to 300 Fellows, with recipients from mainland China accounting for about 10%.

About Professor Haipeng Dai

Haipeng Dai is an associate professor and doctoral supervisor at the School of Computer Science, Nanjing University. He is a recipient of a national young-talent program, an IET Fellow, a CCF Distinguished Member, and a senior member of ACM and IEEE. He has received honors including the ACM China Rising Star Award, the IEEE Technical Committee on Scalable Computing Middle Career Researcher Award, and the Outstanding Scientific and Technological Worker Award of the Chinese Institute of Electronics. His research focuses on the Internet of Things, data mining, and mobile computing.

He has published more than 300 papers in leading international conferences and journals, including more than 130 CCF-A papers in venues such as NSDI, UbiComp, INFOCOM, SIGMOD, VLDB, ICDE, KDD, WWW, EuroSys, and ATC. He has received more than ten best-paper and distinguished-paper awards from top conferences and journals, including outstanding scientific papers recognized by the China Association for Science and Technology and four best-paper awards at CCF-A/B conferences. He has led a key task under the National Key R&D Program and has hosted or participated in more than ten projects including National Natural Science Foundation of China general and key joint-fund projects.

Professor Dai received the first prize of the 2024 Jiangsu Computer Society Science and Technology Award as the first contributor and the second prize of the 2025 China Invention Association Invention and Entrepreneurship Achievement Award as the second contributor. He serves as secretary-general of ACM SIGCOMM China and as a standing committee member of the CCF Technical Committee on Internet of Things and the Technical Committee on Network and Data Communications. He has chaired more than ten conferences including ISPA, HPCC, ICNP, and COCOON, and serves as area editor of COMNET, editorial board member of TII, and youth editorial board member of Acta Electronica Sinica. He has been included in the World’s Top 2% Scientists list.

Professor Dai’s election as an IET Fellow recognizes his outstanding contributions to the Internet of Things, edge computing, and related fields. It also reflects the laboratory’s growing international influence and academic reputation. We congratulate Professor Dai on this honor and look forward to his continued achievements in research and talent cultivation.

Read original article

Scientific Data 2025 | TrialBench, the First Multimodal AI Platform for Clinical-Trial Prediction, Released

Fri, 17 Oct 2025 00:00:00 +0000

Clinical trials are a critical bridge from laboratory drug discovery to patient treatment, but the process is highly challenging: the average success rate is below 15%, timelines often exceed ten years, and costs can reach billions of dollars.

In September 2025, TrialBench, jointly developed by teams from HKUST (Guangzhou), Nanjing University, Harvard, Stanford, IQVIA, and other institutions, was formally published in Scientific Data, a Nature Portfolio journal. It is the world’s first multimodal clinical-trial prediction dataset designed for AI.

Platform Value

TrialBench integrates 23 sub-datasets and covers eight core prediction tasks:

Predicting trial duration
Predicting patient dropout rate
Predicting serious adverse events
Predicting mortality events
Predicting whether a trial will be approved
Identifying failure reasons
Automatically generating inclusion criteria
Recommending reasonable dosage

These tasks summarize eight key clinical-trial prediction problems.

Technical Features

The platform integrates multi-source data and advanced AI techniques:

Graph neural networks for drug molecular structures
Bio-BERT for clinical text
Hierarchical attention models for disease-code understanding

It also provides complete baseline models, evaluation metrics, and multimodal fusion methods, with Python and R toolkits for out-of-the-box use.

Applications

Experiments show that across 14 binary classification tasks, multimodal models achieve F1 scores above 0.7 on 11 tasks, demonstrating strong predictive capability. Google DeepMind has already used TrialBench in TxGemma for adverse-event prediction, and the AUTOCT project also uses it as a benchmark evaluation platform.

Open Access

TrialBench is open to researchers worldwide. It aims to promote deeper integration between AI and medical research, improve clinical-trial design, and accelerate new drug development.

Platform: https://huyjj.github.io/Trialbench/

Read original article

NeurIPS 2025 Accepted Papers Overview

Sat, 11 Oct 2025 00:00:00 +0000

NeurIPS (Annual Conference on Neural Information Processing Systems) is a top-tier conference in machine learning, alongside ICML and ICLR, recognized as one of the most challenging, highest-level, and most influential conferences in the field! NeurIPS is a CCF Class A conference and Core Conference Ranking Class A conference, with an H5 index of 278! Founded in 1987 in Canada by neural network scholars from the connectionist school, NeurIPS has grown in influence, with paper topics primarily focused on machine learning, artificial intelligence, and statistics.

The Large Model Center at Nanjing University’s School of Computer Science has 9 papers accepted to NeurIPS 2025.

01

Title: Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models

Authors: Yan-Shuo Liang, Jia-Rui Chen, Wu-Jun Li

Institution: Nanjing University

Abstract:

Thanks to the rich knowledge obtained from large-scale pre-training and subsequent fine-tuning strategies, existing large language models (LLMs) have demonstrated excellent performance across a wide range of tasks. However, when LLMs learn multiple downstream tasks sequentially, they often forget previously learned knowledge, leading to significant performance degradation on old tasks—a phenomenon known as catastrophic forgetting. Catastrophic forgetting hinders LLMs from continuously accumulating new knowledge, making it crucial to design continual learning methods that can overcome this challenge. Meanwhile, Low-Rank Adaptation (LoRA), as one of the most representative methods in parameter-efficient fine-tuning, has gained widespread attention in continual learning for LLMs. LoRA reparameterizes pre-trained weights into low-rank forms, requiring only a small number of parameters to be updated for task adaptation. Compared to full parameter updates, LoRA significantly improves fine-tuning efficiency. However, existing LoRA-based continual learning methods still have limitations. They typically expand new LoRA branches when learning new tasks while freezing old branches, thereby avoiding forgetting caused by directly modifying old parameters. During inference, these methods usually adopt simple addition to integrate new and old branches. This approach forces new and old branches to contribute equally to old tasks, which may instead cause new branches to significantly interfere with old tasks, exacerbating forgetting and reducing overall performance. To address this, we propose GainLoRA (gated integration of low-rank adaptation), a new continual learning method for LLMs. GainLoRA expands new LoRA branches for each new task and dynamically integrates new and old branches through a gating module. By imposing initialization and update constraints on the new gating module, GainLoRA significantly reduces interference from new LoRA branches on old tasks, effectively mitigating forgetting and improving the overall performance of LLMs in continual learning.

Figure 1

02

Title: StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Authors: Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang

Institution: Nanjing University, Shanghai AI Laboratory, Zhejiang University, Huawei Noah’s Ark Lab, Yinwang Intelligent Technology

Abstract:

Multimodal large language models have made significant progress in video understanding in recent years. However, due to historical visual feature storage limitations and insufficient real-time spatiotemporal reasoning capabilities, their effectiveness in real-time streaming scenarios remains limited. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. The core of StreamForest is the Persistent Event Memory Forest, a memory mechanism that can adaptively organize video frames into multiple event-level tree structures. This process is guided by a penalty function based on temporal distance, content similarity, and merging frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce the Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we propose OnlineIT, an instruction tuning dataset customized for streaming video tasks. OnlineIT significantly improves MLLM performance in real-time perception and future prediction. To evaluate its generalization ability in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results show that StreamForest achieves state-of-the-art performance, reaching 77.3% accuracy on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. Notably, even under extreme visual token compression (limited to 1024 tokens), the model maintains 96.8% average accuracy across eight benchmarks (relative to the default 8k setting). These results highlight StreamForest’s robustness, efficiency, and versatility in streaming video understanding.

Figure 2

Figure 3

03

Title: LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Authors: Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng, Lingxue Song, Xi Chen, Liang Li, Limin Wang

Institution: Nanjing University, China Mobile Research Institute

Abstract:

Current vision-language models (VLMs) have limited performance in long video understanding: they rely on expensive and scarce long video annotations, and short-context models easily overlook intermediate content when extended to long sequences, causing performance imbalance between long and short tasks. To address this, we propose LongVPO—a two-stage direct preference optimization framework that requires no long video annotations. LongVPO first uses “anchored cues” to automatically synthesize preference data from short video clips, then achieves cross-clip alignment through “self-reasoning” on real long videos, learning complex long-range reasoning capabilities. Using only 16K synthetic data, LongVPO achieves superior performance on LVBench, LongVideoBench, MLVU, VideoMME, and other benchmarks while maintaining strong performance on short video tasks, providing a new paradigm for efficient and scalable long video understanding.

Figure 4

04

Title: Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Authors: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu

Institution: Nanjing University, NVIDIA, Hong Kong Polytechnic University, Rutgers University

Abstract:

Eagle 2.5 is a series of frontier vision-language models (VLMs) designed for long-context multimodal understanding. Existing VLMs mainly focus on short-context tasks, with insufficient support for long video understanding and high-resolution image processing. Eagle 2.5 proposes a general training framework with two core technologies: Automatic Degradation Sampling (ADS) and Image Area Preservation (IAP), which dynamically allocate visual and text input budgets and maintain image integrity when segmenting. Additionally, the authors introduce a progressive mixed post-training strategy that gradually extends context length to improve model stability in handling diverse inputs. To support training, they construct the new Eagle-Video-110K dataset, providing story-level and clip-level dual annotations to enhance long video understanding capabilities. Experiments show that Eagle 2.5 achieves significant improvements on multiple long video and image understanding benchmarks. For example, the 8B parameter Eagle 2.5 achieves 72.4% on Video-MME with 512 frame input, approaching the performance of larger models like GPT-4o and Qwen2.5-VL-72B. The model also performs excellently on high-resolution image understanding tasks. In summary, Eagle 2.5 achieves efficient and powerful long-context multimodal understanding capabilities through innovative sampling strategies, progressive training methods, and large-scale multi-level datasets, providing a strong direction for future high-performance VLM development.

Figure 5

05

Title: VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Authors: Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, Yi Wang

Institution: Zhejiang University, Shanghai AI Laboratory, Nanjing University, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

Abstract:

Infusing reasoning capabilities into multimodal large language models is key to achieving human-level perception and understanding. Existing methods mostly rely on the reasoning capabilities of LLMs to analyze parsed visual information, but are often limited by static perception stages. This paper proposes “Visual Test-Time Scaling,” which enhances multimodal LLM reasoning capabilities through iterative perception during inference. Under the guidance of updated text predictions, it gradually refines attention to high-confidence spatiotemporal regions, mimicking human hierarchical attention mechanisms. The training process combines reinforcement learning with spatiotemporal supervision signals for end-to-end optimization of reasoning paths. These designs allow multimodal LLMs to improve performance by increasing perception computational capacity. Extensive experiments validate the effectiveness and generalization of the iterative perception method across various tasks and benchmarks. The newly introduced Videochat-R1.5 model achieves significant improvements across more than 15 benchmarks covering video dialogue, video reasoning, and spatiotemporal perception, with an average improvement of more than 5% compared to robust baselines like Qwen2.5VL-3B and -7B.

Figure 6

06

Title: MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

Authors: Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, Limin Wang

Institution: Nanjing University

Abstract:

Thanks to the development of diffusion models, image-to-video generation technology has made significant progress. However, generating motion-realistic videos remains a formidable challenge. The core of this challenge lies in accurately modeling the complexity of motion, which requires capturing physical laws, object interactions, and domain-specific motion patterns—prior knowledge that is difficult to generalize effectively across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented generation framework. This framework extracts and transfers motion priors from relevant reference videos through a Context-Aware Motion Adaptation (CAMA) mechanism to improve the motion realism of generated videos. The core technical innovations include: (1) Retrieval-based motion representation extraction: using video encoders and resamplers to extract semantic-level motion features from retrieved reference videos; (2) Context learning-based motion adaptation method: efficiently learning and transferring motion patterns from multiple retrieved reference videos to target scenarios through a causal Transformer architecture; (3) Attention motion injection adapter: injecting motion features into pre-trained video diffusion models to enhance motion realism. Extensive experiments demonstrate that our method achieves significant improvements across multiple scenarios and various base models, introducing only negligible computational overhead during inference. Furthermore, its modular design supports zero-shot generalization to new domains—simply updating the retrieval database without retraining any model components. This research enhances the core capabilities of video generation systems by enabling efficient retrieval and transfer of motion priors, providing a new paradigm for synthesizing videos with realistic dynamic effects.

Figure 7

07

Title: Loquetier: A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving

Authors: Yuchen Zhang, Hanyue Du, Chun Cao, Jingwei Xu

Institution: Nanjing University

Abstract:

Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning (PEFT) technique for adapting large language models (LLMs) to downstream tasks. Although numerous studies have explored strategies for unifying LLM training and serving, the domain of unified fine-tuning and inference for LoRA-based models remains underexplored. This paper proposes Loquetier—a virtualized multi-LoRA framework that seamlessly integrates LoRA fine-tuning and inference serving in a single runtime environment. Loquetier consists of two main components: (1) a virtualization module that isolates PEFT-based model modifications and supports deploying multiple adapters on a shared single base model; (2) an optimized computational flow with kernel designs that fuse fine-tuning and inference paths in forward propagation, enabling efficient batch processing and minimizing kernel call overhead. In extensive experiments across three task scenarios, Loquetier significantly outperforms existing baselines in both performance and flexibility: achieving 3.0× throughput of top co-serving systems in inference-only tasks, and 46.4× higher service level objective attainment rate than PEFT in unified fine-tuning and inference tasks.

Figure 8

08

Title: 3D Interaction Geometric Pre-training for Molecular Relational Learning

Authors: Namkyeong Lee, Yunhak Oh, Heewoong Noh, Gyoung S. Na, Minkai Xu, Hanchen Wang, Tianfan Fu, Chanyoung Park

Institution: KAIST, KRICT, Stanford University, Genentech, Nanjing University

Abstract:

Accurate prediction of molecular interactions is crucial in drug discovery and materials science. However, existing molecular relational learning methods are mostly limited to using 2D topological structures of molecules, ignoring 3D spatial geometric information that determines the nature of interactions—primarily because obtaining precise 3D interaction conformations is extremely expensive. To break through this bottleneck, we propose 3DMRL, an innovative 3D geometric pre-training framework. The core of this framework is that instead of relying on expensive computations to obtain true interaction conformations, it simulates how molecules contact each other in 3D space by constructing a “virtual interaction environment”—arranging multiple small molecules around a large molecule through random sampling, translation, and rotation. Based on this, we design dual pre-training tasks to guide 2D models to learn 3D geometric information in this virtual environment: one uses contrastive learning to help models understand the global geometric structure of interactions; the other uses an equivariant network to predict fine local relative geometric relationships between molecules, capturing atomic-level interaction details. Extensive experiments show that 3DMRL can significantly improve the performance of various mainstream models on molecular interaction prediction and drug-drug interaction prediction tasks, achieving up to 24.93% performance improvement across 40 tasks and demonstrating excellent generalization capabilities in out-of-distribution scenarios. This work systematically introduces 3D geometric pre-training to the field of molecular relational learning for the first time, laying a solid foundation for developing more accurate and versatile AI-assisted scientific discovery tools.

Figure 9

09

Title: EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

Authors: Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, Jiangmiao Pang

Institution: Nanjing University, Shanghai AI Laboratory, University of Tokyo, Zhejiang University, Fudan University

Abstract:

Human intelligence can naturally transfer and integrate knowledge between first-person (egocentric) and third-person (exocentric) perspectives, which is crucial for learning and communication. However, although current multimodal large language models (MLLMs) have achieved significant progress in single-perspective video understanding, they still lack systematic evaluation of cross-perspective reasoning. To address this, we propose EgoExoBench—the first benchmark for evaluating MLLMs’ first-person and third-person video understanding and reasoning capabilities.

EgoExoBench is built on public datasets and contains 7300+ multiple-choice questions (MCQs) covering 11 sub-tasks, divided into three major challenges: semantic alignment, viewpoint association, and temporal reasoning. Task designs cover matching from task, action, object, to person levels, as well as cross-perspective spatial correspondence and event sequence reasoning.

The research team conducted systematic evaluation of 13 mainstream open-source and closed-source MLLMs (such as GPT-4o, Claude 3.7 Sonnet, Qwen2.5-VL, InternVL3, etc.). Results show that these models perform well on single-perspective tasks but exhibit significant performance degradation on cross-perspective tasks. For example, the best open-source model Qwen2.5-VL-72B achieves only 47% overall accuracy, while humans achieve over 90% accuracy on the same tasks. Further experiments show that chain-of-thought (CoT) prompting does not improve performance and even reduces accuracy on some tasks, indicating that cross-perspective reasoning remains a major challenge for existing models.

In summary, EgoExoBench provides a systematic and scalable evaluation framework that helps advance embodied agents and human-robot collaboration systems with human-like cross-perspective intelligence.

Figure 10

Professor Wang Limin Receives 2025 Ant Intech Technology Award

Fri, 19 Sep 2025 00:00:00 +0000

Recently, at the 2025 Inclusion Bund Conference, the “2025 Ant Intech Award” was officially announced. 10 young scientists received the “Ant Intech Technology Award”. At the same time, 10 Chinese doctoral students from top universities worldwide received the “Ant Intech Scholarship”. Among them, Professor Wang Limin received the 2025 Ant Intech Technology Award.

The 2025 Ant Intech Award is established by Ant Group Co., Ltd., providing public welfare research funding support for outstanding young scholars and doctoral students in the field of computer science, with two core awards: the “Ant Intech Technology Award” and the “Ant Intech Scholarship”.

Figure: 2025 Ant Intech Technology Award Ceremony

Academicians and industry authorities attended the award ceremony, including Chen Chun (Academician of Chinese Academy of Engineering, Professor at Zhejiang University), Zhang Hongjiang (Foreign Academician of US National Academy of Engineering), and Zheng Weimin (Academician of Chinese Academy of Engineering, Professor at Tsinghua University). Michael I. Jordan (Member of US National Academy of Sciences, Engineering, and Arts & Sciences) and Jack Dongarra (Turing Award winner, Academician of US National Academy of Engineering, Professor at University of Tennessee) sent video messages to young scholars: “The path of research may not be smooth, but the problems you explore today will define future technologies and opportunities. Be bold in seeking truth, and your research will ultimately impact the world.”

It is understood that this year’s award recipients have demonstrated exceptional innovation capabilities in frontier areas such as Artificial General Intelligence (AGI), embodied intelligence, digital medicine, and data security, with their achievements being widely adopted by the industry. Professor Wang Limin won the award for his significant contributions to Artificial General Intelligence. The award citation: developed the first leading general video understanding large model InternVideo (with over 5 million downloads), proposed the “progressive training” method, enabling AI to understand the dynamic world in layers like humans, and empowered application scenarios such as autonomous driving.

Figure: Professor Wang Limin participating in the 2025 Ant Intech Technology Award ceremony round table forum

ICCV 2025 Accepted Papers

Tue, 12 Aug 2025 00:00:00 +0000

ICCV is one of the most influential top-tier conferences in computer vision. It is organized by the IEEE Computer Society and held biennially alongside CVPR and ECCV as the three flagship vision venues. ICCV covers cutting-edge topics such as image processing, object detection, 3D reconstruction, video understanding, and vision–language research, serving as a premier platform for presenting the latest advances and exchanging ideas. With its very high acceptance standards, ICCV represents the frontier trends and research hotspots of the field.

Seven papers from the Large Model Center of the Department of Computer Science and Technology, Nanjing University (NJU MCG), have been accepted to ICCV 2025.

01

Title: MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

Authors: Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang

Affiliations: Nanjing University; Ant Group

Abstract:

Although large models have achieved strong performance on many vision tasks, efficient lightweight neural networks are receiving growing attention due to their faster inference and easier deployment on mobile devices. However, existing video models still focus on larger ViT architectures, with few attempts to build efficient video architectures. Given that many efficient CLIP models already demonstrate strong zero-shot classification and retrieval capabilities, we aim to fill the gap for video–text understanding and propose MobileViCLIP, a fast and efficient video–text model with strong zero-shot capability that can be deployed on mobile devices. Concretely, MobileViCLIP achieves performance comparable to mainstream ViT-based models on several text–video retrieval and zero-shot video classification datasets, while improving inference speed on mobile devices by tens of times. We believe focusing on efficiency for video–text models is important and valuable to the field.

02

Title: p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Authors: Jun Zhang (张峻), Desen Meng (孟德森), Zhengming Zhang (张拯明), Zhenpeng Huang (黄振鹏), Tao Wu (吴涛), Limin Wang (王利民)

Affiliations: Nanjing University; China Mobile Research Institute

Abstract:

Despite the strong performance of multimodal large language models (MLLMs) on various downstream tasks, their massive training and inference costs hinder further development. A major cause is that the LLM must process an enormous number of visual tokens. We propose p-MoD, an efficient MLLM architecture that significantly reduces computational cost during both training and inference while maintaining performance. To reduce the number of visual tokens processed at each LLM Transformer layer, p-MoD introduces a Mixture-of-Depths (MoD) mechanism that processes only the most informative tokens at each layer and skips redundant ones. Integrating MoD into MLLMs is nontrivial; to address training/inference stability and limited training data, p-MoD designs Tanh-gated Weight Normalization (TanhNorm) and Symmetric Token Reweighting (STRing). Furthermore, we observe that visual token redundancy increases in deeper layers and thus propose Progressive Ratio Decay (PRD) to gradually reduce the kept-token ratio layer by layer. This key design fully unlocks MoD’s potential, markedly boosting efficiency and performance. On 15 benchmarks with LLaVA-1.5 and LLaVA-NeXT baselines, p-MoD matches or surpasses performance while using 55.6% inference TFLOPs, 53.7% KV cache, and 77.7% GPU training time.

03

Title: Scalable Image Tokenization with Index Backpropagation Quantization

Authors: Fengyuan Shi (石丰源), Zhuoyan Luo (罗卓彦), Yixiao Ge (葛艺潇), Yujiu Yang (杨余久), Ying Shan (单瀛), Limin Wang (王利民)

Affiliations: Nanjing University; Tsinghua University; Tencent

Abstract:

Existing vector quantization (VQ) methods face scalability issues, largely because codebooks updated only partially during training become unstable: as the distribution gap between inactive codes and visual features widens, codebook utilization drops and training eventually collapses. We propose Index Backpropagation Quantization (IBQ), a new VQ method that jointly optimizes all codebook embeddings and the visual encoder. By applying a straight-through estimator to the one-hot categorical distribution between encoded features and the codebook, IBQ makes all codes differentiable and maintains a latent space consistent with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves large codebooks with high utilization at high dimension (256) and scale (2¹⁸). On ImageNet, IBQ shows strong scalability and competitive performance for both image reconstruction and autoregressive visual generation.

04

Title: Make Your Training Flexible: Towards Deployment-Efficient Video Models

Authors: Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang

Affiliations: Shanghai AI Laboratory; Shanghai Jiao Tong University; University of Science and Technology of China; Nanjing University

Abstract:

Mainstream video training typically relies on fixed spatiotemporal sampling grids that extract a fixed number of visual tokens as input, making both training and inference heavily constrained by preset sampling strategies. This rigid design hampers adaptation to varying computational budgets in downstream scenarios—especially when models trained under high compute cannot be efficiently deployed on resource-limited edge devices. We propose a new training paradigm to achieve “lossless adaptation across scenarios”: retain top performance under high compute while enabling lossless migration to low-resource environments. We first introduce Token Optimization (TO), an adaptive inference framework that dynamically samples and selects tokens according to downstream compute limits to maximize information utilization. We then develop Flux, a training-side data augmentation tool that enables flexible sampling grids with token selection, integrating seamlessly into mainstream video training frameworks to markedly enhance robustness and flexibility at near-zero extra cost. Integrated into large-scale video pretraining, FluxViT sets new SOTA under standard compute; notably, with only 1/4 tokens, FluxViT with TO still rivals the best InternVideo2 models, saving nearly 90% compute without loss.

05

Title: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Authors: Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang

Affiliations: Shanghai AI Laboratory; Nanjing University; SIAT, Chinese Academy of Sciences (Shenzhen)

Abstract:

We introduce VRBench—the first long-form narrative video benchmark specifically designed to evaluate multi-step reasoning in large models—addressing limitations of existing evaluations that overlook temporal reasoning and process validity. VRBench contains 1,010 long videos (avg. length 1.6 hours), 9,468 human-annotated multi-step QA pairs, and 30,292 timestamped reasoning steps. Videos are curated through a multi-stage pipeline with expert cross-check, ensuring coherent plots and complexity. We build a human-in-the-loop framework to generate coherent chains-of-reasoning with timestamped steps across seven types (e.g., causal attribution, implicit reasoning). A multi-stage evaluation assesses models by both results and processes: beyond MCQ results, we propose an LLM-guided process score to comprehensively assess reasoning-chain quality. Experiments with 12 LLMs and 16 VLMs reveal current limitations in long-video multi-step reasoning and offer recommendations.

06

Title: Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning

Authors: Yue Duan (段岳), Taicai Chen (陈泰财), Lei Qi (祁磊), Yinghuan Shi (史颖欢)

Affiliations: Nanjing University; Southeast University

Links: https://arxiv.org/abs/2508.05316, https://github.com/NJUyued/USP4SSCL

Abstract:

Semi-supervised continual learning (SSCL) aims to learn from a sequence of tasks in which only part of the data is labeled—highly practical yet challenging. The core challenge is to effectively leverage unlabeled data while balancing memory stability (avoiding forgetting) and learning plasticity (learning new knowledge). We propose USP, a divide-and-conquer collaborative framework that systematically enhances Unlabeled learning, Stability, and Plasticity via three coupled modules. For plasticity, we propose Feature Space Reservation (FSR), which uses an Equiangular Tight Frame (ETF) to reserve positions in the feature space for future classes, reducing conflicts when learning new tasks. For unlabeled learning, we design Divide-and-Conquer Pseudo-labeling (DCP), which splits unlabeled data into high- and low-confidence subsets and assigns pseudo-labels using a classifier and a more robust Nearest Class Mean (NCM), respectively, fully utilizing all data. For stability, we introduce Class-mean anchored Unlabeled Distillation (CUD), which reuses DCP’s intermediate results and anchors unlabeled data to stable class centers computed from labeled data, effectively mitigating catastrophic forgetting. Extensive experiments show that USP significantly outperforms SOTA, improving final-task accuracy by up to 5.94%.

07

Title: Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

Authors: Haoran Wang (王皓冉), Zekun Li (李泽昆), Jian Zhang (张剑), Lei Qi (祁磊), Yinghuan Shi (史颖欢)

Affiliations: Nanjing University; Southeast University

Links: https://arxiv.org/abs/2508.07759, https://github.com/wanghr64/cav-sam

Abstract:

Large vision models (e.g., SAM) often degrade on downstream segmentation tasks involving new domains or categories. Reference Segmentation, which uses an annotated reference image to guide the segmentation of a target image, is a promising direction. However, existing methods largely rely on meta-learning, requiring heavy training data and compute. We propose CAV-SAM, a new paradigm that turns the “correspondence” between the reference and target images into a “pseudo video,” enabling the latest video model SAM2 to adapt effectively through lightweight test-time tuning, completely avoiding costly meta-learning. The framework includes: (1) Diffusion-based Semantic Transition (DBST), which generates a smooth semantic transition sequence (pseudo video) from the reference to the target to handle semantic differences (same class, different instances); and (2) Test-Time Geometric Alignment (TTGA), which performs lightweight tuning of SAM2 using only the reference image and a novel enhanced cycle-consistency loss to better align geometric changes (pose, scale). Without meta-learning, CAV-SAM surpasses prior SOTA by about 5% on average across multiple datasets.

ICML 2025 Accepted Papers

Fri, 18 Jul 2025 00:00:00 +0000

ICML is one of the most prestigious and influential conferences in machine learning. It is among the longest-running and largest venues in the field and a CCF Class-A conference.

Four papers from the Large Model Center of the Department of Computer Science and Technology, Nanjing University (NJU MCG), have been accepted to ICML 2025.

01

Title: On the Tension between Byzantine Robustness and No-Attack Accuracy in Distributed Learning

Authors: Yi-Rui Yang, Chang-Wei Shi, Wu-Jun Li

Affiliations: Nanjing University

Link: https://cs.nju.edu.cn/lwj/paper/ICML2025_NFLinBRDL.pdf

Abstract:

Distributed machine learning leverages multiple interconnected devices (nodes) and their data to train models. As datasets and models scale up, large clusters face higher rates of software/hardware failures; in open-network scenarios such as federated learning, adversarial attacks are also more likely. Faulty or malicious nodes are called Byzantine nodes. Byzantine-robust distributed learning often uses robust aggregators to withstand such behavior. However, when no Byzantine nodes are present, the effect of robust aggregation is underexplored. This work theoretically analyzes aggregation error in the no-attack setting and proves that the worst-case aggregation error of a robust aggregator increases with the number of Byzantine nodes it is designed to tolerate—revealing an inherent tension between Byzantine robustness and no-attack accuracy. For both non-convex objectives and those satisfying the Polyak–Łojasiewicz (PL) condition, the paper establishes tight lower bounds on the convergence rate of gradient descent with robust aggregation, reflecting the same trade-off. Experiments substantiate the theory and suggest a practical recipe: use robust aggregation during most epochs to prevent crashes/restarts; near convergence, if the cluster is healthy, switch to standard averaging to further improve accuracy—accelerating training and reducing cost while preserving accuracy. Accepted as Spotlight (top 2.6% of submissions; 9.6% of accepts).

02

Title: Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

Authors: Zizheng Huang, Haoxing Chen, Jiaqi Li, Jun Lan, Huijia Zhu, Weiqiang Wang, Limin Wang

Affiliations: Nanjing University; Shanghai Institute of Advanced Innovation; China Mobile Research Institute; Shanghai AI Laboratory

Link: https://arxiv.org/abs/2408.17081

Abstract:

Vision Mamba (Vim) offers near-linear computational complexity and strong potential for high-resolution images and long videos, but training—especially at large scales—often suffers from overfitting and complicated pipelines, leaving a gap to leading ViT models on standard benchmarks. This paper proposes Stochastic Layer-Wise Shuffle (SLWS), a plug-and-play regularization method that randomly shuffles each layer’s input token sequence during training with a probability increasing linearly with depth, and restores the original order at output. SLWS encourages deep layers to learn position-invariant high-level semantics, while shallow layers remain sensitive to low-level positional cues. The induced shuffling increases task difficulty as a regularizer, mitigating overfitting. SLWS requires no architectural changes and incurs zero inference overhead. It stabilizes training of large Vim models and yields consistent gains under supervised training. With CLIP-feature–guided masked feature distillation pretraining, Vim-Huge achieves 87.6% fine-tuning accuracy on ImageNet-1K, establishing a new SOTA for Vision Mamba training.

03

Title: Elucidating the Design Space of Multimodal Protein Language Models (ICML Spotlight)

Authors: Xinyou Wang*, Cheng-Yen Hsieh*, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu

Affiliations: Nanjing University; Rutgers University; ByteDance

Link: https://arxiv.org/abs/2504.11454

Abstract:

Proteins are biological macromolecules whose amino-acid sequences fold into specific 3D structures. AI-driven protein modeling and design is a key direction in AI for Science. Following the 2024 Nobel Prize in Chemistry recognizing DeepMind’s AlphaFold for solving the long-standing protein folding problem, AI methods are increasingly used in antibody design, enzyme engineering, and therapeutics. Protein sequences share structural similarity with natural language. Building on this insight, NJU’s NLP group and ByteDance Research have explored generative protein modeling, including DPLM (a general diffusion protein language model, ICML 2024) and DPLM-2 (a multimodal protein base model, ICLR 2025). This work advances that line of research. Code: https://github.com/bytedance/dplm; Project: https://bytedance.github.io/dplm/.

Multimodal Protein Language Models (PLMs) jointly model and generate protein sequences and structures. Sequences are modeled with discrete diffusion over amino-acid tokens (as in DPLM). Structures are continuous 3D coordinates that must be discretized into structure tokens for joint modeling. We identify three challenges: (1) discretizing coordinates causes information loss and harms fine-grained structural fidelity; (2) discrete structure tokens under-capture intrinsic correlations of local structure; and (3) insufficient geometric modeling hinders accurate capture of complex 3D residue interactions.

We address these by introducing a more precise generative modeling scheme tailored for protein structures to improve prediction accuracy, and by adding explicit geometric supervision via a geometric module with representation alignment to enhance geometric relational modeling. Experiments show strong gains: RMSD on folding drops from 5.52 to 2.36, comparable to ESMFold; in unconditional protein generation, sampling diversity improves by ~30% while maintaining sample quality.

04

Title: Differentiable Solver Search for Fast Diffusion Sampling

Authors: Shuai Wang, Zexian Li, Qipeng Zhang, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

Affiliations: Nanjing University; Alibaba

Link: https://arxiv.org/abs/2505.21114

Abstract:

Diffusion models deliver excellent generation quality but with substantial inference cost. Recent ODE-based advanced solvers target lower compute under few sampling steps, yet many are inspired by Adams-type linear multistep methods and rely solely on time-dependent Lagrange interpolation—which may be suboptimal for diffusion dynamics. This paper reveals a compact solver-design search space over time steps and solver coefficients, and proposes a differentiable solver search algorithm to discover superior solvers.

With the searched solvers, FlowMatching models SiT-XL/2 and FlowDCN-XL/2 achieve FID 2.40 and 2.35 on ImageNet 256×256 with only 10 steps; the DDPM model DiT-XL/2 reaches FID 2.33 in 10 steps. The discovered solvers substantially outperform traditional solvers (and even some distillation methods) and generalize across architectures, resolutions, and model scales.

12 Papers from Nanjing University’s Large Model Center Accepted by CVPR 2025

Wed, 30 Apr 2025 00:00:00 +0000

CVPR (the IEEE/CVF Conference on Computer Vision and Pattern Recognition) is one of the world’s most influential annual academic conferences, covering cutting-edge research in computer vision, pattern recognition, and related fields. Each year it gathers top researchers, scholars, and industry professionals to discuss the latest technological advances and innovative applications. Topics range from image processing and machine learning to 3-D reconstruction and video analysis. All submissions undergo a rigorous peer-review process to ensure originality and academic value. In the 2024 Google Scholar Metrics, CVPR ranked second among all journals and conferences worldwide, just behind Nature.

The Large Model Center of the School of Computer Science at Nanjing University has had 12 papers accepted by CVPR 2025.

01

Title: UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

Authors: Hao Lin, Ke Wu, Jie Li, Jun Li, Wu-Jun Li

Affiliation: Nanjing University

Link: https://arxiv.org/abs/2307.16375

Abstract: Training large models usually demands multi-node, multi-GPU distributed setups. Even with ample hardware, 64 %–87 % of users (in our experiments) fail to obtain results because of sub-optimal hyper-parameters such as how the model and data are partitioned. Moreover, slow training is often tackled by adding GPUs while ignoring the decisive role of distributed algorithms in hardware utilization. Efficient algorithms deliver several-fold speed-ups—and cost cuts—over less efficient ones. Many existing strategies are inefficient and can even slow training as GPU count rises. We present UniAP, the first method to jointly optimize intra-layer (e.g., tensor parallelism) and inter-layer (e.g., pipeline parallelism) strategies via automatic search, together with a supporting platform. Given a model and hardware profile, UniAP automatically finds a high-performance scheme, achieving up to 3.8 × speed-up over the best prior work and up to 9 × over unoptimized baselines, while preventing the hyper-parameter mistakes that often cripple runs. UniAP has also been adapted to domestic AI accelerators. The paper was accepted as an Oral (0.7 % of submissions, 3.3 % of accepted papers) at CVPR 2025.

02

Title: Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization

Authors: Xiran Wang, Jian Zhang, Lei Qi, Yinghuan Shi

Affiliation: Nanjing University; Southeast University

Link: https://arxiv.org/abs/2503.18987

Abstract: Domain generalization tackles distribution shifts between source (training) and unseen target (test) domains. First-order meta-learning based on gradient alignment finds balanced parameters across multiple sources, mitigating over-fitting. We reveal that gradient-aligned paths are not unique and that existing methods explore only one. Furthermore, they focus on directional alignment but ignore where in parameter space the model converges; ideally, the solution should lie near the centroid of each source optimum. We propose Arithmetic Meta-Learning (Arith), which introduces parameter averaging into meta-learning and designs an arithmetic-gradient optimizer that approximates the centroid while preserving gradient direction. Arith needs no extra expert networks or explicit regularizers and achieves strong generalization across benchmarks.

03

Title: Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed-Domain Semi-Supervised Medical Image Segmentation

Authors: Qinghe Ma, Jian Zhang, Zekun Li, Qian Yu, Lei Qi, Yinghuan Shi

Affiliation: Nanjing University; Southeast University

Link: https://arxiv.org/abs/2503.16997

Abstract: Large-scale pretrained vision foundation models show impressive generality, yet their rich priors can be a double-edged sword when adapted to specialized tasks. In medical-image segmentation with domain mismatch, foundation models such as MedSAM often yield over-confident but erroneous predictions, hampering leverage of unlabeled data. We introduce SynFoC, a framework that co-trains a foundation model with a from-scratch conventional model. The latter corrects high-confidence errors of the former, while the former supplies high-quality pseudo-labels early on. A Self-Mutual Confidence (SMC) module assesses pseudo-label quality and adaptively fuses them; a consensus–disagreement consistency constraint further boosts collaboration. Experiments confirm superior performance over existing approaches.

04

Title: Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting

Authors: Maochen Yang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi

Affiliation: Nanjing University; Southeast University

Link: https://arxiv.org/abs/2503.17984

Abstract: Crowd counting is vital in smart-city and public-safety applications, yet dense annotation is costly. Semi-supervised counting aims to exploit unlabeled data, but effective use remains challenging. We propose TMTB (Taste More Taste Better), advancing both data and model aspects. (1) Inpainting Augmentation uses diffusion models to regenerate image backgrounds without disturbing crowd structures, greatly enriching data diversity; unreliable regions are filtered. (2) Visual State Space Model (VSSM) serves as the backbone, capturing global context with linear complexity—ideal for extreme density, low light, or bad weather. (3) A noise-robust classification head supplies coarse-but-stable interval-count supervision, mitigating regression sensitivity to label noise. On multiple datasets, TMTB outperforms state-of-the-art methods under 5 %, 10 %, and 40 % label fractions; on JHU-Crowd++ with only 5 % labels it lowers MAE below 70 for the first time (67.0) and shows strong cross-domain generalization.

05

Title: AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning

Authors: Yuheng Xu, Shijie Yang, Xin Liu, Jie Liu, Jie Tang, Gangshan Wu

Affiliation: Nanjing University

Link: https://arxiv.org/abs/2503.01565

Abstract: The spread of high-DPI displays heightens demand for high-def images, yet edge devices struggle to host heavy SR networks, calling for efficiency. Prior LUT-based SR has scarcely mined pixel-level cues and uses fixed sampling, limiting accuracy and fine-detail capture. We introduce two plug-and-play modules: AutoSample, which learns flexible LUT sampling weights during training—adapting to pixel variations, enlarging receptive field, and incurring no inference overhead—and AdaRL, which strengthens inter-layer connections to boost fine-detail reconstruction. With similar storage, AutoLUT lifts MuLUT by ≈ 0.20 dB PSNR across five datasets; on SPF-LUT it halves storage, cuts inference time by two-thirds, and maintains fidelity.

06

Title: CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

Authors: Xin Liu, Jie Liu, Jie Tang, Gangshan Wu

Affiliation: Nanjing University

Link: https://arxiv.org/abs/2503.06896

Abstract: Transformer-based SR excels on low-level vision but its quadratic complexity explodes with resolution. Existing speed-ups partition images into content-agnostic windows, curtailing long-range redundancy exploitation vital for SR. We propose CATANet, a lightweight Content-Aware Token Aggregation Network. A novel aggregation module clusters content-similar tokens across the entire image, sharing aggregation centers and updating them only during training to cut computation. We then apply intra-group self-attention for long-range interaction and inter-group cross-attention to enhance global fusion. Compared with the clustering-based SPIN, CATANet is faster at inference while gaining up to 0.33 dB PSNR.

07

Title: Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning

Authors: Jiange Yang, Haoyi Zhu, Yating Wang, Gangshan Wu, Tong He, Limin Wang

Affiliation: Nanjing University; Shanghai AI Lab; USTC; Tongji University

Link: https://arxiv.org/pdf/2411.14519

Abstract: Data scarcity and heterogeneity challenge robot learning. Tra-MoE adopts a sparsely gated Mixture-of-Experts to learn trajectory prediction from large-scale cross-domain video without action labels, balancing parameter sharing and specialization. It fuses simulation videos rendered by different physics engines with real videos of humans, single-arm, and dual-arm robots—promising for cross-agent learning. An adaptive policy-conditioning mechanism leverages predicted trajectories to boost downstream robot control, greatly reducing needs for expensive real-robot data.

08

Title: LeviTor: 3-D Trajectory Oriented Image-to-Video Synthesis

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang

Affiliation: Nanjing University; Ant Group; Zhejiang University; Hong Kong University of Science and Technology; Shanghai AI Lab

Link: https://github.com/ant-research/LeviTor

Abstract: Sketching a trajectory is an intuitive way to control motion in image-to-video synthesis, yet 2-D paths are ambiguous for out-of-plane motion. LeviTor enriches interaction by adding a depth dimension: users assign relative depth to trajectory key-points, retaining 2-D convenience while enabling 3-D control. Objects are represented by a few cluster points reflecting depth and occlusion. These, along with depth and instance maps, guide a video-diffusion generator to produce videos faithfully following 3-D trajectories. Extensive experiments demonstrate precise motion control and high realism.

09

Title: Contextual AD Narration with Interleaved Multimodal Sequence

Authors: Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, Limin Wang

Affiliation: Nanjing University; KU Leuven; Ant Group; Shanghai AI Lab

Link: https://arxiv.org/abs/2403.12922

Abstract: Audio description (AD) narrates visual content for the visually impaired. We present Uni-AD, a simple unified framework that feeds interleaved multimodal sequences—video features, text, character lists, and context—into a pretrained language model. A lightweight mapper aligns video to text space for fine-grained fusion; a character-optimization module highlights major roles in context. Coupled with context cues and a contrastive loss, Uni-AD generates fluent, context-aware narration. Experiments on multiple AD datasets confirm its superiority.

10

Title: Multiple Object Tracking as ID Prediction

Authors: Ruopeng Gao, Ji Qi, Limin Wang

Affiliation: Nanjing University; China Mobile (Jiangsu) Software Technology Co.; Shanghai AI Lab

Link: https://github.com/MCG-NJU/MOTIP

Abstract: Multi-object tracking (MOT) is traditionally decomposed into detection and association, with handcrafted algorithms maintaining trajectories and computing cost matrices—effective yet requiring extensive tuning for complex scenes. We reconceptualize MOT as context-conditioned ID prediction and propose MOTIP, an end-to-end framework that directly decodes ID labels for current detections given past trajectories. Using only appearance features, MOTIP achieves state-of-the-art results on multiple benchmarks without elaborate tricks, offering a powerful baseline for future research.

11

Title: Online Video Understanding: OVBench and VideoChat-Online

Authors: Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang

Project Site: https://videochat-online.github.io/

Affiliation: Nanjing University; China Mobile Research Institute; Shanghai AI Lab

Abstract: Multimodal large language models have excelled at offline video understanding, but real-time scenarios (e.g., autonomous driving, HCI) pose fresh challenges. We contribute on three fronts: (1) OVBench, a comprehensive QA benchmark evaluating perception, memory, and reasoning over streaming video, spanning six task types across past, current, and future contexts (16 subtasks from diverse datasets). (2) Pyramid Memory Bank, which efficiently retains critical spatio-temporal cues. (3) An offline-to-online learning paradigm, with an alternating dialog format and the VideoChatOnline-IT instruction-tuning set for streaming data. Our resulting framework, VideoChat-Online, outperforms state-of-the-art offline and online models on common offline benchmarks and OVBench, despite lower compute cost.

12

Title: Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Authors: Zi’ang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang

Affiliation: Shanghai AI Lab; Zhejiang University; University of Science and Technology of China; Shanghai Jiao Tong University; Shenzhen Institutes of Advanced Technology, CAS; Nanjing University

Abstract: Although multimodal LLMs excel at broad visual reasoning, they lag on fine-grained or high-precision tasks. Prior efforts either add tool-usage skills or fold specific vision tasks into the autoregressive framework, often harming overall multimodal performance. We propose Task Preference Optimization (TPO), which introduces differentiable task preferences distilled from fine-grained vision tasks to guide optimization. Learnable task tokens form dynamic links between multiple task-specific heads and the core MLLM, enabling effective use of rich labeled data. TPO supports joint multi-task training, boosting overall performance by 14.6 % versus baselines and delivering strong zero-shot generalization comparable to fully-supervised state-of-the-art models. We instantiate TPO on VideoChat and LLaVA, confirming significant gains and opening a scalable pathway to enhance MLLMs on diverse visual tasks.

Read Original

Five Papers from Nanjing University’s School of Computer Science Large Model Innovation Center Accepted at ICLR 2025

Tue, 15 Apr 2025 00:00:00 +0000

ICLR (International Conference on Learning Representations) is one of the leading AI conferences focusing on deep learning and representation learning. Since its inception in 2013, ICLR has become a premier platform for machine learning research, particularly in deep learning, neural architectures, reinforcement learning, generative models, and NLP.

Five papers from the Large Model Innovation Center of Nanjing University’s School of Computer Science were accepted at ICLR 2025.

01

Title: TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Authors: Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, Limin Wang
Affiliations: Nanjing University, Shanghai AI Laboratory, Chinese Academy of Sciences, etc.
Link: https://openreview.net/forum?id=nAVejJURqZ
Abstract: Most existing video multimodal large models tend to focus on irrelevant segments when understanding long videos, often leading to hallucinations. Can we enhance MLLMs’ long-video QA performance by using temporal localization as an auxiliary task to pinpoint relevant subsegments? We propose TimeSuite, which incrementally fine-tunes short-video MLLMs with time-location data to boost long-video understanding. TimeSuite includes: a simple, efficient long-video framework (VideoChat‑T); a high‑quality localization‑based instruction tuning dataset (TimePro); and a tailored instruction task (Temporal Grounded Caption). Joint tuning guides MLLMs to focus on correct segments, improving QA accuracy. First, VideoChat‑T achieves expert‑level temporal localization without external decoders while retaining strong QA generalization and zero‑shot ability. Second, integrating the expert task enhances comprehensive long‑video understanding, validating this hybrid approach. Experiments show VideoChat‑T yields 5.6% and 6.8% accuracy gains on Egoschema and VideoMME, respectively, and demonstrates superior zero‑shot localization, matching supervised expert models after fine‑tuning.

02

Title: CG-Bench: Clue‑grounded Question Answering Benchmark for Long Video Understanding
Authors: Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, Limin Wang
Affiliations: Nanjing University, Shanghai AI Laboratory, Fudan University, Zhejiang University
Link: https://openreview.net/forum?id=le4IoZZHy1
Abstract: We introduce CG‑Bench, a benchmark for long‑video multimodal reasoning using a “Clue‑Question‑Answer” triplet. Unlike multiple‑choice tests, models must answer correctly and accurately locate supporting video segments. CG‑Bench offers three tasks: perception (basic visual skills), reasoning (temporal & multimodal integration), and hallucination detection (reliability under ambiguity). It uses dual evaluation: white‑box IoU for localization precision and black‑box Clue Recovery Rate for context dilution. Combining multiple‑choice and open‑ended forms with human annotations and heuristic rules, CG‑Bench ensures evaluation quality. The dataset contains 1,219 long videos across 638 subcategories, totaling 12,129 QA pairs. Results show models like GPT‑4o perform well on multiple choice but drop sharply when localization is required (white‑box acc@IoU only 4.38%, open‑ended accuracy <40%). Performance varies with video length, frame sampling, and multimodal cues, highlighting challenges in precise information retrieval for long‑video reasoning.

03

Title: SPA: 3D Spatial‑Awareness Enables Effective Embodied Representation
Authors: Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Limin Wang, Tong He
Affiliations: University of Science and Technology of China, Shanghai AI Laboratory, Zhejiang University, Tongji University, Nanjing University
Link: https://openreview.net/forum?id=6TLdqAZgzn
Abstract: Spatial awareness is critical for robots in complex environments, but existing methods struggle to capture 3D geometry. We propose SPA, a visual representation framework that enhances 3D spatial awareness for embodied tasks. SPA trains on a large multi‑view dataset with camera poses, depth, and semantic maps from synthetic and real robot scenes. It builds volumetric features from multi‑view input, uses mask‑based differentiable neural rendering to generate RGB, depth, and semantic maps, and applies Eikonal regularization with SDF supervision for geometric consistency. After 6,000 GPU hours, SPA outperforms baselines on 200+ tasks across real and eight simulated environments, ranking first in 30.3% of tasks.

04

Title: Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning
Authors: Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, Xiaoxing Ma
Affiliations: Nanjing University, University of Toronto, Microsoft Research Asia, Peking University, Meta
Link: https://openreview.net/forum?id=FiyS0ecSm0
Abstract: AI has advanced in competition‑level proofs, especially inequalities, which pose huge search spaces at each step. We present a neural‑symbolic system that integrates neural networks with symbolic reasoning, excelling on Olympiad‑level inequality tasks. On a standard set of 20 problems, our system solves 16 on average (versus 15 by human gold medalists), outperforming GPT and DeepSeek. This breakthrough showcases neural‑symbolic methods’ potential for complex mathematical reasoning, opening new avenues in automated theorem proving, education, and research.

05

Title: MeteoRA: Multiple‑tasks Embedded LoRA for Large Language Models
Authors: Jingwei Xu, Junyu Lai, Yunpeng Huang
Affiliations: Nanjing University
Link: https://openreview.net/pdf?id=yOOJwR15xg
Abstract: The “pretrain + finetune” paradigm underpins LLM deployment, with LoRA as a popular efficient fine‑tuning method. Yet task awareness and adapter switching remain challenging with multiple LoRA adapters. We propose MeteoRA, a scalable multi‑task LoRA architecture embedding task‑specific adapters and a routing component via a Mixture‑of‑Experts (MoE) design for adaptive adapter selection. A hybrid expert model acceleration strategy leverages PyTorch and Triton–based custom operators to avoid MoE routing loops, achieving 4× speedup. Experiments demonstrate MeteoRA’s effectiveness on composite tasks, handling up to ten serial questions per inference and showing clear routing biases, confirming adaptive switching.

View original

Shusheng InternVideo2.5 Open-Sourced, Precisely Finding the 'Needle in a Haystack' in Tens of Thousands of Frames, with Fine-Grained Spatiotemporal Perception

Tue, 11 Feb 2025 00:00:00 +0000

Recently, the Shanghai AI Lab, in collaboration with Nanjing University and the Shenzhen Institutes of Advanced Technology, jointly open-sourced the multi-modal video model Shusheng InternVideo2.5. In the field of video understanding, the upgraded InternVideo2.5 has achieved improvements in both temporal span and fine granularity, expanding its capacity sixfold compared to the previous model. It enables a precise “needle in a haystack” search within long videos containing tens of thousands of frames, allowing AI to more accurately interpret the complex real world and infuse new quality into various applications. Previously, the Shusheng InternVideo series was applied during the live broadcast of the Paris Olympics by China Central Television, precisely pinpointing athletes’ scoring moments and corresponding slow-motion replays, significantly enhancing TV production efficiency. With enhanced long video processing capabilities, InternVideo2.5 will offer more efficient AI support for applications such as autonomous driving, security surveillance, and virtual reality.

Open source link: https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5
Paper link: https://arxiv.org/abs/2501.12386
Huggingface link: https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B

Focus on Fine-Grained Spatiotemporal Understanding and Efficient Long Video Processing

Shanghai AI Lab has continuously invested in video multi-modal large model (Video MLLM) technology since 2022, successively launching and open-sourcing the general video foundation model Shusheng InternVideo, the video understanding large model Shusheng InternVideo2, and the dialogue-centric video understanding paradigm VideoChat. By leveraging its experience in video visual representation learning and multi-modal dialogue, the upgraded InternVideo2.5 focuses on fine spatiotemporal understanding through deep integration of visual perception and language comprehension, achieving breakthroughs in long video understanding.

InternVideo2.5 Capability Characteristics:

Ultra-long video processing: Accurately locate targets within tens of thousands of frames, with processing length extended from 3,000 to 10,000 frames.
Fine-grained perception: Accurately identify and locate objects, scenes, and actions while comprehending subtle spatiotemporal relationships.
Integration of multiple visual capabilities: Not only supports general video Q&A but also proficiently handles specialized tasks such as object tracking and segmentation.

Left image: Performance comparison between InternVideo2.5 and other 8-billion-parameter open models on MVBench and VideoMME; Right image: InternVideo2.5 accurately tracks and analyzes videos.

LRC Combined with Progressive Training to Overcome Bottlenecks in Long Video Modeling

For long videos and fine-grained visual tasks, traditional video multi-modal large models face significant challenges in accurately tracking target objects in ultra-long videos or recognizing subtle spatiotemporal relationships in complex scenes. For example, in “needle in a haystack” tasks, conventional methods require extensive computational resources and deliver unsatisfactory localization accuracy, thereby limiting industrial applications. To address this, Shanghai AI Lab, together with its research team, leveraged its self-developed Shusheng InternVL2.5 base model to propose Long-range Context Modeling (LRC) technology as a solution.

The Two Core Modules of Long-range Context Modeling (LRC) Technology:

Hierarchical Context Compression (HiCo): Exploits redundancy in long video visual data through layered compression. Experimental results demonstrate that with HiCo, InternVideo2.5 can accurately locate target frames within tens of thousands of frames, leading in performance among open models.
Task Preference Optimization (TPO): Transforms annotations from various fine-grained visual tasks (such as object tracking, segmentation, and temporal localization) into differentiable task preferences, thereby guiding the model’s self-learning to extend its capabilities to specialized visual applications.

Additionally, the team pre-trained InternVideo2.5 using a progressive multi-stage training strategy on over 300,000 hours of video data, ensuring robust video processing capabilities. The training corpus includes vision-language alignment data, long video sequences, and specialized visual task data, providing abundant information for comprehensive model learning. Following the progressive training scheme of Shusheng InternVL, the approach enhances fine-grained perception and temporal understanding in stages: initial basic learning for task recognition and video-language alignment; subsequent integration and training of specific task components alongside visual concept pre-training; and finally, multi-task training combined with instruction fine-tuning on mixed corpora to optimize all model components. This method achieves effective scaling from “small to large” and refinement of data from “coarse to fine”, reducing costs while enhancing performance.

View Original

The Chinese Academy of Sciences Academicians Forum on the Healthy Development and Empowerment of Large Models/AIGC Held in Nanjing

Tue, 16 Jan 2024 00:00:00 +0000

The 155th Frontier Forum of the Chinese Academy of Sciences Academicians — “The Healthy Development and Empowerment of Large Models/AIGC” was held in Nanjing from January 6 to 7, 2024. The forum was organized by the Chinese Academy of Sciences Academicians, hosted by the Academic and Publishing Work Committee and the Standing Committee of the Information Technology Science Department of the Chinese Academy of Sciences, co-organized by Nanjing University, Southeast University, and the publisher “Science in China”, with Academicians Lu Jian and Huang Ru, along with Academician Wang Jian from the Chinese Academy of Engineering, jointly serving as forum chairs.

Academician Bao Xinhai, Director of the Academic and Publishing Work Committee, attended the forum along with Zhou Dejin from the Work Bureau of the Chinese Academy of Sciences Academicians, Ren Youqun from the Ministry of Education’s Teacher Work Department, Academician Huang Ru from Southeast University, and Xu Guanghui from the Jiangsu Science and Technology Department, who delivered opening remarks.

Six academicians from the Chinese Academy of Sciences—including Bao Xinhai, Lu Jian, Huang Ru, Tan Tieniu, E Weinan, and Xu Zongben—two academicians from the Chinese Academy of Engineering, Gao Wen and Yang Shanlin, and nearly 300 experts from 87 universities, research institutes, and companies (including the Chinese Academy of Sciences, Nanjing University, Southeast University, Hong Kong University of Science and Technology, iFlytek, Huawei, Alibaba, Xiaomi, Midea, and Geely Automobile Research Institute) attended the forum, with more than half being young scientists under 45.

The forum comprised two sessions: keynote presentations and special topic reports. In the keynote session, Academician Tan Tieniu discussed trends in generative AI; Academician Gao Wen introduced the Pengcheng Brain pre-trained large model platform and open-source collaborations; Academician Yang Shanlin presented AIGC and its scientific foundations; Academician E Weinan explained the basics of deep learning; Academician Xu Zongben discussed mathematical research on large models; Professor Guo Yike, an Academician of the Royal Academy of Engineering (UK) and Vice-Chancellor of Hong Kong University of Science and Technology, addressed the intrinsic scientific issues of large models; and AI experts from iFlytek, Huawei, and Alibaba showcased applications and innovative practices of large models.

In the special topic session, experts presented reports on eight topics: “Frontier and Collaborative Innovation in the Development of Large Models/AIGC”, “Empowering Technological Development with Large Models/AIGC”, “Boosting the Real Economy with Large Models/AIGC”, “Facilitating Educational Transformation with Large Models/AIGC”, “Large Models/AIGC and Intelligent Basic Software”, “Large Models/AIGC, Computing Infrastructure, and Chip Technology”, “Safety, Controllability, Privacy Protection, and Low-cost Deployment of Large Models/AIGC”, and “Governance and Management of Large Models/AIGC”. Following the reports, experts engaged in roundtable discussions on these topics.

After two days of discussions, the experts explored key technologies and challenges in the development of large models and AI, application scenarios, industrial empowerment, and legal and ethical risks, reaching some preliminary consensus. The forum outcomes will be released in the form of briefings and special reports.

Read Original Article