Bridge Icon UMU-Bench:
Closing the Modality Gap in Multimodal Unlearning Evaluation

1Zhejiang University, 2Hangzhou Dianzi University
NeurIPS 2025

Abstract

Although Multimodal Large Language Models (MLLMs) have advanced numerous fields, their training on extensive multimodal datasets introduces significant privacy concerns, prompting the necessity for effective unlearning methods. However, current multimodal unlearning approaches often directly adapt techniques from unimodal contexts, largely overlooking the critical issue of modality alignment, i.e., consistently removing knowledge across both unimodal and multimodal settings. To close this gap, we introduce UMU-Bench, a unified benchmark specifically targeting modality misalignment in multimodal unlearning. UMU-Bench consists of a meticulously curated dataset featuring 653 individual profiles, each described with both unimodal and multimodal knowledge. Additionally, novel tasks and evaluation metrics focusing on modality alignment are introduced, facilitating a comprehensive analysis of unimodal and multimodal unlearning effectiveness. Through extensive experimentation with state-of-the-art unlearning algorithms on UMU-Bench, we demonstrate prevalent modality misalignment issues in existing methods. These findings underscore the critical need for novel multimodal unlearning approaches explicitly considering modality alignment. The code and data are publicly available at https://github.com/QDRhhhh/UMU-bench.

Benchmark

In this paper, we introduce UMU-Bench, a knowledge-based benchmark that achieves a balance between unimodal and multimodal data, designed to address both aspects of unlearning.

Dataset composition and structure

Datasets

The dataset is composed of 500 fictitious individuals and 153 real individuals, each with a rich profile. Each profile contains various knowledge, including personal information such as images, names, birthplaces, birthdates, occupations and more. These profiles cover a broad spectrum of knowledge, encompassing 70 countries, 270 regions, birthdates from 1950 to 2010, 145 distinct occupations, and diverse personal preferences for each individual.

The evaluation of unlearning is primarily conducted from two perspectives: Unlearning Completeness (UC) and Model Utility (UT). To facilitate these evaluations, the dataset is divided into three distinct subsets: the forget set, the retain set, and the real person set.

Forget Set

This subset is used to evaluate the UC of the model. The forget set consists of knowledge instances from 500 fictitious individuals, and the forgetting rates are configured at 5%, 10%, and 15%. These knowledge instances are specifically chosen to assess how well the model can forget particular details after unlearning. Ideally, after the unlearning process, the model should demonstrate a significant reduction in performance when tested on this subset, as it is expected to have forgotten the associated knowledge.

Retain Set

This subset is designed to assess UT. It includes the remaining 95%, 90%, and 85% of the 500 fictitious individuals after the knowledge instances in the forget set have been removed. The retain set evaluates the model's ability to retain relevant knowledge and maintain performance on the remaining data, even after the unlearning of specific information. Ideally, the model should demonstrate minimal performance degradation on this set, suggesting that unlearning has not overly affected the model's ability to recall and utilize the retained knowledge.

Real Person Set

This subset is also evaluated from the perspective of UT and consists of profiles of 153 real individuals. The key feature of this set is that it is independent of the forget set. It serves to evaluate the model's general performance and robustness. Since this set represents real-world data, it is crucial to test the model's ability to generalize beyond the synthetic knowledge used in the forget set. In an ideal scenario, unlearning should not adversely affect the model's performance on this set, ensuring that the model retains its utility and general capabilities after the unlearning process.

Results

Performance comparison of different unlearning algorithms on the UMU-Bench dataset, using the LLaVA-1.5-7B model across three forgetting rates (5%, 10%, and 15%).

Experimental results

Performance across three unlearning modalities: unimodal, multimodal, and mixed mode. The evaluation metric is the difference between the original model's performance and the performance of the model after unlearning.

Experimental results

Evaluation of modality alignment across different unlearning algorithms (GA, GD) and a range of unimodal-to-multimodal loss balancing ratios (α : β). Each subfigure illustrates performance under varying proportions for three task types (i.e., classification, cloze, and generation) across unimodal (text-only), multimodal (text + image), and hybrid (mixed) unlearning setups.

The results demonstrate how different balancing ratios influence unlearning completeness and modality alignment, highlighting the trade-offs between unimodal and multimodal performance in each algorithm.

Experimental results

BibTeX

@inproceedings{wang2025umu,
  author = {Wang, Chengye and Li, Yuyuan and Feng, Xiaohua and Chen, Chaochao and Zheng, Xiaolin and Yin, Jianwei},
  title = {UMU-Bench: Closing the Modality Gap in Multimodal Unlearning Evaluation},
  booktitle = {Advances in Neural Information Processing Systems},
  year = {2025}
}