While DPO has shown effective for LLM alignment, it often struggles to achieve consistent improvements in multimodal scenarios. We find this is due to unconditional preference in multimodal preference optimization, where the image condition is overlooked. Through controlled comparisons, we discover that multimodal LLMs can achieve similar performance even when all images are removed from the multimodal preference data during DPO. We attribute this to a systematic gap between the theoretical expectations and practical implementations of the DPO objective in multimodal settings. While DPO aims to compute implicit rewards conditioned on all input modalities, it may prioritize language-only preferences and overlook the image condition, leading to suboptimal model performance and increased hallucination.
We propose ᴍDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by jointly optimizing text and image preferences, with a reward anchor to prevent a decrease in the likelihood of the chosen instance.
@inproceedings{wang2024mdpo,
title={mDPO: Conditional Preference Optimization for Multimodal Large Language Models},
author={Wang, Fei and Zhou, Wenxuan and Huang, James Y and Xu, Nan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao},
journal={Proceedings of EMNLP 2024},
year={2024}
}