mDPO: Conditional Preference Optimization for Multimodal Large Language Models

1USC  2UC Davis  3Microsoft Research

EMNLP 2024

The Pitfall of Multimodal Preference Optimization


While DPO has shown effective for LLM alignment, it often struggles to achieve consistent improvements in multimodal scenarios. We find this is due to unconditional preference in multimodal preference optimization, where the image condition is overlooked. Through controlled comparisons, we discover that multimodal LLMs can achieve similar performance even when all images are removed from the multimodal preference data during DPO. We attribute this to a systematic gap between the theoretical expectations and practical implementations of the DPO objective in multimodal settings. While DPO aims to compute implicit rewards conditioned on all input modalities, it may prioritize language-only preferences and overlook the image condition, leading to suboptimal model performance and increased hallucination.

mDPO: Conditional and Anchored Preference Optimization


We propose ᴍDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by jointly optimizing text and image preferences, with a reward anchor to prevent a decrease in the likelihood of the chosen instance.

Qualitative Results


Top: When trained with standard DPO, Bunny often assumes the image description in the question is correct, responding accordingly, even if the question contains an adversarial premise regarding the image. In contrast, MDPO identifies the false premise in the question by referencing the image. Bottom: Bunny trained with standard DPO may disregard the image and provide an educated guess for the answer. Conversely, MDPO delivers a correct answer that is conditioned on the image.

Quantitative Results


Experiments on different-size multimodal LLMs (Bunny-3B and LLaVA-7B) and widely used benchmarks show that ᴍDPO effectively addresses the unconditional preference issue and significantly improves model performance, particularly in reducing hallucination.

Citation


        @inproceedings{wang2024mdpo,
          title={mDPO: Conditional Preference Optimization for Multimodal Large Language Models},
          author={Wang, Fei and Zhou, Wenxuan and Huang, James Y and Xu, Nan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao},
          journal={Proceedings of EMNLP 2024},
          year={2024}
        }