Authors: Yinhong Liu, Jianfeng He, Hang Su, Ruixue Lian, Yi Nian, Jake W. Vincent, Srikanth Vishnubhotla, Robinson Piramuthu, Saab Mansour.
Updates: Our work has been accepted by EMNLP 2025 🎉
This is the official repository for the MDSEval benchmark. It includes all human annotations, benchmark data, and the implementation of our newly proposed data filtering framework, Mutually Exclusive Key Information (MEKI). MEKI is designed to filter high-quality multimodal data by ensuring that each modality contributes unique information.
Multimodal Dialogue Summarization (MDS) is an important task with wide-ranging applications. To develop effective MDS models, robust automatic evaluation methods are essential to reduce both costs and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations.
MDSEval is the first meta-evaluation benchmark for MDS. It consists of:
- Image-sharing dialogues
- Multiple corresponding summaries
- Human judgments across eight well-defined quality dimensions
To ensure data quality and diversity, we introduce a novel filtering framework, Mutually Exclusive Key Information (MEKI), which leverages complementary information across modalities.
Our contributions include:
- The first formalization of key evaluation dimensions specific to MDS
- A high-quality benchmark dataset for robust evaluation
- A comprehensive assessment of state-of-the-art evaluation methods, showing their limitations in distinguishing between summaries from advanced MLLMs and their vulnerability to various biases
Besides the requirements.txt
, we additionaly depends on:
- The google-research with install command in
prepare_dialog_data.sh
- The external images provided in
MDSEval_annotations.json
with download script inprepare_image_data.sh
- The model checkpoint ViT-H-14-378-quickgelu loaded by
meki.py
We first download and merge the textual dialogues from their source (PhotoChat and DialogCC)
bash prepare_dialog_data.sh
Then download the images for MDSEval:
bash prepare_image_data.sh
Note that the original hosting website is not very stable, so you may need to run the script multiple times to ensure all images are successfully downloaded.
You can explore the MDSEval dataset using the provided notebook:
demonstrations.ipynb
, which contains functions to load and visualize the data.
The MDSEval dataset includes the following statistics:
Statistic | Value |
---|---|
Total number of dialogues | 198 |
Summaries per dialogue | 5 |
Avg. turns per dialogue | 17.1 |
Avg. tokens per dialogue | 209.0 |
Evaluation aspects | 8 |
Avg. annotators per summary | 2.9 |
Avg. sentences per summary | 4.5 |
Human annotations across eight evaluation dimensions:
Evaluation Dimensions | Scale | Note |
---|---|---|
Multimodal Coherence (COH) | 1-5 | |
Conciseness (CON) | 1-5 | |
Multimodal Coverage (COV) | 1-5 | |
Multimodal Information Balancing (BAL) | 1-7 | Bipolar |
Topic Progression (PROG) | 1-5 | |
Multimodal Faithfulness (FAI) | Faithful, Not faithful to image, Not faithful to text, Not faithful to both | Both sentence level and summary level |
To ensure the dataset is sufficiently challenging for multimodal summarization, dialogues should contain key information uniquely conveyed by a single modality — meaning it cannot be inferred from the other. To quantify this, we introduce Mutually Exclusive Key Information (MEKI) as a selection metric.
We embed both the image and textual dialogue into a shared semantic space, e.g. using the CLIP model, denoted as vectors
To measure Exclusive Information (EI) in
Next, to identify Exclusive Key Information (EKI) — crucial content uniquely conveyed by one modality — we first generate a pseudo-summary
% | \operatorname{Proj}_S(I_T^\perp) | =
\left| \frac{\langle I_T^\perp, S\rangle}{\langle S, S\rangle} S \right|
]
which quantifies the extent of exclusive image-based key information. Similarly, we compute
Finally, the MEKI score aggregates both components:
[
\operatorname{MEKI}(I, T; S) = \lambda \operatorname{EKI}(I \mid T; S) + (1-\lambda)\operatorname{EKI}(T \mid I; S)
]
where
The MEKI implementation is provided in meki.py
. Please follow the instructions in the file to use it.
MDSEval is constructed using images and dialogues from the following sources:
Accordingly, we release MDSEval under the Apache 2.0 License.
If you found the benchmark useful, please consider citing our work.
This is an intern project which has ended. Therefore, there will be no regular updates for this repository.