Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

1Institute of Science Tokyo   2MBZUAI   3NII LLMC
LREC 2026 (to appear)

Abstract

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, to assess the generalizability of automatic evaluation metrics in multi-task scenarios, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) benchmark, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.

HarmonicEval

Overview of HarmonicEval

HarmonicEval consists of two steps. (a) Criterion-wise scoring is performed by prompting a VLM to evaluate the input text based on each criterion, followed by score smoothing to improve robustness based on the first-order statistics. (b) Score aggregation produces an overall score using harmonic weighting based on the second-order statistics, aiming to reduce statistical fluctuations.

MMHE Benchmark

Overview of MMHE benchmark

MMHE benchmark is a multi-task multi-criteria human evaluation benchmark. Each candidate text is manually evaluated by three expert annotators.

Main Results

Main results table 1

HarmonicEval achieves higher correlations with human judgments than conventional metrics on MMHE.

Main results table 2

Analysis on MMHE reveals that existing metrics prioritize or deprioritize specific criteria, while HarmonicEval achieves the highest correlation across most criteria.

BibTeX

@misc{ohi2025multimodalmultitaskmulticriteriaautomatic,
      title={Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models},
      author={Masanari Oi and Masahiro Kaneko and Naoaki Okazaki and Nakamasa Inoue},
      year={2025},
      eprint={2412.14613},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.14613},
}