Below we show results for 500 randomly selected test images from Flickr30K and MS-COCO obtained from the BLIP-Large models with and without MOCHa.
Click "Previous" and "Next" to browse through the results. "Jump To" lets the user skip to a specific image. "Show MS-COCO"\"Show Flickr30K" toggles between the two datasets.
To save space on the screen, we highly recommend to click "Hide Instructions" once done reading.
We report the difference of the contradiction probability between the top beams, coloring positive differences in green (for these the captions predicted with MOCHa contradict the GT captions less)
and negative differences in red (for these the captions predicted with MOCHa contradict the GT captions more). The color intensity is proportional to the magnitude of the difference.
Contradiction probabilities are computed with respect to the ground-truth caption depicted below.
To first focus on cases with significant differences between BLIP's predictions with and without MOCHa, samples are ordered according to n-gram similarity, measured between these predictions.
We refer the reader to the accompanying pdf document for additional details about our qualitative comparison and additional results.