Mitigating Open-Vocabulary Caption Hallucinations

Hallucinated details are prevalent in the outputs of modern image captioning models. Prior work has largely focused on detecting or mitigating hallucinations by using closed-vocabulary object lists, which simplify the problem but fail to capture most types of hallucinations that occur in practice. By leveraging recent progress in generative foundation models, we propose a unified framework for quantifying and mitigating open-vocabulary hallucinations.


First, we introduce OpenCHAIR, a benchmark for evaluating open-vocabulary hallucinations which surpasses the existing benchmark CHAIR both in diversity and accuracy:




Additionally, we introduce MOCHa, a reinforcement learning-based approach that adjusts captioning models to output detailed, valid captions while avoiding such hallucinations:

Hallucinated details are shown as highlighted text.

Abstract

While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring most types of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting, including quantifying their presence and optimizing to mitigate such hallucinations. Our OpenCHAIR benchmark leverages generative foundation models to evaluate open-vocabulary caption hallucinations, surpassing the popular CHAIR benchmark in both diversity and accuracy. To mitigate open-vocabulary hallucinations on the sequence level, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark as well as other existing metrics.

OpenCHAIR

Instead of relying on closed object lists, OpenCHAIR uses a knowledgeable LLM to perform fine-grained hallucination detection over a significantly larger vocabulary:

cars peace

To exploit the LLM's full potential we construct a new dataset by generating 5000 captions with highly diverse objects and let a powerful text-to-image model generate images for them. We find that we are not just able to significantly increase the benchmark's diversity, but also improve the evaluation accuracy with respect to a human evaluator:

cars peace

*CHAIR also includes a coarse synonym list.

OpenCHAIR vs. CHAIR

We display the object coverage of CHAIR (over MS-COCO) and OpenCHAIR, measured as the number of unique objects:

We provide a list of all additional objects in OpenCHAIR:

sculpture, haystack, monument, boar, buffet, whiskey, land, capsule, bulldozer, jester, cougar, skateboarder, greenhouse, cigarette, tool, cookie, trampoline, rifle, pigtail, tile, cement, wood, sunflower, police, honeycomb, hive, comb, cottage, eyeglass, vine, alley, scar, reef, cable, sparkler, gingerbread, deck, dirt, bartender, sled, hotel, armor, picnic, yard, battlefield, pasture, pond, map, jet, badminton, pendant, steel, bonnet, knuckle, sun, rafter, cafeteria, sauce, snorkel, ramp, globe, robot, bag, movie, staircase, chopstick, foot, cream, chin, pepperoni, swamp, knee, diver, bike, tadpole, town, hacksaw, mustard, iron, spit, railing, claw, gravel, bullwhip, mother, tuna, workshop, underwear, beast, package, cartwheel, lace, shore, garage, fish, bull, snowball, mist, shelf, metal, fabric, lipstick, shirt, alcohol, pistol, asteroid, medal, tightrope, video, wagon, salad, worm, hook, hourglass, cabinet, tower, waterfront, jar, piano, grapevine, singer, baby, beer, steeple, toad, ice, beach, cloud, hoodie, fingerprint, pinstripe, rollerblade, lime, road, coin, flower, heel, shark, graph, dessert, pallet, grill, carton, ham, motorbike, frog, swimwear, bar, streetlight, head, cloth, headscarf, china, headdress, jean, badge, pet, gift, sack, plant, dandelion, riverbank, sharpener, iphone, saloon, fur, waiter, midwife, microphone, flesh, joystick, card, propeller, sea, iceberg, treadmill, gorilla, dancer, megaphone, restroom, skull, prison, shrimp, napkin, shaft, kale, harmonica, ipad, rhino, boyfriend, typewriter, balloon, gearbox, spice, rock, saddle, fluid, bullfight, hatchet, screw, site, falcon, beaker, driftwood, deadbolt, drawer, bumper, rainforest, helicopter, rollercoaster, penny, throne, trenchcoat, hardhat, tutu, arm, breast, spear, casserole, room, talon, machinery, mammal, corral, flute, sunroof, bridge, chainsaw, baton, fog, human, honey, patch, gate, toolbox, mascara, pasta, antenna, signpost, arrow, bib, ostrich, equipment, grid, sunset, spatula, stereo, tank, uniform, sushi, headstand, church, canoe, washroom, flock, ground, popsicle, barn, dove, foliage, piston, muffin, plaque, duck, basement, forklift, pencil, strap, bullfighting, bagpipe, teacup, banner, lighthouse, lighter, parrot, chip, necktie, bowling, skin, glue, backyard, factory, gym, coastline, hawk, pail, emerald, sundae, comet, bracelet, paint, watch, river, porch, crane, wristwatch, pole, garden, gun, appliance, owl, wheelbarrow, trumpet, feather, hurricane, canine, sleeve, courtyard, pendulum, desert, postcard, cushion, trapeze, cassette, dunk, lake, poppy, microscope, tavern, dishwasher, chess, armchair, bug, bison, handkerchief, smoke, drugs, berry, pickaxe, roof, cowboy, lightning, orchid, hill, cattle, playpen, water, fence, scotch, lamppost, meat, cardboard, paper, turtleneck, polaroid, skyscraper, cheetah, bath, pinwheel, tablecloth, stiletto, sombrero, blade, street, spacesuit, stage, boulder, morgue, neighborhood, calendar, pavement, park, sash, suit, tongue, yarn, worker, village, manuscript, frosting, basket, panda, bumblebee, wine, leather, scanner, seed, liquor, duckling, acupuncturist, child, lung, planter, stopwatch, ship, cherry, rainstorm, mane, figurine, grass, grassland, gargoyle, lumberjack, speedboat, computer, steak, cactus, kilt, journal, bronco, beehive, ponytail, mannequin, cowgirl, animal, bicep, wheat, dam, fridge, runway, seal, balcony, chimney, bell, stone, radio, bay, haircut, keyhole, dashboard, tuxedo, vessel, hammer, hole, taxi, wheel, campfire, headpiece, wool, alligator, gymnasium, grain, macaroni, bow, jail, salsa, ticket, maid, lever, fireman, pool, laundry, heart, gymnast, cavern, mop, poncho, flashlight, schoolyard, shelter, leg, costume, baker, extinguisher, pancake, armadillo, stocking, sidewalk, syringe, cave, bush, denim, photo, gem, sofa, inhaler, supermarket, kayak, people, rim, poodle, tanker, hoof, cabin, bulb, leaf, sweatshirt, shotgun, bullhorn, burner, debris, cannon, fishbowl, hanger, bunny, loaf, sword, shower, tapestry, coffee, blackbird, eye, bee, windmill, duster, pod, shed, vehicle, roller, insect, turban, chimpanzee, runner, mountain, patio, potato, jewel, tuba, tourist, arcade, closet, marshmallow, bedroom, petal, earphone, lasso, shipyard, rain, chick, doll, cappuccino, handle, bead, pianist, bible, trunk, squid, farmhouse, chameleon, boardroom, utensil, baseball, pipe, clown, toupee, actor, peanut, aquarium, stair, diner, pig, fan, copper, compass, cafe, hoe, cupboard, cola, pilot, rink, garbage, turkey, ribbon, hilltop, butt, pocket, sleigh, shell, drawing, meter, brush, firewood, omelet, lion, cloak, dice, trailer, grandmother, mule, torch, traffic, grenade, tail, shack, towel, ruby, salmon, gondola, swing, pocketknife, horn, eagle, pearl, chain, tiger, sand, bookbag, tricycle, toy, corkscrew, charcoal, text, somersault, man, pinecone, grandfather, hedge, chocolate, infant, wheelchair, sugar, bread, wand, tape, hose, boot, blackboard, bookstore, doorway, wing, pin, skeleton, dial, marinara, needle, beanie, forehead, meteorite, wetsuit, waterfall, barbecue, flask, candy, pebble, skirt, thread, sparrow, hoop, visor, oyster, oatmeal, waffle, girl, roadway, stump, bubblegum, mummy, tomato, candelabra, picture, spaghetti, sticker, cheek, gold, rabbit, vat, canvas, sledgehammer, string, dish, reporter, guitar, firetruck, plastic, house, parachute, brownie, firefighter, handlebar, table, rubble, gnome, bark, satchel, chalk, nickel, cube, milkshake, mustache, apron, workstation, firework, treehouse, container, cub, jewelry, stagecoach, groom, beard, glass, toothpaste, watermelon, beekeeper, monkey, trash, screen, pollen, bucket, wig, carriage, waterhole, chapel, lightbulb, dartboard, tweezer, handstand, tequila, straw, bathrobe, penguin, lava, lily, cobblestone, noodle, poster, bagel, sandcastle, jukebox, wardrobe, pyramid, nightgown, lumber, dime, stem, fin, afro, burger, pirate, building, kitchen, cigar, snow, cyclist, seashell, syrup, stain, sill, luggage, phone, washcloth, medication, tracksuit, creek, bamboo, key, surf, hydrant, eyelash, meal, airship, wallpaper, windowsill, bluebird, net, vest, flame, jaw, fire, battery, snail, flour, cupcake, box, kitten, stool, pit, paperwork, children, smoothie, moat, chimp, photograph, screwdriver, awning, darts, chickadee, hula, raincoat, bullfighter, lego, crack, bicyclist, pajama, vineyard, mountaintop, leash, puppet, album, nut, stroller, canyon, sketch, clay, touchscreen, pudding, certificate, backseat, piggy, dumbbell, seashore, crowd, curtain, wrist, seaweed, soldier, seagull, printer, recliner, pastry, parent, mud, toe, market, lid, ladder, sling, hedgehog, ax, ceiling, pub, beaver, puppy, crib, notepad, onion, cubicle, swimsuit, chisel, airport, coal, soup, elk, saucer, shrub, ukulele, guy, cinema, wrench, meatball, storybook, sock, bean, cocoa, ant, beet, mall, jeweler, campsite, swan, receipt, tugboat, cereal, daughter, dollar, pen, grave, pantry, pan, lobby, stairwell, astronaut, crate, wall, gumball, mailbox, pizzeria, helmet, bolt, dreadlock, puddle, cap, tiara, food, icing, boxer, taco, sailor, cotton, blood, engine, body, lip, banjo, television, army, fox, squirrel, flag, seat, eyeshadow, surgeon, barbell, hospital, hardware, donkey, slingshot, ram, princess, missile, koala, cashier, bouquet, star, brick, arch, limb, peel, pineapple, flowerpot, shoulder, doctor, graveyard, oil, temple, sponge, jungle, projector, mitt, pony, earring, seesaw, bathtub, gadget, tray, snowstorm, step, camper, can, bubble, snowflake, office, clothesline, button, billiard, legging, collage, mast, jug, crab, zoo, medallion, lollipop, blueprint, pigeon, smokestack, corn, bank, moustache, ranch, candle, urn, cork, waltz, chicken, throat, lane, dagger, thermometer, pumpkin, sidecar, dragonfly, money, lamb, cargo, birdhouse, antelope, toadstool, pill, cockroach, glove, cellar, barnyard, model, lettuce, leotard, cityscape, fountain, band, tattoo, server, gum, daisy, smartphone, skydiver, skating, tripod, purse, jackrabbit, freezer, necklace, hatchling, stairway, lizard, mold, warehouse, peg, saxophone, flagpole, arena, tooth, photographer, swimmer, bookmark, foam, ginger, rodeo, beak, gown, night, locker, escalator, shield, rainbow, lab, blanket, sweater, dress, coat, highchair, lunchbox, knob, palace, bill, milk, auditorium, ottoman, violin, kid, spa, harp, scarf, neck, farmer, ring, soda, branch, horseshoe, doghouse, mask, eraser, grape, whiteboard, hamburger, chalkboard, nose, martini, checkerboard, broom, organ, bikini, goatee, diploma, crayon, turtle, teapot, sunscreen, bullring, plank, pug, baboon, freckle, skunk, whale, gemstone, refinery, jacket, sign, magazine, juice, ski, crater, painting, hiker, emu, chandelier, lamp, submarine, salon, film, shuffleboard, ivory, camera, ledge, spotlight, hillside, jump, twig, hand, vegetable, reindeer, pie, therapist, pitchfork, silverware, tube, cheese, slipper, cone, jumpsuit, female, newspaper, waist, actress, outdoors, wire, notebook, bathroom, clarinet, boy, lavender, forest, bat, medicine, strawberry, tree, crayfish, dentist, moon, hall, trophy, shoehorn, shovel, perfume, board, pegboard, brass, cello, axe, barge, cinnamon, chessboard, restaurant, workbench, bowtie, harbor, popcorn, cage, yacht, peacock, mallet, chili, dart, nipple, golfer, thumb, starfish, skate, lock, waitress, birdcage, earmuff, volcano, robe, bassinet, cut, puzzle, mayonnaise, sunrise, toast, mozzarella, ocean, pile, goldfish, ballerina, portrait, vault, raft, windshield, cider, moss, grove, canopy, cobra, goose, ladle, turret, elevator, woodcutter, hut, pacifier, wolf, shuttle, attic, chariot, ball, vinyl, beverage, drink, plane, canal, rooster, mushroom, silver, babysitter, playground, cobweb, paintbrush, redhead, spool, cape, kiwi, tarp, rope, stove, menu, champagne, hair, telephone, wildflower, snowman, fertilizer, bride, pear, llama, tulip, wreath, handgun, mattress, blueberry, sailboat, rose, nail, vanilla, iguana, shoreline, officer, sweat, tear, dough, brain, nest, knight, cookbook, marble, oak, letter, corridor, lemon, bobcat, engineer, lantern, snowsuit, automobile, olive, easel, mouth, scalpel, boardwalk, rodent, cannonball, bacon, fist, sneaker, ballroom, graffiti, basketball, dinosaur, sail, farm, city, kangaroo, railroad, cart, parade, parmesan, statue, meteor, scissor, quill, courtroom, tractor, chestnut, boxing, sandbox, stomach, mug, spyglass, pot, crown, wallet, face, ear, latte, planet, cheeseburger, bakery, siding, island, ox, goat, radish, moose, telescope, floor, drum, shampoo, lemonade, belly, page, belt, beret, casino, ruler, sheet, castle, studio, mansion, bandanna

MOCHa

To mitigate captioning hallucinations in the open-vocabulary setting, we propose an RL-based pipeline. The algorithm iteratively collects a batch of data from the image captioning model M (left side), evaluates it, and applies an optimization step (right side):

cars peace

We wish to optimize for the competing objectives of output fidelity (low hallucination rate) and adequacy (including sufficient details to describe the input image), as optimizing for one of these alone may cause the other to deteriorate.
MOCHa's multi-objective reward model is able to handle this delicate trade-off by utilizing a SOTA Natural Language Inference (NLI) model to measure open-vocabulary hallucinations, and BERTScore, a pretrained model measuring the general quality of the generated text. Both models score the predicted captions w.r.t reference captions.

MOCHa Is Able To Reduce Hallucinations While Maintaining Caption Quality

We measure model performance across various metrics that measure both caption fidelity (evaluating the consistency of the generated caption to the provided ground truth) and caption quality. NLI P(C), CH, and OCH denote NLI contradiction probability, CHAIR (i for instance, s for sentence) and OpenCHAIR respectively.
See our paper for additional metrics and more results.

Additional Visual Results

For additional qualitative results, we refer to our interactive visualization tool. The tool provides image captioning results using BLIP-Large with and without MOCHa for 500 randomly selected test images from MS-COCO and Flickr30K. For more information about the visualization tool, please see chapter A in the appendix.

BibTeX

@misc{benkish2024mitigating,
      title={Mitigating Open-Vocabulary Caption Hallucinations}, 
      author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
      year={2024},
      eprint={2312.03631},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
      }