Using AI to automate more Accessible Alt-Text

Marcus Mendes reported last week for 9to5 Mac Apple researchers have devised a way to train AI models on generating image descriptions which, despite the models being “far smaller” than competing ones, “delivers more accurate, detailed descriptions.”

The new study, a collaborative effort between Apple’s band of AI researchers and folks at the University of Wisconsin-Madison, was recently detailed in a paper entitled RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning.

“Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive,” Apple’s researchers wrote in prefacing its findings on the company’s Machine Learning Research blog. “While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization.”

As Mendes explains in his story, the concept of dense image captioning refers to “generating detailed, region-level descriptions of everything happening within an image, rather than a single overall summary.”

He continues: “In other words, it identifies multiple elements and regions in an image, and describes them with fine-grain detail, resulting in a much richer understanding of the scene than an overall description.”

Without getting into the veracity of Apple’s claims here, the pertinence to accessibility is obvious. Like with language translation, providing image descriptions, or alt-text, is one task where leaning on artificial intelligence is savvy; it’s the kind of thing which plays to a computer’s core competency: automation. Indeed, from a human point of view, the ability to append good image descriptions is most definitely a skill to be learned and honed. There are tips and tricks on doing it well, but the tension lies in wanting to walk the tightrope of brevity and baseline; that is, wanting to be succinct in length whilst still offering a baseline of good information. Most alt-text I encounter particularly on social media, falls off and descends into the abyss—not because people are lazy or unfeeling, but because, as I said, it’s a skill one must learn and practice. On the other side, I absolutely loathe the penchant of people (and brands) using the alt-text box as though it were a secret treasure chest, adding cutesy, pithy messages. In my opinion, doing so is not merely offensive—using a bonafide accessibility tool for one’s own vanity—it also underscores an utterly, and tellingly, fundamental lack of understanding on disability and showing empathy for the people who need accessibility. I get it—this practice may not be malicious in intent, but it is absolutely malicious in action because it reduces an assistive technology to a mere novelty.

If you want to try AI-powered alt-text, I suggest downloading Ice Cubes. The app is an open-source Mastodon client, and I prefer its interface design to something like Ivory. I have Ice Cubes on iOS and macOS, and there’s a little button that looks like a magic wand which, when tapped, will automatically generate alt-text for your photo(s). You can edit it, of course, but it works well in my experience. As I said, leveraging AI like this is conducive to the technology’s strengths. Not only that, but manually typing out alt-text every time you want to share pictures can be cognitively and/or physically taxing for a person with disabilities, depending on their individual needs and tolerances.

Next
Next

Google Translate’s ‘Live translate with Headphones’ Feature Arrives on iOS