Apple’s New Speech Frameworks Portend More Accessible Audio for Developers and users alike

John Voorhees wrote a piece for MacStories earlier this week which delves into a new class in Apple’s Speech APIs and compares it to OpenAI’s open source speech-to-text model called Whisper. Apple’s class, named SpeechAnalyzer and its corresponding module in SpeechTranscriber, both are included in the developer betas released during last week’s WWDC festivities. As Voorhees explains, Apple’s APIs are highly performant.

Software development minutia notwithstanding, the nut of Voorhees’ story—especially pertinent to accessibility—is his conclusion that the speed with which Apple’s framework works is “a game-changer for anyone who uses voice transcription to create text from lectures, podcasts, YouTube videos, and more.” By contrast, Voorhees writes Whisper is inexpensive yet glacially slow; he calls its pokiness “frustrating” when trying to finish a YouTube project. Voorhees’ son, Finn, cobbled together a command-line utility aptly named Yap that chews up audio and video files and spits them out as SRT and TXT files. Notably for accessibility purposes, SRT is a plain-text file format commonly used for displaying captions and subtitles in videos. Using Yap, built atop Apple’s aforementioned APIs, it took Voorhees a mere 45 seconds to create an SRT file for his AppStories podcast. In MacWhisper, Voorhees reports the conversion process took 1:41 and 3:55 using OpenAI’s Large V3 Turbo and Large V2 models, respectively.

Voorhees posits Apple’s frameworks represent what he described as “a significant leap forward in transcription speed without compromising on quality” while adding he believes the new APIs will “replace Whisper as the default transcription model for transcription apps on Apple platforms.” The advent of the APIs are a noteworthy development, seconded by John Gruber. He writes today it “bodes very well for all sorts of use cases where transcription would be helpful, like third-party podcast players.”

Podcast players. I’ve extolled the virtues of having high quality transcripts in podcast apps many times before; to have them is to turn what’s ostensibly an exclusionary medium to, say, Deaf and hard-of-hearing people into something substantially more inclusive. Before WWDC, I advocated for Apple to build a “TranscriptKit” API which helped App Store developers include transcripts in apps such as Marco Arment’s Overcast. As far as I know, no such tool came to be this year—but perhaps it will come in the future presumably built with these very technologies. As it stands, however, part of the problem with including transcripts in any audio app is production—it takes a while to generate accurate transcripts in a timely fashion, as Voorhees illustrates. Apple’s new APIs seemingly are designed to solve that issue, the net effect of which could be more accessible and inclusive audio apps from Arment and others in the years to come.

Yap is available for download on GitHub.

Previous
Previous

Amazon’s Recent Kindle Software Update Adds More Line, Text Spacing options for accessibility

Next
Next

Forbes Announces First-Ever ‘Accessibility 100’ list