Mozilla’s new open source model aims to revolutionize voice recognition

You may have noticed the steady and sure progress of voice recognition tech in recent times – all the big tech firms want to make strides in this arena if only to improve their digital assistants, from Cortana to Siri – but Mozilla wants to push harder, and more broadly, on this front with the release of an open source speech recognition model.

The initial release of this Automatic Speech Recognition engine has just been unleashed, based on work carried out by the Machine Learning team at Mozilla. The engine is modelled on ‘Deep Speech’ papers published by Baidu, which detail a trainable multi-layered deep neural network.

Mozilla says that its project initially had a goal of hitting a ‘word error rate’ of less than 10%. However, the firm says the engine’s word error rate on LibriSpeech’s test-clean set is now 6.5%, clearly beating this goal, and achieving close to the Holy Grail of human-level performance (which occurs at around 5.8%, according to the Deep Speech 2 paper).

Mozilla has worked hard to train the speech recognition model using ‘supervised learning’ and a huge dataset of thousands of hours of labeled audio, drawn from all manner of sources including free (TED-LIUM and LibriSpeech) and paid (Fisher and Switchboard) speech corpora.

Further labeled speech data was pulled from the likes of language study departments in universities, and public TV and radio stations, all of which was more fuel to the fire for honing the speech recognition engine.

And of course the huge strength of this project, its open source nature, means that this honed technology is now open to anyone to use in their speech recognition projects.

Streamlined speech

Mozilla further notes that the plan for the future is to release a model that’s light and fast enough to run on a smartphone or single-board computer like the Raspberry Pi.

The company has also unleashed its Common Voice initiative, which is an open and publicly available voice dataset containing some 400,000 recordings from 20,000 different speakers – that represents around 500 hours of speech.

As Mozilla puts it, the idea here is to “build a speech corpus that's free, open source, and big enough to create meaningful products with”, running in parallel with the new speech recognition model.

Microsoft is also making big strides on the voice recognition front, having achieved a word error rate of 5.1% in the Switchboard speech recognition benchmark, as announced back in the summer.

Darren is a freelancer writing news and features for TechRadar (and occasionally T3) across a broad range of computing topics including CPUs, GPUs, various other hardware, VPNs, antivirus and more. He has written about tech for the best part of three decades, and writes books in his spare time (his debut novel - 'I Know What You Did Last Supper' - was published by Hachette UK in 2013).