Watch my lips Burton Pritzker/Getty
This article is being made free to view thanks to sponsorship from P&G
Artificial intelligence is getting its teeth into lip reading. A project by Google鈥檚 DeepMind and the University of Oxford applied deep learning to a huge data set of BBC programmes to create a lip-reading system that leaves professionals in the dust.
The AI system was trained using some 5000 hours from six different TV programmes, including Newsnight, BBC Breakfast and Question Time. In total, the videos contained 118,000 sentences.
Advertisement
First the University of Oxford and DeepMind researchers trained the AI on shows that aired between January 2010 and December 2015. Then they tested its performance on programmes broadcast between March and September 2016.
By only looking at each speaker鈥檚 lips, the system accurately deciphered entire phrases, with examples including 鈥淲e know there will be hundreds of journalists here as well鈥 and 鈥淎ccording to the latest figures from the Office of National Statistics鈥.
Event:
Here is a聽clip from the database without subtitles:

And here鈥檚 the same clip with subtitles provided by the AI system:

AI shows the way
The AI vastly outperformed a professional lip-reader who attempted to decipher 200 randomly selected clips from the data set.
The professional annotated just 12.4 per cent of words without any error. But the AI annotated 46.8 per cent of all words in the March to September data set without any error. And many of its mistakes were small slips, like missing an 鈥榮鈥 at the end of a word. With these results, the system also outperforms all other automatic lip-reading systems.
鈥淚t鈥檚 a big step for developing fully automatic lip-reading systems,鈥 says at the University of Oulu in Finland. 鈥淲ithout that huge data set, it鈥檚 very difficult for us to verify new technologies like deep learning.鈥
Two weeks ago, a similar deep learning system called 鈥 also developed at the University of Oxford 鈥 outperformed humans on a lip-reading data set known as GRID. But where GRID only contains a vocabulary of 51 unique words, the BBC data set contains nearly 17,500 unique words, making it a much bigger challenge.
In addition, the grammar in the BBC data set comes from a wide diversity of real human speech, whereas the grammar in GRID鈥檚 33,000 sentences follows the same pattern and so is far easier to predict.
The DeepMind and Oxford group says it will release its BBC data set as a training resource. , who is working on LipNet, says he is looking forward to using it.
Lining up the lips
To make the BBC data set suitable for automatic lip reading in the study, video clips had to be prepared using machine learning. The problem was that the audio and video streams were sometimes out of sync by almost a second, which would have made it impossible for the AI to learn associations between the words said and the way the speaker moved their lips.
But by assuming that most of the video was correctly synced to its audio, a computer system was taught the correct links between sounds and mouth shapes. Using this information, the system figured out how much the feeds were out of sync when they didn鈥檛 match up, and realigned them. It then automatically processed all 5000 hours of the video and audio ready for the lip-reading challenge 鈥 a task that would have been onerous by hand.
The question now is how to use AI鈥檚 new lip-reading capabilities. We probably don鈥檛 need to fear computer systems eavesdropping on our conversations by reading our lips because long-range microphones are better for spying in most situations.
Instead, Zhou thinks lip-reading AIs are most likely to be used in consumer devices to help them figure out what we are trying to say.
鈥淲e believe that machine lip readers have enormous practical potential, with applications in improved hearing aids, silent dictation in public spaces (Siri will never have to hear your voice again) and speech recognition in noisy environments,鈥 says Assael.
arXiv
Topics:



