Sci-fi fans may recall a spooky episode of the British TV show Black Mirror, where the protagonist purchases a robot of her deceased boyfriend created by his voice recordings, chat history and social media profiles.

While digital necromancy of this kind may seem a distant future away, a patent recently granted to Microsoft suggests the technology may not be as far-fetched as it seemed when the Black Mirror episode first aired in 2013. Granted in December last year, the patent allows Microsoft to create a robot that could potentially be used to offer a conversational experience with a deceased person by using their “images, voice data, social media posts and electronic messages”. The patent also goes one step further by suggesting this information can be applied to recreate a model of a person.

While the capabilities mentioned in this patent seem otherworldly, the latest Stanford AI Index report found that some of most rapid developments in AI technology - being automatic speech recognition (ASR) and natural language processing (NLP) - already match or exceed human capabilities.

ASR and NLP explained

If we take the example of a video call with a digital clone, ASR is the technology allowing the computer to recognise our voice and transcribe it into text (“listening”), while NLP is the technology that processes the text, allowing the computer to understand its meaning and generate a response that is given through text or increasingly, by speaking back to us (“reasoning and responding”).

ASR uses algorithms to recognise and convert human speech to text. At its simplest, ASR works by recording speech and breaking down the recording into individual sounds (or ‘phonemes’). Each phoneme is then analysed to find which word is the most probable. Once this is achieved and repeated, spoken words can then be transcribed into written text.

Related but distinct to ASR is NLP. NLP uses algorithms to identify and separate human language into fragments so that the grammatical structure of sentences and the meaning of words can be analysed and understood by the computer in context, which then helps computers to respond meaningfully.

Stronger, faster, smarter

Stanford’s AI Index report highlights the impressive technical improvements in ASR and NLP that have been measured using testing techniques which can also be used to measure human language capabilities.

Librispeech’s “Test Other” determines how well ASR systems can transcribe speech in realistic environments. Tests done using Librispeech on ASR systems showed the error rate for speech transcription dropped from 14% in 2016 (enough to be an unreliable transcriber) to a vanishingly small 2.6% in 2021, as demonstrated in the graph below.

NLP has also seen impressive progress. SuperGlue is a benchmark that evaluates the performance of NLP systems on a series of language understanding tasks. The graph below demonstrates the rapid improvements to NLP systems over the past year, with top NLP systems receiving an impressive score of 90.3, surpassing the ‘human baseline’ score of 89.8 that was based on the performance of hired crowdworkers.

Uses of ALP and NLP

Advances like the ones described above have led to exciting and innovative uses for ALP and NLP, some of which are familiar in our everyday lives, including:

  1. Virtual assistants - the use of virtual assistants built into mobile and smart home devices such as Amazon’s Alexa, Google Assistant and Apple Siri are mainstream examples of ALP and NLP in action today. Many of us have become accustomed to interacting with these devices by using our voices to have them perform a range of tasks, such as scheduling alarms, play music, enquiring about the weather and even translating words on the spot. In fact, a study by Edison Research found that in 2020, approximately 26% of Australians owned a smart speaker
  2. Voice banking – AI powered by ASR and NLP can allow customers to use their voice to manage their banking operations. ASR can also be used to verify and authenticate customers. In what could be a big step for customer service, NLP can detect the emotions of speakers so that disgruntled customers can be more quickly directed to assistance services before the problem escalates.
  3. Medical transcripts and analysis - Amazon Transcribe Medical provides an ASR service that transcribes physician-patient conversations for clinical documentation and subtitling telehealth consultations. This has been used in conjunction with NLP, which can analyse transcripts and medical records to triage and generate diagnostic models in order to perform a range of tasks, such as detecting early-stage chronic disease.

Communication breakdown

The performance of NLP and ASR systems has surpassed human benchmarks for several tasks, however, it’s not exactly time for humankind to throw in the towel just yet.

As the Microsoft team behind DeBERTa, one of the current state-of-the-art models deployed to perform NLP tasks, comments: the technology is “by no means reaching the human level of natural language understanding”. This is because humans are “extremely good at leveraging the knowledge learned from different tasks to solve a new task with no or little task-specific demonstration”, or in other words, highly capable of learning the meaning of a new word and then applying it to other language contexts. This is a skill NLP models are yet to match to the same extent.

Likewise, technical challenges remain for ASR as well, with current systems still experiencing difficulty in processing speakers when there is significant background noise and when attempting to identify users in larger datasets.

Perhaps more concerning are the results of research conducted by Stanford University, which found that even state-of-the-art ASR technology exhibited significant racial and gender disparity in that there were on average two times more errors experienced by black speakers as compared with white speakers. This “racial gap” also extends to NLP – a paper released by Oxford Insights suggests that NLP systems understand “standard” varieties of a language, and struggle with the vernacular used in minority communities. Issues such as these mimic the criticisms that are directed to other branches of AI.

Where to from here?

This all shows that AI apps have become very good listeners, perhaps better than most of us. But there is still a way to go with AI’s ability to keep up its end of a conversation before it is truly humanlike (like that seen in Black Mirror).

The Visual Commonsense Reasoning (VCR) task, first introduced in 2018, asks machines to answer a challenging question about a given image and to justify that answer with reasoning (whereas Siri will just give you an answer or suggestion in response to your question). As the following graph from the Stanford AI Index report shows, AI’s performance is still below humans, but the gap is rapidly narrowing.


Read more: Artificial Intelligence Index Report 2021


Authors: Edward Zheng, Jen Bradley and Peter Waters