CIOInsights - Insights From Technology Leaders

Is there something special about the human voice

By CIOinsights

Artificial intelligence-powered speech synthesisers can now hold eerily realistic spoken conversations, putting on accents, whispering and even cloning the voices of others. So how can we tell them apart from the human voice?

These days it's quite easy to strike up a conversation with AI. Ask a question of some chatbots, and they'll even provide an engaging response verbally. You can chat with them across multiple languages and request a reply in a particular dialect or accent.

It is now even possible to use AI-powered speech cloning tools to replicate the voices of real humans. One was recently used to copy the voice of the late British broadcaster Sir Michael Parkinson to produce an eight-part podcast series while natural history broadcaster Sir David Attenborough was "profoundly disturbed" to hear his voice has been cloned by AI and used to say things he never uttered.

In some cases the technology is being used in sophisticated scams to trick people into handing over money to criminals.

Not all AI-generated voice are used for nefarious means. They are also being built into chatbots powered by large language models so they can hold respond and converse in a far more natural and convincing way. ChatGPT's voice function, for example, can now reply using variations of tone and emphasis on certain words in very similar ways that a human would to convey empathy and emotion. It can also pick up non-verbal cues such as sighs and sobs, speak in 50 languages and is able to render accents on the fly. It can even make phone calls on behalf of users to help with tasks. At one demonstration by OpenAI, the system ordered strawberries from a vendor.

These capabilities raise an interesting question: is there anything unique about the human voice to help us distinguish it from robo-speech?

Jonathan Harrington, a professor of phonetics and digital speech processing at the University of Munich, Germany, has spent decades studying the intricacies of how humans talk, produce the sounds of words and accents. Even he is impressed by the capabilities of AI-powered voice synthesisers.

"In the last 50 years, and especially recently, speech generation/synthesis systems have become so good that it is often very difficult to tell an AI-generated and a real voice apart," he says.

But he believes there are still some important cues that can help us to tell if we are talking to a human or an AI.

Before we get into that, however, we decided to set up a little challenge to see just how convincing an AI-generated voice could be compared to a human one. To do this we asked New York University Stern School of Business chief AI architect Conor Grennan to create pairs of audio clips reading out short segments of text.

One was a passage from Lewis Carroll's classic tale, "Alice in Wonderland" read by Grennan and the other was an identical segment generated with an AI speech cloning tool from software company ElevenLabs. You can listen to them both below to see if you can tell the difference.

Surprisingly, around half of the people we played the clips to couldn't tell which was which by ear. It's worth pointing out that our experiment was far from scientific and the clips weren't being listened to over high-end audio equipment – just typical laptop and smart phone speakers.

Steve Grobman, who serves as the chief technology officer of cybersecurity company, McAfee, struggled to discern which voice was human and which was AI merely by listening with his ear.

"There were definitely things beyond speech, like the inhalation which would have me go more towards human, but the cadence, balance, tonality would push me to AI," he says. For the untrained human ear, many of these things can be difficult to pick up.

"Humans are very bad at this," says Grobman, explaining that deepfake detection software is helping catch things the human ear can miss. But it gets especially challenging when bad actors manipulate real audio with bits of fake audio, he says, pointing to a video of Microsoft co-founder Bill Gates hawking a quantum AI stock trading tool. To the human ear, the audio sounded exactly like the tech billionaire, but running it through a scam classifier, it was flagged as a deepfake.

McAfee recently highlighted how a fabricated advert used mixed deepfake and real audio of singer Taylor Swift. Grobman's tip: "Always listen to the context of what is being said, things that sound suspicious likely are."

Another cybersecurity expert we spoke to – Pete Nicoletti, global chief information security officer of Check Point Software, a threat analysis platform – was also stumped by our "Alice in Wonderland" challenge.

He says he usually listens for unnatural speech patterns such as irregular pauses and awkward phrasing when playing audio. Strange artefacts like distortions and mismatched background noise can also be a give-away. He also listens for limited variations in volume, cadence and tone because voices that are cloned from just a few seconds of audio may not have the full range of a human voice.

"We live in a post-real society where AI generated voice clones can fool even the voice validation systems of credit card companies," Nicoletti says. "Turing would be turning over in his grave right now," referring to World War II British code breaker Alan Turing, who designed the "Turing Test" as a way to identify AI by engaging with it in conversation.

Dane Sherrets, innovation architect of emerging technologies at HackerOne, a community of bug bounty hunters that work to expose security vulnerabilities of some of the biggest companies in the world, was among those able to correctly identify the human voice. The natural inflection and breathing in the clips were the give-away, he says.

Listening for the accentuation, or emphasis, words are given in a sentence can be a good trick for spotting computer-generated speech, agrees Harrington. This is because humans use accentuation to give a sentence more meaning within the context of a dialogue.

"For example, a sentence like 'Marianna made the marmalade' typically has most emphasis on the first and last words if read as an individual sentence devoid of context," he says. But if someone asked if Marianna bought the marmalade, the emphasis might instead fall on the word "made" in the answer.