Increasingly, voice assistants from vendors such as Amazon, Apple, Google, Microsoft and others are starting to find their way into a myriad of devices, products and tools used on a daily basis. While once we might have only interacted with conversational systems on our phones, dedicated desktop appliances, or desktop computers, we can now find conversational interfaces on a wide range of appliances and products from televisions to cars and even toaster ovens. Soon, any device we can interact with will have an audio conversational interface instead of buttons or screens to type or click. The dawn of the conversational computing age is here.
However, are these devices intelligent enough to handle the wide range of queries that humans are posing? The objective of finding out how intelligent these systems really are is the goal of Cognilytica’s most recent Voice Assistant Benchmark aiming to test the cognitive capabilities of the most widely deployed voice assistant devices on the market. (Disclosure: I am a principal analyst with Cognilytica).
In its second iteration, the Voice Assistant Benchmark asks 120 questions grouped into 12 categories of various levels of cognitive difficulty. These questions aim to test not only the ability for the devices to understand the questions being asked but also their underlying knowledge graph and cognitive capabilities. To results of the questions asked are evaluated into one of four categories: Category 0 responses are those in which the device either could not answer the question at all or defaulted the user to a search or other generic response. Category 1 responses are those in which the device responds with an irrelevant or incorrect response. Category 2 responses are those in which the device responds such that a human must make the determination as to what the right response is. Category 3 responses are clear, straightforward answers that provide an acceptable response to the user.
Each response is also marked with whether the response is “adequate” to address the specific question being asked. In most cases, a Category 3 response is required to be adequate, but in some situations, Category 0 responses are preferred when we would rather the device not attempt to answer something that is intentionally ambiguous or even gibberish. The benchmark tallies up all the total adequate responses and then compares them against what the top score could be. Since these backends are regularly improving, this benchmark is repeated regularly to see how the voice assistant responses change over time.
Results from the Benchmark
While the voice assistants this round did dramatically better than they did in the previous first version of the benchmark, they still performed, as a whole, inadequately. For the current benchmark, Alexa provided the greatest number of adequate responses at 49 out of 144 questions asked (34.7%) while Google followed close behind with 48 out of 144 questions responded adequately (34.0%). Microsoft's Cortana showed the biggest improvement over the past benchmark with 46 out of 144 adequate responses (31.9%). Apple's Siri trails the pack with 35 out of 144 adequate responses (24.3%). The charts below outline overall adequate answers as well as total answers for each category 0-3. The questions asked were those that an elementary school student should be easily able to understand and respond to. As such, if these voice assistants were in school, they'd all get a failing grade.
Interesting Responses from Voice Assistants
What is most interesting in these benchmarks is that it's clear that the voice assistant companies are continually working on their knowledge graphs and underlying cloud-based AI technology that powers the intelligence of these devices. After all, the intelligence of these devices is not in the device itself but in the big infrastructure in the cloud-powered by lots of computing power and data to support it. So, in essence, what's really being testing is the intelligence of the big back-end system, and not what's on the device itself. From the benchmark, it's clear that there is evidence that these companies are working very hard to improve and broaden their underlying data and these conversational systems continue to improve over time.
All results of the benchmark questions and answers are recorded on video to document and keep transparent the category results, and also so, we can have some evidence of how these systems are improving over time. As a result, Cognilytica produced several interesting videos that highlight and showcase some of the unusual and interesting responses of the voice assistants:
Benchmark Videos: Comparing Responses of Voice Assistants
How Far Away Are We from Truly Intelligent Voice Assistants?
Given that these voice assistants still seem to fail with fairly basic and straightforward questions, it makes us ask; How far away are we from a truly valuable, intelligent conversational system? We're actually much closer than it might seem. While these devices still have a long way to go to prove that they can reliably answer most questions, the rate of improvement is impressive. The major vendors are putting large teams to work making these devices better. Amazon alone has claimed over 10,000 employees in their Alexa division. And the news continues to trickle out about how Microsoft, Google and Apple are putting humans in the loop, improving the devices by listening in on conversations. While this is definitely a controversial practice, and possibly compliance and regulatory-related concern, it is clear that the vendors are doing this to continue to train and evolve the models that power these voice assistant systems.
As such, we can expect continued cognitive capabilities in the devices, and benchmarks as the above should continue to show improvements over time. And benchmarks like this one will help show how quickly these voice assistants continue to improve.
Ronald Schmelzer, columnist, is senior analyst and founder of the Artificial Intelligence-focused analyst and advisory firm Cognilytica, and is also the host of the AI Today podcast, SXSW Innovation Awards Judge, founder and operator of TechBreakfast demo format events, and an expert in AI, Machine Learning, Enterprise Architecture, venture capital, startup and entrepreneurial ecosystems, and more. Prior to founding Cognilytica, Ron founded and ran ZapThink, an industry analyst firm focused on Service-Oriented Architecture (SOA), Cloud Computing, Web Services, XML, & Enterprise Architecture, which was acquired by Dovel Technologies in August 2011.
Ron is a Parallel Entrepreneur, having started and sold a number of successful companies. The companies Ron has started and run have collectively employed hundreds of people, raised over $60M in Venture funding and exits in the millions. Ron was founder and chief organizer of TechBreakfast – the largest monthly morning tech meetup in the nation with over 50,000 members and 3000+ attendees at the monthly events across the US including Baltimore, DC, NY, Boston, Austin, Silicon Valley, Philadelphia, Raleigh and more.
He was also founder and CEO at Bizelo, a SaaS company focused on small business apps, and was Founder and CTO of ChannelWave, an enterprise software company which raised $60M+ in VC funding and subsequently acquired by Click Commerce, a publicly traded company. Ron founded and was CEO of VirtuMall and VirtuFlex from 1994-1998, and hired the CEO before it merged with ChannelWave.
Ron is a well-known expert in IT, Software-as-a-Service (SaaS), XML, Web Services, and Service-Oriented Architecture (SOA). He is well regarded as a startup marketing & sales adviser, and is currently mentor & investor in the TechStars seed stage investment program, where he has been involved since 2009. In addition, he is a judge of SXSW Interactive Awards and served on standards bodies such as RosettaNet, UDDI, and ebXML.
Ron is the lead author of XML And Web Services Unleashed (SAMS 2002) and co-author of Service-Orient or Be Doomed (Wiley 2006) with Jason Bloomberg. Ron received a B.S. degree in Computer Science and Engineering from Massachusetts Institute of Technology (MIT) and MBA from Johns Hopkins University.