IoT

Is Voice Always the Best Choice?

One Engineer’s Argument for Integrating Voice and Screens

June 16, 2018

by Jared Kirschner

“There’s a skill for that.” –You, circa 2020

Amazon Alexa has a lot of skills, including those that pay the bills—nearly quadrupling in 2017 to 25,000. Our puny human brains can’t possibly keep up. Luckily, Amazon recognizes our human limitations and will soon release a solution they call Skill Arbitration, allowing Alexa to find—and, if needed, install—the right skill to address a query without depending so heavily on our ability to conjure the necessary incantation (“Hey Alexa, tell <SKILL NAME> to <CAREFULLY WORDED COMMAND>”). But does Amazon understand the machine’s limitations?

“There’s an app for that” has been the reigning tech joke of the last decade. The joke: For anything you can think of, someone has made an app. But beneath the joke there’s a pervasive attitude I find troubling: that the appropriate solution to any problem is an app. It’s the default, the first thing that comes to mind, even for non-software engineers. And hey, as a software engineer, I’m flattered. But I’m also a practitioner of human-centered design who recognizes that the strongest solutions are: (a) built from an understanding of user needs, and (b) may include technology, but often only as part of the solution.

“There’s a skill for that” is the next joke, the next default solution, the next troubling attitude.

“There’s a skill for that” is the next joke, the next default solution, the next troubling attitude.

But there are ways to avoid developing products and services that are the setup for this joke. And Amazon is actually showing the first steps on that path by acknowledging one of the machine’s fundamental limitations—that voice is not always the best choice—with a change to some recently released Echo products…

Touchscreens! The Echo Show and Echo Spot both have integrated touchscreens.

While this might not seem like a revolutionary idea—after all, we’ve had touchscreens for as long as we’ve had, well, apps—it opens up the possibility of different types of interfaces complementing each other, producing a better interaction than either could provide separately.

As a simple case, imagine you need a timer for something. To set up a timer with a smartphone/laptop/tablet, you must pick up the device, open a timer app, navigate to the right screen, and fiddle with controls to get the settings right. After a somewhat cumbersome and time-consuming setup, checking it is super easy… you can just glance at it. The experience with using a voice-only virtual assistance is basically the opposite. Setting up the timer is trivial… you just say what you need. But if you want to quickly check the timer? That requires a conversation, which is a lot more diverting and time consuming than glancing at a screen.

The Best of Both Worlds

With an integrated interface, you can have the best of both worlds: setting up the timer through voice, then checking the timer visually. As the Nielson Norman Group succinctly puts it: “voice is an efficient input modality” while “screen is an efficient output modality.” This is one example of a broadly applicable design principle that should be considered whenever designing a voice interface.

As an output modality, voice has a lot of problems.

Voice has an inherent advantage as an input modality because it is immediate (no setup or navigation required) and can be done in parallel with other activities (it’s hands-free and doesn’t require much mental attention). But as an output modality, it has a lot of problems. First, you have no control over when the information delivery occurs—it just happens, whether you’re ready to listen and process the information or not. Second, it takes so long to deliver small amounts of information and there’s no ability to skip or scan. Third, human brains have limited working memory, so keeping voice output in our heads, rather than saved for us on a screen, can make anything more than the simplest interaction difficult. (As an aside, these are all reasons that make experiences with text-based chat bots fundamentally different than those with voice-based virtual assistants, even though both use words as their primary input and output.)

Because of this, voice-only interfaces are only optimal for simple interactions involving a command with no or limited response, such as turning off the lights. They don’t do well in either of the following cases:

Feedback-Driven Interactions

What if you want to dim the lights rather than turn them off? That is an iterative process where you will need to make adjustments (setting), possibly multiple times, based on the observed result (light level). If the output interface is voice, this is going to take a while. Even in an ideal world, where the virtual assistant is as good as a friend at understanding you, would you really want to have this conversation: “Dim the lights… a little more… no, I meant more dim, not more light… a little lower… too much, back a bit… okay, good, stop”? It will always be far more effective to use a dimmer switch, whether physical or digital (touchscreen), unless the meaning of dim is previously agreed upon and unchanged.

Voice-only interfaces are only optimal for simple interactions involving a command with no or limited response, such as turning off the lights.

Back-and-Forth Interactions

Say you want to try a new local restaurant. You ask your device to list Thai restaurants nearby. The system can maybe make some assumptions, such as that you want the place to be currently open and well-rated, but not others, such as preferences for price range or distance (drive? walk?). It may ask some clarifying questions. You answer them and hear a choice of three options, but do you have everything you need to make your comparison and decide? You’ve probably already forgotten at least one of the options, so they may need to be repeated… more than once. And maybe you’d like to know more about the menu or the reviews or the ambiance? …Yeah, good luck with that. Why aren’t you just using the Yelp app? I mean, there really is an app for that.

No matter how capable artificial intelligence systems become, this weakness will remain. There is no advance in technology that can overcome this fundamental limitation of voice as an output modality.

An Integrated Experience

As an example of how voice and screen-based interfaces can complement each other, let’s expand on the example above—wanting to try a new restaurant. The experience today involves asking Alexa for Thai restaurants nearby, listening to it respond with a list of suggestions made with unclear selection criteria, and then being directed to look at the Alexa app on your phone for further information. This interaction recognizes that trying to do everything over voice would be crazy-making but doesn’t take full advantage of the power of integrating voice and screen. How can we make this better?

Would you really want to have this conversation with your virtual assistant: “Dim the lights… a little more… no, I meant more dim, not more light… a little lower… too much, back a bit… okay, good, stop”?

Setting aside the specific device system (e.g., Echo Show), let’s consider what we could do if we had both voice and screen available. In responding to the initial request, the system should allow the user to understand both the results and the selection criteria, allowing the selection criteria to be adjusted or refined by the user if needed. One possible approach is to vocalize the search criteria—“Here’s a list of Thai restaurants open now near your current location,”—while on the screen giving the results the primary focus and the search criteria a secondary focus. Vocalizing the results isn’t very effective for the reasons given above. Vocalizing the initial search criteria, on the other hand, could be effective as the user can easily check the system’s assumptions immediately without much attention or memory (Is anything I heard wrong?) without interrupting the primary focus (assessing the results). If the selection criteria need adjustment or refinement, you could use voice or touch to modify the selection criteria and update the results. Changes in criteria made via voice could be acknowledged over voice; changes made via the screen don’t need this vocal acknowledgement… because the user is already looking at the screen (e.g., press “$” price-range button, see that “$” button is now selected).

Know Thy Tools

This approach shows that a voice-screen hybrid experience offers potential advantages over voice or screen-only experiences. Voice interfaces are an exciting new technology, and it can be tempting to search the world looking for nails to use this proverbial hammer on. Whenever designing an experience, it’s important to understand the strengths and weaknesses of the tools in your toolbelt. Voice interfaces do bring new capabilities that we didn’t have before, but they often can’t get the job done well in isolation.

In this post, we mostly explored the usability of voice interfaces, but that’s just one component of a strong user experience. Next time, we’ll explore a different component: social and emotional needs.

Image by Vidar Nordli-Mathisen on Unsplash

filed in: IoT

About the Author

Jared Kirschner
Electrical Engineer
Jared orchestrates electrons via software and circuitry to enable carefully considered user experiences defined in collaboration with design, strategy, and engineering colleagues.
With a diverse set of competencies in software and electrical engineering, Jared applies the appropriate technology to the task at hand—whether developing firmware for a regulated medical device, prototyping an experience to inform the design process, building out the full stack for a web/mobile experience, or analyzing data and building quantitative models to understand and optimize engineering systems.
Jared earned a B.Sc. in electrical and computer engineering with a concentration in sustainability from Olin College of Engineering.
More from Jared Kirschner
View All

About the Author

Related Content

Copied to clipboard