Thursday, October 27, 2011

The advent of the Audio User Interface

Since about 2000, I have touted the inevitability of having powerful voice recognition capabilities available to mobile devices and the benefits of using voice as the primary user interface. With the latest iPhone release (4s) came Siri, which demonstrates a rudimentary implementation of the aforementioned.

The key technological component is the ability to do ad-hoc voice recognition on distributed servers out in the internet (ie "in the cloud") and having the results of what I will call "voice parsing" be returned to the requestor.

If we can toss aside our preconceived notions about having a Graphical User Interface in favor of an Audio User Interface, then we can perhaps envision a device that is the size of a watch (or really even smaller - the size of a set of headbuds with a mic) that has the sole job of recording our voice and transmitting that to the internet.
Once a "voice command" is sent to be parsed in the cloud, the results can then be routed back to the original device, provided it has enough technical sophistication to do anything with it or, more likely, the results will be forwarded to a "virtual command processor" that is personal to you that will then execute the command and send the results back to your tiny device.  Think of the "virtual command processor" as a virtual PC (or Mac) that has all of the capabilities of a personal computer and that accepts machine-control commands from a voice parser.

The shear computational power required to recognize - and to parse - a voice command is simply too great to ever expect to contain on a small, handheld device. Thus it becomes necessary to offload the processing to a distributed bank of servers somewhere out on the internet.

For example, I currently wear an iPod nano as a watch and plug my iPhone headset into it.

What if this same set up were capable of recognizing and executing voice commands?

Certainly with a bit of tweaking, a rudimentary cellphone could be made the size of a watch (using the headphone cable as an antennae or not, as the case may be).

In this hypothetical situation, here is a dialog with the device:
I speak into my headset "compose email"
The "voice parser" (VP) parses the command and routes it to my personal "virtual command processor" (VCP) which knows my email account information and literally opens yahoo mail and opens up a new email. (of course I see none of this).
I speak into my headset "email contents
VP tells VCP to set focus to the body of the email
I speak into my headset "Four score and seven years ago"
VP parses and is unsure of spelling of "Four" vs. "For"
VP automated voice requests "Uncertain of spelling of For in Four Score and Seven" Is it spelled F-O-U-R?
VP waits for my response of Yes or No
I speak into my headset "Yes"
VP tells VCP to inject the full sentence into the body of the email.
I speak into my headset "Send email to Tom Stevens"
VP tells VCP to set focus to "To" Field and types in "Tom Stevens", which the VCP will find - or not - in my personal address book. VP then tells VCP to hit "Send"


Obviously there are many edge cases and flow issues to make all of this work smoothly.  This is definitely an evolutionary process.
We are at version 0.1alpha right now.


I think visually challenged persons will be of great help in developing this technology to its fullest. They are used to interfacing with the world around them from a *mostly* auditory standpoint and can help us to more fully develop Audio User Interfaces "AUI"s that make sense.
shannon norrell