Microsoft's future Natural Language Query unveiled. Speech recognition API for Windows Phone 7 in the works

Microsoft’s Tellme division has unveiled its future plans regarding the upcoming Natural Language Query speech recognition technology that the company is working on with the Xbox/Kinect, Windows, Windows Phone. As of right now Tellme only used for relatively simple speech search queries and commands that usually don’t involve more that a few rods and simple voice to text translation (as currently available in Windows Phone 7 Mango and other Microsoft products like windows 7, Kinect etc).

Tellme’s future speech recognition system should allow users to naturally talk to their phones with a more integrated system that should interact seamlessly with the users social & business connections, personal preferences, know his intent, his likes and dislikes. You should checkout the the video below highlighting these future capabilities:

Microsoft has also unveiled that a Tellme Speech Recognition API for Windows Phone 7 is in the works and will allow third-party developers to use the Tellme service in the apps.

Head over to the Tellme blogpost for a more detailed view of the current services and upcoming plans.

source: Tellme via All About Microsoft

  • http://www.vkvraju.com vkvraju

    This is going to be huge! Looks like a market changer!

  • BucksterMcgee

    Speech recognition is a tricky thing; it’s one of those things that are magical when they work, but frustrating when they don’t.

    And remember it is an incredibly hard thing to do. Even if you have a system just as a good as a human, think of all the times when you’ve misheard someone or they’ve misheard you. There are numerous forms of interference that can distort both what the listener is hearing and what the speaker is trying to say. Thick accents can cause intended sounds to appear as an entirely different sounds: stuttering, mumbling, or even just poor enunciation can change the output of a speaker from a clean phrase to jumbled mess; and background noises over lapped can make conversation become unrecognizable or even misunderstood.

    Even with a near perfect system there will be times when the system fails, and when it does fail will it be as frustrating as when you encounter the same failure when speaking to other humans? There are times on support calls when the language/speak barrier between myself and the person on the other line is so bad that I would rather use a mouse to click through a series of menu’s to solve my problem then continue “trying” to communicate.

    A mouse is rather limited, in the way it can only perform one task at a time at one point, but quite generally the task is completed rather accurately. Speech on the other hand can quickly convey a complex message/task/request but do in part to the complexity or actually lack of accurate complexity can not only have a failure to the task but also have a regressive type feeling that the task is now further away from being completed then before trying to start it. So any system looking to replace the simple precision of a mouse must not only be highly successful, but fail in a graceful way so that when the inevitable does happen, it is not received as regressive feeling.

    As an example, I love the speech on my phone. Often opening something, finding something, or responding to something is easier and actually quicker than if I tried to navigate normally and find it. When searching with Bing I almost never use the keyboard unless the speech has a hard time hearing me, or the search terms are hard to convey through speech. For certain applications that I don’t want pinned up front, it can be faster and easier for me to simple voice search for them rather than scrolling to them. It’s like with search I can go directly to the item I want, regardless of the size of the search list, in one command, compared to slowly narrowing in on the target through normal UI elements.

    The problems arise when it doesn’t work as I was hoping for. Often searching Bing for names of places, restaurants, etc cause issues because the phonetics if how I am pronouncing it doesn’t match up to how Bing sees it. This is necessarily Bings fault, as my own ignorance could be pronouncing the search incorrectly, but I would assume the average person wouldn’t take this thought into consideration. When I search for an item on my phone, and the system misunderstands and takes me to the wrong app, or tries to do a bing search, even if I cancel and try again, the time and effort I would’ve saved from not navigating normally begins to be lost.

    These are problems that happen even for fairly simple search terms; asking my phone to find me a bakery to bake red velvet cake for august the 23rd is, while I understand how it could be broken down to work, still several years off.

    That all being said, I am actually a huge supporter for this type of advancement in technology. I view the keyboard and especially the mouse as highly outdated technologies that limit the way we interact every day. While they do have their places, I would far rather have that perfect Star Trek type communication with my computer (or Natural User Interface as Microsoft calls it) then having these limited tools to use as we do now. And while we are still years away from the perfect interface we can see the beginnings of that movement with stuff like the Kinect and the tellme services.

    I am incredibly anxious to get the Fall update for the Xbox dashboard. Not only is it the next step to an all-Metro UI for Microsoft but also it brings the type of voice control I’ve expected from the Kinect since early prototypes of it. It’s the “Computer [xbox] play ‘The Frames’ ‘People get ready’” type of interaction that I want. The “Computer play ‘Lord of the Rings, the Two Towers” chapter 5” … “computer pause, I mean chapter 4” that they are so close too.

    If they just keep advancing and don’t stay stagnant they will reach the near perfect interface within the decade almost for sure. But right now on my Xbox it’s just, “Xbox pause… xbox play…. Xbox next.”

    And again don’t get me wrong, with that alone I already am disappointed when I can’t do that with my computer, TV, or hell even microwave. It’s that, “I don’t have to reach for anything” the “crap, I forgot to turn that off” or “just wait a minute…” type of thing that is so great. From across the room without even lifting your head or turning an inch you can control it, and that is exactly the bit of that magic I talked about before. I smile every damn time I pause my xbox and as well as the subsequent plays. It’s great when I’m in the next room and a less then desirable song comes on, “xbox next!” and my xbox obeys. It allows you to continue whatever you were doing without interruption and truly the more you use it the more you wish it was everywhere. But then again, when it fails, it’s embarrassing, and the magic fades away.

    So when I see a video like this, I think “ya, well of course you want the computer that understands you perfectly just as if it was a tiny human hiding in the circuitry, that’s an easy thing to picture conceptually, but how close are they? Really?” I would be far more impressed if they showed off some of their latest research to see just how close they are. Also, what are the limitations of this? Seems like a lot of what Tellme is requires a backend server to do its magic, so what happens when you can’t reach that server, does the magic fade away?

    I would love love love to see even a tenth of this type of control in Windows 8. It would just be that type of OS where you can to interact however you want to. Mouse and keyboard, touch, gesture, voice? Yes. I can see how they can slowly improve Windows Phone as well even with the limiting resources. Tellme for 3rd parties is a great idea, but how soon? Do we have to wait a year for that? Or is it in the next (after mango) update? What can you do with it? Will we be able to transcribe full emails and messages? Will it have an always on mic like Kinect so we can just say, “Phone [zune?] pause” ? Or does it require a press like now? And again, when do they expect that they will have that near perfect “find me a bakery…” type interaction ready? A year? 5 years? 10 years?

    When it comes down to it, I’m more impressed with what they can do now, rather than what they “could” do later. If they say that’s what they will be able to do, then I better start seeing more evidence, because hell I want a transporter that can beam from here to the beaches of Hawaii, and not only can I make a nice presentation on it, I even have a few ideas how to make it work. ;)

  • Anonymous

    Holy crap!  I can’t believe you just wrote all of that.