Monday, March 27, 2006

Developing Speech Applications - Part I

Haven't blogged, or conducted a session, about Speech Technologies (my second love) and the Microsoft Speech Application SDK (SASDK) for quite sometime. However, I have received a lot of requests for the SASDK tutorial on FlightEnquiry.NET Speech Demo that I showed at the Microsoft ISV Community Days in March 2005, Pakistan Developer Conference 2005 in June 2005, and at various INETA Pakistan user group events last year. I have figured that it is better to blog about something in a series of posts, as opposed to trying to type one HUGE post and be absent from the blogging world for days and weeks. So, if time and old age permit, I will hopefully try to blog about the SASDK tutorial in 3 posts. (I will try to keep the "Atlas at last" posts coming during this time too.)

NOTE: Most of the following text had been written about 8 months ago.

BACKGROUND
When ASP.NET first appeared on the horizon, most of us heaved a sigh a relief; No more having to work with ASP's two separate languages, VBScript on the server and JavaScript on the client; no spaghetti code; and most importantly, no need to worry about posting data back on to the form when it reloaded. With a number of cool ASP.NET server controls, life had been made simpler for the Web Developer. In a sense, for a brief moment in time, ASP Developers (who had shifted to ASP.NET) forgot all about JavaScript, since all ASP.NET controls had client-side logic built into it. The burden of writing JavaScript (or JScript) code had shifted from the Web Application Developer to the ASP.NET Server Control developer.

INTRODUCTION
Enter Microsoft Speech Application SDK (SASDK). Suddenly, client-side scripting became relevant again. Speech Applications built with the SASDK were browser-based ASP.NET applications with much of the handling done at the client i.e. web-browser. In making their Speech platform compatible with ASP.NET, Microsoft had 2 objectives: (1) to allow companies to use their existing ASP.NET code infrastructure in developing Speech apps; and (2) to make use of SALT (Speech Application Language Tags), a standard for providing speech input and output in the web-browser using tags. In various talks and speaker session, I have repeatedly been asked if the Microsoft SASDK would speech-enable websites or allow voice-activated web browsing. My answer: yes and no, depending on the type of platform you are developing the speech application for.

A Speech Application developed using the Microsoft SASDK can be in either one of two categories; (1) Voice-only (2) Multi-Modal. Both application types leverage existing ASP.NET code but differ in implementation. Voice-only applications have a VUI (Voice User Interface) and no GUI i.e. all communication between the app and the user is by means of voice or DTMF (Dual-Tone Multi-Frequency) that may be transferred over a telephone line or any other voice carrier medium. Multi-Modal applications, on the other hand, have a VUI as well as a GUI. This allows the user to interact with the application using voice as well as with the mouse and keyboard. The use of Voice-only telephony applications is understandable, but why use Multi-Modal? Fact of the matter is that the bulk of Multi-Modal applications that would be developed within the next few years won't be for the desktop PC, but would instead be for Mobile and Smart Clients. PC users typically have all the hardware they need, namely the keyboard and mouse, to interact with their applications. However, it is difficult to use Mobile/Smart Client application because of the small keypad or even stylus while on the road. Microsoft's release of the Mobile Internet Toolkit did not really bring about increased development of Mobile applications as many would have expected because mobile or smart client applications are extremely difficult to use because of the small size of the device they are running on. Speech-enabled Mobile or Smart Client applications require minimal use of the keypad or stylus and rely more on speech for input (and even output), thus providing increased ease-of-use. [NOTE: Multi-modal applications failed to take-off in a way that Microsoft had exepected. The Speech API (SAPI) 5.3 that comes built into Windows Vista would allow development of desktop speech applications and DOES NOT make use of SALT. You can download a video of the Multi-modal Speech Application (Sublime Demo) running on a hand-held device from Jahanzeb Sherwani's web page.]

The execution model followed by an application developed using the SASDK is somewhat the same as any ASP.NET application, consisting mainly of a certain number of requests and responses. However, the client-side processing done for a speech application is different. In a typical web form (non-speech app) scenario, information is submitted to the server through form controls (textboxes, checkboxes, radio buttons etc.) which form the GUI. In speech applications, input is received by the a speech control that is not visually rendered but is in fact hosted inside a web page using the tag from SALT specification. When a voice input is received, it is processed on the client by JScript, meaningful information is extracted and sent to the server by means of a variable known in the SASDK domain as a "semantic item". The importance of JScript that I highlighted at the start of this article stems from this client side processing of the speech input. This scenario is valid for both, Voice-only and Multi-modal Speech applications. For more information on SALT, check out the SALT Forum web site.

Please note that in order for Internet Explorer to interpret SALT tags, you need to have Microsoft Enterprise Instrumentation Framework (EIF) installed on your PC. EIF is a pre-requisite and comes with the SASDK. Also, SASDK v1.0 and v1.1 work only with .NET Framework 1.1 and Visual Studio.NET 2003.

In the second part of this 3 part series of posts, I will briefly discuss the series of functions performed during the execution of a typical Voice-only application.

4 comments:

kumara said...

Hi I go through your presentation and it is good. I am woking in speech server only for the past 4 days. I created a sample application. When i run that sln i am getting the error as "not attached to SALT client" in the telephony simulator status bar. I didnt connect any telephone line through TIM. Whether it is necessary or only in deployment. Kindly give a soln for this

Adnan Farooq Hashmi said...

TIM is not related to the error message you are getting. SALT is interpreted inside IE, which happens to be your SALT client. I guess the error message points to a faulty EIF (Enterprise Instrumentation Framework) installation.

kumara said...

Hi i have some doubts in speech server. i created a sample dtmf application in speech server. i dont how to retrieve the values that is pressed. where i have to get those value. Kindly help me

dorla said...

Very nice - easy to follow, simple, and working. Thanks for the knowledge!
More templates easy to download