GB2342530A

GB2342530A - Gathering user inputs by speech recognition

Info

Publication number: GB2342530A
Application number: GB9821870A
Authority: GB
Inventors: Michael Banbrook; Michael Edward Williams; James Andrew Walker
Original assignee: Vocalis Ltd
Current assignee: Vocalis Ltd
Priority date: 1998-10-07
Filing date: 1998-10-07
Publication date: 2000-04-12
Also published as: GB9821870D0

Abstract

A voice recognition system answers an input telephone call and then reads an Internet web page for use as a template for synthesising speech data to be read to the caller, prompt speech data to be read to the caller to prompt them to make an appropriate response and recognition data to be used by the speech recogniser to analyse received speech signals from the caller to identify the selection or input being specified by the user. The user can be prompted to select one or more hypertext links that will trigger another web page or portion of a web page to be read to the user including its associated inputs. The user inputs may also be options from a drop-down list, free form text or button selections. The control codes within the web page received for these various options are identified and appropriate prompt text and recognition data for each one is generated. Additional navigation recognition data such as a call termination, repeat and review words are included. The data retrieved and recognised by the speech recognition system is relayed to the web page recovered by the system to mimic the action of the web page having been completed using a conventional computer link with a keyboard and mouse selections.

Description

GATHERING USER INPUTS BY SPEECH RECOGNITION This invention relates to the gathering of user inputs by speech recognition.

More particularly, this invention relates to the recognition of speech delivered on a speech communication channel to a remote call answering device.

Speech recognition is known in various fields. One example is the recognition of speech for automated text input into word processors and the like. A different use of speech recognition technology is in the automation of telephone messaging systems whereby a user makes a telephone call and a speech recognition system on the receiving side uses speech recognition techniques to navigate through a menu structure to gather information from the user, for example to route a call.

A problem with speech recognition systems of the type that respond to speech over a telephone line is that their cost and complexity make them unattractive to small and medium size businesses.

Viewed from one aspect the present invention provides apparatus responsive to user speech signals for gathering user inputs, said apparatus comprising: a call answering device for answering an incoming speech call on a speech communication channel; a text data and control code reader responsive to said incoming speech call for accessing a page of text data and control code data corresponding to a page of text data and control data stored on a host computer and accessed via a computer network, said control code data including user input control code data for controlling at least one user input response relating to said page to be returned to said host computer via said computer network; a speech synthesiser responsive to speech data representing said text data for generating a machine speech signal that is transmitted to a user via said speech communication channel; a speech recogniser responsive to speech recognition data representing at least one spoken user input associated with said user input control code and a user speech signal received from said user via said speech communication channel for recognising a spoken user input; and a user input relay responsive to recognition of said spoken user input by said speech recogniser for generating a corresponding user input response and returning said corresponding user input response to said host computer via said computer network.

The invention uses text data and control data read from a host computer via a computer network to provide a template for the prompting for and recognition of user inputs in the form of speech. The control codes within pages stored on the host can provide links to other pages and jumps within a page that can serve the function of providing a menu type structure. The invention makes it possible for the speech recognition hardware and systems to be centrally provided (e. g. shared between many users) with the user inputs gathered being distributed via the computer network back to the different users and input into their standard processing systems. The data input to the page via the speech recognition system of the invention or by conventional techniques to the page may be routed in a single stream to the recipient simplifying the management and control of this information. Furthermore, maintaining and updating the page of text data a control code data will also update the mechanism for speech recognition input to this page without having to make any changed to the central system.

It will be appreciated that the text data and control code data could be read directly from the host computer each time it was needed. However, a higher level of performance is achieved in preferred embodiments in which a text data and control code reader accesses a cached version said page of text data and control code data in response to said incoming speech call.

The speech data and speech recognition data may be automatically generated by a parser dynamically or prestored depending upon the balance required between currency, storage space and speed. The speech data may also be a recorded live speech data.

It will be appreciated that the page of text data and control code data could have various different formats and encoding languages. However, preferred embodiments of the invention are ones in which the pages and internet web page and the text data a control code data is encoded as html.

The control codes embedded within the page can have various different forms.

In order to provide menu type navigation hypertext links can be read out and recognised to provide branching to other pages or portions of pages in a manner that provides a menu-like interface.

Further control codes can be of the type that specify a form that may for instance include free form text fields (possibly constrained to numeric data only), option lists, button selections and the like.

On a graphically displayed page graphical information (such as a drop-down list button or a radio button display) give a visual prompt to a user of the type of input required. When speech recognition is being used and access to these visual prompts is not available the system is able to compensate by replacing them with appropriate text data within the page that is synthesised into speech explaining the input options to the user and prompting them to make an appropriate choice or select an appropriate option.

The present invention is particularly suited for allowing user input to a user support web page via speech recognition rather than a conventional computer network link. A clear example of the usefulness of this alternative input mechanism would be when the fault being reported was one that prevented the user making an appropriate internet connection to the user support web page.

The invention is also particularly useful to provide an alternative way for a user to leave information when an incoming speech call cannot be connected to a human operator and as an alternative to the extremely frustrating situation where a user is kept on hold pending the available of an operator for long periods of time.

Viewed from another aspect the present invention provides a method of gathering user inputs in response to user speech signals, said method comprising the steps of : answering with a call answering device an incoming speech call on a speech communication channel; in response to said incoming speech call, accessing a page of text data and control code data corresponding to a page of text data and control data stored on a host computer and accessed via a computer network, said control code data including user input control code data for controlling at least one user input response relating to said page to be returned to said host computer via said computer network; in response to said speech data representing said text data, generating a machine speech signal that is transmitted to a user via said speech communication channel; in response to said speech recognition data representing at least one spoken user input associated with said user input control code data and a user speech signal received from said user via said speech communication channel, recognising a spoken user input ; and in response to recognition of said spoken user input, generating a corresponding user input response and returning said corresponding user input response to said host computer via said computer network.

An embodiment of the invention will now be described, by way of example only, with reference to the accompanying drawings in which: Figures 1 and 2 illustrate example internet web pages from a user support web site; Figure 3 schematically illustrates an embodiment of the present invention ; Figures 4A, 4B, 4C and 4D are a flow diagram illustrating one embodiment of the invention; and Figures SA, 5B and 5C are a flow diagram illustrating another embodiment of the invention.

Consider the following: Scenario Mr A's recently-purchased PC has crashed for the third time today. He tries to call the support number but gives up after being held in a queue for over 30 minutes.

The support leaflet lists another support option, an 0800 number where he can log the fault and be guaranteed a call back by an engineer within four hours. He calls 0800 XXX XXX, and hears the following: Dialogue System:"Welcome to some-company's support service. To help us provide the support you need, please select from one of the following options: say "Hardware support","Software support"or"General faults". To repeat this list, say"repeat", to leave at any time, say "Goodbye". < beep > .

Mr A :"Software support" System:"Software support. To assist us in resolving your fault, please provide the following details. Say"Cancel"at any time if you wish to discontinue. Customer number". < beep > .

Mr A:"842278" System:"Model number."Say P5-2, P5-3, P6-1 or Other". To repeat this list, say"repeat". < beep > .

Mr A :"P6-1" System:"Does your PC start Windows 95 successfully? Say"yes","no", or "don't know"". < beep > .

Mr A :"No".

System:"Briefly describe the nature of the problem:" < beep > .

Mr A:"I switch my PC on and it says Starting Windows 95.... Then it says Windows Protection fault. Please contact your hardware vendor" System:"An engineer will contact you within four hours of logging this fault.

Please provide a telephone number we can contact you on. Telephone number". < beep > .

Mr A :"01223 846177" System:"To review your answers, say"review". To submit your response, say "submit". To cancel, say"Cancel"". < beep > .

Mr A:"Review".

System:"Customer number 842278. Model number P6-1. Does your PC start Windows 95 successfully ? No. Briefly describe the nature of the problem. I switch my PC on... etc Telephone number 01223 846177".

System:"To review your answers say"review". To submit your response, say "submit". To cancel, say"cancel"". < beep > .

Mr A:"Submit".

System:"Your fault reference number is 66745/1. An engineer will contact you shortly. Goodbye".

Behind the scenes The above dialogue is achieved using the following two web pages, HTML parsing, text-to-speech and speech recognition.

The HTML code for the web page shown in Figure 1 is: < html > < head > < title > Support service < /title > < /head > < body bgcolor="&num;FFFFFF" > < hr > < p align="center" > < strong > Welcome to some-company & &num;146 ; s support service.

< /strong > < < /p > < hr > < p > To help us provide the support you need, please selection from one of the following options: < /p > < p align="center" > < a href="hw. html" > Hardware support < /a > < br > < a href="sw. html" > Software support < /a > < br > < a href="general. html" > General faults < /a > < /p > < /body > < /html > What the system does In the first instance, the system answers the call and looks up the web page associated with the called number. This web page is grabbed live (although it could be cached with a check for currency being made by retrieving the header of the remote page), and scanned for HTML"href"tags (aka links). These are converted-to phonemes for recognition purposes, and stored in an option list along with the hypertext links. Any other text within the < body > tags is read out to the caller. The speech data and recognition data may alternatively be created in advance to speed processing. The speech data may also be a recorded live speech signal rather than synthesised.

Text to read out: "Welcome to some-company's support service.

To help us provide the support you need, please select from one of the following options:" Option List: Hardware support (hw. html) Software support (sw. html) General faults (general. html) Dialogue is built around this by adding the word"Say"before the option list, "or"before the last option in the list, and adding options for repeating the list or leaving this service. This additional text may also be built into a special version of the web page with this extra prompt material included.

When the caller speaks an option that matches one from the list, the associated link is followed, the new web page shown in Figure 2 is retrieved and the process repeats: Again, the system reads out the web page, until it encounters the form. At this point is says"Say Cancel"at any time if you wish to discontinue".

Then, for each item, it reads out any text, signals a beep, and feeds responses from the caller into the form fields.

For a list item, such as the model number, it reads out all the items from the list box and offers a chance to repeat them. Each item in the list is converted to phonemes for recognition (in exactly the same way an option list built from hypertext links is).

Figure 3 shows an overview of the components of the system which comprise of : Text to Speech (TTS) engine: a package called"rsynth"which runs on Linux.

This is a Klatt based synthesiser which given text input generates both the transcription (in machine readable IPA) and a 16 bit linear speech file. The package transcribes the text using transcriptions from the Beep dictionary where they exist or a rule based technique where not. Although this package can run on SCO it seemed that under SCO the dictionary could not be included. w Recogniser : standard PowerPC hardware recogniser Dialogic : D41ESC in SCBus mode HTML Grabber: a package called"lynx"which again runs under Linux. This package can display the page but by giving the"source"option it just gets the page and pipes it to stdout.

HTML parser: this is written into the demo program itself.

The system has a default web page which it uses as the home page. This is the first page that is read upon starting a call. When a call is received the initial page is obtained through the"html grabber"from the web. The page is then interpreted line by line by the"html parser".

The parser takes the raw html and pre-processes it thus: places all commands on a newline by themselves removes blank lines converts problem characters (such as" & ") The pre-processed html is then read in line by line. Each line is either 1. text 2. html command word 3. end of page Text is passed to the TTS engine, which returns a speech file. This file is passed to the diologic card to play asynchronously whilst the next line is read.

HTML command words are ignored (neither interpreted or read out) except for commands which define a list of hypertext links: < ul > defines the start of a recognition list < /ul > defines the end of a recognition list < dt > defines the start of a new recognition item in a recognition list. Any text after this is interpreted as"play text"for the current recognition item.

< a href="http ://...." > text < la > defines the hyperlink and the recognition keyword (s) for a recognition item.

Navigation of the web is to be done through the"recognition lists". The list contains a number of list items, each of which contains a defined recognition item.

The recognition item defines the html page to go to upon recognition, the keyword to use for the recognition and the text to play out.

When a start of list is encountered the parser keeps reading lines in until the end of the list is detected. At the end of list a speech file is generated by the TTS engine containing speech to be played (in talkover mode) by the recogniser. The TTS engine is also used to transcribe the keywords into IPA phonetic form. This form is then translated into SAMPA *used by the recogniser) and a ptrans file generated using these keyword transcriptions. This ptrans file is then downloaded to the recogniser and a ptrans recognition initiated in talkover mode.

The return from the recogniser is a list item number, which is used to look up the hypertext link from the recognition list items. This link is then passed to the "HTML grabber"and the whole process begins again with the new web page.

If the end of page is encountered then the call is deemed to have finished and closes the call, exiting to a wait state for the next caller.

Figures 4A, 4B, 4C and 4D are a flow diagram illustrating the possible operation of the device of Figure 3 in accordance with one embodiment.

At step 100 a html page to be processed is fetched via a network link. At step 102 the first line of the html page is parsed and at step 104 check to see if it is a html command (control data). If the line does not contain a command then speech data corresponding to the text is synthesize at step 106 and sent to an asynchronously operating playing mechanism at step 108 before the next line is parsed at step 102.

If the line parsed does contain a command, then step 110 identifies the type of command. A subset of all the possible command is illustrated. If the line contains a hypertext line a < Href > , then these are collected into a recognition list at step 112 before the next line is parsed. If the line contains an end of paragraph command < /p > (or equivalent), then the recognition list is checked at step 114 to see if it is active (i. e. not empty). If the recognition list is not active, then a brief pause state is entered at step 116 before processing of the next paragraph is started.

If the recognition list at step 114 is active, then processing proceeds to step at which the item"Continue"is added to the recognition list. The recognition list contains all the hypertext links encountered in a paragraph together the"Continue" option. The user may want to select one of the hypertext links or may wish to take none, i. e. to continue.

At step 120, the message"You have the following options"is played to the user followed at step 122 by the playing of all the entries in the recognition list.

These options are played as talkover text, i. e. a user can make a selection before the full list has been played to speed up operation.

At step 124 a check is made to see if any words spoken by the user have been recognised. If no words have been recognised and this is the first time the list has been read, then step 126 plays the message"I'm sorry I have not recognised your selection"and the list is repeated once.

If a word is recognised or the list has been read twice, then step 128 check to see if the word recognised was"Continue"or no word was recognised following two readings of the list. If either of these conditions are met, then the processing returns to step 102 and no hypertext link is taken. If a word other than"Continue"is recognised, then the html page corresponding to that link is fetched at step 130 and a return to processing at step 102 is made.

If a list command (or other option selection command, e. g. a button) is recognised at step 110 in Figure 4A, then processing proceeds to steps 132,134 and 136, where corresponding text data for introducing the option to a user is stripped out from among the command data (numbered list would have numbers added), recognition data for the possible user inputs generated (including numbers for numbered lists) together with commands to triggered by recognition of the various options. Step 138 checks to determine when the end of the list has been reached.

When the end of list has been reached, step 140 synthesises the text data into speech and this is then asynchronously played to the user as talkover text at step 142.

Step 144 checks to see is a word has been recognised. If no word has been recognised and the list has been read once, then the"I'm sorry..."message is read at step 146 and the list is repeated. If a word is recognised at step 144, then the corresponding page is fetched at step 148 and processing returns to step 102. A default option may be taken if no word is recognised and the text has been read twice.

Returning to Figure 4A, if a < form > command is recognised at step 110, then the processing illustrated in Figure 4D is started. Every form is terminated by a "Submit"button. Step 150 check to see if the end of the form has yet been reached.

If the end of the form has not been reached, then the next item within the form is processed at step 152. The items could be: lists, checkboxes, radio boxes, free form text, numeric data, alphanumeric data etc. These item will be process using techniques similar to those described in relation to Figures 4B and 4C. After each item has been processed, the user selection/input is stored at step 154 (e. g. a variable's value is set for later return to the host).

At step 150, when the submit button is reached the processing proceeds to step 156 where the users selections are read back to them, e. g. by reading the form's text and the users selected options. At step 158 the user is asked to confirm their selections by saying"Yes"or repeat the input by saying"No". If the user says"Yes", then this is recognised and the form data sent to the host via the network at step 162.

The host will return a next page in response to the form data and this will be processed at step 102. If the user says"No"or no response is made, then the form is cleared and processing returns to the start of the form at step 150.

Figures SA, 5B and 5C are a flow diagram illustrating the possible operation of the device of Figure 3 in accordance with another embodiment.

At step 10 an incoming call is answered by an automatic call answering and speech recognition system. At step 12 a web page corresponding to the initial web page for the telephone number that was dialled is directly fetched via an internet connection from the host computer on which it resides. In steps 14,16,18,20,22,24 and 26 the fetched web page is parsed, possibly amended and then speech data and recognition data generated.

More particularly, at step 14 all commands within the fetched web page are identified and placed on a new line. At step 16 all blank lines are removed. At step 18 all problem characters, such as & or"or (etc, are converted into corresponding text.

At step 20 the commands within the page are analysed to identify their type, e. g. a list of hypertext links, a selection list from a drop-down box, a free form text field, a numeric field etc and appropriate additional text is added that will prompt the user to make a speech response of an appropriate type to generate data that can be used to provide the information needed.

At step 22 the text that has been produced for the fetched page is used to generate speech data that can be supplied to a speech synthesiser over the telephone line. At step 24 the recognition data is generated that can be used for comparison with received speech signals from the user after prompts for input to detect an appropriate user selection. At step 26 additional speech recognition data is added that can correspond to options such as"goodbye"or"cancel"that may be input at any time or may be appropriate to a particular type of selection. Step 26 may not be needed if a special purpose page is being used.

At step 28 the speech data generated at Step 22 is read to the user over the telephone line by the speech synthesiser until a control code is encountered within the page corresponding to a desired user input. At step 30 a test is made to determine if the user input is one expecting free form text. If the answer to this is yes then step 32 proceeds to recognise the user speech signal and covert this to text (possibly a phoneme transcription) until a pause of a predetermined length is detected. When such a pause is detected recognition ceases and the system proceeds to step 34 where a test is made to see if the word"cancel"was recognised.

If cancel was not recognised then the free form text that was converted from the user's input speech signal at step 32 is relayed to the free form text field within the web page at step 36. If the answer at step 34 was yes, then step 36 is bypassed.

If the answer at step 30 was no, then the user input required will be one of selecting between a plurality of hypertext links, options from a drop-down list or button selections. The appropriate prompt text for the input required is read to the user by the speech synthesiser at step 38. The response of the user is fed to the speech recogniser at step 40. At step 42 a test is made to see if the word"repeat"was recognised. If the answer is yes, then the system returns to step 38 where the prompts are read again. If the answer is no, then the system proceeds to step 44.

At step 44 the system determines whether a valid potential link or option or selection was recognised. If the answer is no then the system returns to step 38. If the answer is yes, then the system proceeds to step 46 where a test is made to determine whether the option was a hypertext link option. If the option was a hypertext link option then a branch to the new page or portion of page is made at step 48 and then processing returns to step 12.

If the option was not a hypertext link, then step 50 relays the appropriate option selection button selection to the web page and processing proceeds to step 52.

At step 52 a test is made to determine whether the end of the current page has been reached. If the end of the current page has not been reached then processing returns to step 28. If the end of the page has been reached then processing proceeds to step 54 at which a summary of the user input data is read back to the user by the telephone line.

At step 56 the user is prompted to select via a voice input 1 of the options of reviewing the input data again, confirming the input data or re-entering the input data.

At step 58 the system recognises the user's speech signal response to this prompt.

At step 60 a user request to review the information is detected and appropriate return made to step 54 if a review has been requested. At step 62 a test is made for a request to re-enter the data and if such a request is present then a return is made to step 28 at the start of the page concerned. At step 64, a test is made to detect whether a confirmed response was received from the user. If the results is negative, then processing is returned to step 56 until a valid response is received from the user.

If at step 64 a confirm signal was received, then processing proceeds to step 66 where that confirm signal is returned to the web page. At step 68 a test is made as to whether there are any more web pages to be subject to user input following the confirm signal. If more web pages need to be visited then processing proceeds to step 12. If there are no more web pages that need to be visited then the telephone call is terminated at step 70.

A Functional Specification of system in accordance with another embodiment of the invention is a follows 1. Introduction This is the functional specification for the Vocalis SPEECHtmI Alpha system.

1.1 Scope This document contains the proposed functional solution to the SPEECHtmI service.

This document covers the functionality of the underlying software for the SPEECHtmI service, the dialogue design ethos and the environment used.

2. System Objectives The following section details the overall structure and objectives for the SPEECHtml system.

2.1 SPEECHtml Service-Overview This section is included to provide an overview of the service, and is for information only. The specific functionality to be implemented in the system follows in subsequent sections.

SPEECHtmI is a service that allows a client to provide a custom IVR system without the requirement for any client side hardware or software, other than access to an ISP to host the client's website.

The (client's) IVR service is defined by standard HTML pages, hosted by the client's own ISP. SPEECHtmI interprets these pages, realtime, providing telephony based interaction to form a complete IVR service. The client is able to update the IVR service at any time by simply updating their HTML pages on their own host ISP.

SPEECHtmI will use a combination of text-to-speech (TTS) technology and speech recognition to provide the IVR service. The definitions for how SPEECHtmI will interpret standard HTML codes is defined in this document.

It is intended that SPEECHtmI should provide low cost access, with minimal upfront cost, to a high quality, high value IVR service.

3. Functional Description SPEECHtmI consists of 6 modules: ~ Web Telephony system Core FSM Text to Phoneme parser TTS engine . Hardware Each of these modules has it's own requirements and functional specifications as laid out in the following sections 3.1 Web This section summarises the full functional specification for this module.

SPEECHtml will have 3 Web sites; 1. Administration: a site dedicated to obtaining general client information (name, address, phone number, email), service information (http address for the service pages, type of telephone number required), advanced user information (user configurable service data), and providing the customer with their administration and service details (account number, password, usage statistics, client's IVR service telephone number) 2. Reference/Help: a site dedicated to making it as easy as possible for the client to set up an IVR service using SPEECHtmI. The site will contain reference services and help pages.

3. Host administration: a site dedicated to enabling the SPEECHtmI service provider to administer the server (access client data, enable/disable accounts) 3.2 Telephony System SPEECHtmI will provide 30 simultaneous telephony ports. These ports will be available to any of 300 possible telephone numbers which are allocated to service clients.

Any one telephone number can be active on up to 30 ports at any time.

If all 30 ports are busy, then the next caller will receive a busy tone.

It must be possible to locate the called number on any incoming call.

3.3 Core Dialogue Engine This is responsible for providing the interaction between the end user and the SPEECHtmI service. It consists of 5 elements ; # Call Flow engine * Recognition engine Page grabber * HTML parser * TTS system 3.3.1 Call flow engine The call flow engine is responsible for the call flow. The basic model for the engine is shown in Illustration 1, where the engine Picks up a call Identifies the called number, and via a database lookup maps this number to the relevant http address Loops around a cycle of * Get HTML page * Interpret/Play/Recognise HTML information Perform action upon recognition (skip through current page or return to start of loop with new http address) If no action is required and the end of page is encountered then hangup

Pick up the call v Indentify the called number No such ookup the URL llSllla r HANGUP e called number > lsec Nô suce Play"Sorry there is page HTML page Play default/custom a problem"wait messa. V Anchor point request Recogniti Errer Interpret/Play page new URL requesdform submission Reachedend of page with no action defmed HANGUP Illustration 1: Call flow diagram In the event that the identified number called number is not in the user database then the system should exit straight to hangup without playing any prompt.

In the event that the download of a page is slow then the system plays out a user configurable text message in a repeating loop.

In the event that the page requested is not available then the service will play a user configurable error prompt and exit to hangup.

Each page will result in one of 3 possible actions; Skip forward to a point defined as an anchor point in the current page and resume with the interpret function Request new page from the page grabber Exit with no further action (to hangup) 3.3.2 Recognition Each recognition within a page will be of a default style. For the Alpha version this will not be user configurable.

All recognitions (except numeric forms input) will be * Phoneme based, * Talkover, * wordspotting, * OOV rejection enabled.

Connected digit recognition (Forms input only) will use the digit word models in an arbitrary length digit grammar. During such recognitions DTMF input will be enabled. DTMF input will be used as the default in the case where the user enters both DTMF and voice.

Recognition grammars will be generated from the text using a text-to-phoneme conversion.

Error conditions and handling: Silence : if the number of options is less than 5 then say"I'm sorry, your options are..."else say"I'm sorry I didn't hear that, could you repeat that" OOV rejection: as Silence Error correction will be 2 level (active twice on any one recognition) * Hardware errors: exit to hangup through custom error prompt.

It will be possible to run, realtime, 16 simultaneous recognitions at the expected average vocabulary of 5,2 word phrases.

3.3. 3 Page grabber This is responsible for fetching the required HTML pages from a requested URL. The process must allow for user feedback in the case where the download is slow or delayed. This feedback should not affect the download speed. The grabber will support the following requests: * GET-connect to the requested URL and retrieve the page.

The page grabber will be passed a URL which will be checked for syntax.-In the case of badly formed or invalid URLs the page grabber will play an error prompt and exit.

The page grabber will not handle relative URL references. It expects a fully formed URL and it will be the responsibility of the calling function to maintain the current page URL and handle relative references to pages. Only HTTP URLs will be supported, although the page grabber will resolve URLs to IP addresses using name lookup and support HTTP connections on non default web server ports.

The page grabber will provide a disk-caching facility for HTML pages to reduce the need to retrieve pages on multiple occasions. The page grabber will make use of the HTTP HEAD command to compare cached pages with those stored on web servers to check whether a full retrieval is required.

POST-send form data.

The page grabber will support the transmission of completed form details to the web server using the POST command. No checking of content will take place.

Errors will be dealt with as follows; Delays > 3 seconds for total download or form post-music, default message, or nothing will be played (this option will be customer configurable). Music will play immediately after the 3 second delay and is interrupted immediately upon completion of the retrieval. For the message option, this will be played immediately after the 3 second delay, will be played once only, and will not be interrupted by completion of page retrieval.

Delays > 15 seconds for total download or form post-play error prompt and exit to hangup No such page-play error prompt and exit to hangup All other errors-play error prompt and exit to hangup 3.3.4 HTML Parser Within the core call flow engine, the system interprets html pages, passing text to the TTS system and performing recognitions when required. This is performed using an HTML parser that interprets HTML according to the following specification.

There are 4 main elements that the parser can understand; General text-this will be passed to the TTS system, one sentence at a time.

This will be an asynchronous operation to allow the core engine to play a sentence whilst generating the next one. This will minimise delays. Where a sentence is longer than 50 words then the parser will split the text after 50 words.

Hypertext links-links will be interpreted according to where in a page they occur. If the current context is a paragraph then the link (s) will be stored, and at the end of the current paragraph the system will play out a list of possible options and perform a recognition on these options. If the current context is a list then the action will be as defined below. If the link is within a form then the link will be ignored.

Lists-these are interpreted as a whole before being played out (unlike general text). If the list contains no hypertext links then the text is simply passed to the TTS system and then played out. If the list contains hypertext links then a set of keywords is generated and a recognition is performed on the possible list items. There are two main types of lists; numbered and unnumbered. Numbered lists will have numbers inserted at the start of each list item and these numbers will also be added to the keywords available for recognition.

* Multiple level lists-these will be interpreted the same as single level lists and not as individual recognition blocks.

Forms-these are interpreted as a series of recognitions which collate information to be passed on as a form. Each input on the form will be a separate recognition.

HTML codes that the system will understand are listed in Table. The parser will ignore any HTML codes not defined here.

HTML code HTML definition Parser definition Supported in Phase Unnumbered Unnumbered list Ply current buffered data (async) and Alpha begin a new play context The current play context will not end until a < /ul > is encoutered.

Numbered list Play current buffered data (async) and Alpha begin a new play context The current play context will not end until a < lol > is encoutered.

Unnumbered Unnumbered list (of As < ul > Alpha definitions) < dir > Unnumbered list As < ul > Alpha Unnumberedlist As < ul > Alpah < li > New list item If within < ol > then insert a list number Alpha otherwise insert pause. new New list item (in < dl > ) Insert pause Alpha < dd > Definition of list item Insert pause Alpha a Hypertext link Add recognition item to the Alpha recognition list (see recognition lists and recogniflon items), and"Text"to the Reference current play context. source not found." > Text < /a > < lul > , < /ol > , End of list Play the current buffered data as a Alpha talkover recognition if the list contains ' hyperlinks, otherwise play the context < /menu > , with no recognition active.

< /dir > < /P > End of paragraph Play current play context and begin a Alpha new one Blank line Nothing Implies end of paragraph/sentence so Alpha treat as < /p > if there is a currently active play context < fbrm...- > Start of a FORM See input types Alpha < input... > An input item in the form Do a recognition according to the type Alpha of input listed below < input... Text input Ignored Ignored type=text... >

< input... Not defined Numeric entry box (arbitrary length) Alpha type=numeric ... > < input... Select a file Ignored Ignored type=file... > < input... Multiple entry checkbox Multiple level yes/no recognition of Alpha possible checkboxes type=checkbo x... > < input.. Single entry checkbox Word recognition of the list. Only one Alpha item is allowed to be sefected type=radio ... > < input... Send the form Collate all the information and send the Alpha form type=submit > < input.. Reset all the fields to Go back to the start of the form and Alpha default values clear all form variables type=reset > < input..."clickable"image button Ignored Ignored type=image ... > < input... Embed data that the user Set the variable and add it to the form Alpha can not alter information type=hidden > < textarea... > Multiple line text entry Ignored Ignored < select... > Produce drop downl As"radio box"Alpha scrolling lists < select Drop down list that allows As"checkbox"Alpha multipleselection multiple > < optOon List item for selecv Add to the recognition list Alpha value=? ? > < pre > < /pre > Use formatted textIgnoredIgnored 3.3.5 TTS server From the core dialogue engine, requests will be made to a speech server, which will deliver a speech file in alaw format containing speech specified by an input text file.

Full specification of this server is given in section 3.5 Text-To-Speech server.

3.4 Text-to-Phoneme parser Given an arbitrary text string the text-to-phoneme parser will return the phonemic transcription as an entry in a recognition structure suitable for direct download to a recogniser.

The language supported will be UK English, based on SAMPA phonemes.

The parser will have a dictionary of common words (exact dictionary yet to be decided) which will always be searched for a matching word. If the word is not in the dictionary then the parser will use basic rules to convert from text into a string of phonemes.

Any search of the dictionary must take no more than 0.1 seconds.

3.5 Text-To-Speech server The text-to-speech server will be able to serve 30 simultaneous channels of text to speech in real-time. The server will accept the text input from an arbitrary file and deliver alaw encoded speech to an arbitrary output file.

The voice will be male English (US). As other voices become available these will be included as options.

Where an error occurs, the system will exit via a pre-synthesised error prompt. This will not be user configurable.

The current base technology for the TTS will be ACUVOICE AV2001 (Solaris X86).

This is deemed to be the highest quality synthesiser available.

3.6 Hardware The system will comprise of 3 physical hosts; TTS server host Application host Web host The TTS server host will be a Dell OptiPlex GX1 333Mhz running X86 Solaris. This will host the ACUVOICE TTS system which will communicate to the Application host via a combination of socket connections (for communications) and NFS mount points (for file access). These connections will physically be made via a LAN connection between the two machines The Application host will be a machine of at 200MHz capable of ; * running SCO Openserver 5.04 * hosting 2 PowerPC recogniser cards (full length ISA slots) * hosting 1 D300SC Dialogic card (1 full length ISA slot) . 64M RAM * at least 2Gbyte Hard drive * LAN adapter for connection to TTS server LAN adapter for connection to the Web host and the external internet connection The Web host will be a machine capable of running the SPEECHtmI web site. The operating system is not specified. This machine must be accessable as an external web server and also have a LAN connection to the Application host for transferring client database information.

The overall connectivity is shown in Illustration 2.

BNC COAX I APPLICATION HOST TTS HOST r | | ROUTER ROUTER 7 HUM WEB HOST 64Mbytes Pipe INTERNET Illustration 2: System connectivity 4. Operations and Maintenance 4.1 Normal operation 4.1.1 Normal operation The must be accessible from any external machine that is Internet connected.

4.1.2 Performance 4.1.2.1 Minimum performance The SPEECHtmI system must run with no noticeable delays (greater than 1 sec), except for the delay inherent in downloading new pages. This delay is dependant on factors beyond the control of the SPEECHtmI system.

4.1.2.2 Resource capacity The SPEECHtml system must support 30 simultaneous phone accesses.

4.1.3 Shut-down Shutting down the any part of the system must result in no loss of customer data.

Each host may be shutdown independently of the rest of the SPEECHtml system, with the following degradation of service: DWS-no customer data can be updated on the main application host. The system should show no degradation of service.

* Speech server-The system can no longer convert text to speech so the applications should exit using a pre-synthesised prompt"Sorry there is a problem, please call again later".

* Application host-If the application host is shutdown then the system cannot answer calls. The calling parties calls will ring without being picked up. In the Beta stage when more than one 30-port system is envisaged then the system should be tolerant to any one host being offline.

4.2 Security Any database files must be protected against external access by process or users other than those with permission. The customer database files must only be accessible to specified customers (those in possession of an account number and password), and a designated system administrator.

5. Constraints 5.1 Hardware design constraints 5.1.1 Hardware Requirements 1. A Web server running SCO UNIX which supports full external access.

2. A speech server running Solaris X86 with no extemal access 3. An application host running SCO UNIX which has outgoing access to external web pages 4. LAN connections between the SPEECHtml system host, the Web server, external web gateway 5. LAN connection between the SPEECHtml system host and the Speech server 5.2 Software design constraints 5.2.1 Reliability 5.2.1.1 Accuracy All database operations must insure the integrity of the database.

Recognition accuracy cannot be quantified since it depends on the vocabulary supplied.

5.2.1.2 Robustness The host machines and all SPEECHtml processes must not crash, and must be able to deal with, and report any incorrectly entered data.

All calls must finish with the system hanging up. This must be a gracefully exit via an error prompt where appropriate.

5.2.1.3 Consistency All database operations must conform to a similar interface.

5.2.2 Reusability All modules of SPEECHtmI are to be designed as the basis for the full operation system, and as such should allow for ease of expansion.

5.2.3 Testability 5.2.3.1 Communicativeness All modules must provide a log file system for debugging purposes.

5.2.3.2 Self-descriptiveness The code must be clearly and accurately labelled for easy maintenance and Understandability.

5.2.3.3 Structuredness The system will be constructed in modules. This allows modules to be individually developed and tested then incorporated in a controlled manner.

5.2.4 Efficiency The service as a whole must operate with no noticeable delays except those inherent with the Internet interface.

As previously stated all database accesses must be performed in less than one second.

Access to the text-to-speech server must be real-time.

5.3 Code and build constraints The coding should conform to all of the quality standards detailed for ISO 9001.

6. Definitions, acronyms and abbreviations

HTML Hyper Text Markup Langage TTS Text to Speech IVR Interactive Voice Response LAN Local Area Network

Claims

CLAIMS 1. Apparatus responsive to user speech signals for gathering user inputs, said apparatus comprising: a call answering device for answering an incoming speech call on a speech communication channel; a text data and control code reader responsive to said incoming speech call for accessing a page of text data and control code data corresponding to a page of text data and control data stored on a host computer and accessed via a computer network, said control code data including user input control code data for controlling at least one user input response relating to said page to be returned to said host computer via said computer network; a speech synthesiser responsive to speech data representing said text data for generating a machine speech signal that is transmitted to a user via said speech communication channel; a speech recogniser responsive to speech recognition data representing at least one spoken user input associated with said user input control code data and a user speech signal received from said user via said speech communication channel for recognising a spoken user input; and a user input relay responsive to recognition of said spoken user input by said speech recogniser for generating a corresponding user input response and returning said corresponding user input response to said host computer via said computer network.
2. Apparatus as claimed in claim 1, wherein a text data and control code reader accesses a cached version of said page of text data and control code data in response to said incoming speech call.
3. Apparatus as claimed in any one of claims 1 and 2, wherein a parser parses said page of text data and control code data to generate said speech data.
4. Apparatus as claimed in any one of claims 1 and 2, wherein a parser parses said page of text data and control code data to generate said speech recognition data.
5. Apparatus as claimed in any one of claims 3 and 4, wherein said speech data and said speech recognition data is generated prior to receiving said incoming speech call.
6. Apparatus as claimed in any one of claims 1 and 2, wherein said speech data is a recorded live speech signal.
7. Apparatus as claimed in any one of the preceding claims, wherein page of text data and control code data is an internet web page.
8. Apparatus as claimed in any one of the preceding claims, wherein said page of text data and control code data is an html page.
9. Apparatus as claimed in any one of the preceding claims, wherein said user input control code corresponds to at least one hypertext link to another page or portion of a page of text data and control code data, said speech data includes data for generating a speech signal corresponding to said hypertext link, said recognition data includes data for recognising a speech signal corresponding to said hypertext link and said corresponding user input response is selection of said hypertext link by a user.
10. Apparatus as claimed in claim 9, wherein said user input control code corresponds to a list of a plurality of hypertext links each having corresponding speech data, recognition data and a user input response.
11. Apparatus as claimed in any one of the preceding claims, wherein said text data and said speech recognition data include text inviting a spoken user input requesting a repeat of at least a portion of said machine speech signal and a response requesting said repeat.
12. Apparatus as claimed in any one of the preceding claims, wherein said text data and said speech recognition data include text inviting a spoken user input requesting termination of said incoming speech call and a response requesting termination.
13. Apparatus as claimed in any one of the preceding claims, wherein said text data and said speech recognition data include text inviting a spoken user input requesting a review of at least a part of previously gathered user inputs during said incoming speech call and a response requesting said review.
14. Apparatus as claimed in any one of the preceding claims, wherein said text data and said speech recognition data include text inviting a spoken user input requesting confirmation of at least a part of previously gathered user inputs during said incoming speech call and a response indicating said confirmation.
15. Apparatus as claimed in any one of the preceding claims, wherein said text data and said speech recognition data include text inviting a spoken user input requesting re-entering of at least a part of previously gathered user inputs during said incoming speech call and a response indicating requesting said re-entering.
16. Apparatus as claimed in any one of the preceding claims, wherein said user input control code corresponds to a free text field code, said speech data includes data for generating a speech signal corresponding to a prompt to enter a free text field, said speech recogniser includes a free text speech recogniser.
17. Apparatus as claimed in any one of claims 1 to 15, wherein said user input control code corresponds to a free text field code, said speech data includes data for generating a speech signal corresponding to a prompt to enter a free text field, said speech recogniser includes a phoneme recogniser and said corresponding user input response is a phoneme transcription of said spoken user input.
18. Apparatus as claimed in any one of the preceding claims, wherein said user input control code corresponds to a option list code, said speech data includes data for generating a speech signal corresponding to a prompt to enter an option list selection, said recognition data includes data for recognising a speech signal corresponding to said option list selection and said corresponding user input response is an option selection.
19. Apparatus as claimed in any one of the preceding claims, wherein said user input control code corresponds to a button selection code, said speech data includes data for generating a speech signal corresponding to a prompt to enter a button selection, said recognition data includes data for recognising a speech signal corresponding to said button selection and said corresponding user input response is a button selection.
20. Apparatus as claimed in any one of the preceding claims, wherein said call answering device answers said incoming speech call upon determination that said incoming speech call cannot be connected to a human operator.
21. A method of gathering user inputs in response to user speech signals, said method comprising the steps of : answering with a call answering device an incoming speech call on a speech communication channel; in response to said incoming speech call, accessing a page of text data and control code data corresponding to a page of text data and control data stored on a host computer and accessed via a computer network, said control code data including user input control code data for controlling at least one user input response relating to said page to be returned to said host computer via said computer network; in response to said speech data representing said text data, generating a machine speech signal that is transmitted to a user via said speech communication channel; in response to said speech recognition data representing at least one spoken user input associated with said user input control code data and a user speech signal received from said user via said speech communication channel, recognising a spoken user input; and in response to recognition of said spoken user input, generating a corresponding user input response and returning said corresponding user input response to said host computer via said computer network.
22. Apparatus responsive to user speech signals for gathering user inputs substantially as hereinbefore described with reference to the accompanying drawings.
23. A method of gathering user inputs in response to user speech signals substantially as hereinbefore described with reference to the accompanying drawings.