WO2001004876A1 - Architecture distribuee orientee objet pour comprehension vocale - Google Patents

Architecture distribuee orientee objet pour comprehension vocale Download PDF

Info

Publication number
WO2001004876A1
WO2001004876A1 PCT/GB1999/002240 GB9902240W WO0104876A1 WO 2001004876 A1 WO2001004876 A1 WO 2001004876A1 GB 9902240 W GB9902240 W GB 9902240W WO 0104876 A1 WO0104876 A1 WO 0104876A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
program
speech
programs
input
Prior art date
Application number
PCT/GB1999/002240
Other languages
English (en)
Inventor
Andrew Paul Breen
Simon Downey
Original Assignee
British Telecommunications Public Limited Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by British Telecommunications Public Limited Company filed Critical British Telecommunications Public Limited Company
Priority to PCT/GB1999/002240 priority Critical patent/WO2001004876A1/fr
Priority to CA002343077A priority patent/CA2343077A1/fr
Publication of WO2001004876A1 publication Critical patent/WO2001004876A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing

Definitions

  • This invention relates to human-computer interaction, and particularly, but not exclusively, to human-computer interaction via a speech language interface. Description of the Background Art
  • speech-to-text speech-to-text
  • speech synthesis speech-to-speech
  • Systems such as ViaVoice Executive (TM), available from IBM Corporation, and Dragon NaturallySpeaking, available from Dragon Systems Inc, of 320 Nevada Street, Newton, MA 02610, US, enable a computer user to command "speech enabled" applications programs on his computer, by speaking key words corresponding to menu selection items.
  • ViaVoice Executive TM
  • Dragon NaturallySpeaking available from Dragon Systems Inc, of 320 Nevada Street, Newton, MA 02610, US
  • a text generator which, in essence, uses the same semantic and syntactic information as the semantic and syntactic analyser, to convert the semantic information to be conveyed to the user (for example a query for further information from the user, or the report of an action taken) into natural language text. This may, of course, be followed by text-to-speech conversion to generate speech output from the text.
  • speech recognition, speech synthesis, and natural language analysis and generation are each separate, computationally intensive, applications programs. Whilst it is certainly possible to obtain separate speech recognition, speech synthesis, and natural language processing programs and to cause them to operate sequentially, little thought has been given to optimising the way in which they interwork.
  • an object of the present invention is to provide a human computer interaction system in which different components can interwork to improve the responsiveness to a human operator.
  • the invention provides an architecture for human computer interaction using natural language processing, in which the input, analysis, processing and response functions are provided as separate software operating in parallel, and communicating asynchronously with each other.
  • the asynchronous communication utilises queues.
  • the responses of the applications are event-driven.
  • the events include timestamp data. Where additional input modes are present (for example via a GUI) it is thus possible to take these into account by comparing the timestamps with those for speech events.
  • Figure 1 is a block diagram showing the components of the first embodiment of the invention
  • Figure 2 is a block diagram showing the hardware present within a user interface terminal computer of Figure 1 ;
  • Figure 3 is an illustrative diagram showing the layers of software present in the computers of Figures 1 and 2;
  • Figure 4 is an illustrative diagram showing in greater detail the applications software present within the embodiment of Figure 1 ;
  • Figure 5 is an illustrative diagram showing the stages of dialogue performed by the embodiment of Figure 1 ;
  • Figure 6 is an illustrative diagram showing the architecture of a
  • Figure 7 is an illustrative diagram showing the architecture of a Strategy Agent program forming part of the applications of Figure 4;
  • Figure 8 is an illustrative diagram showing the architecture of a speech recogniser program forming part of Figure 4.
  • Figure 9 is an illustrative diagram showing in greater detail the speech recognition application of Figure 8.
  • Figure 10 is an illustrative diagram showing the structure of a directed graph output by the speech recogniser of Figures 8 and 9;
  • Figure 1 1 is an illustrative diagram showing the structure of a parser program forming part of Figure 4;
  • Figure 12 is an illustrative diagram showing the structure of a semantic analysis program forming part of Figure 4.
  • Figure 13 is an illustrative diagram showing the structure of a dialogue manager program forming part of Figure 4;
  • Figure 14 is a flow diagram showing the operation of a graphical user interface program forming part of Figure 4.
  • Figure 15 is a flow diagram showing the operation of the audio interface program of Figure 4;
  • Figure 16 (comprising Figures 16a -1 6d) is a flow diagram showing the operation of the speech recogniser program of Figures 8 and 9:
  • Figure 17 (comprising Figures 17a and 1 7b) is a flow diagram showing the operation of the Presentation Agent program of Figure 6;
  • Figure 18 (comprising Figures 18a - 18c) is a flow diagram showing the operation of a Strategy Agent program of Figure 7;
  • Figure 19 is an illustrative diagram showing the communication through Object Request Brokers in an embodiment
  • Figure 20 is an illustrative diagram showing the data flow through an event channel in an embodiment
  • Figure 21 is an illustrative diagram showing an event channel in greater detail.
  • the first embodiment of the invention comprises six computers 10, 20, 30, 40, 50, 60, interconnected by a local area network (LAN) 70.
  • the first computer 10 (conveniently a multi media personal computer, to be described in greater detail with reference to Figures 2 and 3) provides a user interface client terminal, at which the user will enter input commands and information, and the system will present results or requests for information.
  • a second computer 20 provides natural language parsing, semantic analysis, and text generation facilities. It may for example be a Sun (TM) workstation.
  • a third computer 30 operates a speech recognition program.
  • the third computer may, for example, be a Hewlett Packard (TM) UX9 or UX10 computer.
  • a fourth computer 40 synthesises output for the user using a text to speech program and a "talking head", or “avatar”, three dimensional visual display unit. It is likewise a Hewlett Packard (TM) UX9 or UX10 computer.
  • a fifth computer 50 hosts an event channel server program, which hosts and manages the event channels.
  • a sixth computer 60 comprises a database server, storing a database of information (for example an e-mail database) and a program for interrogating the database.
  • a user makes a spoken request to client PC 10, from which the captured audio data is sent via the LAN 70 to the recognise computer 30, which performs speech recognition.
  • the results from the recogniser 30 are sent to the parser computer 20 which parses the words recognised by the speech recogniser computer 30 to provide a parsed syntactic structure, which is analysed, and compared with possible actions which could be performed in relation to the database held on the database server computer 60.
  • the message is passed to the database server computer 60, which performs the required action on its database and, if necessary, returns response data to the parser computer 30.
  • the parser computer 30 Where the response is a message to the user, the parser computer 30 generates corresponding text, and passes this as a text file to the output synthesis computer 40. Likewise, if the input analysed semantics could not be resolved into an action for the database server computer 60, the parser computer 30 generates text to request re-input by the user, or queries for the required additional information. The output synthesis computer 40 performs text to speech conversion to generate an audio file which is passed to the client PC 10 for playback to the user.
  • a "talking head" three dimensional image of a speaking human head is generated, the lips of which move in synchronisation with the output audio, and the image is sent as a video file to the client computer 10 via the LAN 70 for display thereon.
  • the user interface terminal 10 comprises conventional components such as a processor 13 (for example a Pentium III Processor) and associated read only and random access memory and disk drives (shown as store 18) for holding programs for execution by the processor 13.
  • a processor 13 for example a Pentium III Processor
  • store 18 associated read only and random access memory and disk drives
  • a microphone 1 1 a loudspeaker 12, a VDU 15, a keyboard 16, and a mouse (or other cursor control device) 17.
  • An audio interface such as a sound card 19 provides analogue/digital and digital/analogue conversion; storage of audio files; and other conventional processes. It may, for example, be a Sound Blaster (TM) audiocard.
  • TM Sound Blaster
  • Figure 3 indicates the software present in each of the computers 10-60.
  • Each computer includes conventional operating system and networking software 300.
  • the operating system may be Microsoft Windows NT on the terminal 10, and the same or Unix, Xenix or Linux on the other computers 10-60, and the networking software may be Novell Netware.
  • Middleware software 200 may be Microsoft Windows NT on the terminal 10, and the same or Unix, Xenix or Linux on the other computers 10-60, and the networking software may be Novell Netware.
  • Middleware software 200 may be Microsoft Windows NT on the terminal 10-60, and the same or Unix, Xenix or Linux on the other computers 10-60.
  • the networking software may be Novell Netware.
  • Middleware software 200 may be Novell Netware.
  • Communicating through the networking software is a so-called “middleware" program layer 200 on each of the computers 10-60.
  • the function of the "middleware” is to allow different programs on different computers to share information and call functions on each other.
  • the "middleware" layer 200 is provided by Object Request Brokers (ORB) consistent with the Common Object Request Broker Architecture (CORBA), a protocol developed by the Object Management Group which is described in CORBA/IIOP specification 2.2 (available at http://www.omg/org/library/).
  • ORB Object Request Brokers
  • CORBA Common Object Request Broker Architecture
  • one Object Request Broker which is suitable for the present application is Orbix 2.3, available from lona Technologies.
  • CORBA allows component objects on different platforms to communicate with each other in a distributed processing system.
  • Programs in the present embodiment reside under the CORBA platform and communicate with each other through object request brokers (ORBs) using standard interfaces, as shown in Figure 19.
  • ORBs object request brokers
  • CORBA allows remote invocation of methods (i.e. subroutines) on other computers in a distributed network. Normally, such calls are synchronous; that is, the calling program awaits the response from the called program.
  • Event channels are an additional mechanism provided by CORBA (see the OMG CORBA Event Service Specification (available at http://www.omg/org/library/) incorporated herein by reference), which allow two programs to communicate in a de-coupled fashion, by providing queues to which data can be written by one program at one time, and read by another at a later time. Event Channels
  • OrbixTalk V1.1 available from lona Technologies, described in the OrbixTalk V1.1 Reference manual, May 1997, but where higher speeds are required, other software is preferred, to provide real-time event channels capable of handling data at a steady flow rate; such real time event channels are currently being harmonised by the Object Management Group at the time of writing (and it is anticipated that a standardised event channel will be preferred).
  • RT. event service One commercially available real-time (RT.) event service is the TAO service developed by Washington University, and described in "Using the Real- Time Event Service", Chris Gill, Tim Harrison and Carlos O'Ryan, at http://www.cs.wustl.edu/ ⁇ schmidt/events_tutorial.html, and "The Design and Performance of a Real-Time CORBA Object Event Service", by Timothy H Harrison, David L Levine, and Douglas C Schmidt, published in Proceedings of the OOPSLA 97 (ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications) October 5-9 1 997 and also available at http://www.cs.wustl.edu/ ⁇ schmidt/oopsla.html
  • the middleware software 200 provides at least one event channel 1 1 6 for each applications program in the application program layer 400.
  • Each applications program may be written in a different computing language, but consistency is provided by providing each application with an interface written in a common Interface Description Language (IDL), which specifies the functions or methods provided by each application, and the format of data required as input or output from each such method.
  • IDL Interface Description Language
  • objects that generate events are called suppliers and objects that receive them are called consumers.
  • Suppliers can either supply events to a specific consumer via its event channel (the push model), or consumers can ask the specific supplier to generate events (the pull model). Since an event channel is itself an object, a single event channel can simultaneously have more than one supplier and/or more than one consumer. Programs may poll their event channels to determine whether new data has arrived, and may then pull data from the channel, as shown in Figure 20.
  • the events of the present invention are, typically, a packet of audio data; directed graph data output from the recogniser; a marked-up text phrase; or a database query. They therefore often include substantial amounts of data. Some messages which report, or instruct, for example a change of state of a program are also carried as events.
  • an event channel service (ECS) which provides a combination of the push supplier and pull consumer models, as shown in Figure 21 .
  • a push supplier generates events and sends them to a proxy push consumer, which is actually part of the EC itself.
  • a pull consumer makes requests for data to a proxy pull supplier which is again part of the EC.
  • the EC acts as a proxy push consumer and a proxy pull supplier.
  • the real push suppliers and pull consumers are bound to the EC and only see the interfaces provided by the EC's
  • the IDL interface code for each application is arranged to read messages containing such data from, and write messages containing such data to, the event channel or channels 1 16 for the application to which it interfaces. It is also arranged to accept messages involving it's functions or messages and to return messages in reply.
  • the IDL interface code provides a common "wrapper" for each application in the application layer 400, enabling it to communicate with other applications via the event channels, or procedure calls. Other messages are also used for some communications between programs.
  • Another type of message here is typically, a call to a function of another program (on the same or a remote computer), which is handled through the ORBs. Messages are used in the present embodiments where immediate, rather than deferred or asynchronous, execution is desired.
  • the software architecture of Figure 2 is present in each of the computers
  • Agent program 1 10 a "Strategy Agent” program 1 12; and a "Information Agent” program 1 14.
  • Each of these has some role in controlling other programs, and each acts as a communications buffer, so that most communications between other applications pass through the co-ordinating program 1 10-1 14 concerned.
  • agent used herein does not necessarily imply either a program which modifies it's behaviour with experience, or one which is personalised for a user, and implies nothing about the structure or function of the programs it describes.
  • service programs 122-140 are provided, each comprising an IDL code interface (the function of which will be discussed in detail) and an application.
  • Presentation Agent program 1 10 - Overview In general terms, the Presentation Agent program controls the receipt of data or commands from the human user, and the return of video or audio information to the user.
  • the Presentation Agent program 1 10 co-ordinates, and acts as a communications interface to, the graphical user interface 120 (for example Microsoft Windows or X Windows) forming part of the operating system layer 300 of the user interface terminal 10.
  • the graphical user interface 120 for example Microsoft Windows or X Windows
  • the Presentation Agent program 1 10 is located on any convenient computer; it may be conveniently co-located with some of the programs it controls, and/or with the other agent programs, to reduce network traffic over the LAN 70 and reduce latency.
  • the Information Agent program 1 14 acts as an interface to the databases present; in this case the e-mail database, and additional database information services (not shown). It issues commands to the e-mail database and receives data from the e-mail database. In this case, the Information Agent program 1 14 acts as an IMAP e-mail client program. Where several different databases are present, the Information Agent interacts with each.
  • Strategy Agent Program 1 12
  • the Strategy Agent program controls the programs which try to format and interpret the user commands.
  • the parser program 130 attempts to parse speech from the speech recogniser program 124 using a grammar of the language concerned, and the semantics program 132 attempts to analyse the "meaning" of the speech and to compare the parsed input with possible tasks to be performed.
  • the dialogue manager program 134 maintains a memory of the previous recent inputs from the user (speech, text or other) and uses these, together with the current input from the user, to resolve ambiguities (such as the meaning of pronouns). It makes use of the semantic information used by the semantic analysis program 132, to generate semantic structures requesting further information where necessary, or reporting results.
  • the text generation program 136 applies the syntactic (i.e. grammar) data used by the parser program 130 to generate natural language text from the semantics produced by the dialogue manager program 134.
  • syntactic i.e. grammar
  • the programs communicate via five event channels; an event channel 142, 144, 146 for each of the three agent programs 1 10, 1 12, 1 14 respectively; an event channel 148 for the speech recogniser program 124; and an event channel 150 for the audio user interface program 122.
  • Each event channel is a first-in-first-out queue of events; i.e., a stack of memory locations. Data is placed on the event channel in chronological order, together with a "timestamp" (i.e. data recording the time of the event).
  • Each of the programs can read the event data from its own event channel. Each can also write data in the form of an event to the event channel of another program.
  • the format of the event data for each program is defined by the interface description of that program.
  • the speech synthesis program 126 is provided with two event queues; an input event channel 152 at which text is received, and an output event channel 154 at which speech is supplied, to be read by the audio interface program 122.
  • the user inputs a speech phrase "list new messages”.
  • the audio is processed by the recogniser application 124, the output of which is a directed graph (to be described in greater detail below) comprising possible sequences of recognised words together with the recognition "score” (a measure of similarity between each portion of the input audio and a stored word).
  • the directed graph is passed to the parser program 1 30, which applies grammar (syntax) rules to determine which of the word sequences making up a directed graph is grammatically appropriate, and outputs syntactic data listing the syntactic categories (eg verb, noun, noun-phrase and so on) detected in the directed graph, together with the values corresponding to each syntactic category (in this case, "list” "new” and “message”).
  • grammar syntax
  • the syntactic data is passed to the semantic analysis program 132, which looks up the meanings of the input syntactic data and compares these meanings with operations which can be performed by the information service (eg e-Mail database). In this case, the syntactic data is determined to mean the command
  • the command is passed to the Information Agent program 1 14, which translates it into a procedure call to e-mail database application 140, which is executed to cause display on the user interface of the new e-mail messages.
  • the number of new messages is returned to the Information Agent program 1 14 (in this case 3), which in turn forwards the data to the Strategy Agent program 1 12.
  • the Strategy Agent program 1 12 supplies the data returned from the Information Agent program 1 14, together with the semantic structure output from the semantic analyser 1 32, to the dialogue generator program 134.
  • the dialogue generator 1 34 then generates the speech dialogue to report the existence of the three new messages, to generate a semantic structure (in the same format as the output of the semantic analyser program 132) reporting the results from the Information Agent program 1 14. This output semantic structure is passed to the text-generator program
  • FIG. 6 shows the information flow architecture for the Presentation Agent program 1 10.
  • the Presentation Agent program 1 10 receives messages, data and other events on the Presentation Agent queue (PAQ) 142. It supplies data and other events to the Strategy Agent queue (SAQ) 144.
  • PAQ Presentation Agent queue
  • SAQ Strategy Agent queue
  • the Presentation Agent program 1 10 is arranged to send messages to, and receive messages from, the graphical user interface (GUI) program 120; the audio interface program 122; the speech recogniser program 124; the text speech program 126; and, optionally (not shown in Figure 4), an audio player program 1 27 arranged to replay stored audio sequences; for example, a CD player. This is used for standard announcement noises or messages.
  • GUI graphical user interface
  • the audio interface program 1 22 is arranged to output audio data onto the speech recogniser event channel 148 and to its own audio event channel 1 50, which may be used by other audio processing applications (not shown or used in this embodiment).
  • the speech recogniser program 124 is arranged to supply its output data to the input event channel 142 of the Presentation Agent program 1 10.
  • the Presentation Agent program 1 10 is arranged to supply text data to the input event channel 1 52 of the speech synthesis program 1 26, the output audio data of which is pushed onto the speech synthesis program output event channel
  • the audio interface program 1 22 is arranged to poll the output event queue 1 54 of the speech synthesiser program 126.
  • the GUI program 1 20 has an event channel (not shown) to which the Presentation Agent program 1 10 is arranged to write events.
  • the Strategy Agent program 1 1 2 is arranged to read the Strategy Agent queue 144 and to write data or other events to the Presentation Agent queue 142 and the Information Agent queue 146.
  • the Strategy Agent program 1 1 2 is arranged to communicate with, and control the operation of, the parser program 130; the semantics program 132; the dialogue manager program 134; and the text generation program 136, by passing messages to and receiving messages from each, so as to invoke methods (i.e. call sub-routines) within the service programs 130-136 and receive data and replies therefrom where necessary.
  • methods i.e. call sub-routines
  • the Information Agent program 1 14 is arranged to receive data or other events on it's event queue 146 and to write data to the Strategy Agent queue 144.
  • Speech Recogniser Program 124 It is arranged, in response to the receipt of data representing a formatted query from the Strategy Agent on its input queue 146, to execute a corresponding call to the relevant information program 140, invoking a desired method and supplying query data. It is likewise arranged to receive a return message from the information service program 140 and generate, in response, data which is written to the Strategy Agent queue 144. Speech Recogniser Program 124
  • Figure 8 shows the information flow of the speech recogniser program 124.
  • the program comprises a recogniser server program 124a, comprising the IDL interface code for communicating with the queues 148 and 142, and a speech recognition engine 124b.
  • Figure 9 shows the structure of the speech recogniser engine 146. It comprises a database 31 of templates, or models, corresponding to words or parts of words; a comparison engine 32 arranged to compare audio data with the stored templates (for example using a Hidden Markov Model (HMM) comparison process); and a register 33 storing an indication of the current recogniser state, which may be idle (after having recognised the end of an uttered phrase); ready to receive data; recognising (whilst receiving data); or waiting (having finished receiving data and still recognising). Audio data (a file of audio samples) is received from the speech recogniser queue 148, and recognition output data is supplied to the Presentation Agent queue 142.
  • HMM Hidden Markov Model
  • the recognition output data is in the form of directed graph data.
  • Speech recognisers generally compare a segment of audio data with each template in the store 31 , and allocate each a "score" indicating the degree or correlation between the audio segments and the word corresponding to that template.
  • the templates are then ranked according to score and the word with the highest value score is selected as having been recognised.
  • the second highest, third highest and so on are also retained, normally for purposes of later correction by a user.
  • next word recognised may depend on that of the preceding word. For example, if two words are elided, the first may be misrecognised by the inclusion of the starting portion of the second, which will lead to a lower score for the correct second word.
  • Figure 10 shows the recognition output for the input question, "Do I have any e-mails from Bob?".
  • the recogniser accordingly outputs two possible words ("Do” or "You") each with an associated score or correlation value.
  • the recogniser outputs two possibilities ("My” or "I”), each associated with two scores, indicating their likelihood of having occurred following either one of the first words.
  • Each word node in the directed graph output of the speech recogniser comprises timestamp information (the start and endpoint times of the word in the utterance), recognition score, and overall score along each path.
  • timestamp information the start and endpoint times of the word in the utterance
  • recognition score the overall score along each path.
  • overall score the overall score along each path.
  • STAP Technology Application Platform
  • the recogniser can also receive command messages to change its state between "idle”, “ready”, “recognising”, and "waiting".
  • the audio user interface is arranged to provide two functions; respectively to record audio data and to play audio data.
  • the audio recording function is arranged, on commencement, to receive audio data from the microphone 1 1 ; digitise it; and supply it to the recogniser event channel 148, together with timestamp data indicating, for each frame of audio data supplied to the event channel as an event, the start and endpoint times of the frame. It can be switched on and off under the control of the Presentation Agent program 1 10.
  • the audio reproduction function receives audio samples, converts them to analogue and supplies them to the loudspeaker 12. Reproduction is performed when data is present on the speech synthesis output event channel 1 54. When no data is available on that event channel, the reproduction function is set to idle mode.
  • the recording function is given priority over the reproduction function. Accordingly, if the recording function is in the "recording" (i.e. active) state, the presence of data on the speech synthesis output queue 1 54 does not cause the reproduction function to become active.
  • the parser program 130 comprises a parser server program 1 30a comprising IDL interface code for interfacing with other programs, and a parser engine 130b.
  • the parser engine 136b program comprises a database 21 comprising a lexicon (i.e. word list) indicating the syntactic category (i.e. noun, verb, and so on) of each word, and a database of syntax or grammar rules specifying the orders in which such syntactic categories can validly occur in a given language.
  • a lexicon i.e. word list
  • the syntactic category i.e. noun, verb, and so on
  • the parser of this embodiment uses network minimisation (as described in, for example, Mehryar Mohri; "Finite-State Transducers in Language and Speech Processing”; Computational Linguistics; vol 23, no 2, pp 269-312 (June 1997) and chart parsing (as described in, for example, Martin Kay; "Algorithm schemata and data structures in syntactic processing”; in B.J. Grosz, K. Sparck Jones, B. Webber (eds) "Readings in Natural Language Processing”; pp 35-70; Morgan Kaufmann (1986)) techniques, both of which are well known in the art.
  • the parser program is conveniently written in the artificial intelligence computer language PROLOG, with the addition of an interface, in interface description language (IDL).
  • IDL interface description language
  • the lexicon database 24 may contain all words in a given language, but it is preferable to restrict the words to those likely to occur in connection with the data base to be interrogated, to reduce the processing time.
  • the parser evaluates each path through the graph to determine whether any complete path is grammatical according to the syntax rules. Where several possible paths are grammatical, then the scores for the alternative words are taken into account, together with other factors including the relative frequency of occurrence of each word, and/or grammar rule which it satisfies, in the language (a priori probability), to select the likeliest path by multiplying score by a priori probability, and/or syntactic accuracy.
  • the parser may be able to select the path including the lower probability correct word if this results in a more grammatical utterance, and/or is a more commonly occurring word. If no path from start to end is found, then the longest path is selected, taking into account the acoustic scores (for example, low scoring words at the beginning or end of the utterance may simply indicate misrecognised noises, which can be ignored by the parser in an otherwise complete parse of an utterance).
  • the acoustic scores for example, low scoring words at the beginning or end of the utterance may simply indicate misrecognised noises, which can be ignored by the parser in an otherwise complete parse of an utterance.
  • the parser may also (as described below) receive text typed in by a user, in which case there is no need to select a path as only a single path exists.
  • the output of the parser is therefore a string of text, identifying the syntactic categories present in the input utterance, and the values of those categories (e.g. a noun is present, the value of the noun is "e-mail"). Associated with each syntactic category entry is the timestamp data from the speech recogniser for the corresponding word.
  • the semantic analyser program 130 comprises a server program 132a comprising IDL interface code for interfacing with other programs, and a semantic analyser 132b.
  • the semantic analyser program is conveniently written in the artificial intelligence computer language PROLOG, with the addition of an interface, in interface description language (IDL).
  • the semantic analyser program 132 comprises a semantic lexicon database 23 storing words and phrases, and linking each to a meaning, so as to map similar words onto the same meaning. Also provided is a semantic analysis engine 24, arranged to read the syntactic data, and to look up each word in the semantic lexicon, and to output corresponding semantic information.
  • semantic output information is represented by a data structure consisting of nestable feature-value pairs in the format [feature:value].
  • the utterance 'List new emails' may be represented as [command:list, object: [type:email, numbe ⁇ plural, status:new]].
  • This semantic representation is also used by the dialogue manager and text generation services.
  • “nestable” indicates that a value such as "object” can be made up of further values. ").
  • Associated with each semantic entry is the timestamp data from the speech recogniser for the corresponding word.
  • the entries in the semantic lexicon for the phrases "Have I got any " (e.g. messages), "How many do I have", "list ", are all mapped onto the meaning entry for the single command "list” in the database, and the semantics analysis engine 24 is arranged therefore to replace the occurrence of any of these in the syntactic data by the command "list”, and to determine the object of the command (e.g. messages) and identify this.
  • the semantics analysis program outputs an indication of the command, and the fact that it refers to a plural object, the identity of which is for resolution by the dialogue manager program as described below.
  • Words are represented by codes referring to the corresponding meaning entry in the semantic lexicon (i.e. tokenised) in the semantic data output by the analyser. Any words representing numbers are replaced by the numeric values.
  • the semantics service uses the utterances, to build a semantic representation of the best possible input, which it represents as a set of feature- value pairs (e.g. feature "number” could have value "plural”).
  • a set of task-specific semantic features are defined (e.g. "command”, "sender"), which may be present in any input, and which have specific rules for filling their values. These rules are applied to every input, making this an information-seeking approach to semantics, looking for specific information in the input.
  • Dialogue Manager Program 134 The role of the dialogue service is to evaluate this meaning of the utterance, in the context of the overall dialogue between the user and the computer.
  • the dialogue is modelled in an object hierarchy, so that the characteristics of general dialogue acts (e.g. "actions”), may be inherited by more specific dialogue acts (e.g. "email deletion”), thus capturing generality, and maximising code sharing and reuse.
  • the current dialogue state may require the dialogue service to query one of the information services to get information about the user's email, or perform some action.
  • the dialogue manager program comprises a short term store 82, a semantic database 84 (equivalent to that 23 of the semantics engine 132b) and a logical processing program 86.
  • the short term store 82 is arranged to store previous inputs from the user (received via the Presentation Agent program 1 10). These comprise previous semantics output by the semantic analysis program 132 and previous data received from the graphical user interface 120 (in the form of an indication of which on-screen display objects have been selected), together with, in each case, timestamp data indicating the time which the input was received via the graphical or audio user interface.
  • the provision of the short term memory and time stamping information enables the dialogue manager program 134 to perform resolution of ellipsis or ambiguity in the user speech input. For example, having previously requested a list of new e-mails, a user may say "print them".
  • the dialogue process 86 accesses the short-term store 82 and reads the records of the most recent inputs, utilising the timestamp data to work backwards in time from the present.
  • the objects described in each previous user input are compared with the ambiguous or elliptical term, and the most recent match is determined.
  • the single selected e-mail does not correspond to the plural form "them" and the preceding multiple selection of new e-mails is therefore selected as the objects referred to by the "them".
  • the dialogue manager program 134 alters the semantic data received from the semantic analysis program 132, to replace the elliptical or ambiguous term by the term retrieved from the short term store 82.
  • the dialogue manager program also receives, from the Strategy Agent program, any non-speech input events such as mouse clicks, each with an associated timestamp field.
  • the timestamp fields of the non-speech inputs are compared with those of words of the semantic structure, and if there is coincidence, the dialogue manager determines whether the element of the semantic structure is more meaningful if combined with the object pointed to by the mouse. For example, the command "Move this here” is ambiguous, but if an object was pointed to within, or closely before or after, "this” and a location was pointed to within, or closely before or after, “here”, then the ambiguity is resolved by the dialogue manager.
  • non-speech events may be associated with words if they fall between the end point of the word before and the start point of the word afterwards.
  • the store can also be used to direct the user to hold to a coherent dialogue. For instance, User - “delete the e-mail from Ed"
  • a database of available commands 88 is provided.
  • the disambiguated semantic structure is compared with each of the available commands in the or each command database 88, using the semantic lexicon database 84 to determine synonyms of the available commands. Where the input semantic structure matches a command and its arguments exactly, that command is returned to the Strategy Agent program 1 10.
  • the dialogue manager program is arranged to generate semantics for an error message. For example, where an argument is missing from an otherwise matched command, the semantics specifies a query and the missing argument.
  • the dialogue manager does not await a response from the databases before being able to accept new input from the semantics analysis program.
  • the dialogue manager may have generated a further request to the same, or a different, database and may even have received a reply therefrom, before receiving a reply to an earlier user query.
  • the dialogue manager program is able to match each reply to the corresponding request, using the short term store 82, using a code sent with the query and returned with the results.
  • the dialogue manager is arranged to generate a message (for audio or GUI presentation) saying "The information on XXX you required is now available - would you like to hear it?, rather than reproducing the results immediately.
  • the dialogue manager program generates verbal answers from the data receipt from the Information Agent program. This is performed by matching the received data (for example "three new e-mails") to one of the inputs in the short term memory 82.
  • the question or command semantics can then be inverted ⁇ eg. "How many do I have” becomes quote "You have” and combined with the data received from the Information Agent ("three"), to generate a semantic structure representing the response to the user.
  • An additional flag indicating emphasis is attached to the semantics for the new information received from the information manager program 1 14.
  • the text generation program 136 utilises the same lexicon and syntactic rules as the parser program described above (for example, stored as a Directed Clause Grammar (DCG)), to create grammatical output text from the semantic structure it receives from the Strategy Agent program.
  • DCG Directed Clause Grammar
  • Such text generation in general is well known and simply reverses the operation of the files of program without the need for (chart parsing).
  • the text generation program adds SGML tags before and after the words to be emphasised by the speech synthesiser.
  • the text is marked-up in SGML to contain prosodic elements, such as emphasis of elements of new information, which is passed to the speech synthesis service for presentation to the user.
  • Markup of text for speech synthesis is described in Slott, J. M. 1996. "A Generalised Platform and Markup Language for Text to Speech Synthesis". Ph.D. diss., Dept Electrical Engineering and Computer Science, MIT; Taylor P, Isard A. 1997. "SSML: A Speech Synthesis Markup Language", Speech Communication:21 : 123-133; and Sproat. R., Taylor. P., Tanenblatt. M., Isard. A.1997. "A Markup Language for Text-to-Speech Synthesis". Eurospeech '97: 4, incorporated herein by reference. Text-to-speech (speech synthesiser) Program 126
  • Text to speech programs are well known.
  • the text-to- speech engine is British Telecom's Laureate expressive engine described in the following three documents which are incorporated herein by reference: "Overview of Current Text-to-Speech Techniques: Part I - Text and Linguistic Analysis” (Edgington M., Lowry A., Jackson P., Breen A.P. & Minnis S. BT Technology Journal, Vol 14, No 1 , pp 68-83, January 1996); “Overview of Current Text-to- Speech Techniques: Part II - Prosody and Speech Generation” (Edgington M., Lowry A., Jackson P., Breen A.P.
  • Event Channel Server Program can display text or images; and receive from the native GUI an indication of objects selected on the screen by the user. Additionally, it will both input and display text, typed on the keyboard 16 by the user and terminated by the carriage return or other termination signals. Each such input causes an event in the native GUI which is signalled to the Presentation Agent program by the GUI program. Event Channel Server Program
  • An event channel (EC) server program forming part of the middleware layer 200 is provided on the computer 50.
  • the event channel server program monitors the event channels.
  • the server provides two particular features: event caching and event throttling.
  • Event throttling involves adapting the EC to the particular type of event data being transmitted from one component to another.
  • the EC provides a plurality of push and pull functions that are optimised for different events according to their data type. For example, an event consisting of audio data is relatively large but needs to be passed quickly from one component to another. Specific push and pull functions are provided for transmitting audio data.
  • each EC maintains a queue or cache of data to be processed by its agent or service. The EC monitors the length of the queue to ensure that no queue is overflowing. Event queues specify a maximum buffer length of data held on the EC. Once this maximum is reached, a component attempting to push a new event or message onto the queue will be informed that it is full. The component may then request a callback from the EC once the queue has space, or it may every so often retry pushing the data itself.
  • a callback can also be set to trigger on a queue becoming, or ceasing to be, empty, to enable a pull client program to avoid continual polling (although in this embodiment, continual polling is in fact performed).
  • a minimum length of data that may be pulled from the EC is also specified for each EC.
  • a client program attempting to pull an event or message from its EC queue will be informed if there is insufficient or no data on the queue.
  • the maximum and minimum buffer lengths are dynamic variables and may be varied depending upon network and processor load, for example. Monitoring queue lengths and setting appropriate maxima and minima helps to maintain overall synchronisation in the system and ensure that no process is getting too far ahead or lagging too far behind.
  • each events has a priority field specifying high priority (e.g. a command) or low priority.
  • FIFO First-In-First-Out
  • the event channel server program is arranged, if a queue length exceed a predetermined length, to pull events in priority order so as to avoid excessively long delays for important events.
  • the timestamps of remaining events in the queue are examined, and for those whose timestamps exceed a predetermined time in the past, their priorities are increased, to avoid them being kept too long in the queue.
  • session management programs which allow a user to log-in; commence a session; recognise applications programs present in the system; initialise event channels and queues; and bind each agent to the service programs it controls. Corresponding processes for clearing down a session are likewise provided.
  • Other input and output services may also be present.
  • a telephone interface (not shown) may be provided, for receiving speech from and transmitting speech to another telephony user.
  • a video synthesis application (not shown) is preferably provided together with the speech synthesis application, to provide a representation of a human head or "avatar” with lip movements synchronised to the synthesised speech as described in US 4841575 or EP 0225729 or WO 97/36288 or Breen.
  • A. P. 1996. The face of talking Machines in a Multi-media World”.
  • British Telecommunications Engineering 15, or Breen.
  • the GUI is configured in a step 1002, to set up to threads of execution on the client terminal 10, each thread providing an event loop monitoring the native GUI for events.
  • the program polls the GUI event channel.
  • a corresponding message is sent for display on the native GUI in step 1006.
  • step 1008 in the second event loop, the graphical user interface program 120 awaits user input via the mouse 17 or keyboard 16. Where the user input indicates (e.g. by a predetermined keystroke) that the recogniser mode is to be changed, to toggle between starting and stopping recording, this is signalled as a "start recording” or "stop recording” flag on the Presentation Agent event channel 142 (step 1010), and the GUI changes state of an internal flag to indicate the new state as recording or not. Where the input is other than to change recording mode (for example, is either typed in text or a selected object on the screen) the text or the object pointer are sent to the presentation event channel in step 1012.
  • the input is other than to change recording mode (for example, is either typed in text or a selected object on the screen) the text or the object pointer are sent to the presentation event channel in step 1012.
  • the audio interface is configured to control the sound card 19.
  • the audio interface can be in one of three states; idle, playing, or recording. Initially, it is configured in the idle state (step 1 104). In step 1 106, it is determined whether or not a start/stop recording message has been received from the Presentation Agent program. If so, in step 1 108, it is tested whether the current state is idle and, if so, the state is changed to the recording data state in step 1 1 10 and data recording commences.
  • step 1 1 12 it is tested whether the state is already the recording state and if so, in step 1 1 14, recording ceases and the state is set to idle.
  • step 1 1 1 12 the state is not recording
  • step 1 1 16 the program determines whether the state is set as playing (reproducing audio data). If so, playing is ceased and recording is commenced in 1 1 18. If the state is none of the above, then the audio interface is in an unknown state and cannot function; accordingly, an error is signalled,
  • control returns to step 1 106.
  • step 1 120 If data is present in step 1 120, then in step 1 122 it is determined whether the state is set to recording. If so, no action is taken to remove data from the text to speech queue and control returns to step 1 106. If not, then in step 1 124 it is determined whether the current state is idle or playing and, if so, the audio data is pulled from the text-to-speech output event channel 154 and reproduced by the sound card 19. If the state is none of the above then again, an error state is signalled and processing ceases.
  • the audio interface operates effectively two event loops; one polling for audio data to play and playing it if the mode is other than record mode, and the other, waiting to change mode between recording mode and idle mode.
  • Speech Synthesiser Program 126
  • the speech synthesiser on starting, polls its input event channel 152 for SGML text; pulls any text present from the event channel; and causes the speech synthesiser engine to create an audio ( and optionally a video) output sequence. It also polls the speech synthesiser engine for available synthesis data, and pushes the audio data, as it becomes available, onto the speech synthesis output event channel 154.
  • Speech Recogniser Program 124 Referring to Figure 16 (comprising Figures 16a to 16d) the operation of the speech recogniser program will now be described.
  • the recogniser program is configured and commences operation.
  • the recogniser polls its event channel 148. If data is present, then in step 1206 the recogniser determines its current status. If its current status is "ready”, then in step 1208 the state is changed to commence recognition and the data is pulled from the channel and fed to recognition engine 124b.
  • step 1210 it is determined whether the recogniser is already in the recognising state and, if so then in step 1 12 the new audio data is sent to the recogniser 124b.
  • step 1214 it is determined whether the recogniser is in the waiting state and, if so then in step 1214
  • the new audio data is rejected (as being post-utterance noise). If the recogniser is not in any of the preceding states then an error message is generated since its state is unknown.
  • step 1208 step 1212 or step 1216, or if no data is present at the input in step 1204, the data polling loop is finished and control passes to the process of Figure 16b.
  • step 1220 it is determined whether a message from the Presentation Agent program to interrupt recording has been received and, if so, in step 1222, it is determined whether the recogniser is in either the recognising or the ready states. If so, then in step 1 224, recognition is interrupted and the status is set to waiting.
  • step 1222 If the states is neither recognising nor ready in step 1222 then an error message is generated in step 1226 and sent to the Presentation Agent program.
  • step 1228 the recogniser program determines whether the Presentation Agent program has sent a message to start recognition and, if so, then in step 1 230 it is determined whether the recogniser is in either the recognising or the waiting state and if so an error message is generated in step 1 232 and sent back to the Presentation Agent program. If not, then in step 1 234, the state is set to ready. Following steps 1232 or 1 234, or if no start message was detected in step 1 228, control passes to the process of Figure 16c. In step 1240, recogniser program polls the recogniser event queue for messages from the Presentation Agent.
  • the recogniser program determines whether the message is an end of speech indication and, if so, in step 1 244, an end-of-speech flag is set by the recogniser, which changes it's state from "waiting" to "ready".
  • an end-of-speech flag is set by the recogniser, which changes it's state from "waiting" to "ready".
  • some audio samples may still be in the system (i.e. on event queues), so the recogniser is set to a 'Wait' external state while processing these (which may simply involve discarding them), when the last sample is received, the state is set to ready, and the next utterance processing may begin.
  • the recogniser knows how many samples it will receive (& hence when the last one is), as the audio sends a message to the recogniser (via the PA) containing the number of samples sent when it is told to stop.
  • step 1244 or if the message is other than an end of speech indication in step 1 242, or there is no message in step 1240, control passes to the process of Figure 1 6d.
  • step 1250 the speech recogniser interface program 124a polls the speech recognition engine 1 24b. If there is no reply in step 1252, control returns to step 1 204. If there is a reply then in step 1 254 it is determined whether it is an alarm and, if so in step 1256 the recogniser deals as appropriate with the alarm.
  • step 1 257 it is determined whether the reply is a recognition result and, if so, then in step 1 258 the directed graph data read from the recognition engine 1 24b is sent to the Presentation Agent event channel and, in step 1 260, the status is set to ready.
  • step 1 262 If the result is none of the above then in step 1 262, an error message is sent to the Presentation Agent program 1 10. After step 1260 or step 1262 control returns to step 1 204 of Figure 1 6a.
  • the recogniser program 1 24 provides loops which poll for availability of data; change of recognition state instructions; and for completion of recognition.
  • Presentation Agent Program 1 10
  • the Presentation Agent program is configured on commencement of operation of the system.
  • the Presentation Agent program determines which service agent program are present, and selects appropriate output destinations for output indications. If no visual output is available (because the user is communicating only via telephony interface) output messages are directed solely to the text to speech converter, whereas if no-text- to-speech converter is present the text is directed to the graphical user interface. It is assumed here, as described above, that graphical and audio interfaces are available.
  • step 1304 it is determined whether the message to stop recording has been received on the Presentation Agent event channel (e.g. from the GUI program 1 20), and, if so, in step 1306 it is determined whether the present state is recording audio data. If not, then an error message is generated (step 1307).
  • the Presentation Agent event channel e.g. from the GUI program 1 20
  • step 1308 the message is sent to the recogniser program to enter up recognition and at step 1310, a message is sent to the audio program to interrupt audio recording. Finally, in step 131 2, a message is sent to the GUI 1 20 to confirm to the user that the recording has stopped.
  • step 1314 After steps 131 2, or 1307, or if no "stop recording” indication was detected in step 1304, control passes to step 1314 in which it is determined whether a "start recording” message has similarly been detected. If so, in step 131 6 it is determined whether the current status is recording and, if so, an error message is generated in step 1317 and sent to the GUI.
  • step 1318 a message is sent to start the recogniser and in step 1320 a message is sent to start the audio service recording.
  • step 1322 a confirmation message is sent to the GUI program 120 for display.
  • step 1330 it is determined whether text or selection of an object (e.g. to execute a command or select an option) have been received from the GUI program. If so then in step 1332 a confirmation message is sent back to the GUI and in step 1334, the text or selection is placed on the Strategy Agent event channel 144.
  • step 1336 it is determined whether the message was the "Exit system" command and, if so, in step 1338, the Presentation Agent program 1 10 instructs all its associated service programs to shut down and shuts down itself.
  • step 1336 of if no text or command was detected in step 1330 then in step 1340 the Presentation Agent event channel 142 is polled. If no event is present control passes back to step 1304 of Figure 17a. Otherwise, in step 1344, it is determined whether the received event is an output message for the user, in which case it is sent to the GUI program 120 in step 1346 for display. If the event is not a message event but a recognition result (step 1348) then in step 1350 the recognition result is put on the Strategy Agent event channel 144, and the recognised text is also sent to the GUI program 120 for display (step 1346). This allows the user immediately to detect erroneous input. Otherwise, in step 1352, it is determined whether the event is text for the text to speech synthesiser in step 1352 and, if so, in step 1354 the text is pushed onto the text-to-speech converter input event queue 1 52.
  • step 1356 it is determined whether the event is one requiring other action by the Presentation Agent - for example, messages from service programs indicating that actions have been performed, or that their states have changed, requiring further action by the Presentation Agent such as to control or shut down one of its associated service programs and, if so, in step 1358 the required action is performed and the Presentation Agent state is updated if necessary. Otherwise, then in step 1360 it is determined whether the event is an alarm and, if so, then in step 1362 the alarm is handled. If none of the above is detected, an error message is generated and sent to the GUI 1 20 in step 1364.
  • the program is configured at start-up of the system, binds to each associated service program and causes each service program to configure itself.
  • step 1404 the program polls the Strategy Agent queue 144.
  • step 1406 it is determined whether or not an event is present on the Strategy Agent queue 144 and if not, control returns to step 1404.
  • step 1408 it is determined whether the event corresponds to a string of alphanumeric text and, if so, then in step 1410, it is determined whether the text is plain text or a directed graph. If (step 1410) it is a directed graph (i.e. it has been received, via the Presentation Agent 1 10, from the speech recogniser program 124) then in step 1412 it is determined whether the event includes the end of the directed graph. If not, then in step 1414 the event is temporarily buffered and control returns to step 1402.
  • step 1408 If in step 1408 the event was determined to be other than a string event then the control passes to step 1420 ( Figure 18b), in which it is determined whether or not the event includes feature/value data which indicates that it is a reply from the Information Agent program responding to an earlier instruction. If not, then (step 1421 ) an error message is generated (as the message is unknown) and control returns to step 1404. If the data is recognised as a reply from the Information Agent 1 14 it is sent to the dialogue program 134 in step 1422 and. in step 1424, the message is received from the dialogue program 134.
  • step 1426 the response from the dialogue program 134 is forwarded to the text generation program 126, and in step 1428 the marked up text received from the text generation program 126 is received and forwarded (step 1430) to the Presentation Agent queue 142. Control then returns to step 1404 of Figure 18a.
  • step 1444 If (step 1444) the response is a command to the Information Agent 1 14 it is pushed onto the Information Agent queue 146 in step 1446, and control returns to step 1404 in Figure 17a. Otherwise (i.e. the dialogue manager is returning a reply to the user) control passes to step 1426 of Figure 18b.
  • the Strategy Agent polls its event queue for events from the Presentation Agent 1 10 and passes these forward towards the dialogue manager 134; passes information requests from the dialogue manager to the Information Agent 1 14; passes results supplied from the Information Agent 1 14 to the dialogue manager 134; and passes replies for the user generated by the dialogue manager 134 to the Presentation Agent 1 10.
  • Information Agent program 1 14 The operation of the Information Agent program will largely be clear from the foregoing. In particular, it polls its event channel for requests from the Strategy Agent, and data from each Information Service.
  • the Strategy Agent On receiving a request from the Strategy Agent, it buffers the identification code and timestamp supplied with the request in a request buffer, and looks up a database query based on the request, using a database of commands for each Information Service. This is then forwarded, together with an identification code (which may be the same as that received with the query) to the required Information Service(s).
  • the Information Agent program 1 14 is arranged to query each in parallel, and to forward replies as they arrive to the Strategy Agent queue.
  • a store of "real-world” knowledge is also included in the semantics analysis application, containing for example, facts about the particular information services present or potentially present, such as which semantic items can co-occur and which are incompatible.
  • an item such as a "book” is linked to a more generic type such as "object”, and therefore inherits specific properties that all objects have, such as being the object of actions such as “to lift”, “to move” and so on, but is also specifically linked to particular actions ("to read”) and not to others ("to swim in”).
  • the semantics engine can infer incorrect or missing information, for example, from the real-world data and the input semantics.
  • a pragmatics reasoning stage may additionally be provided for further meaning-in-context inference to be performed on the input from the user.
  • recognisers have been described. However, other suitable recognisers which can be arranged to supply outputs in the form of a directed graph are commercially available. Even recognisers which do not supply a directed graph, but merely the best (or best N) words, can be used, although the parser is less able to correct recognition errors in such embodiments.
  • a handwriting recognition engine is provided in parallel with, or instead of, the speech recognition engine.
  • the directed graph output of the handwriting engine may operate on a letter by letter basis or on a word by word basis as discussed in relation to the speech recogniser above.
  • a video camera could be used to capture aspects of user input such as hand gestures, eye movements or even expressions.
  • a video recognition engine is provided in parallel with, or instead of, the speech recognition engine. In place of the described parser and semantic analysis programs, analysis as described in, for example, WO 99/31604 or WO 99/08202 (incorporated herein by reference) could be used.
  • Some or all of the databases may be remotely located and accessed via the Internet rather than via a LAN as described. In this case, the asynchronous, parallel processing of the above embodiments is particularly useful.
  • the Object request Broker in such cases conforms to the HOP (Internet Inter-ORB Protocol).
  • the first to return a result, or the first to return a grammatically parsable result, may be used.
  • the system thus continues to respond in real-time to the user's later requests, even where some substantial processing task is under way.
  • a distributed network of computers can be used to share heavy processing tasks, enabling the user interface to be small - for example a downloadable web page.
  • the embodiment enables the different inputs from a user (for example, verbally and via a mouse, but also possibly through other feedback or general input means) to be associated together, thus enabling body language or other actions to be correlated with speech to disambiguate the users' intention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer And Data Communications (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un appareil permettant une interaction homme/ordinateur, de sorte qu'un utilisateur peut commander au moins un service informatique. Cet appareil comprend un appareil d'entrée destiné à recevoir au moins un mot d'un utilisateur; un appareil d'analyse agencé de façon à analyser ce mot; une interface permettant de commander au moins un service informatique en fonction d'au moins un mot; et un appareil de sortie destiné à générer une réponse de l'utilisateur en fonction dudit mot; caractérisé en ce que l'appareil d'entrée, l'interface, et l'appareil de sortie au moins exécutent séparément des programmes, et en ce que ces programmes communiquent de manière asynchrone les uns avec les autres.
PCT/GB1999/002240 1999-07-13 1999-07-13 Architecture distribuee orientee objet pour comprehension vocale WO2001004876A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/GB1999/002240 WO2001004876A1 (fr) 1999-07-13 1999-07-13 Architecture distribuee orientee objet pour comprehension vocale
CA002343077A CA2343077A1 (fr) 1999-07-13 1999-07-13 Architecture distribuee orientee objet pour comprehension vocale

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/GB1999/002240 WO2001004876A1 (fr) 1999-07-13 1999-07-13 Architecture distribuee orientee objet pour comprehension vocale

Publications (1)

Publication Number Publication Date
WO2001004876A1 true WO2001004876A1 (fr) 2001-01-18

Family

ID=10846920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1999/002240 WO2001004876A1 (fr) 1999-07-13 1999-07-13 Architecture distribuee orientee objet pour comprehension vocale

Country Status (2)

Country Link
CA (1) CA2343077A1 (fr)
WO (1) WO2001004876A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3373141A1 (fr) * 2017-03-09 2018-09-12 Capital One Services, LLC Systèmes et procédés permettant de fournir un dialogue automatisé en langage naturel avec des clients
CN108958804A (zh) * 2017-05-25 2018-12-07 蔚来汽车有限公司 适于与车辆有关的应用场景的人机交互方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999026233A2 (fr) * 1997-11-14 1999-05-27 Koninklijke Philips Electronics N.V. Procede et systeme configures pour le partage selectif du materiel dans un systeme d'intercommunication a commande vocale avec traitement de la parole sur plusieurs niveaux de complexite relative

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999026233A2 (fr) * 1997-11-14 1999-05-27 Koninklijke Philips Electronics N.V. Procede et systeme configures pour le partage selectif du materiel dans un systeme d'intercommunication a commande vocale avec traitement de la parole sur plusieurs niveaux de complexite relative

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
ARONS B: "TOOLS FOR BUILDING ASYNCHRONOUS SERVERS TO SUPPORT SPEECH AND AUDIOAPPLICATIONS", ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY,US,NEW YORK, NY: ACM, 1992, pages 71 - 78, XP000337552, ISBN: 0-89791-549-6 *
HARRISON T H ET AL: "THE DESIGN AND PERFORMANCE OF A REAL-TIME CORBA EVENT SERVICE", ACM SIGPLAN NOTICES,US,ASSOCIATION FOR COMPUTING MACHINERY, NEW YORK, vol. 32, no. 10, 1 October 1997 (1997-10-01), pages 184 - 200, XP000723422, ISSN: 0362-1340 *
HOGE H: "SPICOS II - A SPEECH UNDERSTANDING DIALOGUE SYSTEM", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP),JP,TOKYO, ASJ, 1990, pages 1313 - 1316, XP000506994 *
NIEMOELLER M ET AL: "A PC-BASED REAL-TIME LARGE VOCABULARY CONTINUOUS SPEECH RECOGNIZER FOR GERMAN", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP),US,NEW YORK, IEEE, 1997, pages 1807 - 1810, XP000735012, ISBN: 0-8186-7920-4 *
O'MALLEY M H ET AL: "Beyond the "reading machine": combining smart text-to-speech with an AI-based dialogue generator", SPEECH TECHNOLOGY, SEPT.-OCT. 1986, USA, vol. 3, no. 3, pages 34 - 40, XP002132480, ISSN: 0744-1355 *
YOICHI TAKEBAYASHI ET AL: "A REAL-TIME SPEECH DIALOGUE SYSTEM USING SPONTANEOUS SPEECH UNDERSTANDING", IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS,JP,INSTITUTE OF ELECTRONICS INFORMATION AND COMM. ENG. TOKYO, vol. E76 - D, no. 1, 1 January 1993 (1993-01-01), pages 112 - 120, XP000354892, ISSN: 0916-8532 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3373141A1 (fr) * 2017-03-09 2018-09-12 Capital One Services, LLC Systèmes et procédés permettant de fournir un dialogue automatisé en langage naturel avec des clients
US10332505B2 (en) 2017-03-09 2019-06-25 Capital One Services, Llc Systems and methods for providing automated natural language dialogue with customers
US10614793B2 (en) 2017-03-09 2020-04-07 Capital One Services, Llc Systems and methods for providing automated natural language dialogue with customers
US11004440B2 (en) 2017-03-09 2021-05-11 Capital One Services, Llc Systems and methods for providing automated natural language dialogue with customers
US11735157B2 (en) 2017-03-09 2023-08-22 Capital One Services, Llc Systems and methods for providing automated natural language dialogue with customers
CN108958804A (zh) * 2017-05-25 2018-12-07 蔚来汽车有限公司 适于与车辆有关的应用场景的人机交互方法

Also Published As

Publication number Publication date
CA2343077A1 (fr) 2001-01-18

Similar Documents

Publication Publication Date Title
EP3895161B1 (fr) Utilisation de flux d'entrée pré-événement et post-événement pour l'entrée en service d'un assistant automatisé
US7912726B2 (en) Method and apparatus for creation and user-customization of speech-enabled services
EP1602102B1 (fr) Gestion de conversations
EP4254402A2 (fr) Détermination automatique d'une langue pour la reconnaissance de la parole d'un énoncé vocal reçu par l'intermédiaire d'une interface d'assistant automatisé
EP0607615B1 (fr) Système d'interface de reconnaissance de la parole adapté pour des systèmes à fenêtre et systèmes de messagerie à parole
RU2349969C2 (ru) Синхронное понимание семантических объектов, реализованное с помощью тэгов речевого приложения
US8645122B1 (en) Method of handling frequently asked questions in a natural language dialog service
JP3725566B2 (ja) 音声認識インターフェース
US8000974B2 (en) Speech recognition system and method
EP1175060B1 (fr) Couche intergiciel entre applications et moteurs vocales
US7188067B2 (en) Method for integrating processes with a multi-faceted human centered interface
JP4012263B2 (ja) アプリケーション間タスクのための多モード自然言語インタフェース
US7624018B2 (en) Speech recognition using categories and speech prefixing
EP1349145B1 (fr) Système et procédé permettant la gestion des informations utilisant un interface de dialogue parlé
US20030061029A1 (en) Device for conducting expectation based mixed initiative natural language dialogs
EP1650744A1 (fr) Détection de commande invalide en reconnaissance de la parole
GB2323694A (en) Adaptation in speech to text conversion
MXPA04005121A (es) Entendimiento sincronico de objeto semantico para interfase altamente interactiva.
US6745165B2 (en) Method and apparatus for recognizing from here to here voice command structures in a finite grammar speech recognition system
US7162425B2 (en) Speech-related event notification system
Kadous et al. InCa: A mobile conversational agent
WO2001004876A1 (fr) Architecture distribuee orientee objet pour comprehension vocale
GB2375210A (en) Grammar coverage tool for spoken language interface
Fabbrizio et al. Extending a standard-based ip and computer telephony platform to support multi-modal services
Paggio et al. Linguistic interaction in staging—A language engineering view

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA US

WWE Wipo information: entry into national phase

Ref document number: 09763828

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2343077

Country of ref document: CA

Ref country code: CA

Ref document number: 2343077

Kind code of ref document: A

Format of ref document f/p: F