WO1997032431A1 - Dialogue system - Google Patents

Dialogue system Download PDF

Info

Publication number
WO1997032431A1
WO1997032431A1 PCT/GB1997/000554 GB9700554W WO9732431A1 WO 1997032431 A1 WO1997032431 A1 WO 1997032431A1 GB 9700554 W GB9700554 W GB 9700554W WO 9732431 A1 WO9732431 A1 WO 9732431A1
Authority
WO
WIPO (PCT)
Prior art keywords
dialogue
recogniser
input
speech
human
Prior art date
Application number
PCT/GB1997/000554
Other languages
French (fr)
Inventor
Alan Einer Hendrickson
Original Assignee
Pulse Train Technology Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pulse Train Technology Limited filed Critical Pulse Train Technology Limited
Priority to AU22226/97A priority Critical patent/AU2222697A/en
Publication of WO1997032431A1 publication Critical patent/WO1997032431A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/40Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/5166Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing in combination with interactive voice response systems or voice portals, e.g. as front-ends

Definitions

  • the present invention relates to a dialogue system and a method of operating a dialogue system.
  • the invention also extends to a method of generating a compressed version of an item of speech.
  • Dialogue systems use pre-recorded or computer generated speech output to convey information (which may be imperative, informative, or interrogative) to a respondent.
  • Conventional dialogue systems require a respondent to input information by means of the dial or buttons on a telephone instrument.
  • the number of pulses or the tone frequencies emitted by the instrument are interpreted by the dialogue systems by mapping them into an agreed set of semantic meanings. For example, a respondent might be instructed to 'press one' to indicate agreement, 'press two' to indicate disagreement, etc.
  • Newer dialogue systems have attempted to employ electronic apparatus to directly respond to the voice of a human respondent using speech recognition techniques which may be further coupled with semantic analysis algorithms to improve the accuracy of the dialogue process.
  • voice directed systems are still inaccurate unless the respondent is specially trained and in any event the spoken responses are limited in terms of the vocabulary and grammar which are acceptable for automatic recognition purposes.
  • a dialogue system adapted to carry out a conversation between the system and a human respondent, the system comprising a dialogue generator adapted to electronically generate output dialogue which is routed to the human respondent, wherein the output dialogue invites a response from the human respondent; and means for receiving input dialogue responsive to the output dialogue in the form of a speech signal from the human respondent characterised in that the system further comprises one or more recogniser stations, each for communicating input dialogue to a respective human recogniser for speech recognition and semantic analysis.
  • the system generates and transmits output to the respondent (who may be either the initiator or the called party in the conversation) and the human recogniser is responsible for the interpretation of audible input received from the respondent (e.g., answers to survey questions) .
  • the invention thus recognises the limitations of speech recognition algorithms, and provides a 'semi- automated' system in which the accuracy of the overall system is greatly improved.
  • the system according to the present invention gives the following advantages.
  • the system is adapted for connection to a communication medium such as the public switched telephone network (PSTN) or private networks of leased lines (e.g., travel agents, currency traders, etc.) to transmit and receive dialogue output or input dialogue to or from the respondent.
  • PSTN public switched telephone network
  • private networks of leased lines e.g., travel agents, currency traders, etc.
  • the respondent has suitable equipment (such as a personal computer with a monitor screen or a TV set capable of receiving and interpreting digital signals) then the output dialogue may be in the form of textual data displayed on the screen or TV set and viewed by the respondent.
  • the output dialogue is an audio signal such as a pre-recorded or synthesised voice.
  • a typical conversation consists of a series of transactions between the system and a respondent.
  • Each transaction comprises an item of output dialogue from the system and a corresponding item of input dialogue from a respondent.
  • Transactions may consist of an output followed by an input or conversely by an input followed by an output, depending on the nature of the conversation and the application to which the method is applied. Conversations with mixed types of transactions are also possible.
  • the conversation may involve items of output dialogue which do not require a response from the respondent, but at least one item of output dialogue will require a response (typically it will be interrogatory, i.e. expressing or involving a question) .
  • the conversation may be initiated by the system or by the respondent.
  • the audible output from the system may be either a software controlled electronically generated audible output or the equivalent output spoken by a human initiator.
  • a human initiator would be more typically used in conversations which were initiated by the system (e.g. , in survey research applications) .
  • the initiator may also be a recogniser at other times. Ideally, only a small proportion of a conversation employs a human initiator and at some point in the conversation the output is taken over by the electronic components of the system.
  • the means for receiving input dialogue comprises a routing device adapted to receive input dialogue from a plurality of human respondents in parallel and route the input dialogue to the or a selected one of the recogniser stations.
  • a routing device adapted to receive input dialogue from a plurality of human respondents in parallel and route the input dialogue to the or a selected one of the recogniser stations.
  • the conversations may be carried out sequentially or concurrently.
  • the system further comprises a dialogue manager for controlling the routing device whereby the system can carry out a plurality of concurrent conversations with a respective plurality of human respondents, the number of concurrent conversations being greater than the number of recogniser stations. This enables each recogniser to participate in more than one concurrent conversation. This reduces the number of recogniser stations and the number of human recognisers required.
  • the concurrent conversations between the system and the respondents are typically independent of each other in that there is no synchronisation between them. Moreover, the nature and purpose of each conversation may differ one from the other. Each human recogniser is engaged to deal with the input of a single transaction at a time.
  • the system comprises a plurality of recogniser stations and the means for receiving input dialogue comprises a routing device adapted to route the input dialogue to a selected one of said plurality of recogniser stations.
  • a human recogniser indicates to the system that the input has been recognised and interpreted they are considered by the system to be part of a 'pool' of available recognisers and immediately become available to participate in another transaction which may or may not involve the same respondent as the previous transaction. This effectively allows the system to carry out more simultaneous conversations than there are human recognisers in the pool.
  • the human recogniser may also take part in a conversation, or at least control the timing and/or content of the dialogue output generated by the system, as discussed below.
  • the system will typically generate and transmit a complete output and receive a corresponding input from the respondent.
  • a human recogniser is selected from the pool of available recognisers on the system, and the audible input is switched through to the selected human recogniser.
  • the routing device is adapted to route the input dialogue to a human recogniser immediately following the end of an item of output dialogue, or at a predetermined interval before the end of an item of output dialogue.
  • the routing device may be initiated by the receipt of input dialogue which in some cases may be received before the point where it is normally anticipated; i.e., the respondent may interrupt the normal course of the transaction before the output is completed.
  • the respondent When a selected recogniser is presented with the input from a respondent they interpret the response and depending on the semantic content of the response initiate one of a number of possible actions.
  • an action will comprise classifying the input into one or more predefined semantic categories. For example the output dialogue of the system could have been a question to which the expected response is either affirmation or disagreement. The recogniser would in this case indicate to the system (e.g., by pressing an appropriate key) which of these two alternatives the input dialogue was most appropriately matched to.
  • the system will typically also provide a number of standard actions to be employed in cases where the input does not fall into an expected response category.
  • the completion of the action taken by the recogniser when receiving the input from the respondent may be a signal to the system that another transaction in the dialogue may commence.
  • the dialogue generator may also employ information taken from any prior action by a human recogniser in the conversation or logical combination of prior actions to direct the conversation at points where the conversation is allowed to branch.
  • the recogniser is thereby responsible for the synchronisation of the pairs of input and output dialogue forming each transaction.
  • the system continuously monitors the input channel of the telephone line for each respondent to see if audible sound can be detected. At the same time the input is typically stored in a digital form. If the system does not detect audible sound, the recording is discarded.
  • the system further comprises a store for storing input dialogue prior to playing back the stored input dialogue to a human recogniser.
  • the input may be required for later analysis and in this situation every item of input dialogue will typically be recorded and stored.
  • the means for receiving input dialogue may be adapted to route input dialogue to the store only when the or each recogniser station is occupied. The stored input dialogue can then be played back to a selected recogniser when that recogniser becomes available.
  • a method of operating a dialogue system to carry out a conversation between the system and a human respondent, the method comprising electronically generating output dialogue and routing the output dialogue to the human respondent, wherein the output dialogue invites a response from the respondent, the method further comprising receiving input dialogue responsive to the output dialogue in the form of a speech signal from the human respondent and communicating the input dialogue to a human recogniser for speech recognition and semantic analysis.
  • a method of operating a dialogue system to carry out a plurality of conversations between the system and a respective plurality of human respondents the method comprising, for each conversation, generating output dialogue and routing the output dialogue to a respective human respondent, wherein the output dialogue invites a response from the respective human respondent, the method further comprising receiving input dialogue responsive to the output dialogue in the form of a speech signal from each respondent, and communicating the input dialogue from each respondent to an available human recogniser for speech recognition and semantic analysis.
  • a method of generating a compressed version of an item of speech the item of speech having a first length and the compressed version having a second length shorter than the first length, and wherein the item of speech comprises a speech signal having one or more audible intervals and one or more substantially silent intervals, the method comprising removing the or each substantially silent interval from the speech signal to generate the compressed version of the item of speech.
  • apparatus for generating a compressed version of an item of speech, the item of speech having a first length and the compressed version having a second length shorter than the first length, and wherein the item of speech comprises a speech signal having one or more audible intervals and one or more substantially silent intervals, the apparatus comprising means for removing the or each substantially silent interval from the speech signal to generate the compressed version of the item of speech.
  • the fourth and fifth aspects of the present invention provide a particularly efficient and simple method of providing a recognisable compressed version of an item of speech which, when played back, has a shorter length than the original item of speech.
  • the method is employed in a system according to the first aspect of the invention, or as part of the method according to the second or third aspects of the invention.
  • the system typically comprises a store for storing input dialogue prior to playing back the stored input dialogue to a human recogniser.
  • the input dialogue is typically stored and played back when a human recogniser becomes available.
  • the input dialogue is compressed according to the fourth aspect of the invention either before or after storing, and is played back to the selected human recogniser, typically whilst the item of input dialogue is still being received.
  • the compressed version of the original item of speech is shorter, and therefore allows the selected human recogniser to catch up with the ongoing input dialogue.
  • the method may further comprise removing short slices (typically 10ms slices) of the audible portions of the speech signal.
  • FIG. 1 is a schematic block diagram of an example of an Interactive Voice Response (IVR) dialogue system according to the invention
  • Figure 2 is a logic diagram of the recogniser allocation and de-allocation process
  • Figure 3 illustrates two simultaneous conversations and the activity of a single recogniser
  • Figure 4 illustrates an example of a speech compressor according to the fifth aspect of the invention.
  • Figure l is a schematic diagram of an embodiment of a dialogue system according to the present invention.
  • the illustrated system architecture is one example of a number of possible architectures which could be implemented for a system according to the invention.
  • Pictured in Figure 1 are a number of blocks which represent electronic hardware devices or software modules. In the discussion which follows each of these blocks is described.
  • a Dialogue Manager (DM) 1 is the central controlling piece of software.
  • the DM l is able to communicate directly with three further software modules, namely the Output Server (OS) 2, Input Server (IS) 3, and Back end database and database servers (DBS) 4.
  • OS Output Server
  • IS Input Server
  • DBS Back end database and database servers
  • the DM 1 communicates directly with a hardware switch 5.
  • the DM 1 can incorporate the software logic which controls the overall execution of a dialogue for a given application. More typically the DM 1 will be driven by a database of instructions which together comprise a protocol for a conversation. It would be expected that systems would have the capability of executing a number of such protocols simultaneously and in principle each respondent currently interacting with the system could be doing so under the control of a different protocol.
  • the Switch 5 will typically be a computer incorporating special purpose electronic boards which permit a variety of telephony related functions. To the right of the Switch are shown thirteen telephone lines 6 which originate from the Public Switched Telephone Network (PSTN) 7 or from a private network or possibly a mixture of the two.
  • PSTN Public Switched Telephone Network
  • the switches 8-11 To the left of the Switch 5 are shown connections to four telephones 8-11 which would more typically be headsets worn by the human recognisers.
  • the telephones 8-11 are each situated next to a personal computer (PC) 12-15.
  • PC personal computer
  • the combination of a telephone and a PC is called a Recogniser Station (RS) .
  • the number of telephone lines and the number of Recogniser Stations attached to the switch is flexible but the numbers shown in figure 2 indicate that the system is capable of handling a larger number of ongoing dialogues than there are human recognisers associated with the system.
  • the Input Server (IS) module 3 is connected directly to each of the PCs 12-15 which form part of each Recogniser Station.
  • the IS module 3 is responsible for keeping track of the status of each RS, which are usually in a state of being 'assigned' or conversely 'available' with respect to a transaction which is in progress within the system.
  • the IS 3 is initially contacted by the DM 1 which requests the services of a recogniser.
  • the IS 3 will respond with an acknowledgement that a recogniser has been assigned to the transaction or conversely that none is currently available.
  • the IS 3 is responsible for placing messages and information on the screen of the PC 12-15 of the Recogniser Station, and accepting input from the PC 12-15.
  • the input from the PC 12-15 can originate by any of the conventional methods of human interaction with the PC including keyboard input, mouse or tracker ball input, or even voice input.
  • the IS module 3 When input is received from a Recogniser Station, the IS module 3 will validate the response in the context of the overall controlling protocol which controls the conversation. The IS 3 will be aware of which protocol is appropriate by virtue of information passed on from the DM 1 at the time the request was made to assign a recogniser. In the case of invalid input, the IS 3 will send an appropriate error message to the Recogniser Station and await corrected input. When valid input has been received for the transaction, the value (contents) of the input will be transmitted to the DM 1. At the same time, the recogniser will be released from the transaction and placed into the pool of available recognisers.
  • the process is initiated by the DM module 1.
  • the DM module 1 will indicate which output is to be transmitted to which outgoing telephone line 6, sending this information to the Output Server (OS) 2.
  • OS Output Server
  • the OS module 2 is responsible for assembling or generating the required output stream. Typically, this will consist of a digitised wave form which is then sent to a Voice Response hardware unit 16.
  • the VR unit 16 can pass the information on in a digital form or if required it can convert the digital information into conventional analogue signals which are sent through the switch 5 and down the appropriate telephone line 6.
  • pre-recorded elements might be the phrase "The number you require is " , the ten digits, "one” , “two” , etc., and the phrase “I repeat, " .
  • the output stream it would be possible for the output stream to contain any complete telephone number, spoken digit by digit, following the phrase "The number you require is " .
  • the system also permits direct output to originate from the Recogniser Stations, bypassing in effect the OS 2. This may be required in cases where the protocol employed has not anticipated a situation which arises as determined from the content of the input received from the respondent. In such cases, the recogniser is able to use their judgement and simply speak into the telephone 8-11 directly to the respondent.
  • the system will typically provide for a keyboard action to be relayed to the DM 1 via the IS 3 to inform the DM of the special situation.
  • the DM 1 is directly connected to the DBS module 4, which acts to provide a number of specialised functions and also serves as a repository of the information which may be collected by some conversations. Each conversation is under the control of a protocol which defines the possible transactions and in some cases the permitted sequence of transactions.
  • the protocols may include provision for branching during the conversation process, and the action of branching will typically be controlled by the DM 1 using information stored in the DBS module 4.
  • Inbound applications are those where the conversation originates by a call placed by the respondent.
  • the mapping of such an inbound call to a specific protocol is usually done on the basis of the telephone line number which is called. Typically there will be more telephone line numbers (virtual lines) than there are available physical telephone lines coming into the Switch 5.
  • the Switch 5 will typically have the facility to determine which telephone line number has been rung. The Switch 5 will then communicate with the DM 1 passing on the information that a new conversation is to commence and passing on the virtual line number. The DM 1 is then able to make a determination of the appropriate protocol to employ during the conversation process, and request the details of the protocol from the DBS module 4.
  • Inbound apj, , ..cations can begin with an initial output which can be either spoken by a human or generated and transmitted through the VR unit 16.
  • Recogniser 'A' is assigned.
  • Event 4 commences with another output-input transaction, which then may split into a variable number of further output-input transactions as the respondent calls out the required number (events 5, 6 and 7).
  • Event 7 also commences a new output-input transaction, asking for the desired telephone number.
  • the protocol terminates after event 10, with the last action being an output from the system signifying that dialling is now in progress.
  • Another possible inbound application for this invention is to control a multi-lined switchboard for a company or organisation. In this case, there may be no special significance attached to the number which the respondent dials. An example conversation is given below.
  • the selected operator is immediately freed to undertake another task.
  • the extension number selected is used to select an appropriate response: " Just a moment please. I'm putting you through to ⁇ name of person> in our sales department.”
  • Outbound applications can also begin with an initial output which can be either spoken by a human or generated and transmitted through the VR unit 16.
  • an initial output which can be either spoken by a human or generated and transmitted through the VR unit 16.
  • the prevailing circumstances in the initial state of an outbound call are more heterogeneous than inbound calls, and typically the first output from the system will be conducted by a human.
  • the system would simply be expected to instruct one of the recognisers to carry out this task.
  • the task of the recogniser is to act as a 19 recruiter, and try to get the appropriate person at the telephone number contacted to the telephone and then try to persuade them to enter into a conversation which constitutes an interview.
  • There are a number of possible outcomes to this activity ranging from the initiation of the interview proper to finding that the appropriate person is not available or is unwilling to participate.
  • the recogniser is expected to indicate to the system the outcome status which will be used to select the next step in the controlling protocol.
  • the system may incorporate facilities for auto-dialling or predictive dialling in outbound applications. These facilities, if incorporated, are controlled by special algorithms in the DM 1 which require information to be passed to and from the Switch 5.
  • the Switch 5 has the appropriate circuitry to initiate dialling attempts.
  • the system might output an initial continuity message through the OS
  • the dialogue manager l is constantly aware of the question length it is currently asking, and inspects the pool of dedicated answer recognisers. At some point (a few seconds from the end of the question) an answer recognizer is selected, and the question and precoded answers (if relevant) are brought up on the screen, thus alerting them to the impending answer. At the moment the question is finished, the selected recogniser is connected through to the incoming wire of the telephone line (not the outgoing wire) . The survey process is paused, awaiting the response or some instruction from the recogniser. The system also always records and stores the respondent's response in DBS 4 at the same time it is played through to the answer recogniser.
  • the recogniser listens, and can do the following:
  • the respondent is thus smoothly handed back to a live person (recruiter or recogniser) , and the priority of selection with the recogniser themselves as the last resort guarantees there will be no interruption.
  • the recruiter will be aware (to some extent at least) about the nature of the problem, and be well equipped to deal with it. • Take over immediately. If, in the judgement of the recogniser, the situation warrants immediate action, or if the situation is too complex to be handled via the (possibly truncated - from the beginning, not the end) recording of the respondent's problem, they can take over immediately with or without playing the continuity recording.
  • Interviewer's special actions An augmented list of the special actions may be available to the recogniser. Some of these may be (a) silent, or (b) involve a handback to a recruiter or themselves, or (c) possible go off into a special dialogue. For example, if an appointment is made, it may be possible to do this through electronically generated dialogue as well, thus maintaining contact with the prerecorded voice throughout.
  • the protocol for a conversation will indicate that the entire spoken response of a respondent to a particular question or questions is desired for possible future use and analysis.
  • the output-input transaction follows the same course of action as previously described for other output-input transaction with the exception that recording of the response is no longer optional.
  • the system would be expected to preserve the digitised responses via the DBS module 4 on a storage device (not shown) .
  • the function of the recogniser in such questions (which are normally referred to in the market research industry as 'open-ended questions') is to determine the appropriate point at which the informant has given the entire response. In some cases, the recogniser may be instructed to encourage the respondent to give more detail.
  • a menu of prompts such as "Tell me a bit more” or " Is there anything you would like to add to that?" can be selected via an action of the recogniser which in turn triggers the output via the OS and VR unit.
  • an action is taken which releases the system to continue with the protocol.
  • the system will continue to record the input from the respondents not yet assigned to a human recogniser and will assign a recogniser to the transaction as soon as they become available. It then becomes necessary to present to the recogniser a reconstruction of the input from the moment the respondent began to speak, while at the same time the ongoing input for the same transaction continues to be recorded.
  • the system will replay the recorded input to the recogniser in less elapsed time than was required to record it in an attempt to catch up with the ongoing input, at which point the system will switch the live input directly to the recogniser.
  • the speeding up of the process of input reconstruction is accomplished by the truncation of the short silences that are naturally present in all human speech.
  • Typical recorded embedded silences of greater than 100 milliseconds will be reduced to this amount of time or less during the reconstruction process.
  • the system may resort to elimination of the parts of the audible signal as well, by frequent removal of short 'slices' (i.e. intervals) of sound.
  • a typical slice of removed audible material might be of 10 msec duration, and the frequency that these slices would be removed would be dynamically adjusted to permit the reconstructed input to catch up with input still being received while preserving as much of the natural input cadence as possible under the circumstances. It is possible to cut about 50% of the material (including silences) in continuous speech and still understand it when it is played back.
  • Speech compression is achieved by a speech compressor 17 illustrated in detail in Figure 4.
  • the DM1 causes the switch 5 to route the speech signal onto line 18 which forms an input to the speech compressor 17.
  • the speech compressor 17 comprises an analogue-to-digital convertor 25 (optionally located in the switch 5) which outputs a digital signal to a two-way switch 21.
  • the switch 21 routes the digital signal under normal circumstances to a threshold device 22.
  • the threshold device 22 determines whether the digital signal is above or below a predetermined level. If it is below the predetermined level, the signal is discarded at 23. If the signal is above the predetermined level, the signal is output to DBS 4 via output 20.
  • the DM1 causes the switch 21 to route the digital signal to a slice remover 23, which removes intervals from the signal digital speech at a frequency determined by a signal from DM1 on control line
  • Figure 2 shows a logic diagram of the recogniser allocation and de-allocation process which also incorporates the typical (but optional) logic for recording and replaying the truncated reconstructed input to the recogniser when it is required.
  • Figure 3 illustrates a number of the above points.
  • the figure commences with two alternating shaded and unshaded bars labelled 'Dl' and 'D2'. These two bars represent conversations which are taking place simultaneously within a single system.
  • the shaded portions of each bar represent the output intervals, and the unshaded and numbered portions are the input intervals where input was received from the two respondents.
  • 'Rl' represents the activity of a unique recogniser active on the system during the conversations.
  • the numbered sections of Rl represent the input portions taken from the Dl and D2 bars above and show the point at which they are presented to the recogniser. Notice that inputs numbered 1, 2, and 3 are presented to the recogniser as they occur as the recogniser was available at the time the inputs commenced. However, in the case of inputs numbered 4, 5, and 6, the inputs had to be recorded as the recogniser was not available at the time the inputs commenced. As soon as the recogniser becomes available, the inputs are presented in a shortened form using one or more of the methods described above. The diagram illustrates that one recogniser can cope with two simultaneous conversations and still have a proportion of their time available. At the same time, the conversations proceed at a natural pace and are not delayed by the non-availability of the recogniser.
  • the fact that the input is recorded also allows the recogniser the opportunity to review all or a portion of the input if it is unclear for any reason. This may avoid the need to have to ask the respondent to repeat what was said.
  • the system will typically provide the recogniser with simple computer controls to allow this type of action.
  • the system will employ an algorithm whereby the frequency of having to employ the method of reconstruction of the recorded input of transactions to recognisers is monitored. If the frequency is too high the system may employ a number of methods to alleviate this condition.
  • One method is to introduce variable delays prior to starting another transaction (applicable if the transaction is of the form system output followed by respondent input) .
  • Another method is to restrict the number of simultaneous conversations by not permitting a new conversation to commence when a previous one terminates.

Abstract

A dialogue system adapted to carry out a conversation between a system and a human respondent. The system comprises a dialogue generator (2, 16) adapted to electronically generate output dialogue which is routed to the human respondent, wherein the output dialogue invites a response from the human respondent. The system also comprises means (5) for receiving input dialogue responsive to the output dialogue in the form of a speech signal from the human respondent. The system further comprises one or more recogniser stations (8-15), each for communicating input dialogue to a respective human recogniser for speech recognition and semantic analysis. Apparatus for generating a compressed version of an item of speech. The item of speech has a first length and the compressed version has a second length shorter than the first length. The item of speech comprises a speech signal having one or more audible intervals and one or more substantially silent intervals. The apparatus comprises means (22) for removing the or each substantially silent interval from the speech signal to generate the compressed version of the item of speech.

Description

DIALOGUE SYSTEM The present invention relates to a dialogue system and a method of operating a dialogue system. The invention also extends to a method of generating a compressed version of an item of speech.
Dialogue systems use pre-recorded or computer generated speech output to convey information (which may be imperative, informative, or interrogative) to a respondent.
Conventional dialogue systems require a respondent to input information by means of the dial or buttons on a telephone instrument. The number of pulses or the tone frequencies emitted by the instrument are interpreted by the dialogue systems by mapping them into an agreed set of semantic meanings. For example, a respondent might be instructed to 'press one' to indicate agreement, 'press two' to indicate disagreement, etc.
Newer dialogue systems have attempted to employ electronic apparatus to directly respond to the voice of a human respondent using speech recognition techniques which may be further coupled with semantic analysis algorithms to improve the accuracy of the dialogue process. However, such voice directed systems are still inaccurate unless the respondent is specially trained and in any event the spoken responses are limited in terms of the vocabulary and grammar which are acceptable for automatic recognition purposes.
In accordance with a first aspect of the present invention there is provided a dialogue system adapted to carry out a conversation between the system and a human respondent, the system comprising a dialogue generator adapted to electronically generate output dialogue which is routed to the human respondent, wherein the output dialogue invites a response from the human respondent; and means for receiving input dialogue responsive to the output dialogue in the form of a speech signal from the human respondent characterised in that the system further comprises one or more recogniser stations, each for communicating input dialogue to a respective human recogniser for speech recognition and semantic analysis.
References hereinafter to input or output should be understood to refer to input dialogue or output dialogue respectively.
The system generates and transmits output to the respondent (who may be either the initiator or the called party in the conversation) and the human recogniser is responsible for the interpretation of audible input received from the respondent (e.g., answers to survey questions) . The invention thus recognises the limitations of speech recognition algorithms, and provides a 'semi- automated' system in which the accuracy of the overall system is greatly improved. The system according to the present invention gives the following advantages.
Firstly, the fact that respondents may reply (or interrogate) in a completely natural way means that no time is wasted by having to instruct the respondents about a set of agreed response conventions. Individual conversations thus require less time to conduct.
Secondly, the use of human recognisers for input dialogue recognition and semantic analysis greatly increases the accuracy of the process over existing fully automated voice recognition systems.
Thirdly, the fact that the system imposes no constraints on the input from the respondent and responds accurately in most circumstances means that the use of such systems is acceptable to a greater proportion of the potential respondents using such systems.
Typically the system is adapted for connection to a communication medium such as the public switched telephone network (PSTN) or private networks of leased lines (e.g., travel agents, currency traders, etc.) to transmit and receive dialogue output or input dialogue to or from the respondent. If the respondent has suitable equipment (such as a personal computer with a monitor screen or a TV set capable of receiving and interpreting digital signals) then the output dialogue may be in the form of textual data displayed on the screen or TV set and viewed by the respondent. Typically however the output dialogue is an audio signal such as a pre-recorded or synthesised voice. A typical conversation consists of a series of transactions between the system and a respondent. Each transaction comprises an item of output dialogue from the system and a corresponding item of input dialogue from a respondent. Transactions may consist of an output followed by an input or conversely by an input followed by an output, depending on the nature of the conversation and the application to which the method is applied. Conversations with mixed types of transactions are also possible. The conversation may involve items of output dialogue which do not require a response from the respondent, but at least one item of output dialogue will require a response (typically it will be interrogatory, i.e. expressing or involving a question) .
The conversation may be initiated by the system or by the respondent. At the onset of the conversation the audible output from the system may be either a software controlled electronically generated audible output or the equivalent output spoken by a human initiator. A human initiator would be more typically used in conversations which were initiated by the system (e.g. , in survey research applications) . The initiator may also be a recogniser at other times. Ideally, only a small proportion of a conversation employs a human initiator and at some point in the conversation the output is taken over by the electronic components of the system.
Typically the means for receiving input dialogue comprises a routing device adapted to receive input dialogue from a plurality of human respondents in parallel and route the input dialogue to the or a selected one of the recogniser stations. This enables the system to carry out conversations between the system and a plurality of respondents. The conversations may be carried out sequentially or concurrently. Typically the system further comprises a dialogue manager for controlling the routing device whereby the system can carry out a plurality of concurrent conversations with a respective plurality of human respondents, the number of concurrent conversations being greater than the number of recogniser stations. This enables each recogniser to participate in more than one concurrent conversation. This reduces the number of recogniser stations and the number of human recognisers required. The concurrent conversations between the system and the respondents are typically independent of each other in that there is no synchronisation between them. Moreover, the nature and purpose of each conversation may differ one from the other. Each human recogniser is engaged to deal with the input of a single transaction at a time.
Typically the system comprises a plurality of recogniser stations and the means for receiving input dialogue comprises a routing device adapted to route the input dialogue to a selected one of said plurality of recogniser stations. When a human recogniser indicates to the system that the input has been recognised and interpreted they are considered by the system to be part of a 'pool' of available recognisers and immediately become available to participate in another transaction which may or may not involve the same respondent as the previous transaction. This effectively allows the system to carry out more simultaneous conversations than there are human recognisers in the pool.
In the case of applications where transactions consist of outputs from the system followed by inputs from the respondent the human recognisers will typically be guided in their recognition and interpretation efforts by having a textual record of the output just transmitted to the respondent appear on a computer screen. The human recogniser is thus able to place the audible input received from the respondent in the context of the overall conversation.
The human recogniser may also take part in a conversation, or at least control the timing and/or content of the dialogue output generated by the system, as discussed below. In the case of a conversation consisting of outputs followed by inputs the system will typically generate and transmit a complete output and receive a corresponding input from the respondent. A human recogniser is selected from the pool of available recognisers on the system, and the audible input is switched through to the selected human recogniser. Typically the routing device is adapted to route the input dialogue to a human recogniser immediately following the end of an item of output dialogue, or at a predetermined interval before the end of an item of output dialogue. Alternatively, the routing device may be initiated by the receipt of input dialogue which in some cases may be received before the point where it is normally anticipated; i.e., the respondent may interrupt the normal course of the transaction before the output is completed. When a selected recogniser is presented with the input from a respondent they interpret the response and depending on the semantic content of the response initiate one of a number of possible actions. Typically an action will comprise classifying the input into one or more predefined semantic categories. For example the output dialogue of the system could have been a question to which the expected response is either affirmation or disagreement. The recogniser would in this case indicate to the system (e.g., by pressing an appropriate key) which of these two alternatives the input dialogue was most appropriately matched to. The system will typically also provide a number of standard actions to be employed in cases where the input does not fall into an expected response category.
The completion of the action taken by the recogniser when receiving the input from the respondent may be a signal to the system that another transaction in the dialogue may commence. The dialogue generator may also employ information taken from any prior action by a human recogniser in the conversation or logical combination of prior actions to direct the conversation at points where the conversation is allowed to branch. The recogniser is thereby responsible for the synchronisation of the pairs of input and output dialogue forming each transaction.
In the case of transactions consisting of inputs from the respondent followed by outputs from the system, the system will normally wait for the respondent to commence with another transaction when the previous transaction has been completed.
Typically the system continuously monitors the input channel of the telephone line for each respondent to see if audible sound can be detected. At the same time the input is typically stored in a digital form. If the system does not detect audible sound, the recording is discarded.
Preferably, the system further comprises a store for storing input dialogue prior to playing back the stored input dialogue to a human recogniser. For certain conversation protocols the input may be required for later analysis and in this situation every item of input dialogue will typically be recorded and stored. Alternatively the means for receiving input dialogue may be adapted to route input dialogue to the store only when the or each recogniser station is occupied. The stored input dialogue can then be played back to a selected recogniser when that recogniser becomes available.
In accordance with a second aspect of the invention there is provided a method of operating a dialogue system, to carry out a conversation between the system and a human respondent, the method comprising electronically generating output dialogue and routing the output dialogue to the human respondent, wherein the output dialogue invites a response from the respondent, the method further comprising receiving input dialogue responsive to the output dialogue in the form of a speech signal from the human respondent and communicating the input dialogue to a human recogniser for speech recognition and semantic analysis.
In accordance with a third aspect of the present invention there is provided a method of operating a dialogue system, to carry out a plurality of conversations between the system and a respective plurality of human respondents the method comprising, for each conversation, generating output dialogue and routing the output dialogue to a respective human respondent, wherein the output dialogue invites a response from the respective human respondent, the method further comprising receiving input dialogue responsive to the output dialogue in the form of a speech signal from each respondent, and communicating the input dialogue from each respondent to an available human recogniser for speech recognition and semantic analysis.
In accordance with a fourth aspect of the present invention, there is provided a method of generating a compressed version of an item of speech, the item of speech having a first length and the compressed version having a second length shorter than the first length, and wherein the item of speech comprises a speech signal having one or more audible intervals and one or more substantially silent intervals, the method comprising removing the or each substantially silent interval from the speech signal to generate the compressed version of the item of speech.
In accordance with a fifth aspect of the present invention there is provided apparatus for generating a compressed version of an item of speech, the item of speech having a first length and the compressed version having a second length shorter than the first length, and wherein the item of speech comprises a speech signal having one or more audible intervals and one or more substantially silent intervals, the apparatus comprising means for removing the or each substantially silent interval from the speech signal to generate the compressed version of the item of speech. The fourth and fifth aspects of the present invention provide a particularly efficient and simple method of providing a recognisable compressed version of an item of speech which, when played back, has a shorter length than the original item of speech. Typically the method is employed in a system according to the first aspect of the invention, or as part of the method according to the second or third aspects of the invention.
In this case the system typically comprises a store for storing input dialogue prior to playing back the stored input dialogue to a human recogniser. When a human recogniser is not available, the input dialogue is typically stored and played back when a human recogniser becomes available. The input dialogue is compressed according to the fourth aspect of the invention either before or after storing, and is played back to the selected human recogniser, typically whilst the item of input dialogue is still being received. The compressed version of the original item of speech is shorter, and therefore allows the selected human recogniser to catch up with the ongoing input dialogue. However the compressed version maintains enough information for the human recogniser to carry out the required speech recognition and semantic analysis. The method may further comprise removing short slices (typically 10ms slices) of the audible portions of the speech signal.
An embodiment of the present inventions will now be described with reference to the accompanying Figures, in which:- Figure 1 is a schematic block diagram of an example of an Interactive Voice Response (IVR) dialogue system according to the invention;
Figure 2 is a logic diagram of the recogniser allocation and de-allocation process;
Figure 3 illustrates two simultaneous conversations and the activity of a single recogniser; and
Figure 4 illustrates an example of a speech compressor according to the fifth aspect of the invention. Figure l is a schematic diagram of an embodiment of a dialogue system according to the present invention. The illustrated system architecture is one example of a number of possible architectures which could be implemented for a system according to the invention. Pictured in Figure 1 are a number of blocks which represent electronic hardware devices or software modules. In the discussion which follows each of these blocks is described.
A Dialogue Manager (DM) 1 is the central controlling piece of software. The DM l is able to communicate directly with three further software modules, namely the Output Server (OS) 2, Input Server (IS) 3, and Back end database and database servers (DBS) 4. In addition, the DM 1 communicates directly with a hardware switch 5. The DM 1 can incorporate the software logic which controls the overall execution of a dialogue for a given application. More typically the DM 1 will be driven by a database of instructions which together comprise a protocol for a conversation. It would be expected that systems would have the capability of executing a number of such protocols simultaneously and in principle each respondent currently interacting with the system could be doing so under the control of a different protocol.
The Switch 5 will typically be a computer incorporating special purpose electronic boards which permit a variety of telephony related functions. To the right of the Switch are shown thirteen telephone lines 6 which originate from the Public Switched Telephone Network (PSTN) 7 or from a private network or possibly a mixture of the two.
To the left of the Switch 5 are shown connections to four telephones 8-11 which would more typically be headsets worn by the human recognisers. The telephones 8-11 are each situated next to a personal computer (PC) 12-15. The combination of a telephone and a PC is called a Recogniser Station (RS) . The number of telephone lines and the number of Recogniser Stations attached to the switch is flexible but the numbers shown in figure 2 indicate that the system is capable of handling a larger number of ongoing dialogues than there are human recognisers associated with the system. The Input Server (IS) module 3 is connected directly to each of the PCs 12-15 which form part of each Recogniser Station. The IS module 3 is responsible for keeping track of the status of each RS, which are usually in a state of being 'assigned' or conversely 'available' with respect to a transaction which is in progress within the system.
The IS 3 is initially contacted by the DM 1 which requests the services of a recogniser. The IS 3 will respond with an acknowledgement that a recogniser has been assigned to the transaction or conversely that none is currently available. When a recogniser has been assigned to a transaction, the IS 3 is responsible for placing messages and information on the screen of the PC 12-15 of the Recogniser Station, and accepting input from the PC 12-15. The input from the PC 12-15 can originate by any of the conventional methods of human interaction with the PC including keyboard input, mouse or tracker ball input, or even voice input.
When input is received from a Recogniser Station, the IS module 3 will validate the response in the context of the overall controlling protocol which controls the conversation. The IS 3 will be aware of which protocol is appropriate by virtue of information passed on from the DM 1 at the time the request was made to assign a recogniser. In the case of invalid input, the IS 3 will send an appropriate error message to the Recogniser Station and await corrected input. When valid input has been received for the transaction, the value (contents) of the input will be transmitted to the DM 1. At the same time, the recogniser will be released from the transaction and placed into the pool of available recognisers.
When the output part of a transaction is due to be transmitted to a respondent, the process is initiated by the DM module 1. The DM module 1 will indicate which output is to be transmitted to which outgoing telephone line 6, sending this information to the Output Server (OS) 2.
The OS module 2 is responsible for assembling or generating the required output stream. Typically, this will consist of a digitised wave form which is then sent to a Voice Response hardware unit 16. The VR unit 16 can pass the information on in a digital form or if required it can convert the digital information into conventional analogue signals which are sent through the switch 5 and down the appropriate telephone line 6.
In a number of cases the output stream will need to be assembled from a collection of pre-recorded or synthesised elements. For example, in a directory enquiry application, pre-recorded elements might be the phrase "The number you require is ..." , the ten digits, "one" , "two" , etc., and the phrase "I repeat, ..." . In this example, it would be possible for the output stream to contain any complete telephone number, spoken digit by digit, following the phrase "The number you require is ..." .
The system also permits direct output to originate from the Recogniser Stations, bypassing in effect the OS 2. This may be required in cases where the protocol employed has not anticipated a situation which arises as determined from the content of the input received from the respondent. In such cases, the recogniser is able to use their judgement and simply speak into the telephone 8-11 directly to the respondent. The system will typically provide for a keyboard action to be relayed to the DM 1 via the IS 3 to inform the DM of the special situation.
The DM 1 is directly connected to the DBS module 4, which acts to provide a number of specialised functions and also serves as a repository of the information which may be collected by some conversations. Each conversation is under the control of a protocol which defines the possible transactions and in some cases the permitted sequence of transactions. The protocols may include provision for branching during the conversation process, and the action of branching will typically be controlled by the DM 1 using information stored in the DBS module 4.
Applications are usually divided into 'inbound' or 'outbound' by convention. Inbound applications are those where the conversation originates by a call placed by the respondent. The mapping of such an inbound call to a specific protocol is usually done on the basis of the telephone line number which is called. Typically there will be more telephone line numbers (virtual lines) than there are available physical telephone lines coming into the Switch 5.
In an inbound application, the Switch 5 will typically have the facility to determine which telephone line number has been rung. The Switch 5 will then communicate with the DM 1 passing on the information that a new conversation is to commence and passing on the virtual line number. The DM 1 is then able to make a determination of the appropriate protocol to employ during the conversation process, and request the details of the protocol from the DBS module 4.
Inbound apj, , ..cations can begin with an initial output which can be either spoken by a human or generated and transmitted through the VR unit 16. The following is an example of an inbound dialogue which commences with an initial output from the VR unit 16. Event System Action Recogniser Respondent
Action Action
1 OS and VR unit
"Good evening, this is BT Direct. How may I help you today?"
Recogniser 'A' is assigned.
2 "I'd like to use my BT charge card to ring the United Kingdom please."
3 Puts the call Recogniser 'A' status, -Indicates an
'initial call' action on the on the screen PC keyboard to together with a inform the DM menu comprising that a normal all possible UK charge call expected is desired. requests from Recogniser is respondents then de- contacting this assigned and number. returned to the pool. OS and VR unit
"Certainly. May I have your charge card number please?" Recogniser 'B' is then assigned.
Reads out card number (probably in groups)
Puts call Recogniser 'B' status - information on Keys in numbers screen so that as they are recogniser is spoken expecting the terminating respondent to with a 'return' read out the key when card number. respondent pauses
OS and VR unit
Repeats numbers as keyed in by recogniser. When number is complete, the number is validated. If OK, recogniser 'B' is immediately de- assigned. Then the following output is sent:
"Thank you. May I have the telephone number in the United Kingdom you require please?" Recogniser 'C is then assigned.
Respondent recites UK telephone number wanted, probably in groups.
Places call Recogniser ' C status keys in numbers information on as recited screen OS and VR unit - Repeats numbers as keyed in by recogniser. When number is complete, automatic dialling commences and recogniser ' C is de-assigned. The OS and VR unit say - "Thank you. Your number is ringing now."
In the series of transactions described above, there were three logical human recognisers "A", "B" and "C" employed during the execution of the protocol. The actual recognisers assigned could be the same one, and typically if the system was not busy, the system logic could allow the same recogniser to participate in every transaction. At busy times, recognisers would be de-assigned as soon as possible and then re-assigned to the next required task.
The protocol commenced with an output-input type transaction (events 1 and 2, followed by 3). Event 4 commences with another output-input transaction, which then may split into a variable number of further output-input transactions as the respondent calls out the required number (events 5, 6 and 7). Event 7 also commences a new output-input transaction, asking for the desired telephone number. The protocol terminates after event 10, with the last action being an output from the system signifying that dialling is now in progress. Another possible inbound application for this invention is to control a multi-lined switchboard for a company or organisation. In this case, there may be no special significance attached to the number which the respondent dials. An example conversation is given below.
Even System Action Recogniser Respondent t Action Action
1 OS and VR unit - " Good <part of day>, this is Pulse Train Technology. Please state the person or the department of the company you wish to speak with please." If more than one telephonist is available, one is assigned to the call at this point.
2 "I'd like to speak with the sales department." Puts call Assigned status, ' operator - initial call' on Listens to screen. response and selects the appropriate extension number.
The selected operator is immediately freed to undertake another task. The extension number selected is used to select an appropriate response: " Just a moment please. I'm putting you through to <name of person> in our sales department."
Outbound applications can also begin with an initial output which can be either spoken by a human or generated and transmitted through the VR unit 16. However, the prevailing circumstances in the initial state of an outbound call are more heterogeneous than inbound calls, and typically the first output from the system will be conducted by a human. The system would simply be expected to instruct one of the recognisers to carry out this task. In the context of a common application, research interviews, the task of the recogniser is to act as a 19 recruiter, and try to get the appropriate person at the telephone number contacted to the telephone and then try to persuade them to enter into a conversation which constitutes an interview. There are a number of possible outcomes to this activity ranging from the initiation of the interview proper to finding that the appropriate person is not available or is unwilling to participate. In any case the recogniser is expected to indicate to the system the outcome status which will be used to select the next step in the controlling protocol.
The system may incorporate facilities for auto-dialling or predictive dialling in outbound applications. These facilities, if incorporated, are controlled by special algorithms in the DM 1 which require information to be passed to and from the Switch 5. The Switch 5 has the appropriate circuitry to initiate dialling attempts.
In the case of a respondent who is initially contacted by a human recogniser acting as a recruiter and who also agrees to be interviewed via an IVR process, the system might output an initial continuity message through the OS
2 and VR unit 16. For example, the message might go as follows:
"Hello, I'm the person who will ask you the rest of our questions. Thanks for agreeing to be interviewed this way - we really appreciate it! And just remember, if you have any special questions, or you want to change your mind about something you said previously, or you just want to talk to <recruiter's name> again, all you have to do is say so. OK, if you're ready, let's start with the first question ..." .
The dialogue manager l is constantly aware of the question length it is currently asking, and inspects the pool of dedicated answer recognisers. At some point (a few seconds from the end of the question) an answer recognizer is selected, and the question and precoded answers (if relevant) are brought up on the screen, thus alerting them to the impending answer. At the moment the question is finished, the selected recogniser is connected through to the incoming wire of the telephone line (not the outgoing wire) . The survey process is paused, awaiting the response or some instruction from the recogniser. The system also always records and stores the respondent's response in DBS 4 at the same time it is played through to the answer recogniser.
The recogniser listens, and can do the following:
• Indicate the answer. The answer is stored in DBS 4. This would be the normal response, and it may trigger a "Thank you" acknowledgement from the VR unit 16, and release the system for the next question.
• Play a continuity recording prior to hand back, such as "just a moment please, I'm going to hand you back/over to "<name of selected person>". This is done whenever there is a problem, such as the respondent saying they wish to go back (to change a previous reply) . The system immediately selects (in order of priority) (a) the original recruiter, (b) a substitute recruiter, or (c) the recogniser themselves. In the case of (a) or (b) the selected person gets an alert message on their screen, with some indication of the problem, and then hears the prerecorded response just made by the informant (or at least, the end of it) . This process takes place at the same time the respondent is listening to the continuity recording. The respondent is thus smoothly handed back to a live person (recruiter or recogniser) , and the priority of selection with the recogniser themselves as the last resort guarantees there will be no interruption. At the same time, the recruiter will be aware (to some extent at least) about the nature of the problem, and be well equipped to deal with it. • Take over immediately. If, in the judgement of the recogniser, the situation warrants immediate action, or if the situation is too complex to be handled via the (possibly truncated - from the beginning, not the end) recording of the respondent's problem, they can take over immediately with or without playing the continuity recording.
• Release the system for another question, but review the answer before replying. The recogniser might be slightly uncertain about the response just made, but be none the less confident that they will be able to make a judgement (maybe they couldn't spot the brand name because the list was too long, for example) . In this case, a single keystroke (or mouse click) releases the system, but allows the recogniser to enter the answer afterwards. It is possible, of course, to listen to the response just made again via the digital recording. This may not be required, as all the recogniser may need to do is to find the correct response on their screen.
• Interviewer's special actions. An augmented list of the special actions may be available to the recogniser. Some of these may be (a) silent, or (b) involve a handback to a recruiter or themselves, or (c) possible go off into a special dialogue. For example, if an appointment is made, it may be possible to do this through electronically generated dialogue as well, thus maintaining contact with the prerecorded voice throughout.
In some cases, the protocol for a conversation will indicate that the entire spoken response of a respondent to a particular question or questions is desired for possible future use and analysis. In this case, the output-input transaction follows the same course of action as previously described for other output-input transaction with the exception that recording of the response is no longer optional. The system would be expected to preserve the digitised responses via the DBS module 4 on a storage device (not shown) . The function of the recogniser in such questions (which are normally referred to in the market research industry as 'open-ended questions') is to determine the appropriate point at which the informant has given the entire response. In some cases, the recogniser may be instructed to encourage the respondent to give more detail. In this case, a menu of prompts such as "Tell me a bit more" or " Is there anything you would like to add to that?" can be selected via an action of the recogniser which in turn triggers the output via the OS and VR unit. When the recogniser has determined that the spoken information obtained from the informant is complete, an action is taken which releases the system to continue with the protocol. Because the simultaneous conversations on the system are independent of each other the number of human recognisers required by the system to service the input part of the conversations will vary from moment to moment. There exists, therefore, the possibility that the system may at some time require more human recognisers than the number available in the pool. In this case the system will continue to record the input from the respondents not yet assigned to a human recogniser and will assign a recogniser to the transaction as soon as they become available. It then becomes necessary to present to the recogniser a reconstruction of the input from the moment the respondent began to speak, while at the same time the ongoing input for the same transaction continues to be recorded. The system will replay the recorded input to the recogniser in less elapsed time than was required to record it in an attempt to catch up with the ongoing input, at which point the system will switch the live input directly to the recogniser. The speeding up of the process of input reconstruction is accomplished by the truncation of the short silences that are naturally present in all human speech. Typically recorded embedded silences of greater than 100 milliseconds will be reduced to this amount of time or less during the reconstruction process. In extreme conditions the system may resort to elimination of the parts of the audible signal as well, by frequent removal of short 'slices' (i.e. intervals) of sound. A typical slice of removed audible material might be of 10 msec duration, and the frequency that these slices would be removed would be dynamically adjusted to permit the reconstructed input to catch up with input still being received while preserving as much of the natural input cadence as possible under the circumstances. It is possible to cut about 50% of the material (including silences) in continuous speech and still understand it when it is played back.
Speech compression is achieved by a speech compressor 17 illustrated in detail in Figure 4. When an item of speech is to be compressed, the DM1 causes the switch 5 to route the speech signal onto line 18 which forms an input to the speech compressor 17. As shown in Figure 4, the speech compressor 17 comprises an analogue-to-digital convertor 25 (optionally located in the switch 5) which outputs a digital signal to a two-way switch 21. The switch 21 routes the digital signal under normal circumstances to a threshold device 22. The threshold device 22 determines whether the digital signal is above or below a predetermined level. If it is below the predetermined level, the signal is discarded at 23. If the signal is above the predetermined level, the signal is output to DBS 4 via output 20. In extreme circumstances when the system is busy, the DM1 causes the switch 21 to route the digital signal to a slice remover 23, which removes intervals from the signal digital speech at a frequency determined by a signal from DM1 on control line
24. Figure 2 shows a logic diagram of the recogniser allocation and de-allocation process which also incorporates the typical (but optional) logic for recording and replaying the truncated reconstructed input to the recogniser when it is required.
Figure 3 illustrates a number of the above points. The figure commences with two alternating shaded and unshaded bars labelled 'Dl' and 'D2'. These two bars represent conversations which are taking place simultaneously within a single system. The shaded portions of each bar represent the output intervals, and the unshaded and numbered portions are the input intervals where input was received from the two respondents.
Below the Dl and D2 bars is a broken bar labelled 'Rl', which represents the activity of a unique recogniser active on the system during the conversations. The numbered sections of Rl represent the input portions taken from the Dl and D2 bars above and show the point at which they are presented to the recogniser. Notice that inputs numbered 1, 2, and 3 are presented to the recogniser as they occur as the recogniser was available at the time the inputs commenced. However, in the case of inputs numbered 4, 5, and 6, the inputs had to be recorded as the recogniser was not available at the time the inputs commenced. As soon as the recogniser becomes available, the inputs are presented in a shortened form using one or more of the methods described above. The diagram illustrates that one recogniser can cope with two simultaneous conversations and still have a proportion of their time available. At the same time, the conversations proceed at a natural pace and are not delayed by the non-availability of the recogniser.
The fact that the input is recorded also allows the recogniser the opportunity to review all or a portion of the input if it is unclear for any reason. This may avoid the need to have to ask the respondent to repeat what was said. The system will typically provide the recogniser with simple computer controls to allow this type of action. Typically the system will employ an algorithm whereby the frequency of having to employ the method of reconstruction of the recorded input of transactions to recognisers is monitored. If the frequency is too high the system may employ a number of methods to alleviate this condition. One method is to introduce variable delays prior to starting another transaction (applicable if the transaction is of the form system output followed by respondent input) . Another method is to restrict the number of simultaneous conversations by not permitting a new conversation to commence when a previous one terminates.

Claims

1. A dialogue system adapted to carry out a conversation between the system and a human respondent, the system comprising a dialogue generator adapted to electronically generate output dialogue which is routed to the human respondent, wherein the output dialogue invites a response from the human respondent; and means for receiving input dialogue responsive to the output dialogue in the form of a speech signal from the human respondent characterised in that the system further comprises one or more recogniser stations, each for communicating input dialogue to a respective human recogniser for speech recognition and semantic analysis.
2. A system according to claim 1 wherein the means for receiving input dialogue comprises a routing device adapted to receive input dialogue from a plurality of human respondents in parallel and route the input dialogue to the or a selected one of the recogniser stations.
3. A system according to claim 2 further comprising a dialogue manager for controlling the routing device whereby the system can carry out a plurality of concurrent conversations with a respective plurality of human respondents, the number of concurrent conversations being greater than the number of recogniser stations.
4. A system according to any of the preceding claims wherein the system comprises a plurality of recogniser stations and the means for receiving input dialogue comprises a routing device adapted to route the input dialogue to a selected one of said plurality of recogniser stations.
5. A system according to any of the preceding claims wherein the system further comprises a store for storing input dialogue prior to playing back the stored input dialogue to a human recogniser.
6. A system according to claim 5 wherein the means for receiving input dialogue is adapted to route input dialogue to the store only when the or each recogniser station is occupied.
7. A system according to claim 5 or 6 wherein the system further comprises a speech compressor for shortening the stored input dialogue before or after it is stored, whereby the shortened version of the stored input dialogue is played back to the human recogniser.
8. A system according to claim 7, wherein the speech compressor is adapted to eliminate the time in excess of a predetermined limit of substantially silent intervals contained within the input dialogue and/or elimination of slices of audible input dialogue at frequent intervals.
9. A system according to any of the preceding claims, wherein the means for receiving input dialogue is adapted to route the input dialogue to a recogniser station at the end of an item of output dialogue, or at a predetermined interval before the end of an item of output dialogue.
10. A system according to any of the preceding claims, wherein the or each recogniser station comprises an input device for the human recogniser to input the result of the speech recognition and semantic analysis.
11. A system according to claim 10, wherein the input result determines the next item of output dialogue generated by the dialogue generator in the conversation.
12. A system according to claim 10 or claim 11, wherein the input result triggers the generation of the next item of output dialogue generated by the dialogue generator in the conversation.
13. A system according to any of the preceding claims, wherein the system is adapted for connection to a communication medium such as the public switched telephone network (PSTN) or a private network of telephone lines.
14. A method of operating a dialogue system, to carry out a conversation between the system and a human respondent, the method comprising electronically generating output dialogue and routing the output dialogue to the human respondent, wherein the output dialogue invites a response from the respondent, the method further comprising receiving input dialogue responsive to the output dialogue in the form of a speech signal from the human respondent and communicating the input dialogue to a human recogniser for speech recognition and semantic analysis.
15. A method of operating a dialogue system, to carry out a plurality of conversations between the system and a respective plurality of human respondents the method comprising, for each conversation, generating output dialogue and routing the output dialogue to a respective human respondent, wherein the output dialogue invites a response from the respective human respondent, the method further comprising receiving input dialogue responsive to the output dialogue in the form of a speech signal from each respondent, and communicating the input dialogue from each respondent to an available human recogniser for speech recognition and semantic analysis.
16. A dialogue system or a method of operating a dialogue system substantially as hereinbefore described with reference to the accompanying drawings.
17. A method of generating a compressed version of an item of speech, the item of speech having a first length and the compressed version having a second length shorter than the first length, and wherein the item of speech comprises a speech signal having one or more audible intervals and one or more substantially silent intervals, the method comprising removing the or each substantially silent interval from the speech signal to generate the compressed version of the item of speech.
18. A method according to claims 17 and claim 14 or claim 15, wherein the item of speech comprises an item of the input dialogue.
19. Apparatus for generating a compressed version of an item of speech, the item of speech having a first length and the compressed version having a second length shorter than the first length, and wherein the item of speech comprises a speech signal having one or more audible intervals and one or more substantially silent intervals, the apparatus comprising means for removing the or each substantially silent interval from the speech signal to generate the compressed version of the item of speech.
20. Apparatus according to claim 19 further comprising means for removing intervals from the speech signal at a selected frequency.
21. Apparatus according to claim 20 further comprising means for dynamically varying the selected frequency.
22. A system according to any of claims 1 to 13, further comprising apparatus according to any of claims 19 to 21, wherein the item of speech comprises an item of the input dialogue.
PCT/GB1997/000554 1996-02-29 1997-02-27 Dialogue system WO1997032431A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU22226/97A AU2222697A (en) 1996-02-29 1997-02-27 Dialogue system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB9604316A GB9604316D0 (en) 1996-02-29 1996-02-29 Dialogue system
GB9604316.1 1996-02-29

Publications (1)

Publication Number Publication Date
WO1997032431A1 true WO1997032431A1 (en) 1997-09-04

Family

ID=10789631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1997/000554 WO1997032431A1 (en) 1996-02-29 1997-02-27 Dialogue system

Country Status (3)

Country Link
AU (1) AU2222697A (en)
GB (1) GB9604316D0 (en)
WO (1) WO1997032431A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999033248A1 (en) * 1997-12-19 1999-07-01 Telefonaktiebolaget Lm Ericsson (Publ) Telephone answering equipment and method for transferring information to telephone equipment
US6330539B1 (en) * 1998-02-05 2001-12-11 Fujitsu Limited Dialog interface system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5033088A (en) * 1988-06-06 1991-07-16 Voice Processing Corp. Method and apparatus for effectively receiving voice input to a voice recognition system
US5163083A (en) * 1990-10-12 1992-11-10 At&T Bell Laboratories Automation of telephone operator assistance calls

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5033088A (en) * 1988-06-06 1991-07-16 Voice Processing Corp. Method and apparatus for effectively receiving voice input to a voice recognition system
US5163083A (en) * 1990-10-12 1992-11-10 At&T Bell Laboratories Automation of telephone operator assistance calls

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SMITH G W ET AL: "VOICE ACTIVATED AUTOMATED TELEPHONE CALL ROUTING", PROCEEDINGS OF THE CONFERENCE ON ARTIFICIAL INTELLIGENCE FOR APPLICATIONS, ORLANDO, MAR. 1 - 5, 1993, no. CONF. 9, 1 March 1993 (1993-03-01), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 143 - 148, XP000379598 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999033248A1 (en) * 1997-12-19 1999-07-01 Telefonaktiebolaget Lm Ericsson (Publ) Telephone answering equipment and method for transferring information to telephone equipment
US6330539B1 (en) * 1998-02-05 2001-12-11 Fujitsu Limited Dialog interface system

Also Published As

Publication number Publication date
AU2222697A (en) 1997-09-16
GB9604316D0 (en) 1996-05-01

Similar Documents

Publication Publication Date Title
US10129402B1 (en) Customer satisfaction analysis of caller interaction event data system and methods
US10110741B1 (en) Determining and denying call completion based on detection of robocall or telemarketing call
US8094803B2 (en) Method and system for analyzing separated voice data of a telephonic communication between a customer and a contact center by applying a psychological behavioral model thereto
US5511112A (en) Automated voice system for improving agent efficiency and improving service to parties on hold
US8094790B2 (en) Method and software for training a customer service representative by analysis of a telephonic interaction between a customer and a contact center
US6771746B2 (en) Method and apparatus for agent optimization using speech synthesis and recognition
US7657022B2 (en) Method and system for performing automated telemarketing
US8626520B2 (en) Apparatus and method for processing service interactions
US4866756A (en) Interactive computerized communications systems with voice input and output
JP2633471B2 (en) Voice data processing device and operation method
US20060265089A1 (en) Method and software for analyzing voice data of a telephonic communication and generating a retention strategy therefrom
US6324262B1 (en) Method and system for automated delivery of nontruncated messages
US8054951B1 (en) Method for order taking using interactive virtual human agents
JPH0125261B2 (en)
US20080240374A1 (en) Method and system for linking customer conversation channels
TW200540649A (en) Method and apparatus for automatic telephone menu navigation
US20020087323A1 (en) Voice service system and method
US5125023A (en) Software switch for digitized audio signals
WO1997032431A1 (en) Dialogue system
US8170195B2 (en) Methods and systems for verifying typed objects or segments of a telephonic communication between a customer and a contact center
Vysotsky VoiceDialing-the first speech recognition based telephone service delivered to customer's home
CA2314152C (en) Automated voice system for improving agent efficiency and improving service to parties on hold
Schmaradt 20 Ames%., Cambridge, MA 02139, USA

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN YU AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 97530712

Format of ref document f/p: F

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA