GB2304957A

GB2304957A - Voice-dialog system for automated output of information

Info

Publication number: GB2304957A
Application number: GB9618308A
Authority: GB
Inventors: Georg Fries; Karlheinz Schuhmacher; Antje Wirth; Bernhard Kaspar
Original assignee: Deutsche Telekom AG
Current assignee: Deutsche Telekom AG
Priority date: 1995-08-31
Filing date: 1996-09-02
Publication date: 1997-03-26
Anticipated expiration: 2016-09-02
Also published as: FR2738382A1; FR2738382B1; DE19532114C2; GB2304957B; DE19532114A1; GB9618308D0

Abstract

A voice-dialog system outputs information, in particular a telephone number. An alphabet-identifier identifies an utterance which is spelt out by the user and selects utterances that can be spelt in a similar manner from a plurality of predetermined utterances; an utterance-identifier compares the utterance input by the user with the utterances selected by the alphabet-identifier and supplies at least one utterance for output to the user. A lexicon operates on-line and stores orthographic-phonetic information for the plurality of predetermined utterances which the alphabet-identifier, the utterance-identifier and a synthesizer can access in real time.

Description

VOICE-DIALOG SYSTEM FOR AUTOMATED OUTPUT OF INFORMATION The invention relates to a voice-dialog method for automated output of information, such as a telephone number of a user, and to a voice-dialog installation for carrying out the method, and to an apparatus for speaker-independent voice-identification, in particular for use in such an installation.

Voice-dialog systems for automated voice output of telephone numbers are known, in which the dialog between a caller, who requires certain information, and the system is conducted over the telephone. The voicedialog systems currently in operation can, however, only identify a fixed, small to medium vocabulary of approximately 1000 words. Any texts, including the output of place names, surnames and the telephone number, are output by way of a voice synthesizer. It has, however, been shown that errors in the pronunciation of names occur, particularly if the names do not obey the usual pronunciation rules.

The underlying object of the invention is therefore to make a voice-dialog method available for automated output of information and to provide a voicedialog installation which is suitably developed for this purpose and which can process a very large identifiable vocabulary, that is, approximately 10,000 to 100,000 words, and can still attain an acceptable identification rate and which also reduces or even totally avoids errors in the case of the voice output of foreign-language terms.

According to a first aspect of the present invention, there is provided a voice-dialog method for automated output of information, having the following steps: (a) intermittently loading orthographic-phonetic information for a plurality of predetermined utterances from a lexicon which is capable of operating on-line, with the information being available in real time; (b) verbally requesting the user to input an utterance; (c) temporarily storing the utterance which has been input; (d) verbally requesting the user to spell the utterance which has been input; (e) in response to the spelt-out utterance, identifying and selecting a plurality of the predetermined, spelt-out reference utterances with the aid of the stored orthographic information on the basis of ascertaining similarity; (f) feeding the selected utterances and the temporarily stored utterance to an utterance identifier; (g) identifying and selecting at least one utterance from the selected utterances on the basis of a similarity-comparison; and (h) sequentially outputting the utterances found in step (g) and the associated information in synthesized voice form.

According to a second aspect of the present invention, there is provided a voice-dialog installation, comprising: a device for the input of an utterance by a user, at least one synthesizer for generating voice signals for the user, a voice-inputting device, an alphabet-identifier which can identify an utterance which is spelt out by the user and can select orthbgraphically similar utterances from a plurality of predetermined spelt-out reference utterances, an utterance-identifier which compares the utterance input by the user with the utterances selected by the alphabet-identifier and on the basis of ascertaining similarity supplies at least one utterance for output to the user, and at least one lexicon which is capable of operating on-line and stores orthographic-phonetic information for the plurality of predetermined utterances which the alphabet-identifier, the utterance-identifier and the synthesizer can access in real time.

According to a third aspect of the present invention, there is provided an apparatus for speakerindependent voice-identification, having an alphabetidentifier which can identify an utterance spelt out by a user and can select several spelt-out reference utterances from a plurality of predetermined spelt-out reference utterances on the basis of ascertaining similarity, and having an utterance-identifier which, on the basis of ascertaining similarity, compares an utterance, which is input by the user and which corresponds to the spelt-out utterance, with the utterances which are pre-selected by the alphabetidentifier and supplies at least one output utterance as a result.

The invention is able to process a very large vocabulary at an acceptable identification rate, as an utterance input by a user undergoes combined voice identification. This utterance can be a surname, a first name, a street name, a place name or even words which are joined together. The combined voice identification takes in an alphabet-identifier, which can identify an utterance spelt out by the user and thereupon can select orthographically similar utterances from a plurality of predetermined reference utterances which have been spelt out. The term "orthographically similar utterance" is used each time in the following to express the fact that two or more sequences of pronounced letters forming words sound alike (e.g. "es e es es e el" and "ef e es es e el").

As a second main component, the combined voice identification includes an utterance-identifier which compares the utterance input directly by the user with the reference utterances which correspond to the speltout reference utterances which are selected by the alphabet identifier. On the basis of ascertaining similarity, the utterance identifier supplies as an identification result at least one word for output to the user, which word corresponds to a reference utterance similar to the user's utterance. A lexicon capable of operating on-line is used to store orthographic-phonetic information for the plurality of predetermined utterances which the alphabet-identifier, the utterance-identifier and a synthesizer can access in real time.

Advantageously, a memory for temporary storage is provided, which memory temporarily stores the utterance directly input by the user before it is forwarded to the utterance-identifier. In addition, the installation contains a further memory in which the spelt-out reference utterances, which have been preselected by the alphabet-identifier, are loaded in the form of a list of candidates of orthographically similar names.

The utterance-identifier operates in keywordspotting mode so that the user can, within certain limits, make additional utterances before and after the actual utterance and the utterance-identifier is still able to extract the relevant utterance.

The orthographic-phonetic information stored in the lexicon pertains, in the first place, to the spelling of the predetermined utterances which the alphabet-identifier uses in order to identify an utterance which has been spelt out and to make therefrom a pre-selection of orthographically similar names for the utterance-identifier. In addition, phonetic transcriptions, for example for place names and surnames, are stored in the lexicon. Orthographic and phonetic transcriptions of proper names are transmitted from an electronic dictionary of pronunciation to the lexicon in an off-line process.

In this connection, only proper names which occur in the electronic telephone directory are transferred.

The electronic telephone directory is a data bank which is capable of operating in real time and which contains the addresses and telephone numbers required to output information to the user. In order to obtain a high level of quality even in the case of voice-output of names which do not obey the usual pronunciation rules, intonation-related information of the terms is also stored in addition to the phonetic information. These voice features reproduce the intonation of syllables and endings of foreign-language words as well.

In order to avoid a situation where the results of identification of the combined voice identification are affected at random on account of acoustic similarities between words and/or spoken letters, additional information for homonyms is stored in the lexicon.

This additional information allows one candidate obtained by voice identification to be supplemented by alternatives which can be pronounced in the same way and thus allows the identification rate of the installation to be increased.

Advantageously, the lexicon includes a store for general vocabulary, for names of towns and for the surnames which occur there.

The control of the voice-dialog installation is effected by means of a program-controlled microcomputer. The control software implemented therein ensures inter alia that the required orthographic and phonetic information from the lexicon is made available to the identifiers and the synthesizer in good time and that the installation requests a user in a voice-controlled manner to input the respective utterances. In addition, it monitors the time-outs occurring in the voice-identifiers, processes terminating and help commands and takes over the identification and control of errors.

Internal program loops, which can reject an utterance input by the user or at the end of a given time span can ask the user to input his utterance anew, run in the utterance-identifier and in the alphabetidentifier.

The invention is explained in greater detail below with reference to an exemplary embodiment in conjunction with the enclosed drawings in which: Figure 1 is a schematic block diagram of a voice dialog installation having the combined voice-identification according to the invention and an on-line lexicon; Figure 2 is a flow chart showing the progress of an automated voice dialog for name identification and output of a pertinent call number effected by the voice-dialog installation according to Figure 1.

Figure 1 shows the basic structure of a voicedialog installation which can effect lexicon-controlled identification of any utterances, for example of place names or surnames, by means of a combination of voiceidentifiers and can output information associated with the utterance (for example a call number) on the basis of an utterance which has been ascertained (identification result). In detail, a telephone set or apparatus 10 is represented in Figure 1, at which apparatus a caller can input the place name and the surname of a subscriber, whose telephone number he wishes to find out, or certain other utterances.

Arranged on the operational side of the voice-dialog installation there is at least one analog-to-digital converter 80 which converts the analog voice signals from the subscriber into digital signals. The output of the analog-to-digital converter can be connected to the respective input of a voice memory 20 and an alphabet-identifier or letter-identifier 30. The voice memory 20 is used for temporary storage, for later use, of the utterance directly input into the telephone apparatus 10 by the caller, that is, for example, the name "Meier". The alphabet-identifier 30 receives, by way of the analog-to-digital converter 80 as a function of the status of the voice-dialog run, a spelt-out version of the directly input utterance which was previously stored in the voice memory 20. A programcontrolled microcomputer 120 ensures that the directly input utterance is loaded into the voice memory 20 and that the spelt-out utterance is fed to the alphabetidentifier 30. The output of the alphabet-identifier 30 is connected to a memory 40, stored in which there is a list of candidates of orthographically similar utterances which have been ascertained by the alphabetidentifier 30 during a pre-selection. An utteranceidentifier 50 is provided with three inputs which are connected to respective outputs of the candidate memory 40, the voice memory 20 and an on-line lexicon 70. The utterance-identifier 50 operates in the so-called keyword-spotting mode which makes it possible for the actual utterance, for example "Meier", to be correctly extracted, even if additional utterances such as "er", "please" or the like precede or follow it. The output of the keyword-spotter 50 is connected to an idetification-result memory 55 in which the resultant utterances, that is, similarly sounding names, are stored by the keyword-spotter 50. The utterances which are stored in the identification- result memory 55 are fed to a synthesizer 60 which on the basis of the corresponding information from the lexicon in turn transmits the names in synthesized speech by way of a digital-to-analog converter 85 to the telephone apparatus 10 of the subscriber. The synthesizer 60 can also produce the verbal requests to be made of the caller in conjunction with a database - not shown - in which all of the texts to be announced by the installation are contained in an orthographic or phonetic form.

The on-line lexicon 70 mentioned above is distinguished above all by the fact that it can be used simultaneously and in real time by the alphabetidentifier 30 for letter-identification, by the keyword-spotter 50 and by the synthesizer 60. That is why all the information relating to the utterances to be identified by the installation and to be made is stored in this lexicon 70. This information is orthographic and pronunciation- or intonation-related information which is loaded from a dictionary of pronunciation 100 into the on-line lexicon 70 in an off-line process. In addition, information on homonyms is stored in the lexicon 70 in order to extend the identification result of the utterance-identifier with names which sound alike or in order to supplement the spelt-out reference utterances of the alphabet identifier with orthographically similar names and thus to increase the probability of detecting the correct utterance at the same time. This also ensures that there is an increased success rate during use or an improved total throughput through the installation, as utterances which are to be identified are more rarely rejected by the voice-identifiers 30, 50. The information on homonyms makes it possible for the utterance-identifier, for example for an utterance "Meier", to find all the spellings present in the electronic telephone directory, such as, for example, "Meier", "Mayer", "Maier" and "Meyer", and to include them in the list of identification results. On the other hand, it is thereby possible for the alphabetidentifier to map, for example, frequently occurring and possibly incorrectly used spelling variants, such as, for example "MULLER" or "MUELLER", to the correct, spelt-out reference utterance even if, for example, only the spelling with " " appears in the telephone directory. The on-line lexicon 70 which has been described therefore first assists both the voiceidentification and the voice synthesis.

The mode of operation of the voice-dialog installation is explained in greater detail in the following with reference to a name-identification. It may be assumed that the voice-dialog installation already knows the name of the place in which the person, whose telephone number a caller would like to find out, lives. For this purpose, the installation first asked the user of the telephone apparatus 10 to input the place name (for example Darmstadt) direct, that is, in a form not spelt out. Advantageously, the microcomputer 120 controls the installation in such a way that the place name is only fed to the keywordspotter 50 in order to identify the utterance. As already mentioned, the keyword-spotter is able to tolerate additional utterances, such as "er" or "please", and extract merely the town name as information. The voice dialog-installation can also be developed in such a way that pre-selection of orthographically similar place names is effected by the alphabet-identifier 30 for the keyword-spotter 50 when an incorrect identification result or no identification result at all has been supplied by the keyword-spotter 50. After the place name has been identified, the voice-dialog installation makes available from the on line lexicon 70 all the surnames, which are stored in an electronic telephone directory 90 for this town name. It may further be assumed that the spelling of all the proper names which are required for the spelling identification in the alphabet-identifier 30, a respective sequence of phonetic symbols for all the proper names which are required for the voiceidentification in the keyword-spotter, and also a respective sequence of phonetic symbols including intonation information required for the voice synthesis are contained in the on-line lexicon 70. In addition, references to the corresponding entries in the on-line lexicon are contained in the electronic telephone directory 90 which contains the surnames of the subscribers with corresponding telephone numbers and addresses.

The caller is now guided through a dialog, during the course of which he finds out the desired telephone number by virtue of specifying the place name and the name of the subscriber.

The following voice dialog between the caller using the telephone apparatus 10 and the voice-dialog installation is explained in the flow chart according to Figure 2.

The caller is first asked verbally by the installation by way of the synthesizer 60 to input directly the desired name, for example "Meier". This input is subsequently temporarily stored in the voice memory 20. Even additional utterances, such as "er" and "please", are also recorded thereby in the voice memory 20. Subsequently, the caller is requested verbally by way of the synthesizer 60 to spell out the name previously directly input. Thereupon, the subscriber inputs the letter sequence M, E, I, E, R.

In conjunction with the orthographic information which is stored in the on-line lexicon 70, the alphabet identifier 30 ascertains similarity and makes a preselection from the list of available surnames stored in the on-line lexicon 70 under the place name. On account of identification uncertainties, the alphabetidentifier 30 ascertains a plurality of candidates, for example "Neier", "Meier", "Meter", "Mieter", "Neter", "Nieter", "Meiter", "Meider", etc.. This list of candidates is stored in the memory 40. The programcontrolled microcomputer 120 causes the keyword-spotter 50 to read out the user utterance "Meier" previously temporarily stored in the voice memory 20 and to load the pre-selected candidates which are in the memory 40.

On the basis of ascertaining similarity, the keywordspotter 50 compares the spoken name "Meier", which is directly input, with the list of candidates by using the phonetic information stored in the on-line lexicon 70. The keyword-spotter 50 supplies, for example, the names "Neier" and "Meier" as an identification result and stores them in the result memory 55. The voicedialog installation, on account of the phonetic and intonation-related information stored in the on-line lexicon 70, knows how to pronounce and intonate the identification results which have been found.

Thereupon, the names which have been found, in the present case the names "Neier" and "Meier", are successively transmitted by way of the synthesizer 60 to the telephone apparatus 10 of the caller. The caller can thereupon select the correct name. With this surname and the identified place name, a data bank inquiry of the electronic telephone directory 90 is then commenced. The names and addresses which are found are read out in a user-controlled manner, that is, the user can influence when the voice-output of the names and addresses which have been found is terminated and how often a list is read out or for which name additional information is to be output. In problem cases, it is possible for the caller to be connected through to an operator. As soon as the user of the voice-dialog installation indicates that the data output by way of the voice synthesizer 60 (first name, surname, street, street number) corresponds to the data of the person whose telephone number he is seeking, the microcomputer 120 causes the installation to read out the corresponding telephone number from the telephone directory 90 and inform the caller thereof verbally.

Owing to the lexicon-controlled identification of any utterances as a result of the combination of the alphabet-identifier 30 and the keyword-spotter 50, it is possible to process a clearly greater vocabulary at an acceptable identification rate than can conventional installations, which only use one voice identifier.

The reason for this can be seen in the fact that the alphabet-identifier 30 makes a pre-selection of the words which are to be identified and only this comparatively small selection of words which come into question is fed to the keyword-spotter 50 for actual identification.

Claims

1. Voice-dialog method for automated output of information, having the following steps: (a) intermittently loading orthographic-phonetic information for a plurality of predetermined utterances from a lexicon which is capable of operating on-line, with the information being available in real time; (b) verbally requesting the user to input an utterance; (c) temporarily storing the utterance which has been input; (d) verbally requesting the user to spell the utterance which has been input; (e) in response to the spelt-out utterance, identifying and selecting a plurality of the predetermined, spelt-out reference utterances with the aid of the stored orthographic information on the basis of ascertaining similarity; (f) feeding the selected utterances and the temporarily stored utterance to an utterance identifier; (g) identifying and selecting at least one utterance from the selected utterances on the basis of a similarity-comparison; and (h) sequentially outputting the utterances found in step (g) and the associated information in synthesized voice form.

2. Voice-dialog method according to claim 1, characterised in that step (h) is repeated until the user terminates the synthesized voice output of the utterances.

3. Voice-dialog method according to claim 1 or 2, characterised in that steps (e) and (g) are terminated at the end of a predetermined time span and the user is requested to re-input his utterance if no utterance has been identified.

4. Voice-dialog method according to claim 2 or 3, characterised in that the user identifies one of the synthesized utterances as coinciding with his utterance, and in that, in response to this, an inquiry of an electronic telephone directory is commenced, which directory is capable of operating in real time and from which directory all of the data records meeting the criterion of the identified utterance are read out and made available to the user to choose from, and in that, on the basis of a name and an address read out from the directory, the user can identify the data record whose telephone number is to be output by the installation.

5. Voice-dialog method according to one of the claims 1 to 4, characterised in that orthographicphonetic information for predetermined utterances is loaded at predetermined instants from a lexicon which is capable of operating on-line.

6. Voice-dialog installation for carrying out the method according to one of the claims 1 to 5, comprising: a device for the input of an utterance by a user, at least one synthesizer for generating voice signals for the user, a voice-inputting device, an alphabet-identifier which can identify an utterance which is spelt out by the user and can select orthographically similar utterances from a plurality of predetermined spelt-out reference utterances, an utterance-identifier which compares the utterance input by the user with the utterances selected by the alphabet-identifier and on the basis of ascertaining similarity supplies at least one utterance for output to the user, and at least one lexicon which is capable of operating on-line and stores orthographic-phonetic information for the plurality of predetermined utterances which the alphabet-identifier, the utterance-identifier and the synthesizer can access in real time.

7. Voice-dialog installation according to claim 6, comprising a memory for temporary storage which temporarily stores the utterance input by the user, and by a memory which receives the utterances pre-selected by the alphabet-identifier.

8. Voice-dialog installation according to claim 6 or 7, characterised in that the utterance-identifier operates in keyword-spotting mode.

9. Voice-dialog installation according to one of the claims 6 to 8, characterised in that the data which is stored in the lexicon is orthographic, phonetic and intonation-related information for the predetermined utterances.

10. Voice-dialog installation according to claim 9, characterised in that additional information on homonyms is stored in the lexicon.

11. Voice-dialog installation according to one of the claims 6 to 10, characterised in that the utterance input by the user can be a place name, a surname or a plurality of words joined together.

12. Voice-dialog installation according to one of the claims 6 to 11, characterised in that the lexicon is capable of operating on-line and includes means for the storage of a general vocabulary, place names and surnames.

13. Voice-dialog installation according to one of the claims 6 to 12, characterised in that it is controlled by a program-controlled microcomputer.

14. Voice-dialog installation according to one of the claims 6 to 13, characterised in that the utterance-identifier and the alphabet-identifier are developed in such a way that they can reject an utterance input by the user and/or at the end of a given time span can ask the user to re-input his utterance.

15. Apparatus for speaker-independent voiceidentification, in particular for use in a voice-dialog installation according to one of the claims 6 to 14, having an alphabet-identifier which can identify an utterance spelt out by a user and can select several spelt-out reference utterances from a plurality of predetermined spelt-out reference utterances on the basis of ascertaining similarity, and having an utterance-identifier which, on the basis of ascertaining similarity, compares an utterance, which is input by the user and which corresponds to the spelt-out utterance, with the utterances which are preselected by the alphabet-identifier and supplies at least one output utterance as a result.

16. Apparatus for voice-identification according to claim 15, wherein the utterance-identifier operates in the keyword-spotting mode.

17. Apparatus for voice-identification according to claim 15 or 16, comprising a lexicon which stores orthographic and phonetic information on the plurality of predetermined utterances which the alphabetidentifier and the utterance-identifier can access in real time in order to ascertain utterances which sound alike or are orthographically similar.

18. A voice-dialog method, substantially as herein described with reference to Figure 2 of the accompanying drawings.

19. A voice-dialog installation, substantially as herein described with reference to, or as shown in, Figure 1 of the accompanying drawings.