EP1097447A1

EP1097447A1 - Method and device for recognizing predetermined key words in spoken language

Info

Publication number: EP1097447A1
Application number: EP99945842A
Authority: EP
Inventors: Alfred Hauenstein
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1998-07-23
Filing date: 1999-07-01
Publication date: 2001-05-09
Also published as: WO2000005709A1; US20010016814A1

Abstract

The invention relates to a method and device for recognizing specific key words in spoken language, whereby the key words are modeled for recognition. A predetermined number of expletives are also modeled. Whenever a key word occurs in spoken language, the key word is recognized. However, no key word is recognized if the spoken language coincides with an expletive.

Description

description

Method and device for recognizing predetermined keywords in spoken language

The invention relates to a method and a device for recognizing predetermined keywords in spoken language by a computer.

A method and a device for speech recognition are known from [1]. There you will also find a basic introduction to the components involved in the speech recognition system as well as important techniques common in speech recognition.

Modeling is understood below to mean the mapping of words into a vocabulary accessible to the system for speech recognition. A vocabulary includes keywords and filler words. A key word is at least one sound that is to be recognized by the system for recognizing spoken language and that is linked in particular to a predetermined action. In particular, a sound contains at least one phoneme. A keyword can also include several words, at least one pause or at least one sound. A noise word denotes an acoustic unit that does not correspond to a keyword, e.g. a word, a sound or a pause.

Systems for the recognition of key words have become known (see [2] or [3]) which only model the key words and / or phrases from key words. For the rejection of words that are not keywords, algorithms are used that distinguish keywords from the other words. A disadvantage of these systems is that a new configuration of the system for speech recognition must be carried out for a new vocabulary. Another approach to keyword recognition is a speech recognition system with a large vocabulary. If such a system recognizes all words and noises, predefined keywords can also be recognized (compare [4]). Such a system places extremely high demands on computing power and is generally not available on the computers intended for speech recognition. Furthermore, it is practically not possible to model all acoustic events.

The object of the invention is to provide a method and a device for recognizing keywords in spoken language, in which or in which the disadvantages described above are avoided.

The object is achieved according to the features of the independent claims.

First, a method for recognizing predetermined ones

Spoken language keywords specified where the keywords are modeled for recognition. Furthermore, a predefined set of filler words is modeled. If a key word occurs in the spoken language, this key word is recognized, otherwise no key word is recognized if a match with a filler word is determined in the spoken language.

A further development consists in the fact that the predetermined amount of filler words is small. This is a decisive advantage since the size of the amount of filler words directly influences the computing power of the speech recognition system. A small amount of filler words can also be handled by a computer with relatively low computing power, which is advantageous in terms of the cost of the system for speech recognition. Furthermore, the predetermined amount of filler words is determined from a predetermined number of the most common words in a language.

It is an advantage of the invention that, in particular, the set of filler words can be the same for all possible combinations of keywords, so that when the keywords are changed, there is no need to change the set of filler words. On the basis of the amount of these filler words, it is possible to absorb all words of the spoken language which are not keywords, that is to say to prevent these 'non-keywords' from being recognized as keywords. For this purpose, the filler words are preferably short, monosyllabic words, the acoustic ones

Representations match the words of the spoken language that are not keywords, or at least parts of those words. In particular, the set of filler words can be obtained from the analysis of spoken dialogues. For this, a list of frequencies in these

Words occurring in dialogues are determined and the approx. 15 to 50 most common words selected as filler words. The filler words are preferably provided with a marking. If a keyword matches a filler word from the set of filler words, this filler word is removed from the set of filler words. The keywords and the filler words are then preferably modeled using a system for recognizing spoken language (see [1], [5]). All marked filler words are filtered out of the spoken language and thus only the keywords are displayed to a user or a target application.

It is a particular advantage that the determination of the filler words can be based on a statistical analysis of natural spontaneous language. This actually models words spoken by a human and, with the filler words, excellent hit rates for non- Keywords achieved. It is also a particular advantage that the small amount of filler words places little demands on the computing power of the computer to be used.

A combination of the invention with known methods for recognizing keywords is also advantageous. This applies in particular to the modeling of noises and pauses (see [2]).

It is also a development of the invention that a

Noise word is deleted from the set of noise words if this noise word matches part of a keyword.

Another development is that the keywords recognized in the spoken language are displayed and the recognized noise words are not displayed.

As part of an additional training, at least one noise or at least one pause is modeled and added to the set of noise words.

One possible use of the method according to the invention is to control a medical device using the key words.

Another use of the invention is to answer a customer request, in particular in a communication network, for example the telephone network, the customer request being triggered by a keyword. For example, the system answers a call from a customer who specifies a specific keyword. This enables an automated and efficient interaction of the customer with a computer, whereby a human customer advisor can also be addressed using a keyword. Another development of the invention consists in determining a code word which indicates that a keyword preferably follows immediately. An example is the control of medical devices during the operation with the code word "computer":

"Computer operating table higher" instead of "operating table higher".

The code word "computer" signals the system for recognizing key words that a key word "operating table higher" may then be spoken. In addition, as a further development, the code word "computer" can be modeled as a filler word in order not to detect a keyword when the code word is said accidentally without a subsequent keyword.

A device for recognizing predetermined keywords in spoken language is also specified, which has a processor unit which is set up in such a way that the predetermined keywords are modeled for recognition. Furthermore, a predetermined set of filler words is modeled. If a key word occurs in the spoken language, then this key word is recognized, or if a key word is found in the spoken language

If a match is found with a noise word, no keyword is recognized.

A further development of the device according to the invention consists in determining the predetermined amount of filler words small or in determining the predetermined amount of filler words from a predetermined number of the most frequent words in a language.

This device is particularly suitable for carrying out the method according to the invention or one of its developments explained above. Further developments of the invention also result from the dependent claims.

Exemplary embodiments of the invention are illustrated in more detail with the aid of the following figures.

Show it

Fig.l a device for recognizing predetermined keywords in spoken language;

2 is a block diagram illustrating a method for recognizing predetermined keywords in spoken language;

3 shows a block diagram which represents a possibility for determining the filler words;

4 shows a list with possible filler words and

5 shows a processor unit.

1 generally shows a system architecture for speech recognition (speech recognition system).

Prerequisite for the recognition of naturally spoken

Language is a suitable formalism for representing knowledge. A complete one

Speech recognition system comprises several levels of processing.

These are in particular acoustic phonetics, intonation, syntax,

Semantics and pragmatics. In Fig.l the

Processing levels shown in the recognition of keywords. The natural speech signal 101 enters the speech recognition system. A feature extraction is carried out there in a component 102. After the feature extraction, an acoustic 104 is used to classify 104 (also:

Distance calculation) of the features of the speech signal 101 obtained in the preprocessing 102. The classification 104 is followed by a search 105 for predefined filler words 106, application-specific keywords 107 or predefined noise models 108 (optionally, it is also possible to model pauses). The assignments 106, 107 and / or 108 made on the basis of the search 105 are filtered in a logical block 109 and the sequence of found keywords 110 is output.

It should be noted that the block structure in Fig.l represents only a logical division. Implementation in hardware or software components is not tied to the division represented by FIG.

FIG. 2 shows a block diagram illustrating a method for recognizing predetermined keywords in spoken language. For this purpose, the keywords are modeled in a step 201. In a step 202, the filler words are modeled. Thereupon, in a step 203, the components of the spoken language (sounds) are separated according to key words and filler words. The keywords found are displayed in a step 204.

3 shows a block diagram which represents a possibility for determining the filler words. For this purpose, the spoken language 301 is broken down into sounds (components) and these sounds are sorted according to their frequency (see step 302).

In a step 303, the n most frequent sounds are determined as filler words. A sound 304 is particularly a word 305, a syllable 306, multiple words 307, a sound 308 or a pause 309.

Fig. 4 shows a list of possible filler words. The filler words are common in natural language dialogues in the modeled language (e.g. German) and are ideal for modeling non-key words. Fig. 4 shows an example of a list with 1! Fillers:

"I - we - that - yes - then - there - and - that - is - me - on - that - that - until - it - o'clock - still - with."

A computing unit 501 is shown in FIG. The computing unit 501 comprises a processor CPU 502, one

Memory 503 and an input / output interface 504, which is used in different ways via an interface 505 led out of the computing unit 501: an output is visible on a monitor 507 and / or output on a printer 508 via a graphics interface. An input is made via a mouse 509 or a keyboard 510. The computing unit 501 also has a bus 506, which ensures the connection of memory 503, processor 502 and input / output interface 504. It is also possible to connect additional components to bus 506: additional memory, hard disk, etc. Via interface 505 or bus 506, it is possible to control external devices or another program running on another computer. The following publications have been cited in this document:

[1] A. Hauenstein: "Optimization of algorithms and design of a processor for automatic speech recognition", Chair for Integrated Circuits, Technical

University of Munich, dissertation, July 19, 1993, chapter 2, pages 13-26.

[2] R. C. Rose: "Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition"; Computer, speech and language; 9 (1995); Pages 309-333.

[3] Junkawitsch, Neubauer, Höge, Ruske: "A new keyword spotting algorithm with pre-calculated optimal thresholds", Proc. Intern. Conference on Speech and Language Processing, 1996, pages 2067-2070.

[4] M. Weintraub: "LVCSR log-likelihood ratio scoring for

Keyword-spotting ", Proc. Intern. Conference on Acoustics, Speech and Signal Processing, 1995, pages 297-300.

[5] A. Hauenstein: "Optimization of algorithms and design of a processor for automatic speech recognition", Chair for Integrated Circuits, Technical University of Munich, dissertation, July 19, 1993, Chapter 3, pages 27-86.

Claims

claims

1. A method for recognizing predetermined keywords in spoken language by a computer, a) in which the predetermined keywords are modeled for recognition, b) in which a predetermined set of filler words is modeled, c) in when, in the spoken language Keyword occurs, this keyword is recognized, d) in which, if a match with a filler word is determined in the spoken language, no keyword is recognized.

2. The method of claim 1, wherein the predetermined amount of filler words is less than 50 words.

3. The method of claim 1 or 2, wherein the predetermined amount of filler words is determined from a predetermined number of the most common words in the language.

4. The method according to any one of claims 1 to 3, wherein when changing the predetermined keywords, a filler word, which is a keyword, is deleted from the set of filler words.

5. The method according to any one of the preceding claims, in which, if a noise word matches a part of a keyword or is acoustically similar, this noise word is deleted from the set of noise words.

Method according to one of the preceding claims, in which those recognized in the spoken language Keywords are displayed and the recognized noise words are not displayed.

7. The method according to any one of the preceding claims, wherein at least one sound of the speech is modeled and added to the set of noise words.

8. The method according to any one of the preceding claims, in which at least one pause is modeled and added to the set of noise words.

9. The method according to any one of claims 1 to 8, in which a medical device is controlled using the keywords.

10. The method according to any one of claims 1 to 8, wherein the computer interacts with a user using the keyword, predetermined actions being performed on the computer.

11. The method according to any one of claims 1 to 8, in which a device or an application of the communication technology is controlled using the keywords.

12. The method according to any one of the preceding claims, in which a code word is set up, which indicates that a keyword follows.

13. The method according to claim 12, wherein the code word is modeled as a filler word.

14. Device for recognizing predetermined keywords in spoken language with a processor unit, which is set up in such a way that a) the predefined keywords for the recognition are modeled, b) a predefined set of filler words is modeled, c) if a keyword occurs in the spoken language, this keyword is recognized, d) if a match with a filler word is determined in the spoken language no keyword is recognized.

15. The apparatus of claim 14, wherein the processor unit is set up such that the predetermined amount of noise is small.

16. The apparatus of claim 14 or 15, wherein the processor unit is set up such that the predetermined amount of filler words can be determined from a predetermined number of the most common words in a language.