WO2001001389A2

WO2001001389A2 - Voice recognition method and device

Info

Publication number: WO2001001389A2
Application number: PCT/DE2000/001056
Authority: WO
Inventors: Andreas Kipp
Original assignee: Siemens Aktiengesellschaft
Priority date: 1999-06-24
Filing date: 2000-04-05
Publication date: 2001-01-04
Also published as: WO2001001389A3; CN1365487A; EP1190413A2; HUP0201923A2

Abstract

A voice recognition method wherein a section of a continuous speech flow consisting of spoken words is detected by means of comparison with stored models. In response to the detection of a first key word, said key word is stored, a first voice recognition system is deactivated and a second voice recognition system is activated. In a second detection step, the speech flow is checked by the second speech recognition system for the appearance of a predetermined, second key word or a second key word sequence.

Description

description

Method and device for speech recognition

The development of everyday speech recognition and

Voice control systems has been one of the main lines of development in computer technology for years. In the course of this development, considerable progress has been made and marketable voice recognition systems have been established that also prove themselves in practical use. Advanced systems of this type are also generally suitable for voice control of a computer or connected peripheral devices. Simple speech recognition systems, which, however, can only process a relatively small vocabulary, are also already being used in the areas of consumer electronics and automotive equipment, as well as in other areas in which acoustic control of devices is possible and sensible due to a limited vocabulary.

Certain problems still exist regarding processing speed, i.e. keeping up with fast speech, and - in the more sophisticated systems - with regard to the high demands on the hardware basis and also relatively high acquisition costs.

The problem of recognizing keyword sequences in a continuous stream of spoken words deserves special attention in the further development of speech recognition systems. Such keyword sequences mostly have a relatively strictly defined information structure, which, when processed appropriately, enables particularly simple and reliable recognition, and they are also often associated with voice control tasks, such as entering a number code, a telephone number, a time or one date. The processing of such sequences takes place according to the state of the art (and to a certain extent quite successfully) in Framework of conventional speech recognition systems, for example on the basis of the known hidden Markov modeling, whereby a step-by-step output of the recognition result is also possible - for example by means of the partial traceback method.

The invention is based on the object of specifying a method of the generic type and an apparatus for carrying out the method, which enable a more reliable, simpler and faster recognition of keyword sequences.

This object is achieved in terms of its method aspect by a method with the features of claim 1 and in terms of its device aspect by a device with the features of claim 9.

The invention includes the essential idea of solving the problem of recognizing a coherent keyword sequence better and more reliably by dividing the recognition process into two or more sub-steps, in each of which a specific speech recognition system is used. This idea is based on the realization that speech recognition systems with a relatively small vocabulary can work significantly faster and more safely than

Speech recognition systems with a large vocabulary. It also proceeds from the idea that certain key word sequences that occur frequently and that are meaningful in everyday language use also have a relatively clearly defined information structure, so that conditional activation of several existing speech recognition systems, each with a specific vocabulary, in successive sub-steps depending on the acquisition result of the respective one preceding sub-step is advantageously applicable. Furthermore, the invention is based on the knowledge that, especially under adverse acoustic conditions (with loud ambient noise or relatively strong distortions), speech recognition systems small vocabulary provide much better accuracy than those with large vocabulary. The conditional use of several systems with a small vocabulary therefore increases the detection rate for keyword sequences as such and, on the other hand, reduces the rate of incorrect detections.

According to the invention, it is provided that the interlinked speech recognition systems are successively activated and, after solving their specific recording task and storing a recorded keyword or part of a keyword sequence, are deactivated again, whereupon another system is activated to solve its assigned recording task, a detected further keyword or another Part of a keyword sequence is stored, etc. etc. The keywords or parts of

After the completion of the acquisition process, keyword sequences are put together in an orderly manner and output or transmitted to a corresponding control unit for the realization of a control task.

In a preferred embodiment of the method, depending on the type of the first detected keyword or part of a keyword sequence, several speech recognition systems that are, as it were, on standby are selected and activated based on the first partial detection result.

According to a further preferred embodiment, after the acquisition of a first keyword or part of a keyword sequence for the acquisition of a second keyword or part of the keyword sequence (and analogously for further parts of a sequence), a time window is predetermined in the speech stream, within which a second (or further one) ) The result of the registration must be available. Depending on the specific system configuration, this time window can be an absolute one

Time span or a time span related to actually incoming speech signals. After the window has passed in the absence of a detection result, the system first used is reactivated.

In a further advantageous embodiment, the lossless switching between the individual used

Speech recognition systems enables a buffering of the speech data is provided. During the first detection step, a process that follows the FIFO (first-in, first-out) principle continuously stores a last section of the speech stream with a predetermined length as a buffer section. The length of the buffer section depends on the detection speed of the first speech recognition system, which must be so long that the time period between the utterance of the keyword and its detection is buffered (with an additional security amount). The speech stream is processed with a delay by this buffer section in the second acquisition step, which is triggered by the presence of the result of the first acquisition step.

A particularly important application of the invention is represented by key word sequences in which the first keyword or the first part is such that it is followed regularly by a section or part of the speech stream containing a number or numbers. In this case, a system specially adapted to the recognition of numbers or combinations of numbers is used as the second speech recognition system. For example, the terms "number", "telephone number", "date", "time" or the like can be used as the first keywords of a keyword sequence. occur, and these terms will be followed by strings of digits or certain combinations of digits / words, for the recognition of which a system with a correspondingly limited vocabulary can be activated.

Another important field of application for the voice control of computers or computer peripherals is keyword sequences, in which the first key word is one Class of devices (eg "device"), while in other parts of the sequence special devices or devices are named that are to be activated in any way. Here, too, it is easy to see that the interconnected use of simple speech recognition systems with an extremely reduced vocabulary and thus a very high level of recognition reliability is possible.

In addition to the mentioned important application of the voice control of a computer or of computer peripherals is also the

Voice control of other technical devices in the professional or private sector, for example devices in the car or in the household (such as navigation systems, audio or video systems, household devices, telecommunications terminal devices, toys etc.) of great economic interest.

The device aspects of the proposed solution essentially result directly from the procedural aspects; for the rest, advantages and advantages of the invention result from the subclaims and the following description of preferred exemplary embodiments with reference to the figures. Of these show:

1 shows a schematic illustration of a simple embodiment of the invention in the form of a functional block diagram,

Fig. 2 is a graphical representation to illustrate the

Principle of voice stream buffering according to an advantageous embodiment of the invention and

Fig. 3 is a schematic representation of a further embodiment in the form of a functional block diagram.

1 schematically shows a speech recognition device 100 for the detection of key word sequences in a continuous speech stream S. The speech stream S is at a branch point 101 divided into two (information-equal) speech streams Sl and S2. The partial speech stream S1 arrives directly at the input of a first speech recognition unit 102, specifically at a first input of a first detection stage 102a, to the second input of which a first vocabulary memory 102b is connected. The first detection stage 102a has a control output connected to a speech recognition sequence control 103 and a data output connected to a first keyword memory 104.

The second partial speech stream S2 arrives at the input of a ring speech buffer 105, in which the last section of the speech stream is temporarily stored and at whose output a partial speech stream S2 'delayed by the buffer speech stream section is thus output. This comes to

Input of a second speech recognition unit 106, which - analogous to the first speech recognition unit 102 - consists of a second acquisition stage 106a and a second vocabulary memory 106b. The data output of the second detection stage 106a is connected to a second keyword memory 107. The outputs of both keyword memories 104, 107 are connected to inputs of a sequence memory 108, the output of which also represents the output of the device 100. The speech recognition sequence control has two control outputs which are connected to control inputs of the first and second speech recognition units 102 and 106, respectively.

The speech stream S (in the form of the partial speech stream S1 carrying the entire information content) is checked in the first speech recognition unit 102, which is activated by the speech recognition sequence controller 103 at the start of the recognition process, to determine whether a word stored in the first vocabulary memory 102b occurs. If such a word occurs, this is registered in the first detection unit 102a and the word in question is transferred to the first keyword memory 104 and at the same time a control signal is output to the speech recognition sequence controller 103. This thereupon deactivates the first speech recognition unit 102 and activates the second - until then inactive - speech recognition unit 106.

After passing through the ring speech buffer 105, the delayed partial speech stream S2 ¹ arrives at its input, and (like the partial speech stream S1 in the first detection unit 102) this is detected in the second detection unit 106 when a second keyword occurs of a set of words stored in the second vocabulary memory 106b. When such a second keyword is detected by the second detection stage 106a, it is output to the second keyword memory 107. At the same time, a control signal is output to the speech recognition sequence controller 103, which then deactivates the second speech recognition unit 106 again and activates the first speech recognition unit 102 instead.

Furthermore, the speech recognition sequence controller 103 controls an output of the words stored in the first and second keyword memories 104, 107 to the sequence memory 106, where they are stored in an orderly manner and are provided for output from the device 100. In this simple example, this completes the acquisition of a keyword sequence using two different speech recognition units with differentiated, respectively reduced vocabulary.

The specific application of the proposed method and the device outlined above is to be outlined in more detail using a practically relevant application example:

The following word sequences should be recognized

- Enter phone number <string of digits> - Enter date <date>

- Enter the time <time>

- Query device <device>, where the expressions in angle brackets should have the following meaning:

<String of digits>: continuously consecutive digits <date>: a date printout, e.g. "November 2nd 99" <time>: a time printout, e.g. "10 to 9"

<device>: an element from a finite set of devices, e.g. "Computer"

The following speech recognition systems are created: 1. System: Detection of the sequences: "Enter telephone number", "Enter date", "Enter time", "Query device"

2. System: string recognizer

3. System: date recognizer

4. System: time recognizer 5. System: detection of the individual device names from a predetermined supply.

Depending on the result of system 1, one of systems 2 to 5 is activated. System 1 must also provide information about the (time) end point of the recognized keyword sequence. When one of the systems 2 to 5 is activated, the recognition continues at this point, so buffering is necessary. Furthermore, the detection systems have to keep pace at least.

The function of buffering the last section of the speech stream for seamless processing by the second speech recognition unit (“System 2”) is outlined in FIG. 2. With to is the time of detection of a first keyword sequence "input telephone number" by the first speech recognition unit ("System 1"), with t _E the time end point of this first keyword sequence, with P _h , ι a position in the buffer system to which the system 1 currently reads the voice data at the time t ₀ , and with P _h , 2 the corresponding scanning position of the system 2 at the same time t ₀ (at which it is currently being activated). The buffering thus clearly ensures that the time which elapses through the processing time of the system 1 until the detection of the first keyword sequence, which of course corresponds to a section of the voice stream, does not lead to a loss of voice stream data. Without the buffering, the first two digits "4" and "6" would in principle be lost for the system 2 in the example shown here and would therefore no longer be accessible to a detection.

FIG. 3 shows a speech processing device 200 which is modified compared to the device from FIG. 1 and which is distinguished by a double cascading of speech recognition systems and a selection option for different systems in the second stage. Incidentally, the first and second stages with the components 201 to 208 are essentially the same as in the device according to FIG. 1 and are designated with corresponding reference numerals, and these components are not explained again here.

The sequence memory 208 is designed here, as symbolized by the division with two dashed vertical lines, to accommodate a three-part keyword sequence. The partial signal stream S2 'from (here: first) speech buffer 205 is branched at a branch point 209 on the one hand to the second detection stage 206a and on the other hand to a second speech buffer 210. There is a further buffering or delaying of the partial speech stream S2.2 ″ available at the output (thus twice delayed). This is fed to the input of a third speech recognition unit 211, specifically a third detection stage 211a.

Like the first and second speech recognition units 202 and 206, the third speech recognition unit 211 also contains a specific vocabulary memory 211b which is connected to a further input of the third detection stage 211a. Also analogous to the execution of the first and second Here too, the (third) detection stage is followed by a (third) keyword memory 212, which in turn is connected on the output side to the sequence memory 208. The assemblies 210 to 212 implement, as can be easily derived from the above explanations for FIG. 1, a third step of recognizing a keyword sequence which also corresponds to a third hierarchical level of the method.

It should also be pointed out that a selector stage 203S is connected to the output of the first detection stage (in addition to the first keyword memory 204), which is organized in the form of a lookup table and in each case assigns and records one of several available second speech recognition units to individually acquired first keywords outputs the corresponding selection signal to the speech recognition sequence control 203. The dash-dotted arrows projecting upward from this indicate that, in addition to the second speech recognition unit 206 shown in the figure, other speech recognition units of the second level can optionally be controlled. Of course, these, too - like the second speech recognition unit 206 shown in the figure is assigned the third speech recognition unit 211 - can again be assigned speech recognition units of the third level. Furthermore, as can be easily seen, a similar selector stage can also be provided between the second and third levels, so that a selected one of several third-party speech recognition units available is activated at this level as a function of the recognized second keyword or second part of a keyword sequence could be. Finally, cascading is also possible with a single buffer, the delay time of which is then variable and which tends to have to be reduced in order to implement step-keeping processing.

In other details, the implementation of the invention is not limited to the examples above, but also in a variety of variations possible in the professional judgment.

Claims

claims

1. A method for speech recognition, in which a section of a continuous speech stream of spoken words is detected by comparison with stored patterns, that is to say that

in a first detection step, the speech stream is checked for the occurrence of a predetermined first keyword or a first keyword sequence using a first speech recognition system,

- in response to the detection of a first keyword or a first keyword sequence, the latter is saved, the first speech recognition system is deactivated and a second speech recognition system is activated, - in a second detection step by means of the second

Speech recognition system, the speech stream is checked for the occurrence of a predetermined second keyword or a second keyword sequence,

- In response to the detection of the second keyword or the second keyword sequence, this or these are stored, the second speech recognition system deactivated and the first or a further speech recognition system activated and

- The stored first and second keywords or keyword sequences are combined and output or made available for output.

2. The method of claim 1, d a d u r c h g e k e n n z e i c h n e t that a selected one of several available second speech recognition systems is activated depending on the type of the first detected keyword or the first keyword sequence.

3. The method according to claim 1 or 2, characterized in that for the detection of the second keyword or the second Keyword sequence a time window in the voice stream is predetermined.

4. The method according to any one of the preceding claims, characterized in that a respective last section of the speech stream is temporarily stored as a buffer section during the first detection step in a storage process and the second detection step is carried out with the speech stream delayed by the buffer section, the length of time of the buffer section in Depending on the acquisition time constant of the first speech recognition system is determined.

5. The method according to any one of the preceding claims, characterized in that such or such is predetermined as the first keyword or first keyword sequence, the / which regularly contains a digit or a section as a second keyword or second keyword sequence, and that a speech recognition system adapted to the digit recognition is used as the second speech recognition system.

6. The method of claim 5, d a d u r c h g e k e n n z e i c h n e t that one of the words "number", "telephone number", "date" or "time" is predetermined as the first keyword and the second keyword sequence is a string of digits or date or time.

7. The method according to any one of the preceding claims, characterized in that it has more than two detection steps, each using a specific speech recognition system.

8. The method according to any one of the preceding claims, characterized by the application for voice control of a computer or a device controlled by a computer or a telecommunications or consumer electronics device.

9. Device (100; 200) for carrying out the method according to one of the preceding claims, with a first speech recognition system (102; 202) for detecting the occurrence of a predetermined first keyword or a keyword sequence in a continuous speech stream, a second speech recognition system (106 ; 206) for detecting the occurrence of a predetermined second keyword or a second keyword sequence following the first keyword or the first keyword sequence in the continuous voice stream and a voice recognition sequence control (103; 203) for the initial activation of the first voice recognition system and for the conditional later Activation of the second speech recognition system as a function of a detection result of the first speech recognition system, the first and second speech recognition systems having first and second vocabulary memories (102b, 106b; 202b, 206b) with different words have chat.

10. The device according to claim 9, characterized by a buffer memory, in particular ring buffer (105; 205, 210) for buffering the continuous speech stream to bridge a processing time of the first speech recognition system (102; 202) for detecting the first keyword or the first keyword sequence.

11. Apparatus according to claim 9 or 10, characterized in that more than two speech recognition systems (202, 206, 211) are provided for the graded conditional detection of more than two interrelated keywords or keyword sequences.

12. Device according to one of claims 9 to 11, characterized by a keyword memory (104, 107; 204, 207, 212) assigned to each speech recognition system and a sequence memory (108; 208) connected to the keyword memories for the orderly storage of one of the memory contents of the Keyword memory composite sequence.