NZ243055A

NZ243055A - Speech recogniser: neural network and dynamically programmed pattern recogniser operate in parallel

Info

Publication number: NZ243055A
Application number: NZ243055A
Authority: NZ
Inventors: Heidi Hackbarth
Original assignee: Alcatel Australia
Priority date: 1991-06-20
Filing date: 1992-06-08
Publication date: 1995-08-28
Also published as: EP0519360A3; EP0519360B1; AU658635B2; DE4120308A1; ATE148253T1; AU1828392A; DE59207925D1; EP0519360A2

Abstract

An apparatus and method for speech recognition in the case of a progressive expansion of the reference terminology, particularly for automatic telephone dialling by voice input. Neuron-type and conventional recognition methods are combined with one another in such a manner that during the training and the (re-)configuration of the neuron network, a conventional recogniser, which operates in accordance with the principle of dynamic programming, is provided with the added word patterns for immediate classification as reference. After conclusion of the learning phase, the neuron network takes over recognition of the entire terminology. <IMAGE>

Description

<div class="application article clearfix" id="description"> 2430 55 Priority Date(s): . I uwinpidie Specification Filed: £.*?. f&., ;Publication Date: . 2 .5 . AUG. J99.5 P.O. Journal, No: .. ?T.. ;TRUE COPY ;-8 JUN 1992 ;RiZCi-iVED ;NEW ZEALAND PATENTS ACT 1953 COMPLETE SPECIFICATION ;"AN ARRANGEMENT AND METHOD FOR SPEECH RECOGNITION ;WE. ALCATEL AUSTRALIA LIMITED, (ftCN Ooo oos A Company of the State of New South Wales, of 280 Botany-Road, Alexandria. New South Wales, 2015, Australia, hereby declare the invention for which we pray that a patent may be granted to us, and the method by which it is to be performed, ;to be particularly described in and by the following statement: ;I ;£430 5*5 This invention relates to an arrangement with a neural network and a method for speech recognition by successive extension of the reference vocabulary. It is well known that neural networks with a hierarchical network structure can be used for pattern recognition. In such networks, every element of a higher level is 5 affected by elements of a lower level, with cach clement of a level typically being connectcd to every element of the level below it (e.g. A. Krausc, H. Hackbarth, "Scaly Artificial Neural Networks for Speaker-Independent Recognition of Isolated Words", IEEE Proceedings of 1CASSP 1989, Glasgow UK, with additional references). For specch recognition, neural networks offer the advantage of inherent robustness 10 against disturbing noise, compared to conventional techniques. A fundamental disadvantage of neural techniques, however, lies in relatively long training phase, with currently available computers. If, in actual use of a neural specch rccogniscr, a vocabulary extension by only one word is envisaged, the whole network must be rc-traincd. Additional output elements arc added and all weighting 15 parameters arc recalculated. This means that a word newly entered into the vocabulary can only be recognised after the end of this learning phase, which takes place off-line and, in some circumstances, can last for hours. Among other conventional methods, so-called dynamic programming is well known for pattern recognition. In this case, a word first spoken for learning by the 20 specch rccogniscr is immediately stored as a reference pattern in a speech pattern memory (eg. H. Ncy, "The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition", IEEE Transactions on Acoustics, Specch and Signal Processing, April 1984, with additional rcfcrcnccs). This method provides the advantage that the reference pattern is available within a few seconds for classification 25 by the speech rccogniscr. 2 2430 5 5 However, a disadvantage of dynamic programming lies in the higher sensitivity to disturbing noise, compared to neural techniques. Recognition in real time is provided by both methods for small speech vocabularies (approximately 70 to 100 words, depending on the processor capability). An object of the present invention is to provide a robust specch recognition system which is immediately available after successive extension of the vocabulary with single words or groups of words. According to a first aspect of the invention there is provided an arrangement with a neural network for specch recognition, wherein in addition to the neural network, a conventional rccogniscr is provided which operates in accordance with the principles of dynamic programming and immediately stores all newly-spoken words which are to be added to the vocabulary of the arrangement, simultaneously with their being processed in the neural network, in the form of reference patterns into a speech-pattern memory which can be accessed by the conventional rccogniscr for immediate classification of the reference pattern. According to a further aspect of the invention there is provided a method for specch recognition with successive extension of the rcfcrcncc vocabulary, wherein the combination of neural and conventional methods such that a word first spoken for learning by the specch recognition device - a) is stored in a speech-pattern memory as a reference pattern, and becomes available for immediate classification by a conventional rccogniscr operating according to the principle of dynamic programming, and b) simultaneously initiates the training and (rc)configuration of the neural network. An extension of the vocabulary by means of the neural network requires a long- duration retraining of the whole network, taking up to several hours. The conven- 243055 tional rccogniscr, which operates in accordancc with the principle of dynamic programming, is provided with the newly entered word patterns as a rcfcrcncc for use in immediate classification. Preferably, the conventional rccogniscr can be activated for recognising a word 5 spoken during the training phase of the neural network, either for the newly entered word patterns only, or for all word patterns, until the -basically more robust - neural network is trained for the extended word pattern and again takes over the recognition of the complete, now extended, vocabulary. With the aid of the invention it is possible to rccognisc spccch also during the training phase of the neural network. 10 In order that the invention may be readily carried into effect, telephone dialling by means of spccch input will be dcscribcd with the aid of two flow chart diagrams, in which: Figure I shows the very first speaking of the vocabulary N, as Case 1; Figure 2 shows the extension by the vocabulary M, as Case 2. 15 When the vocabulary N is first spoken (Figure 1). the names spoken by the user (possibly several times) arc stored in a spccch-pattcrn memory as rcfcrcncc patterns. This takes placc during a period of the order of seconds. At the same time as this, the newly-spoken names arc processed in the neural network. In this situation, the neural network is trained and configured over a period of several hours. A name 20 entered during the training of a neural network, for the purpose of dialling a telephone conncction by spccch, activates the conventional rccogniscr. which operates in accordancc with the principle of dynamic programming and compares the just-spoken name with all the rcfcrcncc patterns stored in the spcech-patlcrn memory, and carries out a classification. The training of the neural network continues to run as a back-25 ground program. After completion of the training and configuration of the neural 4 243055 network, the recognition of the names spoken for the purpose of dialling a telephone connection takes place exclusively by means of the noise-resistant neural network. If now the list of subscribers N capable of being dialled is to be extended by M names, as shown in Figure 2, the names spoken by the user arc again stored as reference patterns. Simultaneously with this, it is required to extend the output layer of the neural network by M neural elements, to establish the corresponding connections and to retrain the weightings of the connections between all elements. This re-configuration of the neural network, in turn, takes several hours. However, the previous neural network, trained for N, remains intact. !f the user wishes to use the automatic dialling facility during this learning phase, and speaks a name, the retraining is interrupted and both rccogniscrs arc activated. On the one hand, the "old" neural network compares the just-spoken name with the N-Vocabulary. The result is a proposed word with a probability value. On the other hand, the just-spoken name is compared conventionally with the M newly-entered reference patterns in the spccch-pattcrn memory, and there also a word is proposed with a probability value. In this situation the classical algorithm determines with which of the newly acquired names the just-spoken one agrees best, and how well it docs so. The larger of these two values, after appropriate normalisation, determines the word probably spoken, so that finally a single candidate is provided for the spoken name. After conclusion of the learning phase, the more robust neural network again takes over the recognition cf <lic whole vocabulary N + M. The method described allows for some variants: During an input of spccch (use of the automatic dialling facility), while the neural network is being extended by the vocabulary M and retrained, the conventional rccogniscr takes over the classification of the entire vocabulary, ic. the comparison of the rcfcrcncc patterns for N and M. The method then proceeds in the manner 24 30 5 5 described with the aid of Figure 1 in connection with the very first speaking of the vocabulary. The previous neural network, trained for N, docs not need to be retained in this case. It also becomes possible to dispense with the costly probability determinations and their normalisation for the purpose of combining the two methods. Admittedly, this simplification can lead to diminished resistance to disturbing noise. </div>

Claims

<div class="application article clearfix printTableText" id="claims"> V k era T* 11■ 1 What we claim is:-

1. A method of speech recognition with successive expansion of a reference vocabulary, including a speech recognition device comprising a neural network and a speech recognizer, wherein in response to a word being spoken for the 5 first time to train the speech recognition device the method comprises: a) storing the word spoken for the first time as a new reference pattern in a speech pattern memcv and making this new reference pattern available for immediate use in making a recognition decision by the speech recognizer operating according to dynamic programming principles; and 10 b) simultaneously initiating training and configuration of the neural network to subsequently recognize the word spoken for the first time, wherein an already existing neural network is maintained until the training and configuration of the neural network are completed; a word spoken, during the training of the neural network, for recognition 15 by the speech recognition device interrupts the training of the neural network and activates the existing neural network to furnish a first probability value from a previous vocabulary for the word spoken during training and simultaneously activates the speech recognizer which compares the word spoken during training with the new reference pattern from the speech pattern memory and 20 determines a second probability value; and the first and second probability values are standardized and compared with one another to make a recognition decision. 7 243055

2. A method as claimed in claim 1, wherein a word spoken during the training of the neurai network, for recognition by the speech recognition device activates only the speech recognizer which thereafter compares the word spoken during training with all reference patterns from the speech pattern 5 memory, including the new reference speech pattern, and makes a recognition decision.

3. A method as claimed in claim 2, wherein upon completion of the training and configuration of the neural network, the neural network exclusively takes over recognition using the now expanded vocabulary. 10

4. A speech recognition device comprising: pattern recognition means, for receiving input speech, processing the input speech to form reference speech patterns, storing the reference speech patterns as a first vocabulary during an initial training operation, for subsequently receiving input speech, processing the input speech to form input 15 speech patterns, and comparing the input speech patterns with the previously processed reference speech patterns to look for a match during a recognition operation, and for forming an expanded vocabulary of reference speech patterns, including the first vocabulary reference speech patterns and at least one new reference speech pattern; and 20 neural network means operating in parallel with the pattern recognition means, for receiving the input speech and processing the input speech to form a first neural network corresponding to the first vocabulary during an initial 2430 h § configuration operation, for subsequently receiving input speech and processing the input speech to reach a recognition decision, and for reconfiguring the neural network to form a second expanded neural network, including the first neural network corresponding to the first vocabulary, when subsequently input speech is received which does not result in a positive recognition decision, wherein during the reconfiguration of the neural network, upon the inputting of speech, the reconfiguration is temporarily stopped, speech recognition operations are performed on the input speech by both the pattern recognition means, using the expanded vocabulary, and the neural network means, using the first neural network, in parallel, results of the respective recognition operations are assigned probability values, the probability values are compared, the result having the highest probability value is selected, and the reconfiguration of the neural network is subsequently continued.

5. A method substantially as herein described with reference to Figures 1 - 2 of the accompanying drawings. ALCATEL AUSTRALIA LIMITED (A.C.N. 000 005 363) B. O'Connor Authorized Agent P5/1/1703 9 </div>