CODINGANDTRAININGOFTHEVOCABULARYFOR SPEECHRECOGNITION
The invention relates a method, a device and a computer program for recognizing a word from a speech signal. In particular the invention relates to a method, a device and a computer program, wherein the words to be recognized can be trained.
According to the prior art, the EP 0 797 185 discloses a method and a device for word recognition permitting word recognition by a speaker-independent method also for speech signals newly added by a user. For this purpose a representation on stored phoneme feature vectors is carried out for each new speech signal to be picked up. Several possibilities are compared with each other for recognizing an appropriate phoneme representation of the speech signal. Thereafter, the speech signal including the phoneme representation is stored in a user memory .
Essential for methods and devices for speaker-dependent word recognition is the required memory space necessary for the static storage of speech samples. A corresponding code book grows proportional to the number of speech samples per word to be recognized, to the number of the words to be recognized and to the number of users. For the storage a static direct access memory, or in other words a static RAM is required, which has disadvantages over a read-only memory (ROM) in view of costs and current consumption.
The method according to the above-mentioned prior art reduces the necessary memory space by representing newly spoken words on already stored phonemes. Apart from the basic vocabulary, which is present in the form of already stored phonemes and words, the described method moreover requires additional user vocabulary for obtaining good recognition results, which also contains phoneme representations and words that have, however, to be generated in a speaker-dependent manner.
Furthermore, the representation of a phoneme requires the representation by means of several speech feature vectors. In order to obtain an error-tough word recognition it is necessary to provide numerous different representations of the same phoneme in the memory, whereby the amount of required memory space is increased.
In accordance therewith it is the object of the invention to provide a method and a device for word recognition on the basis of a speech signal, wherein the storage of comparative data required for the word recognition is memory space efficient and inexpensive .
According to the invention said object is provided by the teaching of patent claims 1, 10 and 16.
It thereby proves to be advantageous that speech feature vectors are stored in a code book memory, which can be accessed by means of allocated codes. Said speech feature vectors being stored in the code book memory will hereinafter be called code book feature vectors. The code book memory can be factory-made, and it can, therefore, comprise an optimal representational amount of basic elements required for speech recognition, or in other words, speech feature vectors. The memory size of the code book neither depends on the number of users nor on the training of the speaker-dependent system.
Moreover it is advantageous, that the code book feature vectors stored in the code book memory are basic elements required for the computer-based word recognition. In contrast to phoneme- or word-based code book memories, where the phonemes or words are each described by at least one sequence of feature vectors, a memory-efficient code book combination is thereby achieved. Moreover, the code book consisting of linguistic basic elements can be used universally, as it enables not only the representation of individual phonemes or words of the human
language, but rather the representation of the human language as such.
It is another advantage that digitalized words and allocated code sequences are stored in a reference memory. Each code of a code sequence references a code book feature vector. This results in a significant memory space saving for the reference memory. A typical dimension of a code book feature vector or respectively of a speech feature vector is 16, wherein each component requires at least a 1 byte memory space. Thus a feature vector requires at least a 16 byte memory space. In contrast thereto a code typically only requires a 2 byte memory space, which in view of the present invention results in a reduction of the required reference memory space by the factor 8 by storing codes rather than speech feature vectors.
Additional advantageous embodiments are disclosed in patent claims 2 to 9 , 11 to 15 and 17.
According to patent claim 2 it is an advantage that the similarity values recognized within the scope of word recognition by comparing speech feature vectors of a spoken speech signal with code book feature vectors are intermediately stored in the reference memory. By allocating the similarity values to the corresponding code sequences, which are already present in the reference memory, a dual storage of said code sequences can be avoided, and a separate buffer for similarity values can be waived altogether.
According to patent claim 3 it is advantageous that as a result of the word recognition not only a stored digitalized word corresponding to the speech signal is recognized, but also an additional word allocated to said word. This enables the flexible use of the word recognition particularly in a speech- controlled system, wherein the word recognized from the speech signal is an uncoded-text control command, which can, for instance, be outputted on a screen or a control device and
wherein the additional word is a character string of control codes for controlling the system.
According to patent claim 4 it is advantageous that the second digitalized word is a digitalized phone number, which enables the use in a communication terminal, for instance a mobile phone, for a speech-controlled call set-up, wherein a name is spoken and the mobile phone sets up a connection by means of the phone number allocated to the recognized name.
According to patent claim 5 it is advantageous that the contents of the code book memory are speaker-independent so that an efficient factory-made generation of the code book memory contents is supported. It is also advantageous that the reference memory contents are generated in a speaker-dependent manner allowing a high recognition rate in word recognition.
According to patent claim 6 and patent claim 11 it proves to be an advantage that the contents of the code book memory are not dependent on the language. This means that words in different languages such as German or English are recognized by means of the same code book memory contents. It is unnecessary to use different code book memory contents for different languages. Thus a code book according to the invention can constitute a speaker-independent and language-independent representation of the human language .
According to patent claim 7 it is an advantage that the speech feature vectors stored in the code book memory can be accessed by means of allocated codes. Therefore, the code book memory contents need not be changed for training a new word, but a sequence of suitable references, or in other words a suitable sequence of codes, is added to the reference memory instead. It is moreover advantageous that for training a new word no new speech feature vectors have to be generated and stored due to the factory-made generation of the code book memory, but that the training can be performed alone with the speech feature
vectors previously defined in the code book memory. This is efficient in view of the memory space and is able to reduce the complexity and the duration of the training process.
According to patent claim 8 and patent claim 15 it is advantageous that a phone number can be allocated to a new trained word. Especially when using the word recognition according to the invention in a communication terminal this can enable the speech-controlled connection set-up to a subscriber identified by his name towards the device.
According to patent claim 9 it shows that the extraction of a speech feature vector results in a vector being provided with both spectral components and an energy component. This enables a good recognition rate of spoken words having a high noise portion in the speech signal, as well as through the energy normalizing also of words spoken at different intensities.
According to patent claim 12 it is advantageous that the reference memory is a direct access memory (RAM) and that the code book memory is a read-only memory (ROM) . Common direct access memories have a lower storage density compared with common read-only memories and, therefore, a higher memory space requirement with identical memory sizes, they have a higher current consumption and are more expensive. Thus, it is an advantage according to the present invention to store the code book, which need not be changed when training new words and which typically has a clearly larger memory space requirement than the references, in a ROM, while the references, which can be supplemented during the training, are stored in a - preferably static - RAM.
According to patent claim 13 it is an advantage that by combining at least 4 of the described components of the computer unit in a chip, space is saved and costs are reduced. According to patent claim 14 it is an advantage that the inventive computer unit for word recognition is a computer unit
of a mobile phone. Thus, the mobile phone can be operated with speech inputs instead of keyboard commands, which especially in connection with common handsfree telephones, for instance in a car, is user-friendly and an important security aspect, for example in road traffic.
According to patent claim 17 it proves to be advantageous that due to storage on a computer-readable medium the computer program is not bound to a single computer unit or a single computer, but that it can simply be read in by different computers, for example, for the purpose of simulation, test or manufacture .
In the following the invention is described in more detail by means of embodiments and figures, wherein
Fig. 1 shows a block diagram for the facilitated illustration of speech signal processing in a mobile phone, Fig. 2 shows a block diagram of a computer unit for word recognition, Fig. 3 shows an illustration of exemplary data of memory contents and of a word allocation, Fig. 4 shows a flow chart of the process of word recognition, Fig. 5 shows a flow chart of the training process for a new word, Fig. 6 shows a flow chart of an extraction of a speech feature derived from a speech signal.
In the following the invention is explained in more detail by means of figure 1. Figure 1 shows by means of an example the use of the inventive computer unit for the word recognition in a mobile phone. The indicated functional units of the mobile phone are facilitated and have only been illustrated to the extent necessary for comprehending the invention.
Figure 1 illustrates a speech signal S, a microphone MI, a control unit CR, an inventive computer unit RE, a display DI as output device, a keyboard KB as input device, an additional optional input device FD, which may for instance be a mouse or a track ball, a device for setting up the call CS, a sending means SE and an antenna AT.
The control unit is coupled with the microphone MI, with the input and output devices DI , KB and FD, as well as with the computer unit RE and the means for setting up a call CS . The computer unit CS is moreover coupled with the computer unit RE and the sending device SE . The sending device SE is further coupled with the antenna AT.
The control unit CR controls the processing of the speech signal S picked up via the microphone MI. The speech signal is forwarded to the computer unit RE, if a user has selected a mode „recognition" , for instance, by means of the keyboard. The control unit CR signalizes the mode to the computer unit RE, which then performs a word recognition according to the invention. As a result a phone number is forwarded to the device CS, which by means of said phone number sets up a call via sender SE . The device CS signalizes the effected call setup to control unit CR, which switches into the mode „Transparent" and connects the speech signal to the sender for further handling the call.
The result of the word recognition can be read out on the display DI for controlling purposes by the user. The direct recognition result, i.e. the word having been described by the spoken speech signal S can be displayed. As an alternative, also an additional word allocated to the recognized word can be displayed. If a user speaks the trained name of a subscriber into the microphone MI, the corresponding digitalized subscriber name is first recognized on the basis of the speech signal S and displayed as intermediate result. In the given example a phone number is allocated to the subscriber name as
additional word, which phone number as final result of the word recognition is used for the call set-up, and which can equally be read out on display DI .
The term „word" usually relates to a meaningful tonal unit. In view of the invention, however, the term is used as a broad one and designates a finite character string having at least one element. Therefore, a telephone number is to be explicitly understood as a word in terms of the present invention.
When the control unit CR illustrated in figure 1 is in the mode „Transparent" , the speech signal S is directly, i.e. transparent for the computer unit RE, forwarded from the control unit to the sender SE.
The control unit CR arrives in the mode „Training", for example, by means of a corresponding user input or by means of a failed word recognition. Thereafter a picked up speech signal S for implementing a training sequence, i.e. for picking up a new word described by the speech signal, is forwarded to the computer unit RE. The control unit CR thereby signalizes the mode „Training" to the computer unit RE. The word corresponding to the speech signal S in entered via an input device KB, FD. In addition, also another word, for example a phone number, can be entered and allocated correspondingly.
Figure 2 shows the inventive computer unit RE. It comprises an extraction unit ME for extracting speech feature vectors from the speech signal S, a code book memory CS including factory- made speaker-independently stored code book feature vectors and codes, a reference memory RS for the speaker-dependent storage of digitalized words and code sequences, a read-out unit AE for reading out code book feature vectors from the code book memory, a comparing means VG for comparing speech feature vectors extracted from the speech signal with code book feature vectors of the code book memory and a recognition unit DT for recognizing a digitalized word from the reference memory RS as
recognition result ER, by which the speech signal S is described.
To each of the code book feature vectors stored in the code book memory CS a code is allocated. To each digitalized word stored in the reference memory RS at least one sequence of codes is allocated. The code book feature vectors are read out of the code book memory CS by means of the read-out unit AE with the aid of codes of the code sequences stored in the reference memory RS .
According to an advantageous embodiment of the inventive computer unit RE, at least four components out of the components extraction unit ME, code book memory CS, reference memory RS, comparing means VG, read-out unit AE and recognition unit are realized on a chip. With the exception of the memories all other mentioned components can be implanted in a chip in one embodiment .
In another advantageous embodiment of the inventive computer unit RE, the reference memory RS is a direct access memory (RAM). Preferably a so-called non-volatile, i.e. static RAM is used, as the speaker-dependent stored digitalized words and codes are maintained also when the supply voltage is switched off. In contrast to the reference memory contents, which can be generated, changed or expanded by a user, the contents of the code book memory CS being factory-made cannot be changed and is pre-specified in a speaker-independent manner. Therefore, a read-only memory (ROM) is preferably used in this embodiment for the code book memory CS .
In the following the generation of the code book feature vectors of the code book memory, i.e. of the code book is explained (without figure) . Said process comprises the provision of an initial code book with code book feature vectors and the optimization of the initial code book. The
methods indicated in this respect are examples and by no means to be understood as restrictions.
An already existing code book having the desired number of code book feature vectors can be used as initial code book. If the initial code book comprises a smaller number of code book feature vectors than desired, it can successively be expanded up to the desired number. This can be done by means of an coincidental selection of feature vectors from a training vector sequence. As an alternative, the initial code book, the code book feature vectors of which each comprise K elements thereby opening a K-dimensional space, can be generated from many code books for independent sub-spaces of the K-dimensional space. „ Splitting" is another possibility, whereby a code book feature vector is replaced by two code book feature vectors . This can, for instance, be done by copying the vector and subsequently adding an offset to all components of the vector, and by subtracting the same offset from all components of the other vector. In the case of „binary splitting" all code book feature vectors present in the code book are each divided according to this principle until the desired number of code book feature vectors, which can be represented as power of two, is obtained. In the case of „single splitting", however, only one code book feature vector is divided so as to achieve any code book size.
One possibility of the code book optimization is the use of the so-called Linde-Buzo-Gray (LBG) algorithm, which is known to the person skilled in the art. Said algorithm for signal coding can iteratively optimize the partition limits of a code book so that the occurring error is minimized for the used training sequence. Alternatives are the use of the so-called K-means algorithm and the use of the LBG algorithm within the scope of splitting already with the generation of the initial code book.
The use of a language-independent initial code book, of multi- language training sequences, and of the optimization algorithm
itself results in language-independent code book feature vectors so that the inventive word recognition is not restricted to words in a single language.
Figure 3 illustrates by means of data simplified in view of structure and contents an example for memory contents of the code book memory CS and the reference memory RS . Moreover, an example for a word allocation is illustrated.
The code book memory CS contains code book feature vectors and precisely allocated codes. For instance, code 1 is allocated to the code book feature vector a consisting of the components al, a2 , a3 and a4 , code 2 is allocated to the code book feature vector b consisting of components bl , b2 , b3 and b4 etc.. In general, however, neither the code nor the individual elements of the code book feature vectors are specified to a particular data type. The code can for example also be represented alphanumerically. The number of elements of a code book feature vector depends on the used method for the extraction of features and can vary within a code book memory.
The reference memory RS contains words („IRIS", „EPA" ...), to each of which at least one code sequence is usually allocated. For instance, the code sequence ,,22324" is allocated to the word „IRIS". Also several code sequences can be allocated to a word in order to reduce the error rate in word recognition, particularly in the case of a changing pronunciation by the user. A word without allocated code sequence is not trained and cannot be recognized by word recognition although it may be stored in the reference memory RS . A code sequence consists of codes for code book feature vectors. The number of the respective codes of the code sequences can be variable, i.e. the code sequences of the reference memory can have different lengths. The positions of the codes of the code sequence are precisely specified, and each single code of a code sequence can be read out of the reference memory via its position. Position 1 of the code sequence allocated to the word „IRIS",
for example, has code „2", position 2 code „2", position 3 code
„3" etc.. With respect to the embodiment of the present invention illustrated in figure 3, phone numbers are precisely allocated to the words in the reference memory as additional words. Thus it is possible by means of the present invention and within the scope of word recognition to recognize a name as a first word from a spoken speech signal first, and further a second word, for instance as indicated, a phone number.
The relationships between code book memory contents and reference memory contents are shown in figure 3 by means of the word „IRIS". The code sequence ,,22324" is allocated to the word „IRIS" in the reference memory. If the codes of the code sequence are replaced by the corresponding code book feature vectors of the code book memory, the represented code book feature vector sequence is achieved as code book feature vector sequence. Thus, said indicated sequence and simultaneously the word „IRIS" has been allocated the phone number ,,0171234" from the reference memory.
Figure 4 illustrates the inventive method for the computer- based recognition of a word from a speech signal. In general the present invention is based on a speech signal which essentially only contains relevant information and which describes a digitalized word from the reference memory. The first prerequisite can be fulfilled by means of filtration of interfering noises . This can either be done prior to the execution of the inventive method or during the extraction of speech features . Corresponding methods are known to the person skilled in the art. The same refers to methods for determining the beginning and end of a word, which can be applied together with the inventive method, if the word recognition is applied to continuous speech, or in other words to a sequence of words.
In view of the word recognition illustrated in figure 4, the speech signal 400 is divided into a finite sequence of time frames TS . This can be effected either during picking up the
speech signal, i.e. during so-called real-time processing, or by means of buffering. The time sequence of the time frames TS is thereby precisely described by an allocation of an index i. The first time frame TSi, for instance, has the index value 1, the second time frame TS2 has the index value 2 etc..
For word recognition a speech feature vector SVi is extracted from time frame TSi, i.e. from at least one time frame TSi 410 and is buffered 420. Corresponding time frames can be pre- specified. This means that both the number of time frames used for the word recognition (e.g. 10 time frames, preferably all time frames or at least one time frame) as well as the kind of selection (e.g. each second, each third time frame) are pre- specified.
For comparing a so obtained speech feature vector SVi with code book feature vectors CV respectively that code Ci is read out 440 for all code sequences CFj 430 stored in the reference memory, the position of which in the code sequence CFi is allocated to the index value i of time frame TSi. The index j serves the differentiation of the code sequences CF of the reference memory and in the indicated example runs from value 1 to the number of the present code sequences CF . For each of the so recognized codes Cji the corresponding, i.e. allocated code book feature vector CVC is read out of the code book memory CS 450 and compared with the buffered speech fature vector S i 460.
The result of a respective comparison is a similarity value 460 indicating the similarity between the speech feature vector SV and the code book feature vector CVC . The similarity value can, for example, be a probability value of the correspondence or the absolute value of the difference of vectors SVi and CVC . The respective similarity value is buffered 470 such that it is precisely allocated to the word described by the code sequence CFj in the reference memory RS . The buffering is preferably done in the reference memory RS .
Each extracted speech feature vector SVX thus results in a number of similarity values corresponding to the number of code sequences CF of the reference memory RS .
From the similarity values determined by the comparison the word stored in the reference memory RS, by which the speech signal is described, is recognized 480. This can be done by means of the similarity values through different methods known to the person skilled in the art, for example, the algorithm of the so-called dynamic programming or the so-called Viterbi Algorithm.
Figure 5 illustrates the inventive new pick-up of a word, i.e the training phase of word recognition. The reference memory contents is thereby expanded in a speaker-dependent manner by a digitalized training word 505 and by the representation of an allocated speech signal 500, which is hereinafter also called training speech signal.
At first, the digitalized training word 505 is stored 510 in the reference memory RS . Just like with the word recognition according to figure 4 the training speech signal is subsequently divided into a finite sequence of pre-specified time frames TS . An index value is allocated to the time frames TS by means of an index i, which index value precisely describes the time sequence of time frames Si. A speech feature vector SV is extracted 530 from the time frames TSi, i.e. from at least one time frame TSx 520.
Thereafter a code sequence is recognized indicating a sequence of referenced code book feature vectors, by which the speech feature vector sequence of the newly picked up word is described. Said code sequence is allocated to the newly picked up word and stored in the reference memory 590.
For this purpose each extracted speech feature vector SVi is compared with all code book feature vectors CV-, 535 stored in the code book memory CS . Therefore, the respective code book feature vector CV-, and the allocated code C3 are read out 540 of the code book memory CS . Like with word recognition each comparison results in a similarity value AW 550, i.e for example in a probability value or a vector difference value which is buffered together with code C of the corresponding code book feature vector CV-, 560. The number of similarity values AW resulting per extracted speech feature vector SVX is equal to the number of the code book feature vectors CV-, stored in the code book memory CS .
Thereafter the code Ccvτnaχ with the greatest similarity is recognized 580 for each speech feature vector SV__ from the buffered codes C-, and the similarity values. For this purpose, for example, the greatest probability value AWmax of the correspondence or - if the similarity values are vector difference values - the smallest vector difference value is searched. The so recognized code CCVmax is stored in the reference memory RS such that it forms an element of a code sequence, which is allocated to the digitalized training word, and that the position thereof in the code sequence is precisely desribed 590 by the index value of the time frame TSX from which the respective speech feature vector has been extracted. The code resulting from the comparison of the code book feature vectors and the first speech feature vector, for instance, can be stored in the first position of the code sequence, the code resulting from the comparison of the code book feature vectors and the second speech feature vector in the second position of the code sequence etc ..
In an additional embodiment according to the invention several code sequences can be allocated in the reference memory to a single digitalized word. This takes place in correspondence with the already described method for newly picking up a word, wherein the new storage of the word is waived.
In another inventive embodiment another word, for example, a control character sequence or a phone number, is additionally allocated to the newly picked up word and stored in the reference memory.
Figure 6 shows by means of an example the procedure of the speech feature extraction. At first, a speech signal 600 being continuous in terms of time and divided into time windows, for example at a width of 10 ms, is sampled, for instance, at a sampling frequency of 8 kHz, and quantized by means of an analog-digital-converter, for instance, at a resolution of 16 bit 610. Given a time window width of 10 ms and a sampling frequency of 8 kHz each time window comprises 80 sampling values. As a speech signal can only be considered constant over a short time period (about 10 ms to 50 ms ) , the window width should not exceed said time period. A speech feature calculation is released for each time window. Upon filtering and reinforcing the sampling values (not shown in fig. 6), a so-called Hamming window, for instance with a width 32ms, is applied for cutting out a quasi stationary segment of the - now discretely present - speech signal 620. With the examplarily given width of 32 ms of the Hamming window also sampling values of the speech signals are detected, which do not belong to the actual time window of 10 ms . This allows a better feature extraction.
The sampling values of the cut out segment are subjected to a so-called Fast Fourier Transformation FFT 630. As the speech signal always has a real value, a reflected spectrum is the result. Therefore, the first half of the resulting spectral values is sufficient for completely describing the spectrum. As furthermore the phase progress to the word recognition is meaningless, the continued calculation is merely based on the absolute value of the square frequency response 640. A 32 ms segment delivers 256 sampling values, the FFT results in 256 spectral values . Upon the formation of the absolute value of
the square frequency response the calculation now continues with 128 spectral values.
By means of a suited filtration a multi-dimensional feature vector is obtained from the spectral values, which vector represents the envelope of the segment 650. For this purpose, the 128 spectral values can be grouped, for instance, by means of 15 overlapping so-called triangular cores (not shown in fig. 6), which are arranged according to the so-called MEL scale. The grouping is achieved by multiplying the individual triangular cores with the absolute value of the spectrum, wherein the results within such a core are summed up. In the mentioned example this results in a 15-dimensional feature vector representing the envelope of the 32 ms segment. Said feature vector is logarithmized in accordance with the logarithmic characteristic of the human ear 660. The energy of the vector is thereafter determined, and through the division of each single component the vector is normalized by the calculated energy 670. The calculated energy is thereby added to the vector as an additional component so that in the given example the speech feature extraction results in a speech feature vector with 16 components 670.
The difference over the preceding vector can optionally be calculated, for instance, for removing background noise, which can be a constant portion in the speech feature vector. The difference speech feature vector eventually then forms the actual result of the feature extraction.
A further embodiment of the present invention, which is explained without using a figure, relates to a computer program. The term „computer program" as defined in the present invention explicitly includes the term „computer program product". The computer program, which can be loaded into the internal memory of a digital computer unit, particularly of a mobile phone, comprises software code portions adapted to
perform the described inventive method, if the computer program is executed on the computer unit.
Said computer program can particularly also be stored on a computer-readable medium such as a floppy disc, CD-ROM or an optical disc.