WO2001037261A1 - Coding and training of the vocabulary for speech recognition - Google Patents

Coding and training of the vocabulary for speech recognition Download PDF

Info

Publication number
WO2001037261A1
WO2001037261A1 PCT/EP2000/010891 EP0010891W WO0137261A1 WO 2001037261 A1 WO2001037261 A1 WO 2001037261A1 EP 0010891 W EP0010891 W EP 0010891W WO 0137261 A1 WO0137261 A1 WO 0137261A1
Authority
WO
WIPO (PCT)
Prior art keywords
code
code book
word
memory
allocated
Prior art date
Application number
PCT/EP2000/010891
Other languages
French (fr)
Inventor
Stefan Dobler
Karl Hellwig
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to AU19983/01A priority Critical patent/AU1998301A/en
Publication of WO2001037261A1 publication Critical patent/WO2001037261A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the invention relates a method, a device and a computer program for recognizing a word from a speech signal.
  • the invention relates to a method, a device and a computer program, wherein the words to be recognized can be trained.
  • the EP 0 797 185 discloses a method and a device for word recognition permitting word recognition by a speaker-independent method also for speech signals newly added by a user.
  • a representation on stored phoneme feature vectors is carried out for each new speech signal to be picked up.
  • Several possibilities are compared with each other for recognizing an appropriate phoneme representation of the speech signal.
  • the speech signal including the phoneme representation is stored in a user memory .
  • the method according to the above-mentioned prior art reduces the necessary memory space by representing newly spoken words on already stored phonemes.
  • the described method moreover requires additional user vocabulary for obtaining good recognition results, which also contains phoneme representations and words that have, however, to be generated in a speaker-dependent manner.
  • the representation of a phoneme requires the representation by means of several speech feature vectors. In order to obtain an error-tough word recognition it is necessary to provide numerous different representations of the same phoneme in the memory, whereby the amount of required memory space is increased.
  • the object of the invention to provide a method and a device for word recognition on the basis of a speech signal, wherein the storage of comparative data required for the word recognition is memory space efficient and inexpensive .
  • speech feature vectors are stored in a code book memory, which can be accessed by means of allocated codes. Said speech feature vectors being stored in the code book memory will hereinafter be called code book feature vectors.
  • the code book memory can be factory-made, and it can, therefore, comprise an optimal representational amount of basic elements required for speech recognition, or in other words, speech feature vectors.
  • the memory size of the code book neither depends on the number of users nor on the training of the speaker-dependent system.
  • the code book feature vectors stored in the code book memory are basic elements required for the computer-based word recognition.
  • the phonemes or words are each described by at least one sequence of feature vectors, a memory-efficient code book combination is thereby achieved.
  • the code book consisting of linguistic basic elements can be used universally, as it enables not only the representation of individual phonemes or words of the human language, but rather the representation of the human language as such.
  • each code of a code sequence references a code book feature vector.
  • a typical dimension of a code book feature vector or respectively of a speech feature vector is 16, wherein each component requires at least a 1 byte memory space.
  • a feature vector requires at least a 16 byte memory space.
  • a code typically only requires a 2 byte memory space, which in view of the present invention results in a reduction of the required reference memory space by the factor 8 by storing codes rather than speech feature vectors.
  • the second digitalized word is a digitalized phone number, which enables the use in a communication terminal, for instance a mobile phone, for a speech-controlled call set-up, wherein a name is spoken and the mobile phone sets up a connection by means of the phone number allocated to the recognized name.
  • the contents of the code book memory are speaker-independent so that an efficient factory-made generation of the code book memory contents is supported. It is also advantageous that the reference memory contents are generated in a speaker-dependent manner allowing a high recognition rate in word recognition.
  • a phone number can be allocated to a new trained word. Especially when using the word recognition according to the invention in a communication terminal this can enable the speech-controlled connection set-up to a subscriber identified by his name towards the device.
  • the reference memory is a direct access memory (RAM) and that the code book memory is a read-only memory (ROM) .
  • RAM direct access memory
  • ROM read-only memory
  • Common direct access memories have a lower storage density compared with common read-only memories and, therefore, a higher memory space requirement with identical memory sizes, they have a higher current consumption and are more expensive.
  • the inventive computer unit for word recognition is a computer unit of a mobile phone.
  • the mobile phone can be operated with speech inputs instead of keyboard commands, which especially in connection with common handsfree telephones, for instance in a car, is user-friendly and an important security aspect, for example in road traffic.
  • Fig. 1 shows a block diagram for the facilitated illustration of speech signal processing in a mobile phone
  • Fig. 2 shows a block diagram of a computer unit for word recognition
  • Fig. 3 shows an illustration of exemplary data of memory contents and of a word allocation
  • Fig. 4 shows a flow chart of the process of word recognition
  • Fig. 5 shows a flow chart of the training process for a new word
  • Fig. 6 shows a flow chart of an extraction of a speech feature derived from a speech signal.
  • Figure 1 shows by means of an example the use of the inventive computer unit for the word recognition in a mobile phone. The indicated functional units of the mobile phone are facilitated and have only been illustrated to the extent necessary for comprehending the invention.
  • Figure 1 illustrates a speech signal S, a microphone MI, a control unit CR, an inventive computer unit RE, a display DI as output device, a keyboard KB as input device, an additional optional input device FD, which may for instance be a mouse or a track ball, a device for setting up the call CS, a sending means SE and an antenna AT.
  • the control unit is coupled with the microphone MI, with the input and output devices DI , KB and FD, as well as with the computer unit RE and the means for setting up a call CS .
  • the computer unit CS is moreover coupled with the computer unit RE and the sending device SE .
  • the sending device SE is further coupled with the antenna AT.
  • the control unit CR controls the processing of the speech signal S picked up via the microphone MI.
  • the speech signal is forwarded to the computer unit RE, if a user has selected a mode hinderrecognition" , for instance, by means of the keyboard.
  • the control unit CR signalizes the mode to the computer unit RE, which then performs a word recognition according to the invention.
  • a phone number is forwarded to the device CS, which by means of said phone number sets up a call via sender SE .
  • the device CS signalizes the effected call setup to control unit CR, which switches into the mode vigorousTransparent" and connects the speech signal to the sender for further handling the call.
  • the result of the word recognition can be read out on the display DI for controlling purposes by the user.
  • the direct recognition result i.e. the word having been described by the spoken speech signal S can be displayed.
  • an additional word allocated to the recognized word can be displayed. If a user speaks the trained name of a subscriber into the microphone MI, the corresponding digitalized subscriber name is first recognized on the basis of the speech signal S and displayed as intermediate result. In the given example a phone number is allocated to the subscriber name as additional word, which phone number as final result of the word recognition is used for the call set-up, and which can equally be read out on display DI .
  • the speech signal S is directly, i.e. transparent for the computer unit RE, forwarded from the control unit to the sender SE.
  • the control unit CR arrives in the mode tranquilTraining", for example, by means of a corresponding user input or by means of a failed word recognition. Thereafter a picked up speech signal S for implementing a training sequence, i.e. for picking up a new word described by the speech signal, is forwarded to the computer unit RE. The control unit CR thereby signalizes the mode vigorousTraining" to the computer unit RE.
  • the word corresponding to the speech signal S in entered via an input device KB, FD.
  • another word for example a phone number, can be entered and allocated correspondingly.
  • Figure 2 shows the inventive computer unit RE. It comprises an extraction unit ME for extracting speech feature vectors from the speech signal S, a code book memory CS including factory- made speaker-independently stored code book feature vectors and codes, a reference memory RS for the speaker-dependent storage of digitalized words and code sequences, a read-out unit AE for reading out code book feature vectors from the code book memory, a comparing means VG for comparing speech feature vectors extracted from the speech signal with code book feature vectors of the code book memory and a recognition unit DT for recognizing a digitalized word from the reference memory RS as recognition result ER, by which the speech signal S is described.
  • an extraction unit ME for extracting speech feature vectors from the speech signal S
  • a code book memory CS including factory- made speaker-independently stored code book feature vectors and codes
  • a reference memory RS for the speaker-dependent storage of digitalized words and code sequences
  • a read-out unit AE for reading out code book feature vectors from the code book memory
  • a code is allocated to each of the code book feature vectors stored in the code book memory CS .
  • the code book feature vectors are read out of the code book memory CS by means of the read-out unit AE with the aid of codes of the code sequences stored in the reference memory RS .
  • At least four components out of the components extraction unit ME, code book memory CS, reference memory RS, comparing means VG, read-out unit AE and recognition unit are realized on a chip.
  • all other mentioned components can be implanted in a chip in one embodiment .
  • the reference memory RS is a direct access memory (RAM).
  • RAM direct access memory
  • a so-called non-volatile, i.e. static RAM is used, as the speaker-dependent stored digitalized words and codes are maintained also when the supply voltage is switched off.
  • a read-only memory (ROM) is preferably used in this embodiment for the code book memory CS .
  • An already existing code book having the desired number of code book feature vectors can be used as initial code book. If the initial code book comprises a smaller number of code book feature vectors than desired, it can successively be expanded up to the desired number. This can be done by means of an coincidental selection of feature vectors from a training vector sequence. As an alternative, the initial code book, the code book feature vectors of which each comprise K elements thereby opening a K-dimensional space, can be generated from many code books for independent sub-spaces of the K-dimensional space.
  • Wash Splitting is another possibility, whereby a code book feature vector is replaced by two code book feature vectors .
  • LBG Linde-Buzo-Gray
  • Figure 3 illustrates by means of data simplified in view of structure and contents an example for memory contents of the code book memory CS and the reference memory RS . Moreover, an example for a word allocation is illustrated.
  • the code book memory CS contains code book feature vectors and precisely allocated codes. For instance, code 1 is allocated to the code book feature vector a consisting of the components al, a2 , a3 and a4 , code 2 is allocated to the code book feature vector b consisting of components bl , b2 , b3 and b4 etc.. In general, however, neither the code nor the individual elements of the code book feature vectors are specified to a particular data type. The code can for example also be represented alphanumerically. The number of elements of a code book feature vector depends on the used method for the extraction of features and can vary within a code book memory.
  • the reference memory RS contains words sacIRIS", tendEPA" (7), to each of which at least one code sequence is usually allocated. For instance, the code sequence ,,22324" is allocated to the word yogaIRIS". Also several code sequences can be allocated to a word in order to reduce the error rate in word recognition, particularly in the case of a changing pronunciation by the user. A word without allocated code sequence is not trained and cannot be recognized by word recognition although it may be stored in the reference memory RS .
  • a code sequence consists of codes for code book feature vectors. The number of the respective codes of the code sequences can be variable, i.e. the code sequences of the reference memory can have different lengths.
  • Position 1 of the code sequence allocated to the word StephenIRIS for example, has code Mr2", position 2 code Mr2", position 3 code
  • phone numbers are precisely allocated to the words in the reference memory as additional words.
  • a name is precisely allocated to the words in the reference memory as additional words.
  • code book memory contents The relationships between code book memory contents and reference memory contents are shown in figure 3 by means of the word.
  • the code sequence ,,22324" is allocated to the word quizIRIS" in the reference memory. If the codes of the code sequence are replaced by the corresponding code book feature vectors of the code book memory, the represented code book feature vector sequence is achieved as code book feature vector sequence. Thus, said indicated sequence and simultaneously the word administratIRIS" has been allocated the phone number ,,0171234" from the reference memory.
  • Figure 4 illustrates the inventive method for the computer- based recognition of a word from a speech signal.
  • the present invention is based on a speech signal which essentially only contains relevant information and which describes a digitalized word from the reference memory.
  • the first prerequisite can be fulfilled by means of filtration of interfering noises . This can either be done prior to the execution of the inventive method or during the extraction of speech features .
  • Corresponding methods are known to the person skilled in the art. The same refers to methods for determining the beginning and end of a word, which can be applied together with the inventive method, if the word recognition is applied to continuous speech, or in other words to a sequence of words.
  • the speech signal 400 is divided into a finite sequence of time frames TS . This can be effected either during picking up the speech signal, i.e. during so-called real-time processing, or by means of buffering.
  • the time sequence of the time frames TS is thereby precisely described by an allocation of an index i.
  • the first time frame TSi for instance, has the index value 1
  • the second time frame TS 2 has the index value 2 etc..
  • a speech feature vector SVi is extracted from time frame TSi, i.e. from at least one time frame TSi 410 and is buffered 420.
  • Corresponding time frames can be pre- specified. This means that both the number of time frames used for the word recognition (e.g. 10 time frames, preferably all time frames or at least one time frame) as well as the kind of selection (e.g. each second, each third time frame) are pre- specified.
  • the result of a respective comparison is a similarity value 460 indicating the similarity between the speech feature vector SV and the code book feature vector CV C .
  • the similarity value can, for example, be a probability value of the correspondence or the absolute value of the difference of vectors SVi and CV C .
  • the respective similarity value is buffered 470 such that it is precisely allocated to the word described by the code sequence CFj in the reference memory RS .
  • the buffering is preferably done in the reference memory RS .
  • Each extracted speech feature vector SV X thus results in a number of similarity values corresponding to the number of code sequences CF of the reference memory RS .
  • Figure 5 illustrates the inventive new pick-up of a word, i.e the training phase of word recognition.
  • the reference memory contents is thereby expanded in a speaker-dependent manner by a digitalized training word 505 and by the representation of an allocated speech signal 500, which is hereinafter also called training speech signal.
  • the digitalized training word 505 is stored 510 in the reference memory RS .
  • the training speech signal is subsequently divided into a finite sequence of pre-specified time frames TS .
  • An index value is allocated to the time frames TS by means of an index i, which index value precisely describes the time sequence of time frames Si.
  • a speech feature vector SV is extracted 530 from the time frames TSi, i.e. from at least one time frame TS x 520.
  • a code sequence is recognized indicating a sequence of referenced code book feature vectors, by which the speech feature vector sequence of the newly picked up word is described.
  • Said code sequence is allocated to the newly picked up word and stored in the reference memory 590.
  • each extracted speech feature vector SVi is compared with all code book feature vectors CV-, 535 stored in the code book memory CS . Therefore, the respective code book feature vector CV-, and the allocated code C 3 are read out 540 of the code book memory CS .
  • each comparison results in a similarity value AW 550, i.e for example in a probability value or a vector difference value which is buffered together with code C of the corresponding code book feature vector CV-, 560.
  • the number of similarity values AW resulting per extracted speech feature vector SV X is equal to the number of the code book feature vectors CV-, stored in the code book memory CS .
  • the code C cv ⁇ na ⁇ with the greatest similarity is recognized 580 for each speech feature vector SV__ from the buffered codes C-, and the similarity values.
  • the greatest probability value AW max of the correspondence or - if the similarity values are vector difference values - the smallest vector difference value is searched.
  • the so recognized code C CV max is stored in the reference memory RS such that it forms an element of a code sequence, which is allocated to the digitalized training word, and that the position thereof in the code sequence is precisely desribed 590 by the index value of the time frame TS X from which the respective speech feature vector has been extracted.
  • the code resulting from the comparison of the code book feature vectors and the first speech feature vector can be stored in the first position of the code sequence, the code resulting from the comparison of the code book feature vectors and the second speech feature vector in the second position of the code sequence etc ..
  • code sequences can be allocated in the reference memory to a single digitalized word. This takes place in correspondence with the already described method for newly picking up a word, wherein the new storage of the word is waived.
  • another word for example, a control character sequence or a phone number, is additionally allocated to the newly picked up word and stored in the reference memory.
  • Figure 6 shows by means of an example the procedure of the speech feature extraction.
  • a speech signal 600 being continuous in terms of time and divided into time windows, for example at a width of 10 ms, is sampled, for instance, at a sampling frequency of 8 kHz, and quantized by means of an analog-digital-converter, for instance, at a resolution of 16 bit 610.
  • an analog-digital-converter for instance, at a resolution of 16 bit 610.
  • each time window width comprises 80 sampling values.
  • the window width should not exceed said time period.
  • a speech feature calculation is released for each time window.
  • a so-called Hamming window for instance with a width 32ms, is applied for cutting out a quasi stationary segment of the - now discretely present - speech signal 620.
  • width of 32 ms of the Hamming window also sampling values of the speech signals are detected, which do not belong to the actual time window of 10 ms . This allows a better feature extraction.
  • the sampling values of the cut out segment are subjected to a so-called Fast Fourier Transformation FFT 630.
  • FFT Fast Fourier Transformation
  • a multi-dimensional feature vector is obtained from the spectral values, which vector represents the envelope of the segment 650.
  • the 128 spectral values can be grouped, for instance, by means of 15 overlapping so-called triangular cores (not shown in fig. 6), which are arranged according to the so-called MEL scale.
  • the grouping is achieved by multiplying the individual triangular cores with the absolute value of the spectrum, wherein the results within such a core are summed up. In the mentioned example this results in a 15-dimensional feature vector representing the envelope of the 32 ms segment.
  • Said feature vector is logarithmized in accordance with the logarithmic characteristic of the human ear 660.
  • the energy of the vector is thereafter determined, and through the division of each single component the vector is normalized by the calculated energy 670.
  • the calculated energy is thereby added to the vector as an additional component so that in the given example the speech feature extraction results in a speech feature vector with 16 components 670.
  • the difference over the preceding vector can optionally be calculated, for instance, for removing background noise, which can be a constant portion in the speech feature vector.
  • the difference speech feature vector eventually then forms the actual result of the feature extraction.
  • a further embodiment of the present invention which is explained without using a figure, relates to a computer program.
  • the term as defined in the present invention explicitly includes the term combatcomputer program product".
  • the computer program which can be loaded into the internal memory of a digital computer unit, particularly of a mobile phone, comprises software code portions adapted to perform the described inventive method, if the computer program is executed on the computer unit.
  • Said computer program can particularly also be stored on a computer-readable medium such as a floppy disc, CD-ROM or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a method, a device and a computer program for the recognition of a word from a speech signal. In particular the invention relates to a method, a device and a computer program, wherein the words to be recognized can be picked up anew. Both the word recognition and the training of new words is done by means of code book feature vectors, which are pre-specified in a code book memory CS in a factory-made manner. Digitalized words and allocated code sequences, which reference user-dependent sequences of code book feature vectors, are stored in a reference memory RS. For word recognition a speech signal is divided into time frames, from each of which a speech feature vector is extracted. The resulting speech feature vector sequence is compared with code book feature vector sequences, which are referenced by the code sequences stored in the reference memory. By means of similarity values, which result from the comparisons, a digitalized word is recognized, which is described by the speech signal.

Description

CODINGANDTRAININGOFTHEVOCABULARYFOR SPEECHRECOGNITION
The invention relates a method, a device and a computer program for recognizing a word from a speech signal. In particular the invention relates to a method, a device and a computer program, wherein the words to be recognized can be trained.
According to the prior art, the EP 0 797 185 discloses a method and a device for word recognition permitting word recognition by a speaker-independent method also for speech signals newly added by a user. For this purpose a representation on stored phoneme feature vectors is carried out for each new speech signal to be picked up. Several possibilities are compared with each other for recognizing an appropriate phoneme representation of the speech signal. Thereafter, the speech signal including the phoneme representation is stored in a user memory .
Essential for methods and devices for speaker-dependent word recognition is the required memory space necessary for the static storage of speech samples. A corresponding code book grows proportional to the number of speech samples per word to be recognized, to the number of the words to be recognized and to the number of users. For the storage a static direct access memory, or in other words a static RAM is required, which has disadvantages over a read-only memory (ROM) in view of costs and current consumption.
The method according to the above-mentioned prior art reduces the necessary memory space by representing newly spoken words on already stored phonemes. Apart from the basic vocabulary, which is present in the form of already stored phonemes and words, the described method moreover requires additional user vocabulary for obtaining good recognition results, which also contains phoneme representations and words that have, however, to be generated in a speaker-dependent manner. Furthermore, the representation of a phoneme requires the representation by means of several speech feature vectors. In order to obtain an error-tough word recognition it is necessary to provide numerous different representations of the same phoneme in the memory, whereby the amount of required memory space is increased.
In accordance therewith it is the object of the invention to provide a method and a device for word recognition on the basis of a speech signal, wherein the storage of comparative data required for the word recognition is memory space efficient and inexpensive .
According to the invention said object is provided by the teaching of patent claims 1, 10 and 16.
It thereby proves to be advantageous that speech feature vectors are stored in a code book memory, which can be accessed by means of allocated codes. Said speech feature vectors being stored in the code book memory will hereinafter be called code book feature vectors. The code book memory can be factory-made, and it can, therefore, comprise an optimal representational amount of basic elements required for speech recognition, or in other words, speech feature vectors. The memory size of the code book neither depends on the number of users nor on the training of the speaker-dependent system.
Moreover it is advantageous, that the code book feature vectors stored in the code book memory are basic elements required for the computer-based word recognition. In contrast to phoneme- or word-based code book memories, where the phonemes or words are each described by at least one sequence of feature vectors, a memory-efficient code book combination is thereby achieved. Moreover, the code book consisting of linguistic basic elements can be used universally, as it enables not only the representation of individual phonemes or words of the human language, but rather the representation of the human language as such.
It is another advantage that digitalized words and allocated code sequences are stored in a reference memory. Each code of a code sequence references a code book feature vector. This results in a significant memory space saving for the reference memory. A typical dimension of a code book feature vector or respectively of a speech feature vector is 16, wherein each component requires at least a 1 byte memory space. Thus a feature vector requires at least a 16 byte memory space. In contrast thereto a code typically only requires a 2 byte memory space, which in view of the present invention results in a reduction of the required reference memory space by the factor 8 by storing codes rather than speech feature vectors.
Additional advantageous embodiments are disclosed in patent claims 2 to 9 , 11 to 15 and 17.
According to patent claim 2 it is an advantage that the similarity values recognized within the scope of word recognition by comparing speech feature vectors of a spoken speech signal with code book feature vectors are intermediately stored in the reference memory. By allocating the similarity values to the corresponding code sequences, which are already present in the reference memory, a dual storage of said code sequences can be avoided, and a separate buffer for similarity values can be waived altogether.
According to patent claim 3 it is advantageous that as a result of the word recognition not only a stored digitalized word corresponding to the speech signal is recognized, but also an additional word allocated to said word. This enables the flexible use of the word recognition particularly in a speech- controlled system, wherein the word recognized from the speech signal is an uncoded-text control command, which can, for instance, be outputted on a screen or a control device and wherein the additional word is a character string of control codes for controlling the system.
According to patent claim 4 it is advantageous that the second digitalized word is a digitalized phone number, which enables the use in a communication terminal, for instance a mobile phone, for a speech-controlled call set-up, wherein a name is spoken and the mobile phone sets up a connection by means of the phone number allocated to the recognized name.
According to patent claim 5 it is advantageous that the contents of the code book memory are speaker-independent so that an efficient factory-made generation of the code book memory contents is supported. It is also advantageous that the reference memory contents are generated in a speaker-dependent manner allowing a high recognition rate in word recognition.
According to patent claim 6 and patent claim 11 it proves to be an advantage that the contents of the code book memory are not dependent on the language. This means that words in different languages such as German or English are recognized by means of the same code book memory contents. It is unnecessary to use different code book memory contents for different languages. Thus a code book according to the invention can constitute a speaker-independent and language-independent representation of the human language .
According to patent claim 7 it is an advantage that the speech feature vectors stored in the code book memory can be accessed by means of allocated codes. Therefore, the code book memory contents need not be changed for training a new word, but a sequence of suitable references, or in other words a suitable sequence of codes, is added to the reference memory instead. It is moreover advantageous that for training a new word no new speech feature vectors have to be generated and stored due to the factory-made generation of the code book memory, but that the training can be performed alone with the speech feature vectors previously defined in the code book memory. This is efficient in view of the memory space and is able to reduce the complexity and the duration of the training process.
According to patent claim 8 and patent claim 15 it is advantageous that a phone number can be allocated to a new trained word. Especially when using the word recognition according to the invention in a communication terminal this can enable the speech-controlled connection set-up to a subscriber identified by his name towards the device.
According to patent claim 9 it shows that the extraction of a speech feature vector results in a vector being provided with both spectral components and an energy component. This enables a good recognition rate of spoken words having a high noise portion in the speech signal, as well as through the energy normalizing also of words spoken at different intensities.
According to patent claim 12 it is advantageous that the reference memory is a direct access memory (RAM) and that the code book memory is a read-only memory (ROM) . Common direct access memories have a lower storage density compared with common read-only memories and, therefore, a higher memory space requirement with identical memory sizes, they have a higher current consumption and are more expensive. Thus, it is an advantage according to the present invention to store the code book, which need not be changed when training new words and which typically has a clearly larger memory space requirement than the references, in a ROM, while the references, which can be supplemented during the training, are stored in a - preferably static - RAM.
According to patent claim 13 it is an advantage that by combining at least 4 of the described components of the computer unit in a chip, space is saved and costs are reduced. According to patent claim 14 it is an advantage that the inventive computer unit for word recognition is a computer unit of a mobile phone. Thus, the mobile phone can be operated with speech inputs instead of keyboard commands, which especially in connection with common handsfree telephones, for instance in a car, is user-friendly and an important security aspect, for example in road traffic.
According to patent claim 17 it proves to be advantageous that due to storage on a computer-readable medium the computer program is not bound to a single computer unit or a single computer, but that it can simply be read in by different computers, for example, for the purpose of simulation, test or manufacture .
In the following the invention is described in more detail by means of embodiments and figures, wherein
Fig. 1 shows a block diagram for the facilitated illustration of speech signal processing in a mobile phone, Fig. 2 shows a block diagram of a computer unit for word recognition, Fig. 3 shows an illustration of exemplary data of memory contents and of a word allocation, Fig. 4 shows a flow chart of the process of word recognition, Fig. 5 shows a flow chart of the training process for a new word, Fig. 6 shows a flow chart of an extraction of a speech feature derived from a speech signal.
In the following the invention is explained in more detail by means of figure 1. Figure 1 shows by means of an example the use of the inventive computer unit for the word recognition in a mobile phone. The indicated functional units of the mobile phone are facilitated and have only been illustrated to the extent necessary for comprehending the invention. Figure 1 illustrates a speech signal S, a microphone MI, a control unit CR, an inventive computer unit RE, a display DI as output device, a keyboard KB as input device, an additional optional input device FD, which may for instance be a mouse or a track ball, a device for setting up the call CS, a sending means SE and an antenna AT.
The control unit is coupled with the microphone MI, with the input and output devices DI , KB and FD, as well as with the computer unit RE and the means for setting up a call CS . The computer unit CS is moreover coupled with the computer unit RE and the sending device SE . The sending device SE is further coupled with the antenna AT.
The control unit CR controls the processing of the speech signal S picked up via the microphone MI. The speech signal is forwarded to the computer unit RE, if a user has selected a mode „recognition" , for instance, by means of the keyboard. The control unit CR signalizes the mode to the computer unit RE, which then performs a word recognition according to the invention. As a result a phone number is forwarded to the device CS, which by means of said phone number sets up a call via sender SE . The device CS signalizes the effected call setup to control unit CR, which switches into the mode „Transparent" and connects the speech signal to the sender for further handling the call.
The result of the word recognition can be read out on the display DI for controlling purposes by the user. The direct recognition result, i.e. the word having been described by the spoken speech signal S can be displayed. As an alternative, also an additional word allocated to the recognized word can be displayed. If a user speaks the trained name of a subscriber into the microphone MI, the corresponding digitalized subscriber name is first recognized on the basis of the speech signal S and displayed as intermediate result. In the given example a phone number is allocated to the subscriber name as additional word, which phone number as final result of the word recognition is used for the call set-up, and which can equally be read out on display DI .
The term „word" usually relates to a meaningful tonal unit. In view of the invention, however, the term is used as a broad one and designates a finite character string having at least one element. Therefore, a telephone number is to be explicitly understood as a word in terms of the present invention.
When the control unit CR illustrated in figure 1 is in the mode „Transparent" , the speech signal S is directly, i.e. transparent for the computer unit RE, forwarded from the control unit to the sender SE.
The control unit CR arrives in the mode „Training", for example, by means of a corresponding user input or by means of a failed word recognition. Thereafter a picked up speech signal S for implementing a training sequence, i.e. for picking up a new word described by the speech signal, is forwarded to the computer unit RE. The control unit CR thereby signalizes the mode „Training" to the computer unit RE. The word corresponding to the speech signal S in entered via an input device KB, FD. In addition, also another word, for example a phone number, can be entered and allocated correspondingly.
Figure 2 shows the inventive computer unit RE. It comprises an extraction unit ME for extracting speech feature vectors from the speech signal S, a code book memory CS including factory- made speaker-independently stored code book feature vectors and codes, a reference memory RS for the speaker-dependent storage of digitalized words and code sequences, a read-out unit AE for reading out code book feature vectors from the code book memory, a comparing means VG for comparing speech feature vectors extracted from the speech signal with code book feature vectors of the code book memory and a recognition unit DT for recognizing a digitalized word from the reference memory RS as recognition result ER, by which the speech signal S is described.
To each of the code book feature vectors stored in the code book memory CS a code is allocated. To each digitalized word stored in the reference memory RS at least one sequence of codes is allocated. The code book feature vectors are read out of the code book memory CS by means of the read-out unit AE with the aid of codes of the code sequences stored in the reference memory RS .
According to an advantageous embodiment of the inventive computer unit RE, at least four components out of the components extraction unit ME, code book memory CS, reference memory RS, comparing means VG, read-out unit AE and recognition unit are realized on a chip. With the exception of the memories all other mentioned components can be implanted in a chip in one embodiment .
In another advantageous embodiment of the inventive computer unit RE, the reference memory RS is a direct access memory (RAM). Preferably a so-called non-volatile, i.e. static RAM is used, as the speaker-dependent stored digitalized words and codes are maintained also when the supply voltage is switched off. In contrast to the reference memory contents, which can be generated, changed or expanded by a user, the contents of the code book memory CS being factory-made cannot be changed and is pre-specified in a speaker-independent manner. Therefore, a read-only memory (ROM) is preferably used in this embodiment for the code book memory CS .
In the following the generation of the code book feature vectors of the code book memory, i.e. of the code book is explained (without figure) . Said process comprises the provision of an initial code book with code book feature vectors and the optimization of the initial code book. The methods indicated in this respect are examples and by no means to be understood as restrictions.
An already existing code book having the desired number of code book feature vectors can be used as initial code book. If the initial code book comprises a smaller number of code book feature vectors than desired, it can successively be expanded up to the desired number. This can be done by means of an coincidental selection of feature vectors from a training vector sequence. As an alternative, the initial code book, the code book feature vectors of which each comprise K elements thereby opening a K-dimensional space, can be generated from many code books for independent sub-spaces of the K-dimensional space. „ Splitting" is another possibility, whereby a code book feature vector is replaced by two code book feature vectors . This can, for instance, be done by copying the vector and subsequently adding an offset to all components of the vector, and by subtracting the same offset from all components of the other vector. In the case of „binary splitting" all code book feature vectors present in the code book are each divided according to this principle until the desired number of code book feature vectors, which can be represented as power of two, is obtained. In the case of „single splitting", however, only one code book feature vector is divided so as to achieve any code book size.
One possibility of the code book optimization is the use of the so-called Linde-Buzo-Gray (LBG) algorithm, which is known to the person skilled in the art. Said algorithm for signal coding can iteratively optimize the partition limits of a code book so that the occurring error is minimized for the used training sequence. Alternatives are the use of the so-called K-means algorithm and the use of the LBG algorithm within the scope of splitting already with the generation of the initial code book.
The use of a language-independent initial code book, of multi- language training sequences, and of the optimization algorithm itself results in language-independent code book feature vectors so that the inventive word recognition is not restricted to words in a single language.
Figure 3 illustrates by means of data simplified in view of structure and contents an example for memory contents of the code book memory CS and the reference memory RS . Moreover, an example for a word allocation is illustrated.
The code book memory CS contains code book feature vectors and precisely allocated codes. For instance, code 1 is allocated to the code book feature vector a consisting of the components al, a2 , a3 and a4 , code 2 is allocated to the code book feature vector b consisting of components bl , b2 , b3 and b4 etc.. In general, however, neither the code nor the individual elements of the code book feature vectors are specified to a particular data type. The code can for example also be represented alphanumerically. The number of elements of a code book feature vector depends on the used method for the extraction of features and can vary within a code book memory.
The reference memory RS contains words („IRIS", „EPA" ...), to each of which at least one code sequence is usually allocated. For instance, the code sequence ,,22324" is allocated to the word „IRIS". Also several code sequences can be allocated to a word in order to reduce the error rate in word recognition, particularly in the case of a changing pronunciation by the user. A word without allocated code sequence is not trained and cannot be recognized by word recognition although it may be stored in the reference memory RS . A code sequence consists of codes for code book feature vectors. The number of the respective codes of the code sequences can be variable, i.e. the code sequences of the reference memory can have different lengths. The positions of the codes of the code sequence are precisely specified, and each single code of a code sequence can be read out of the reference memory via its position. Position 1 of the code sequence allocated to the word „IRIS", for example, has code „2", position 2 code „2", position 3 code
„3" etc.. With respect to the embodiment of the present invention illustrated in figure 3, phone numbers are precisely allocated to the words in the reference memory as additional words. Thus it is possible by means of the present invention and within the scope of word recognition to recognize a name as a first word from a spoken speech signal first, and further a second word, for instance as indicated, a phone number.
The relationships between code book memory contents and reference memory contents are shown in figure 3 by means of the word „IRIS". The code sequence ,,22324" is allocated to the word „IRIS" in the reference memory. If the codes of the code sequence are replaced by the corresponding code book feature vectors of the code book memory, the represented code book feature vector sequence is achieved as code book feature vector sequence. Thus, said indicated sequence and simultaneously the word „IRIS" has been allocated the phone number ,,0171234" from the reference memory.
Figure 4 illustrates the inventive method for the computer- based recognition of a word from a speech signal. In general the present invention is based on a speech signal which essentially only contains relevant information and which describes a digitalized word from the reference memory. The first prerequisite can be fulfilled by means of filtration of interfering noises . This can either be done prior to the execution of the inventive method or during the extraction of speech features . Corresponding methods are known to the person skilled in the art. The same refers to methods for determining the beginning and end of a word, which can be applied together with the inventive method, if the word recognition is applied to continuous speech, or in other words to a sequence of words.
In view of the word recognition illustrated in figure 4, the speech signal 400 is divided into a finite sequence of time frames TS . This can be effected either during picking up the speech signal, i.e. during so-called real-time processing, or by means of buffering. The time sequence of the time frames TS is thereby precisely described by an allocation of an index i. The first time frame TSi, for instance, has the index value 1, the second time frame TS2 has the index value 2 etc..
For word recognition a speech feature vector SVi is extracted from time frame TSi, i.e. from at least one time frame TSi 410 and is buffered 420. Corresponding time frames can be pre- specified. This means that both the number of time frames used for the word recognition (e.g. 10 time frames, preferably all time frames or at least one time frame) as well as the kind of selection (e.g. each second, each third time frame) are pre- specified.
For comparing a so obtained speech feature vector SVi with code book feature vectors CV respectively that code Ci is read out 440 for all code sequences CFj 430 stored in the reference memory, the position of which in the code sequence CFi is allocated to the index value i of time frame TSi. The index j serves the differentiation of the code sequences CF of the reference memory and in the indicated example runs from value 1 to the number of the present code sequences CF . For each of the so recognized codes Cji the corresponding, i.e. allocated code book feature vector CVC is read out of the code book memory CS 450 and compared with the buffered speech fature vector S i 460.
The result of a respective comparison is a similarity value 460 indicating the similarity between the speech feature vector SV and the code book feature vector CVC . The similarity value can, for example, be a probability value of the correspondence or the absolute value of the difference of vectors SVi and CVC . The respective similarity value is buffered 470 such that it is precisely allocated to the word described by the code sequence CFj in the reference memory RS . The buffering is preferably done in the reference memory RS . Each extracted speech feature vector SVX thus results in a number of similarity values corresponding to the number of code sequences CF of the reference memory RS .
From the similarity values determined by the comparison the word stored in the reference memory RS, by which the speech signal is described, is recognized 480. This can be done by means of the similarity values through different methods known to the person skilled in the art, for example, the algorithm of the so-called dynamic programming or the so-called Viterbi Algorithm.
Figure 5 illustrates the inventive new pick-up of a word, i.e the training phase of word recognition. The reference memory contents is thereby expanded in a speaker-dependent manner by a digitalized training word 505 and by the representation of an allocated speech signal 500, which is hereinafter also called training speech signal.
At first, the digitalized training word 505 is stored 510 in the reference memory RS . Just like with the word recognition according to figure 4 the training speech signal is subsequently divided into a finite sequence of pre-specified time frames TS . An index value is allocated to the time frames TS by means of an index i, which index value precisely describes the time sequence of time frames Si. A speech feature vector SV is extracted 530 from the time frames TSi, i.e. from at least one time frame TSx 520.
Thereafter a code sequence is recognized indicating a sequence of referenced code book feature vectors, by which the speech feature vector sequence of the newly picked up word is described. Said code sequence is allocated to the newly picked up word and stored in the reference memory 590. For this purpose each extracted speech feature vector SVi is compared with all code book feature vectors CV-, 535 stored in the code book memory CS . Therefore, the respective code book feature vector CV-, and the allocated code C3 are read out 540 of the code book memory CS . Like with word recognition each comparison results in a similarity value AW 550, i.e for example in a probability value or a vector difference value which is buffered together with code C of the corresponding code book feature vector CV-, 560. The number of similarity values AW resulting per extracted speech feature vector SVX is equal to the number of the code book feature vectors CV-, stored in the code book memory CS .
Thereafter the code Ccvτnaχ with the greatest similarity is recognized 580 for each speech feature vector SV__ from the buffered codes C-, and the similarity values. For this purpose, for example, the greatest probability value AWmax of the correspondence or - if the similarity values are vector difference values - the smallest vector difference value is searched. The so recognized code CCVmax is stored in the reference memory RS such that it forms an element of a code sequence, which is allocated to the digitalized training word, and that the position thereof in the code sequence is precisely desribed 590 by the index value of the time frame TSX from which the respective speech feature vector has been extracted. The code resulting from the comparison of the code book feature vectors and the first speech feature vector, for instance, can be stored in the first position of the code sequence, the code resulting from the comparison of the code book feature vectors and the second speech feature vector in the second position of the code sequence etc ..
In an additional embodiment according to the invention several code sequences can be allocated in the reference memory to a single digitalized word. This takes place in correspondence with the already described method for newly picking up a word, wherein the new storage of the word is waived. In another inventive embodiment another word, for example, a control character sequence or a phone number, is additionally allocated to the newly picked up word and stored in the reference memory.
Figure 6 shows by means of an example the procedure of the speech feature extraction. At first, a speech signal 600 being continuous in terms of time and divided into time windows, for example at a width of 10 ms, is sampled, for instance, at a sampling frequency of 8 kHz, and quantized by means of an analog-digital-converter, for instance, at a resolution of 16 bit 610. Given a time window width of 10 ms and a sampling frequency of 8 kHz each time window comprises 80 sampling values. As a speech signal can only be considered constant over a short time period (about 10 ms to 50 ms ) , the window width should not exceed said time period. A speech feature calculation is released for each time window. Upon filtering and reinforcing the sampling values (not shown in fig. 6), a so-called Hamming window, for instance with a width 32ms, is applied for cutting out a quasi stationary segment of the - now discretely present - speech signal 620. With the examplarily given width of 32 ms of the Hamming window also sampling values of the speech signals are detected, which do not belong to the actual time window of 10 ms . This allows a better feature extraction.
The sampling values of the cut out segment are subjected to a so-called Fast Fourier Transformation FFT 630. As the speech signal always has a real value, a reflected spectrum is the result. Therefore, the first half of the resulting spectral values is sufficient for completely describing the spectrum. As furthermore the phase progress to the word recognition is meaningless, the continued calculation is merely based on the absolute value of the square frequency response 640. A 32 ms segment delivers 256 sampling values, the FFT results in 256 spectral values . Upon the formation of the absolute value of the square frequency response the calculation now continues with 128 spectral values.
By means of a suited filtration a multi-dimensional feature vector is obtained from the spectral values, which vector represents the envelope of the segment 650. For this purpose, the 128 spectral values can be grouped, for instance, by means of 15 overlapping so-called triangular cores (not shown in fig. 6), which are arranged according to the so-called MEL scale. The grouping is achieved by multiplying the individual triangular cores with the absolute value of the spectrum, wherein the results within such a core are summed up. In the mentioned example this results in a 15-dimensional feature vector representing the envelope of the 32 ms segment. Said feature vector is logarithmized in accordance with the logarithmic characteristic of the human ear 660. The energy of the vector is thereafter determined, and through the division of each single component the vector is normalized by the calculated energy 670. The calculated energy is thereby added to the vector as an additional component so that in the given example the speech feature extraction results in a speech feature vector with 16 components 670.
The difference over the preceding vector can optionally be calculated, for instance, for removing background noise, which can be a constant portion in the speech feature vector. The difference speech feature vector eventually then forms the actual result of the feature extraction.
A further embodiment of the present invention, which is explained without using a figure, relates to a computer program. The term „computer program" as defined in the present invention explicitly includes the term „computer program product". The computer program, which can be loaded into the internal memory of a digital computer unit, particularly of a mobile phone, comprises software code portions adapted to perform the described inventive method, if the computer program is executed on the computer unit.
Said computer program can particularly also be stored on a computer-readable medium such as a floppy disc, CD-ROM or an optical disc.

Claims

Patent Claims
Method for a computer-based recognition of a word from a speech signal, a) wherein the speech signal is divided into a sequence of time frames to each of which an index value is allocated, b) wherein a time sequence of the time frames is precisely described by the allocated index values, c) wherein the following steps are performed for at least one time frame of the speech signal: cl) extracting and buffering a speech feature vector of the time frame, c2) wherein the following steps are performed for at least one code sequence, which is stored in a reference memory and to which a first digitalized word stored in the reference memory is allocated at each time: c2i) reading out a code of the code sequence, the position of which in the code sequence is allocated to the index value of the time frame, c2ii) reading out a code book feature vector allocated to the read out code from a code book memory, c2iii) recognizing and buffering a similarity value from a comparison of the buffered speech feature vector with the read out code book feature vector, d) wherein by means of the buffered similarity values the first digitalized word stored in the reference memory, by which the speech signal is described, is the one which is recognized.
Method according to claim 1, wherein additionally
- the buffering of the respectively recognized similarity value takes place in the reference memory.
3. Method according to one of claims 1 or 2 , wherein additionally a second digitalized word is recognized, which is allocated to the first digitalized word by which the speech signal is described, and which is stored in the reference memory.
4. Method according to claim 3, wherein the second digitalized word is a digitalized phone number.
5. Method according to one of claims 1 to 4 , wherein
- contents of the reference memory are speaker-dependent
- contents of the code book memory are speaker- independent .
6. Method according to one of claims 1 to 5 , wherein contents of the code book memory are language-independent .
7. Method according to one of claims 1 to 6 , wherein contents of the reference memory are expanded by means of the following steps in a speaker-dependent manner with the aid of a digitalized training word and allocated training speech signal: a) dividing the training speech signal into a sequence of time frames each having an index value clearly describing a time sequence of the time frames, b) storing the digitalized training word in the reference memory, c) performing the following steps for at least one time frame of the training speech signal: cl) extracting and buffering a speech feature vector from a part of the training speech signal designated by the time frame, c2) performing the following steps for at least one code book feature vector stored in the code book memory : c2i) reading out the code book feature vector and buffering the code allocated to the read out code book feature vector, c2ii) determining a similarity value from a comparison of the buffered speech feature vector with the read out code book feature vector, c3ii) buffering the similarity value so that the similarity value is allocated to the buffered code, c3) determining by means of the buffered similarity values the code allocated to the code book feature vector, by which the speech feature vector is described, c4) storing the determined code in the reference memory so that the determined code forms an element in a code sequence, which is allocated to the digitalized training word, and so that the position thereof in the code sequence is precisely described by the index value of the time frame.
8. Method according to claim 7, wherein additionally a further digitalized word is allocated to the digitalized training word and stored in the reference memory .
9. Method according to one of claims 1 to 8 , wherein the extraction of a speech feature vector is performed by means of the following steps :
- sampling a speech signal portion containing at least the portion of the speech signal identified by the time frame with a sampling frequency which can be predefined,
- quantizing the sampling values into discrete sampling values ,
- buffering the discrete sampling values,
- determining by means of a Fast Fourier Transformation (FFT) spectral values, wherein the spectral values are calculated from the discrete sampling values of the time frame and from a pre-specified number of sampling values outside the time frame,
- determining a first vector by filtering spectral values representing the absolute value of the spectrum of the spectral values,
- determining a second vector by means of logarithmizing the first vector,
- determining the speech feature vector by means of an energy normalization of the second vector, wherein the calculated energy of the second vector is added to the speech feature vector as additional vector component.
10. Computer unit for recognizing a word from a speech signal, comprising
- an extraction unit for extracting speech feature vectors from the speech signal,
- a code book memory having speaker-independently stored code book feature vectors and codes, wherein a respective code is allocated to the code book feature vectors,
- a reference memory for speaker-dependently storing digitalized words and code sequences, wherein at least one code sequence is allocated to a respective digitalized word,
- a read-out unit for reading code book feature vectors out of the code book memory, wherein the reading out takes place by means of a code of the code sequences of the reference memory,
- a comparing means for comparing speech feature vectors extracted from the speech signal with code book feature vectors of the code book memory, and
- a recognition unit for recognizing a digitalized word from the reference memory, by which the speech signal is described.
11. Computer unit according to claim 10, wherein the code book feature vectors of the code book memory are language- independent .
12. Computer unit according to claim 10 or 11, wherein the reference memory is a direct access memory (RAM) and wherein the code book memory is a read-only memory (ROM)
13. Computer unit according to claim 10, 11 or 12, wherein at least four of the following components are realized on a chip :
- the extraction unit,
- the code book memory,
- the reference memory,
- the comparing means ,
- the read-out unit,
- the recognition unit.
14. Computer unit according to one of claims 10 to 13, wherein the computer unit for word recognition is a computer unit of a mobile phone.
15. Computer unit according to one of claims 10 to 14, wherein the reference memory additionally stores digitalized phone numbers, each of which are allocated to a digitalized word.
16. Computer program, which can be loaded into an internal memory of a digital computer unit, and which comprises software code portions adapted to perform the steps according to one of claims 1 to 3 , 7 to 9 , if the computer program is executed on the computer unit.
17. Computer program according to claim 16, wherein the computer program is stored on a computer-readable medium.
PCT/EP2000/010891 1999-11-18 2000-11-04 Coding and training of the vocabulary for speech recognition WO2001037261A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU19983/01A AU1998301A (en) 1999-11-18 2000-11-04 Coding and training of the vocabulary for speech recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP99122889.1 1999-11-18
EP99122889A EP1102239A1 (en) 1999-11-18 1999-11-18 Vocabulary encoding and training for speech recognition

Publications (1)

Publication Number Publication Date
WO2001037261A1 true WO2001037261A1 (en) 2001-05-25

Family

ID=8239418

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2000/010891 WO2001037261A1 (en) 1999-11-18 2000-11-04 Coding and training of the vocabulary for speech recognition

Country Status (3)

Country Link
EP (1) EP1102239A1 (en)
AU (1) AU1998301A (en)
WO (1) WO2001037261A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19636739C1 (en) * 1996-09-10 1997-07-03 Siemens Ag Multi-lingual hidden Markov model application for speech recognition system
US5732187A (en) * 1993-09-27 1998-03-24 Texas Instruments Incorporated Speaker-dependent speech recognition using speaker independent models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5732187A (en) * 1993-09-27 1998-03-24 Texas Instruments Incorporated Speaker-dependent speech recognition using speaker independent models
DE19636739C1 (en) * 1996-09-10 1997-07-03 Siemens Ag Multi-lingual hidden Markov model application for speech recognition system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BAHL L R ET AL: "A METHOD FOR THE CONSTRUCTION OF ACOUSTIC MARKOV MODELS FOR WORKS", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,US,IEEE INC. NEW YORK, vol. 1, no. 4, 1 October 1993 (1993-10-01), pages 443 - 452, XP000422858, ISSN: 1063-6676 *
FONTAINE V ET AL: "SPEAKER-DEPENDENT SPEECH RECOGNITION BASED ON PHONE-LIKE UNITS MODELS - APPLICATION TO VOICE DIALING", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP),US,LOS ALAMITOS, IEEE COMP. SOC. PRESS, 1997, pages 1527 - 1530, XP000822750, ISBN: 0-8186-7920-4 *
SADAOKI FURUI: "VECTOR-QUANTIZATION-BASED SPEECH RECOGNITION AND SPEAKER RECOGNITION TECHNIQUES", PROCEEDINGS OF THE ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS,US,LOS ALAMITOS, IEEE COMP. SOC. PRESS, vol. CONF. 25, 1991, pages 954 - 958, XP000314439 *

Also Published As

Publication number Publication date
EP1102239A1 (en) 2001-05-23
AU1998301A (en) 2001-05-30

Similar Documents

Publication Publication Date Title
US7630878B2 (en) Speech recognition with language-dependent model vectors
US5893059A (en) Speech recoginition methods and apparatus
US4624008A (en) Apparatus for automatic speech recognition
US6014624A (en) Method and apparatus for transitioning from one voice recognition system to another
Kanthak et al. Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition
CA2610269C (en) Method of adapting a neural network of an automatic speech recognition device
JP4351385B2 (en) Speech recognition system for recognizing continuous and separated speech
US5865626A (en) Multi-dialect speech recognition method and apparatus
US6092045A (en) Method and apparatus for speech recognition
JP2733955B2 (en) Adaptive speech recognition device
JP4414088B2 (en) System using silence in speech recognition
JPH0422276B2 (en)
US20020178004A1 (en) Method and apparatus for voice recognition
JPH0416800B2 (en)
US20060074662A1 (en) Three-stage word recognition
JP3803029B2 (en) Voice recognition device
WO2001022400A1 (en) Iterative speech recognition from multiple feature vectors
JPH10505687A (en) Method and apparatus for speech recognition using optimized partial stochastic mixed consensus
JPH11272291A (en) Phonetic modeling method using acoustic decision tree
EP1022725B1 (en) Selection of acoustic models using speaker verification
EP1525577B1 (en) Method for automatic speech recognition
KR19980070329A (en) Method and system for speaker independent recognition of user defined phrases
GB2347775A (en) Method of extracting features in a voice recognition system
JP2002536691A (en) Voice recognition removal method
JP2980026B2 (en) Voice recognition device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase