WO2002101719A1 - Voice recognition apparatus and voice recognition method - Google Patents
Voice recognition apparatus and voice recognition method Download PDFInfo
- Publication number
- WO2002101719A1 WO2002101719A1 PCT/JP2002/005647 JP0205647W WO02101719A1 WO 2002101719 A1 WO2002101719 A1 WO 2002101719A1 JP 0205647 W JP0205647 W JP 0205647W WO 02101719 A1 WO02101719 A1 WO 02101719A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- conversion function
- conversion
- input
- voice
- speech
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 149
- 238000006243 chemical reaction Methods 0.000 claims abstract description 393
- 230000006870 function Effects 0.000 claims abstract description 265
- 230000006978 adaptation Effects 0.000 claims abstract description 154
- 239000011159 matrix material Substances 0.000 claims description 320
- 230000009466 transformation Effects 0.000 claims description 278
- 239000013598 vector Substances 0.000 claims description 194
- 238000012217 deletion Methods 0.000 claims description 26
- 230000037430 deletion Effects 0.000 claims description 26
- 238000009826 distribution Methods 0.000 claims description 22
- 238000012905 input function Methods 0.000 claims description 2
- 230000003044 adaptive effect Effects 0.000 description 110
- 238000012545 processing Methods 0.000 description 60
- 230000001131 transforming effect Effects 0.000 description 11
- 238000000605 extraction Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 6
- 238000013138 pruning Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 241000270666 Testudines Species 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
Definitions
- the present invention relates to a speech recognition device and a speech recognition method, and performs high-accuracy speech recognition without making the user aware of model adaptation even when the device is used by a plurality of users or in a plurality of environments.
- the present invention relates to a voice recognition device and a voice recognition method that can perform the voice recognition. Background art
- a speech recognition device In a speech recognition device, generally, the following processing (speech recognition processing) is performed, so that the input speech is speech-recognized.
- a predetermined dimension of a feature vector representing a feature amount of the inputted speech is extracted by acoustic analysis of the inputted speech.
- a method of voice analysis there is a Fleet transformation or the like.
- the matching process between the feature vector sequence and the acoustic model is performed, and the word sequence (word) power corresponding to the acoustic model sequence that matches the feature vector sequence is obtained as a result of the matching process. Is done.
- the acoustic model is configured using a probability (density) function such as one or more Gaussian probability distributions defined in a feature vector space. HMM is used.
- a Gaussian probability distribution constituting the acoustic model is used, and a feature vector sequence is observed from a sequence of acoustic models as a plurality of candidates for speech recognition results (hereinafter, appropriately referred to as a hypothesis).
- the degree (score) is calculated, and the final speech recognition result is determined from the multiple hypotheses based on the score.
- the hypothesis with the highest score for the feature vector sequence is selected as the one that best matches the input speech, and the acoustic model that constitutes the hypothesis is selected.
- a word string corresponding to the sequence of files is output as a speech recognition result.
- the speech recognition device for a specific speaker which can be divided into three types of adaptive speech recognition devices, uses an acoustic model that has been learned using the speech of the specific speaker. For, a highly accurate (low error recognition rate) speech recognition result can be obtained.
- a speech recognition device for a specific speaker the speech recognition accuracy of speakers other than the specific speaker generally deteriorates greatly.
- the speech recognition device for unspecified speakers an acoustic model trained using the speech of an unspecified number of speakers is used, so that relatively accurate speech recognition results can be obtained for any speaker. be able to.
- the speech recognition accuracy of the particular speaker cannot be as high as that of the speech recognition device for the particular speaker.
- a model-adaptive speech recognizer has the same performance as a speech recognizer for an unspecified speaker, but when a specific user (speaker) uses the device, the user is notified.
- the model adaptation of the acoustic model is performed by this voice, and the voice recognition accuracy for the user is improved.
- the model-adaptive speech recognition device first performs speech recognition using the same acoustic model as in the speech recognition device for unspecified speakers, but at that time, the speech input from the user Then, a mismatch between the sound model and the acoustic model is analyzed, and based on the analysis result, a conversion matrix for transforming the acoustic model into a model that matches (applies) the input speech is obtained. After that, speech recognition is performed using an acoustic model obtained by transforming the acoustic model using a transformation matrix, that is, an acoustic model subjected to model adaptation.
- model-adaptive speech recognition device In a model-adaptive speech recognition device, the above-described model adaptation is performed as training, for example, before the user uses the device in earnest, so that the acoustic model matches the user's voice. Because it is converted to what you do. The speech recognition accuracy for that particular user is improved.
- the acoustic model in the model-adaptive speech recognition device is converted into a speech suitable for speech recognition as described above, if the user (speaker) is focused on, the speech recognition device is However, if one focuses on the environment in which the speech recognition device is used, it will be adapted to that environment.
- the voice recognition device in an environment where the voice recognition device is used, for example, there is a noise at that location, and a distortion of the channel until the user's voice is input to the voice recognition device.
- a model-adaptive speech recognizer is used in a certain environment, the acoustic model will be converted to adapt to the sound in that environment.
- the speech recognizer will be adapted to the environment in which it will be used.
- the distortion of the channel is based on the characteristics of a microphone for converting voice into an electric signal, or the voice input to a voice recognition device is transmitted through a band-limited transmission line such as a telephone line. In such cases, there are those based on the characteristics of the transmission path.
- model adaptation is performed, for example, by linearly transforming an average vector defining a Gaussian probability distribution constituting the HMM using the above-described transformation matrix.
- model adaptation refers to both the conversion of an acoustic model by a conversion matrix and the conversion of a feature vector. That is, in the model adaptation, the acoustic model may be adapted to the feature vector obtained from the user's voice, or the feature vector obtained from the user's voice may be adapted to the acoustic model. .
- Model adaptation is the likelihood that a feature vector of a certain utterance is observed from the acoustic model.
- the acoustic model corresponding to the utterance of interest the phoneme of the utterance of interest, etc.
- the purpose is to improve (increase the score) for the feature vector, which is calculated from the Gaussian probability distribution that constitutes the HMM as an acoustic model that represents the HMM.
- the model adaptation to be transformed.In this case, the feature vector is ideally mapped to the mean vector that defines the Gaussian probability distribution that constitutes the acoustic model by being transformed by the transformation matrix. It is.
- the score for the feature vector of the target utterance calculated from the acoustic model corresponding to the target utterance is made larger than the score calculated from the other acoustic models.
- a transformation matrix that performs a linear transformation that matches the feature vector to the mean vector that defines the Gaussian probability distribution that constitutes the acoustic model corresponding to the utterance of interest is determined. This conversion matrix can be calculated, for example, periodically or irregularly. At the time of speech recognition, matching processing is performed using the feature vector (or acoustic model) converted by the conversion matrix. Will be
- a transformation matrix for performing model adaptation is obtained using a plurality of feature vector sequences obtained from a plurality of utterances of the particular speaker.
- a transformation matrix that matches each of the feature vectors with the corresponding average vector For example, a method using linear regression (least square method) is known. Then, the transformation matrix obtained in this way is obtained by calculating the feature vector obtained from the utterance of a specific speaker and the statistical error (here, the sum of square errors) from the corresponding average vector. It maps to the vector to be minimized, so that its transformation matrix transforms any feature vector from a particular speaker's utterance to exactly match the corresponding average vector. Is generally not possible.
- any of the methods is basically the same as the above-mentioned method.
- the feature vector of the utterance of interest or the acoustic model corresponding to the utterance of interest is converted from the acoustic model so as to maximize the likelihood that the feature vector is observed.
- model-adaptive speech recognition device As the model adaptation by a specific user's voice or the model adaptation in a specific environment progresses, the voice of the specific user or the speech recognition in a specific environment Accuracy improves, but on the other hand, speech recognition accuracy for other users and other environments deteriorates. As a result, the model-adaptive speech recognizer has the same performance as the speech recognizer for a specific speaker.
- the speech recognition device may be used by another user or may be used under another environment. By using it, it is possible to adapt it to other users and other environments.
- the acoustic model of the speech recognizer is adapted to the first user or the first environment. Until it adapts to other users and other environments, the speech recognition accuracy is greatly degraded.
- the acoustic model adapted to the first user or the first environment may not be fully adaptable to other users or other environments, in which case the first user or the first It is necessary to return the acoustic model adapted to the current environment to the original acoustic model (after resetting), and then adapt it to other users and other environments.
- each speech recognition device prepares a plurality of sets of acoustic models and adapts a different set of acoustic models for each user to the user.
- each speech recognition device has Since speech recognition is performed using an acoustic model adapted to that user, the same speech recognition accuracy as that of a speech recognition device for a specific speaker can be obtained for all of the plurality of users.c
- speech recognition is performed using an acoustic model adapted to the user who is speaking, so the device must be informed of which user is speaking. Therefore, the user must input information for identifying himself / herself by operating a button or the like before starting to use the device, which is troublesome. Disclosure of the invention
- the present invention has been made in view of such a situation, and even when used by a plurality of users or in a plurality of environments, the user is not required to be conscious of model adaptation, and has high accuracy. It enables speech recognition.
- the speech recognition device of the present invention converts one of an input speech and an acoustic model when performing model adaptation for adapting the input speech and an acoustic model used for speech recognition and adapting to the other 1
- a conversion function storage means for storing the above conversion function, and a conversion obtained by converting one of the input voice and the acoustic model corresponding to the input voice by one or more conversion functions stored in the conversion function storage means. Based on the result, from among the one or more conversion functions stored in the conversion function storage means, a conversion function that is optimal for adapting one of the input speech and the acoustic model to the other is detected, and the optimal conversion is performed.
- Allocating means for allocating an input voice to a function, voice storage means for storing an input voice to which a conversion function is allocated, and one or more conversion functions stored in the conversion function storage means. That is, the conversion function to which the new input voice is allocated by the allocation means is updated by using all the input voices allocated to the conversion function, and the conversion function stored in the conversion function storage means.
- a conversion function selecting means for selecting a conversion function used for converting one of the input speech and the acoustic model from one or more conversion functions; and a conversion function selected by the conversion function selection means.
- a matching unit that performs matching processing with the other, and outputs a speech recognition result of the input voice based on the matching processing result.
- the speech recognition method includes the steps of: converting one of an input voice and an acoustic model corresponding to the input voice based on a conversion result obtained by converting each of the input voices by one or more conversion functions; Detecting an optimal conversion function for adapting one of the acoustic models to the other, and assigning the input speech to the optimal conversion function; A conversion function update step of updating a conversion function to which a new input voice is assigned using all input voices assigned to the conversion function; and an input voice and an acoustic model from one or more conversion functions.
- a conversion function selection step for selecting a conversion function to be used to convert one of the conversion functions and a conversion function selected in the conversion function selection step.
- a conversion step for converting one of the input voice and the acoustic model, a conversion step for converting one of the input voice and the acoustic model using the conversion function, and a matching process for the other are performed based on the matching processing result. And a matching step of outputting a speech recognition result of the input speech.
- the program according to the present invention is configured such that, based on a conversion result obtained by converting one of an input voice and an acoustic model corresponding to the input voice by one or more conversion functions, the input voice is selected from one or more conversion functions. And an acoustic model for detecting one of the best conversion functions to be applied to the other, and allocating the input speech to the optimum conversion function.
- a conversion function selection step for selecting a conversion function to be used for converting one of the models; and a conversion function selected in the conversion function selection step.
- the recording medium is configured such that, based on a conversion result obtained by converting one of an input voice and an acoustic model corresponding to the input voice by each of one or more conversion functions, the input voice and the audio model are selected from one or more conversion functions. Detecting an optimal conversion function for adapting one of the acoustic models to the other, and assigning the input speech to the optimal conversion function; and A conversion function to update a conversion function to which a new input voice is assigned using all input voices assigned to the conversion function, and an input voice and an acoustic model from one or more conversion functions. A conversion function selection step for selecting a conversion function to be used for converting one of the conversion functions and a conversion function selected in the conversion function selection step.
- the input voice and the audio based on a conversion result obtained by converting one of an input voice and an acoustic model corresponding to the input voice by one or more conversion functions, the input voice and the audio
- the best conversion function for adapting one of the models to the other is found, the input speech is assigned to the best conversion function, and the conversion function with the new input speech is assigned to the conversion function.
- ⁇ is updated using all the input voices assigned to the function.
- a conversion function used to convert one of the input voice and the acoustic model is selected from one or more conversion functions.
- the selected conversion function converts one of the input speech and the acoustic model.
- one of the input speech and the acoustic model converted by the conversion function is matched with the other, and the matching process is performed. Based on the matching processing result, the speech recognition result of the input speech is output.
- FIG. 1 is a block diagram showing a configuration example of an embodiment of a speech recognition device to which the present invention is applied.
- FIG. 2 is a flowchart illustrating the speech recognition processing.
- FIG. 3 is a flowchart illustrating the adaptive data registration process.
- FIG. 4 is a flowchart illustrating the conversion matrix update process.
- FIG. 5 is a flowchart illustrating the conversion matrix generation Z deletion processing.
- FIG. 6 is a flowchart illustrating a conversion matrix generation process.
- FIG. 7 is a flowchart illustrating the transformation matrix deletion processing.
- FIG. 8 is a block diagram showing a configuration example of another embodiment of the speech recognition device to which the present invention is applied.
- FIG. 9 is a block diagram showing a configuration example of a computer according to an embodiment of the present invention.
- FIG. 1 shows a configuration example of an embodiment of a speech recognition device to which the present invention is applied.
- the voice uttered by the user is input to a microphone (microphone) 1, which converts the input voice into a voice signal as an electric signal.
- This audio signal is supplied to an A / D (Analog Digital) converter 2.
- the A / D converter 2 samples and quantizes the audio signal as an analog signal from the microphone 1 and converts it into audio data as a digital signal.
- This audio data is supplied to the feature extraction unit 3.
- the feature extraction unit 3 performs acoustic analysis processing on the audio data from the A / D conversion unit 2 for each appropriate frame, thereby obtaining, for example, an MFCC (Mel Frequency A feature vector as a feature amount such as Cepstrura Coefficient) is extracted.
- the feature extraction unit 3 can extract other feature vectors such as a spectrum, a linear prediction coefficient, a cepstrum coefficient, a line spectrum pair, and the like.
- the feature vector obtained for each frame in the feature extracting unit 3 is sequentially supplied to and stored in the feature vector buffer 4. Therefore, the feature vector buffer 4 stores the time series of the feature vectors for each frame.
- the buffer 4 stores, for example, a time-series feature vector obtained from the start to the end of a certain utterance (voice section).
- the conversion unit 5 performs a linear conversion of the feature vector stored in the buffer 4 using the conversion matrix supplied from the selection unit 14, and converts the converted feature vector (hereinafter referred to as a conversion feature vector as appropriate). Is supplied to the matching unit 6 as a result adapted to the acoustic model stored in the acoustic model storage unit ⁇ .
- the matching unit 6 refers to the acoustic model storage unit 7, the dictionary storage unit 8, and the grammar storage unit 9 as necessary using the feature vector (transformation feature vector) supplied from the conversion unit 5. While, the voice input to the microphone 1 (input voice) is recognized based on, for example, a continuous distribution HMM method.
- the acoustic model storage unit 7 stores an acoustic model representing acoustic characteristics of each predetermined unit (PLU (Phonetic-Linguistic-Units)) such as individual phonemes or syllables in the language of the speech to be recognized.
- PLU Phonetic-Linguistic-Units
- an acoustic model for example, an HMM (Hidden Hidden) having a Gaussian distribution used to calculate a probability that a predetermined feature vector sequence is observed is used.
- the Gaussian distribution of the HMM is defined by a mean vector and a covariance matrix, and the HMM is constructed using a probability density function other than the Gaussian distribution. It is possible.
- the dictionary storage unit 8 stores, for each word (vocabulary) to be recognized, a word dictionary in which information (phonological information) related to its pronunciation is described.
- the grammar storage unit 9 stores the words registered in the word dictionary of the dictionary storage unit 8 It stores grammar rules (language models) that describe whether they are linked (connected).
- a grammar rule for example, a rule based on a context-free grammar (CFG) or a statistical word chain probability (N-gram) can be used.
- the matching unit 6 refers to the word dictionary in the dictionary storage unit 8 and connects the acoustic models stored in the acoustic model storage unit 7 to form a word acoustic model (word model). Further, the matching unit 6 connects several word models by referring to the grammar rules stored in the grammar storage unit 9, and uses the word models connected in this way to generate a time-series feature.
- the matching unit 6 Recognizes the voice input to microphone 1 by the continuous distribution HMM method based on the vector. That is, the matching unit 6 calculates a score representing the likelihood that the feature vector of the time series supplied via the conversion unit 5 is observed from the series of each word model configured as described above. calculate. Then, the matching unit 6 detects a word model sequence having the highest score, and outputs a word string corresponding to the word model sequence as a speech recognition result.
- the matching unit 6 since speech recognition is performed by the HMM method, the matching unit 6 accumulates the appearance probabilities of each feature vector for the word string corresponding to the connected word model, and uses the accumulated value as a score. The word string with the highest score is output as the speech recognition result.
- the score calculation is performed based on an acoustic score given by the acoustic model stored in the acoustic model storage unit 7 (hereinafter, appropriately referred to as an acoustic score) and a grammar rule stored in the grammar storage unit 9.
- Linguistic scores hereinafter referred to as linguistic scores, as appropriate).
- the acoustic score is calculated based on the probability that a sequence of feature vectors output by the feature extracting unit 3 is observed from the acoustic model forming the word model.
- the language score is obtained based on the probability that the word of interest and the word immediately before the word are linked (connected).
- the acoustic score and linguistic score for each word The speech recognition result is determined based on the final score obtained by comprehensively evaluating the core and the final score (hereinafter referred to as the final score as appropriate).
- the final score S of the word string is calculated, for example, according to the following equation.
- C k represents a weight applied to the language score L (w k ) of the word w k .
- a matching process is performed to find N that maximizes the final score shown in the above equation and word strings W l , w 2 ,..., W N , and the word strings W l , w 2 , ⁇ ⁇ ⁇ , W N are output as speech recognition results.
- the voice recognition device of FIG. 1 By performing the above matching process, in the voice recognition device of FIG. 1, for example, when the user utters “I want to go to New York,” the user “New York”, “Ni”, “Go” Each word such as “I” or “I” is given an acoustic score and a linguistic score, and when the final score obtained by comprehensively evaluating them is the largest, the word string “New York”, “Nii”, “I want to go” ”,“ Is ”, are output as speech recognition results.
- the matching unit 6 evaluates the word sequence of the four kinds 4, among them, must determine the best match to the utterance of the user (also the greatest final score) . If the number of words registered in the word dictionary increases, the number of words arranged in the word number becomes the number of words multiplied by the number of words, so the word string that must be evaluated is a huge It becomes a number.
- the number of words included in an utterance is unknown, so that not only a word string consisting of four words but also a word string consisting of one word, two words, ... There is a need to. Therefore, the number of word strings to be evaluated becomes even more enormous, and among such enormous word strings, the one that is most likely to be the result of speech recognition is determined in terms of the amount of computation and the memory capacity used. It is a very important issue to make efficient decisions from
- Examples of methods for improving the calculation amount and the memory capacity include, for example, an acoustic pruning method of terminating the score calculation based on the acoustic score obtained in the course of obtaining the acoustic score, and a language score. There is a linguistic pruning method that narrows down the words to be calculated based on the score.
- These pruning techniques are also called beam search methods.
- a predetermined threshold is used for narrowing down (pruning) words, and this threshold is called a beam width.
- the sound score and the language score are collectively referred to as appropriate.
- a sequence of a certain word is assumed as a hypothesis as a candidate for a speech recognition result, and a sequence of words as the hypothesis (speech recognition).
- a new hypothesis is generated by connecting a new word to the resulting rule, and a score for a sequence of words as each generated hypothesis is calculated using the feature vector.
- the hypothesis with a relatively low score is deleted, and the same process is repeated for the remaining hypotheses.
- the registration unit 10 stores, for example, the characteristic vector sequence of the speech for each utterance (for each speech section) stored in the buffer 4 into an acoustic model of the sequence corresponding to the speech (here, as described above).
- HMM The flatness that defines the Gaussian distribution of each It is associated with the average vector sequence and supplied to the adaptive database 11.
- the feature vector sequence that the registration unit 10 supplies to the adaptive database 11 and the sequence of the average vector associated with it are the feature vector output by the feature extraction unit 3 and the acoustic model storage unit. It is used to update the transformation matrix used to adapt the acoustic model stored in 7. Therefore, a set of a characteristic vector sequence supplied to the adaptive database 11 by the registration unit 10 and a sequence of average vectors associated with the characteristic vector sequence is hereinafter referred to as adaptive data as appropriate.
- the average vector sequence in such adaptive data is a feature vector sequence with the highest likelihood (probability) observed from the corresponding acoustic model sequence, and therefore, ideally, in the adaptive data, It can be said that the conversion matrix that converts a feature vector sequence into a sequence of average vectors associated with the feature vector is a conversion matrix that performs optimal model adaptation.
- the speech feature vector that constitutes the adaptation data is input to the microphone 1
- the voice of the user or the like can be obtained by being processed by the feature extraction unit 3.
- how to recognize the acoustic model of the sequence corresponding to the sound input to the microphone 1 is a problem. For example, this can be recognized by the following two methods. .
- the voice recognition device requests the user to utter a predetermined word.
- the sequence of the acoustic model corresponding to the voice can be recognized based on the predetermined word that has requested the user to utter.
- a feature vector obtained from a user's voice is converted by a voice recognition device by a conversion unit 5 using a conversion matrix stored in a conversion matrix storage unit 13 described later.
- the matching unit 6 performs a matching process using each of the obtained transform feature vectors.
- the one with the highest score Can be recognized as a correct speech recognition result, and a sequence of acoustic models corresponding to the speech recognition result can be recognized as a sequence of acoustic models corresponding to the user's speech.
- the registration unit 10 recognizes the score by monitoring the internal state of the matching unit 6 and, for the feature vector sequence of the speech stored in the buffer 4, Recognize the series of acoustic models corresponding to (the highest score).
- the registration unit 10 calculates the average vector of the HMM as the acoustic model (the probability that the characteristic vector is observed from the state of the HMM). It is necessary to recognize the average vector that defines the Gaussian distribution used for the calculation, but the registration unit 10 recognizes this average vector by referring to the sound model storage unit 7.
- the adaptation database 11 stores the adaptation data supplied from the registration unit 10 together with assignment information indicating to which of the transformation matrices the adaptation data is assigned in the transformation matrix storage unit 13.
- the allocation information is supplied to the model adaptation unit 12 and the adaptation database 11 .
- the model adaptation unit 12 uses the adaptation data stored in the adaptation database 11. Update, generation, deletion, etc. of the transformation matrix used to perform model adaptation for adapting the speech feature vector to the acoustic model stored in the acoustic model storage unit 7.
- the model adaptation unit 12 When new adaptation data is stored in the adaptation database 11, it is recognized which of the transformation matrices to be assigned to the adaptation data is stored in the transformation matrix storage unit 13, and the adaptation data is assigned to the transformation matrix.
- the model adaptation unit 12 recognizes to which of the conversion matrices stored in the transformation matrix storage unit 13 the newly stored adaptation data should be assigned in the adaptation database 11, and determines the assignment. Generate assignment information to represent. Then, the model adaptation unit 12 supplies the assignment information to the adaptation database 11 and stores it in association with the corresponding adaptation data. Therefore, in the speech recognition apparatus of FIG. 1 (the same applies to the speech recognition apparatus of FIG. 8 described later), all the adaptive data stored in the adaptive database 11 are converted into the conversion matrix stored in the conversion matrix storage section 13. It is assigned to one of the matrices, and this assignment causes the adaptive data to be classified (subset) into several classes (the classes specified by the transformation matrix).
- the transformation matrix storage unit 13 stores one or more transformation matrices.
- the transformation matrix storage unit 13 stores, for example, only one transformation matrix as an initial state.
- one conversion matrix stored as an initial state in the conversion matrix storage unit 13 is, for example, an identity matrix (a unit matrix) as in a conventional model-adaptive speech recognition device. Etc. can be adopted.
- the selection unit 14 monitors the internal state of the matching unit 6 and, based on the monitoring result, selects one or more conversion functions stored in the conversion matrix storage unit 13 and stores the conversion function in the buffer 4. The one used to convert the feature vector is selected and supplied to the conversion unit 5.
- the user's voice input to the microphone 1 is supplied to the feature extraction unit 3 as digital voice data via the AZD conversion unit 2, and the feature extraction unit 3 converts the voice data supplied thereto into a predetermined voice data.
- Acoustic analysis is performed for each frame to extract feature vectors.
- the feature vectors obtained for each frame in the feature extracting unit 3 are sequentially supplied to the buffer 4 and stored. The extraction of the feature vector by the feature extraction unit 3 and the storage of the feature vector by the buffer 4 are continued until one utterance (voice section) of the user ends.
- the detection of the voice section is performed by, for example, a known method.
- the selection unit 14 proceeds to step S 1, where all the conversions stored in the conversion matrix storage unit 13 are performed.
- the matrix is selected and supplied to the converter 5, and the process proceeds to step S2.
- the conversion matrix selected in the selection unit 14 is hereinafter referred to as a selection conversion matrix as appropriate.
- the conversion unit 5 reads the time-series feature vector from the buffer 4, converts the feature vector read from the buffer 4 by the selection conversion matrix supplied from the selection unit 14, and performs the conversion.
- the supply of the conversion feature vector obtained by the conversion to the matching unit 6 is started.
- the conversion unit 5 uses the respective conversion matrices to store the feature bases stored in the buffer 4.
- the turtle is converted, and the resulting sequence of converted feature vectors is supplied to the matching unit 6.
- step S2 supply of the feature vector sequence converted by each of the one or more conversion matrices stored in the conversion matrix storage unit 13 to the matching unit 6 is started.
- step S3 the matching unit 6 refers to the acoustic model storage unit 7, the dictionary storage unit 8, and the grammar storage unit 9 as necessary using the feature vector sequence supplied thereto, Performs matching processing to calculate scores based on the continuous distribution HMM method, etc., while pruning hypotheses by the beam search method.
- the matching unit 6 performs a matching process on each of the characteristic vector sequences converted by each of the one or more conversion matrices stored in the conversion matrix storage unit 13.
- step S4 the matching unit 6 determines, for each of the characteristic vector sequences converted by the one or more conversion matrices stored in the conversion matrix storage unit 13, a predetermined time from the start time of the voice section. Judge whether the hypothesis for time has been obtained.
- step S4 If it is determined in step S4 that a hypothesis for a predetermined time from the start time of the voice section has not been obtained yet, the process returns to step S3, and the matching unit 6 determines whether the feature vector supplied from the conversion unit 5 is available. The matching process using the torque sequence is continued.
- step S4 a hypothesis for a predetermined time from the start time of the voice section is When it is determined that the characteristic vector is obtained, that is, in the matching unit 6, for each of the characteristic vector sequences converted by each of the one or more conversion matrices stored in the conversion matrix storage unit 13, a predetermined time from the start time of the voice section If the hypothesis is obtained, the process proceeds to step S5, where the selection unit 14 obtains the feature vector sequences converted by the one or more conversion matrices stored in the conversion matrix storage unit 13. From the hypotheses for the given period of time, the one with the highest score is selected.
- step S5 the selection unit 14 detects a transformation matrix used to transform the feature vector sequence from which the hypothesis with the highest score has been obtained, and proceeds to step S6.
- the conversion matrix detected in this manner (hereinafter, appropriately referred to as a detection conversion matrix) is stored in the sound model storage unit 7 with respect to (the feature vector of) the user's voice that is currently input. Is the one that gives the highest score that can be obtained from the obtained acoustic model, and therefore the one that adapts the user's voice to the acoustic model best, that is, the optimal transformation matrix for that user's voice. it can.
- step S6 the selection unit 14 selects the detected conversion matrix (optimal conversion matrix) detected in step S5 from the conversion matrices stored in the conversion matrix storage unit 13, and selects The selection conversion matrix is supplied to the conversion unit 5, and the process proceeds to step S7.
- step S7 the conversion unit 5 converts the feature vector read from the buffer 4 by the selection conversion matrix supplied from the selection unit 14, and matches the conversion feature vector obtained by the conversion. Start supplying to Part 6.
- this allows the matching vector 6 to be supplied with the feature vector sequence converted by the conversion matrix (hereinafter, appropriately referred to as an optimum conversion matrix) that most appropriately adapts the user's voice that has been input to the acoustic model. Be started.
- the conversion matrix hereinafter, appropriately referred to as an optimum conversion matrix
- step S8 the matching unit 6 continues the matching process using the feature vector sequence supplied thereto. That is, the matching unit 6 uses the feature vector sequence converted from the conversion matrix stored in the conversion matrix storage unit 13 by the optimum conversion matrix for the currently input speech, and Continue the logging process. Thereby, the matching unit 6 calculates a score obtained using the feature vector sequence transformed by the optimal transformation matrix.
- the matching unit 6 deletes the score and the hypothesis obtained by using the feature vector obtained by the loop processing in steps S3 and S4 and converted by a transformation matrix other than the optimal transformation matrix.
- step S9 When the calculation of the score up to the end time of the voice section is completed, the matching unit 6 proceeds to step S9, and detects the hypothesis with the highest score from the remaining hypotheses, and performs voice recognition. The result is output, and the process proceeds to step S10.
- step S10 new adaptive data is registered (stored) in the adaptive database 11. Adaptive data registration processing is performed, and the speech recognition processing ends.
- step S10 of FIG. 2 will be described with reference to the flowchart of FIG.
- step S 21 the registration unit 10 refers to the internal state of the matching unit 6, so that the speech feature vector of one utterance stored in the buffer 4 is obtained.
- the sequence the sequence of the acoustic model corresponding to the speech (the sequence of the acoustic model that constitutes the speech recognition result of the utterance) is recognized.
- the registration unit 10 recognizes, by referring to the acoustic model storage unit 7, an average vector defining a Gaussian distribution of each acoustic model of the sequence of the recognized acoustic model, and
- the adaptive data is configured by associating the average vector sequence corresponding to the acoustic model sequence with the feature vector sequence stored in the buffer 4.
- the registration unit 10 supplies the adaptation data to the adaptation database 11 for storage, and proceeds to step S23.
- step S23 the registration unit 10 clears the buffer 4 by deleting the feature vector sequence for one utterance stored in the buffer 4, and proceeds to step S24.
- step S24 the model adaptation section 12 adapts the adaptive data in the immediately preceding step S22.
- the new adaptive data stored in the database 11 is used as the target adaptive data, and the characteristic vector sequence in the target adaptive data is most closely approximated to the average vector sequence associated with the characteristic vector sequence.
- a conversion matrix (optimal conversion matrix) to be converted into a vector sequence is detected from the conversion matrices stored in the conversion matrix storage unit 13.
- the model adapting unit 12 transforms the feature vector sequence in the target adaptive data by using a certain transformation matrix stored in the transformation matrix storage unit 13 to obtain a transformed feature vector sequence. Further, the model adaptation unit 21 calculates, for example, the total sum of the distances between each transform feature vector of the transform feature vector sequence and the corresponding average vector of the average vector sequence in the target adaptive data. It is determined as the error between the series of the transformation feature vector and the average vector. The model adaptation unit 21 finds the error between the above-described transformation feature vector and the series of average vectors for each of the transformation feature vectors obtained by the transformation matrices stored in the transformation matrix storage unit 13. The transformation matrix used to obtain the transformation feature vector that minimizes the error is detected as the optimal transformation matrix.
- step S25 the model adapting unit 12 assigns the target adaptive data to the optimal transformation matrix. That is, the model adaptation unit 12 uses the information representing the optimal transformation matrix as the above-mentioned assignment information, supplies the assignment information to the adaptation database 11, and stores it in association with the target adaptation data.
- the model adaptation unit 12 performs a transformation matrix update process for updating the transformation matrix stored in the transformation matrix storage unit 13 using the adaptation data stored in the adaptation database 11. Then, the adaptive data registration process ends.
- step S31 the model adaptation unit 12 converts the transformation matrix to which the attention adaptive data is assigned from the transformation matrices stored in the transformation matrix storage unit 13 into the attention transformation. Proceed to step S32 as a matrix.
- step S32 the model adapting unit 12 updates the transformation matrix of interest using all the adaptation data assigned to the transformation matrix of interest.
- the model adaptation unit 12 is, for example, a matrix that linearly transforms a feature vector sequence in each piece of adaptation data assigned to the transformation matrix of interest, and a sequence of feature vectors after the linear transformation. The one that minimizes the error from the average vector associated with the feature vector sequence is determined by the least squares method (linear regression). Then, the model adaptation unit 12 updates the noted transformation matrix with this matrix (replaces this matrix with the noted transformation matrix), and supplies the updated noted transformation matrix to the transformation matrix storage unit 13. Then, it is stored by overwriting the target transformation matrix before updating.
- the method of updating the transformation matrix of interest in step S32 is basically the same as that of model adaptation in a conventional model adaptive speech recognition device.
- the update of the target transformation matrix in step S32 uses only the adaptation data assigned to the target transformation matrix.For example, all the voices input for model adaptation are used. This is different from the conventional method of performing model adaptation using. That is, in the conventional model adaptation method, there is no concept that the adaptation data is assigned to the transformation matrix.
- step S32 the adaptive data assigned to the transform matrix of interest is recognized by referring to the assignment information stored in the adaptive database 11.
- step S32 After the target transformation matrix is updated in step S32, the process proceeds to step S33, in which an assignment update process for updating the assignment of adaptive data to each transformation matrix stored in the transformation matrix storage unit 13 is performed. Done.
- step S32 the transformation matrix of interest is updated, so that the adaptation data assigned to each transformation matrix stored in the transformation matrix storage unit 13 includes the currently assigned transformation matrix.
- the transformation matrix of interest becomes the optimal transformation matrix.
- assigned to the updated attention transformation matrix In some adaptation data, other transformation matrices may become the optimal transformation matrix instead of the transformation matrix of interest. Therefore, in the assignment updating process in step S33, each adaptation data stored in the adaptation database 11 confirms whether or not the currently assigned transformation matrix is the optimal transformation matrix. If not, the adaptive data is reassigned to the optimal transformation matrix.
- the assignment updating process includes the processes in steps S41 to S48.
- the model adaptation unit 12 converts the variables I and J into a transformation matrix.
- the number of transformation matrices stored in the storage unit 13 and the number of adaptation data stored in the adaptation database 11 are set, and a variable i for counting the transformation matrix and the adaptation data are set.
- the variable j to be counted is initialized to 1.
- step S42 the model adaptation unit 12 stores the feature vector sequence in the adaptation data #j, which is the j-th adaptation data stored in the adaptation database 11, into the transformation matrix storage unit 13.
- the transformation is performed using the transformation matrix Mi, which is the stored i-th transformation matrix, and the process proceeds to step S43.
- step S43 the model adaptation unit 12 computes a transformation feature vector obtained by transforming the adaptive data #j with a transformation matrix M; and a series of average vectors in the adaptive data #j.
- the error ⁇ (ij) is obtained in the same manner as in the case described in step S24 in FIG.
- step S44 the model adaptation unit 12 determines whether or not it is equal to I which is the total number of variable i force conversion matrices. If it is determined in step S44 that the variable i is not equal to I, the process proceeds to step S45, where the model adaptation unit 12 increments the variable i by 1, returns to step S42, and returns to step S42. The same processing is repeated.
- step S44 If it is determined in step S44 that the variable i is equal to I, the process proceeds to step S46, where the model adaptation unit 12 determines whether the variable j1S is equal to J, which is the total number of adaptation data. Is determined. In step S46, variable j equals J If it is determined that it is not correct, the process proceeds to step S47, where the model adapting unit 12 increments the variable j by one, initializes the variable i to 1, returns to step S42, and returns to step S42. The same processing is repeated.
- step S46 when it is determined that the variable j is equal to J, that is, for all the adaptive data stored in the adaptive database 11, the adaptive data is stored in the transformation matrix storage unit 13.
- the model adaptation unit 12 transforms each adaptation data # j into a transformation matrix Mi that minimizes the error ⁇ (i, j). Reassign. That is, the model adaptation unit 12 associates the information representing the transformation matrix Mi that minimizes the error ⁇ (i, j) with the adaptation data #j stored in the adaptation database 11 as assignment information. And memorize (overwrite).
- adaptive data #j is assigned to the transformation matrix Mi
- the transformed feature vector sequence obtained by transforming the feature vector sequence in the adapted data #j with the transformation matrix Mi and the adaptive data #
- the error ⁇ (i, j) from the average vector sequence at j is hereinafter referred to as the error for the adaptive data as appropriate.
- step S33 the process proceeds to step S34, where the model adaptation unit 12 performs the assignment update processing. It is determined whether or not there is a transformation matrix in which the adaptive data to be assigned has changed.
- step S34 when it is determined that there is a transformation matrix in which the adaptive data to be changed has been changed, the process proceeds to step S35, where the model adaptation unit 12 checks the transformation matrix in which the allocation of the adaptive data has changed. The process returns to step S32 as a transformation matrix, and the same processing is repeated thereafter.
- step S35 when there is a conversion matrix in which the allocation of the adaptive data has changed, in step S35, the conversion matrix is set as a target conversion matrix. And, Returning to step S32, the target transformation matrix is updated using the adaptive data assigned thereto, and further, in step S33, the assignment update process is repeated.
- step S35 the plurality of transformation matrices are regarded as the transformation matrix of interest, and in step S32, the transformation matrices of interest are respectively Are updated using the adaptive data assigned to each.
- step S34 when it is determined that there is no transformation matrix in which the allocation of adaptive data has changed, that is, when all the adaptive data in the adaptive database 11 have been allocated to the optimal transformation matrix. Then, the process proceeds to step S36, where the model adaptation unit 12 performs a conversion matrix generation / deletion process and ends the conversion matrix update process.
- step S51 the model adaptation unit 12 generates a new transformation matrix from the transformation matrices stored in the transformation matrix storage unit 13. Then, it is determined whether there is a transformation matrix that satisfies a predetermined generation condition set in advance.
- the generation condition for example, it is possible to employ that the number of adaptive data equal to or larger than a predetermined threshold (the number larger than the predetermined threshold) is assigned to the transformation matrix.
- the generation conditions include, for example, that the average value of the errors in the adaptive data assigned to the conversion matrix is equal to or larger than a predetermined threshold (larger), For example, it is possible to adopt a case where the number of errors in the data is equal to or more than a predetermined threshold value or more. That is, as a generation condition, depending on the transformation matrix, it becomes difficult to accurately convert the feature vector in all the adaptive data assigned to the transformation matrix to the average vector associated therewith. And other conditions that indicate the situation Can be adopted.
- step S51 when it is determined that none of the transformation matrices stored in the transformation matrix storage unit 13 satisfy the generation condition, steps S52 and S53 are skipped and step S52 is skipped. Proceed to 5 4.
- step S51 when it is determined that some of the transformation matrices stored in the transformation matrix storage unit 13 satisfy the generation condition, the process proceeds to step S52 and the model adaptation unit 1 2 Goes to step S53 with the transformation matrix that satisfies the generation condition as the transformation matrix of interest.
- step S53 the model adaptation unit 12 performs a transformation matrix generation process described later, and proceeds to step S54. .
- step S54 the model adaptation unit 12 satisfies a predetermined deletion condition set in advance, which is to be satisfied when the transformation matrix is deleted from the transformation matrices stored in the transformation matrix storage unit 13. Determine whether a transformation matrix exists.
- the deletion condition for example, it is possible to adopt that only the number of adaptive data equal to or less than a predetermined threshold (the number less than the predetermined threshold) is assigned to the transformation matrix.
- a deletion condition in addition to the fact that only the number of adaptive data equal to or less than a predetermined threshold value is assigned to the transformation matrix, for example, the average value of the error of the adaptive data assigned to the transformation matrix is equal to or less than the predetermined value. It is possible to adopt a condition that is equal to or greater than (greater than) the threshold value.
- the deletion condition the latest time selected in step S6 in the speech recognition processing in FIG.
- the conversion matrix storage unit 13 is stored for each of the conversion matrices stored in the conversion matrix storage unit 13, and the date and time are It is also possible to adopt that the date and time are past a predetermined number of days from the current date and time. In this case, the transformation matrix that has not been selected for a long time in step S6 in the speech recognition processing in FIG. 2 is deleted.
- step S54 if it is determined that none of the transformation matrices stored in the transformation matrix storage unit 13 satisfy the deletion condition, skip steps S55 and S56 and perform the transformation. Matrix generation End Z deletion processing.
- step S51 If it is determined in step S51 that some of the transformation matrices stored in the transformation matrix storage unit 13 satisfy the deletion condition, the process proceeds to step S55 and the model adaptation unit 12 Goes to step S56 with the transformation matrix that satisfies the deletion condition as the transformation matrix of interest.
- step S56 the model adaptation unit 12 performs a transformation matrix deletion process described later, and ends the transformation matrix generation / deletion process.
- step S61 the model adaptation unit 61 generates first and second matrices based on the transformation matrix of interest.
- step S52 of FIG. 5 the transformation matrix that satisfies the generation condition is set as the transformation matrix of interest.
- step S61 the transformation matrix of interest is split, so to speak. Generates the first and second matrices.
- step S52 of FIG. 5 when there are a plurality of transformation matrices regarded as the transformation matrix of interest, the transformation matrix generation processing of FIG. Are performed sequentially or in parallel.
- the generation of the first and second matrices based on the transformation matrix of interest in step S61 can be performed, for example, by changing the components by a predetermined value with respect to the transformation matrix of interest. .
- a predetermined vector is one more smaller than a predetermined minute vector ⁇ than when the predetermined vector is mapped (transformed) by the conversion matrix of interest.
- Two matrices that map to positions shifted by ⁇ are obtained, and these two matrices can be used as the first and second matrices. Also, let the attentioned transformation matrix be the first matrix as it is.
- a predetermined vector is mapped to a position shifted by a predetermined small vector ⁇ compared to the case where a predetermined vector is mapped by the target transformation matrix.
- the matrix to be obtained can be obtained, and the matrix can be used as the second matrix.
- step S61 After generating the first and second matrices in step S61, the process proceeds to step S62, where the model adaptation unit 12 sets the number of adaptation data allocated to the transformation matrix of interest to a variable K. At the same time, the variable k for counting the number of adaptive data is initialized to 1, and the process proceeds to step S63.
- step S63 the model adaptation unit 12 transforms the feature vector sequence in the adaptation data #k, which is the k-th adaptation data assigned to the transformation matrix of interest, using the first and second matrices, respectively. In this way, two transformed feature vector sequences are obtained.
- the transformed feature vector sequence obtained by transforming the feature vector sequence with each of the first matrix and the second matrix is referred to as a first transformed feature vector sequence and a second transformed feature vector sequence, respectively. This is called the transformation feature vector series.
- step S64 the model adaptation unit 12 computes an error between the first transform feature vector sequence and the average vector sequence in the adaptation data #k (hereinafter referred to as a first error as appropriate).
- a second error an error between the second transform feature vector sequence and the average vector sequence in the adaptive data # k (hereinafter, appropriately referred to as a second error), and the process proceeds to step S65.
- step S65 the model adaptation unit 12 determines whether the first error is less than (or less than) the second error.
- step S65 when it is determined that the first error is smaller than the second error, that is, when the first and second matrices are compared, the first matrix has the adaptive data # If the feature vector sequence at k can be more appropriately adapted to the corresponding acoustic model, the process proceeds to step S66, where the model adaptation unit 12 assigns adaptation data #k to the first matrix. Proceed to step S68.
- step S65 if it is determined that the first error is not less than the second error, that is, if the first and second matrices are compared, the second matrix If the feature vector sequence in #k can be more appropriately adapted to the corresponding acoustic model, the process proceeds to step S67, where the model adaptation unit 12 converts the adaptation data #k into the second matrix. Assignment, go to step S68.
- step S68 the model adaptation unit 12 determines whether or not the variable k is equal to the total number K of adaptation data allocated to the transformation matrix of interest.
- step S68 If it is determined in step S68 that the variable k is not equal to K, the process proceeds to step S69, where the model adaptation unit 12 increments the variable k by 1 and returns to step S63. Hereinafter, the same processing is repeated.
- step S68 when it is determined that the variable k is equal to K, immediately, each of the adaptive data assigned to the target transformation matrix is replaced with the appropriate one of the first or second matrix ( If the characteristic vector is assigned to the one that is closer to the corresponding average vector), the process proceeds to step S70, where the model adaptation unit 12 reads the transformation of interest from the transformation matrix storage unit 13. The matrix is deleted, and the first and second matrices are stored in the transformation matrix storage unit 13 as new transformation matrices.
- the transformation matrix of interest is deleted, and two new transformation matrices are added.
- the transformation matrix is substantially increased by one (generated). become.
- step S71 the model adaptation unit 12 uses the two new transformation matrices as the transformation matrix of interest and proceeds to step S72.
- step S72 the model adaptation unit 12 updates the conversion matrix of interest using all the adaptive data assigned to the conversion matrix of interest, as in step S32 of FIG.
- the two transformation matrices newly stored in the transformation matrix storage unit 13 are the transformation matrices of interest, and therefore, each of the two transformation matrices of interest is the adaptive data assigned to each of them. Is updated using.
- the process proceeds to step S73, where the model adaptation unit 12 performs the same assignment updating process as in step S33 of FIG. 4, and proceeds to step S74.
- step S74 the model adaptation unit 12 determines whether there is a transformation matrix in which the adaptive data to be assigned has changed by the assignment updating process in step S73.
- step S74 when it is determined that there is a transformation matrix in which the adaptive data to be changed is present, the process proceeds to step S75, where the model adapting unit 12 converts the transformation matrix obtained by transforming the adaptation data into Returning to step S72 as a new conversion matrix of interest, the same processing is repeated thereafter.
- step S75 when there is a transformation matrix in which the allocation of the adaptive data has changed, in step S75, the transformation matrix is set as the transformation matrix of interest. Then, returning to step S72, the target transformation matrix is updated using the adaptive data assigned thereto, and further, in step S73, the assignment update process is repeated.
- step S75 the plurality of transformation matrices are regarded as the transformation matrix of interest, and in step S72, the transformation matrices of interest are respectively Are updated using the adaptive data assigned to each.
- step S74 if it is determined in step S74 that there is no transformation matrix in which the allocation of the adaptive data has changed, that is, if all the adaptive data in the adaptive database 11 have been allocated to the optimal transformation matrix, The transformation matrix generation processing ends.
- step S81 the model adaptation unit 81 deletes the transformation matrix of interest from the transformation matrix storage unit 13.
- step S55 of FIG. 5 the conversion satisfying the deletion condition is performed.
- the matrix is defined as the transformation matrix of interest.
- step S81 the transformation matrix of interest is deleted from the transformation matrix storage unit 13.
- step S81 After the transformation matrix of interest is deleted in step S81, the process proceeds to step S82, where the model adaptation unit 12 sets the number of adaptation data allocated to the transformation matrix of interest to a variable K, A variable k for counting the number of adaptive data is initialized to 1, and the process proceeds to step S83.
- step S82 the total number of adaptive data assigned to each of the plurality of target transformation matrices is set in a variable K. .
- step S83 the model adaptation unit 12 converts the feature vector sequence in the adaptation data #k, which is the k-th adaptation data, in the same manner as in step S24 of FIG.
- a conversion matrix that converts to a vector sequence that is closest to the sequence of the average vector associated with the sequence, that is, an optimal conversion matrix, is detected from the conversion matrices stored in the conversion matrix storage unit 13. And proceed to step S84. .
- step S84 the model adaptation unit 12 assigns (re-) adapts the adaptation data #k to the transformation matrix (optimal transformation matrix) detected in step S83, and proceeds to step S85.
- step S85 the model adaptation unit 12 determines whether or not the variable k is equal to the total number K of adaptation data allocated to the transformation matrix of interest deleted in step S81.
- step S85 When it is determined in step S85 that the variable k is not equal to K, the process proceeds to step S86, where the model adaptation unit 12 increments the variable k by 1 and returns to -S83. Hereinafter, the same processing is repeated. If it is determined in step S85 that the variable k is equal to K, immediately, all the adaptive data assigned to the target transformation matrix deleted in step S81 is stored in the transformation matrix storage unit. If it is reassigned to one of the transformation matrices stored in 13, the process proceeds to step S 87, where the model adaptation unit 12 removes any of the adaptation data assigned to the transformation matrix of interest. Are all newly assigned transformation matrices, and the process proceeds to step S88.
- step S88 the model adaptation unit 12 updates the conversion matrix of interest using all the adaptive data assigned to the conversion matrix of interest, as in step S32 of FIG.
- each of the transformation matrices of interest is updated using the adaptive data assigned to each of the transformation matrices.
- step S89 the model adaptation unit 12 performs the same assignment updating process as in step S33 of FIG. 4, and then proceeds to step S90.
- step S90 the model adaptation unit 12 determines whether there is a transformation matrix in which the adaptive data to be assigned has changed by the assignment updating process in step S89.
- step S90 when it is determined that there is a transformation matrix in which the adaptive data to be changed is present, the process proceeds to step S91, where the model adaptation unit 12 looks at the transformation matrix obtained by transforming the adaptation data. The process returns to step S88 as a transformation matrix, and the same processing is repeated thereafter.
- step S91 when there is a conversion matrix in which the assignment of the adaptive data has changed, in step S91, the conversion matrix is set as the target conversion matrix. Then, returning to step S88, the target transformation matrix is updated using the adaptive data assigned thereto, and further, in step S89, the assignment update process is repeated.
- step S91 the plurality of transformation matrices are used as the transformation matrix of interest.
- step S91 each of the transformation matrices of interest is updated using the adaptive data assigned to each.
- step S90 when it is determined in step S90 that there is no transformation matrix in which the allocation of the adaptive data has changed, that is, when all the adaptive data in the adaptive database 11 have been allocated to the optimal transformation matrix, The transformation matrix deletion processing ends.
- the adaptive data including the characteristic vector of the user's voice is registered by the adaptive data registration process of FIG.
- the adaptation data is assigned to an optimal transformation matrix among one or more transformation matrices stored in the transformation matrix storage unit 13.
- the transform matrix to which the adaptive data is newly assigned is updated using the adaptive data assigned to the transform matrix by the transform matrix update process of FIG. 4, and furthermore, each adaptive data stored in the adaptive database 11 is updated.
- the transformation matrix of the adaptive data is assigned so that is assigned to the optimal transformation matrix.
- the adaptive data is classified (clustered) into a transform matrix that is optimal for adapting the feature vector sequence in the adaptive data to the corresponding acoustic model, and furthermore, each class thus classified is classified.
- the transformation matrix corresponding to the class is updated using the adaptive data of the class, so that the speech input by the user is automatically classified, so to speak, so that the speech of the class is appropriately classified by the corresponding acoustic model.
- the conversion matrix is updated so as to adapt to, and as a result, by performing model adaptation using such a conversion matrix, the speech recognition accuracy can be improved.
- the classification of the voice input by the user is performed from the viewpoint of which transformation matrix is the optimal conversion matrix for the voice. Therefore, the user's own voice should be classified into which class. There is no need to specify whether it exists. This means, for example, that the voice recognition device uses If the environment used is different, it means that the class may be classified into a different class (assigned to a different transformation matrix), but even if the class is classified into a different class, For a classified speech, the transformation matrix corresponding to the class is the optimal transformation matrix, and therefore, the optimal transformation matrix allows the speech to be optimally adapted to the corresponding acoustic model. And
- the transformation matrix corresponding to the class is the optimal transformation matrix. Therefore, according to the optimal transformation matrix, the speech can be optimally adapted to the corresponding acoustic model. Can be done.
- a new transformation matrix is generated, and the transformation matrix is updated using adaptive data that uses the transformation matrix as an optimal transformation matrix. Therefore, for example, when the speech recognition device is used in an environment that is significantly different from the past, or when an utterance is input by a user whose characteristics are significantly different from those of the conventional user. In addition, it is possible to prevent the speech recognition accuracy from being greatly degraded.
- the conversion matrix storage unit is used.
- the transformation matrix stored in 13 the input speech cannot be sufficiently adapted to the corresponding acoustic model, and the speech recognition accuracy may be degraded.
- the transformation matrix generation processing in Fig. 6 a new transformation matrix is generated, and the new transformation matrix is significantly different from speech input under a significantly different environment and from previous users. Updates are performed using the voices of users with different features, and as a result, degradation of voice recognition accuracy due to changes in the user or environment, which occurs in conventional model-adaptive voice recognition devices, is prevented. It is possible to do.
- First and second matrices which divide the allocation of the adaptive data, so to speak, are generated as new transformation matrices. Since it is reassigned to a conversion matrix that maps (converts) to a sequence that is closer to the sequence of the toll, a conversion matrix that adapts the speech to the corresponding acoustic model is generated dynamically, so to speak, without the user's knowledge. Therefore, the user does not need to be aware of model adaptation.
- the transformation matrix deletion processing of FIG. 7 for example, when the number of adaptive data allocated to the transformation matrix decreases, the transformation matrix is deleted, so that the transformation matrix is stored in the transformation matrix storage unit 13. It is possible to prevent an increase in the processing amount due to an excessive number of transformation matrices.
- one or more transformation matrices stored in the transformation matrix storage unit 13 are obtained by transforming the feature series for a predetermined time by each.
- the matching process is performed using the transformed feature vector sequence, and the subsequent matching process is continued by transforming the feature vector sequence with the transform matrix with the highest likelihood.
- the input speech is converted into an optimal transformation matrix (in this embodiment, a feature vector sequence of the input speech is converted to an acoustic model corresponding to the speech).
- the speech recognizer is used by multiple users or in multiple environments, the voice of each user or the voice input under each environment is immediately converted to the corresponding acoustic model. It is possible to perform high-accuracy speech recognition without making the user aware of model adaptation. That is, in the conventional model-adaptive speech recognition device, as described above, after the model adaptation is performed so as to adapt to a specific user or a specific environment, the use in another user or another environment is started. Then, since the acoustic model of the speech recognition device is adapted to the first user and the first environment, until the acoustic model adapts to other users and the other environment, the speech recognition Although the accuracy is greatly degraded, the speech recognition device shown in Fig. 1 adapts to the corresponding acoustic model by transforming the input speech by the optimal transformation matrix, so that other speech Can respond (adapt) to users and the environment.
- model adaptation is performed so that (the characteristic vector of) the inputted speech is adapted to the corresponding acoustic model.
- the speech recognition device as described above, Also, it is possible to perform a model adaptation in which the acoustic model is adapted to the input speech.
- FIG. 8 shows a configuration example of such a speech recognition device.
- the conversion unit 5 for performing the conversion using the conversion matrix selected in the selection unit 14 is not between the buffer 4 and the matching unit 6, but the matching unit 6 and the acoustic model storage unit.
- the configuration is basically the same as that of the speech recognition device in FIG. Therefore, in the speech recognition apparatus of FIG. 8, the feature vector sequence is not converted by the conversion matrix, but the average vector that defines the Gaussian distribution of the acoustic model stored in the acoustic model storage unit 7 is used.
- the matching unit 6 obtains an acoustic model adapted to the input speech, and performs matching processing in the matching unit 6 using the acoustic model.
- the acoustic model is adapted to the input speech, the average vector sequence in the adaptive data is converted to the adaptive data.
- the conversion matrix that converts the sequence into the sequence that most closely resembles the feature vector sequence in the data is obtained as the optimal conversion matrix. Therefore, simply, the transformation matrix used in the speech recognition device in FIG. 1 and the transformation matrix used in the speech recognition device in FIG. 8 are in an inverse relationship.
- FIG. 9 shows a configuration example of an embodiment of a computer on which a program for executing the above-described series of processes is installed.
- the program can be recorded in advance on a hard disk 105 or ROM 103 as a recording medium built in the computer.
- the program may be stored on removable recording media 111 such as a flexible disk, CD-ROM (Compact Disc Read Only Memory), MO (Magneto optical) ice, DVD (Digital Versatile Disc), magnetic disk, and semiconductor memory. It can be stored (recorded) temporarily or permanently.
- removable recording medium 111 can be provided as so-called package software.
- the program can be installed on the computer from the removable recording medium 111 described above, or transmitted from the download site to the computer wirelessly via a satellite for digital satellite broadcasting, or transmitted over the LAN. (Local Area Network), the Internet, and the like, which are transferred to the computer by wire, and the computer receives the transferred program in the communication unit 108 and the built-in hard disk 10 5 can be installed.
- LAN Local Area Network
- the Internet and the like, which are transferred to the computer by wire, and the computer receives the transferred program in the communication unit 108 and the built-in hard disk 10 5 can be installed.
- the computer has a built-in CPU (Central Processing Unit) 102.
- An input / output interface 110 is connected to the CPU 102 via the bus 101.
- the CPU 102 receives a command via the input / output interface 110 when the user operates the input unit 107 including a keyboard, a mouse, a microphone, and the like. Then, the program stored in the ROM (Read Only Memory) 103 is executed.
- ROM Read Only Memory
- the CPU 102 may execute a program stored in the hard disk 105, a program transferred from a satellite or a network, received by the communication unit 108, and installed in the hard disk 105, Alternatively, a program read from the removable recording medium 111 mounted on the drive 109 and installed on the hard disk 105 is loaded into a RAM (Random Access Memory) 104 and executed. Accordingly, the CPU 102 performs the processing according to the above-described flowchart or the processing performed by the configuration of the above-described block diagram. Then, the CPU 102 outputs the processing result as necessary, for example, via an input / output interface 110, to an output unit 106 constituted by an LCD (Liquid Crystal Display), a speaker, or the like. Or from the communication unit 108 and further recorded on the hard disk 105.
- a program stored in the hard disk 105 a program transferred from a satellite or a network, received by the communication unit 108, and installed in the hard disk 105
- processing steps for describing a program for causing a computer to perform various types of processing do not necessarily need to be processed in chronological order in the order described as a flowchart, and may be performed in parallel. Alternatively, it also includes processing executed individually (for example, parallel processing or processing by an object).
- the program may be processed by one computer, or may be processed in a distributed manner by a plurality of computers. Further, the program may be transferred to a remote computer and executed.
- a matrix (conversion matrix) is used for conversion for model adaptation, but any other function can be used.
- linear transformation is performed as transformation for model adaptation.
- non-linear conversion for example.
- an HMM is used as an acoustic model, and a matching process based on the HMM method is performed to obtain a score representing a likelihood as a result of speech recognition.
- the present invention is not limited to the HMM method.
- the feature vector is included in the adaptive data and stored in the adaptive database 11.
- the adaptive data includes, for example, audio data instead of the feature vector. It is also possible to include the waveform data itself.
- the transformation matrix update processing of FIG. 4 is performed on the input speech after outputting the speech recognition result, but the transformation matrix update processing may be performed at any other timing. It can be done regularly or irregularly.
- the transformation matrix generation / deletion processing of FIG. 5 is performed as a part of the transformation matrix update processing of FIG. 4, but the transformation matrix generation Z deletion processing may be performed in any other manner. At the timing, it can be done regularly or irregularly.
- the adaptive data is stored up to the upper limit of the storage capacity of the adaptive database 11, but in this case, for example, the adaptive data is supplied after that. It is possible to prevent the adaptation data from being stored, or to delete old (past) adaptation data from the adaptation database 11. Furthermore, a plurality of adaptive data having an approximate feature vector sequence associated with the same average vector sequence is searched, and the plurality of adaptive data is searched for the same average beta vector.
- the data may be combined into one adaptation data consisting of the system IJ and an arbitrary one of a plurality of feature vector sequences to be approximated.
- the speech recognition is performed by the continuous HMM method. However, for the speech recognition, for example, a discrete HMM method may be employed.
- the first and second two matrices are generated from the transformation matrices that satisfy the generation conditions.
- three or more matrices are generated. It is also possible. Industrial applicability
- the input voice is selected from the one or more conversion functions.
- a conversion function that is optimal for adapting one of the acoustic model and the other to the other is detected, and the input function is assigned to the optimal conversion function. It is updated with all the input speech assigned to the conversion function.
- a conversion function used to convert one of the input speech and the acoustic model is selected from the one or more conversion functions, and the selected conversion function is used to select one of the input speech and the acoustic model. One is converted.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP02733382A EP1394770A4 (en) | 2001-06-08 | 2002-06-07 | VOICE RECOGNIZING METHOD AND DEVICE |
KR1020037001766A KR100924399B1 (ko) | 2001-06-08 | 2002-06-07 | 음성 인식 장치 및 음성 인식 방법 |
US10/344,031 US7219055B2 (en) | 2001-06-08 | 2002-06-07 | Speech recognition apparatus and method adapting best transformation function to transform one of the input speech and acoustic model |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2001174633A JP2002366187A (ja) | 2001-06-08 | 2001-06-08 | 音声認識装置および音声認識方法、並びにプログラムおよび記録媒体 |
JP2001-174633 | 2001-06-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2002101719A1 true WO2002101719A1 (en) | 2002-12-19 |
Family
ID=19015892
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2002/005647 WO2002101719A1 (en) | 2001-06-08 | 2002-06-07 | Voice recognition apparatus and voice recognition method |
Country Status (6)
Country | Link |
---|---|
US (1) | US7219055B2 (ja) |
EP (1) | EP1394770A4 (ja) |
JP (1) | JP2002366187A (ja) |
KR (1) | KR100924399B1 (ja) |
CN (1) | CN1244902C (ja) |
WO (1) | WO2002101719A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645135B2 (en) | 2008-09-12 | 2014-02-04 | Rosetta Stone, Ltd. | Method for creating a speech model |
Families Citing this family (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050246330A1 (en) * | 2004-03-05 | 2005-11-03 | Giang Phan H | System and method for blocking key selection |
US7818172B2 (en) * | 2004-04-20 | 2010-10-19 | France Telecom | Voice recognition method and system based on the contexual modeling of voice units |
JP2006201749A (ja) * | 2004-12-21 | 2006-08-03 | Matsushita Electric Ind Co Ltd | 音声による選択装置、及び選択方法 |
CN1811911B (zh) * | 2005-01-28 | 2010-06-23 | 北京捷通华声语音技术有限公司 | 自适应的语音变换处理方法 |
WO2006128107A2 (en) * | 2005-05-27 | 2006-11-30 | Audience, Inc. | Systems and methods for audio signal analysis and modification |
US7778831B2 (en) * | 2006-02-21 | 2010-08-17 | Sony Computer Entertainment Inc. | Voice recognition with dynamic filter bank adjustment based on speaker categorization determined from runtime pitch |
US8010358B2 (en) * | 2006-02-21 | 2011-08-30 | Sony Computer Entertainment Inc. | Voice recognition with parallel gender and age normalization |
WO2007142102A1 (ja) * | 2006-05-31 | 2007-12-13 | Nec Corporation | 言語モデル学習システム、言語モデル学習方法、および言語モデル学習用プログラム |
US7617103B2 (en) * | 2006-08-25 | 2009-11-10 | Microsoft Corporation | Incrementally regulated discriminative margins in MCE training for speech recognition |
US8423364B2 (en) * | 2007-02-20 | 2013-04-16 | Microsoft Corporation | Generic framework for large-margin MCE training in speech recognition |
TWI319563B (en) * | 2007-05-31 | 2010-01-11 | Cyberon Corp | Method and module for improving personal speech recognition capability |
GB2453366B (en) * | 2007-10-04 | 2011-04-06 | Toshiba Res Europ Ltd | Automatic speech recognition method and apparatus |
US8788256B2 (en) * | 2009-02-17 | 2014-07-22 | Sony Computer Entertainment Inc. | Multiple language voice recognition |
US8442833B2 (en) * | 2009-02-17 | 2013-05-14 | Sony Computer Entertainment Inc. | Speech processing with source location estimation using signals from two or more microphones |
US8442829B2 (en) * | 2009-02-17 | 2013-05-14 | Sony Computer Entertainment Inc. | Automatic computation streaming partition for voice recognition on multiple processors with limited memory |
US9026444B2 (en) | 2009-09-16 | 2015-05-05 | At&T Intellectual Property I, L.P. | System and method for personalization of acoustic models for automatic speech recognition |
US9478216B2 (en) | 2009-12-08 | 2016-10-25 | Nuance Communications, Inc. | Guest speaker robust adapted speech recognition |
CN101923854B (zh) * | 2010-08-31 | 2012-03-28 | 中国科学院计算技术研究所 | 一种交互式语音识别系统和方法 |
US8635067B2 (en) | 2010-12-09 | 2014-01-21 | International Business Machines Corporation | Model restructuring for client and server based automatic speech recognition |
US9224384B2 (en) * | 2012-06-06 | 2015-12-29 | Cypress Semiconductor Corporation | Histogram based pre-pruning scheme for active HMMS |
KR20140028174A (ko) * | 2012-07-13 | 2014-03-10 | 삼성전자주식회사 | 음성 인식 방법 및 이를 적용한 전자 장치 |
CN102862587B (zh) * | 2012-08-20 | 2016-01-27 | 泉州市铁通电子设备有限公司 | 一种铁路车机联控语音分析方法和设备 |
KR101429138B1 (ko) * | 2012-09-25 | 2014-08-11 | 주식회사 금영 | 복수의 사용자를 위한 장치에서의 음성 인식 방법 |
CN113470641B (zh) | 2013-02-07 | 2023-12-15 | 苹果公司 | 数字助理的语音触发器 |
US20140337030A1 (en) * | 2013-05-07 | 2014-11-13 | Qualcomm Incorporated | Adaptive audio frame processing for keyword detection |
US9251784B2 (en) | 2013-10-23 | 2016-02-02 | International Business Machines Corporation | Regularized feature space discrimination adaptation |
JP5777178B2 (ja) * | 2013-11-27 | 2015-09-09 | 国立研究開発法人情報通信研究機構 | 統計的音響モデルの適応方法、統計的音響モデルの適応に適した音響モデルの学習方法、ディープ・ニューラル・ネットワークを構築するためのパラメータを記憶した記憶媒体、及び統計的音響モデルの適応を行なうためのコンピュータプログラム |
US9589560B1 (en) * | 2013-12-19 | 2017-03-07 | Amazon Technologies, Inc. | Estimating false rejection rate in a detection system |
CN103730120A (zh) * | 2013-12-27 | 2014-04-16 | 深圳市亚略特生物识别科技有限公司 | 电子设备的语音控制方法及系统 |
US9697828B1 (en) * | 2014-06-20 | 2017-07-04 | Amazon Technologies, Inc. | Keyword detection modeling using contextual and environmental information |
KR102371697B1 (ko) | 2015-02-11 | 2022-03-08 | 삼성전자주식회사 | 음성 기능 운용 방법 및 이를 지원하는 전자 장치 |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10255907B2 (en) * | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
KR20170046291A (ko) * | 2015-10-21 | 2017-05-02 | 삼성전자주식회사 | 전자 기기, 그의 음향 모델 적응 방법 및 음성 인식 시스템 |
JP6805037B2 (ja) * | 2017-03-22 | 2020-12-23 | 株式会社東芝 | 話者検索装置、話者検索方法、および話者検索プログラム |
CN107180640B (zh) * | 2017-04-13 | 2020-06-12 | 广东工业大学 | 一种相位相关的高密度叠窗频谱计算方法 |
US10446136B2 (en) * | 2017-05-11 | 2019-10-15 | Ants Technology (Hk) Limited | Accent invariant speech recognition |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | FAR-FIELD EXTENSION FOR DIGITAL ASSISTANT SERVICES |
CN109754784B (zh) | 2017-11-02 | 2021-01-29 | 华为技术有限公司 | 训练滤波模型的方法和语音识别的方法 |
CN110517680B (zh) * | 2018-11-15 | 2023-02-03 | 腾讯科技(深圳)有限公司 | 一种人工智能的数据检测方法及装置、存储介质 |
CN113345428B (zh) * | 2021-06-04 | 2023-08-04 | 北京华捷艾米科技有限公司 | 语音识别模型的匹配方法、装置、设备和存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6223159B1 (en) * | 1998-02-25 | 2001-04-24 | Mitsubishi Denki Kabushiki Kaisha | Speaker adaptation device and speech recognition device |
JP2001255886A (ja) * | 2000-03-09 | 2001-09-21 | Matsushita Electric Ind Co Ltd | 音声認識方法および音声認識装置 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2980382B2 (ja) * | 1990-12-19 | 1999-11-22 | 富士通株式会社 | 話者適応音声認識方法および装置 |
JPH06214596A (ja) * | 1993-01-14 | 1994-08-05 | Ricoh Co Ltd | 音声認識装置および話者適応化方法 |
JPH06324695A (ja) * | 1993-05-13 | 1994-11-25 | Seiko Epson Corp | 音声認識装置 |
JP3216565B2 (ja) * | 1996-08-02 | 2001-10-09 | 日本電信電話株式会社 | 音声モデルの話者適応化方法及びその方法を用いた音声認識方法及びその方法を記録した記録媒体 |
JP3035239B2 (ja) * | 1997-03-10 | 2000-04-24 | 株式会社エイ・ティ・アール音声翻訳通信研究所 | 話者正規化装置、話者適応化装置及び音声認識装置 |
JP3088357B2 (ja) * | 1997-09-08 | 2000-09-18 | 株式会社エイ・ティ・アール音声翻訳通信研究所 | 不特定話者音響モデル生成装置及び音声認識装置 |
US6151573A (en) * | 1997-09-17 | 2000-11-21 | Texas Instruments Incorporated | Source normalization training for HMM modeling of speech |
US6343267B1 (en) * | 1998-04-30 | 2002-01-29 | Matsushita Electric Industrial Co., Ltd. | Dimensionality reduction for speaker normalization and speaker and environment adaptation using eigenvoice techniques |
US6999926B2 (en) * | 2000-11-16 | 2006-02-14 | International Business Machines Corporation | Unsupervised incremental adaptation using maximum likelihood spectral transformation |
US6915259B2 (en) * | 2001-05-24 | 2005-07-05 | Matsushita Electric Industrial Co., Ltd. | Speaker and environment adaptation based on linear separation of variability sources |
US7165028B2 (en) * | 2001-12-12 | 2007-01-16 | Texas Instruments Incorporated | Method of speech recognition resistant to convolutive distortion and additive distortion |
US7072834B2 (en) * | 2002-04-05 | 2006-07-04 | Intel Corporation | Adapting to adverse acoustic environment in speech processing using playback training data |
-
2001
- 2001-06-08 JP JP2001174633A patent/JP2002366187A/ja active Pending
-
2002
- 2002-06-07 WO PCT/JP2002/005647 patent/WO2002101719A1/ja active Application Filing
- 2002-06-07 CN CNB028025784A patent/CN1244902C/zh not_active Expired - Fee Related
- 2002-06-07 KR KR1020037001766A patent/KR100924399B1/ko not_active IP Right Cessation
- 2002-06-07 US US10/344,031 patent/US7219055B2/en not_active Expired - Fee Related
- 2002-06-07 EP EP02733382A patent/EP1394770A4/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6223159B1 (en) * | 1998-02-25 | 2001-04-24 | Mitsubishi Denki Kabushiki Kaisha | Speaker adaptation device and speech recognition device |
JP2001255886A (ja) * | 2000-03-09 | 2001-09-21 | Matsushita Electric Ind Co Ltd | 音声認識方法および音声認識装置 |
Non-Patent Citations (2)
Title |
---|
FURUI MATSUI: "Onsei ninshiki no tame no N-best ni motozuku washa tekioka", THE ACOUSTICAL SOCIETY OF JAPAN (ASJ) HEISEI 8 NENDO SHUKI KENKYU HAPPYOKAI KOEN RONBUNSHU, vol. 3-3-16, 25 September 1996 (1996-09-25), pages 117 - 118, XP002954482 * |
KOSAKA, MATSUNAGA, SAGAYAMA: "Ki kozo washa clustering o mochiita washa tekio", THE ACOUSTICAL SOCIETY OF JAPAN (ASJ) HEISEI 5 NENDO SHUKI KENKYU HAPPYOKAI KOEN RONBUNSHU, vol. 2-7-14, 5 October 1993 (1993-10-05), pages 97 - 98, XP002954483 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645135B2 (en) | 2008-09-12 | 2014-02-04 | Rosetta Stone, Ltd. | Method for creating a speech model |
Also Published As
Publication number | Publication date |
---|---|
CN1244902C (zh) | 2006-03-08 |
KR100924399B1 (ko) | 2009-10-29 |
EP1394770A1 (en) | 2004-03-03 |
CN1465043A (zh) | 2003-12-31 |
US20040059576A1 (en) | 2004-03-25 |
US7219055B2 (en) | 2007-05-15 |
EP1394770A4 (en) | 2006-06-07 |
JP2002366187A (ja) | 2002-12-20 |
KR20030018073A (ko) | 2003-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR100924399B1 (ko) | 음성 인식 장치 및 음성 인식 방법 | |
JP6705008B2 (ja) | 話者照合方法及びシステム | |
US10210862B1 (en) | Lattice decoding and result confirmation using recurrent neural networks | |
US8019602B2 (en) | Automatic speech recognition learning using user corrections | |
JP4195428B2 (ja) | 多数の音声特徴を利用する音声認識 | |
KR100612840B1 (ko) | 모델 변이 기반의 화자 클러스터링 방법, 화자 적응 방법및 이들을 이용한 음성 인식 장치 | |
JP3933750B2 (ja) | 連続密度ヒドンマルコフモデルを用いた音声認識方法及び装置 | |
JP4141495B2 (ja) | 最適化された部分的確率混合共通化を用いる音声認識のための方法および装置 | |
JP4465564B2 (ja) | 音声認識装置および音声認識方法、並びに記録媒体 | |
JP5106371B2 (ja) | 話認認証の検証のための方法および装置、話者認証システム | |
WO2001065541A1 (fr) | Dispositif de reconnaissance de la parole, procede de reconnaissance de la parole et support d'enregistrement | |
CN112349289B (zh) | 一种语音识别方法、装置、设备以及存储介质 | |
JP4515054B2 (ja) | 音声認識の方法および音声信号を復号化する方法 | |
KR101014086B1 (ko) | 음성 처리 장치 및 방법, 및 기록 매체 | |
US20040006469A1 (en) | Apparatus and method for updating lexicon | |
Manasa et al. | Comparison of acoustical models of GMM-HMM based for speech recognition in Hindi using PocketSphinx | |
Zgank et al. | Predicting the acoustic confusability between words for a speech recognition system using Levenshtein distance | |
JP2886118B2 (ja) | 隠れマルコフモデルの学習装置及び音声認識装置 | |
JP4048473B2 (ja) | 音声処理装置および音声処理方法、並びにプログラムおよび記録媒体 | |
KR100586045B1 (ko) | 고유음성 화자적응을 이용한 재귀적 화자적응 음성인식시스템 및 방법 | |
JP5136621B2 (ja) | 情報検索装置及び方法 | |
JP4678464B2 (ja) | 音声認識装置および音声認識方法、並びにプログラムおよび記録媒体 | |
JP3894419B2 (ja) | 音声認識装置、並びにこれらの方法、これらのプログラムを記録したコンピュータ読み取り可能な記録媒体 | |
Wang | Automatic Speech Recognition Model for Swedish Using Kaldi | |
JPH10149190A (ja) | 音声認識方法及び音声認識装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CN KR US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2002733382 Country of ref document: EP Ref document number: 1020037001766 Country of ref document: KR |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWP | Wipo information: published in national office |
Ref document number: 1020037001766 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 028025784 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10344031 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 2002733382 Country of ref document: EP |