CN101452701B - Confidence degree estimation method and device based on inverse model - Google Patents

Confidence degree estimation method and device based on inverse model Download PDF

Info

Publication number
CN101452701B
CN101452701B CN2007101941394A CN200710194139A CN101452701B CN 101452701 B CN101452701 B CN 101452701B CN 2007101941394 A CN2007101941394 A CN 2007101941394A CN 200710194139 A CN200710194139 A CN 200710194139A CN 101452701 B CN101452701 B CN 101452701B
Authority
CN
China
Prior art keywords
phoneme
inverse model
mentioned
degree
log
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2007101941394A
Other languages
Chinese (zh)
Other versions
CN101452701A (en
Inventor
何磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Priority to CN2007101941394A priority Critical patent/CN101452701B/en
Publication of CN101452701A publication Critical patent/CN101452701A/en
Application granted granted Critical
Publication of CN101452701B publication Critical patent/CN101452701B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a method and a device for training an inverse model related to phonemes, a method and a device for generating the weight used for the confidence estimation based on the inverse model, a method and a device for the confidence estimation which is used for a speech recognition result and is based on the inverse model, as well as a speech recognition method and a speech recognition system, wherein the method for training the inverse model related to the phonemes comprises: taking the phonemes as acoustic primitives to recognize trained speech to obtain a recognition result of the trained speech; analyzing the confusion among various phonemes in the recognition result; selecting at least one competitive phoneme which is easy to be confused with the phoneme for each phoneme in the recognition result; establishing a first inverse model and a second inverse model; utilizing a trained speech segment corresponding to the competitive phoneme to train the first inverse model; and utilizing trained speech segments corresponding to the phonemes except for the competitive phoneme to train the second inverse model.

Description

Confidence degree estimation method and device based on inverse model
Technical field
The present invention relates to being used in the speech recognition technology surveys and refuses to know the degree of confidence estimation technique that the outer speech of collection is imported, particularly, method and apparatus, the generation that relates to the training inverse model relevant with phoneme be used for based on the weight of the degree of confidence estimation of inverse model method and apparatus, be used for confidence degree estimation method and device and the audio recognition method and the system based on inverse model of voice identification result.
Background technology
Usually, when carrying out speech recognition, speech recognition system is always exported the pairing optimal candidate of maximum probability as recognition result.Yet when the input voice comprised the outer speech of collection, the recognition result of such " but best mistake " tended to cause serious problem, especially for the voice command control system.Therefore, how surveying and refuse to know the outer speech of collection is a core technology of speech recognition system.
Detection that utilizes the degree of confidence estimation technique to solve the outer speech of collection and the technical scheme of refusing to know have been proposed.In the prior art, the technology that exists multiple degree of confidence to estimate, below concise and to the point main several technology are described.
Method 1: the acoustics posterior probability (acoustics score) of using each candidate is as degree of confidence, and then the acoustics score is high more, and degree of confidence is high more.This method has directly been utilized the acoustics score that obtains, system overhead minimum in identifying.Yet therefore this method can only provide limited detection and refuse to know the performance that collects outer speech owing to only depend on the acoustic model that is used for recognizing voice.
Method 2: from middle extraction such as the many candidate lists of N-best, speech grid and/or confusion network degree of confidence by the output of the demoder the speech recognition system.This method (is published in Proceeding ofICASSP at the article " Large Vocabulary Decoding and ConfidenceEstimation Using Word Posteriori probabilities " that G.Evermann and P.C.Woodland showed, 2000) and A.Lee, be described in detail in the article that K.Shikano and T.Kawahara showed " RealTime Word Confidence Scoring Using Local Posterior Probabilities onTree Trellis Search " (being published in Proceeding of ICASSP, 2004).
Method 3: utilize such as degenerating of statistical language model the heuristic information of grading is calculated the degree of confidence of recognition result.This method is at X.Huang, the document that A.Acero and H.Hon showed " SpokenLanguage Processing:A Guide to Theory; Algorithm and SystemDevelopment " (is published by Prentice Hall, the 9.7th joint, pp.451-455,2001) be described in detail in.The full content of above document/article is contained in this with way of reference, for your guidance.
Said method 2 and method 3 are applicable to the speech recognition system of medium or large-scale vocabulary usually, and especially method 3 must be based on the language model that trains up.
Method 4: the log-likelihood ratio between use best candidate and certain alternative hypothesis is as degree of confidence, to realize the knowledge of refusing of the outer speech of collection.This method is suitable for the speech recognition system of little vocabulary, for example the voice command control system.
Summary of the invention
The present invention just is being based on above-mentioned technical matters and is proposing, its purpose be to provide method and apparatus, the generation of a kind of training inverse model relevant with phoneme be used for based on the weight of the degree of confidence estimation of inverse model method and apparatus, be used for confidence degree estimation method and device and the audio recognition method and the system based on inverse model of voice identification result.
According to an aspect of the present invention, provide the method for a kind of training inverse model relevant, comprising: be acoustic primitives recognition training voice with the phoneme, to obtain the recognition result of above-mentioned training utterance with phoneme; Analyze the degree of obscuring between each phoneme in the above-mentioned recognition result; For each phoneme in the above-mentioned recognition result, select at least one to be easy to competitive phoneme with this phoneme confusion; Set up first inverse model and second inverse model; Utilize the training utterance section corresponding, train above-mentioned first inverse model with above-mentioned at least one competitive phoneme; Utilize and the corresponding training utterance section of phoneme except above-mentioned at least one competitive phoneme, train above-mentioned second inverse model.
According to another aspect of the present invention, provide a kind of generation to be used for the method for the weight estimated based on the degree of confidence of inverse model, comprising: set up the training utterance set; Based on above-mentioned training utterance set, design the vocabulary that a plurality of special sound order controls that are used for are separately used; Using for each above-mentioned special sound order control, is that acoustic primitives makes up corresponding speech recognition device with the phoneme; Utilize a plurality of above-mentioned speech recognition devices, the voice in the corresponding above-mentioned vocabulary are discerned, to obtain the recognition result of above-mentioned voice; For each phoneme in the above-mentioned recognition result,, calculate the log-likelihood ratio of this combination to each combination that constitutes by inverse model, phoneme type and phoneme position of this phoneme; According to the above-mentioned log-likelihood ratio of above-mentioned each combination, determine to utilize separately this combination carry out degree of confidence when estimating etc. error rate; And, set the weight of above-mentioned each combination according to the error rate such as above-mentioned of above-mentioned each combination, and wherein, the weight height of the combination that the weight of the combination that error rate such as above-mentioned is low is higher than error rate such as above-mentioned; Wherein, the inverse model of above-mentioned each phoneme is to utilize the method for the above-mentioned training inverse model relevant with phoneme to train.
According to another aspect of the present invention, a kind of confidence degree estimation method based on inverse model that is used for voice identification result is provided, comprise:,, calculate the log-likelihood ratio of this phoneme based on the acoustics score of this phoneme and the inverse model of this phoneme for each phoneme in the above-mentioned voice identification result; And for each speech in the above-mentioned voice identification result, the log-likelihood ratio that calculates this speech is the log-likelihood ratio of all phonemes of this speech of composition and the sum of products of weight separately; Wherein, above-mentioned inverse model is to utilize the method for the above-mentioned training inverse model relevant with phoneme to train; Above-mentioned weight is that the method for utilizing above-mentioned generation to be used for the weight estimated based on the degree of confidence of inverse model generates.
According to another aspect of the present invention, provide a kind of audio recognition method, comprising: be that acoustic primitives is discerned the voice of input with the phoneme, to obtain the recognition result of above-mentioned voice; And utilize the above-mentioned confidence degree estimation method that is used for voice identification result based on inverse model, the recognition result of above-mentioned voice is carried out degree of confidence estimate.
According to another aspect of the present invention, provide the device of a kind of training inverse model relevant, comprising with phoneme: speech recognition device, it is acoustic primitives recognition training voice with the phoneme, to obtain the recognition result of above-mentioned training utterance; Degree of obscuring analytic unit, it analyzes the degree of obscuring between each phoneme in the above-mentioned recognition result; Selected cell, it is for each phoneme in the above-mentioned recognition result, selects at least one to be easy to competitive phoneme with this phoneme confusion; Inverse model is set up the unit, and it sets up first inverse model and second inverse model for above-mentioned each phoneme; First training unit, it utilizes the training utterance section corresponding with above-mentioned at least one competitive phoneme of this phoneme for above-mentioned each phoneme, trains above-mentioned first inverse model of this phoneme; And second training unit, it is for above-mentioned each phoneme, utilizes and the corresponding training utterance section of phoneme except above-mentioned at least one competitive phoneme of this phoneme, trains above-mentioned second inverse model of this phoneme.
According to another aspect of the present invention, provide a kind of generation to be used for the device of the weight estimated based on the degree of confidence of inverse model, comprising: the training utterance set; The vocabulary design unit, it is gathered based on above-mentioned training utterance, designs the vocabulary that a plurality of special sound order controls that are used for are separately used; A plurality of speech recognition devices, it corresponds respectively to each of above-mentioned special sound order control application and is built as with the phoneme is acoustic primitives, and the voice in the corresponding above-mentioned vocabulary are discerned, to obtain the recognition result of above-mentioned voice; The log-likelihood calculations unit, it is for each phoneme in the above-mentioned recognition result, to each combination that is made of inverse model, phoneme type and phoneme position of this phoneme, calculates the log-likelihood ratio of this combination; Etc. the error rate determining unit, it determines that according to the above-mentioned log-likelihood ratio of above-mentioned each combination utilizing this combination to carry out degree of confidence separately estimates the calendar scheduling error rate; And the weight setting unit, it sets the weight of above-mentioned each combination according to the error rate such as above-mentioned of above-mentioned each combination, wherein, and the weight height of the combination that the weight of the combination that error rate such as above-mentioned is low is higher than error rate such as above-mentioned; Wherein, the inverse model of above-mentioned each phoneme is to utilize the device of the above-mentioned training inverse model relevant with phoneme to train.
According to another aspect of the present invention, a kind of degree of confidence estimation unit based on inverse model that is used for voice identification result is provided, comprise: the log-likelihood calculations unit of phoneme, it is for each phoneme in the above-mentioned voice identification result, based on the acoustics score and the inverse model of this phoneme, calculate the log-likelihood ratio of this phoneme; And the log-likelihood calculations unit of speech, it is for each speech in the above-mentioned voice identification result, and the log-likelihood ratio that calculates this speech is the log-likelihood ratio of all phonemes of this speech of composition and the sum of products of weight separately; Wherein, above-mentioned inverse model is to utilize the device of the above-mentioned training inverse model relevant with phoneme to train; Above-mentioned weight is that the device that utilizes above-mentioned generation to be used for the weight estimated based on the degree of confidence of inverse model generates.
According to another aspect of the present invention, provide a kind of speech recognition system, comprising: speech recognition equipment, it is that acoustic primitives is discerned the voice of input with the phoneme, to obtain the recognition result of above-mentioned voice; And the above-mentioned degree of confidence estimation unit based on inverse model that is used for voice identification result, its recognition result to above-mentioned voice carries out degree of confidence and estimates.
Description of drawings
Fig. 1 is the process flow diagram of the method for the training according to an embodiment of the invention inverse model relevant with phoneme;
Fig. 2 is the process flow diagram that generation according to an embodiment of the invention is used for the method for the weight estimated based on the degree of confidence of inverse model;
Fig. 3 is the process flow diagram based on the confidence degree estimation method of inverse model that is used for voice identification result according to an embodiment of the invention;
Fig. 4 is the process flow diagram of audio recognition method according to an embodiment of the invention;
Fig. 5 is the schematic block diagram of the device of the training according to an embodiment of the invention inverse model relevant with phoneme;
Fig. 6 is the schematic block diagram that generation according to an embodiment of the invention is used for the device of the weight estimated based on the degree of confidence of inverse model;
Fig. 7 is the schematic block diagram based on the degree of confidence estimation unit of inverse model that is used for voice identification result according to an embodiment of the invention;
Fig. 8 is the schematic block diagram of speech recognition system according to an embodiment of the invention.
Embodiment
Believe that by below in conjunction with the detailed description of accompanying drawing to specific embodiments of the invention, above and other objects of the present invention, feature and advantage will become more obvious.
For the ease of the understanding of back embodiment, at first briefly introduce foregoing log-likelihood ratio (Log Likelihood Ratio is called for short LLR) test.
Formula (1) below the LLR test is used calculates LLR:
LLR = log P ( O | H 0 ) P ( O | H 1 ) - - - ( 1 )
Wherein, O represents to import voice, H 0Expression and the corresponding null hypothesis of object module of discerning the optimal candidate of exporting, H 1Represent the alternative hypothesis corresponding with the inverse model of this object module, P represents posterior probability.Obviously, for each input voice O, if based on the probability P (O|H of null hypothesis 0) much larger than probability P (O|H based on alternative hypothesis 1), then null hypothesis is reliable.In concrete decision-making, if the log-likelihood ratio between null hypothesis and the alternative hypothesis is greater than predetermined decision-making value, then null hypothesis H 0Be that optimal candidate can be accepted reliably; Otherwise, the refusal optimal candidate.
In the LLR measuring technology, the design of inverse model and train extremely importantly, it directly determines speech recognition system to refuse to know performance to what collect outer speech.
Below just in conjunction with the accompanying drawings each preferred embodiment of the present invention is described in detail.
Fig. 1 is the process flow diagram of the method for the training according to an embodiment of the invention inverse model relevant with phoneme.Below in conjunction with accompanying drawing, describe present embodiment in detail.The generation that utilizes the inverse model of the method training of present embodiment will be used to describe in conjunction with other embodiment the back is used for the method for the weight estimated based on the degree of confidence of inverse model and is used for the confidence degree estimation method based on inverse model of voice identification result.
As shown in Figure 1,, training utterance is discerned, to obtain the recognition result of training utterance at first in step 101.Usually, training utterance may be defined as and is used to train acoustic model to estimate the speech data of some parameter, for example, and by the speech data of user's typing in advance.Particularly, utilizing with the phoneme is the speech recognition device recognition training voice of acoustic primitives, and for example, this speech recognition device is the phoneme loop recognition network.Certainly, persons of ordinary skill in the art may appreciate that and to use other that structure of this speech recognition device and principle are known, omit explanation herein with the speech recognition device of phoneme as acoustic primitives.
Then,, analyze the degree of obscuring between each phoneme in the recognition result that step 101 obtains in step 105, thus obtain recording each phoneme between the confusion matrix of degree of obscuring.In the present embodiment, phoneme between degree of obscuring be meant and make a phoneme of this phoneme centering be known number of samples for the training utterance of another phoneme by mistake.
Those of ordinary skill in the art can understand, and can use any one the existing or following degree of obscuring analytical approach in this step.
Then, in step 110,, select at least one to be easy to competitive phoneme with this phoneme confusion for each phoneme in the recognition result.Particularly, for each phoneme, at first according to and this phoneme between the size of degree of obscuring other phoneme is sorted, from other phoneme that is sorted, select the bigger one or more phonemes of degree of obscuring then, so that the ratio of the total sample number of this phoneme surpasses degree of obscuring threshold value in the summation of the degree of obscuring between this phoneme and these phonemes and the training utterance, these phonemes are exactly the competitive phoneme of this phoneme so.
In the present embodiment, degree of obscuring threshold value is predetermined, and preferably, the scope of degree of obscuring threshold value is 80%-90%.
Then, in step 115, first inverse model and second inverse model set up in each phoneme.In the present embodiment, the first initial inverse model has identical topological structure with second inverse model, and first inverse model and second inverse model can be gauss hybrid models (Gaussian Mixed Model, be called for short GMM) or hidden Markov model (Hidden Markov Model is called for short HMM).
Then,, utilize training utterance section corresponding in the training utterance, train first inverse model of this phoneme with the competitive phoneme of this phoneme in step 120.And in step 125, utilize in the training utterance and the corresponding training utterance section of phoneme except competitive phoneme, train second inverse model of this phoneme.
Therefore, first inverse model that trains like this on acoustic space with the acoustics close together of corresponding phoneme, and identification error sensitivity to causing owing to phonetic similarity, thereby to collecting the detection accuracy height of outer speech, yet this inverse model also causes the False Rejects that correct recognition result is refused as the outer speech of collection is increased easily.In addition, second inverse model that trains acoustics distance with corresponding phoneme on acoustic space is far away, but better to the robustness of environmental change, and stable with respect to fixing decision-making value.By these two inverse models, can both keep strengthening robustness again, thereby reduce false rejection rate collecting the detection accuracy of outer speech.
Those of ordinary skill in the art knows easily, can adopt the acoustic training model method of standard that first inverse model and second inverse model are trained.
In addition, in the present embodiment,, can adopt initial consonant and simple or compound vowel of a Chinese syllable to replace phoneme for Chinese speech.
By above description as can be seen, the method of the inverse model that the training of present embodiment is relevant with phoneme is based on degree of the obscuring analysis of the recognition result of training utterance, utilize two types voice segments to train first inverse model and second inverse model of each phoneme respectively, combination by two inverse models, can be so that adopt the speech recognition system of two inverse models can survey the outer speech of collection exactly, and enhancing reduces False Rejects to the robustness of environmental change.
Under same inventive concept, Fig. 2 is the process flow diagram that generation according to an embodiment of the invention is used for the method for the weight estimated based on the degree of confidence of inverse model.Below in conjunction with accompanying drawing, describe present embodiment in detail, wherein identical with front embodiment part is omitted its explanation.Utilize the weight of the method generation of present embodiment will be used to the confidence degree estimation method that is used for voice identification result that the back is described in conjunction with other embodiment based on inverse model.
As shown in Figure 2, in step 201, set up the training utterance set.In the present embodiment, the training utterance set comprises at least: the speech of phoneme balance and phrase; Everyday words during special sound order control is used; And the voice that are mixed with different noises.
Then, in step 205,, design the vocabulary that a plurality of special sound order controls that are used for are separately used based on this training utterance set.Particularly, according to the needs that certain special sound order control is used, in the training utterance set, select corresponding training utterance to constitute vocabulary.
Then, in step 210, using for each special sound order control, is that acoustic primitives makes up corresponding speech recognition device with the phoneme, is used to discern the voice that are used for this special sound order control application.The speech recognition device that makes up in this step can be that existing or future any one is the speech recognition device based on statistical model of acoustic primitives with the phoneme.
Like this, according to above-mentioned designed vocabulary and constructed speech recognition device, can generate the weight that is used for log-likelihood calculations.
Then,, utilize a plurality of speech recognition devices that in step 210, make up, the voice in the corresponding vocabulary are discerned,, comprise the acoustics score of phoneme in this recognition result to obtain the recognition result of these voice in step 215.
Then, in step 220, for each phoneme in the recognition result, each combination that inverse model, phoneme type and phoneme position by this phoneme are constituted, the log-likelihood ratio that calculates this combination is as its degree of confidence, wherein, the inverse model of phoneme is the method training that utilizes the training of the embodiment shown in Figure 1 inverse model relevant with phoneme.
Particularly, according to a phoneme in the recognition result, determine inverse model, phoneme type and the phoneme position of this phoneme, and constitute a plurality of combinations.Then, for a plurality of combinations each, calculate its log-likelihood ratio.
If inverse model is a gauss hybrid models, then utilize following formula (2) to calculate log-likelihood ratio:
LLR m c = 1 T m Σ t = 1 Tm log b m j ( O t ) b A c ( O t )
= 1 T m ( AScore m - Σ t = 1 Tm log b A c ( O t ) )
= 1 T m ( AScore m - Σ t = 1 Tm log ( Σ i = 1 N c w i c P ( O t | λ i c ) ) ) - - - ( 2 )
Wherein, LLR m cRepresent the log-likelihood ratio between m phoneme and c the inverse model, T mThe speech samples totalframes (dwell length) of representing m phoneme, AScore mThe acoustics score of representing m phoneme, O tThe t frame speech samples of representing m phoneme, b A cThe mixed Gaussian of c inverse model correspondence of () expression distributes N cThe total quantity of representing the mixed Gaussian distribution of c inverse model correspondence, P (| λ i c) i Gaussian distribution in the mixed Gaussian distribution of c inverse model correspondence of expression, w i cThe hybrid weight of representing i Gaussian distribution.
From formula (2) as can be seen, the computing cost of log-likelihood ratio mainly from the calculating of gauss hybrid models, can adopt Gauss's technology of prunning branches at this, thereby significantly reduces calculated amount under the constant situation of performance.
Then, in step 225, according to the log-likelihood ratio of each combination of in step 220, calculating, determine to utilize separately this combination carry out degree of confidence when estimating etc. error rate.
Usually, speech recognition errors comprises foregoing False Rejects and wrong acceptance, and wherein wrong acceptance is meant to be accepted the recognition result of mistake (wrong identification that collects speech in outer speech or the collection) as correct result.Obviously, these two kinds of mistakes are a pair of paradox.For example, for a speech recognition system, when False Rejects reduces, the acceptance increase that can lead to errors, vice versa.By regulating decision-making value, can be so that false rejection rate be identical with false acceptance rate, the false rejection rate of this moment or false acceptance rate such as are exactly at error rate.Error rate such as pass through, can assess the conspicuousness that a combination is surveyed for speech recognition errors, thus weigh speech recognition system refuse to know performance.Usually, low more etc. error rate, the conspicuousness that this combination is surveyed for speech recognition errors is high more, and it is also high more that the refusing of system known performance, thereby the weight that this of this phoneme is combined in the log-likelihood calculations of the speech that comprises this phoneme is big more.
At last, in step 230, according to each combination etc. error rate, set the weight of each combination, wherein, the weight of the combination low etc. error rate such as compares at the weight height of the high combination of error rate.
By above description as can be seen, the generation of present embodiment is used for controlling at the balance of having considered phoneme and different voice commands based on the method for the weight of the degree of confidence estimation of inverse model the basis of the optimization of using, the speech recognition errors that is caused according to each combination that constitutes by inverse model, phoneme type and phoneme position of each phoneme etc. error rate, set the weight of this combination, thereby the mode with data-driven generates each weight, and good portability is provided.
Under same inventive concept, Fig. 3 is the process flow diagram based on the confidence degree estimation method of inverse model that is used for voice identification result according to an embodiment of the invention, and wherein voice identification result comprises the acoustics score of phoneme.Below in conjunction with accompanying drawing, describe present embodiment in detail, wherein identical with front embodiment part is omitted its explanation.
As shown in Figure 3, at first in step 301,,, calculate the log-likelihood ratio of this phoneme based on the acoustics score of this phoneme and the inverse model of this phoneme for each phoneme in the voice identification result.In the present embodiment, the inverse model of phoneme is the method training that utilizes the training of the embodiment shown in Figure 1 inverse model relevant with phoneme.
Particularly, be under the situation of gauss hybrid models at first inverse model and second inverse model, utilize foregoing formula (2) to calculate the log-likelihood ratio LLR of phoneme m c, it represents the log-likelihood ratio between m phoneme and c the inverse model.
Then, in step 310, for each speech in the voice identification result, the log-likelihood ratio that calculates this speech is the log-likelihood ratio of all phonemes of this speech of composition and the sum of products of weight separately, and promptly the likelihood ratio of speech equals the weighted sum of the log-likelihood ratio of its all phonemes that comprise.Because different phoneme type and the position of phoneme in speech, each phoneme in the speech is different for the contribution of the degree of confidence of whole speech, for example, for Chinese speech, usually clear initial consonant is bigger than turbid simple or compound vowel of a Chinese syllable for the contribution of degree of confidence, and the phoneme of prefix is very important in a speech, and therefore different phonemes should have different weights.In the present embodiment, the weight of phoneme is that the method for utilizing the generation of embodiment shown in Figure 2 to be used for the weight estimated based on the degree of confidence of inverse model generates.
Particularly, calculate the log-likelihood ratio of speech by following formula (3):
LLR = Σ c = 1 2 Σ m = 1 M w m c ( LLR m c ) - - - ( 3 )
Wherein, Σ c = 1 2 Σ m = 1 M w m c = 1 , w m cThe expression weight, M represents to form the number of the phoneme of this speech.By above description as can be seen, the confidence degree estimation method based on inverse model that is used for voice identification result of present embodiment adopts the log-likelihood ratio of speech as degree of confidence, and the weighted sum of the log-likelihood ratio by each phoneme in the speech is calculated the log-likelihood ratio of speech, considered the influence of inverse model, phoneme type and the phoneme position of phoneme, what can significantly improve speech recognition system refuses to know performance.
Further, in another embodiment, after the log-likelihood ratio (step 310) that has obtained each speech in the voice identification result, relatively whether log-likelihood ratio and one or more decision-making value of each speech are correctly discerned to determine this speech.
In the present embodiment, decision-making value can be pre-determined as required by the user.If a decision-making value, then, then indicate this speech correctly to be discerned when the log-likelihood ratio of speech during greater than this decision-making value, be wrong identification otherwise indicate this speech.If a plurality of decision-making values, then, indicate this speech correctly to be discerned when the log-likelihood ratio of speech during greater than the decision-making value of maximum; When the log-likelihood ratio of speech during less than the decision-making value of minimum, indicating this speech is wrong identification; When the log-likelihood ratio of speech is between the decision-making value of minimum and maximum decision-making value, notify the user to readjust decision-making value.
Further, in another embodiment, after the log-likelihood ratio (step 310) that has obtained each speech in the voice identification result, utilize normalized function that the log-likelihood ratio of speech is normalized to degree of confidence score in the certain limit, for example 1 to 100.Then voice identification result and degree of confidence score thereof are offered the user.
Under same inventive concept, Fig. 4 is the process flow diagram of audio recognition method according to an embodiment of the invention.Below in conjunction with accompanying drawing, describe present embodiment in detail, wherein identical with front embodiment part is omitted its explanation.
As shown in Figure 4, at first in step 401, be that acoustic primitives is discerned the voice of input with the phoneme, with the recognition result of the voice that obtain importing.As previously mentioned, can utilize with the phoneme be acoustic primitives speech recognition device to the input voice discern, for example, phoneme loop recognition network etc.
Then,, utilize the confidence degree estimation method that is used for voice identification result of embodiment shown in Figure 3, the recognition result of voice of input is carried out degree of confidence estimate based on inverse model in step 410.
Under same inventive concept, Fig. 5 is the schematic block diagram of the device of the training according to an embodiment of the invention inverse model relevant with phoneme.Below in conjunction with accompanying drawing, describe present embodiment in detail.The generation that utilizes the inverse model of the device training of present embodiment will be used to describe in conjunction with other embodiment the back is used for the device of the weight estimated based on the degree of confidence of inverse model and is used for the degree of confidence estimation unit based on inverse model of voice identification result.
As shown in Figure 5, the device 500 of the inverse model that the training of present embodiment is relevant with phoneme comprises: speech recognition device 501, and it is acoustic primitives recognition training voice with the phoneme, to obtain the recognition result of training utterance; Degree of obscuring analytic unit 502, it analyzes the degree of obscuring between each phoneme in the recognition result that obtains by speech recognition device 501; Competitive phoneme selected cell 503, it is for each phoneme in the recognition result, selects at least one to be easy to competitive phoneme with this phoneme confusion; Inverse model is set up unit 504, and it sets up first inverse model and second inverse model for each phoneme; First training unit 505, it utilizes the voice segments corresponding with the competitive phoneme of this phoneme for each phoneme, trains first inverse model of this phoneme; And second training unit 506, it is for each phoneme, utilizes and the corresponding voice segments of phoneme except the competitive phoneme of this phoneme, trains second inverse model of this phoneme.
In the present embodiment, speech recognition device 501 can be that existing or following any one is the speech recognition device based on statistical model of acoustic primitives with the phoneme.
When utilizing competitive phoneme selected cell 503 to select the competitive phoneme of certain phoneme, at first by the degree of obscuring between phoneme sequencing unit basis and this phoneme, other phoneme is sorted, the phoneme selected cell is selected the big phoneme of one or more degree of obscuring from other phoneme that is sorted by the phoneme sequencing unit then, so that the ratio of the total sample number of this phoneme surpasses degree of obscuring threshold value in the summation of the degree of obscuring between these phonemes and this phoneme and the training utterance, these phonemes are exactly the competitive phoneme of this phoneme so.
Inverse model is set up the initial topology structure of first inverse model set up unit 504 and second inverse model can be identical, and first inverse model and second inverse model can be gauss hybrid models or hidden Markov model.
The device 500 and each ingredient thereof that should be pointed out that the inverse model that training in the present embodiment is relevant with phoneme can constitute with special-purpose circuit or chip, also can realize by the corresponding program of computing machine (processor) execution.And the device 500 of the inverse model that the training of present embodiment is relevant with phoneme can be realized the method for the inverse model that the training of embodiment shown in Figure 1 is relevant with phoneme in operation.
Under same inventive concept, Fig. 6 is the schematic block diagram that generation according to an embodiment of the invention is used for the device of the weight estimated based on the degree of confidence of inverse model.Below in conjunction with accompanying drawing, describe present embodiment in detail, wherein identical with front embodiment part is omitted its explanation.Utilize the weight of the device generation of present embodiment will be used to the degree of confidence estimation unit that is used for voice identification result that the back is described in conjunction with other embodiment based on inverse model.
As shown in Figure 6, the generation of present embodiment is used for comprising based on the device 600 of the weight of the degree of confidence estimation of inverse model: training utterance set 601, and it comprises training utterance; Vocabulary design unit 602, it designs the vocabulary that a plurality of special sound order controls that are used for are separately used based on the training utterance in the training utterance set 601; A plurality of speech recognition devices 603, it corresponds respectively to each of special sound order control application and is built as with the phoneme is acoustic primitives, and to discerning by the voice in the vocabulary of vocabulary design unit 602 designs, to obtain recognition result; Log-likelihood calculations unit 604, it is for each phoneme in the recognition result, to each combination that is made of inverse model, phoneme type and phoneme position of this phoneme, calculates the log-likelihood ratio of this combination; Etc. error rate determining unit 605, it is according to the log-likelihood ratio of each combination, determine to utilize separately this combination carry out degree of confidence when estimating etc. error rate; And weight setting unit 606, its according to each combination etc. error rate, set the weight of each combination, wherein, the weight of the combination low etc. error rate such as compares at the weight height of the high combination of error rate; Wherein, the inverse model of each phoneme is device 500 training that utilize the training of the embodiment shown in Figure 5 inverse model relevant with phoneme.
In the present embodiment, training utterance set 601 comprises at least: the speech of phoneme balance and phrase; Everyday words during special sound order control is used; And the voice that are mixed with different noises.
In log-likelihood calculations unit 604,, then calculate the log-likelihood ratio of each combination according to foregoing formula (2) if the inverse model of phoneme is a gauss hybrid models.
Should be pointed out that device 600 and each ingredient thereof that generation in the present embodiment is used for the weight estimated based on the degree of confidence of inverse model can constitute with special-purpose circuit or chip, also can realize by the corresponding program of computing machine (processor) execution.And the device 600 that the generation of present embodiment is used for the weight estimated based on the degree of confidence of inverse model can realize that in operation the generation of embodiment shown in Figure 2 is used for the method for the weight estimated based on the degree of confidence of inverse model.
Under same inventive concept, Fig. 7 is the schematic block diagram based on the degree of confidence estimation unit of inverse model that is used for voice identification result according to an embodiment of the invention.Below in conjunction with accompanying drawing, describe present embodiment in detail, wherein identical with front embodiment part is omitted its explanation.
As shown in Figure 7, the degree of confidence estimation unit 700 based on inverse model that is used for voice identification result of present embodiment comprises: the log-likelihood calculations unit 701 of phoneme, it is for each phoneme in the voice identification result, based on the acoustics score and the inverse model of this phoneme, calculate the log-likelihood ratio of this phoneme; And the log-likelihood calculations unit 702 of speech, it is for each speech in the voice identification result, and the log-likelihood ratio that calculates this speech is the log-likelihood ratio of all phonemes of this speech of composition and the sum of products of weight separately; Wherein, the inverse model of phoneme is device 500 training that utilize the training of the embodiment shown in Figure 5 inverse model relevant with phoneme, and weight is that the device 600 that utilizes the generation of embodiment shown in Figure 6 to be used for the weight estimated based on the degree of confidence of inverse model generates.
In the log-likelihood calculations unit 701 of phoneme,, then calculate the log-likelihood ratio of phoneme according to foregoing formula (2) if the inverse model of phoneme is a gauss hybrid models; And in the log-likelihood calculations unit 702 of speech, calculate the log-likelihood ratio of speech according to foregoing formula (3).
Further, the degree of confidence estimation unit 700 based on inverse model that is used for voice identification result of present embodiment also comprises: comparing unit, whether correctly log-likelihood ratio and one or more decision-making value of each speech that it is relatively calculated determine the identification of this speech.When the log-likelihood ratio of speech during, represent that this speech is correctly discerned greater than decision-making value.
As previously mentioned, decision-making value can be pre-determined as required by the user.If a decision-making value, then, then indicate this speech correctly to be discerned when the log-likelihood ratio of speech during greater than this decision-making value, be wrong identification otherwise indicate this speech.If a plurality of decision-making values, then, indicate this speech correctly to be discerned when the log-likelihood ratio of speech during greater than the decision-making value of maximum; When the log-likelihood ratio of speech during less than the decision-making value of minimum, indicating this speech is wrong identification; When the log-likelihood ratio of speech is between the decision-making value of minimum and maximum decision-making value, notify the user to readjust decision-making value.
Further, the degree of confidence estimation unit 700 based on inverse model that is used for voice identification result of present embodiment also comprises: the normalization unit, it utilizes normalized function that the log-likelihood ratio of speech is normalized to degree of confidence score in the certain limit, and for example 1 to 100.
Should be pointed out that degree of confidence estimation unit 700 and each ingredient thereof based on the inverse model that are used for voice identification result in the present embodiment can constitute with special-purpose circuit or chip, also can carry out corresponding program by computing machine (processor) and realize.And, present embodiment be used for voice identification result in operation, can realize the confidence degree estimation method that is used for voice identification result of embodiment shown in Figure 3 based on the degree of confidence estimation unit 700 of inverse model based on inverse model.
Under same inventive concept, Fig. 8 is the schematic block diagram of speech recognition system according to an embodiment of the invention.Below in conjunction with accompanying drawing, describe present embodiment in detail, wherein identical with front embodiment part is omitted its explanation.
As shown in Figure 8, the speech recognition system 800 of present embodiment comprises: speech recognition equipment 801, its can be existing or following any be the speech recognition equipment of acoustic primitives with the phoneme, the voice of input are discerned, to obtain the recognition result of voice; And based on the degree of confidence estimation unit of inverse model, it can be the degree of confidence estimation unit 700 based on inverse model that is used for voice identification result of embodiment shown in Figure 7, is used for that the recognition result by speech recognition equipment 801 outputs is carried out degree of confidence and estimates.
The speech recognition system 800 and each ingredient thereof that should be pointed out that present embodiment can constitute with special-purpose circuit or chip, also can carry out corresponding program by computing machine (processor) and realize.And the speech recognition system 800 of present embodiment can realize the audio recognition method of embodiment shown in Figure 4 in operation.
Though more than by some exemplary embodiments describe in detail the training of the present invention inverse model relevant with phoneme method and apparatus, generate the method and apparatus be used for the weight estimated based on the degree of confidence of inverse model, confidence degree estimation method and device and audio recognition method and the system that is used for voice identification result based on inverse model, but above these embodiment are not exhaustive, and those skilled in the art can realize variations and modifications within the spirit and scope of the present invention.Therefore, the present invention is not limited to these embodiment, and scope of the present invention is only defined by the appended claims.

Claims (30)

1. the method for the inverse model that a training is relevant with phoneme comprises:
With the phoneme is acoustic primitives recognition training voice, to obtain the recognition result of above-mentioned training utterance;
Analyze the degree of obscuring between each phoneme in the above-mentioned recognition result;
For each phoneme in the above-mentioned recognition result,
Select at least one to be easy to competitive phoneme with this phoneme confusion;
Set up first inverse model and second inverse model;
Utilize the training utterance section corresponding, train above-mentioned first inverse model with above-mentioned at least one competitive phoneme; And
Utilize and the corresponding training utterance section of phoneme except above-mentioned at least one competitive phoneme, train above-mentioned second inverse model.
2. the method for the inverse model that training according to claim 1 is relevant with phoneme, wherein, at least one is easy to above-mentioned selection comprise with the step of the competitive phoneme of this phoneme confusion:
According to and this phoneme between degree of obscuring, other phoneme in the above-mentioned recognition result is sorted; And
From above-mentioned other phoneme that is sorted, select the big phoneme of one or more degree of obscuring as competitive phoneme so that in the summation of the degree of obscuring between its and this phoneme and the above-mentioned training utterance ratio of the total sample number of this phoneme above degree of obscuring threshold value.
3. the method for the inverse model that training according to claim 1 and 2 is relevant with phoneme, wherein, above-mentioned phoneme is replaced with the initial consonant and the simple or compound vowel of a Chinese syllable of Chinese speech.
4. the method for the inverse model that training according to claim 1 and 2 is relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are gauss hybrid models.
5. the method for the inverse model that training according to claim 3 is relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are gauss hybrid models.
6. the method for the inverse model that training according to claim 1 and 2 is relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are hidden Markov models.
7. the method for the inverse model that training according to claim 3 is relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are hidden Markov models.
8. a generation is used for the method for the weight estimated based on the degree of confidence of inverse model, comprising:
Set up the training utterance set;
Based on above-mentioned training utterance set, design the vocabulary that a plurality of special sound order controls that are used for are separately used;
Using for each above-mentioned special sound order control, is that acoustic primitives makes up corresponding speech recognition device with the phoneme;
Utilize a plurality of above-mentioned speech recognition devices, the voice in the corresponding above-mentioned vocabulary are discerned, to obtain the recognition result of above-mentioned voice;
For each phoneme in the above-mentioned recognition result,
To each combination that constitutes by inverse model, phoneme type and phoneme position of this phoneme, calculate the log-likelihood ratio of this combination;
According to the above-mentioned log-likelihood ratio of above-mentioned each combination, determine to utilize separately this combination carry out degree of confidence when estimating etc. error rate; And
According to the error rate such as above-mentioned of above-mentioned each combination, set the weight of above-mentioned each combination, wherein, the weight height of the combination that the weight of the combination that error rate such as above-mentioned is low is higher than error rate such as above-mentioned;
Wherein, the inverse model of above-mentioned each phoneme is to utilize the method for any described training of claim 1 to 7 inverse model relevant with phoneme to train.
9. generation according to claim 8 is used for the method based on the weight of the degree of confidence estimation of inverse model, and wherein, above-mentioned training utterance set comprises at least: the speech of phoneme balance and phrase; Everyday words during special sound order control is used; And the voice that are mixed with different noises.
According to Claim 8 or 9 described generations be used for the method for the weight estimated based on the degree of confidence of inverse model, wherein, be under the situation of gauss hybrid models at above-mentioned inverse model, the step of calculating the log-likelihood ratio of each combination comprises:
Calculate following formula
LLR m c = 1 T m ( AScore m - Σ t = 1 Tm log ( Σ i = 1 N c w i c P ( O t | λ i c ) ) )
Wherein,
Figure FSB00000505822700031
Represent the log-likelihood ratio between m phoneme and c the inverse model, T mThe training utterance frame number of representing m phoneme, AScore mThe acoustics score of representing m phoneme, O tThe t frame speech samples of representing m phoneme, N cThe total quantity of representing the mixed Gaussian distribution of c inverse model correspondence, P (| λ i c) i Gaussian distribution in the mixed Gaussian distribution of c inverse model correspondence of expression,
Figure FSB00000505822700032
The hybrid weight of representing i Gaussian distribution.
11. the degree of confidence estimation approach based on inverse model that is used for voice identification result comprises:
For each phoneme in the above-mentioned voice identification result,, calculate the log-likelihood ratio of this phoneme based on the acoustics score of this phoneme and the inverse model of this phoneme; And
For each speech in the above-mentioned voice identification result, the log-likelihood ratio that calculates this speech is the log-likelihood ratio of all phonemes of this speech of composition and the sum of products of weight separately;
Wherein, above-mentioned inverse model is to utilize the method for any described training of claim 1 to 7 inverse model relevant with phoneme to train;
Above-mentioned weight is that the method for utilizing any described generation of claim 8 to 10 to be used for the weight estimated based on the degree of confidence of inverse model generates.
12. the confidence degree estimation method based on inverse model that is used for voice identification result according to claim 11 wherein, is under the situation of gauss hybrid models at above-mentioned inverse model, the step of the log-likelihood ratio of described this phoneme of calculating comprises:
Calculate following formula
LLR m c = 1 T m ( AScore m - Σ t = 1 Tm log ( Σ i = 1 N c w i c P ( O t | λ i c ) ) )
Wherein,
Figure FSB00000505822700034
Represent the log-likelihood ratio between m phoneme and c the inverse model, T mThe training utterance frame number of representing m phoneme, AScore mThe acoustics score of representing m phoneme, O tThe t frame speech samples of representing m phoneme, N cThe total quantity of representing the mixed Gaussian distribution of c inverse model correspondence, P (| λ i c) i Gaussian distribution in the mixed Gaussian distribution of c inverse model correspondence of expression,
Figure FSB00000505822700035
The hybrid weight of representing i Gaussian distribution.
13., also comprise according to claim 11 or the 12 described confidence degree estimation methods that are used for voice identification result based on inverse model:
Whether correctly the log-likelihood ratio of more above-mentioned each speech and one or more decision-making value determine the identification of this speech.
14. according to claim 11 or the 12 described confidence degree estimation methods based on inverse model that are used for voice identification result, also comprise: the log-likelihood ratio that will go up predicate is normalized to the interior degree of confidence score of certain limit.
15. an audio recognition method comprises:
With the phoneme is that acoustic primitives is discerned the voice of input, to obtain the recognition result of above-mentioned voice; And
Utilize any described confidence degree estimation method that is used for voice identification result of claim 11 to 14, the recognition result of above-mentioned voice is carried out degree of confidence estimate based on inverse model.
16. the device of the inverse model that a training is relevant with phoneme comprises:
Speech recognition device, it is acoustic primitives recognition training voice with the phoneme, to obtain the recognition result of above-mentioned training utterance;
Degree of obscuring analytic unit, it analyzes the degree of obscuring between each phoneme in the above-mentioned recognition result;
Competitive phoneme selected cell, it is for each phoneme in the above-mentioned recognition result, selects at least one to be easy to competitive phoneme with this phoneme confusion;
Inverse model is set up the unit, and it sets up first inverse model and second inverse model for above-mentioned each phoneme;
First training unit, it utilizes the training utterance section corresponding with above-mentioned at least one competitive phoneme of this phoneme for above-mentioned each phoneme, trains above-mentioned first inverse model of this phoneme; And
Second training unit, it is for above-mentioned each phoneme, utilizes and the corresponding training utterance section of phoneme except above-mentioned at least one competitive phoneme of this phoneme, trains above-mentioned second inverse model of this phoneme.
17. the device of the inverse model that training according to claim 16 is relevant with phoneme, wherein, above-mentioned competitive phoneme selected cell comprises:
The phoneme sequencing unit is used for the degree of obscuring between basis and this phoneme, and other phoneme in the above-mentioned recognition result is sorted; And
The phoneme selected cell, be used for selecting the big phoneme of one or more degree of obscuring as competitive phoneme from above-mentioned other phoneme that is sorted so that in the summation of the degree of obscuring between its and this phoneme and the above-mentioned training utterance ratio of the total sample number of this phoneme above degree of obscuring threshold value.
18. according to the device of claim 16 or the 17 described training inverse model relevant with phoneme, wherein, above-mentioned phoneme is replaced with the initial consonant and the simple or compound vowel of a Chinese syllable of Chinese speech.
19. according to the device of claim 16 or the 17 described training inverse model relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are gauss hybrid models.
20. the device of the inverse model that training according to claim 18 is relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are gauss hybrid models.
21. according to the device of claim 16 or the 17 described training inverse model relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are hidden Markov models.
22. the device of the inverse model that training according to claim 18 is relevant with phoneme, wherein, above-mentioned first inverse model and second inverse model are hidden Markov models.
23. a generation is used for the device of the weight estimated based on the degree of confidence of inverse model, comprising:
The training utterance set;
The vocabulary design unit, it is gathered based on above-mentioned training utterance, designs the vocabulary that a plurality of special sound order controls that are used for are separately used;
A plurality of speech recognition devices, it corresponds respectively to each of above-mentioned special sound order control application and is built as with the phoneme is acoustic primitives, and the voice in the corresponding above-mentioned vocabulary are discerned, to obtain the recognition result of above-mentioned voice;
The log-likelihood calculations unit, it is for each phoneme in the above-mentioned recognition result, to each combination that is made of inverse model, phoneme type and phoneme position of this phoneme, calculates the log-likelihood ratio of this combination;
Etc. the error rate determining unit, it is according to the above-mentioned log-likelihood ratio of above-mentioned each combination, determine to utilize separately this combination carry out degree of confidence when estimating etc. error rate; And
The weight setting unit, it sets the weight of above-mentioned each combination according to the error rate such as above-mentioned of above-mentioned each combination, wherein, the weight height of the combination that the weight of the combination that error rate such as above-mentioned is low is higher than error rate such as above-mentioned;
Wherein, the inverse model of above-mentioned each phoneme is to utilize the device of any described training of claim 16 to 22 inverse model relevant with phoneme to train.
24. generation according to claim 23 is used for the device based on the weight of the degree of confidence estimation of inverse model, wherein, above-mentioned training utterance set comprises at least: the speech of phoneme balance and phrase; Everyday words during special sound order control is used; And the voice that are mixed with different noises.
25. be used for the device of the weight estimated based on the degree of confidence of inverse model according to claim 23 or 24 described generations, wherein, at above-mentioned inverse model is under the situation of gauss hybrid models, and above-mentioned log-likelihood calculations unit calculates the log-likelihood ratio of above-mentioned each combination according to following formula:
LLR m c = 1 T m ( AScore m - Σ t = 1 Tm log ( Σ i = 1 N c w i c P ( O t | λ i c ) ) )
Wherein,
Figure FSB00000505822700062
Represent the log-likelihood ratio between m phoneme and c the inverse model, T mThe training utterance frame number of representing m phoneme, AScore mThe acoustics score of representing m phoneme, O tThe t frame speech samples of representing m phoneme, N cThe total quantity of representing the mixed Gaussian distribution of c inverse model correspondence, P (| λ i c) i Gaussian distribution in the mixed Gaussian distribution of c inverse model correspondence of expression,
Figure FSB00000505822700063
The hybrid weight of representing i Gaussian distribution.
26. the degree of confidence estimation unit based on inverse model that is used for voice identification result comprises:
The log-likelihood calculations unit of phoneme, it based on the acoustics score and the inverse model of this phoneme, calculates the log-likelihood ratio of this phoneme for each phoneme in the above-mentioned voice identification result; And
The log-likelihood calculations unit of speech, it is for each speech in the above-mentioned voice identification result, and the log-likelihood ratio that calculates this speech is the log-likelihood ratio of all phonemes of this speech of composition and the sum of products of weight separately;
Wherein, above-mentioned inverse model is to utilize the device of any described training of claim 16 to 22 inverse model relevant with phoneme to train;
Above-mentioned weight is that the device that utilizes any described generation of claim 23 to 25 to be used for the weight estimated based on the degree of confidence of inverse model generates.
27. the degree of confidence estimation unit that is used for voice identification result according to claim 26 based on inverse model, wherein, at above-mentioned inverse model is under the situation of gauss hybrid models, and the log-likelihood calculations unit of above-mentioned phoneme calculates the log-likelihood ratio of above-mentioned each phoneme according to following formula:
LLR m c = 1 T m ( AScore m - Σ t = 1 Tm log ( Σ i = 1 N c w i c P ( O t | λ i c ) ) )
Wherein,
Figure FSB00000505822700071
Represent the log-likelihood ratio between m phoneme and c the inverse model, T mThe training utterance frame number of representing m phoneme, AScore mThe acoustics score of representing m phoneme, O tThe t frame speech samples of representing m phoneme, N cThe total quantity of representing the mixed Gaussian distribution of c inverse model correspondence, P (| λ i c) i Gaussian distribution in the mixed Gaussian distribution of c inverse model correspondence of expression,
Figure FSB00000505822700072
The hybrid weight of representing i Gaussian distribution.
28., also comprise according to claim 26 or the 27 described degree of confidence estimation units that are used for voice identification result based on inverse model:
Comparing unit, whether correctly the log-likelihood ratio of its more above-mentioned each speech and one or more decision-making value determine the identification of this speech.
29., also comprise according to claim 26 or the 27 described degree of confidence estimation units that are used for voice identification result based on inverse model:
Normalization unit, its log-likelihood ratio that will go up predicate are normalized to the degree of confidence score in the certain limit.
30. a speech recognition system comprises:
Speech recognition equipment, it is that acoustic primitives is discerned the voice of input with the phoneme, to obtain the recognition result of above-mentioned voice; And
Any described degree of confidence estimation unit based on inverse model that is used for voice identification result of claim 26 to 29, its recognition result to above-mentioned voice carry out degree of confidence and estimate.
CN2007101941394A 2007-12-05 2007-12-05 Confidence degree estimation method and device based on inverse model Expired - Fee Related CN101452701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101941394A CN101452701B (en) 2007-12-05 2007-12-05 Confidence degree estimation method and device based on inverse model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101941394A CN101452701B (en) 2007-12-05 2007-12-05 Confidence degree estimation method and device based on inverse model

Publications (2)

Publication Number Publication Date
CN101452701A CN101452701A (en) 2009-06-10
CN101452701B true CN101452701B (en) 2011-09-07

Family

ID=40734901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101941394A Expired - Fee Related CN101452701B (en) 2007-12-05 2007-12-05 Confidence degree estimation method and device based on inverse model

Country Status (1)

Country Link
CN (1) CN101452701B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104205214A (en) * 2012-03-09 2014-12-10 国际商业机器公司 Noise alleviation method, program, and device

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102142253B (en) * 2010-01-29 2013-05-29 富士通株式会社 Voice emotion identification equipment and method
JP5875414B2 (en) 2012-03-07 2016-03-02 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Noise suppression method, program and apparatus
CN103366734B (en) * 2012-03-31 2015-11-25 佳能株式会社 The voice recognition result method of inspection and equipment, voice recognition and audio monitoring systems
US9536528B2 (en) * 2012-07-03 2017-01-03 Google Inc. Determining hotword suitability
CN105632495B (en) * 2015-12-30 2019-07-05 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN105869637B (en) * 2016-05-26 2019-10-15 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN108831443B (en) * 2018-06-25 2020-07-21 华中师范大学 Mobile recording equipment source identification method based on stacked self-coding network
CN111968649B (en) * 2020-08-27 2023-09-15 腾讯科技(深圳)有限公司 Subtitle correction method, subtitle display method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0953970A2 (en) * 1998-04-29 1999-11-03 Matsushita Electric Industrial Co., Ltd. Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN1490786A (en) * 2002-10-17 2004-04-21 中国科学院声学研究所 Phonetic recognition confidence evaluating method, system and dictation device therewith
CN1808567A (en) * 2006-01-26 2006-07-26 覃文华 Voice-print authentication device and method of authenticating people presence
CN1979638A (en) * 2005-12-02 2007-06-13 中国科学院自动化研究所 Method for correcting error of voice identification result

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0953970A2 (en) * 1998-04-29 1999-11-03 Matsushita Electric Industrial Co., Ltd. Method and apparatus using decision trees to generate and score multiple pronunciations for a spelled word
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN1490786A (en) * 2002-10-17 2004-04-21 中国科学院声学研究所 Phonetic recognition confidence evaluating method, system and dictation device therewith
CN1979638A (en) * 2005-12-02 2007-06-13 中国科学院自动化研究所 Method for correcting error of voice identification result
CN1808567A (en) * 2006-01-26 2006-07-26 覃文华 Voice-print authentication device and method of authenticating people presence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104205214A (en) * 2012-03-09 2014-12-10 国际商业机器公司 Noise alleviation method, program, and device
CN104205214B (en) * 2012-03-09 2016-11-23 国际商业机器公司 noise reduction method and device

Also Published As

Publication number Publication date
CN101452701A (en) 2009-06-10

Similar Documents

Publication Publication Date Title
CN101452701B (en) Confidence degree estimation method and device based on inverse model
CN108428446B (en) Speech recognition method and device
US20190266998A1 (en) Speech recognition method and device, computer device and storage medium
CN103971685B (en) Method and system for recognizing voice commands
EP2387031B1 (en) Methods and systems for grammar fitness evaluation as speech recognition error predictor
US7475015B2 (en) Semantic language modeling and confidence measurement
CN103810996B (en) The processing method of voice to be tested, Apparatus and system
US6618702B1 (en) Method of and device for phone-based speaker recognition
CN102568475B (en) System and method for assessing proficiency in Putonghua
CN102194454B (en) Equipment and method for detecting key word in continuous speech
JP5223673B2 (en) Audio processing apparatus and program, and audio processing method
CN105529028A (en) Voice analytical method and apparatus
CN109036471B (en) Voice endpoint detection method and device
CN101465123A (en) Verification method and device for speaker authentication and speaker authentication system
KR102199246B1 (en) Method And Apparatus for Learning Acoustic Model Considering Reliability Score
CN102439660A (en) Voice-tag method and apparatus based on confidence score
KR101317339B1 (en) Apparatus and method using Two phase utterance verification architecture for computation speed improvement of N-best recognition word
KR20210130024A (en) Dialogue system and method of controlling the same
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
Jiang et al. A dynamic in-search data selection method with its applications to acoustic modeling and utterance verification
KR20110071742A (en) Apparatus for utterance verification based on word specific confidence threshold
JP3456444B2 (en) Voice determination apparatus and method, and recording medium
KR101752709B1 (en) Utterance verification method in voice recognition system and the voice recognition system
JP5066668B2 (en) Speech recognition apparatus and program
Zhang et al. Confidence measure (CM) estimation for large vocabulary speaker-independent continuous speech recognition system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110907

Termination date: 20141205

EXPY Termination of patent right or utility model