Summary of the invention
The embodiment of the present invention provides the unrelated keyword recognition method of a kind of languages and system, to reduce keyword
The false alarm rate identified, improves systematic function.
To this end, the present invention provides following technical scheme:
The keyword recognition method that a kind of languages are unrelated, including:
Receive voice signal to be detected;
According to the decoding network built in advance, described voice signal is decoded, obtain candidate keywords;
In different ways described candidate keywords is carried out confidence evaluation;
The confidence evaluation result of different modes is merged, obtains effective confidence of described candidate keywords
Degree;
The keyword of output is determined according to described effective confidence level.
Preferably, described in different ways described candidate keywords carried out confidence evaluation and include: based on
The confidence level of candidate keywords described in log-likelihood calculations;Also include: based on wVector relatedness computation
The confidence level of described candidate keywords, and/or calculate described candidate keywords based on status frames variance score
Confidence level.
Preferably, described include based on the confidence level of candidate keywords described in wVector relatedness computation:
Training universal background model;
According to keyword training sample sound bite and described universal background model, training obtains keyword
GMM model;
Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords
Fragment, then according to sound bite and the described universal background model of described candidate keywords, training is waited
Select keyword GMM model;
Calculate the KL distance between keyword GMM model and candidate keywords GMM model, and by institute
State the KL distance confidence level as described candidate keywords.
Preferably, described include based on the confidence level of candidate keywords described in wVector relatedness computation:
Training universal background model;
Calculate each Gaussian component likelihood on described universal background model of the keyword training sample sound bite
Degree, forms keyword pronunciation model;
Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords
Fragment, then calculates described sound bite each Gaussian component likelihood score on described universal background model, composition
Candidate keywords pronunciation model;
Calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, and by described relevant
Spend the confidence level as described candidate keywords.
Preferably, described confidence level based on the status frames variance score described candidate keywords of calculating includes:
Obtain the voice segments that described candidate keywords is corresponding;
Keyword models is carried out force cutting, obtain comprising in each state the number of speech frames of institute's speech segment
Amount;
According to speech frame quantity in each state, the variance of statistics speech frame is as the confidence of described candidate keywords
Degree.
Preferably, described confidence level based on the status frames variance score described candidate keywords of calculating includes:
Obtain voice segments corresponding to described candidate keywords and on keyword models the speech frame in each state;
Add up the sample variance of speech frame in each state;
The sample variance of the speech frame in comprehensive each state obtains integrality sample variance, and by described entirety
State sample variance is as the confidence level of described candidate keywords.
The Keyword Spotting System that a kind of languages are unrelated, including:
Receiver module, is used for receiving voice signal to be detected;
Decoder module, for decoding described voice signal according to the decoding network built in advance, obtains candidate
Keyword;
Confidence evaluation module, for carrying out confidence evaluation to described candidate keywords in different ways;
Fusion Module, for merging the confidence evaluation result of different modes, obtains described candidate and closes
Effective confidence level of keyword;
Output module, for determining the keyword of output according to described effective confidence level.
Preferably, described confidence evaluation module includes:
First evaluation module, for confidence level based on candidate keywords described in log-likelihood calculations;
Described confidence evaluation module also includes:
Second evaluation module, for based on the confidence level of candidate keywords described in wVector relatedness computation;
And/or
3rd evaluation module, for calculating the confidence level of described candidate keywords based on status frames variance score.
Preferably, described second evaluation module includes:
Background model training unit, is used for training universal background model;
Keyword models training unit, for according to keyword training sample sound bite and described common background
Model, training obtains keyword GMM model;
Candidate keywords model training unit, for the road according to described candidate keywords corresponding in decoding network
Footpath obtains the sound bite of described candidate keywords, then according to sound bite and the institute of described candidate keywords
Stating universal background model, training obtains candidate keywords GMM model;
Metrics calculation unit, has in calculating between keyword GMM model and candidate keywords GMM model
KL distance, and using described KL distance as the confidence level of described candidate keywords.
Preferably, described second evaluation module includes:
Background model training unit, is used for training universal background model;
Keyword pronunciation model construction unit, is used for calculating keyword training sample sound bite described general
Each Gaussian component likelihood score in background model, forms keyword pronunciation model;
Candidate keywords pronunciation model construction unit, for according to described candidate keywords corresponding in decoding network
Path obtain described candidate keywords sound bite, then calculate described sound bite at the described general back of the body
Each Gaussian component likelihood score on scape model, forms candidate keywords pronunciation model;
Correlation calculating unit, for calculating between keyword pronunciation model and candidate keywords pronunciation model
The degree of correlation, and using the described degree of correlation as the confidence level of described candidate keywords.
Preferably, described 3rd evaluation module includes:
Voice segments acquiring unit, for obtaining the voice segments that described candidate keywords is corresponding;
Cutting unit, for carrying out pressure cutting on keyword models, obtains comprising institute's predicate in each state
The speech frame quantity of segment;
Speech frame variance statistic unit, for according to speech frame quantity in each state, the variance of statistics speech frame
Confidence level as described candidate keywords.
Preferably, described 3rd evaluation module includes:
Speech frame acquiring unit, for obtaining voice segments corresponding to described candidate keywords and at keyword models
Speech frame in upper each state;
Sample variance statistic unit, for adding up the sample variance of the speech frame in each state;
Comprehensive unit, the sample variance of the speech frame in comprehensive each state obtains integrality sample side
Difference, and using described integrality sample variance as the confidence level of described candidate keywords.
The unrelated keyword recognition method of languages that the embodiment of the present invention provides and system, according to decoding network
After obtaining keyword decoded result, it is respectively adopted different modes and described keyword decoded result is carried out confidence level
Evaluate, and the confidence evaluation result of different modes is merged the confidence determining keyword decoded result
Degree, determines the reasonability of each keyword decoded result according to this confidence level, so that based on confidence level to pass
The filtration of keyword decoded result more accurately rationally, is effectively improved systematic function.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings
With embodiment, the embodiment of the present invention is described in further detail.
Under HMM/Filler framework, based on MLE(Maximum Likelihood Estimation,
Maximum-likelihood is estimated) the HMM training algorithm of criterion and Viterbi and WFST(weighted Finite
StateTransducer, FST) etc. efficient decoding algorithm so that add up based on keyword
Model/Filler solution to model code method has good operability and generalization in actual applications.But
Under true environment, voice signal to be detected is often by various factors such as noise, channel, area crowds
Impact so that directly decode the keyword results being retrieved often false-alarm higher, affect systematic function.
To this, the method that existing HMM/Filler system is typically filtered by confidence level the most after the decoding presses down
False-alarm probability processed.
As it is shown in figure 1, be the flow chart of the keyword recognition method filtered based on confidence level in prior art,
Comprise the following steps:
Step 101: be respectively trained keyword HMM model and Filler model.
Step 102: according to described model construction decoding network.
Step 103: to the voice signal to be detected received, searches for optimal path in described decoding network,
Determine the voice segments signal corresponding to keyword models and the position in the voice signal of place thereof.
Step 104: to the keyword decoded result obtained, the voice segments signal corresponding including keyword is conciliate
Code path score etc. carries out confidence score, confirms the reasonability of keyword retrieval result.
Step 105: output recognition result.
From above-mentioned flow process, it is whether reasonable that confidence score calculates, and is directly connected to keyword retrieval result
Choice.Confidence score is the highest, then the keyword obtained is the most reliable.If otherwise confidence score can not be true
Real reflection retrieval situation, then the problem being easily caused keyword retrieval mistake.
At present, in the confidence calculations of the unrelated keyword of languages, generally use duration, log-likelihood
The scores such as ratio.These confidence calculations methods depend on decoding paths result, the most permissible
Obtain preferable result, but under complicated actual application environment, the single confidence depending on decoding paths result
Degree calculates and is difficult to make false alarm rate effectively filter.
Analyze existing HMM/Filler system, filter, based on confidence level, the keyword retrieval result obtained and deposit
The false-alarm mistake voice segments of keyword (will not be retrieved as keyword), main cause has following two
Point:
1. recognition result and training sample sense of hearing differ farther out, i.e. test environment and training environment differs greatly
2. recognition result and training sample pronunciation have part to mate, and there is " Chinese " in such as recognition result
Words, and keyword sample is " Chinese ", then, in the case of part coupling, be easily caused acoustic model
Score is higher, causes the generation of false-alarm mistake.
In tradition based on decoding paths score, in log-likelihood ratio confidence score filter method, due to portion
Split confidence score the highest thus cause false alarm rate higher, have impact on the subjective feeling of systematic function and user.
Based on above-mentioned, keyword retrieval result in existing HMM/Filler system is produced false-alarm error reason
Analyzing, the embodiment of the present invention provides the keyword recognition method that a kind of languages are unrelated so that based on confidence level
Filter more accurately rationally, and then improve the performance of Keyword Spotting System.
As in figure 2 it is shown, be the flow chart of the unrelated keyword recognition method of embodiment of the present invention languages, including
Following steps:
Step 201: receive voice signal to be detected.
Step 202: according to the decoding network built in advance, described voice signal is decoded, obtain candidate key
Word.
Described decoding network can according to keyword models and Filler model construction, keyword models and
The training of Filler model and the structure of decoding network can use training more of the prior art and build
Mode, does not limits this embodiment of the present invention.
The process of decoding is mainly to the voice signal to be detected received, and in described decoding network, search is optimum
Path, determines the voice segments signal corresponding to keyword models and the position in the voice signal of place thereof.
Step 203: in different ways described candidate keywords is carried out confidence evaluation.
The purpose that candidate keywords carries out confidence evaluation determines that the correctness of each keyword decoded result.
Due to confidence score calculate whether reasonable, directly influence the choice to each candidate keywords.If put
Confidence score can not truly reflect retrieval situation, the then problem being easily caused keyword retrieval mistake.Therefore,
It is different from the mode using single confidence level in traditional filter method based on confidence level, implements in the present invention
In example, various ways is used to calculate the confidence level of each keyword decoded result from different perspectives, and to these not
Merge with the calculated confidence level of mode, it is thus achieved that effective confidence level of each candidate keywords, and then make
Filtration based on confidence level is more accurately rationally.
In embodiments of the present invention, based on confidence level based on log-likelihood calculations candidate keywords,
It is aided with the new confidence score calculation targetedly that has, and is merged by confidence level and suppress false-alarm
Mistake confidence score so that filter more accurately rationally based on confidence level, and then improve systematic function.
Wherein, the process of confidence level based on log-likelihood calculations candidate keywords similarly to the prior art,
Approximately as:
According to the theory of hypothesis testing, likelihood ratio is defined as given observed quantity and assumes (to belong to certain probability at H1
Distribution) on the ratio of probability and the probability assumed at H0 on (being not belonging to certain probability distribution).Due to
Probability distribution usually assumes that the form into index, therefore to convenience of calculation, generally substitutes with log-likelihood ratio
Likelihood ratio.In the unrelated keyword of languages, if decoding identifies that candidate segment is characterized as O, corresponding keyword
Model is λhmm, Filler model is designated as λfiller, then log-likelihood ratio score is defined as:
Log-likelihood ratio reflects current candidate segment characterizations and belongs to λhmmConfidence level.
The embodiment of the present invention also proposed the confidence calculations mode that following two is new, it may be assumed that
(1) confidence level based on wVector relatedness computation candidate keywords;
(2) confidence level of candidate keywords is calculated based on status frames variance score.
The calculating process of the confidence level that above two is new will be described in detail later.
Step 204: the confidence evaluation result of different modes is merged, obtains described candidate keywords
Effective confidence level.
It should be noted that in actual applications, can be by based on log-likelihood calculations candidate keywords
Confidence level and any one in above two confidence level merge, it is also possible to simultaneously with above two confidence
Degree merges, and does not limits this embodiment of the present invention.
Such as, it is assumed that based on log-likelihood ratio score on keyword models of recognition result sound bite is put
Reliability is Sllr, confidence level based on status frames variance is Svar_frame, confidence based on the wVector degree of correlation
Spend to be divided into Swvec。
In embodiments of the present invention, can use average weighted method that above-mentioned each confidence score is melted
Close.
First by SllrAnd Svar_frameMerge, the most again with SwvecMerging, fusion formula is as follows:
Sfinal=(1-β)(Sllr+αSvar_frame)+β(Swvec-μ)/σ
Wherein, Sllr+αSvar_frameBe in order to using status frames variance as an extention of likelihood ratio score
(Svar_frameDistinction is more weak, proper as addition Item), μ and σ is introduced for SwvecRule
Whole to and Sllr+αSvar_frameIdentical level.
Certainly, in actual applications, it is also possible to use other amalgamation mode, this embodiment of the present invention is not done
Limit.
Step 205: determine the keyword of output according to described effective confidence level.
Such as, when the effective confidence level merging certain candidate keywords obtained is higher than the threshold value set, i.e.
This candidate keywords exportable.
The unrelated keyword recognition method of languages that the embodiment of the present invention provides, is being closed according to decoding network
After keyword decoded result i.e. candidate keywords, it is respectively adopted different modes and described candidate keywords is carried out confidence
Degree is evaluated, and the confidence evaluation result of different modes merges the confidence determining each candidate keywords
Degree, determines the reasonability of each keyword decoded result according to this confidence level, so that based on confidence level to pass
The filtration of keyword decoded result more accurately rationally, is effectively improved systematic function.
It is previously noted that the embodiment of the present invention uses multitude of different ways determine the confidence of keyword decoded result
Degree, is described in detail respectively to it below.
As it is shown on figure 3, be confidence calculations flow process based on the wVector degree of correlation in the embodiment of the present invention.
The sound bite corresponding for recognition result differs farther out with keyword training sample sound bite sense of hearing
False-alarm problem, can be to keyword training sample sound bite and the candidate keywords voice sheet decoded
Section, sets up mixed Gauss model (GMM), respectively then by calculating between two mixed Gauss models
KL distance (Kullback-Leibler Divergence), carries out false-alarm control.
In order to keep the correspondence between mixed model Gaussian component, training keyword GMM model and candidate
During keyword GMM model parameter, can (the UBM model) uses from universal background model
Big posterior probability estimation algorithm (MAP) carries out parameter Estimation.
Concrete calculating process is as it is shown on figure 3, comprise the following steps:
Step one: according to the True Data training universal background model that a large amount of languages are relevant.
Step 2: according to the training sample sound bite of each keyword, training obtains should keyword
GMM model.
Specifically, the MAP algorithm universal background model to pre-estimating can be used to carry out self adaptation, obtain
Take the GMM model that key words text is relevant, for convenience, referred to as keyword GMM model.
Step 3: according to identifying in languages to be detected that decoded result path obtains the voice of each candidate keywords
Fragment, uses the MAP algorithm universal background model to pre-estimating to carry out self adaptation, obtains candidate key
The GMM model that the text that word sound bite is corresponding is correlated with, for convenience, referred to as candidate keyword
GMM model.
Step 4: calculate the KL distance between keyword GMM model and candidate keywords GMM model.
Assume that the probability distribution that keyword GMM model and candidate keywords GMM model represent is respectively
F (x) and g (x), then KL distance definition is:
When concrete calculating KL distance, Monte Carlo etc. can be used to calculate.
Further, it is contemplated that recognition result sound bite is the shortest, the key of the test environment that training obtains
The model (the most foregoing keyword GMM model) that word text is relevant may be not accurate enough, to this,
The embodiment of the present invention also proposed a kind of replacement scheme, comprises the following steps:
Step one: according to the True Data training universal background model that a large amount of languages are relevant.
Step 2: calculate the keyword training sample sound bite each Gaussian component on universal background model seemingly
So degree, forms keyword pronunciation model.
Owing to each Gaussian component of universal background model represents different pronunciation unit in the physical sense, therefore instruct
The distribution practicing sample voice fragment likelihood score in each Gauss just characterizes the pronunciation of this keyword.By difference
The pronunciation model of likelihood score composition this keyword of characterization vector in Gaussian component.
Step 3: for the sound bite of each candidate keywords, calculates it equally on universal background model
Each Gaussian component likelihood score, forms candidate keywords pronunciation model.
Step 4: calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, should
The degree of correlation is as the Measure Indexes of confidence level.
Pronunciation model and above-mentioned keyword GMM mould due to universal background model each Gauss likelihood score composition
Weight in type is similar, can the method be called therefore confidence calculations method based on the wVector degree of correlation.
As shown in Figure 4, it is confidence calculations flow process based on status frames variance score in the embodiment of the present invention.
Here status frames variance is according to the sound bite corresponding to candidate keywords, distributes keyword HMM
The sample variance of speech frame set in each state of model.
In the false-alarm mistake caused due to part pronunciation coupling, it is possible to the frame number of the state of coupling tends to take up
Leading position, correspondingly, would generally there is abnormal (the biggest) in the sample variance of the speech frame of each state.
So can be suppressed partly mating the false-alarm mistake caused by the detection of status frames variance.It calculates
Flow process as shown in Figure 4, comprises the following steps:
Step one: obtain the voice segments that candidate keywords is corresponding;
Step 2: carry out forcing cutting on keyword models, obtain comprising institute's speech segment in each state
Speech frame quantity;
Step 3: according to speech frame quantity in each state, the variance of statistics speech frame is as described candidate key
The confidence level of word.
As it is shown in figure 5, be another kind of confidence calculations based on status frames variance score in the embodiment of the present invention
Flow chart, comprises the following steps:
Step one: obtain voice segments corresponding to candidate keywords and on keyword models the voice in each state
Frame;
Step 2: add up the sample variance of speech frame in each state;
Step 3: the sample variance of the speech frame in comprehensive each state obtains integrality sample variance, and will
Described integrality sample variance is as the confidence level of described candidate keywords.
Correspondingly, the embodiment of the present invention also provides for the Keyword Spotting System that a kind of languages are unrelated, such as Fig. 6 institute
Show, be a kind of structural representation of this system.
This system includes:
Receiver module 601, is used for receiving voice signal to be detected;
Decoder module 602, for decoding described voice signal according to the decoding network built in advance, obtains
Candidate keywords;
Confidence evaluation module 603, comments in different ways described candidate keywords being carried out confidence level
Valency;
Fusion Module 604, for merging the confidence evaluation result of different modes, obtains described time
Select effective confidence level of keyword;
Output module 605, for determining the keyword of output according to described effective confidence level.
In this embodiment, confidence evaluation module 603 includes: the first evaluation module, also includes that second comments
Valency module and/or the 3rd evaluation module.Wherein:
First evaluation module is used for confidence level based on log-likelihood calculations candidate keywords;
Second evaluation module is used for confidence level based on wVector relatedness computation candidate keywords;
3rd evaluation module for calculating the confidence level of candidate keywords based on status frames variance score.
The calculating process of the confidence level of candidate keywords is referred to above the inventive method by above-mentioned each evaluation module
Description in embodiment.
It should be noted that in actual applications, above-mentioned second evaluation module and the 3rd evaluation module can have
Multiple implementation, such as:
A kind of embodiment of the second evaluation module may include that
Background model training unit, is used for training universal background model;
Keyword models training unit, for according to keyword training sample sound bite and described common background
Model, training obtains keyword GMM model;
Candidate keywords model training unit, for the road according to described candidate keywords corresponding in decoding network
Footpath obtains the sound bite of described candidate keywords, then according to sound bite and the institute of described candidate keywords
Stating universal background model, training obtains candidate keywords GMM model;
Metrics calculation unit, has in calculating between keyword GMM model and candidate keywords GMM model
KL distance, and using described KL distance as the confidence level of described candidate keywords.
The another kind of embodiment of the second evaluation module may include that
Background model training unit, is used for training universal background model;
Keyword pronunciation model construction unit, is used for calculating keyword training sample sound bite described general
Each Gaussian component likelihood score in background model, forms keyword pronunciation model;
Candidate keywords pronunciation model construction unit, for according to described candidate keywords corresponding in decoding network
Path obtain described candidate keywords sound bite, then calculate described sound bite at the described general back of the body
Each Gaussian component likelihood score on scape model, forms candidate keywords pronunciation model;
Correlation calculating unit, for calculating between keyword pronunciation model and candidate keywords pronunciation model
The degree of correlation, and using the described degree of correlation as the confidence level of described candidate keywords.
A kind of embodiment of the 3rd evaluation module may include that
Voice segments acquiring unit, for obtaining the voice segments that described candidate keywords is corresponding;
Cutting unit, for carrying out pressure cutting on keyword models, obtains comprising institute's predicate in each state
The speech frame quantity of segment;
Speech frame variance statistic unit, for according to speech frame quantity in each state, the variance of statistics speech frame
Confidence level as described candidate keywords.
The another kind of embodiment of the 3rd evaluation module may include that
Speech frame acquiring unit, for obtaining voice segments corresponding to described candidate keywords and at keyword models
Speech frame in upper each state;
Sample variance statistic unit, for adding up the sample variance of the speech frame in each state;
Comprehensive unit, the sample variance of the speech frame in comprehensive each state obtains integrality sample side
Difference, and using described integrality sample variance as the confidence level of described candidate keywords.
In addition, it is necessary to explanation, in actual applications, above-mentioned Fusion Module 604 can be by based on logarithm
Likelihood ratio calculates the confidence level of candidate keywords and merges, also with any one in above two confidence level
Can merge with above two confidence level simultaneously, this embodiment of the present invention is not limited.Concrete fusion
Mode can be to use average weighted method to merge above-mentioned each confidence score.
The unrelated Keyword Spotting System of languages that the embodiment of the present invention provides, is being closed according to decoding network
After keyword decoded result, it is respectively adopted different modes and described keyword decoded result is carried out confidence evaluation,
And the confidence evaluation result of different modes merged the confidence level determining keyword decoded result, root
The reasonability of each keyword decoded result is determined according to this confidence level, so that based on confidence level to keyword solution
The filtration of code result more accurately rationally, is effectively improved systematic function.
Each embodiment in this specification all uses the mode gone forward one by one to describe, phase homophase between each embodiment
As part see mutually, what each embodiment stressed is different from other embodiments it
Place.For system embodiment, owing to it is substantially similar to embodiment of the method, so describing
Fairly simple, relevant part sees the part of embodiment of the method and illustrates.System described above is implemented
Example is only that schematically the wherein said module illustrated as separating component or unit can be or also may be used
Not to be physically separate, the parts shown as module or unit can be or may not be physics
Unit, i.e. may be located at a place, or can also be distributed on multiple NE.Can be according to reality
The needing of border selects some or all of module therein to realize the purpose of the present embodiment scheme.This area is general
Logical technical staff, in the case of not paying creative work, is i.e. appreciated that and implements.
The all parts embodiment of the present invention can realize with hardware, or with at one or more processor
The software module of upper operation realizes, or realizes with combinations thereof.It will be understood by those of skill in the art that
Microprocessor or digital signal processor (DSP) can be used in practice to realize according to the present invention real
Execute the some or all functions of some or all parts in the system of example.The present invention is also implemented as
For performing part or all equipment or device program (such as, the meter of method as described herein
Calculation machine program and computer program).It is achieved in that the program of the present invention can be stored in computer-readable
On medium, or can be to have the form of one or more signal.Such signal can be from internet net
Upper download of standing obtains, or provides on carrier signal, or provides with any other form.
Being described in detail the embodiment of the present invention above, detailed description of the invention used herein is to this
Bright being set forth, the explanation of above example is only intended to help to understand the method and apparatus of the present invention;With
Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and application
All will change in scope, in sum, this specification content should not be construed as limitation of the present invention.