CN103559881B

CN103559881B - Keyword recognition method that languages are unrelated and system

Info

Publication number: CN103559881B
Application number: CN201310553073.9A
Authority: CN
Inventors: 刘俊华; 魏思; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: Technological University Xunfei Hebei Technology Co Ltd
Priority date: 2013-11-08
Filing date: 2013-11-08
Publication date: 2016-08-31
Anticipated expiration: 2033-11-08
Also published as: CN103559881A

Abstract

The invention discloses the unrelated keyword recognition method of a kind of languages and system, the method includes: receive voice signal to be detected；According to the decoding network built in advance, described voice signal is decoded, obtain candidate keywords；In different ways described candidate keywords is carried out confidence evaluation；The confidence evaluation result of different modes is merged, obtains effective confidence level of described candidate keywords；The keyword of output is determined according to described effective confidence level.

Description

Keyword recognition method that languages are unrelated and system

Technical field

The present invention relates to voice keyword identification technical field, be specifically related to the unrelated keyword of a kind of languages and know Other method and system.

Background technology

Voice keyword identification refers to from given voice document or data, it is judged that whether this speech data wraps Contain certain specific keyword, and determine the positional information etc. that this keyword occurs.The language of main flow at present Sound keyword identification is based primarily upon speech recognition technology, initially with the speech recognition relevant with these voice languages Device identifies the content of text that voice is comprised, and retrieves particular keywords text subsequently from described content of text And the positional information etc. occurred.In this approach, user can define new keyword more easily, There is preferable autgmentability.Exploitation training need yet with speech recognition device builds the acoustics of corresponding languages Model and language model, therefore when promoting to other languages or for want of mark training data and cannot be real Execute.

In recent years, public safety field carries out the demand day of keyword retrieval to some rare foreign languages or dialect languages Benefit is urgent.Relatively limited in view of specific languages skilled staff, lack labeled data, it is impossible to quickly develop phase Answer speech recognition device, and then traditional voice Keyword Spotting System and method cannot be utilized to carry out keyword inspection Rope.To this, researcher proposes languages unrelated keyword identification application, according to the keyword marked Pronunciation sample structure keyword models, fast construction keyword recognition system for speech, flexibly and easily.

At present in languages unrelated keyword identification, it is most commonly based on DTM(Dynamic Time Warping, dynamic time warping) method and based on keyword statistical model/Filler solution to model code side Method (HMM/Filler).First the former extract the phonetic feature sequence of keyword, and and voice to be retrieved Signal characteristic carries out phonetic feature piecemeal and compares, and obtains similar voice segments.This algorithm computational complexity is high, And it being difficult to the most comprehensive multiple keyword sample characteristics, retrieval effectiveness is not satisfactory, crucial at continuous speech Word identification is difficult to effectively promote.And method of based on keyword statistical model/Filler model is mainly passed through Keyword is set up statistical model and non-key word is set up Filler model, on the one hand by model modeling Multiple for keyword samples are effectively combined by method, on the other hand dynamically search by Viterbi decoding etc. Rope algorithm, determines the voice to be detected optimal path in the search network of described model construction, determines key Word positional information.This method covers fully at training data, and detection environment is consistent with training environment in other words In the case of tend to obtain preferable recognition result.But in actual applications, speech data to be detected by In noise complexity and accent, the polytropy of channel, the keyword being retrieved is caused to be frequently not really Keyword, i.e. false alarm rate are higher, thus affect systematic function.

Summary of the invention

The embodiment of the present invention provides the unrelated keyword recognition method of a kind of languages and system, to reduce keyword The false alarm rate identified, improves systematic function.

To this end, the present invention provides following technical scheme:

The keyword recognition method that a kind of languages are unrelated, including:

Receive voice signal to be detected；

According to the decoding network built in advance, described voice signal is decoded, obtain candidate keywords；

In different ways described candidate keywords is carried out confidence evaluation；

The confidence evaluation result of different modes is merged, obtains effective confidence of described candidate keywords Degree；

The keyword of output is determined according to described effective confidence level.

Preferably, described in different ways described candidate keywords carried out confidence evaluation and include: based on The confidence level of candidate keywords described in log-likelihood calculations；Also include: based on wVector relatedness computation The confidence level of described candidate keywords, and/or calculate described candidate keywords based on status frames variance score Confidence level.

Preferably, described include based on the confidence level of candidate keywords described in wVector relatedness computation:

Training universal background model；

According to keyword training sample sound bite and described universal background model, training obtains keyword GMM model；

Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords Fragment, then according to sound bite and the described universal background model of described candidate keywords, training is waited Select keyword GMM model；

Calculate the KL distance between keyword GMM model and candidate keywords GMM model, and by institute State the KL distance confidence level as described candidate keywords.

Training universal background model；

Calculate each Gaussian component likelihood on described universal background model of the keyword training sample sound bite Degree, forms keyword pronunciation model；

Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords Fragment, then calculates described sound bite each Gaussian component likelihood score on described universal background model, composition Candidate keywords pronunciation model；

Calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, and by described relevant Spend the confidence level as described candidate keywords.

Preferably, described confidence level based on the status frames variance score described candidate keywords of calculating includes:

Obtain the voice segments that described candidate keywords is corresponding；

Keyword models is carried out force cutting, obtain comprising in each state the number of speech frames of institute's speech segment Amount；

According to speech frame quantity in each state, the variance of statistics speech frame is as the confidence of described candidate keywords Degree.

Obtain voice segments corresponding to described candidate keywords and on keyword models the speech frame in each state；

Add up the sample variance of speech frame in each state；

The sample variance of the speech frame in comprehensive each state obtains integrality sample variance, and by described entirety State sample variance is as the confidence level of described candidate keywords.

The Keyword Spotting System that a kind of languages are unrelated, including:

Receiver module, is used for receiving voice signal to be detected；

Decoder module, for decoding described voice signal according to the decoding network built in advance, obtains candidate Keyword；

Confidence evaluation module, for carrying out confidence evaluation to described candidate keywords in different ways；

Fusion Module, for merging the confidence evaluation result of different modes, obtains described candidate and closes Effective confidence level of keyword；

Output module, for determining the keyword of output according to described effective confidence level.

Preferably, described confidence evaluation module includes:

First evaluation module, for confidence level based on candidate keywords described in log-likelihood calculations；

Described confidence evaluation module also includes:

Second evaluation module, for based on the confidence level of candidate keywords described in wVector relatedness computation； And/or

3rd evaluation module, for calculating the confidence level of described candidate keywords based on status frames variance score.

Preferably, described second evaluation module includes:

Background model training unit, is used for training universal background model；

Keyword models training unit, for according to keyword training sample sound bite and described common background Model, training obtains keyword GMM model；

Candidate keywords model training unit, for the road according to described candidate keywords corresponding in decoding network Footpath obtains the sound bite of described candidate keywords, then according to sound bite and the institute of described candidate keywords Stating universal background model, training obtains candidate keywords GMM model；

Metrics calculation unit, has in calculating between keyword GMM model and candidate keywords GMM model KL distance, and using described KL distance as the confidence level of described candidate keywords.

Preferably, described second evaluation module includes:

Keyword pronunciation model construction unit, is used for calculating keyword training sample sound bite described general Each Gaussian component likelihood score in background model, forms keyword pronunciation model；

Candidate keywords pronunciation model construction unit, for according to described candidate keywords corresponding in decoding network Path obtain described candidate keywords sound bite, then calculate described sound bite at the described general back of the body Each Gaussian component likelihood score on scape model, forms candidate keywords pronunciation model；

Correlation calculating unit, for calculating between keyword pronunciation model and candidate keywords pronunciation model The degree of correlation, and using the described degree of correlation as the confidence level of described candidate keywords.

Preferably, described 3rd evaluation module includes:

Voice segments acquiring unit, for obtaining the voice segments that described candidate keywords is corresponding；

Cutting unit, for carrying out pressure cutting on keyword models, obtains comprising institute's predicate in each state The speech frame quantity of segment；

Speech frame variance statistic unit, for according to speech frame quantity in each state, the variance of statistics speech frame Confidence level as described candidate keywords.

Preferably, described 3rd evaluation module includes:

Speech frame acquiring unit, for obtaining voice segments corresponding to described candidate keywords and at keyword models Speech frame in upper each state；

Sample variance statistic unit, for adding up the sample variance of the speech frame in each state；

Comprehensive unit, the sample variance of the speech frame in comprehensive each state obtains integrality sample side Difference, and using described integrality sample variance as the confidence level of described candidate keywords.

The unrelated keyword recognition method of languages that the embodiment of the present invention provides and system, according to decoding network After obtaining keyword decoded result, it is respectively adopted different modes and described keyword decoded result is carried out confidence level Evaluate, and the confidence evaluation result of different modes is merged the confidence determining keyword decoded result Degree, determines the reasonability of each keyword decoded result according to this confidence level, so that based on confidence level to pass The filtration of keyword decoded result more accurately rationally, is effectively improved systematic function.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to enforcement In example, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only Some embodiments described in the present invention, for those of ordinary skill in the art, it is also possible to according to these Accompanying drawing obtains other accompanying drawing.

Fig. 1 is the flow chart of the keyword recognition method filtered based on confidence level in prior art；

Fig. 2 is the flow chart of the unrelated keyword recognition method of embodiment of the present invention languages；

Fig. 3 is confidence calculations flow chart based on the wVector degree of correlation in the embodiment of the present invention；

Fig. 4 is a kind of confidence calculations flow chart based on status frames variance score in the embodiment of the present invention；

Fig. 5 is another kind of confidence calculations flow chart based on status frames variance score in the embodiment of the present invention；

Fig. 6 is the structural representation of the unrelated Keyword Spotting System of embodiment of the present invention languages.

Detailed description of the invention

In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings With embodiment, the embodiment of the present invention is described in further detail.

Under HMM/Filler framework, based on MLE(Maximum Likelihood Estimation, Maximum-likelihood is estimated) the HMM training algorithm of criterion and Viterbi and WFST(weighted Finite StateTransducer, FST) etc. efficient decoding algorithm so that add up based on keyword Model/Filler solution to model code method has good operability and generalization in actual applications.But Under true environment, voice signal to be detected is often by various factors such as noise, channel, area crowds Impact so that directly decode the keyword results being retrieved often false-alarm higher, affect systematic function. To this, the method that existing HMM/Filler system is typically filtered by confidence level the most after the decoding presses down False-alarm probability processed.

As it is shown in figure 1, be the flow chart of the keyword recognition method filtered based on confidence level in prior art, Comprise the following steps:

Step 101: be respectively trained keyword HMM model and Filler model.

Step 102: according to described model construction decoding network.

Step 103: to the voice signal to be detected received, searches for optimal path in described decoding network, Determine the voice segments signal corresponding to keyword models and the position in the voice signal of place thereof.

Step 104: to the keyword decoded result obtained, the voice segments signal corresponding including keyword is conciliate Code path score etc. carries out confidence score, confirms the reasonability of keyword retrieval result.

Step 105: output recognition result.

From above-mentioned flow process, it is whether reasonable that confidence score calculates, and is directly connected to keyword retrieval result Choice.Confidence score is the highest, then the keyword obtained is the most reliable.If otherwise confidence score can not be true Real reflection retrieval situation, then the problem being easily caused keyword retrieval mistake.

At present, in the confidence calculations of the unrelated keyword of languages, generally use duration, log-likelihood The scores such as ratio.These confidence calculations methods depend on decoding paths result, the most permissible Obtain preferable result, but under complicated actual application environment, the single confidence depending on decoding paths result Degree calculates and is difficult to make false alarm rate effectively filter.

Analyze existing HMM/Filler system, filter, based on confidence level, the keyword retrieval result obtained and deposit The false-alarm mistake voice segments of keyword (will not be retrieved as keyword), main cause has following two Point:

1. recognition result and training sample sense of hearing differ farther out, i.e. test environment and training environment differs greatly

2. recognition result and training sample pronunciation have part to mate, and there is " Chinese " in such as recognition result Words, and keyword sample is " Chinese ", then, in the case of part coupling, be easily caused acoustic model Score is higher, causes the generation of false-alarm mistake.

In tradition based on decoding paths score, in log-likelihood ratio confidence score filter method, due to portion Split confidence score the highest thus cause false alarm rate higher, have impact on the subjective feeling of systematic function and user.

Based on above-mentioned, keyword retrieval result in existing HMM/Filler system is produced false-alarm error reason Analyzing, the embodiment of the present invention provides the keyword recognition method that a kind of languages are unrelated so that based on confidence level Filter more accurately rationally, and then improve the performance of Keyword Spotting System.

As in figure 2 it is shown, be the flow chart of the unrelated keyword recognition method of embodiment of the present invention languages, including Following steps:

Step 201: receive voice signal to be detected.

Step 202: according to the decoding network built in advance, described voice signal is decoded, obtain candidate key Word.

Described decoding network can according to keyword models and Filler model construction, keyword models and The training of Filler model and the structure of decoding network can use training more of the prior art and build Mode, does not limits this embodiment of the present invention.

The process of decoding is mainly to the voice signal to be detected received, and in described decoding network, search is optimum Path, determines the voice segments signal corresponding to keyword models and the position in the voice signal of place thereof.

Step 203: in different ways described candidate keywords is carried out confidence evaluation.

The purpose that candidate keywords carries out confidence evaluation determines that the correctness of each keyword decoded result. Due to confidence score calculate whether reasonable, directly influence the choice to each candidate keywords.If put Confidence score can not truly reflect retrieval situation, the then problem being easily caused keyword retrieval mistake.Therefore, It is different from the mode using single confidence level in traditional filter method based on confidence level, implements in the present invention In example, various ways is used to calculate the confidence level of each keyword decoded result from different perspectives, and to these not Merge with the calculated confidence level of mode, it is thus achieved that effective confidence level of each candidate keywords, and then make Filtration based on confidence level is more accurately rationally.

In embodiments of the present invention, based on confidence level based on log-likelihood calculations candidate keywords, It is aided with the new confidence score calculation targetedly that has, and is merged by confidence level and suppress false-alarm Mistake confidence score so that filter more accurately rationally based on confidence level, and then improve systematic function.

Wherein, the process of confidence level based on log-likelihood calculations candidate keywords similarly to the prior art, Approximately as:

According to the theory of hypothesis testing, likelihood ratio is defined as given observed quantity and assumes (to belong to certain probability at H1 Distribution) on the ratio of probability and the probability assumed at H0 on (being not belonging to certain probability distribution).Due to Probability distribution usually assumes that the form into index, therefore to convenience of calculation, generally substitutes with log-likelihood ratio Likelihood ratio.In the unrelated keyword of languages, if decoding identifies that candidate segment is characterized as O, corresponding keyword Model is λ_hmm, Filler model is designated as λ_filler, then log-likelihood ratio score is defined as:

S_{llr} = \frac{1}{T} \log \frac{P (O | λ_{hmm})}{P (O | λ_{filler})}

Log-likelihood ratio reflects current candidate segment characterizations and belongs to λ_hmmConfidence level.

The embodiment of the present invention also proposed the confidence calculations mode that following two is new, it may be assumed that

(1) confidence level based on wVector relatedness computation candidate keywords；

(2) confidence level of candidate keywords is calculated based on status frames variance score.

The calculating process of the confidence level that above two is new will be described in detail later.

Step 204: the confidence evaluation result of different modes is merged, obtains described candidate keywords Effective confidence level.

It should be noted that in actual applications, can be by based on log-likelihood calculations candidate keywords Confidence level and any one in above two confidence level merge, it is also possible to simultaneously with above two confidence Degree merges, and does not limits this embodiment of the present invention.

Such as, it is assumed that based on log-likelihood ratio score on keyword models of recognition result sound bite is put Reliability is S_llr, confidence level based on status frames variance is S_{var_frame}, confidence based on the wVector degree of correlation Spend to be divided into S_wvec。

In embodiments of the present invention, can use average weighted method that above-mentioned each confidence score is melted Close.

First by S_llrAnd S_{var_frame}Merge, the most again with S_wvecMerging, fusion formula is as follows:

S_final=(1-β)(S_llr+αS_{var_frame})+β(S_wvec-μ)/σ

Wherein, S_llr+αS_{var_frame}Be in order to using status frames variance as an extention of likelihood ratio score (S_{var_frame}Distinction is more weak, proper as addition Item), μ and σ is introduced for S_wvecRule Whole to and S_llr+αS_{var_frame}Identical level.

Certainly, in actual applications, it is also possible to use other amalgamation mode, this embodiment of the present invention is not done Limit.

Step 205: determine the keyword of output according to described effective confidence level.

Such as, when the effective confidence level merging certain candidate keywords obtained is higher than the threshold value set, i.e. This candidate keywords exportable.

The unrelated keyword recognition method of languages that the embodiment of the present invention provides, is being closed according to decoding network After keyword decoded result i.e. candidate keywords, it is respectively adopted different modes and described candidate keywords is carried out confidence Degree is evaluated, and the confidence evaluation result of different modes merges the confidence determining each candidate keywords Degree, determines the reasonability of each keyword decoded result according to this confidence level, so that based on confidence level to pass The filtration of keyword decoded result more accurately rationally, is effectively improved systematic function.

It is previously noted that the embodiment of the present invention uses multitude of different ways determine the confidence of keyword decoded result Degree, is described in detail respectively to it below.

As it is shown on figure 3, be confidence calculations flow process based on the wVector degree of correlation in the embodiment of the present invention.

The sound bite corresponding for recognition result differs farther out with keyword training sample sound bite sense of hearing False-alarm problem, can be to keyword training sample sound bite and the candidate keywords voice sheet decoded Section, sets up mixed Gauss model (GMM), respectively then by calculating between two mixed Gauss models KL distance (Kullback-Leibler Divergence), carries out false-alarm control.

In order to keep the correspondence between mixed model Gaussian component, training keyword GMM model and candidate During keyword GMM model parameter, can (the UBM model) uses from universal background model Big posterior probability estimation algorithm (MAP) carries out parameter Estimation.

Concrete calculating process is as it is shown on figure 3, comprise the following steps:

Step one: according to the True Data training universal background model that a large amount of languages are relevant.

Step 2: according to the training sample sound bite of each keyword, training obtains should keyword GMM model.

Specifically, the MAP algorithm universal background model to pre-estimating can be used to carry out self adaptation, obtain Take the GMM model that key words text is relevant, for convenience, referred to as keyword GMM model.

Step 3: according to identifying in languages to be detected that decoded result path obtains the voice of each candidate keywords Fragment, uses the MAP algorithm universal background model to pre-estimating to carry out self adaptation, obtains candidate key The GMM model that the text that word sound bite is corresponding is correlated with, for convenience, referred to as candidate keyword GMM model.

Step 4: calculate the KL distance between keyword GMM model and candidate keywords GMM model.

Assume that the probability distribution that keyword GMM model and candidate keywords GMM model represent is respectively F (x) and g (x), then KL distance definition is:

D (f | | g) = &Integral; f (x) \log \frac{f (x)}{g (x)} dx .

When concrete calculating KL distance, Monte Carlo etc. can be used to calculate.

Further, it is contemplated that recognition result sound bite is the shortest, the key of the test environment that training obtains The model (the most foregoing keyword GMM model) that word text is relevant may be not accurate enough, to this, The embodiment of the present invention also proposed a kind of replacement scheme, comprises the following steps:

Step 2: calculate the keyword training sample sound bite each Gaussian component on universal background model seemingly So degree, forms keyword pronunciation model.

Owing to each Gaussian component of universal background model represents different pronunciation unit in the physical sense, therefore instruct The distribution practicing sample voice fragment likelihood score in each Gauss just characterizes the pronunciation of this keyword.By difference The pronunciation model of likelihood score composition this keyword of characterization vector in Gaussian component.

Step 3: for the sound bite of each candidate keywords, calculates it equally on universal background model Each Gaussian component likelihood score, forms candidate keywords pronunciation model.

Step 4: calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, should The degree of correlation is as the Measure Indexes of confidence level.

Pronunciation model and above-mentioned keyword GMM mould due to universal background model each Gauss likelihood score composition Weight in type is similar, can the method be called therefore confidence calculations method based on the wVector degree of correlation.

As shown in Figure 4, it is confidence calculations flow process based on status frames variance score in the embodiment of the present invention.

Here status frames variance is according to the sound bite corresponding to candidate keywords, distributes keyword HMM The sample variance of speech frame set in each state of model.

In the false-alarm mistake caused due to part pronunciation coupling, it is possible to the frame number of the state of coupling tends to take up Leading position, correspondingly, would generally there is abnormal (the biggest) in the sample variance of the speech frame of each state. So can be suppressed partly mating the false-alarm mistake caused by the detection of status frames variance.It calculates Flow process as shown in Figure 4, comprises the following steps:

Step one: obtain the voice segments that candidate keywords is corresponding；

Step 2: carry out forcing cutting on keyword models, obtain comprising institute's speech segment in each state Speech frame quantity；

Step 3: according to speech frame quantity in each state, the variance of statistics speech frame is as described candidate key The confidence level of word.

As it is shown in figure 5, be another kind of confidence calculations based on status frames variance score in the embodiment of the present invention Flow chart, comprises the following steps:

Step one: obtain voice segments corresponding to candidate keywords and on keyword models the voice in each state Frame；

Step 2: add up the sample variance of speech frame in each state；

Step 3: the sample variance of the speech frame in comprehensive each state obtains integrality sample variance, and will Described integrality sample variance is as the confidence level of described candidate keywords.

Correspondingly, the embodiment of the present invention also provides for the Keyword Spotting System that a kind of languages are unrelated, such as Fig. 6 institute Show, be a kind of structural representation of this system.

This system includes:

Receiver module 601, is used for receiving voice signal to be detected；

Decoder module 602, for decoding described voice signal according to the decoding network built in advance, obtains Candidate keywords；

Confidence evaluation module 603, comments in different ways described candidate keywords being carried out confidence level Valency；

Fusion Module 604, for merging the confidence evaluation result of different modes, obtains described time Select effective confidence level of keyword；

Output module 605, for determining the keyword of output according to described effective confidence level.

In this embodiment, confidence evaluation module 603 includes: the first evaluation module, also includes that second comments Valency module and/or the 3rd evaluation module.Wherein:

First evaluation module is used for confidence level based on log-likelihood calculations candidate keywords；

Second evaluation module is used for confidence level based on wVector relatedness computation candidate keywords；

3rd evaluation module for calculating the confidence level of candidate keywords based on status frames variance score.

The calculating process of the confidence level of candidate keywords is referred to above the inventive method by above-mentioned each evaluation module Description in embodiment.

It should be noted that in actual applications, above-mentioned second evaluation module and the 3rd evaluation module can have Multiple implementation, such as:

A kind of embodiment of the second evaluation module may include that

The another kind of embodiment of the second evaluation module may include that

A kind of embodiment of the 3rd evaluation module may include that

The another kind of embodiment of the 3rd evaluation module may include that

In addition, it is necessary to explanation, in actual applications, above-mentioned Fusion Module 604 can be by based on logarithm Likelihood ratio calculates the confidence level of candidate keywords and merges, also with any one in above two confidence level Can merge with above two confidence level simultaneously, this embodiment of the present invention is not limited.Concrete fusion Mode can be to use average weighted method to merge above-mentioned each confidence score.

The unrelated Keyword Spotting System of languages that the embodiment of the present invention provides, is being closed according to decoding network After keyword decoded result, it is respectively adopted different modes and described keyword decoded result is carried out confidence evaluation, And the confidence evaluation result of different modes merged the confidence level determining keyword decoded result, root The reasonability of each keyword decoded result is determined according to this confidence level, so that based on confidence level to keyword solution The filtration of code result more accurately rationally, is effectively improved systematic function.

Each embodiment in this specification all uses the mode gone forward one by one to describe, phase homophase between each embodiment As part see mutually, what each embodiment stressed is different from other embodiments it Place.For system embodiment, owing to it is substantially similar to embodiment of the method, so describing Fairly simple, relevant part sees the part of embodiment of the method and illustrates.System described above is implemented Example is only that schematically the wherein said module illustrated as separating component or unit can be or also may be used Not to be physically separate, the parts shown as module or unit can be or may not be physics Unit, i.e. may be located at a place, or can also be distributed on multiple NE.Can be according to reality The needing of border selects some or all of module therein to realize the purpose of the present embodiment scheme.This area is general Logical technical staff, in the case of not paying creative work, is i.e. appreciated that and implements.

The all parts embodiment of the present invention can realize with hardware, or with at one or more processor The software module of upper operation realizes, or realizes with combinations thereof.It will be understood by those of skill in the art that Microprocessor or digital signal processor (DSP) can be used in practice to realize according to the present invention real Execute the some or all functions of some or all parts in the system of example.The present invention is also implemented as For performing part or all equipment or device program (such as, the meter of method as described herein Calculation machine program and computer program).It is achieved in that the program of the present invention can be stored in computer-readable On medium, or can be to have the form of one or more signal.Such signal can be from internet net Upper download of standing obtains, or provides on carrier signal, or provides with any other form.

Being described in detail the embodiment of the present invention above, detailed description of the invention used herein is to this Bright being set forth, the explanation of above example is only intended to help to understand the method and apparatus of the present invention；With Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and application All will change in scope, in sum, this specification content should not be construed as limitation of the present invention.

Claims

1. the keyword recognition method that languages are unrelated, it is characterised in that including:

Receive voice signal to be detected；

In different ways described candidate keywords is carried out confidence evaluation, described in different ways to institute State candidate keywords to carry out confidence evaluation and include: based on candidate keywords described in log-likelihood calculations put Reliability；Also include: based on the confidence level of candidate keywords described in wVector relatedness computation, and/or based on Status frames variance score calculates the confidence level of described candidate keywords；

The keyword of output is determined according to described effective confidence level；

Wherein, described include based on the confidence level of candidate keywords described in wVector relatedness computation:

Training universal background model；

Calculate the KL distance between keyword GMM model and candidate keywords GMM model, and by institute State the KL distance confidence level as described candidate keywords；

Or, described include based on the confidence level of candidate keywords described in wVector relatedness computation:

Training universal background model；

Calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, and by described relevant Spend the confidence level as described candidate keywords；

Wherein, described confidence level based on the status frames variance score described candidate keywords of calculating includes:

Obtain the voice segments that described candidate keywords is corresponding；

According to speech frame quantity in each state, the variance of statistics speech frame is as the confidence of described candidate keywords Degree；

Or, described confidence level based on the status frames variance score described candidate keywords of calculating includes:

Add up the sample variance of speech frame in each state；

2. the Keyword Spotting System that languages are unrelated, it is characterised in that including:

Receiver module, is used for receiving voice signal to be detected；

Output module, for determining the keyword of output according to described effective confidence level；

Described confidence evaluation module includes:

Described confidence evaluation module also includes:

3rd evaluation module, for calculating the confidence level of described candidate keywords based on status frames variance score；

Wherein, described second evaluation module includes:

Metrics calculation unit, has in calculating between keyword GMM model and candidate keywords GMM model KL distance, and using described KL distance as the confidence level of described candidate keywords；

Or, described second evaluation module includes:

Correlation calculating unit, for calculating between keyword pronunciation model and candidate keywords pronunciation model The degree of correlation, and using the described degree of correlation as the confidence level of described candidate keywords；

Wherein, described 3rd evaluation module includes:

Speech frame variance statistic unit, for according to speech frame quantity in each state, the variance of statistics speech frame Confidence level as described candidate keywords；

Or, described 3rd evaluation module includes: