CN103559881B - Keyword recognition method that languages are unrelated and system - Google Patents

Keyword recognition method that languages are unrelated and system Download PDF

Info

Publication number
CN103559881B
CN103559881B CN201310553073.9A CN201310553073A CN103559881B CN 103559881 B CN103559881 B CN 103559881B CN 201310553073 A CN201310553073 A CN 201310553073A CN 103559881 B CN103559881 B CN 103559881B
Authority
CN
China
Prior art keywords
candidate keywords
keyword
model
confidence level
confidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310553073.9A
Other languages
Chinese (zh)
Other versions
CN103559881A (en
Inventor
刘俊华
魏思
胡国平
胡郁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Technological University Xunfei Hebei Technology Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201310553073.9A priority Critical patent/CN103559881B/en
Publication of CN103559881A publication Critical patent/CN103559881A/en
Application granted granted Critical
Publication of CN103559881B publication Critical patent/CN103559881B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the unrelated keyword recognition method of a kind of languages and system, the method includes: receive voice signal to be detected;According to the decoding network built in advance, described voice signal is decoded, obtain candidate keywords;In different ways described candidate keywords is carried out confidence evaluation;The confidence evaluation result of different modes is merged, obtains effective confidence level of described candidate keywords;The keyword of output is determined according to described effective confidence level.

Description

Keyword recognition method that languages are unrelated and system
Technical field
The present invention relates to voice keyword identification technical field, be specifically related to the unrelated keyword of a kind of languages and know Other method and system.
Background technology
Voice keyword identification refers to from given voice document or data, it is judged that whether this speech data wraps Contain certain specific keyword, and determine the positional information etc. that this keyword occurs.The language of main flow at present Sound keyword identification is based primarily upon speech recognition technology, initially with the speech recognition relevant with these voice languages Device identifies the content of text that voice is comprised, and retrieves particular keywords text subsequently from described content of text And the positional information etc. occurred.In this approach, user can define new keyword more easily, There is preferable autgmentability.Exploitation training need yet with speech recognition device builds the acoustics of corresponding languages Model and language model, therefore when promoting to other languages or for want of mark training data and cannot be real Execute.
In recent years, public safety field carries out the demand day of keyword retrieval to some rare foreign languages or dialect languages Benefit is urgent.Relatively limited in view of specific languages skilled staff, lack labeled data, it is impossible to quickly develop phase Answer speech recognition device, and then traditional voice Keyword Spotting System and method cannot be utilized to carry out keyword inspection Rope.To this, researcher proposes languages unrelated keyword identification application, according to the keyword marked Pronunciation sample structure keyword models, fast construction keyword recognition system for speech, flexibly and easily.
At present in languages unrelated keyword identification, it is most commonly based on DTM(Dynamic Time Warping, dynamic time warping) method and based on keyword statistical model/Filler solution to model code side Method (HMM/Filler).First the former extract the phonetic feature sequence of keyword, and and voice to be retrieved Signal characteristic carries out phonetic feature piecemeal and compares, and obtains similar voice segments.This algorithm computational complexity is high, And it being difficult to the most comprehensive multiple keyword sample characteristics, retrieval effectiveness is not satisfactory, crucial at continuous speech Word identification is difficult to effectively promote.And method of based on keyword statistical model/Filler model is mainly passed through Keyword is set up statistical model and non-key word is set up Filler model, on the one hand by model modeling Multiple for keyword samples are effectively combined by method, on the other hand dynamically search by Viterbi decoding etc. Rope algorithm, determines the voice to be detected optimal path in the search network of described model construction, determines key Word positional information.This method covers fully at training data, and detection environment is consistent with training environment in other words In the case of tend to obtain preferable recognition result.But in actual applications, speech data to be detected by In noise complexity and accent, the polytropy of channel, the keyword being retrieved is caused to be frequently not really Keyword, i.e. false alarm rate are higher, thus affect systematic function.
Summary of the invention
The embodiment of the present invention provides the unrelated keyword recognition method of a kind of languages and system, to reduce keyword The false alarm rate identified, improves systematic function.
To this end, the present invention provides following technical scheme:
The keyword recognition method that a kind of languages are unrelated, including:
Receive voice signal to be detected;
According to the decoding network built in advance, described voice signal is decoded, obtain candidate keywords;
In different ways described candidate keywords is carried out confidence evaluation;
The confidence evaluation result of different modes is merged, obtains effective confidence of described candidate keywords Degree;
The keyword of output is determined according to described effective confidence level.
Preferably, described in different ways described candidate keywords carried out confidence evaluation and include: based on The confidence level of candidate keywords described in log-likelihood calculations;Also include: based on wVector relatedness computation The confidence level of described candidate keywords, and/or calculate described candidate keywords based on status frames variance score Confidence level.
Preferably, described include based on the confidence level of candidate keywords described in wVector relatedness computation:
Training universal background model;
According to keyword training sample sound bite and described universal background model, training obtains keyword GMM model;
Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords Fragment, then according to sound bite and the described universal background model of described candidate keywords, training is waited Select keyword GMM model;
Calculate the KL distance between keyword GMM model and candidate keywords GMM model, and by institute State the KL distance confidence level as described candidate keywords.
Preferably, described include based on the confidence level of candidate keywords described in wVector relatedness computation:
Training universal background model;
Calculate each Gaussian component likelihood on described universal background model of the keyword training sample sound bite Degree, forms keyword pronunciation model;
Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords Fragment, then calculates described sound bite each Gaussian component likelihood score on described universal background model, composition Candidate keywords pronunciation model;
Calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, and by described relevant Spend the confidence level as described candidate keywords.
Preferably, described confidence level based on the status frames variance score described candidate keywords of calculating includes:
Obtain the voice segments that described candidate keywords is corresponding;
Keyword models is carried out force cutting, obtain comprising in each state the number of speech frames of institute's speech segment Amount;
According to speech frame quantity in each state, the variance of statistics speech frame is as the confidence of described candidate keywords Degree.
Preferably, described confidence level based on the status frames variance score described candidate keywords of calculating includes:
Obtain voice segments corresponding to described candidate keywords and on keyword models the speech frame in each state;
Add up the sample variance of speech frame in each state;
The sample variance of the speech frame in comprehensive each state obtains integrality sample variance, and by described entirety State sample variance is as the confidence level of described candidate keywords.
The Keyword Spotting System that a kind of languages are unrelated, including:
Receiver module, is used for receiving voice signal to be detected;
Decoder module, for decoding described voice signal according to the decoding network built in advance, obtains candidate Keyword;
Confidence evaluation module, for carrying out confidence evaluation to described candidate keywords in different ways;
Fusion Module, for merging the confidence evaluation result of different modes, obtains described candidate and closes Effective confidence level of keyword;
Output module, for determining the keyword of output according to described effective confidence level.
Preferably, described confidence evaluation module includes:
First evaluation module, for confidence level based on candidate keywords described in log-likelihood calculations;
Described confidence evaluation module also includes:
Second evaluation module, for based on the confidence level of candidate keywords described in wVector relatedness computation; And/or
3rd evaluation module, for calculating the confidence level of described candidate keywords based on status frames variance score.
Preferably, described second evaluation module includes:
Background model training unit, is used for training universal background model;
Keyword models training unit, for according to keyword training sample sound bite and described common background Model, training obtains keyword GMM model;
Candidate keywords model training unit, for the road according to described candidate keywords corresponding in decoding network Footpath obtains the sound bite of described candidate keywords, then according to sound bite and the institute of described candidate keywords Stating universal background model, training obtains candidate keywords GMM model;
Metrics calculation unit, has in calculating between keyword GMM model and candidate keywords GMM model KL distance, and using described KL distance as the confidence level of described candidate keywords.
Preferably, described second evaluation module includes:
Background model training unit, is used for training universal background model;
Keyword pronunciation model construction unit, is used for calculating keyword training sample sound bite described general Each Gaussian component likelihood score in background model, forms keyword pronunciation model;
Candidate keywords pronunciation model construction unit, for according to described candidate keywords corresponding in decoding network Path obtain described candidate keywords sound bite, then calculate described sound bite at the described general back of the body Each Gaussian component likelihood score on scape model, forms candidate keywords pronunciation model;
Correlation calculating unit, for calculating between keyword pronunciation model and candidate keywords pronunciation model The degree of correlation, and using the described degree of correlation as the confidence level of described candidate keywords.
Preferably, described 3rd evaluation module includes:
Voice segments acquiring unit, for obtaining the voice segments that described candidate keywords is corresponding;
Cutting unit, for carrying out pressure cutting on keyword models, obtains comprising institute's predicate in each state The speech frame quantity of segment;
Speech frame variance statistic unit, for according to speech frame quantity in each state, the variance of statistics speech frame Confidence level as described candidate keywords.
Preferably, described 3rd evaluation module includes:
Speech frame acquiring unit, for obtaining voice segments corresponding to described candidate keywords and at keyword models Speech frame in upper each state;
Sample variance statistic unit, for adding up the sample variance of the speech frame in each state;
Comprehensive unit, the sample variance of the speech frame in comprehensive each state obtains integrality sample side Difference, and using described integrality sample variance as the confidence level of described candidate keywords.
The unrelated keyword recognition method of languages that the embodiment of the present invention provides and system, according to decoding network After obtaining keyword decoded result, it is respectively adopted different modes and described keyword decoded result is carried out confidence level Evaluate, and the confidence evaluation result of different modes is merged the confidence determining keyword decoded result Degree, determines the reasonability of each keyword decoded result according to this confidence level, so that based on confidence level to pass The filtration of keyword decoded result more accurately rationally, is effectively improved systematic function.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present application or technical scheme of the prior art, below will be to enforcement In example, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only Some embodiments described in the present invention, for those of ordinary skill in the art, it is also possible to according to these Accompanying drawing obtains other accompanying drawing.
Fig. 1 is the flow chart of the keyword recognition method filtered based on confidence level in prior art;
Fig. 2 is the flow chart of the unrelated keyword recognition method of embodiment of the present invention languages;
Fig. 3 is confidence calculations flow chart based on the wVector degree of correlation in the embodiment of the present invention;
Fig. 4 is a kind of confidence calculations flow chart based on status frames variance score in the embodiment of the present invention;
Fig. 5 is another kind of confidence calculations flow chart based on status frames variance score in the embodiment of the present invention;
Fig. 6 is the structural representation of the unrelated Keyword Spotting System of embodiment of the present invention languages.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings With embodiment, the embodiment of the present invention is described in further detail.
Under HMM/Filler framework, based on MLE(Maximum Likelihood Estimation, Maximum-likelihood is estimated) the HMM training algorithm of criterion and Viterbi and WFST(weighted Finite StateTransducer, FST) etc. efficient decoding algorithm so that add up based on keyword Model/Filler solution to model code method has good operability and generalization in actual applications.But Under true environment, voice signal to be detected is often by various factors such as noise, channel, area crowds Impact so that directly decode the keyword results being retrieved often false-alarm higher, affect systematic function. To this, the method that existing HMM/Filler system is typically filtered by confidence level the most after the decoding presses down False-alarm probability processed.
As it is shown in figure 1, be the flow chart of the keyword recognition method filtered based on confidence level in prior art, Comprise the following steps:
Step 101: be respectively trained keyword HMM model and Filler model.
Step 102: according to described model construction decoding network.
Step 103: to the voice signal to be detected received, searches for optimal path in described decoding network, Determine the voice segments signal corresponding to keyword models and the position in the voice signal of place thereof.
Step 104: to the keyword decoded result obtained, the voice segments signal corresponding including keyword is conciliate Code path score etc. carries out confidence score, confirms the reasonability of keyword retrieval result.
Step 105: output recognition result.
From above-mentioned flow process, it is whether reasonable that confidence score calculates, and is directly connected to keyword retrieval result Choice.Confidence score is the highest, then the keyword obtained is the most reliable.If otherwise confidence score can not be true Real reflection retrieval situation, then the problem being easily caused keyword retrieval mistake.
At present, in the confidence calculations of the unrelated keyword of languages, generally use duration, log-likelihood The scores such as ratio.These confidence calculations methods depend on decoding paths result, the most permissible Obtain preferable result, but under complicated actual application environment, the single confidence depending on decoding paths result Degree calculates and is difficult to make false alarm rate effectively filter.
Analyze existing HMM/Filler system, filter, based on confidence level, the keyword retrieval result obtained and deposit The false-alarm mistake voice segments of keyword (will not be retrieved as keyword), main cause has following two Point:
1. recognition result and training sample sense of hearing differ farther out, i.e. test environment and training environment differs greatly
2. recognition result and training sample pronunciation have part to mate, and there is " Chinese " in such as recognition result Words, and keyword sample is " Chinese ", then, in the case of part coupling, be easily caused acoustic model Score is higher, causes the generation of false-alarm mistake.
In tradition based on decoding paths score, in log-likelihood ratio confidence score filter method, due to portion Split confidence score the highest thus cause false alarm rate higher, have impact on the subjective feeling of systematic function and user.
Based on above-mentioned, keyword retrieval result in existing HMM/Filler system is produced false-alarm error reason Analyzing, the embodiment of the present invention provides the keyword recognition method that a kind of languages are unrelated so that based on confidence level Filter more accurately rationally, and then improve the performance of Keyword Spotting System.
As in figure 2 it is shown, be the flow chart of the unrelated keyword recognition method of embodiment of the present invention languages, including Following steps:
Step 201: receive voice signal to be detected.
Step 202: according to the decoding network built in advance, described voice signal is decoded, obtain candidate key Word.
Described decoding network can according to keyword models and Filler model construction, keyword models and The training of Filler model and the structure of decoding network can use training more of the prior art and build Mode, does not limits this embodiment of the present invention.
The process of decoding is mainly to the voice signal to be detected received, and in described decoding network, search is optimum Path, determines the voice segments signal corresponding to keyword models and the position in the voice signal of place thereof.
Step 203: in different ways described candidate keywords is carried out confidence evaluation.
The purpose that candidate keywords carries out confidence evaluation determines that the correctness of each keyword decoded result. Due to confidence score calculate whether reasonable, directly influence the choice to each candidate keywords.If put Confidence score can not truly reflect retrieval situation, the then problem being easily caused keyword retrieval mistake.Therefore, It is different from the mode using single confidence level in traditional filter method based on confidence level, implements in the present invention In example, various ways is used to calculate the confidence level of each keyword decoded result from different perspectives, and to these not Merge with the calculated confidence level of mode, it is thus achieved that effective confidence level of each candidate keywords, and then make Filtration based on confidence level is more accurately rationally.
In embodiments of the present invention, based on confidence level based on log-likelihood calculations candidate keywords, It is aided with the new confidence score calculation targetedly that has, and is merged by confidence level and suppress false-alarm Mistake confidence score so that filter more accurately rationally based on confidence level, and then improve systematic function.
Wherein, the process of confidence level based on log-likelihood calculations candidate keywords similarly to the prior art, Approximately as:
According to the theory of hypothesis testing, likelihood ratio is defined as given observed quantity and assumes (to belong to certain probability at H1 Distribution) on the ratio of probability and the probability assumed at H0 on (being not belonging to certain probability distribution).Due to Probability distribution usually assumes that the form into index, therefore to convenience of calculation, generally substitutes with log-likelihood ratio Likelihood ratio.In the unrelated keyword of languages, if decoding identifies that candidate segment is characterized as O, corresponding keyword Model is λhmm, Filler model is designated as λfiller, then log-likelihood ratio score is defined as:
S llr = 1 T log P ( O | λ hmm ) P ( O | λ filler )
Log-likelihood ratio reflects current candidate segment characterizations and belongs to λhmmConfidence level.
The embodiment of the present invention also proposed the confidence calculations mode that following two is new, it may be assumed that
(1) confidence level based on wVector relatedness computation candidate keywords;
(2) confidence level of candidate keywords is calculated based on status frames variance score.
The calculating process of the confidence level that above two is new will be described in detail later.
Step 204: the confidence evaluation result of different modes is merged, obtains described candidate keywords Effective confidence level.
It should be noted that in actual applications, can be by based on log-likelihood calculations candidate keywords Confidence level and any one in above two confidence level merge, it is also possible to simultaneously with above two confidence Degree merges, and does not limits this embodiment of the present invention.
Such as, it is assumed that based on log-likelihood ratio score on keyword models of recognition result sound bite is put Reliability is Sllr, confidence level based on status frames variance is Svar_frame, confidence based on the wVector degree of correlation Spend to be divided into Swvec
In embodiments of the present invention, can use average weighted method that above-mentioned each confidence score is melted Close.
First by SllrAnd Svar_frameMerge, the most again with SwvecMerging, fusion formula is as follows:
Sfinal=(1-β)(Sllr+αSvar_frame)+β(Swvec-μ)/σ
Wherein, Sllr+αSvar_frameBe in order to using status frames variance as an extention of likelihood ratio score (Svar_frameDistinction is more weak, proper as addition Item), μ and σ is introduced for SwvecRule Whole to and Sllr+αSvar_frameIdentical level.
Certainly, in actual applications, it is also possible to use other amalgamation mode, this embodiment of the present invention is not done Limit.
Step 205: determine the keyword of output according to described effective confidence level.
Such as, when the effective confidence level merging certain candidate keywords obtained is higher than the threshold value set, i.e. This candidate keywords exportable.
The unrelated keyword recognition method of languages that the embodiment of the present invention provides, is being closed according to decoding network After keyword decoded result i.e. candidate keywords, it is respectively adopted different modes and described candidate keywords is carried out confidence Degree is evaluated, and the confidence evaluation result of different modes merges the confidence determining each candidate keywords Degree, determines the reasonability of each keyword decoded result according to this confidence level, so that based on confidence level to pass The filtration of keyword decoded result more accurately rationally, is effectively improved systematic function.
It is previously noted that the embodiment of the present invention uses multitude of different ways determine the confidence of keyword decoded result Degree, is described in detail respectively to it below.
As it is shown on figure 3, be confidence calculations flow process based on the wVector degree of correlation in the embodiment of the present invention.
The sound bite corresponding for recognition result differs farther out with keyword training sample sound bite sense of hearing False-alarm problem, can be to keyword training sample sound bite and the candidate keywords voice sheet decoded Section, sets up mixed Gauss model (GMM), respectively then by calculating between two mixed Gauss models KL distance (Kullback-Leibler Divergence), carries out false-alarm control.
In order to keep the correspondence between mixed model Gaussian component, training keyword GMM model and candidate During keyword GMM model parameter, can (the UBM model) uses from universal background model Big posterior probability estimation algorithm (MAP) carries out parameter Estimation.
Concrete calculating process is as it is shown on figure 3, comprise the following steps:
Step one: according to the True Data training universal background model that a large amount of languages are relevant.
Step 2: according to the training sample sound bite of each keyword, training obtains should keyword GMM model.
Specifically, the MAP algorithm universal background model to pre-estimating can be used to carry out self adaptation, obtain Take the GMM model that key words text is relevant, for convenience, referred to as keyword GMM model.
Step 3: according to identifying in languages to be detected that decoded result path obtains the voice of each candidate keywords Fragment, uses the MAP algorithm universal background model to pre-estimating to carry out self adaptation, obtains candidate key The GMM model that the text that word sound bite is corresponding is correlated with, for convenience, referred to as candidate keyword GMM model.
Step 4: calculate the KL distance between keyword GMM model and candidate keywords GMM model.
Assume that the probability distribution that keyword GMM model and candidate keywords GMM model represent is respectively F (x) and g (x), then KL distance definition is:
D ( f | | g ) = ∫ f ( x ) log f ( x ) g ( x ) dx .
When concrete calculating KL distance, Monte Carlo etc. can be used to calculate.
Further, it is contemplated that recognition result sound bite is the shortest, the key of the test environment that training obtains The model (the most foregoing keyword GMM model) that word text is relevant may be not accurate enough, to this, The embodiment of the present invention also proposed a kind of replacement scheme, comprises the following steps:
Step one: according to the True Data training universal background model that a large amount of languages are relevant.
Step 2: calculate the keyword training sample sound bite each Gaussian component on universal background model seemingly So degree, forms keyword pronunciation model.
Owing to each Gaussian component of universal background model represents different pronunciation unit in the physical sense, therefore instruct The distribution practicing sample voice fragment likelihood score in each Gauss just characterizes the pronunciation of this keyword.By difference The pronunciation model of likelihood score composition this keyword of characterization vector in Gaussian component.
Step 3: for the sound bite of each candidate keywords, calculates it equally on universal background model Each Gaussian component likelihood score, forms candidate keywords pronunciation model.
Step 4: calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, should The degree of correlation is as the Measure Indexes of confidence level.
Pronunciation model and above-mentioned keyword GMM mould due to universal background model each Gauss likelihood score composition Weight in type is similar, can the method be called therefore confidence calculations method based on the wVector degree of correlation.
As shown in Figure 4, it is confidence calculations flow process based on status frames variance score in the embodiment of the present invention.
Here status frames variance is according to the sound bite corresponding to candidate keywords, distributes keyword HMM The sample variance of speech frame set in each state of model.
In the false-alarm mistake caused due to part pronunciation coupling, it is possible to the frame number of the state of coupling tends to take up Leading position, correspondingly, would generally there is abnormal (the biggest) in the sample variance of the speech frame of each state. So can be suppressed partly mating the false-alarm mistake caused by the detection of status frames variance.It calculates Flow process as shown in Figure 4, comprises the following steps:
Step one: obtain the voice segments that candidate keywords is corresponding;
Step 2: carry out forcing cutting on keyword models, obtain comprising institute's speech segment in each state Speech frame quantity;
Step 3: according to speech frame quantity in each state, the variance of statistics speech frame is as described candidate key The confidence level of word.
As it is shown in figure 5, be another kind of confidence calculations based on status frames variance score in the embodiment of the present invention Flow chart, comprises the following steps:
Step one: obtain voice segments corresponding to candidate keywords and on keyword models the voice in each state Frame;
Step 2: add up the sample variance of speech frame in each state;
Step 3: the sample variance of the speech frame in comprehensive each state obtains integrality sample variance, and will Described integrality sample variance is as the confidence level of described candidate keywords.
Correspondingly, the embodiment of the present invention also provides for the Keyword Spotting System that a kind of languages are unrelated, such as Fig. 6 institute Show, be a kind of structural representation of this system.
This system includes:
Receiver module 601, is used for receiving voice signal to be detected;
Decoder module 602, for decoding described voice signal according to the decoding network built in advance, obtains Candidate keywords;
Confidence evaluation module 603, comments in different ways described candidate keywords being carried out confidence level Valency;
Fusion Module 604, for merging the confidence evaluation result of different modes, obtains described time Select effective confidence level of keyword;
Output module 605, for determining the keyword of output according to described effective confidence level.
In this embodiment, confidence evaluation module 603 includes: the first evaluation module, also includes that second comments Valency module and/or the 3rd evaluation module.Wherein:
First evaluation module is used for confidence level based on log-likelihood calculations candidate keywords;
Second evaluation module is used for confidence level based on wVector relatedness computation candidate keywords;
3rd evaluation module for calculating the confidence level of candidate keywords based on status frames variance score.
The calculating process of the confidence level of candidate keywords is referred to above the inventive method by above-mentioned each evaluation module Description in embodiment.
It should be noted that in actual applications, above-mentioned second evaluation module and the 3rd evaluation module can have Multiple implementation, such as:
A kind of embodiment of the second evaluation module may include that
Background model training unit, is used for training universal background model;
Keyword models training unit, for according to keyword training sample sound bite and described common background Model, training obtains keyword GMM model;
Candidate keywords model training unit, for the road according to described candidate keywords corresponding in decoding network Footpath obtains the sound bite of described candidate keywords, then according to sound bite and the institute of described candidate keywords Stating universal background model, training obtains candidate keywords GMM model;
Metrics calculation unit, has in calculating between keyword GMM model and candidate keywords GMM model KL distance, and using described KL distance as the confidence level of described candidate keywords.
The another kind of embodiment of the second evaluation module may include that
Background model training unit, is used for training universal background model;
Keyword pronunciation model construction unit, is used for calculating keyword training sample sound bite described general Each Gaussian component likelihood score in background model, forms keyword pronunciation model;
Candidate keywords pronunciation model construction unit, for according to described candidate keywords corresponding in decoding network Path obtain described candidate keywords sound bite, then calculate described sound bite at the described general back of the body Each Gaussian component likelihood score on scape model, forms candidate keywords pronunciation model;
Correlation calculating unit, for calculating between keyword pronunciation model and candidate keywords pronunciation model The degree of correlation, and using the described degree of correlation as the confidence level of described candidate keywords.
A kind of embodiment of the 3rd evaluation module may include that
Voice segments acquiring unit, for obtaining the voice segments that described candidate keywords is corresponding;
Cutting unit, for carrying out pressure cutting on keyword models, obtains comprising institute's predicate in each state The speech frame quantity of segment;
Speech frame variance statistic unit, for according to speech frame quantity in each state, the variance of statistics speech frame Confidence level as described candidate keywords.
The another kind of embodiment of the 3rd evaluation module may include that
Speech frame acquiring unit, for obtaining voice segments corresponding to described candidate keywords and at keyword models Speech frame in upper each state;
Sample variance statistic unit, for adding up the sample variance of the speech frame in each state;
Comprehensive unit, the sample variance of the speech frame in comprehensive each state obtains integrality sample side Difference, and using described integrality sample variance as the confidence level of described candidate keywords.
In addition, it is necessary to explanation, in actual applications, above-mentioned Fusion Module 604 can be by based on logarithm Likelihood ratio calculates the confidence level of candidate keywords and merges, also with any one in above two confidence level Can merge with above two confidence level simultaneously, this embodiment of the present invention is not limited.Concrete fusion Mode can be to use average weighted method to merge above-mentioned each confidence score.
The unrelated Keyword Spotting System of languages that the embodiment of the present invention provides, is being closed according to decoding network After keyword decoded result, it is respectively adopted different modes and described keyword decoded result is carried out confidence evaluation, And the confidence evaluation result of different modes merged the confidence level determining keyword decoded result, root The reasonability of each keyword decoded result is determined according to this confidence level, so that based on confidence level to keyword solution The filtration of code result more accurately rationally, is effectively improved systematic function.
Each embodiment in this specification all uses the mode gone forward one by one to describe, phase homophase between each embodiment As part see mutually, what each embodiment stressed is different from other embodiments it Place.For system embodiment, owing to it is substantially similar to embodiment of the method, so describing Fairly simple, relevant part sees the part of embodiment of the method and illustrates.System described above is implemented Example is only that schematically the wherein said module illustrated as separating component or unit can be or also may be used Not to be physically separate, the parts shown as module or unit can be or may not be physics Unit, i.e. may be located at a place, or can also be distributed on multiple NE.Can be according to reality The needing of border selects some or all of module therein to realize the purpose of the present embodiment scheme.This area is general Logical technical staff, in the case of not paying creative work, is i.e. appreciated that and implements.
The all parts embodiment of the present invention can realize with hardware, or with at one or more processor The software module of upper operation realizes, or realizes with combinations thereof.It will be understood by those of skill in the art that Microprocessor or digital signal processor (DSP) can be used in practice to realize according to the present invention real Execute the some or all functions of some or all parts in the system of example.The present invention is also implemented as For performing part or all equipment or device program (such as, the meter of method as described herein Calculation machine program and computer program).It is achieved in that the program of the present invention can be stored in computer-readable On medium, or can be to have the form of one or more signal.Such signal can be from internet net Upper download of standing obtains, or provides on carrier signal, or provides with any other form.
Being described in detail the embodiment of the present invention above, detailed description of the invention used herein is to this Bright being set forth, the explanation of above example is only intended to help to understand the method and apparatus of the present invention;With Time, for one of ordinary skill in the art, according to the thought of the present invention, in detailed description of the invention and application All will change in scope, in sum, this specification content should not be construed as limitation of the present invention.

Claims (2)

1. the keyword recognition method that languages are unrelated, it is characterised in that including:
Receive voice signal to be detected;
According to the decoding network built in advance, described voice signal is decoded, obtain candidate keywords;
In different ways described candidate keywords is carried out confidence evaluation, described in different ways to institute State candidate keywords to carry out confidence evaluation and include: based on candidate keywords described in log-likelihood calculations put Reliability;Also include: based on the confidence level of candidate keywords described in wVector relatedness computation, and/or based on Status frames variance score calculates the confidence level of described candidate keywords;
The confidence evaluation result of different modes is merged, obtains effective confidence of described candidate keywords Degree;
The keyword of output is determined according to described effective confidence level;
Wherein, described include based on the confidence level of candidate keywords described in wVector relatedness computation:
Training universal background model;
According to keyword training sample sound bite and described universal background model, training obtains keyword GMM model;
Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords Fragment, then according to sound bite and the described universal background model of described candidate keywords, training is waited Select keyword GMM model;
Calculate the KL distance between keyword GMM model and candidate keywords GMM model, and by institute State the KL distance confidence level as described candidate keywords;
Or, described include based on the confidence level of candidate keywords described in wVector relatedness computation:
Training universal background model;
Calculate each Gaussian component likelihood on described universal background model of the keyword training sample sound bite Degree, forms keyword pronunciation model;
Path according to described candidate keywords corresponding in decoding network obtains the voice of described candidate keywords Fragment, then calculates described sound bite each Gaussian component likelihood score on described universal background model, composition Candidate keywords pronunciation model;
Calculate the degree of correlation between keyword pronunciation model and candidate keywords pronunciation model, and by described relevant Spend the confidence level as described candidate keywords;
Wherein, described confidence level based on the status frames variance score described candidate keywords of calculating includes:
Obtain the voice segments that described candidate keywords is corresponding;
Keyword models is carried out force cutting, obtain comprising in each state the number of speech frames of institute's speech segment Amount;
According to speech frame quantity in each state, the variance of statistics speech frame is as the confidence of described candidate keywords Degree;
Or, described confidence level based on the status frames variance score described candidate keywords of calculating includes:
Obtain voice segments corresponding to described candidate keywords and on keyword models the speech frame in each state;
Add up the sample variance of speech frame in each state;
The sample variance of the speech frame in comprehensive each state obtains integrality sample variance, and by described entirety State sample variance is as the confidence level of described candidate keywords.
2. the Keyword Spotting System that languages are unrelated, it is characterised in that including:
Receiver module, is used for receiving voice signal to be detected;
Decoder module, for decoding described voice signal according to the decoding network built in advance, obtains candidate Keyword;
Confidence evaluation module, for carrying out confidence evaluation to described candidate keywords in different ways;
Fusion Module, for merging the confidence evaluation result of different modes, obtains described candidate and closes Effective confidence level of keyword;
Output module, for determining the keyword of output according to described effective confidence level;
Described confidence evaluation module includes:
First evaluation module, for confidence level based on candidate keywords described in log-likelihood calculations;
Described confidence evaluation module also includes:
Second evaluation module, for based on the confidence level of candidate keywords described in wVector relatedness computation; And/or
3rd evaluation module, for calculating the confidence level of described candidate keywords based on status frames variance score;
Wherein, described second evaluation module includes:
Background model training unit, is used for training universal background model;
Keyword models training unit, for according to keyword training sample sound bite and described common background Model, training obtains keyword GMM model;
Candidate keywords model training unit, for the road according to described candidate keywords corresponding in decoding network Footpath obtains the sound bite of described candidate keywords, then according to sound bite and the institute of described candidate keywords Stating universal background model, training obtains candidate keywords GMM model;
Metrics calculation unit, has in calculating between keyword GMM model and candidate keywords GMM model KL distance, and using described KL distance as the confidence level of described candidate keywords;
Or, described second evaluation module includes:
Background model training unit, is used for training universal background model;
Keyword pronunciation model construction unit, is used for calculating keyword training sample sound bite described general Each Gaussian component likelihood score in background model, forms keyword pronunciation model;
Candidate keywords pronunciation model construction unit, for according to described candidate keywords corresponding in decoding network Path obtain described candidate keywords sound bite, then calculate described sound bite at the described general back of the body Each Gaussian component likelihood score on scape model, forms candidate keywords pronunciation model;
Correlation calculating unit, for calculating between keyword pronunciation model and candidate keywords pronunciation model The degree of correlation, and using the described degree of correlation as the confidence level of described candidate keywords;
Wherein, described 3rd evaluation module includes:
Voice segments acquiring unit, for obtaining the voice segments that described candidate keywords is corresponding;
Cutting unit, for carrying out pressure cutting on keyword models, obtains comprising institute's predicate in each state The speech frame quantity of segment;
Speech frame variance statistic unit, for according to speech frame quantity in each state, the variance of statistics speech frame Confidence level as described candidate keywords;
Or, described 3rd evaluation module includes:
Speech frame acquiring unit, for obtaining voice segments corresponding to described candidate keywords and at keyword models Speech frame in upper each state;
Sample variance statistic unit, for adding up the sample variance of the speech frame in each state;
Comprehensive unit, the sample variance of the speech frame in comprehensive each state obtains integrality sample side Difference, and using described integrality sample variance as the confidence level of described candidate keywords.
CN201310553073.9A 2013-11-08 2013-11-08 Keyword recognition method that languages are unrelated and system Active CN103559881B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310553073.9A CN103559881B (en) 2013-11-08 2013-11-08 Keyword recognition method that languages are unrelated and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310553073.9A CN103559881B (en) 2013-11-08 2013-11-08 Keyword recognition method that languages are unrelated and system

Publications (2)

Publication Number Publication Date
CN103559881A CN103559881A (en) 2014-02-05
CN103559881B true CN103559881B (en) 2016-08-31

Family

ID=50014112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310553073.9A Active CN103559881B (en) 2013-11-08 2013-11-08 Keyword recognition method that languages are unrelated and system

Country Status (1)

Country Link
CN (1) CN103559881B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104505090B (en) * 2014-12-15 2017-11-14 北京国双科技有限公司 The audio recognition method and device of sensitive word
US10055767B2 (en) * 2015-05-13 2018-08-21 Google Llc Speech recognition for keywords
CN106297776B (en) * 2015-05-22 2019-07-09 中国科学院声学研究所 A kind of voice keyword retrieval method based on audio template
US10438593B2 (en) 2015-07-22 2019-10-08 Google Llc Individualized hotword detection models
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
CN107545261A (en) * 2016-06-23 2018-01-05 佳能株式会社 The method and device of text detection
CN108694940B (en) * 2017-04-10 2020-07-03 北京猎户星空科技有限公司 Voice recognition method and device and electronic equipment
US10311874B2 (en) 2017-09-01 2019-06-04 4Q Catalyst, LLC Methods and systems for voice-based programming of a voice-controlled device
CN108922521B (en) * 2018-08-15 2021-07-06 合肥讯飞数码科技有限公司 Voice keyword retrieval method, device, equipment and storage medium
CN109192224B (en) * 2018-09-14 2021-08-17 科大讯飞股份有限公司 Voice evaluation method, device and equipment and readable storage medium
CN112185367A (en) * 2019-06-13 2021-01-05 北京地平线机器人技术研发有限公司 Keyword detection method and device, computer readable storage medium and electronic equipment
WO2021051403A1 (en) * 2019-09-20 2021-03-25 深圳市汇顶科技股份有限公司 Voice control method and apparatus, chip, earphones, and system
WO2021062705A1 (en) * 2019-09-30 2021-04-08 大象声科(深圳)科技有限公司 Single-sound channel robustness speech keyword real-time detection method
CN111128128B (en) * 2019-12-26 2023-05-23 华南理工大学 Voice keyword detection method based on complementary model scoring fusion
CN111540363B (en) * 2020-04-20 2023-10-24 合肥讯飞数码科技有限公司 Keyword model and decoding network construction method, detection method and related equipment
CN111554273B (en) * 2020-04-28 2023-02-10 华南理工大学 Method for selecting amplified corpora in voice keyword recognition
CN111798840B (en) * 2020-07-16 2023-08-08 中移在线服务有限公司 Voice keyword recognition method and device
CN111968649B (en) * 2020-08-27 2023-09-15 腾讯科技(深圳)有限公司 Subtitle correction method, subtitle display method, device, equipment and medium
CN112259101B (en) * 2020-10-19 2022-09-23 腾讯科技(深圳)有限公司 Voice keyword recognition method and device, computer equipment and storage medium
CN113327597B (en) * 2021-06-23 2023-08-22 网易(杭州)网络有限公司 Speech recognition method, medium, device and computing equipment
CN113823274B (en) * 2021-08-16 2023-10-27 华南理工大学 Voice keyword sample screening method based on detection error weighted editing distance

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314876A (en) * 2010-06-29 2012-01-11 株式会社理光 Speech retrieval method and system
CN102439660A (en) * 2010-06-29 2012-05-02 株式会社东芝 Voice-tag method and apparatus based on confidence score

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809564B2 (en) * 2006-12-18 2010-10-05 International Business Machines Corporation Voice based keyword search algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314876A (en) * 2010-06-29 2012-01-11 株式会社理光 Speech retrieval method and system
CN102439660A (en) * 2010-06-29 2012-05-02 株式会社东芝 Voice-tag method and apparatus based on confidence score

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
关键词检测系统中基于音素网格的置信度计算;张鹏远等;《电子与信息学报》;20070915;摘要,第2063页左栏第1段、第2064页右栏最后一段-第2065页左栏第2段左栏第2段 *

Also Published As

Publication number Publication date
CN103559881A (en) 2014-02-05

Similar Documents

Publication Publication Date Title
CN103559881B (en) Keyword recognition method that languages are unrelated and system
US12035106B2 (en) Machine learning model capability assessment
CN107797984B (en) Intelligent interaction method, equipment and storage medium
CN105574098B (en) The generation method and device of knowledge mapping, entity control methods and device
KR101734829B1 (en) Voice data recognition method, device and server for distinguishing regional accent
JP6101196B2 (en) Voice identification method and apparatus
US10650801B2 (en) Language recognition method, apparatus and device and computer storage medium
JP5223673B2 (en) Audio processing apparatus and program, and audio processing method
US20160210984A1 (en) Voice Quality Evaluation Method and Apparatus
CN110415705A (en) A kind of hot word recognition methods, system, device and storage medium
CN105869633A (en) Cross-lingual initialization of language models
US11120802B2 (en) Diarization driven by the ASR based segmentation
CN112257437B (en) Speech recognition error correction method, device, electronic equipment and storage medium
US11183180B2 (en) Speech recognition apparatus, speech recognition method, and a recording medium performing a suppression process for categories of noise
CN110704597B (en) Dialogue system reliability verification method, model generation method and device
KR101564087B1 (en) Method and apparatus for speaker verification
CN103559289B (en) Language-irrelevant keyword search method and system
CN111639529A (en) Speech technology detection method and device based on multi-level logic and computer equipment
CN109243427A (en) A kind of car fault diagnosis method and device
Avila et al. Bayesian restoration of audio signals degraded by impulsive noise modeled as individual pulses
US10468031B2 (en) Diarization driven by meta-information identified in discussion content
CN113299278B (en) Acoustic model performance evaluation method and device and electronic equipment
US12087276B1 (en) Automatic speech recognition word error rate estimation applications, including foreign language detection
CN114758645B (en) Training method, device, equipment and storage medium for speech synthesis model
CN104572820B (en) The generation method and device of model, importance acquisition methods and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Applicant after: Iflytek Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Applicant before: Anhui USTC iFLYTEK Co., Ltd.

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190419

Address after: 065001 Xinya R&D Building 610-612, 106 No. 1 Road, Langfang Economic and Technological Development Zone, Hebei Province

Patentee after: Technological University Xunfei Hebei Technology Co., Ltd.

Address before: 230088 666 Wangjiang West Road, Hefei hi tech Development Zone, Anhui

Patentee before: Iflytek Co., Ltd.

TR01 Transfer of patent right