CN1455387A - Rapid decoding method for voice identifying system - Google Patents

Rapid decoding method for voice identifying system Download PDF

Info

Publication number
CN1455387A
CN1455387A CN02148682A CN02148682A CN1455387A CN 1455387 A CN1455387 A CN 1455387A CN 02148682 A CN02148682 A CN 02148682A CN 02148682 A CN02148682 A CN 02148682A CN 1455387 A CN1455387 A CN 1455387A
Authority
CN
China
Prior art keywords
token
node
current
state
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN02148682A
Other languages
Chinese (zh)
Other versions
CN1201284C (en
Inventor
韩疆
颜永红
潘接林
张建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CNB021486824A priority Critical patent/CN1201284C/en
Publication of CN1455387A publication Critical patent/CN1455387A/en
Application granted granted Critical
Publication of CN1201284C publication Critical patent/CN1201284C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

(1) Initializing the decoding arithmetic elements. (2) The feature code word vectors of next voice frame are taken out from the voice feature code word sequence with length T in the input decoding arithmetic element. The current voice frame is Ot, 1<=t<=T. (3) Filtering the current voice frame Ot. (4) Based on Ot, each active node of the token resource L t[I] in each layer I of token resource L t of lexicographic tree at t time is judged. (5) Processing the token between nods of lexicographic tree. (6) Based on the maximum probability of partial directory path, the self-adapting adjustment is carried out for the threshold related to the lopping of branches. (7) Repeating steps (2)-(6). The generated best text string matched to the acoustic model and the language model is output. Thus, the result of the voice recognition is generated. Comparing with traditional methods, the strategy speeds up decoding operation.

Description

Fast decoding method in a kind of speech recognition system
Technical field
The present invention relates to the fast decoding method in a kind of speech recognition system.
Background technology
The decoding computing is the chief component in the speech recognition system, its function is: under given acoustic model and language model, acoustics to input is observed feature vector sequence, allow computing machine in the search volume of static state or dynamic construction, find out the text string that optimum matching is arranged with acoustic model and language model automatically, thereby the user's voice input is converted to corresponding text.
Shown in Figure 1 is a kind of structured flowchart of known speech recognition system, be transformed to the accessible digital signal of computing machine behind the analog voice process analog to digital conversion unit 11, utilizing 12 pairs of these digital signals of feature extraction unit to carry out the branch frame then handles, usually frame length is 20ms, frame moves and is 10ms, extract the MFCC parameter of each frame voice, obtain the MFCC vector sequence, decoding arithmetic element 14 is according to the feature vector sequence of input voice, acoustic model 13 and language model 15, adopt certain search strategy, as depth-first search (Viterbi algorithm) or BFS (Breadth First Search), the result who obtains discerning, wherein language model is used for the knowledge of linguistic level is applied to speech recognition system when carrying out big vocabulary continuous speech recognition, improves the accuracy of identification of system.
Based on the speech recognition device of Fig. 1 the central processing unit speed and the memory size of computing machine there is very high requirement, the commercial dictation machine of present some system, for example, the dictation machine module among the ViaVoice system of IBM and the Microsoft Office XP all requires the central processing unit (more than the Intel Pentium II 400MHz) at a high speed and the memory source (more than the 100MByte) of larger capacity.Generally speaking, the decoding computing has occupied central processing unit computational resource more than 90% and most memory source in the whole speech recognition device; Analog-to-digital conversion module and feature extraction unit occupy central processing unit computational resource below 10% and memory source seldom.
Current commercial built-in speech recognition system mainly is the specific people's speech recognition of adopting based on the simple template coupling of little speech amount, for example, phonetic dialing in the mobile phone and simple command identification etc., because this Technology Need user registers speech data, its ease for use, applicability are not strong; Some unspecified person built-in speech recognition systems are mainly discerned towards the order speech of little vocabulary, and calculated amount and memory requirements are still bigger, for example, the individual voice assistant speech recognition system of IBM needs the computing equipment of the computing power of 50DMIPS for the task domain of 500 speech.
The ultimate principle and the notion of known decoding computing are as follows:
1, lexicographic tree
Lexicographic tree is to be used for a kind of tree structure of all speech pronunciations in the tissue identification system.Phoneme is the base unit that constitutes the speech pronunciation, the TRIPHONE phoneme is a current speech recognition system phoneme unit commonly used, for example: the TRIPHONE phonemic representation sequence of speech " China " is: " sil-zh+ongzh-ong+g ong-g+uo g-uo+sil " (wherein " sil " is a special phoneme, be used for describing the pause in the user speech), the TRIPHONE phoneme is a kind of phoneme of context-sensitive, compare with common pinyin representation, it can describe the pronunciation variation that phoneme produces in different contexts, thus the acoustic feature of descriptor pronunciation more accurately.Have identical preceding asyllabia or sub-speech between the speech of recognition system, for example: speech " centre " and " China ", they have identical prefix " in ", available tree construction is described, the vocabulary of supposing recognition system comprises following 5 speech " abe ", " ab ", " acg ", " acgi " and " ac ", then the lexicographic tree of this vocabulary is as shown in Figure 4:
The related Hidden Markov Model (HMM) (HMM) of the TRIPHONE phoneme of each the node correspondence in the lexicographic tree corresponding to this TRIPHONE, Fig. 5 has provided a kind of HMM topological structure of the TRIPHONE of expression phoneme, and a HMM is made up of some HMM states.
2, token definition and token expanding policy
Token is meant an active search path from the user speech start frame to the current speech frame, it comprises the score value of ID of trace route path information and path and acoustic model and language model coupling, wherein ID of trace route path information is included in all speech in this path and the boundary information of speech, each token is corresponding to the searching route of an activity, and the difference between different tokens is that they have different acoustical context and different language contexts.
Each state in the lexicographic tree among the HMM of each node association all can resident movably token, and each state of this node all has a token chained list, is used for depositing any time all tokens in this state activity.Suppose that at the moment t score value that can expand token in the lexicographic tree in the token chained list of the state i of a node is s i(t-1), so in search procedure, if the score value s of this token j(t-1) add transition probability from state i to state j, add state i and surpass current pruning threshold for the observation probability of current speech frame t, then produce a new token, its score value is s jAnd be associated on the state j (t).Finish t-1 resided in the processing of all tokens on the lexicographic tree constantly after, reside in token resource to be expanded on the lexicographic tree constantly with producing t, and will delete whole t-1 and reside in all tokens on the lexicographic tree constantly.
In the token communication process, possible speech and speech boundary information are recorded in the list structure in a sign path.Therefore at the phonetic entry T finish time, can have ID of trace route path information chained list in the best score value token by backtracking, extract word sequence and corresponding speech boundary position with optimum matching.
3, the token resource of lexicographic tree node definition
Suppose that the dictionary tree node comprises M HMM state: s 1S M, then the definition of the token resource of a lexicographic tree node comprises following token resource information:
The node token resource: H S 1 H S 2 &hellip; H S M
Wherein,
Figure A0214868200092
For about the HMM state S in the node iThe gauge outfit of token chained list.
Traditional decoding operational method is too high to hardware computing power memory requirement, and cost performance is low.
A kind of compression method that is used for the feature vector set of speech recognition system is disclosed in Chinese patent application 02131086.6, the speech characteristic vector clustering is being obtained in the process of code book, increased the step that total distance metric dynamically merges and oidiospore is gathered according to vector number and vector in the subclass, reduced the distance metric summation of the code word that vector is corresponding with it in the set after the cluster, improved the precision of clustering algorithm, code book after this inventive method compression is applied in the speech recognition system, can when guaranteeing the voice system recognition performance, greatly reduce the memory space of system.This invention also discloses a kind of speech recognition system, its structured flowchart as shown in Figure 2, this system replaces acoustic model with feature code book and probability tables, in the process of decoding, do not need to calculate gaussian probability, need only from the probability tables of storage in advance, find out required probable value, significantly reduced in the decoding computing and calculated the expense of gaussian probability, thereby can improve the recognition speed of system to a great extent.
Summary of the invention
The objective of the invention is for a kind of fast decoding operational method of improving in the big vocabulary continuous speech recognition of the current unspecified person system is provided, this method has further solved current Embedded Speech Recognition System technology, the market receiving ability of relative product price, to hardware computing power and the too high problem of memory requirement, make current speech recognition technology also applicable to the embedded hardware platform, for example, PDA, Mobil Phone, Smart Phone etc.
Purpose of the present invention can realize by following measure:
Fast decoding method in a kind of speech recognition system comprises the steps:
(1) the decoding arithmetic element in the speech recognition system is carried out initialization;
(2) length from input decoding arithmetic element is to take out the feature code word vector of next speech frame in the phonetic feature codeword sequence of T successively, puts it and is current speech frame O t, 1≤t≤T;
(3) to current speech frame O tFilter, if speech frame O tBe filtered, then forward step (2) to and carry out, otherwise put speech frame O tBe current efficient voice frame;
(4) based on current efficient voice frame O t, to t moment lexicographic tree token resource L tThe token resource L of each layer I tEach active node in [I] judges, and judgement is belonged to extendible token then expands token in this node token resource table, and the token chain that will newly produce is gone in the token resource table of destination node; Wherein I is index variables, 1≤I≤H; H is the height of lexicographic tree; Otherwise execution in step (7);
(5) be in the token of lexicographic tree node;
(6) according to t local path maximum probability and last efficient voice frame moment corresponding constantly
Figure A0214868200101
The local path maximum probability, the threshold value relevant with beta pruning done the self-adaptation adjustment;
(7) repeat the global path with best score value token that above-mentioned (2)-(6) steps obtains importing the voice T finish time, finish the token expansion, what output had generated this moment has the text string of optimum matching with acoustic model and language model, produces voice identification result.
The present invention does not relate to about the expansion of speech node token and relevant Processing Algorithm, and the user can be according to task domain (for example: the identification of order speech, Chinese single-syllable identification, the major term amount continuous speech recognition etc.) Processing Algorithm that customization is relevant.
Described t is lexicographic tree token resource L constantly tBe this summation of the token resource of all active nodes in the lexicographic tree constantly.The indexed mode of t moment active node is in the lexicographic tree: according to t moment active node residing level index in lexicographic tree, promptly be serially connected and form a chained list at all active nodes of identical layer, each of lexicographic tree layer all has a such chained list, is a two-dimensional chain table on the whole.
Described t is the 1st layer of token resource L of lexicographic tree token resource constantly t[I] is the t moment lexicographic tree active node token resource L of index in a manner described tThe 1st layer of chained list.
Described t local path maximum probability constantly is: in the local path set of t all new generation token correspondences of the moment, and the maximal value of all local path score.
Described last efficient voice frame correspondence Local path maximum probability constantly is: last efficient voice frame moment corresponding
Figure A0214868200103
In the local path set of all new generation token correspondences, the maximal value of all local path score.
Described initialization step (1) also comprises the steps:
A, to produce a score value be zero token, and this token chain is gone into the token resource gauge outfit of the root node in the lexicographic tree, and the active node of current lexicographic tree only comprises root node root, and it is in the ground floor of lexicographic tree;
B, initialization overall situation pruning threshold L gBe the logarithm minimum value;
C, the local beta pruning baseline threshold of initialization L bBe the logarithm minimum value;
D, initialization beta pruning width threshold value L wBe a positive constant L w c, L w cPreestablish by the user.
Described filtration step (3) also comprises the steps:
A, if current t speech frame O constantly tBe the initial speech frame of user speech input, then put it and be the efficient voice frame that filter operation is finished; Otherwise execution in step b;
B, more current t be speech frame O constantly tY feature code word vector f 1 tf 2 tF Y tWith t-1 moment speech frame O T-1Y feature code word vector f 1 T-1f 2 T-1F Y T-1Similarity degree, obtain a similarity value V;
C, with similarity value V and decision threshold θ relatively is if V≤θ then judges speech frame O tFor to decoding computing invalid speech frame; Otherwise judge speech frame O tFor to decoding computing effective speech frame.
Described decision threshold θ is a positive constant that is set by the user.
Described node token resource spread step (4) also comprises the steps:
A, based on current efficient voice frame O tEach token in last state corresponding token resource chained list of the HMM of present node association is done outside expansion, promptly each token in last state corresponding token resource chained list of the HMM of present node association is extended in the token resource table of all child nodes of this node in lexicographic tree;
B, a HMM state getting the HMM with M state of present node association are current pending HMM state S n, 1≤n≤M wherein;
C, get state s nA token in the corresponding token resource table is current pending token;
D, if state S nThe score value of current pending token greater than previous efficient voice frame moment corresponding Overall pruning threshold L g, then, get one by state s according to the topological structure of the HMM model of present node association nThe state that can reach is changed to current armed state s m, begin to carry out otherwise forward step k to;
E, computational token are from S nArrival state s mScore value s m(t); Score value s m(t) add state s for the current score value of token nTo state s mTransition probability, add state s mFor current speech frame O tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain:
F, the current local pruning threshold L of calculating p, its computing formula is: L p=L b-L w, in the formula, L bBe current local beta pruning baseline threshold; L wBe current beta pruning width threshold value;
G, if token from s nArrival state s mScore value greater than current local pruning threshold L p, then produce a new token, putting its score value is s m(t); Otherwise execution in step j;
Gauge outfit is in h, the new token chain ingress that g step is produced
Figure A0214868200121
The token resource table in, and check this node whether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;
The score value s of i, this new token of basis m(t), if s m(t)-L w>L bSet up, then upgrade local beta pruning baseline threshold L bBe L b=s m(t);
J, get another by state s nThe state that can reach is put it and is current armed state s m, repeat the above-mentioned e-i step, up to handling all by state s nThe state that can reach; Forwarding step k to carries out;
K, get state s nAnother token in the corresponding token resource table is current pending token; Repeat the above-mentioned d-j step, up to state s nThe extended operation of all tokens in the corresponding token resource table is all finished; Going to step l carries out;
L, another HMM state of getting the HMM with M state of node association are current pending HMM state S n, wherein 1≤n≤M repeats the above-mentioned c-k step, all finishes until all token resource extended operations of present node.
Described node token resource spread step (4) a comprised the steps: in the step
I is if the current score value of token is less than or equal to last efficient voice frame moment corresponding
Figure A0214868200122
Overall pruning threshold L g, then do not need to do to expand of the operation of current token to all child nodes of its place node, otherwise execution in step ii;
Ii gets j child node node in the J of current token place node in the lexicographic tree child node jBe current pending node;
The iii accumulated token arrives node node jFirst state s 1Score value s 1(t), this score value s 1(t) add that for the current score value of token last state of current token place node is to node jFirst state transition probability, add node jFirst state s 1, for current speech frame O tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain;
Iv calculates current local pruning threshold L p, its computing formula is: L p=L b-L w, wherein, L bBe current local beta pruning baseline threshold; L wBe current beta pruning width threshold value;
V is if token arrives node jFirst state s 1Score value greater than current local pruning threshold L p, execution in step vi then; Otherwise execution in step ix;
Vi produces a new token, and putting its score value is s 1(t);
Vii goes into node with this token chain jGauge outfit is in the node
Figure A0214868200131
The token resource table in, and check node jWhether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;
Viii is based on score value s 1(t), if s 1(t)-L w>L b, then upgrade local beta pruning baseline threshold L bBe L b=s 1(t);
It is current pending node node that ix gets current token place node another child node in lexicographic tree j, repeat the above-mentioned iii-viii step, finish to the extended operation of all child nodes in lexicographic tree of its place node until current token.
Described self-adaptation beta pruning step (6) based on the local path maximum probability comprises the steps:
A, constantly and last efficient voice frame correspondence according to current t Local maximum probability is constantly calculated the beta pruning width threshold value and is adjusted factor L fFor: Wherein,
Figure A0214868200134
Be the number of all the efficient voice frames till the t moment, current overall pruning threshold L gBe last efficient voice frame moment corresponding The local path maximum probability, current local beta pruning baseline threshold L bBe t local path maximum probability constantly;
B, beta pruning width threshold value adjustment factor L to calculating fDo regular: if L f>L f MAXL then f=L f MAXIf L f<L f MINL then f=L f MINWherein: L f MAXFor adjusting factor L fThe upper bound, L f MINFor adjusting factor L fLower bound, be positive constant, can be set by the user;
The beta pruning width threshold value that c, basis calculate is adjusted factor L f, upgrade beta pruning width threshold value L wFor: L w = L f L w c , L w cBe initial beta pruning width threshold value, can be by obtaining in the initialization step (1);
The overall pruning threshold of d, updated time t is L g: L g=L b, for the token expansion at next efficient voice frame is prepared;
E, the local beta pruning baseline threshold L of replacement bBe the logarithm minimum value, for the token expansion at next efficient voice frame is prepared.
The present invention has following advantage compared to existing technology:
Compare with tradition decoding operational method, the present invention comprises following improvement: based on the self-adaptation beta pruning strategy of local path maximum probability; Speech frame filtering policy based on feature code word vector.
A kind of search framework of breadth-first of the band beta pruning based on the expansion of lexicographic tree and token has been adopted in the decoding computing that the present invention relates to, the computation complexity of this algorithm is O (MT), wherein: T is the number that enters the speech frame of searching and computing, and M is the mean value of the corresponding active path bar number constantly of each speech frame in the searching and computing process.
Traditional decoding computing is that all speech frames of user speech input are done searching and computing, in fact, the user's voice input is a time varying signal with local stationarity, therefore, steady section at user input voice, can remove some and be adjacent the similar speech frame of speech frame, promptly these speech frames do not enter the searching and computing process, and do not influence the precision of decoding computing.For this reason, the present invention has provided a kind of speech frame filtering policy based on speech frame feature code word vector, this strategy can effectively be removed the speech frame to the searching and computing redundancy, make the actual speech frame number that enters searching and computing be less than the actual speech frame number that the user speech input is comprised, by above-mentioned formula as seen, compare with classic method, but adopt this strategy speeds up decoding operation.
On the other hand, by above-mentioned formula as seen, the speed of searching and computing also depends in the search procedure average of the corresponding active path bar number constantly of each speech frame, and promptly have: when T was constant, M was big more, and it is big more then to search for expense; M is more little, and it is more little then to search for expense.The size of M depends on the beta pruning strategy that searching and computing is taked.For this reason, the present invention has provided a kind of self-adaptation beta pruning strategy of local path maximum probability, compares with classic method, can effectively reduce the value of M, and to not obviously influence of accuracy of identification, thus further speeds up decoding operation.
Description of drawings
Fig. 1 is the structured flowchart of known speech recognition system
Fig. 2 is the structured flowchart of another known speech recognition system
Fig. 3 is a decoding operational flowchart of the present invention
Fig. 4 is known lexicographic tree structural drawing
Fig. 5 is the HM topological structure synoptic diagram of known TRIPHONE phoneme
Concrete embodiment
Fig. 3 provides the present invention's operational flowchart of decoding.By Fig. 2, in conjunction with Fig. 3, based on the operational scheme of the speech recognition system of decoding operator of the present invention system be: the voice analog signal of input is transformed to digital signal; This digital signal is carried out the branch frame handle, extract the characteristic parameter of each frame voice, the corresponding eigenvector of each speech frame obtains importing the feature vector sequence of voice; Utilize the feature code book that described feature vector sequence is carried out quantization encoding, the corresponding feature code word vector of each speech frame obtains corresponding feature code word vector sequence; Speech frame filter element with phonetic feature code word vector sequence input decoding operator system, do the speech frame filter operation, from phonetic feature code word vector sequence, remove invalid speech frame characteristic of correspondence code word vector, obtain efficient voice feature code word vector sequence; Described efficient voice feature code word vector sequence is done searching and computing obtain recognition result, in the searching and computing process, employing is done the beta pruning of Local Search path based on the self-adaptation beta pruning strategy of local path maximum probability, to each code word of efficient voice feature code word vector sequence, from probability tables, directly find its observation probability on (part) searching route.
Described lexicographic tree active node token resource and indexed mode definition thereof
Among the present invention, claim t at any time, the node that has activity token in the lexicographic tree is a t active node constantly.The indexed mode of t moment active node is in the lexicographic tree: according to t moment active node residing level index in lexicographic tree, promptly be serially connected and form a chained list at all active nodes of identical layer, each of lexicographic tree layer all has a such chained list, is a two-dimensional chain table on the whole.
Any time t, the summation of the token resource of all active nodes is called t lexicographic tree active node token resource constantly in the lexicographic tree, and it has stipulated t token resource to be expanded constantly.For the convenience of narrating below, note is L by the t moment lexicographic tree active node token resource of above-mentioned index t, its index variables are that (1≤I≤H), wherein, H is the height of lexicographic tree to I.
Based on above-mentioned about decoding ultimate principle of computing and the embodiment that related notion provides fast decoding method of the present invention.
Coding/decoding method of the present invention comprises the steps:
1, the decoding arithmetic element in the speech recognition system is carried out initialization;
2, from the length of input decoding arithmetical unit is the phonetic feature code word vector sequence of T, take out the feature code word vector of first speech frame, put it and be current speech frame O t(t=1);
3, to current speech frame O tDo filter operation;
4, if O tBe the efficient voice frame, then to t moment lexicographic tree token resource L tEach layer I (lexicographic tree token resource L of 1≤I≤H) tEach active node of [I] is expanded the token in this node token resource table, and the token chain that will newly produce is gone in the token resource table of destination node; Otherwise forward step 7 to;
5, be in the token of speech node; The present invention does not relate to about the expansion of speech node token and relevant Processing Algorithm, and the user can be according to task domain (for example: the identification of order speech, Chinese single-syllable identification, the major term amount continuous speech recognition etc.) Processing Algorithm that customization is relevant;
6, according to t local path maximum probability and last efficient voice frame moment corresponding constantly The local path maximum probability, the threshold value relevant with beta pruning done the self-adaptation adjustment, comprising overall pruning threshold L g, local beta pruning baseline threshold L bAnd beta pruning width threshold value L w
7, from the length of input decoding arithmetical unit is the speech characteristic vector sequence of T, take out next speech frame,, then put it and be current speech frame O if can get t(t≤T) also forwards step 3 to and carries out, otherwise forwards step 8 to;
8, finish the token expansion and produce recognition result: T has the global path of best score value token by the backtracking moment, and output and acoustic model and language model have the text string of optimum matching.
In the step 4 of above-mentioned coding/decoding method be step by step to what the present node token resource was done extended operation:
A, to each token in last state corresponding token resource chained list of the HMM of node association, token is expanded in the token resource table of all childs of this node in lexicographic tree;
B, first HMM state of getting the HMM with M state of node association are current pending HMM state S n(n=1);
C, get state s nA token in the corresponding token resource table is current pending token;
D, if the current score value of token greater than previous efficient voice frame moment corresponding Overall pruning threshold L g, then, get one by state s according to the topological structure of the HMM model of present node association nThe state that can reach is changed to current armed state s m, carry out otherwise forward step k to;
E, computational token arrive state s mScore value s m(t) be: the current score value of token adds state s nTo state s mTransition probability, add state s mFor current speech frame O tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain;
F, the current local pruning threshold L of calculating p, its computing formula is: L p=L b-L w, in the formula, L bBe current local beta pruning baseline threshold; L wBe current beta pruning width threshold value;
G, if token arrives state s mScore value be less than or equal to current local pruning threshold L p, then forward step j to and carry out; Otherwise do following operation: produce a new token, putting its score value is s m(t);
H, be with gauge outfit in this token chain ingress
Figure A0214868200172
The token resource table in, and check this node whether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;
I, be s according to score value m(t), upgrade local beta pruning baseline threshold L b, the steps include: if s m(t)-L w>L b, then have: L b=s m(t); Otherwise, do not upgrade;
J, get another state s nThe state s that can reach m,, then put it and be current armed state s if get mAnd forward step e execution to, carry out otherwise forward step k to;
K, get state s nAnother token in the corresponding token resource table if get, then is changed to current pending token and forwards steps d to and carry out, and carries out otherwise forward step l to;
L, get the next HMM state of the HMM with M state of node association,, then be changed to current pending HMM state S if get n(n≤M) also forwards step c to and carries out, otherwise shows that extended operation is finished to the present node token resource.
First of the step 4 of above-mentioned coding/decoding method step by step in a step about expanding current token be to the step of all child nodes of its place node:
1, if the score value of current token is less than or equal to last efficient voice frame moment corresponding Overall pruning threshold L g, then do not need to do to expand of the operation of current token to all childs of its place node, carry out otherwise change step 2;
2, a child getting current token place node is current pending node node j(j=1);
3, accumulated token arrives node node jFirst state s 1Score value s 1(t) be: the current score value of token adds that last state of token place node is to node jFirst state transition probability, add node jFirst state s 1For current speech frame O tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain;
4, calculate current local pruning threshold L p, its computing formula is: L p=L b-L w, wherein, L bBe current local beta pruning baseline threshold; L wBe current beta pruning width threshold value;
5, if token arrives node jFirst state s 1Score value be less than or equal to current local pruning threshold L p, then forward step 9 to, otherwise execution in step 6;
6, produce a new token, putting its score value is s 1(t);
7, this token chain is gone into node jGauge outfit is in the node
Figure A0214868200181
The token resource table in, and check node jWhether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;
8, according to score value s 1(t), upgrade local beta pruning baseline threshold L b, the steps include: if s 1(t)-L w>L b, then have: L b=s 1(t), otherwise, do not upgrade;
9, get another child of current token place node,, then put it and be current pending node node if get j(j≤N) also forwards step 3 to and carries out, and wherein N is the number of current token place node all childs in lexicographic tree, otherwise shows to have finished and expand the operation of current token to all childs of its place node.
The step l of above-mentioned coding/decoding method carries out initialized step to the decoding arithmetic element and comprises the steps:
1, producing a score value is zero token, and this token chain is gone into the token resource gauge outfit of the root node in the lexicographic tree, and the active node of current lexicographic tree only comprises root node root, and it is in the ground floor of lexicographic tree;
2, initialization overall situation pruning threshold L gBe the logarithm minimum value;
3, the local beta pruning baseline threshold of initialization L bBe the logarithm minimum value;
4, initialization beta pruning width threshold value L wBe a positive constant L w c, this value can read from the decoding arithmetical unit configuration file that the user sets.
The step that the step 3 of above-mentioned coding/decoding method pair current speech frame is done filter operation comprises the steps:
1, if current t is moment speech frame O tBe the initial speech frame of user speech input, then put it and be the efficient voice frame that filter operation is finished; Otherwise execution in step 2;
2, more current t moment speech frame O tFeature code word vector f 1 tf 2 tF Y tWith t-1 moment speech frame O T-1Feature code word vector f 1 T-1f 2 T-1F Y T-1Similarity degree, obtain a similarity value V, the Y in the above-mentioned feature code word vector expression is the feature code word number that comprises in the speech frame feature code word vector, similarity value V can be calculated by following formula: V = &Sigma; i = 1 Y C ( f i t , f i t - 1 ) , Wherein, C (f i t, f i T-1) be defined as following formula:
Figure A0214868200192
3, similarity value V and decision threshold θ (be a positive constant that is set by the user, can read from the decoding arithmetical unit configuration file that the user sets) are compared, if V≤θ then judges speech frame O tFor to decoding computing invalid speech frame, otherwise judge speech frame O tFor to decoding computing effective speech frame.
Experimental result shows, above-mentioned speech frame filter operation can be removed in the described phonetic feature code word vector sequence 20%~30% invalid speech frame, but so speeds up decoding operation, and its recognition performance is insensitive to user's word speed than classic method, this is because to the slower user of word speed, above-mentioned speech frame filter operation can be removed more invalid speech frame, and to word speed user faster, above-mentioned speech frame filter operation then can be removed less invalid speech frame, and promptly above-mentioned speech frame filter operation can be done regularization to a certain degree to the word speed of user's different user.
The step 6 of above-mentioned coding/decoding method also comprises the steps:
Preceding 5 steps by the decoding arithmetical unit can obtain: current overall pruning threshold L gBe last efficient voice frame moment corresponding
Figure A0214868200193
The local path maximum probability, current local beta pruning baseline threshold L bBe t local path maximum probability constantly.In view of the above, the execution of the step 6 in the operation steps of decoding arithmetical unit is step by step:
1, according to moment t,
Figure A0214868200194
Local maximum probability, calculate the beta pruning width threshold value and adjust factor L fFor: Wherein,
Figure A0214868200202
For till the t moment, the number of all efficient voice frames;
2, the beta pruning width threshold value that calculates is adjusted factor L fDo regular: if L f>L f MAXL then f=L f MANIf L f<L f MINL then f=L f MINWherein: L f MAXFor adjusting factor L fThe upper bound (for example: 1.05), L f MINFor adjusting factor L fLower bound (for example: 0.5), be positive constant, can from the decoder configurations file that the user sets, read;
3, adjust factor L according to the beta pruning width threshold value that calculates f, upgrade beta pruning width threshold value L wFor: L w = L f L w c ;
4, the overall pruning threshold of updated time t is L g: L g=L b, for preparing at token expansion to next efficient voice frame;
5, the local beta pruning baseline threshold L of replacement bBe the logarithm minimum value, for the token expansion at next efficient voice frame is prepared.
In traditional searching algorithm, beta pruning width threshold value L wBe constant, in the present invention, after current efficient voice frame is done searching and computing, beta pruning width threshold value L wCan do the self-adaptation adjustment according to the local path maximum probability, thereby in the time of can realizing that next efficient voice frame done searching and computing to the self-adaptation beta pruning of local path, experimental result shows, under the prerequisite that does not influence accuracy of identification, this method can effectively reduce the described average token number M (10%~20%) in the decode procedure, thus further speeds up decoding operation.

Claims (10)

1, the fast decoding method in a kind of speech recognition system comprises the steps:
(1) the decoding arithmetic element in the speech recognition system is carried out initialization;
(2) length from input decoding arithmetic element is to take out the feature code word vector of next speech frame in the phonetic feature codeword sequence of T successively, puts it and is current speech frame O t, 1≤t≤T;
(3) to current speech frame O tFilter, if speech frame O tBe filtered, execution in step (2) then, otherwise put speech frame O tBe current efficient voice frame;
(4) based on current efficient voice frame O t, to t moment lexicographic tree token resource L tThe token resource L of each layer I tEach active node in [I] judges, and judgement is belonged to extendible token then expands token in this node token resource table, and the token chain that will newly produce is gone in the token resource table of destination node; Wherein I is index variables, 1≤I≤H; H is the height of lexicographic tree; Otherwise execution in step (7);
(5) be in the token of lexicographic tree node;
(6) according to t local path maximum probability and last efficient voice frame moment corresponding constantly The local path maximum probability, the threshold value relevant with beta pruning done the self-adaptation adjustment;
(7) repeat the global path with best score value token that above-mentioned (2)-(6) steps obtains importing the language Shanxi T finish time, finish the token expansion, what output had generated this moment has the text string of optimum matching with acoustic model and language model, produces voice identification result.
2, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described described t lexicographic tree token resource L constantly tBe this summation of the token resource of all active nodes in the lexicographic tree constantly.
3, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described t local path maximum probability constantly is in the local path set of t all new generation token correspondences of the moment, the maximal value of all local path score.
4, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described last efficient voice frame correspondence Local path maximum probability constantly is last efficient voice frame moment corresponding
Figure A0214868200023
In the local path set of all new generation token correspondences, the maximal value of all local path score.
5, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described initialization step (1) also comprises the steps:
A, to produce a score value be zero token, and this token chain is gone into the token resource gauge outfit of the root node in the lexicographic tree, and the active node of current lexicographic tree only comprises root node root, and it is in the ground floor of lexicographic tree;
B, initialization overall situation pruning threshold L gBe the logarithm minimum value;
C, the local beta pruning baseline threshold of initialization L bBe the logarithm minimum value;
D, initialization beta pruning width threshold value L wBe a positive constant L w c, L w cPreestablish by the user.
6, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described filtration step (3) also comprises the steps:
A, if current t speech frame O constantly tBe the initial speech frame of user speech input, then put it and be the efficient voice frame that filter operation is finished; Otherwise execution in step b;
B, more current t be speech frame O constantly tY feature code word vector f 1 tf 2 tF Y tWith t-1 moment speech frame O T-1Y feature code word vector f 1 T-1f 2 T-1F Y T-1Similarity degree, obtain a similarity value V;
C, with similarity value V and decision threshold θ relatively is if V≤θ then judges speech frame O tFor to decoding computing invalid speech frame; Otherwise judge speech frame O tFor to decoding computing effective speech frame.
7, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described decision threshold θ is a positive constant that is set by the user.
8, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described node token resource spread step (4), also comprises the steps:
A, based on current efficient voice frame O tEach token in last state corresponding token resource chained list of the HMM of present node association is done outside expansion, promptly each token in last state corresponding token resource chained list of the HMM of present node association is extended in the token resource table of all child nodes of this node in lexicographic tree;
B, a HMM state getting the HMM with M state of present node association are current pending HMM state S n, 1≤n≤M wherein;
C, get state s nA token in the corresponding token resource table is current pending token;
D, if state S nThe score value of current pending token greater than previous efficient voice frame moment corresponding
Figure A0214868200031
Overall pruning threshold L g, then, get one by state s according to the topological structure of the HMM model of present node association nThe state that can reach is changed to current armed state s m, begin to carry out otherwise forward step k to;
E, computational token are from S nArrival state s mScore value s m(t); Score value s m(t) add state s for the current score value of token nTo state s mTransition probability, add state s mFor current speech frame O tThe observation probability;
F, the current local pruning threshold L of calculating p, its computing formula is: L p=L b-L w, in the formula, L bBe current local beta pruning baseline threshold; L wBe current beta pruning width threshold value;
G, if token from S nArrival state s mScore value greater than current local pruning threshold L p, then produce a new token, putting its score value is s m(t); Otherwise execution in step j;
Gauge outfit is in h, the new token chain ingress that g step is produced
Figure A0214868200041
The token resource table in, and check this node whether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;
The score value s of i, this new token of basis m(t), according to formula s m(t)-L w>L b, then upgrade local beta pruning baseline threshold L bBe L b=s m(t);
J, get another by state s nThe state that can reach is put it and is current armed state s m, repeat the above-mentioned e-i step, up to handling all by state s nThe state that can reach; Forwarding step k to carries out;
K, get state s nAnother token in the corresponding token resource table is current pending token; Repeat the above-mentioned d-j step, up to state s nThe extended operation of all tokens in the corresponding token resource table is all finished; Going to step l carries out;
L, another HMM state of getting the HMM with M state of node association are current pending HMM state S n, wherein 1≤n≤M repeats the above-mentioned c-k step, finishes until all token resource extended operations of present node.
9, the fast decoding method in the speech recognition system as claimed in claim 8 is characterized in that described node token resource spread step (4) a in the step, comprises the steps:
I is if the current score value of token is less than or equal to last efficient voice frame moment corresponding
Figure A0214868200042
Overall pruning threshold L g, then do not need to do to expand of the operation of current token to all child nodes of its place node, otherwise execution in step ii;
J the child node that ii gets in J the child node of current token place node is current pending node node j
The iii accumulated token arrives node node jFirst state s 1Score value s 1(t), this score value s 1(t) add that for the current score value of token last state of current token place node is to node jFirst state transition probability, add node jFirst state s 1For current speech frame O tThe observation probability;
Iv calculates current local pruning threshold L p, its computing formula is: L p=L b-L w, wherein, L bBe current local beta pruning baseline threshold; L wBe current beta pruning width threshold value;
V is if token arrives node jFirst state s 1Score value greater than current local pruning threshold L p, execution in step vi then; Otherwise execution in step ix;
Vi produces a new token, and putting its score value is s 1(t);
Vii goes into node with this token chain jGauge outfit is in the node
Figure A0214868200051
The token resource table in, and check node jWhether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;
Viii is according to score value s 1(t), if s 1(t)-L w>L b, then upgrade local beta pruning baseline threshold L bBe L b=s 1(t);
It is current pending node node that ix gets current token place node another child node in lexicographic tree j, repeat the above-mentioned iii-viii step, finish to the extended operation of all child nodes in lexicographic tree of its place node until current token.。
10, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described self-adaptation beta pruning step (6) based on the local path maximum probability, comprises the steps:
A, constantly and last efficient voice frame correspondence according to current t
Figure A0214868200052
Local maximum probability is constantly calculated the beta pruning width threshold value and is adjusted factor L fFor:
Figure A0214868200053
Wherein,
Figure A0214868200054
Be the number of all the efficient voice frames till the t moment, current overall pruning threshold L gBe last efficient voice frame moment corresponding The local path maximum probability, current local beta pruning baseline threshold L bBe t local path maximum probability constantly;
B, beta pruning width threshold value adjustment factor L to calculating fDo regular: if L f>L f MAXL then f=L f MINIf L f<L f MINL then f=L f MINWherein: L f MAXFor adjusting factor L fThe upper bound, L f MINFor adjusting factor L fLower bound, be positive constant, can from the decoder configurations file that the user sets, read;
The beta pruning width threshold value that c, basis calculate is adjusted factor L f, upgrade beta pruning width threshold value L wFor: L w = L f L w c , L w cBe that initialization beta pruning width threshold value is by obtaining in the initialization step (1);
The overall pruning threshold of d, updated time t is L g: L g=L b, for the token expansion at next efficient voice frame is prepared;
E, the local beta pruning baseline threshold L of replacement bBe the logarithm minimum value, for the token expansion at next efficient voice frame is prepared.
CNB021486824A 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system Expired - Lifetime CN1201284C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB021486824A CN1201284C (en) 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB021486824A CN1201284C (en) 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system

Publications (2)

Publication Number Publication Date
CN1455387A true CN1455387A (en) 2003-11-12
CN1201284C CN1201284C (en) 2005-05-11

Family

ID=29257528

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB021486824A Expired - Lifetime CN1201284C (en) 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system

Country Status (1)

Country Link
CN (1) CN1201284C (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420438B (en) * 2008-11-18 2011-06-22 北京航空航天大学 Three stage progressive network attack characteristic extraction method based on sequence alignment
CN102737638A (en) * 2012-06-30 2012-10-17 北京百度网讯科技有限公司 Voice decoding method and device
CN103915092A (en) * 2014-04-01 2014-07-09 百度在线网络技术(北京)有限公司 Voice identification method and device
CN105632500A (en) * 2014-11-21 2016-06-01 三星电子株式会社 Voice recognition apparatus and method of controlling the same
CN106601229A (en) * 2016-11-15 2017-04-26 华南理工大学 Voice awakening method based on soc chip
CN107256706A (en) * 2012-10-04 2017-10-17 谷歌公司 Audio language is mapped into action using grader
CN108550365A (en) * 2018-02-01 2018-09-18 北京云知声信息技术有限公司 The threshold adaptive method of adjustment of offline speech recognition
CN110110294A (en) * 2019-03-26 2019-08-09 北京捷通华声科技股份有限公司 A kind of method, apparatus and readable storage medium storing program for executing of dynamic inversely decoding
CN110164421A (en) * 2018-12-14 2019-08-23 腾讯科技(深圳)有限公司 Tone decoding method, device and storage medium
CN111640423A (en) * 2020-05-29 2020-09-08 北京声智科技有限公司 Word boundary estimation method and device and electronic equipment
CN112259082A (en) * 2020-11-03 2021-01-22 苏州思必驰信息科技有限公司 Real-time voice recognition method and system
CN112397053A (en) * 2020-11-02 2021-02-23 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and readable storage medium
WO2022193892A1 (en) * 2021-03-16 2022-09-22 深圳地平线机器人科技有限公司 Speech interaction method and apparatus, and computer-readable storage medium and electronic device

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420438B (en) * 2008-11-18 2011-06-22 北京航空航天大学 Three stage progressive network attack characteristic extraction method based on sequence alignment
CN102737638A (en) * 2012-06-30 2012-10-17 北京百度网讯科技有限公司 Voice decoding method and device
CN102737638B (en) * 2012-06-30 2015-06-03 北京百度网讯科技有限公司 Voice decoding method and device
CN107256706B (en) * 2012-10-04 2020-08-18 谷歌有限责任公司 Computing device and storage medium thereof
CN107256706A (en) * 2012-10-04 2017-10-17 谷歌公司 Audio language is mapped into action using grader
CN103915092A (en) * 2014-04-01 2014-07-09 百度在线网络技术(北京)有限公司 Voice identification method and device
WO2015149543A1 (en) * 2014-04-01 2015-10-08 百度在线网络技术(北京)有限公司 Voice recognition method and device
US9805712B2 (en) 2014-04-01 2017-10-31 Baidu Online Network Technology (Beijing) Co., Ltd. Method and device for recognizing voice
CN103915092B (en) * 2014-04-01 2019-01-25 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN105632500B (en) * 2014-11-21 2021-06-25 三星电子株式会社 Speech recognition apparatus and control method thereof
CN105632500A (en) * 2014-11-21 2016-06-01 三星电子株式会社 Voice recognition apparatus and method of controlling the same
CN106601229A (en) * 2016-11-15 2017-04-26 华南理工大学 Voice awakening method based on soc chip
CN108550365A (en) * 2018-02-01 2018-09-18 北京云知声信息技术有限公司 The threshold adaptive method of adjustment of offline speech recognition
CN108550365B (en) * 2018-02-01 2021-04-02 云知声智能科技股份有限公司 Threshold value self-adaptive adjusting method for off-line voice recognition
CN110164421A (en) * 2018-12-14 2019-08-23 腾讯科技(深圳)有限公司 Tone decoding method, device and storage medium
WO2020119351A1 (en) * 2018-12-14 2020-06-18 腾讯科技(深圳)有限公司 Speech decoding method and apparatus, computer device and storage medium
CN110164421B (en) * 2018-12-14 2022-03-11 腾讯科技(深圳)有限公司 Voice decoding method, device and storage medium
US11935517B2 (en) 2018-12-14 2024-03-19 Tencent Technology (Shenzhen) Company Limited Speech decoding method and apparatus, computer device, and storage medium
CN110110294B (en) * 2019-03-26 2021-02-02 北京捷通华声科技股份有限公司 Dynamic reverse decoding method, device and readable storage medium
CN110110294A (en) * 2019-03-26 2019-08-09 北京捷通华声科技股份有限公司 A kind of method, apparatus and readable storage medium storing program for executing of dynamic inversely decoding
CN111640423A (en) * 2020-05-29 2020-09-08 北京声智科技有限公司 Word boundary estimation method and device and electronic equipment
CN111640423B (en) * 2020-05-29 2023-10-13 北京声智科技有限公司 Word boundary estimation method and device and electronic equipment
CN112397053A (en) * 2020-11-02 2021-02-23 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112397053B (en) * 2020-11-02 2022-09-06 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112259082A (en) * 2020-11-03 2021-01-22 苏州思必驰信息科技有限公司 Real-time voice recognition method and system
WO2022193892A1 (en) * 2021-03-16 2022-09-22 深圳地平线机器人科技有限公司 Speech interaction method and apparatus, and computer-readable storage medium and electronic device

Also Published As

Publication number Publication date
CN1201284C (en) 2005-05-11

Similar Documents

Publication Publication Date Title
CN1150515C (en) Speech recognition device
CN1296886C (en) Speech recognition system and method
CN1185621C (en) Speech recognition device and speech recognition method
CN1199148C (en) Voice identifying device and method, and recording medium
CN1201284C (en) Rapid decoding method for voice identifying system
CN1123863C (en) Information check method based on speed recognition
US6529866B1 (en) Speech recognition system and associated methods
CN105118501B (en) The method and system of speech recognition
CN1157712C (en) Speed recognition device and method, and recording medium
CN1703923A (en) Portable digital mobile communication apparatus and voice control method and system thereof
WO2001022400A1 (en) Iterative speech recognition from multiple feature vectors
CN1573926A (en) Discriminative training of language models for text and speech classification
CN1315809A (en) Apparatus and method for spelling speech recognition in mobile communication
JP2002511609A (en) Dynamically configurable acoustic models for speech recognition systems
CN1781102A (en) Low memory decision tree
CN1835075A (en) Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN1924994A (en) Embedded language synthetic method and system
CN1534597A (en) Speech sound identification method using change inference inversion state space model
CN1125437C (en) Speech recognition method
CN1223985C (en) Phonetic recognition confidence evaluating method, system and dictation device therewith
CN1499484A (en) Recognition system of Chinese continuous speech
CN109493846B (en) English accent recognition system
CN1190772C (en) Voice identifying system and compression method of characteristic vector set for voice identifying system
CN1284134C (en) A speech recognition system
CN1190773C (en) Voice identifying system and compression method of characteristic vector set for voice identifying system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20050511