CN1455387A

CN1455387A - Rapid decoding method for voice identifying system

Info

Publication number: CN1455387A
Application number: CN02148682A
Authority: CN
Inventors: 韩疆; 颜永红; 潘接林; 张建平
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2002-11-15
Filing date: 2002-11-15
Publication date: 2003-11-12
Anticipated expiration: 2022-11-15
Also published as: CN1201284C

Abstract

(1) Initializing the decoding arithmetic elements. (2) The feature code word vectors of next voice frame are taken out from the voice feature code word sequence with length T in the input decoding arithmetic element. The current voice frame is Ot, 1<=t<=T. (3) Filtering the current voice frame Ot. (4) Based on Ot, each active node of the token resource L t[I] in each layer I of token resource L t of lexicographic tree at t time is judged. (5) Processing the token between nods of lexicographic tree. (6) Based on the maximum probability of partial directory path, the self-adapting adjustment is carried out for the threshold related to the lopping of branches. (7) Repeating steps (2)-(6). The generated best text string matched to the acoustic model and the language model is output. Thus, the result of the voice recognition is generated. Comparing with traditional methods, the strategy speeds up decoding operation.

Description

Fast decoding method in a kind of speech recognition system

Technical field

The present invention relates to the fast decoding method in a kind of speech recognition system.

Background technology

The decoding computing is the chief component in the speech recognition system, its function is: under given acoustic model and language model, acoustics to input is observed feature vector sequence, allow computing machine in the search volume of static state or dynamic construction, find out the text string that optimum matching is arranged with acoustic model and language model automatically, thereby the user's voice input is converted to corresponding text.

Shown in Figure 1 is a kind of structured flowchart of known speech recognition system, be transformed to the accessible digital signal of computing machine behind the analog voice process analog to digital conversion unit 11, utilizing 12 pairs of these digital signals of feature extraction unit to carry out the branch frame then handles, usually frame length is 20ms, frame moves and is 10ms, extract the MFCC parameter of each frame voice, obtain the MFCC vector sequence, decoding arithmetic element 14 is according to the feature vector sequence of input voice, acoustic model 13 and language model 15, adopt certain search strategy, as depth-first search (Viterbi algorithm) or BFS (Breadth First Search), the result who obtains discerning, wherein language model is used for the knowledge of linguistic level is applied to speech recognition system when carrying out big vocabulary continuous speech recognition, improves the accuracy of identification of system.

Based on the speech recognition device of Fig. 1 the central processing unit speed and the memory size of computing machine there is very high requirement, the commercial dictation machine of present some system, for example, the dictation machine module among the ViaVoice system of IBM and the Microsoft Office XP all requires the central processing unit (more than the Intel Pentium II 400MHz) at a high speed and the memory source (more than the 100MByte) of larger capacity.Generally speaking, the decoding computing has occupied central processing unit computational resource more than 90% and most memory source in the whole speech recognition device; Analog-to-digital conversion module and feature extraction unit occupy central processing unit computational resource below 10% and memory source seldom.

Current commercial built-in speech recognition system mainly is the specific people's speech recognition of adopting based on the simple template coupling of little speech amount, for example, phonetic dialing in the mobile phone and simple command identification etc., because this Technology Need user registers speech data, its ease for use, applicability are not strong; Some unspecified person built-in speech recognition systems are mainly discerned towards the order speech of little vocabulary, and calculated amount and memory requirements are still bigger, for example, the individual voice assistant speech recognition system of IBM needs the computing equipment of the computing power of 50DMIPS for the task domain of 500 speech.

The ultimate principle and the notion of known decoding computing are as follows:

1, lexicographic tree

Lexicographic tree is to be used for a kind of tree structure of all speech pronunciations in the tissue identification system.Phoneme is the base unit that constitutes the speech pronunciation, the TRIPHONE phoneme is a current speech recognition system phoneme unit commonly used, for example: the TRIPHONE phonemic representation sequence of speech " China " is: " sil-zh+ongzh-ong+g ong-g+uo g-uo+sil " (wherein " sil " is a special phoneme, be used for describing the pause in the user speech), the TRIPHONE phoneme is a kind of phoneme of context-sensitive, compare with common pinyin representation, it can describe the pronunciation variation that phoneme produces in different contexts, thus the acoustic feature of descriptor pronunciation more accurately.Have identical preceding asyllabia or sub-speech between the speech of recognition system, for example: speech " centre " and " China ", they have identical prefix " in ", available tree construction is described, the vocabulary of supposing recognition system comprises following 5 speech " abe ", " ab ", " acg ", " acgi " and " ac ", then the lexicographic tree of this vocabulary is as shown in Figure 4:

The related Hidden Markov Model (HMM) (HMM) of the TRIPHONE phoneme of each the node correspondence in the lexicographic tree corresponding to this TRIPHONE, Fig. 5 has provided a kind of HMM topological structure of the TRIPHONE of expression phoneme, and a HMM is made up of some HMM states.

2, token definition and token expanding policy

Token is meant an active search path from the user speech start frame to the current speech frame, it comprises the score value of ID of trace route path information and path and acoustic model and language model coupling, wherein ID of trace route path information is included in all speech in this path and the boundary information of speech, each token is corresponding to the searching route of an activity, and the difference between different tokens is that they have different acoustical context and different language contexts.

Each state in the lexicographic tree among the HMM of each node association all can resident movably token, and each state of this node all has a token chained list, is used for depositing any time all tokens in this state activity.Suppose that at the moment t score value that can expand token in the lexicographic tree in the token chained list of the state i of a node is s _i(t-1), so in search procedure, if the score value s of this token _j(t-1) add transition probability from state i to state j, add state i and surpass current pruning threshold for the observation probability of current speech frame t, then produce a new token, its score value is s _jAnd be associated on the state j (t).Finish t-1 resided in the processing of all tokens on the lexicographic tree constantly after, reside in token resource to be expanded on the lexicographic tree constantly with producing t, and will delete whole t-1 and reside in all tokens on the lexicographic tree constantly.

In the token communication process, possible speech and speech boundary information are recorded in the list structure in a sign path.Therefore at the phonetic entry T finish time, can have ID of trace route path information chained list in the best score value token by backtracking, extract word sequence and corresponding speech boundary position with optimum matching.

3, the token resource of lexicographic tree node definition

Suppose that the dictionary tree node comprises M HMM state: s ₁S _M, then the definition of the token resource of a lexicographic tree node comprises following token resource information:

The node token resource:

H_{S_{1}} H_{S_{2}} \dots H_{S_{M}}

Wherein,

For about the HMM state S in the node _iThe gauge outfit of token chained list.

Traditional decoding operational method is too high to hardware computing power memory requirement, and cost performance is low.

A kind of compression method that is used for the feature vector set of speech recognition system is disclosed in Chinese patent application 02131086.6, the speech characteristic vector clustering is being obtained in the process of code book, increased the step that total distance metric dynamically merges and oidiospore is gathered according to vector number and vector in the subclass, reduced the distance metric summation of the code word that vector is corresponding with it in the set after the cluster, improved the precision of clustering algorithm, code book after this inventive method compression is applied in the speech recognition system, can when guaranteeing the voice system recognition performance, greatly reduce the memory space of system.This invention also discloses a kind of speech recognition system, its structured flowchart as shown in Figure 2, this system replaces acoustic model with feature code book and probability tables, in the process of decoding, do not need to calculate gaussian probability, need only from the probability tables of storage in advance, find out required probable value, significantly reduced in the decoding computing and calculated the expense of gaussian probability, thereby can improve the recognition speed of system to a great extent.

Summary of the invention

The objective of the invention is for a kind of fast decoding operational method of improving in the big vocabulary continuous speech recognition of the current unspecified person system is provided, this method has further solved current Embedded Speech Recognition System technology, the market receiving ability of relative product price, to hardware computing power and the too high problem of memory requirement, make current speech recognition technology also applicable to the embedded hardware platform, for example, PDA, Mobil Phone, Smart Phone etc.

Purpose of the present invention can realize by following measure:

Fast decoding method in a kind of speech recognition system comprises the steps:

(1) the decoding arithmetic element in the speech recognition system is carried out initialization;

(2) length from input decoding arithmetic element is to take out the feature code word vector of next speech frame in the phonetic feature codeword sequence of T successively, puts it and is current speech frame O _t, 1≤t≤T;

(3) to current speech frame O _tFilter, if speech frame O _tBe filtered, then forward step (2) to and carry out, otherwise put speech frame O _tBe current efficient voice frame;

(4) based on current efficient voice frame O _t, to t moment lexicographic tree token resource L _tThe token resource L of each layer I _tEach active node in [I] judges, and judgement is belonged to extendible token then expands token in this node token resource table, and the token chain that will newly produce is gone in the token resource table of destination node; Wherein I is index variables, 1≤I≤H; H is the height of lexicographic tree; Otherwise execution in step (7);

(5) be in the token of lexicographic tree node;

(6) according to t local path maximum probability and last efficient voice frame moment corresponding constantly

The local path maximum probability, the threshold value relevant with beta pruning done the self-adaptation adjustment;

(7) repeat the global path with best score value token that above-mentioned (2)-(6) steps obtains importing the voice T finish time, finish the token expansion, what output had generated this moment has the text string of optimum matching with acoustic model and language model, produces voice identification result.

The present invention does not relate to about the expansion of speech node token and relevant Processing Algorithm, and the user can be according to task domain (for example: the identification of order speech, Chinese single-syllable identification, the major term amount continuous speech recognition etc.) Processing Algorithm that customization is relevant.

Described t is lexicographic tree token resource L constantly _tBe this summation of the token resource of all active nodes in the lexicographic tree constantly.The indexed mode of t moment active node is in the lexicographic tree: according to t moment active node residing level index in lexicographic tree, promptly be serially connected and form a chained list at all active nodes of identical layer, each of lexicographic tree layer all has a such chained list, is a two-dimensional chain table on the whole.

Described t is the 1st layer of token resource L of lexicographic tree token resource constantly _t[I] is the t moment lexicographic tree active node token resource L of index in a manner described _tThe 1st layer of chained list.

Described t local path maximum probability constantly is: in the local path set of t all new generation token correspondences of the moment, and the maximal value of all local path score.

Described last efficient voice frame correspondence Local path maximum probability constantly is: last efficient voice frame moment corresponding

In the local path set of all new generation token correspondences, the maximal value of all local path score.

Described initialization step (1) also comprises the steps:

A, to produce a score value be zero token, and this token chain is gone into the token resource gauge outfit of the root node in the lexicographic tree, and the active node of current lexicographic tree only comprises root node root, and it is in the ground floor of lexicographic tree;

B, initialization overall situation pruning threshold L _gBe the logarithm minimum value;

C, the local beta pruning baseline threshold of initialization L _bBe the logarithm minimum value;

D, initialization beta pruning width threshold value L _wBe a positive constant L _w ^c, L _w ^cPreestablish by the user.

Described filtration step (3) also comprises the steps:

A, if current t speech frame O constantly _tBe the initial speech frame of user speech input, then put it and be the efficient voice frame that filter operation is finished; Otherwise execution in step b;

B, more current t be speech frame O constantly _tY feature code word vector f ₁ ^tf ₂ ^tF _Y ^tWith t-1 moment speech frame O _T-1Y feature code word vector f ₁ ^T-1f ₂ ^T-1F _Y ^T-1Similarity degree, obtain a similarity value V;

C, with similarity value V and decision threshold θ relatively is if V≤θ then judges speech frame O _tFor to decoding computing invalid speech frame; Otherwise judge speech frame O _tFor to decoding computing effective speech frame.

Described decision threshold θ is a positive constant that is set by the user.

Described node token resource spread step (4) also comprises the steps:

A, based on current efficient voice frame O _tEach token in last state corresponding token resource chained list of the HMM of present node association is done outside expansion, promptly each token in last state corresponding token resource chained list of the HMM of present node association is extended in the token resource table of all child nodes of this node in lexicographic tree;

B, a HMM state getting the HMM with M state of present node association are current pending HMM state S _n, 1≤n≤M wherein;

C, get state s _nA token in the corresponding token resource table is current pending token;

D, if state S _nThe score value of current pending token greater than previous efficient voice frame moment corresponding Overall pruning threshold L _g, then, get one by state s according to the topological structure of the HMM model of present node association _nThe state that can reach is changed to current armed state s _m, begin to carry out otherwise forward step k to;

E, computational token are from S _nArrival state s _mScore value s _m(t); Score value s _m(t) add state s for the current score value of token _nTo state s _mTransition probability, add state s _mFor current speech frame O _tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain:

F, the current local pruning threshold L of calculating _p, its computing formula is: L _p=L _b-L _w, in the formula, L _bBe current local beta pruning baseline threshold; L _wBe current beta pruning width threshold value;

G, if token from s _nArrival state s _mScore value greater than current local pruning threshold L _p, then produce a new token, putting its score value is s _m(t); Otherwise execution in step j;

Gauge outfit is in h, the new token chain ingress that g step is produced

The token resource table in, and check this node whether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;

The score value s of i, this new token of basis _m(t), if s _m(t)-L _w＞L _bSet up, then upgrade local beta pruning baseline threshold L _bBe L _b=s _m(t);

J, get another by state s _nThe state that can reach is put it and is current armed state s _m, repeat the above-mentioned e-i step, up to handling all by state s _nThe state that can reach; Forwarding step k to carries out;

K, get state s _nAnother token in the corresponding token resource table is current pending token; Repeat the above-mentioned d-j step, up to state s _nThe extended operation of all tokens in the corresponding token resource table is all finished; Going to step l carries out;

L, another HMM state of getting the HMM with M state of node association are current pending HMM state S _n, wherein 1≤n≤M repeats the above-mentioned c-k step, all finishes until all token resource extended operations of present node.

Described node token resource spread step (4) a comprised the steps: in the step

I is if the current score value of token is less than or equal to last efficient voice frame moment corresponding

Overall pruning threshold L _g, then do not need to do to expand of the operation of current token to all child nodes of its place node, otherwise execution in step ii;

Ii gets j child node node in the J of current token place node in the lexicographic tree child node _jBe current pending node;

The iii accumulated token arrives node node _jFirst state s ₁Score value s ₁(t), this score value s ₁(t) add that for the current score value of token last state of current token place node is to node _jFirst state transition probability, add node _jFirst state s ₁, for current speech frame O _tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain;

Iv calculates current local pruning threshold L _p, its computing formula is: L _p=L _b-L _w, wherein, L _bBe current local beta pruning baseline threshold; L _wBe current beta pruning width threshold value;

V is if token arrives node _jFirst state s ₁Score value greater than current local pruning threshold L _p, execution in step vi then; Otherwise execution in step ix;

Vi produces a new token, and putting its score value is s ₁(t);

Vii goes into node with this token chain _jGauge outfit is in the node

The token resource table in, and check node _jWhether at this node in the active node table of lexicographic tree place layer, if do not exist, then chain is gone into wherein;

Viii is based on score value s ₁(t), if s ₁(t)-L _w＞L _b, then upgrade local beta pruning baseline threshold L _bBe L _b=s ₁(t);

It is current pending node node that ix gets current token place node another child node in lexicographic tree _j, repeat the above-mentioned iii-viii step, finish to the extended operation of all child nodes in lexicographic tree of its place node until current token.

Described self-adaptation beta pruning step (6) based on the local path maximum probability comprises the steps:

A, constantly and last efficient voice frame correspondence according to current t Local maximum probability is constantly calculated the beta pruning width threshold value and is adjusted factor L _fFor: Wherein,

Be the number of all the efficient voice frames till the t moment, current overall pruning threshold L _gBe last efficient voice frame moment corresponding The local path maximum probability, current local beta pruning baseline threshold L _bBe t local path maximum probability constantly;

B, beta pruning width threshold value adjustment factor L to calculating _fDo regular: if L _f＞L _f ^MAXL then _f=L _f ^MAXIf L _f＜L _f ^MINL then _f=L _f ^MINWherein: L _f ^MAXFor adjusting factor L _fThe upper bound, L _f ^MINFor adjusting factor L _fLower bound, be positive constant, can be set by the user;

The beta pruning width threshold value that c, basis calculate is adjusted factor L _f, upgrade beta pruning width threshold value L _wFor:

L_{w} = L_{f} L_{w}^{c},

L _w ^cBe initial beta pruning width threshold value, can be by obtaining in the initialization step (1);

The overall pruning threshold of d, updated time t is L _g: L _g=L _b, for the token expansion at next efficient voice frame is prepared;

E, the local beta pruning baseline threshold L of replacement _bBe the logarithm minimum value, for the token expansion at next efficient voice frame is prepared.

The present invention has following advantage compared to existing technology:

Compare with tradition decoding operational method, the present invention comprises following improvement: based on the self-adaptation beta pruning strategy of local path maximum probability; Speech frame filtering policy based on feature code word vector.

A kind of search framework of breadth-first of the band beta pruning based on the expansion of lexicographic tree and token has been adopted in the decoding computing that the present invention relates to, the computation complexity of this algorithm is O (MT), wherein: T is the number that enters the speech frame of searching and computing, and M is the mean value of the corresponding active path bar number constantly of each speech frame in the searching and computing process.

Traditional decoding computing is that all speech frames of user speech input are done searching and computing, in fact, the user's voice input is a time varying signal with local stationarity, therefore, steady section at user input voice, can remove some and be adjacent the similar speech frame of speech frame, promptly these speech frames do not enter the searching and computing process, and do not influence the precision of decoding computing.For this reason, the present invention has provided a kind of speech frame filtering policy based on speech frame feature code word vector, this strategy can effectively be removed the speech frame to the searching and computing redundancy, make the actual speech frame number that enters searching and computing be less than the actual speech frame number that the user speech input is comprised, by above-mentioned formula as seen, compare with classic method, but adopt this strategy speeds up decoding operation.

On the other hand, by above-mentioned formula as seen, the speed of searching and computing also depends in the search procedure average of the corresponding active path bar number constantly of each speech frame, and promptly have: when T was constant, M was big more, and it is big more then to search for expense; M is more little, and it is more little then to search for expense.The size of M depends on the beta pruning strategy that searching and computing is taked.For this reason, the present invention has provided a kind of self-adaptation beta pruning strategy of local path maximum probability, compares with classic method, can effectively reduce the value of M, and to not obviously influence of accuracy of identification, thus further speeds up decoding operation.

Description of drawings

Fig. 1 is the structured flowchart of known speech recognition system

Fig. 2 is the structured flowchart of another known speech recognition system

Fig. 3 is a decoding operational flowchart of the present invention

Fig. 4 is known lexicographic tree structural drawing

Fig. 5 is the HM topological structure synoptic diagram of known TRIPHONE phoneme

Concrete embodiment

Fig. 3 provides the present invention's operational flowchart of decoding.By Fig. 2, in conjunction with Fig. 3, based on the operational scheme of the speech recognition system of decoding operator of the present invention system be: the voice analog signal of input is transformed to digital signal; This digital signal is carried out the branch frame handle, extract the characteristic parameter of each frame voice, the corresponding eigenvector of each speech frame obtains importing the feature vector sequence of voice; Utilize the feature code book that described feature vector sequence is carried out quantization encoding, the corresponding feature code word vector of each speech frame obtains corresponding feature code word vector sequence; Speech frame filter element with phonetic feature code word vector sequence input decoding operator system, do the speech frame filter operation, from phonetic feature code word vector sequence, remove invalid speech frame characteristic of correspondence code word vector, obtain efficient voice feature code word vector sequence; Described efficient voice feature code word vector sequence is done searching and computing obtain recognition result, in the searching and computing process, employing is done the beta pruning of Local Search path based on the self-adaptation beta pruning strategy of local path maximum probability, to each code word of efficient voice feature code word vector sequence, from probability tables, directly find its observation probability on (part) searching route.

Described lexicographic tree active node token resource and indexed mode definition thereof

Among the present invention, claim t at any time, the node that has activity token in the lexicographic tree is a t active node constantly.The indexed mode of t moment active node is in the lexicographic tree: according to t moment active node residing level index in lexicographic tree, promptly be serially connected and form a chained list at all active nodes of identical layer, each of lexicographic tree layer all has a such chained list, is a two-dimensional chain table on the whole.

Any time t, the summation of the token resource of all active nodes is called t lexicographic tree active node token resource constantly in the lexicographic tree, and it has stipulated t token resource to be expanded constantly.For the convenience of narrating below, note is L by the t moment lexicographic tree active node token resource of above-mentioned index _t, its index variables are that (1≤I≤H), wherein, H is the height of lexicographic tree to I.

Based on above-mentioned about decoding ultimate principle of computing and the embodiment that related notion provides fast decoding method of the present invention.

Coding/decoding method of the present invention comprises the steps:

1, the decoding arithmetic element in the speech recognition system is carried out initialization;

2, from the length of input decoding arithmetical unit is the phonetic feature code word vector sequence of T, take out the feature code word vector of first speech frame, put it and be current speech frame O _t(t=1);

3, to current speech frame O _tDo filter operation;

4, if O _tBe the efficient voice frame, then to t moment lexicographic tree token resource L _tEach layer I (lexicographic tree token resource L of 1≤I≤H) _tEach active node of [I] is expanded the token in this node token resource table, and the token chain that will newly produce is gone in the token resource table of destination node; Otherwise forward step 7 to;

5, be in the token of speech node; The present invention does not relate to about the expansion of speech node token and relevant Processing Algorithm, and the user can be according to task domain (for example: the identification of order speech, Chinese single-syllable identification, the major term amount continuous speech recognition etc.) Processing Algorithm that customization is relevant;

6, according to t local path maximum probability and last efficient voice frame moment corresponding constantly The local path maximum probability, the threshold value relevant with beta pruning done the self-adaptation adjustment, comprising overall pruning threshold L _g, local beta pruning baseline threshold L _bAnd beta pruning width threshold value L _w

7, from the length of input decoding arithmetical unit is the speech characteristic vector sequence of T, take out next speech frame,, then put it and be current speech frame O if can get _t(t≤T) also forwards step 3 to and carries out, otherwise forwards step 8 to;

8, finish the token expansion and produce recognition result: T has the global path of best score value token by the backtracking moment, and output and acoustic model and language model have the text string of optimum matching.

In the step 4 of above-mentioned coding/decoding method be step by step to what the present node token resource was done extended operation:

A, to each token in last state corresponding token resource chained list of the HMM of node association, token is expanded in the token resource table of all childs of this node in lexicographic tree;

B, first HMM state of getting the HMM with M state of node association are current pending HMM state S _n(n=1);

D, if the current score value of token greater than previous efficient voice frame moment corresponding Overall pruning threshold L _g, then, get one by state s according to the topological structure of the HMM model of present node association _nThe state that can reach is changed to current armed state s _m, carry out otherwise forward step k to;

E, computational token arrive state s _mScore value s _m(t) be: the current score value of token adds state s _nTo state s _mTransition probability, add state s _mFor current speech frame O _tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain;

G, if token arrives state s _mScore value be less than or equal to current local pruning threshold L _p, then forward step j to and carry out; Otherwise do following operation: produce a new token, putting its score value is s _m(t);

H, be with gauge outfit in this token chain ingress

I, be s according to score value _m(t), upgrade local beta pruning baseline threshold L _b, the steps include: if s _m(t)-L _w＞L _b, then have: L _b=s _m(t); Otherwise, do not upgrade;

J, get another state s _nThe state s that can reach _m,, then put it and be current armed state s if get _mAnd forward step e execution to, carry out otherwise forward step k to;

K, get state s _nAnother token in the corresponding token resource table if get, then is changed to current pending token and forwards steps d to and carry out, and carries out otherwise forward step l to;

L, get the next HMM state of the HMM with M state of node association,, then be changed to current pending HMM state S if get _n(n≤M) also forwards step c to and carries out, otherwise shows that extended operation is finished to the present node token resource.

First of the step 4 of above-mentioned coding/decoding method step by step in a step about expanding current token be to the step of all child nodes of its place node:

1, if the score value of current token is less than or equal to last efficient voice frame moment corresponding Overall pruning threshold L _g, then do not need to do to expand of the operation of current token to all childs of its place node, carry out otherwise change step 2;

2, a child getting current token place node is current pending node node _j(j=1);

3, accumulated token arrives node node _jFirst state s ₁Score value s ₁(t) be: the current score value of token adds that last state of token place node is to node _jFirst state transition probability, add node _jFirst state s ₁For current speech frame O _tThe observation probability, this observation probability can be done table lookup operation from the probability tables of input decoding arithmetic element and obtain;

4, calculate current local pruning threshold L _p, its computing formula is: L _p=L _b-L _w, wherein, L _bBe current local beta pruning baseline threshold; L _wBe current beta pruning width threshold value;

5, if token arrives node _jFirst state s ₁Score value be less than or equal to current local pruning threshold L _p, then forward step 9 to, otherwise execution in step 6;

6, produce a new token, putting its score value is s ₁(t);

7, this token chain is gone into node _jGauge outfit is in the node

8, according to score value s ₁(t), upgrade local beta pruning baseline threshold L _b, the steps include: if s ₁(t)-L _w＞L _b, then have: L _b=s ₁(t), otherwise, do not upgrade;

9, get another child of current token place node,, then put it and be current pending node node if get _j(j≤N) also forwards step 3 to and carries out, and wherein N is the number of current token place node all childs in lexicographic tree, otherwise shows to have finished and expand the operation of current token to all childs of its place node.

The step l of above-mentioned coding/decoding method carries out initialized step to the decoding arithmetic element and comprises the steps:

1, producing a score value is zero token, and this token chain is gone into the token resource gauge outfit of the root node in the lexicographic tree, and the active node of current lexicographic tree only comprises root node root, and it is in the ground floor of lexicographic tree;

2, initialization overall situation pruning threshold L _gBe the logarithm minimum value;

3, the local beta pruning baseline threshold of initialization L _bBe the logarithm minimum value;

4, initialization beta pruning width threshold value L _wBe a positive constant L _w ^c, this value can read from the decoding arithmetical unit configuration file that the user sets.

The step that the step 3 of above-mentioned coding/decoding method pair current speech frame is done filter operation comprises the steps:

1, if current t is moment speech frame O _tBe the initial speech frame of user speech input, then put it and be the efficient voice frame that filter operation is finished; Otherwise execution in step 2;

2, more current t moment speech frame O _tFeature code word vector f ₁ ^tf ₂ ^tF _Y ^tWith t-1 moment speech frame O _T-1Feature code word vector f ₁ ^T-1f ₂ ^T-1F _Y ^T-1Similarity degree, obtain a similarity value V, the Y in the above-mentioned feature code word vector expression is the feature code word number that comprises in the speech frame feature code word vector, similarity value V can be calculated by following formula:

V = Σ_{i = 1}^{Y} C (f_{i}^{t}, f_{i}^{t - 1}),

Wherein, C (f _i ^t, f _i ^T-1) be defined as following formula:

3, similarity value V and decision threshold θ (be a positive constant that is set by the user, can read from the decoding arithmetical unit configuration file that the user sets) are compared, if V≤θ then judges speech frame O _tFor to decoding computing invalid speech frame, otherwise judge speech frame O _tFor to decoding computing effective speech frame.

Experimental result shows, above-mentioned speech frame filter operation can be removed in the described phonetic feature code word vector sequence 20%～30% invalid speech frame, but so speeds up decoding operation, and its recognition performance is insensitive to user's word speed than classic method, this is because to the slower user of word speed, above-mentioned speech frame filter operation can be removed more invalid speech frame, and to word speed user faster, above-mentioned speech frame filter operation then can be removed less invalid speech frame, and promptly above-mentioned speech frame filter operation can be done regularization to a certain degree to the word speed of user's different user.

The step 6 of above-mentioned coding/decoding method also comprises the steps:

Preceding 5 steps by the decoding arithmetical unit can obtain: current overall pruning threshold L _gBe last efficient voice frame moment corresponding

The local path maximum probability, current local beta pruning baseline threshold L _bBe t local path maximum probability constantly.In view of the above, the execution of the step 6 in the operation steps of decoding arithmetical unit is step by step:

1, according to moment t,

Local maximum probability, calculate the beta pruning width threshold value and adjust factor L _fFor: Wherein,

For till the t moment, the number of all efficient voice frames;

2, the beta pruning width threshold value that calculates is adjusted factor L _fDo regular: if L _f＞L _f ^MAXL then _f=L _f ^MANIf L _f＜L _f ^MINL then _f=L _f ^MINWherein: L _f ^MAXFor adjusting factor L _fThe upper bound (for example: 1.05), L _f ^MINFor adjusting factor L _fLower bound (for example: 0.5), be positive constant, can from the decoder configurations file that the user sets, read;

3, adjust factor L according to the beta pruning width threshold value that calculates _f, upgrade beta pruning width threshold value L _wFor:

L_{w} = L_{f} L_{w}^{c};

4, the overall pruning threshold of updated time t is L _g: L _g=L _b, for preparing at token expansion to next efficient voice frame;

5, the local beta pruning baseline threshold L of replacement _bBe the logarithm minimum value, for the token expansion at next efficient voice frame is prepared.

In traditional searching algorithm, beta pruning width threshold value L _wBe constant, in the present invention, after current efficient voice frame is done searching and computing, beta pruning width threshold value L _wCan do the self-adaptation adjustment according to the local path maximum probability, thereby in the time of can realizing that next efficient voice frame done searching and computing to the self-adaptation beta pruning of local path, experimental result shows, under the prerequisite that does not influence accuracy of identification, this method can effectively reduce the described average token number M (10%～20%) in the decode procedure, thus further speeds up decoding operation.

Claims

1, the fast decoding method in a kind of speech recognition system comprises the steps:

(3) to current speech frame O _tFilter, if speech frame O _tBe filtered, execution in step (2) then, otherwise put speech frame O _tBe current efficient voice frame;

(5) be in the token of lexicographic tree node;

(6) according to t local path maximum probability and last efficient voice frame moment corresponding constantly The local path maximum probability, the threshold value relevant with beta pruning done the self-adaptation adjustment;

(7) repeat the global path with best score value token that above-mentioned (2)-(6) steps obtains importing the language Shanxi T finish time, finish the token expansion, what output had generated this moment has the text string of optimum matching with acoustic model and language model, produces voice identification result.

2, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described described t lexicographic tree token resource L constantly _tBe this summation of the token resource of all active nodes in the lexicographic tree constantly.

3, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described t local path maximum probability constantly is in the local path set of t all new generation token correspondences of the moment, the maximal value of all local path score.

4, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described last efficient voice frame correspondence Local path maximum probability constantly is last efficient voice frame moment corresponding

5, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described initialization step (1) also comprises the steps:

6, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described filtration step (3) also comprises the steps:

7, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described decision threshold θ is a positive constant that is set by the user.

8, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described node token resource spread step (4), also comprises the steps:

D, if state S _nThe score value of current pending token greater than previous efficient voice frame moment corresponding

Overall pruning threshold L _g, then, get one by state s according to the topological structure of the HMM model of present node association _nThe state that can reach is changed to current armed state s _m, begin to carry out otherwise forward step k to;

E, computational token are from S _nArrival state s _mScore value s _m(t); Score value s _m(t) add state s for the current score value of token _nTo state s _mTransition probability, add state s _mFor current speech frame O _tThe observation probability;

Gauge outfit is in h, the new token chain ingress that g step is produced

The score value s of i, this new token of basis _m(t), according to formula s _m(t)-L _w＞L _b, then upgrade local beta pruning baseline threshold L _bBe L _b=s _m(t);

L, another HMM state of getting the HMM with M state of node association are current pending HMM state S _n, wherein 1≤n≤M repeats the above-mentioned c-k step, finishes until all token resource extended operations of present node.

9, the fast decoding method in the speech recognition system as claimed in claim 8 is characterized in that described node token resource spread step (4) a in the step, comprises the steps:

J the child node that ii gets in J the child node of current token place node is current pending node node _j

The iii accumulated token arrives node node _jFirst state s ₁Score value s ₁(t), this score value s ₁(t) add that for the current score value of token last state of current token place node is to node _jFirst state transition probability, add node _jFirst state s ₁For current speech frame O _tThe observation probability;

Vi produces a new token, and putting its score value is s ₁(t);

Vii goes into node with this token chain _jGauge outfit is in the node

Viii is according to score value s ₁(t), if s ₁(t)-L _w＞L _b, then upgrade local beta pruning baseline threshold L _bBe L _b=s ₁(t);

It is current pending node node that ix gets current token place node another child node in lexicographic tree _j, repeat the above-mentioned iii-viii step, finish to the extended operation of all child nodes in lexicographic tree of its place node until current token.。

10, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described self-adaptation beta pruning step (6) based on the local path maximum probability, comprises the steps:

A, constantly and last efficient voice frame correspondence according to current t

Local maximum probability is constantly calculated the beta pruning width threshold value and is adjusted factor L _fFor:

Wherein,

B, beta pruning width threshold value adjustment factor L to calculating _fDo regular: if L _f＞L _f ^MAXL then _f=L _f ^MINIf L _f＜L _f ^MINL then _f=L _f ^MINWherein: L _f ^MAXFor adjusting factor L _fThe upper bound, L _f ^MINFor adjusting factor L _fLower bound, be positive constant, can from the decoder configurations file that the user sets, read;

L_{w} = L_{f} L_{w}^{c},

L _w ^cBe that initialization beta pruning width threshold value is by obtaining in the initialization step (1);