CN1201284C

CN1201284C - Rapid decoding method for voice identifying system

Info

Publication number: CN1201284C
Application number: CNB021486824A
Authority: CN
Inventors: 韩疆; 颜永红; 潘接林; 张建平
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2002-11-15
Filing date: 2002-11-15
Publication date: 2005-05-11
Anticipated expiration: 2022-11-15
Also published as: CN1455387A

Abstract

The present invention relates to a rapid decoding method for a voice recognition system. The rapid decoding method comprises the steps that in the first step, a decoding arithmetic unit in the voice recognition system is initialized; in the second step, the characteristic code word vector of the next voice frame is sequentially extracted from a voice characteristic code word sequence whose length is T, the voice characteristic code word sequence is input into the decoding arithmetic unit, and the characteristic code word vector is set to be the current voice frame Ot, 1<=t<=T; in the third step, the current voice frame Ot is filtered; in the fourth step, on the basis of the current effective frame Ot, each active node in each layer I of token resources in a lexicographic tree token resource Lt at the time of t is judged; in the fifth step, tokens at lexicographic tree nodes are processed; in the sixth step, according to a local path maximum probability at the time of t and a local path maximum probability of the previous effective voice frame at the time of t, a threshold value relevant to pruning is adjusted in a self-adaptive mode; in the seventh step, the second step to the sixth step are repeated, a text string which optimally matches with an acoustic model and a language model which are generated at the moment is output to generate a voice recognition result. Compared with a traditional method, the rapid decoding method can accelerate the speed of decoding calculation.

Description

Quick decoding method in speech recognition system

Technical Field

The invention relates to a fast decoding method in a speech recognition system.

Background

The decoding operation is a main component in the speech recognition system, and the function of the decoding operation is as follows: under the given acoustic model and language model, the feature vector sequence is observed for the input acoustics, and the computer automatically finds out the text string which is best matched with the acoustic model and the language model in the search space which is statically or dynamically constructed, so that the voice input of the user is converted into the corresponding text.

Fig. 1 is a block diagram of a known speech recognition system, in which analog speech is converted into a digital signal that can be processed by a computer after passing through an analog-to-digital conversion unit 11, and then the digital signal is subjected to framing processing by a feature extraction unit 12, where the frame length is usually 20ms and the frame length is shifted to 10ms, and MFCC parameters of each frame of speech are extracted to obtain an MFCC vector sequence, and a decoding operation unit 14 obtains a recognition result by using a certain search strategy, such as a depth-first search (Viterbi algorithm) or a breadth-first search, according to a feature vector sequence of input speech, an acoustic model 13 and a language model 15, where the language model is used to apply knowledge of a language layer to the speech recognition system when performing large vocabulary continuous speech recognition, so as to improve the recognition accuracy of the system.

Based on the very high requirements of the speech recognizer of fig. 1 on the speed of the central processing unit and the memory capacity of the computer, some commercial dictation machine systems, such as the ViaVoice system of IBM and the dictation machine module in Microsoft Office XP, require a high-speed central processing unit (intel pentium II 400MHz or more) and a large memory resource (100MByte or more). In general, decoding operations occupy more than 90% of the central processor computing resources and almost all of the memory resources in the entire speech recognizer; the analog-to-digital conversion module and the feature extraction unit occupy less than 10% of the computing resources of the central processing unit and little memory resources.

The current commercial embedded voice recognition system mainly adopts the voice recognition of a small word quantity specific person based on simple template matching, such as voice dialing in a mobile phone, simple command recognition and the like, and because the technology needs a user to register voice data, the usability and the applicability are not strong; some non-human-specific embedded speech recognition systems are primarily directed to small vocabulary command word recognition and the computational and memory requirements remain large, for example, IBM's personal voice assistant speech recognition system requires a computing device with 50DMIPS of computing power for a task domain of 500 words.

The basic principles and concepts of known decoding operations are as follows:

1. dictionary tree

The lexicon tree is a tree-like structure used to organize the pronunciation of all words in the recognition system. Phonemes are the basic units that make up the pronunciation of a word, TRIPHONE phonemes are the phoneme units commonly used by current speech recognition systems, for example: the TRIPHONE phoneme representation sequence of the word "china" is: "sil-zh + ongzh-ong + g-ng + uo g-uo + sil" (where "sil" is a special phoneme used to describe a pause in the user's speech), and "TRIPHONE phoneme is a context-dependent phoneme that describes the variation of pronunciation of a phoneme in different contexts compared to the conventional pinyin representation, so that the acoustic characteristics of the pronunciation of a word can be more accurately described. There may be the same prefix or subword between words of the recognition system, for example: the words "middle" and "china", which have the same prefix "middle", can be described by a tree structure, assuming that the vocabulary of the recognition system contains the following 5 words "abe", "ab", "acg", "acgi", and "ac", the dictionary tree of the vocabulary is shown in fig. 4:

the TRIPHONE phone corresponding to each node in the dictionary tree is associated with a Hidden Markov Model (HMM) corresponding to the TRIPHONE, and fig. 5 shows an HMM topology representing the TRIPHONE phone, where an HMM is composed of several HMM states.

2. Token definition and token extension policy

A token refers to an active search path from a user's speech start frame to a current speech frame, and includes path identification information including all words in the path and word boundary information, and scores of the path matching acoustic models and language models, and each token corresponds to an active search path, and the tokens differ in that they have different acoustic contexts and different language contexts.

Each state in the HMM associated with each node in the dictionary tree can be resident with a movable token, and each state of the node has a token linked list used forTo deposit all tokens that are active in that state at any time. Suppose at time t, the score of one expandable token in the token linked list of state i of a node in the dictionary tree is s_i(t-1), then during the search, if the score s of the token is_j(t-1) plus the transition probability from state i to state j plus the observed probability for the current speech frame t for state i exceeds the current pruning threshold, a new token is generated with a score s_j(t) and associated on state j. After the processing of all the tokens residing on the dictionary tree at the time t-1 is completed, the token resource to be expanded residing on the dictionary tree at the time t is generated, and all the tokens residing on the dictionary tree at the time t-1 are deleted.

During token propagation, possible words and word boundary information are recorded in a linked list structure that identifies paths. Therefore, at the voice input ending time T, the word sequence with the best match and the corresponding word boundary position can be extracted by recalling the path identification information linked list in the token with the best score.

3. Token resource definition of dictionary tree nodes

Assume that a dictionary tree node contains M HMM states: s₁…s_MThen, the token resource definition of a dictionary tree node contains the following token resource information:

node token resource: h_S1 H_S2 … H_SM

Wherein H_s1(1 ≦ i ≦ M) for HMM state S in the node_iA head of the token chain table.

The traditional decoding operation method has high requirements on hardware computing capacity and a memory, and has low cost performance.

In chinese patent application 02131086.6, a method for compressing a feature vector set for a speech recognition system is disclosed, in the process of clustering a speech feature vector set to obtain a codebook, a step of dynamically merging and splitting subsets according to the number of vectors in the subsets and the total distance measurement of the vectors is added, the sum of the distance measurements of the vectors in the clustered set and the codewords corresponding to the vectors is reduced, the accuracy of the clustering algorithm is improved, and when the codebook compressed by the method of the present invention is applied to the speech recognition system, the storage capacity of the system can be greatly reduced while the recognition performance of the speech system is ensured. The invention also discloses a voice recognition system, the structural diagram of which is shown in figure 2, the system replaces an acoustic model with a feature codebook and a probability table, does not need to calculate the Gaussian probability in the decoding process, only needs to find out the required probability value from the prestored probability table, greatly reduces the expense of calculating the Gaussian probability in the decoding operation, and can improve the recognition speed of the system to a considerable extent.

Disclosure of Invention

The invention aims to provide a method for improving the rapid decoding operation in the current continuous speech recognition system of the unspecific person large vocabulary, which further solves the problems of the current embedded speech recognition technology, the market receiving capacity relative to the product price, and the over-high requirements on the hardware computing capacity and the memory, so that the current speech recognition technology can also be applied to embedded hardware platforms, such as PDA, Mobile Phone, Smart Phone and the like.

The object of the invention can be achieved by the following measures:

a fast decoding method in a speech recognition system, comprising the steps of:

(1) initializing a decoding operation unit in the voice recognition system;

(2) the characteristic code word vector of the next speech frame is taken out from the speech characteristic code word sequence with the length of T input into the decoding arithmetic unit in sequence and is set as a speech frame O at the moment of T_t，1≤t≤T；

(3) For speech frame O at time t_tFiltering, if the voice frame is filtered, then going to step (2) to execute, otherwise, setting the voice frame O_tTo be effectiveVoice frame O_t ^V；

(4) Based on valid speech frames O_t ^VFor t time dictionary tree token resource L_tToken resource L of each layer I of_t[I]Each active node in the node is judged, and if the token is judged to be extensible, the token in the token resource table of the node is extended, and the newly generated token is linked into the token resource table of the target node; wherein I is an index variable, and I is more than or equal to 1 and less than or equal to H; h is the height of the dictionary tree; otherwise, executing the step (7);

(5) processing tokens at nodes of a dictionary tree;

(6) according to the maximum probability of the local path at the time t and the time corresponding to the previous effective speech frame

The maximum probability of the local path is used for carrying out self-adaptive adjustment on a threshold value related to pruning;

(7) and (4) repeating the steps (2) to (6) to obtain a global path with the best score token at the input voice ending time T, ending token expansion, and outputting the text string which is generated at the moment and has the best matching with the acoustic model and the language model to generate a voice recognition result.

The present invention does not relate to the expansion of word node tokens and related processing algorithms, which a user can customize based on task domain (e.g., command word recognition, Chinese monosyllabic recognition, large word-size continuous speech recognition, etc.).

Token resource L of the t-time dictionary tree_tIs the sum of the token resources of all active nodes in the time dictionary tree. The index mode of the active node at the time t in the dictionary tree is as follows: and (3) according to the hierarchical index of the active nodes in the dictionary tree at the time t, namely all the active nodes on the same layer are connected in series to form a linked list, and each layer of the dictionary tree is provided with the linked list which is a two-dimensional linked list as a whole.

The I level token resource L of the t time dictionary tree token resource_t[I]According to the above formulaFormula-indexed t-time dictionary tree active node token resource L_tThe I-th linked list of (1).

The maximum probability of the local path at the time t is as follows: and (4) the maximum value of all local path scores in the local path set corresponding to all the newly generated tokens at the time t.

Said previous valid speech frame corresponds toThe local path maximum probability at a time is: the moment corresponding to the previous valid speech frameAnd in the local path set corresponding to all the newly generated tokens, the maximum value of all the local path scores.

The initialization step (1) further comprises the following steps:

a. generating a token with a value of zero, and linking the token into a token resource table head of a root node in a dictionary tree, wherein an active node of the current dictionary tree only comprises a root node, and the active node is positioned at the first layer of the dictionary tree;

b. initializing a global pruning threshold L_gIs a logarithmic minimum;

c. initializing local pruning baseline threshold L_bIs a logarithmic minimum;

d. initializing pruning width threshold L_wIs a normal number L_w ^c，L_w ^cPreset by the user.

The filtering step (3) further comprises the following steps:

3a, if the time t is a voice frame O_tSetting the initial voice frame input by the user voice as a valid voice frame, and finishing the filtering operation; otherwise, executing the step b;

3b, comparing the voice frame O at the time t_tY feature codeword vectors f₁ ^t f₂ ^t Λ f_Y ^tSpeech frame O at time t-1_t-1Y feature codeword vectors f₁ ^t-1 f₂ ^t-1 Λ f_Y ^t-1Obtaining a similarity measure V according to the similarity degree;

3c, comparing the similarity metric value V with a judgment threshold value theta, and judging a voice frame O at the time t if V is less than or equal to theta_tIs a speech frame invalid for decoding operations; otherwise, judging the speech frame O at the time t_tA speech frame that is valid for the decoding operation.

The decision threshold θ is a constant greater than 0 set by the user.

The node token resource expanding step (4) further comprises the following steps:

4a, based on valid speech frames O_t ^VPerforming external expansion on each token in the token resource chain table corresponding to the last state of the HMM associated with the current node, namely expanding each token in the token resource chain table corresponding to the last state of the HMM associated with the current node into the token resource tables of all child nodes of the node in the dictionary tree;

4b, taking one HMM state of the HMM with M states associated with the current node as the current to-be-processed HMM state S_nWherein n is more than or equal to 1 and less than or equal to M;

4c, taking state s_nOne token in the corresponding token resource table is a current token to be processed;

4d if state S_nThe score of the current token to be processed is larger than the moment corresponding to the previous effective voice frame

Global pruning threshold L of_gThen, according to the topological structure of HMM model associated with current node, a selected state s is taken_nThe reachable state is set as the current state s to be processed_mOtherwise, turning to the step k to start execution;

4e, calculating token from S_nTo state s_mScore s of_m(t); score s_m(t) is the current score of the token plus the state s_nTo state s_mTransition probability of, plus state s_mFor the current speech frame O_tThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;

4f, calculating the current local pruning threshold value L_pThe calculation formula is as follows: l is_p＝L_b-L_wIn the formula, L_bA current local pruning baseline threshold; l is_wIs the current pruning width threshold;

4g, if token from s_nTo state s_mIs greater than the current local pruning threshold L_pThen a new token is generated with a score s_m(t); otherwise, executing step j;

4H, linking the new token generated in the step g into the node with a header H_smChecking whether the node is in the active node table of the layer of the dictionary tree, if not, linking the node into the active node table;

4i, score s according to the new token_m(t) if s_m(t)-L_w＞L_bIf yes, updating the local pruning baseline threshold value L_bIs L_b＝s_m(t)；

4i, take another from state s_nReachable state, set as current pending state s_mRepeating the steps e-i until all the free states s are processed_nAn reachable state; go to step k to execute;

4k, taking state s_nAnother token in the corresponding token resource table is the current token to be processed; repeating the steps d-j until the state s is matched_nFinishing the expansion operation of all tokens in the corresponding token resource table; turning to the step 1 for execution;

4l, taking another HMM state of the HMM with M states associated with the node as the current pending HMM stateHMM state S of_nAnd n is more than or equal to 1 and less than or equal to M, and repeating the steps c-k until all token resource expansion operations of the current node are completed.

The node token resource expanding step (4) a comprises the following steps:

4a-i if the current value of the token is less than or equal to the time corresponding to the previous valid speech frame

Global pruning threshold L of_gIf yes, the operation of expanding the current token to all child nodes of the node where the current token is located is not needed, otherwise, the step ii is executed;

4a-ii takes the jth child node of the J child nodes in the dictionary tree of the node where the current token is located_jIs the current node to be processed;

4a-iii accumulating tokens to reach node_jFirst state s of₁Score s of₁(t) the score s₁(t) adding the current value of the token and the last state of the node where the current token is to the node_jPlus node_jFirst state s of₁For the current speech frame O_tThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;

4a-iv calculating the current local pruning threshold L_pThe calculation formula is as follows: l is_p＝L_b-L_wWherein L is_bA current local pruning baseline threshold; l is_wIs the current pruning width threshold;

4a-v if a token arrives at a node_jFirst state s of₁Is greater than the current local pruning threshold L_pThen step vi is executed; otherwise, executing step ix;

4a-vi to generate a new token with a score s₁(t)；

4a-vii chaining the token into a node_jHeader in nodeIs H_s1And check the node_jWhether the node is in the active node table of the layer of the dictionary tree or not, if not, the node is linked into the active node table;

4a-viii based on the score s₁(t) if s₁(t)-L_w＞L_bThen, the local pruning baseline threshold L is updated_bNamely L_b＝s₁(t)；

4a-ix takes another sub-node of the node where the current token is in the dictionary tree as the node to be processed currently_jAnd repeating the steps iii-viii until the expansion operation of the current token to all the child nodes of the node in which the current token is located in the dictionary tree is completed.

The adaptive pruning step (6) based on the maximum probability of the local path comprises the following steps:

6a, corresponding to the current t moment and the previous effective voice frameCalculating a pruning width threshold adjustment factor L according to the local maximum probability of the moment_fComprises the following steps:

wherein,the current global pruning threshold L is the number of all valid speech frames up to time t_gThe time corresponding to the previous valid speech frameLocal path maximum probability of, current local pruning baseline threshold L_bThe local path maximum probability at time t;

6b, adjusting factor L for calculated pruning width threshold value_fAnd (3) performing normalization: if it is

L_{f} > L_{f}^{MAX}

Then put L_fIs L_f ^MAXIf, if

L_{f} < L_{f}^{MIN},

Then put L_fIs L_f ^MINWherein: l is_f ^MAXTo adjust the factor L_fUpper bound of, L_f ^MINTo adjust the factor L_fThe lower bound of (1) is a normal number which can be set by a user;

6c, adjusting the factor L according to the calculated pruning width threshold value_fUpdating the pruning width threshold L_wComprises the following steps:

L_{w} = L_{f} L_{w}^{c},

L_w ^cis an initial pruning width threshold value which can be obtained in the initialization step (1);

6d, the global pruning threshold value at the updating time t is L_g：L_g＝L_bPreparing for token expansion for the next valid speech frame;

6e, resetting the local pruning baseline threshold L_bFor log min, prepare for token extension for the next valid speech frame.

Compared with the prior art, the invention has the following advantages:

compared with the traditional decoding operation method, the invention comprises the following improvements: the adaptive pruning strategy based on the maximum probability of the local path; and (3) a voice frame filtering strategy based on the characteristic code word vector.

The decoding operation related to the invention adopts a breadth-first search framework with pruning based on a dictionary tree and token expansion, the calculation complexity of the algorithm is O (MT), wherein: t is the number of the voice frames entering the searching calculation, and M is the average value of the number of the active paths of each voice frame at the corresponding moment in the searching calculation process.

The traditional decoding operation is to search and calculate all the speech frames input by the user speech, actually, the speech input by the user is a time-varying signal with local stationarity, therefore, in the stable section of the speech input by the user, some speech frames similar to the adjacent speech frames can be removed, namely, the speech frames do not enter the searching and calculating process, and the precision of the decoding operation is not influenced. Therefore, the invention provides a voice frame filtering strategy based on the characteristic code word vector of the voice frame, which can effectively remove the redundant voice frame for searching calculation, so that the number of the voice frames actually entering the searching calculation is less than the number of the actual voice frames contained in the voice input of a user.

On the other hand, as can be seen from the above formula, the speed of the search calculation also depends on the average number of active paths at the corresponding time of each speech frame in the search process, that is, there are: when T is unchanged, the larger M is, the larger the search overhead is; the smaller M, the smaller the search overhead. The size of M depends on the pruning strategy adopted for the search calculation. Therefore, the invention provides a self-adaptive pruning strategy of the maximum probability of the local path, compared with the traditional method, the self-adaptive pruning strategy can effectively reduce the value of M without obviously influencing the identification precision, thereby further accelerating the speed of decoding operation.

Drawings

FIG. 1 is a block diagram of a known speech recognition system

FIG. 2 is a block diagram of another known speech recognition system

FIG. 3 is a flowchart illustrating a decoding operation according to the present invention

FIG. 4 is a diagram of a known dictionary tree structure

FIG. 5 is a schematic diagram of a HMM topology of a conventional TRIPHONE phone

Detailed Description

FIG. 3 is a flow chart of the decoding operation of the present invention. With reference to fig. 2 and fig. 3, the operation flow of the speech recognition system based on the decoding arithmetic subsystem of the present invention is as follows: converting an input voice analog signal into a digital signal; performing framing processing on the digital signal, extracting characteristic parameters of each frame of voice, wherein each voice frame corresponds to one characteristic vector to obtain a characteristic vector sequence of the input voice; carrying out quantization coding on the feature vector sequence by utilizing a feature codebook, wherein each voice frame corresponds to a feature code word vector to obtain a corresponding feature code word vector sequence; inputting the voice feature code word vector sequence into a voice frame filtering unit of a decoding operation subsystem, performing voice frame filtering operation, and removing a feature code word vector corresponding to an invalid voice frame from the voice feature code word vector sequence to obtain a valid voice feature code word vector sequence; and searching and calculating the effective voice characteristic code word vector sequence to obtain an identification result, in the searching and calculating process, adopting a self-adaptive pruning strategy based on the maximum probability of a local path to prune the local searching path, and directly searching the observation probability of each code word of the effective voice characteristic code word vector sequence on the (local) searching path from a probability table.

The dictionary tree activity node token resource and the index mode definition thereof

In the invention, at any time t, a node with an activity token in the dictionary tree is called as an activity node at the time t. The index mode of the active node at the time t in the dictionary tree is as follows: and (3) according to the hierarchical index of the active nodes in the dictionary tree at the time t, namely all the active nodes on the same layer are connected in series to form a linked list, and each layer of the dictionary tree is provided with the linked list which is a two-dimensional linked list as a whole.

At any time t, the sum of the token resources of all active nodes in the dictionary tree is called t-time dictionary tree active node token resource, and the t-time dictionary tree active node token resource specifies the token resource to be expanded at the time t. For convenience of description, i.e. at t times indexed as aboveThe token resource of the active node of the dictionary tree is L_tThe index variable is I (I is more than or equal to 1 and less than or equal to H), wherein H is the height of the dictionary tree.

A specific embodiment of the fast decoding method of the present invention is given based on the above-described basic principles and related concepts regarding decoding operations.

The decoding method of the present invention includes the steps of:

1. initializing a decoding operation unit in the voice recognition system;

2. extracting the characteristic code word vector of the first speech frame from the speech characteristic code word vector sequence with the length of T input into the decoding arithmetic unit, and setting the characteristic code word vector as a speech frame O at the time of T_t(t＝1)；

3. For speech frame O at time t_tCarrying out filtering operation;

4. if O is_tFor a valid speech frame, token resource L is set for dictionary tree at time t_tDictionary tree token resource L of each layer I (I is more than or equal to 1 and less than or equal to H)_t[I]Each active node, expanding the token in the node token resource table, and linking the newly generated token into the token resource table of the target node; otherwise go to step 7;

5. processing tokens at word nodes; the invention does not relate to the expansion of the token of the word node and the related processing algorithm, and the user can customize the related processing algorithm according to the task domain (such as command word recognition, Chinese monosyllable recognition, large word quantity continuous voice recognition and the like);

6. according to the maximum probability of the local path at the time t and the time corresponding to the previous effective speech frameThe maximum probability of the local path is adaptively adjusted to a threshold value related to pruning, wherein the adaptive adjustment comprises the following steps: global pruning threshold L_gLocal pruning baseline threshold L_bAnd a pruning width threshold L_w；

7. The length of the slave input decoding operator isTaking out the next speech frame from the speech feature vector sequence of T, and if the next speech frame can be taken out, setting the next speech frame as a speech frame O at the time of T_t(T is less than or equal to T) and go to step 3 to execute, otherwise go to step 8;

8. the end token extension yields an identification result: by recalling the global path with the best score token at time T, the text string with the best match to the acoustic model and language model is output.

The sub-steps of the expansion operation on the current node token resource in step 4 of the decoding method are as follows:

4a, expanding the token into token resource tables of all son nodes of the node in a dictionary tree for each token in the token resource chain table corresponding to the last state of the HMM associated with the node;

4b, taking the first HMM state of the HMM with M states associated with the node as the current to-be-processed HMM state S_n(n＝1)；

4d, if the current value of the token is larger than the moment corresponding to the previous effective voice frame

Global pruning threshold L of_gThen, according to the topological structure of HMM model associated with current node, a selected state s is taken_nThe reachable state is set as the current state s to be processed_mOtherwise, go to step k to execute;

4e, calculating token arrival state s_mScore s of_m(t) is: current score of token plus state s_nTo state s_mTransition probability of, plus state s_mFor the current speech frame O_tThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;

4f, calculating the current local pruning threshold value L_pWhich calculatesThe formula is as follows: l is_p＝L_b-L_wIn the formula, L_bA current local pruning baseline threshold; l is_wIs the current pruning width threshold;

4g if token reaches state s_mIs less than or equal to the current local pruning threshold L_pThen go to step j to execute; otherwise, the following operations are carried out: generating a new token with a score s_m(t)；

4H, the head of the token is linked into the node and is H_smChecking whether the node is in the active node table of the layer of the dictionary tree, if not, linking the node into the active node table;

4i, s according to the score_m(t) updating the local pruning baseline threshold L_bThe method comprises the following steps: if s_m(t)-L_w＞L_bThen, there are: l is_b＝s_m(t); otherwise, not updating;

4j, take another state s_nReachable state s_mIf yes, setting the current to-be-processed state s_mAnd go to step e to carry out, otherwise go to step k to carry out;

4k, taking state s_nIf the other token in the corresponding token resource table is obtained, setting the token to be processed currently and turning to the step d for execution, otherwise, turning to the step l for execution;

4l, taking the next HMM state of the HMM with M states associated with the node, and if the next HMM state is taken, setting the next HMM state as the current HMM state S to be processed_n(n is less than or equal to M) and the step c is carried out, otherwise, the token resource expansion operation of the current node is completed.

In the first substep of step 4 of the decoding method, step a, regarding to expand the current token to all child nodes of the node where the current token is located, comprises the steps of:

4a-1, if the score of the current token is less than or equal to the time corresponding to the previous effective speech frameGlobal pruning threshold L of_gIf yes, the operation of expanding the current token to all son nodes of the node where the current token is located is not needed, otherwise, the operation is executed in the step 2;

4a-2, taking a son node of the node where the current token is positioned as the node to be processed currently_j(j＝1)；

4a-3, accumulating tokens to reach node_jFirst state s of₁Score s of₁(t) is: adding the current value of the token and the last state of the node where the token is to the node_jPlus node_jFirst state s of₁For the current speech frame O_tThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;

4a-4, calculating the current local pruning threshold L_pThe calculation formula is as follows: l is_p＝L_b-L_wWherein L is_bA current local pruning baseline threshold; l is_wIs the current pruning width threshold;

4a-5, if token arrives at node_jFirst state s of₁Is less than or equal to the current local pruning threshold L_pIf yes, go to step 9, otherwise, execute step 6;

4a-6, generating a new token with a score s₁(t)；

4a-7, chaining the token into the node_jThe header in the node is H_s1And check the node_jWhether the node is in the active node table of the layer of the dictionary tree or not, if not, the node is linked into the active node table;

4a-8 according to the score s₁(t) updating the local pruning baseline threshold L_bThe method comprises the following steps: if s₁(t)-L_w＞L_bThen, there are: l is_b＝s₁(t), otherwise, do not do nothing moreNew;

4a-9, taking another son node of the node where the current token is positioned, and if the son node is taken, setting the son node as the node to be processed currently_j(j is less than or equal to N) and the step 3 is carried out, wherein N is the number of all children nodes of the node where the current token is located in the dictionary tree, otherwise, the operation of expanding the current token to all children nodes of the node where the current token is located is indicated to be completed.

The step 1 of the decoding method for initializing the decoding arithmetic unit includes the steps of:

d. initializing a global pruning threshold L_gIs a logarithmic minimum;

c. initializing local pruning baseline threshold L_bIs a logarithmic minimum;

b. initializing pruning width threshold L_wIs a normal number L_w ^cThis value can be read from a decoding operator profile set by the user.

The step 3 of the decoding method for performing the filtering operation on the current speech frame comprises the following steps:

3a, if the current t moment speech frame O_tSetting the initial voice frame input by the user voice as a valid voice frame, and finishing the filtering operation; otherwise, executing step 2;

3b, comparing the current t-moment voice frame O_tCharacteristic codeword vector f₁ ^t f₂ ^t … f_Y ^tSpeech frame Q at time t-1_t-1Characteristic codeword vector f₁ ^t-1 f₂ ^t-1 … f_Y ^t-1Obtaining a similarity measure V from the similarity degree of the characteristic code word vector expressionY is the number of feature codewords contained in the feature codeword vector of the speech frame, and the similarity measure V can be calculated by the following formula:

V = Σ_{i = 1}^{Y} C (f_{i}^{t}, f_{i}^{t - 1}),

wherein,defined as the formula:

3c, comparing the similar metric value V with a judgment threshold value theta (a normal number set by a user and can be read from a decoding arithmetic unit configuration file set by the user), and judging a voice frame O if V is less than or equal to theta_tFor speech frames not valid for decoding operations, otherwise speech frame O is determined_tA speech frame that is valid for the decoding operation.

The experimental result shows that the voice frame filtering operation can remove 20-30% of invalid voice frames in the voice characteristic code word vector sequence, so the speed of decoding operation can be accelerated, and the identification performance is insensitive to the voice speed of the user compared with the traditional method, because for the user with slow voice speed, the voice frame filtering operation can remove more invalid voice frames, and for the user with fast voice speed, the voice frame filtering operation can remove less invalid voice frames, namely, the voice frame filtering operation can regularize the voice speeds of different users of the user to a certain extent.

Step 6 of the above decoding method further comprises the steps of:

by decodingThe first 5 steps of the operator can result in: current global pruning threshold L_gThe time corresponding to the previous valid speech frame

Local path maximum probability of, current local pruning baseline threshold L_bThe local path maximum probability at time t. Accordingly, the execution of step 6 in the operation steps of the decoding operator is divided into the following steps:

6a, according to time t,Calculating a pruning width threshold adjustment factor L_fComprises the following steps:

wherein,the number of all effective voice frames until t moment;

L_{f} > L_{f}^{MAX}

Then

L_{f} = L_{f}^{MAX},

If it is

L_{f} < L_{f}^{MIN}

Then

L_{f} = L_{f}^{MIN},

Wherein: l is_f ^MAXTo adjust the factor L_fUpper bound of (e.g.: 1.05), L_f ^MINTo adjust the factor L_fThe lower bound of (e.g.: 0.5) is a normal number, which can be used singlyReading a decoder configuration file set by a user;

L_{w} = L_{f} L_{w}^{c};

6d, the global pruning threshold value at the updating time t is L_g：L_g＝L_bPreparing for token extension for the next valid speech frame;

In the conventional search algorithm, the pruning width threshold L_wIs constant, in the present invention, after the search calculation is performed on the current valid speech frame, the pruning width threshold L is set_wThe method can be adaptively adjusted according to the maximum probability of the local path, so that the adaptive pruning of the local path can be realized when the next effective speech frame is searched and calculated, and the experimental result shows that the method can effectively reduce the average token number M (10-20%) in the decoding process on the premise of not influencing the recognition precision, thereby further accelerating the speed of decoding operation.

Claims

1. A fast decoding method in a speech recognition system, comprising the steps of:

(1) initializing a decoding operation unit in the voice recognition system;

(3) For speech frame O at time t_tFiltering, if the voice frame is filtered, executing the step (2), otherwise, setting the voice frame as the current onePre-active speech frame O_t ^V；

(5) processing tokens at nodes of a dictionary tree;

(6) according to the maximum probability of the local path at the time t and the time corresponding to the previous effective speech frameThe maximum probability of the local path is used for carrying out self-adaptive adjustment on a threshold value related to pruning;

2. The method for fast decoding in a speech recognition system of claim 1 wherein said t-time lexicon tree token resource L_tIs the sum of the token resources of all active nodes in the time dictionary tree.

3. The method of claim 1, wherein the local path maximum probability at time t is a maximum of all local path scores in the local path sets corresponding to all the newly generated tokens at time t.

4. The method of claim 1, wherein the previous valid speech frame corresponds to a fast decoding method in a speech recognition system

The maximum probability of the local path of the moment is the moment corresponding to the previous effective speech frame

And in the local path set corresponding to all the newly generated tokens, the maximum value of all the local path scores.

5. The fast decoding method in a speech recognition system as claimed in claim 1, wherein said initializing step (1) further comprises the steps of:

b. initializing a global pruning threshold L_gIs a logarithmic minimum;

c. initializing local pruning baseline threshold L_bIs a logarithmic minimum;

6. The fast decoding method in a speech recognition system as claimed in claim 1, wherein said filtering step (3) further comprises the steps of:

3b, comparing the voice frame O at the time t_tY feature codeword vectors f₁ ^t f₂ ^t Λ f_Y ^tSpeech frame O at time t-1_tY feature codeword vectors f of-1₁ ^t-1 f₂ ^t-1 Λ f_Y ^t-1Obtaining a similarity measure V according to the similarity degree;

3c, comparing the similar metric value V with a judgment threshold value theta, and judging the time t if V is less than or equal to thetaVoice frame O_tFor speech frames that are not valid for decoding operations: otherwise, judging the speech frame O at the time t_tA speech frame that is valid for the decoding operation.

7. The method of claim 1, wherein the decision threshold θ is a constant greater than 0 set by a user.

8. The fast decoding method in a speech recognition system according to claim 1, wherein said node token resource expanding step (4) further comprises the steps of:

4d if state S_nThe score of the current token to be processed is larger than the moment corresponding to the previous effective voice frameGlobal pruning threshold L of_gThen, according to the topological structure of HMM model associated with current node, a selected state s is taken_nThe reachable state is set as the current state s to be processed_mOtherwise, turning to the step k to start execution;

4e, calculating token from S_nTo state s_mScore s of_m(t); score s_m(t) is the current score of the token plus the state s_nTo state s_mTransition probability of, plus state s_mFor the current languageSound frame O_tThe observation probability of (2);

4g, if the token is from S_nTo state s_mIs greater than the current local pruning threshold L_pThen a new token is generated with a score s_m(t); otherwise, executing step j;

4i, score s according to the new token_m(t) according to the formula s_m(t)-L_w＞L_bThen, the local pruning baseline threshold L is updated_bNamely L_b＝s_m(t)；

4j, take another from state s_nReachable state, set as current pending state s_mRepeating the steps e-i until all the free states s are processed_nAn reachable state; go to step k to execute;

4l, taking another HMM state of the HMM with M states associated with the node as the current to-be-processed HMM state S_nAnd n is more than or equal to 1 and less than or equal to M, and repeating the steps c-k until all token resource expansion operations of the current node are completed.

9. The fast decoding method in speech recognition system according to claim 8, wherein the node token resource expanding step (4) a comprises the steps of:

4a-i) ifThe current value of the token is less than or equal to the time corresponding to the previous effective voice frameGlobal pruning threshold L of_gIf yes, the operation of expanding the current token to all child nodes of the node where the current token is located is not needed, otherwise, the step ii is executed;

4a-ii) taking the jth sub-node of J sub-nodes of the node where the current token is positioned as the node to be processed currently_j；

4a-iii) accumulating tokens to reach a node_jFirst state s of₁Score s of₁(t) the score s₁(t) adding the current value of the token and the last state of the node where the current token is to the node_jPlus node_jFirst state s of₁For the current speech frame O_tThe observation probability of (2);

4a-iv) calculating the current local pruning threshold L_pThe calculation formula is as follows: l is_p＝L_b-L_wWherein L is_bA current local pruning baseline threshold; l is_wIs the current pruning width threshold;

4a-v) if the token arrives at the node_jFirst state s of₁Is greater than the current local pruning threshold L_pThen step vi is executed; otherwise, executing step ix;

4a-vi) generating a new token with a score s₁(t)；

4a-vii) chaining the token into a node_jThe header in the node is H_s1And check the node_jWhether the node is in the active node table of the layer of the dictionary tree or not, if not, the node is linked into the active node table;

4a-viii) according to the score s₁(t) if s₁(t)-L_w＞L_bThen, the local pruning baseline threshold L is updated_bNamely L_b＝s₁(t)；

4a-ix) taking another sub-node of the node where the current token is in the dictionary tree as the node to be processed currently_jRepeating steps iii to viii untilThe expansion operation to the current token to all child nodes in the dictionary tree of the node where the current token is located is completed.

10. A fast decoding method in a speech recognition system as claimed in claim 1, characterized in that said step (6) of adaptive pruning based on local path maximum probability comprises the steps of:

6a, corresponding to the current t moment and the previous effective voice frame

Calculating a pruning width threshold adjustment factor L according to the local maximum probability of the moment_fComprises the following steps:

L_{f} = \frac{(L_{b} - L_{g}) \tilde{t}}{L_{b}},

wherein,

the current global pruning threshold L is the number of all valid speech frames up to time t_gThe time corresponding to the previous valid speech frameLocal path maximum probability of, current local pruning baseline threshold L_bThe local path maximum probability at time t;

L_{f} > L_{f}^{MAX}

Then put L_fIs L_f ^MAXIf, if

L_{f} < L_{f}^{MIN},

Then put L_fIs L_f ^MINWherein: l is_f ^MAXTo adjust the factor L_fUpper bound of, L_f ^MINTo adjust the factor L_fThe lower bound of (1) is a normal number, which can be read from a decoder configuration file set by a user;

L_{w} = L_{f} L_{w}^{c},

L_w ^cthe initialized pruning width threshold is obtained in the initialization step (1);

6d, the global pruning threshold value at the updating time t is L_g:L_g＝L_bPreparing for token expansion for the next valid speech frame;