CN1201284C - Rapid decoding method for voice identifying system - Google Patents

Rapid decoding method for voice identifying system Download PDF

Info

Publication number
CN1201284C
CN1201284C CNB021486824A CN02148682A CN1201284C CN 1201284 C CN1201284 C CN 1201284C CN B021486824 A CNB021486824 A CN B021486824A CN 02148682 A CN02148682 A CN 02148682A CN 1201284 C CN1201284 C CN 1201284C
Authority
CN
China
Prior art keywords
token
node
current
pruning
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CNB021486824A
Other languages
Chinese (zh)
Other versions
CN1455387A (en
Inventor
韩疆
颜永红
潘接林
张建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CNB021486824A priority Critical patent/CN1201284C/en
Publication of CN1455387A publication Critical patent/CN1455387A/en
Application granted granted Critical
Publication of CN1201284C publication Critical patent/CN1201284C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The present invention relates to a rapid decoding method for a voice recognition system. The rapid decoding method comprises the steps that in the first step, a decoding arithmetic unit in the voice recognition system is initialized; in the second step, the characteristic code word vector of the next voice frame is sequentially extracted from a voice characteristic code word sequence whose length is T, the voice characteristic code word sequence is input into the decoding arithmetic unit, and the characteristic code word vector is set to be the current voice frame Ot, 1<=t<=T; in the third step, the current voice frame Ot is filtered; in the fourth step, on the basis of the current effective frame Ot, each active node in each layer I of token resources in a lexicographic tree token resource Lt at the time of t is judged; in the fifth step, tokens at lexicographic tree nodes are processed; in the sixth step, according to a local path maximum probability at the time of t and a local path maximum probability of the previous effective voice frame at the time of t, a threshold value relevant to pruning is adjusted in a self-adaptive mode; in the seventh step, the second step to the sixth step are repeated, a text string which optimally matches with an acoustic model and a language model which are generated at the moment is output to generate a voice recognition result. Compared with a traditional method, the rapid decoding method can accelerate the speed of decoding calculation.

Description

Quick decoding method in speech recognition system
Technical Field
The invention relates to a fast decoding method in a speech recognition system.
Background
The decoding operation is a main component in the speech recognition system, and the function of the decoding operation is as follows: under the given acoustic model and language model, the feature vector sequence is observed for the input acoustics, and the computer automatically finds out the text string which is best matched with the acoustic model and the language model in the search space which is statically or dynamically constructed, so that the voice input of the user is converted into the corresponding text.
Fig. 1 is a block diagram of a known speech recognition system, in which analog speech is converted into a digital signal that can be processed by a computer after passing through an analog-to-digital conversion unit 11, and then the digital signal is subjected to framing processing by a feature extraction unit 12, where the frame length is usually 20ms and the frame length is shifted to 10ms, and MFCC parameters of each frame of speech are extracted to obtain an MFCC vector sequence, and a decoding operation unit 14 obtains a recognition result by using a certain search strategy, such as a depth-first search (Viterbi algorithm) or a breadth-first search, according to a feature vector sequence of input speech, an acoustic model 13 and a language model 15, where the language model is used to apply knowledge of a language layer to the speech recognition system when performing large vocabulary continuous speech recognition, so as to improve the recognition accuracy of the system.
Based on the very high requirements of the speech recognizer of fig. 1 on the speed of the central processing unit and the memory capacity of the computer, some commercial dictation machine systems, such as the ViaVoice system of IBM and the dictation machine module in Microsoft Office XP, require a high-speed central processing unit (intel pentium II 400MHz or more) and a large memory resource (100MByte or more). In general, decoding operations occupy more than 90% of the central processor computing resources and almost all of the memory resources in the entire speech recognizer; the analog-to-digital conversion module and the feature extraction unit occupy less than 10% of the computing resources of the central processing unit and little memory resources.
The current commercial embedded voice recognition system mainly adopts the voice recognition of a small word quantity specific person based on simple template matching, such as voice dialing in a mobile phone, simple command recognition and the like, and because the technology needs a user to register voice data, the usability and the applicability are not strong; some non-human-specific embedded speech recognition systems are primarily directed to small vocabulary command word recognition and the computational and memory requirements remain large, for example, IBM's personal voice assistant speech recognition system requires a computing device with 50DMIPS of computing power for a task domain of 500 words.
The basic principles and concepts of known decoding operations are as follows:
1. dictionary tree
The lexicon tree is a tree-like structure used to organize the pronunciation of all words in the recognition system. Phonemes are the basic units that make up the pronunciation of a word, TRIPHONE phonemes are the phoneme units commonly used by current speech recognition systems, for example: the TRIPHONE phoneme representation sequence of the word "china" is: "sil-zh + ongzh-ong + g-ng + uo g-uo + sil" (where "sil" is a special phoneme used to describe a pause in the user's speech), and "TRIPHONE phoneme is a context-dependent phoneme that describes the variation of pronunciation of a phoneme in different contexts compared to the conventional pinyin representation, so that the acoustic characteristics of the pronunciation of a word can be more accurately described. There may be the same prefix or subword between words of the recognition system, for example: the words "middle" and "china", which have the same prefix "middle", can be described by a tree structure, assuming that the vocabulary of the recognition system contains the following 5 words "abe", "ab", "acg", "acgi", and "ac", the dictionary tree of the vocabulary is shown in fig. 4:
the TRIPHONE phone corresponding to each node in the dictionary tree is associated with a Hidden Markov Model (HMM) corresponding to the TRIPHONE, and fig. 5 shows an HMM topology representing the TRIPHONE phone, where an HMM is composed of several HMM states.
2. Token definition and token extension policy
A token refers to an active search path from a user's speech start frame to a current speech frame, and includes path identification information including all words in the path and word boundary information, and scores of the path matching acoustic models and language models, and each token corresponds to an active search path, and the tokens differ in that they have different acoustic contexts and different language contexts.
Each state in the HMM associated with each node in the dictionary tree can be resident with a movable token, and each state of the node has a token linked list used forTo deposit all tokens that are active in that state at any time. Suppose at time t, the score of one expandable token in the token linked list of state i of a node in the dictionary tree is si(t-1), then during the search, if the score s of the token isj(t-1) plus the transition probability from state i to state j plus the observed probability for the current speech frame t for state i exceeds the current pruning threshold, a new token is generated with a score sj(t) and associated on state j. After the processing of all the tokens residing on the dictionary tree at the time t-1 is completed, the token resource to be expanded residing on the dictionary tree at the time t is generated, and all the tokens residing on the dictionary tree at the time t-1 are deleted.
During token propagation, possible words and word boundary information are recorded in a linked list structure that identifies paths. Therefore, at the voice input ending time T, the word sequence with the best match and the corresponding word boundary position can be extracted by recalling the path identification information linked list in the token with the best score.
3. Token resource definition of dictionary tree nodes
Assume that a dictionary tree node contains M HMM states: s1…sMThen, the token resource definition of a dictionary tree node contains the following token resource information:
node token resource: hS1 HS2 … HSM
Wherein Hs1(1 ≦ i ≦ M) for HMM state S in the nodeiA head of the token chain table.
The traditional decoding operation method has high requirements on hardware computing capacity and a memory, and has low cost performance.
In chinese patent application 02131086.6, a method for compressing a feature vector set for a speech recognition system is disclosed, in the process of clustering a speech feature vector set to obtain a codebook, a step of dynamically merging and splitting subsets according to the number of vectors in the subsets and the total distance measurement of the vectors is added, the sum of the distance measurements of the vectors in the clustered set and the codewords corresponding to the vectors is reduced, the accuracy of the clustering algorithm is improved, and when the codebook compressed by the method of the present invention is applied to the speech recognition system, the storage capacity of the system can be greatly reduced while the recognition performance of the speech system is ensured. The invention also discloses a voice recognition system, the structural diagram of which is shown in figure 2, the system replaces an acoustic model with a feature codebook and a probability table, does not need to calculate the Gaussian probability in the decoding process, only needs to find out the required probability value from the prestored probability table, greatly reduces the expense of calculating the Gaussian probability in the decoding operation, and can improve the recognition speed of the system to a considerable extent.
Disclosure of Invention
The invention aims to provide a method for improving the rapid decoding operation in the current continuous speech recognition system of the unspecific person large vocabulary, which further solves the problems of the current embedded speech recognition technology, the market receiving capacity relative to the product price, and the over-high requirements on the hardware computing capacity and the memory, so that the current speech recognition technology can also be applied to embedded hardware platforms, such as PDA, Mobile Phone, Smart Phone and the like.
The object of the invention can be achieved by the following measures:
a fast decoding method in a speech recognition system, comprising the steps of:
(1) initializing a decoding operation unit in the voice recognition system;
(2) the characteristic code word vector of the next speech frame is taken out from the speech characteristic code word sequence with the length of T input into the decoding arithmetic unit in sequence and is set as a speech frame O at the moment of Tt,1≤t≤T;
(3) For speech frame O at time ttFiltering, if the voice frame is filtered, then going to step (2) to execute, otherwise, setting the voice frame OtTo be effectiveVoice frame Ot V
(4) Based on valid speech frames Ot VFor t time dictionary tree token resource LtToken resource L of each layer I oft[I]Each active node in the node is judged, and if the token is judged to be extensible, the token in the token resource table of the node is extended, and the newly generated token is linked into the token resource table of the target node; wherein I is an index variable, and I is more than or equal to 1 and less than or equal to H; h is the height of the dictionary tree; otherwise, executing the step (7);
(5) processing tokens at nodes of a dictionary tree;
(6) according to the maximum probability of the local path at the time t and the time corresponding to the previous effective speech frame
Figure C0214868200101
The maximum probability of the local path is used for carrying out self-adaptive adjustment on a threshold value related to pruning;
(7) and (4) repeating the steps (2) to (6) to obtain a global path with the best score token at the input voice ending time T, ending token expansion, and outputting the text string which is generated at the moment and has the best matching with the acoustic model and the language model to generate a voice recognition result.
The present invention does not relate to the expansion of word node tokens and related processing algorithms, which a user can customize based on task domain (e.g., command word recognition, Chinese monosyllabic recognition, large word-size continuous speech recognition, etc.).
Token resource L of the t-time dictionary treetIs the sum of the token resources of all active nodes in the time dictionary tree. The index mode of the active node at the time t in the dictionary tree is as follows: and (3) according to the hierarchical index of the active nodes in the dictionary tree at the time t, namely all the active nodes on the same layer are connected in series to form a linked list, and each layer of the dictionary tree is provided with the linked list which is a two-dimensional linked list as a whole.
The I level token resource L of the t time dictionary tree token resourcet[I]According to the above formulaFormula-indexed t-time dictionary tree active node token resource LtThe I-th linked list of (1).
The maximum probability of the local path at the time t is as follows: and (4) the maximum value of all local path scores in the local path set corresponding to all the newly generated tokens at the time t.
Said previous valid speech frame corresponds toThe local path maximum probability at a time is: the moment corresponding to the previous valid speech frameAnd in the local path set corresponding to all the newly generated tokens, the maximum value of all the local path scores.
The initialization step (1) further comprises the following steps:
a. generating a token with a value of zero, and linking the token into a token resource table head of a root node in a dictionary tree, wherein an active node of the current dictionary tree only comprises a root node, and the active node is positioned at the first layer of the dictionary tree;
b. initializing a global pruning threshold LgIs a logarithmic minimum;
c. initializing local pruning baseline threshold LbIs a logarithmic minimum;
d. initializing pruning width threshold LwIs a normal number Lw c,Lw cPreset by the user.
The filtering step (3) further comprises the following steps:
3a, if the time t is a voice frame OtSetting the initial voice frame input by the user voice as a valid voice frame, and finishing the filtering operation; otherwise, executing the step b;
3b, comparing the voice frame O at the time ttY feature codeword vectors f1 t f2 t Λ fY tSpeech frame O at time t-1t-1Y feature codeword vectors f1 t-1 f2 t-1 Λ fY t-1Obtaining a similarity measure V according to the similarity degree;
3c, comparing the similarity metric value V with a judgment threshold value theta, and judging a voice frame O at the time t if V is less than or equal to thetatIs a speech frame invalid for decoding operations; otherwise, judging the speech frame O at the time ttA speech frame that is valid for the decoding operation.
The decision threshold θ is a constant greater than 0 set by the user.
The node token resource expanding step (4) further comprises the following steps:
4a, based on valid speech frames Ot VPerforming external expansion on each token in the token resource chain table corresponding to the last state of the HMM associated with the current node, namely expanding each token in the token resource chain table corresponding to the last state of the HMM associated with the current node into the token resource tables of all child nodes of the node in the dictionary tree;
4b, taking one HMM state of the HMM with M states associated with the current node as the current to-be-processed HMM state SnWherein n is more than or equal to 1 and less than or equal to M;
4c, taking state snOne token in the corresponding token resource table is a current token to be processed;
4d if state SnThe score of the current token to be processed is larger than the moment corresponding to the previous effective voice frame
Figure C0214868200111
Global pruning threshold L ofgThen, according to the topological structure of HMM model associated with current node, a selected state s is takennThe reachable state is set as the current state s to be processedmOtherwise, turning to the step k to start execution;
4e, calculating token from SnTo state smScore s ofm(t); score sm(t) is the current score of the token plus the state snTo state smTransition probability of, plus state smFor the current speech frame OtThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;
4f, calculating the current local pruning threshold value LpThe calculation formula is as follows: l isp=Lb-LwIn the formula, LbA current local pruning baseline threshold; l iswIs the current pruning width threshold;
4g, if token from snTo state smIs greater than the current local pruning threshold LpThen a new token is generated with a score sm(t); otherwise, executing step j;
4H, linking the new token generated in the step g into the node with a header HsmChecking whether the node is in the active node table of the layer of the dictionary tree, if not, linking the node into the active node table;
4i, score s according to the new tokenm(t) if sm(t)-Lw>LbIf yes, updating the local pruning baseline threshold value LbIs Lb=sm(t);
4i, take another from state snReachable state, set as current pending state smRepeating the steps e-i until all the free states s are processednAn reachable state; go to step k to execute;
4k, taking state snAnother token in the corresponding token resource table is the current token to be processed; repeating the steps d-j until the state s is matchednFinishing the expansion operation of all tokens in the corresponding token resource table; turning to the step 1 for execution;
4l, taking another HMM state of the HMM with M states associated with the node as the current pending HMM stateHMM state S ofnAnd n is more than or equal to 1 and less than or equal to M, and repeating the steps c-k until all token resource expansion operations of the current node are completed.
The node token resource expanding step (4) a comprises the following steps:
4a-i if the current value of the token is less than or equal to the time corresponding to the previous valid speech frame
Figure C0214868200121
Global pruning threshold L ofgIf yes, the operation of expanding the current token to all child nodes of the node where the current token is located is not needed, otherwise, the step ii is executed;
4a-ii takes the jth child node of the J child nodes in the dictionary tree of the node where the current token is locatedjIs the current node to be processed;
4a-iii accumulating tokens to reach nodejFirst state s of1Score s of1(t) the score s1(t) adding the current value of the token and the last state of the node where the current token is to the nodejPlus nodejFirst state s of1For the current speech frame OtThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;
4a-iv calculating the current local pruning threshold LpThe calculation formula is as follows: l isp=Lb-LwWherein L isbA current local pruning baseline threshold; l iswIs the current pruning width threshold;
4a-v if a token arrives at a nodejFirst state s of1Is greater than the current local pruning threshold LpThen step vi is executed; otherwise, executing step ix;
4a-vi to generate a new token with a score s1(t);
4a-vii chaining the token into a nodejHeader in nodeIs Hs1And check the nodejWhether the node is in the active node table of the layer of the dictionary tree or not, if not, the node is linked into the active node table;
4a-viii based on the score s1(t) if s1(t)-Lw>LbThen, the local pruning baseline threshold L is updatedbNamely Lb=s1(t);
4a-ix takes another sub-node of the node where the current token is in the dictionary tree as the node to be processed currentlyjAnd repeating the steps iii-viii until the expansion operation of the current token to all the child nodes of the node in which the current token is located in the dictionary tree is completed.
The adaptive pruning step (6) based on the maximum probability of the local path comprises the following steps:
6a, corresponding to the current t moment and the previous effective voice frameCalculating a pruning width threshold adjustment factor L according to the local maximum probability of the momentfComprises the following steps:
Figure C0214868200132
wherein,the current global pruning threshold L is the number of all valid speech frames up to time tgThe time corresponding to the previous valid speech frameLocal path maximum probability of, current local pruning baseline threshold LbThe local path maximum probability at time t;
6b, adjusting factor L for calculated pruning width threshold valuefAnd (3) performing normalization: if it is L f > L f MAX Then put LfIs Lf MAXIf, if L f < L f MIN , Then put LfIs Lf MINWherein: l isf MAXTo adjust the factor LfUpper bound of, Lf MINTo adjust the factor LfThe lower bound of (1) is a normal number which can be set by a user;
6c, adjusting the factor L according to the calculated pruning width threshold valuefUpdating the pruning width threshold LwComprises the following steps: L w = L f L w c , Lw cis an initial pruning width threshold value which can be obtained in the initialization step (1);
6d, the global pruning threshold value at the updating time t is Lg:Lg=LbPreparing for token expansion for the next valid speech frame;
6e, resetting the local pruning baseline threshold LbFor log min, prepare for token extension for the next valid speech frame.
Compared with the prior art, the invention has the following advantages:
compared with the traditional decoding operation method, the invention comprises the following improvements: the adaptive pruning strategy based on the maximum probability of the local path; and (3) a voice frame filtering strategy based on the characteristic code word vector.
The decoding operation related to the invention adopts a breadth-first search framework with pruning based on a dictionary tree and token expansion, the calculation complexity of the algorithm is O (MT), wherein: t is the number of the voice frames entering the searching calculation, and M is the average value of the number of the active paths of each voice frame at the corresponding moment in the searching calculation process.
The traditional decoding operation is to search and calculate all the speech frames input by the user speech, actually, the speech input by the user is a time-varying signal with local stationarity, therefore, in the stable section of the speech input by the user, some speech frames similar to the adjacent speech frames can be removed, namely, the speech frames do not enter the searching and calculating process, and the precision of the decoding operation is not influenced. Therefore, the invention provides a voice frame filtering strategy based on the characteristic code word vector of the voice frame, which can effectively remove the redundant voice frame for searching calculation, so that the number of the voice frames actually entering the searching calculation is less than the number of the actual voice frames contained in the voice input of a user.
On the other hand, as can be seen from the above formula, the speed of the search calculation also depends on the average number of active paths at the corresponding time of each speech frame in the search process, that is, there are: when T is unchanged, the larger M is, the larger the search overhead is; the smaller M, the smaller the search overhead. The size of M depends on the pruning strategy adopted for the search calculation. Therefore, the invention provides a self-adaptive pruning strategy of the maximum probability of the local path, compared with the traditional method, the self-adaptive pruning strategy can effectively reduce the value of M without obviously influencing the identification precision, thereby further accelerating the speed of decoding operation.
Drawings
FIG. 1 is a block diagram of a known speech recognition system
FIG. 2 is a block diagram of another known speech recognition system
FIG. 3 is a flowchart illustrating a decoding operation according to the present invention
FIG. 4 is a diagram of a known dictionary tree structure
FIG. 5 is a schematic diagram of a HMM topology of a conventional TRIPHONE phone
Detailed Description
FIG. 3 is a flow chart of the decoding operation of the present invention. With reference to fig. 2 and fig. 3, the operation flow of the speech recognition system based on the decoding arithmetic subsystem of the present invention is as follows: converting an input voice analog signal into a digital signal; performing framing processing on the digital signal, extracting characteristic parameters of each frame of voice, wherein each voice frame corresponds to one characteristic vector to obtain a characteristic vector sequence of the input voice; carrying out quantization coding on the feature vector sequence by utilizing a feature codebook, wherein each voice frame corresponds to a feature code word vector to obtain a corresponding feature code word vector sequence; inputting the voice feature code word vector sequence into a voice frame filtering unit of a decoding operation subsystem, performing voice frame filtering operation, and removing a feature code word vector corresponding to an invalid voice frame from the voice feature code word vector sequence to obtain a valid voice feature code word vector sequence; and searching and calculating the effective voice characteristic code word vector sequence to obtain an identification result, in the searching and calculating process, adopting a self-adaptive pruning strategy based on the maximum probability of a local path to prune the local searching path, and directly searching the observation probability of each code word of the effective voice characteristic code word vector sequence on the (local) searching path from a probability table.
The dictionary tree activity node token resource and the index mode definition thereof
In the invention, at any time t, a node with an activity token in the dictionary tree is called as an activity node at the time t. The index mode of the active node at the time t in the dictionary tree is as follows: and (3) according to the hierarchical index of the active nodes in the dictionary tree at the time t, namely all the active nodes on the same layer are connected in series to form a linked list, and each layer of the dictionary tree is provided with the linked list which is a two-dimensional linked list as a whole.
At any time t, the sum of the token resources of all active nodes in the dictionary tree is called t-time dictionary tree active node token resource, and the t-time dictionary tree active node token resource specifies the token resource to be expanded at the time t. For convenience of description, i.e. at t times indexed as aboveThe token resource of the active node of the dictionary tree is LtThe index variable is I (I is more than or equal to 1 and less than or equal to H), wherein H is the height of the dictionary tree.
A specific embodiment of the fast decoding method of the present invention is given based on the above-described basic principles and related concepts regarding decoding operations.
The decoding method of the present invention includes the steps of:
1. initializing a decoding operation unit in the voice recognition system;
2. extracting the characteristic code word vector of the first speech frame from the speech characteristic code word vector sequence with the length of T input into the decoding arithmetic unit, and setting the characteristic code word vector as a speech frame O at the time of Tt(t=1);
3. For speech frame O at time ttCarrying out filtering operation;
4. if O istFor a valid speech frame, token resource L is set for dictionary tree at time ttDictionary tree token resource L of each layer I (I is more than or equal to 1 and less than or equal to H)t[I]Each active node, expanding the token in the node token resource table, and linking the newly generated token into the token resource table of the target node; otherwise go to step 7;
5. processing tokens at word nodes; the invention does not relate to the expansion of the token of the word node and the related processing algorithm, and the user can customize the related processing algorithm according to the task domain (such as command word recognition, Chinese monosyllable recognition, large word quantity continuous voice recognition and the like);
6. according to the maximum probability of the local path at the time t and the time corresponding to the previous effective speech frameThe maximum probability of the local path is adaptively adjusted to a threshold value related to pruning, wherein the adaptive adjustment comprises the following steps: global pruning threshold LgLocal pruning baseline threshold LbAnd a pruning width threshold Lw
7. The length of the slave input decoding operator isTaking out the next speech frame from the speech feature vector sequence of T, and if the next speech frame can be taken out, setting the next speech frame as a speech frame O at the time of Tt(T is less than or equal to T) and go to step 3 to execute, otherwise go to step 8;
8. the end token extension yields an identification result: by recalling the global path with the best score token at time T, the text string with the best match to the acoustic model and language model is output.
The sub-steps of the expansion operation on the current node token resource in step 4 of the decoding method are as follows:
4a, expanding the token into token resource tables of all son nodes of the node in a dictionary tree for each token in the token resource chain table corresponding to the last state of the HMM associated with the node;
4b, taking the first HMM state of the HMM with M states associated with the node as the current to-be-processed HMM state Sn(n=1);
4c, taking state snOne token in the corresponding token resource table is a current token to be processed;
4d, if the current value of the token is larger than the moment corresponding to the previous effective voice frame
Figure C0214868200162
Global pruning threshold L ofgThen, according to the topological structure of HMM model associated with current node, a selected state s is takennThe reachable state is set as the current state s to be processedmOtherwise, go to step k to execute;
4e, calculating token arrival state smScore s ofm(t) is: current score of token plus state snTo state smTransition probability of, plus state smFor the current speech frame OtThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;
4f, calculating the current local pruning threshold value LpWhich calculatesThe formula is as follows: l isp=Lb-LwIn the formula, LbA current local pruning baseline threshold; l iswIs the current pruning width threshold;
4g if token reaches state smIs less than or equal to the current local pruning threshold LpThen go to step j to execute; otherwise, the following operations are carried out: generating a new token with a score sm(t);
4H, the head of the token is linked into the node and is HsmChecking whether the node is in the active node table of the layer of the dictionary tree, if not, linking the node into the active node table;
4i, s according to the scorem(t) updating the local pruning baseline threshold LbThe method comprises the following steps: if sm(t)-Lw>LbThen, there are: l isb=sm(t); otherwise, not updating;
4j, take another state snReachable state smIf yes, setting the current to-be-processed state smAnd go to step e to carry out, otherwise go to step k to carry out;
4k, taking state snIf the other token in the corresponding token resource table is obtained, setting the token to be processed currently and turning to the step d for execution, otherwise, turning to the step l for execution;
4l, taking the next HMM state of the HMM with M states associated with the node, and if the next HMM state is taken, setting the next HMM state as the current HMM state S to be processedn(n is less than or equal to M) and the step c is carried out, otherwise, the token resource expansion operation of the current node is completed.
In the first substep of step 4 of the decoding method, step a, regarding to expand the current token to all child nodes of the node where the current token is located, comprises the steps of:
4a-1, if the score of the current token is less than or equal to the time corresponding to the previous effective speech frameGlobal pruning threshold L ofgIf yes, the operation of expanding the current token to all son nodes of the node where the current token is located is not needed, otherwise, the operation is executed in the step 2;
4a-2, taking a son node of the node where the current token is positioned as the node to be processed currentlyj(j=1);
4a-3, accumulating tokens to reach nodejFirst state s of1Score s of1(t) is: adding the current value of the token and the last state of the node where the token is to the nodejPlus nodejFirst state s of1For the current speech frame OtThe observation probability can be obtained by table look-up operation from the probability table input into the decoding operation unit;
4a-4, calculating the current local pruning threshold LpThe calculation formula is as follows: l isp=Lb-LwWherein L isbA current local pruning baseline threshold; l iswIs the current pruning width threshold;
4a-5, if token arrives at nodejFirst state s of1Is less than or equal to the current local pruning threshold LpIf yes, go to step 9, otherwise, execute step 6;
4a-6, generating a new token with a score s1(t);
4a-7, chaining the token into the nodejThe header in the node is Hs1And check the nodejWhether the node is in the active node table of the layer of the dictionary tree or not, if not, the node is linked into the active node table;
4a-8 according to the score s1(t) updating the local pruning baseline threshold LbThe method comprises the following steps: if s1(t)-Lw>LbThen, there are: l isb=s1(t), otherwise, do not do nothing moreNew;
4a-9, taking another son node of the node where the current token is positioned, and if the son node is taken, setting the son node as the node to be processed currentlyj(j is less than or equal to N) and the step 3 is carried out, wherein N is the number of all children nodes of the node where the current token is located in the dictionary tree, otherwise, the operation of expanding the current token to all children nodes of the node where the current token is located is indicated to be completed.
The step 1 of the decoding method for initializing the decoding arithmetic unit includes the steps of:
a. generating a token with a value of zero, and linking the token into a token resource table head of a root node in a dictionary tree, wherein an active node of the current dictionary tree only comprises a root node, and the active node is positioned at the first layer of the dictionary tree;
d. initializing a global pruning threshold LgIs a logarithmic minimum;
c. initializing local pruning baseline threshold LbIs a logarithmic minimum;
b. initializing pruning width threshold LwIs a normal number Lw cThis value can be read from a decoding operator profile set by the user.
The step 3 of the decoding method for performing the filtering operation on the current speech frame comprises the following steps:
3a, if the current t moment speech frame OtSetting the initial voice frame input by the user voice as a valid voice frame, and finishing the filtering operation; otherwise, executing step 2;
3b, comparing the current t-moment voice frame OtCharacteristic codeword vector f1 t f2 t … fY tSpeech frame Q at time t-1t-1Characteristic codeword vector f1 t-1 f2 t-1 … fY t-1Obtaining a similarity measure V from the similarity degree of the characteristic code word vector expressionY is the number of feature codewords contained in the feature codeword vector of the speech frame, and the similarity measure V can be calculated by the following formula:
V = &Sigma; i = 1 Y C ( f i t , f i t - 1 ) , wherein,defined as the formula:
Figure C0214868200193
3c, comparing the similar metric value V with a judgment threshold value theta (a normal number set by a user and can be read from a decoding arithmetic unit configuration file set by the user), and judging a voice frame O if V is less than or equal to thetatFor speech frames not valid for decoding operations, otherwise speech frame O is determinedtA speech frame that is valid for the decoding operation.
The experimental result shows that the voice frame filtering operation can remove 20-30% of invalid voice frames in the voice characteristic code word vector sequence, so the speed of decoding operation can be accelerated, and the identification performance is insensitive to the voice speed of the user compared with the traditional method, because for the user with slow voice speed, the voice frame filtering operation can remove more invalid voice frames, and for the user with fast voice speed, the voice frame filtering operation can remove less invalid voice frames, namely, the voice frame filtering operation can regularize the voice speeds of different users of the user to a certain extent.
Step 6 of the above decoding method further comprises the steps of:
by decodingThe first 5 steps of the operator can result in: current global pruning threshold LgThe time corresponding to the previous valid speech frame
Figure C0214868200194
Local path maximum probability of, current local pruning baseline threshold LbThe local path maximum probability at time t. Accordingly, the execution of step 6 in the operation steps of the decoding operator is divided into the following steps:
6a, according to time t,Calculating a pruning width threshold adjustment factor LfComprises the following steps:
Figure C0214868200196
wherein,the number of all effective voice frames until t moment;
6b, adjusting factor L for calculated pruning width threshold valuefAnd (3) performing normalization: if it is L f > L f MAX Then L f = L f MAX , If it is L f < L f MIN Then L f = L f MIN , Wherein: l isf MAXTo adjust the factor LfUpper bound of (e.g.: 1.05), Lf MINTo adjust the factor LfThe lower bound of (e.g.: 0.5) is a normal number, which can be used singlyReading a decoder configuration file set by a user;
6c, adjusting the factor L according to the calculated pruning width threshold valuefUpdating the pruning width threshold LwComprises the following steps: L w = L f L w c ;
6d, the global pruning threshold value at the updating time t is Lg:Lg=LbPreparing for token extension for the next valid speech frame;
6e, resetting the local pruning baseline threshold LbFor log min, prepare for token extension for the next valid speech frame.
In the conventional search algorithm, the pruning width threshold LwIs constant, in the present invention, after the search calculation is performed on the current valid speech frame, the pruning width threshold L is setwThe method can be adaptively adjusted according to the maximum probability of the local path, so that the adaptive pruning of the local path can be realized when the next effective speech frame is searched and calculated, and the experimental result shows that the method can effectively reduce the average token number M (10-20%) in the decoding process on the premise of not influencing the recognition precision, thereby further accelerating the speed of decoding operation.

Claims (10)

1. A fast decoding method in a speech recognition system, comprising the steps of:
(1) initializing a decoding operation unit in the voice recognition system;
(2) the characteristic code word vector of the next speech frame is taken out from the speech characteristic code word sequence with the length of T input into the decoding arithmetic unit in sequence and is set as a speech frame O at the moment of Tt,1≤t≤T;
(3) For speech frame O at time ttFiltering, if the voice frame is filtered, executing the step (2), otherwise, setting the voice frame as the current onePre-active speech frame Ot V
(4) Based on valid speech frames Ot VFor t time dictionary tree token resource LtToken resource L of each layer I oft[I]Each active node in the node is judged, and if the token is judged to be extensible, the token in the token resource table of the node is extended, and the newly generated token is linked into the token resource table of the target node; wherein I is an index variable, and I is more than or equal to 1 and less than or equal to H; h is the height of the dictionary tree; otherwise, executing the step (7);
(5) processing tokens at nodes of a dictionary tree;
(6) according to the maximum probability of the local path at the time t and the time corresponding to the previous effective speech frameThe maximum probability of the local path is used for carrying out self-adaptive adjustment on a threshold value related to pruning;
(7) and (4) repeating the steps (2) to (6) to obtain a global path with the best score token at the input voice ending time T, ending token expansion, and outputting the text string which is generated at the moment and has the best matching with the acoustic model and the language model to generate a voice recognition result.
2. The method for fast decoding in a speech recognition system of claim 1 wherein said t-time lexicon tree token resource LtIs the sum of the token resources of all active nodes in the time dictionary tree.
3. The method of claim 1, wherein the local path maximum probability at time t is a maximum of all local path scores in the local path sets corresponding to all the newly generated tokens at time t.
4. The method of claim 1, wherein the previous valid speech frame corresponds to a fast decoding method in a speech recognition system
Figure C021486820002C2
The maximum probability of the local path of the moment is the moment corresponding to the previous effective speech frame
Figure C021486820002C3
And in the local path set corresponding to all the newly generated tokens, the maximum value of all the local path scores.
5. The fast decoding method in a speech recognition system as claimed in claim 1, wherein said initializing step (1) further comprises the steps of:
a. generating a token with a value of zero, and linking the token into a token resource table head of a root node in a dictionary tree, wherein an active node of the current dictionary tree only comprises a root node, and the active node is positioned at the first layer of the dictionary tree;
b. initializing a global pruning threshold LgIs a logarithmic minimum;
c. initializing local pruning baseline threshold LbIs a logarithmic minimum;
d. initializing pruning width threshold LwIs a normal number Lw c,Lw cPreset by the user.
6. The fast decoding method in a speech recognition system as claimed in claim 1, wherein said filtering step (3) further comprises the steps of:
3a, if the time t is a voice frame OtSetting the initial voice frame input by the user voice as a valid voice frame, and finishing the filtering operation; otherwise, executing the step b;
3b, comparing the voice frame O at the time ttY feature codeword vectors f1 t f2 t Λ fY tSpeech frame O at time t-1tY feature codeword vectors f of-11 t-1 f2 t-1 Λ fY t-1Obtaining a similarity measure V according to the similarity degree;
3c, comparing the similar metric value V with a judgment threshold value theta, and judging the time t if V is less than or equal to thetaVoice frame OtFor speech frames that are not valid for decoding operations: otherwise, judging the speech frame O at the time ttA speech frame that is valid for the decoding operation.
7. The method of claim 1, wherein the decision threshold θ is a constant greater than 0 set by a user.
8. The fast decoding method in a speech recognition system according to claim 1, wherein said node token resource expanding step (4) further comprises the steps of:
4a, based on valid speech frames Ot VPerforming external expansion on each token in the token resource chain table corresponding to the last state of the HMM associated with the current node, namely expanding each token in the token resource chain table corresponding to the last state of the HMM associated with the current node into the token resource tables of all child nodes of the node in the dictionary tree;
4b, taking one HMM state of the HMM with M states associated with the current node as the current to-be-processed HMM state SnWherein n is more than or equal to 1 and less than or equal to M;
4c, taking state snOne token in the corresponding token resource table is a current token to be processed;
4d if state SnThe score of the current token to be processed is larger than the moment corresponding to the previous effective voice frameGlobal pruning threshold L ofgThen, according to the topological structure of HMM model associated with current node, a selected state s is takennThe reachable state is set as the current state s to be processedmOtherwise, turning to the step k to start execution;
4e, calculating token from SnTo state smScore s ofm(t); score sm(t) is the current score of the token plus the state snTo state smTransition probability of, plus state smFor the current languageSound frame OtThe observation probability of (2);
4f, calculating the current local pruning threshold value LpThe calculation formula is as follows: l isp=Lb-LwIn the formula, LbA current local pruning baseline threshold; l iswIs the current pruning width threshold;
4g, if the token is from SnTo state smIs greater than the current local pruning threshold LpThen a new token is generated with a score sm(t); otherwise, executing step j;
4H, linking the new token generated in the step g into the node with a header HsmChecking whether the node is in the active node table of the layer of the dictionary tree, if not, linking the node into the active node table;
4i, score s according to the new tokenm(t) according to the formula sm(t)-Lw>LbThen, the local pruning baseline threshold L is updatedbNamely Lb=sm(t);
4j, take another from state snReachable state, set as current pending state smRepeating the steps e-i until all the free states s are processednAn reachable state; go to step k to execute;
4k, taking state snAnother token in the corresponding token resource table is the current token to be processed; repeating the steps d-j until the state s is matchednFinishing the expansion operation of all tokens in the corresponding token resource table; turning to the step 1 for execution;
4l, taking another HMM state of the HMM with M states associated with the node as the current to-be-processed HMM state SnAnd n is more than or equal to 1 and less than or equal to M, and repeating the steps c-k until all token resource expansion operations of the current node are completed.
9. The fast decoding method in speech recognition system according to claim 8, wherein the node token resource expanding step (4) a comprises the steps of:
4a-i) ifThe current value of the token is less than or equal to the time corresponding to the previous effective voice frameGlobal pruning threshold L ofgIf yes, the operation of expanding the current token to all child nodes of the node where the current token is located is not needed, otherwise, the step ii is executed;
4a-ii) taking the jth sub-node of J sub-nodes of the node where the current token is positioned as the node to be processed currentlyj
4a-iii) accumulating tokens to reach a nodejFirst state s of1Score s of1(t) the score s1(t) adding the current value of the token and the last state of the node where the current token is to the nodejPlus nodejFirst state s of1For the current speech frame OtThe observation probability of (2);
4a-iv) calculating the current local pruning threshold LpThe calculation formula is as follows: l isp=Lb-LwWherein L isbA current local pruning baseline threshold; l iswIs the current pruning width threshold;
4a-v) if the token arrives at the nodejFirst state s of1Is greater than the current local pruning threshold LpThen step vi is executed; otherwise, executing step ix;
4a-vi) generating a new token with a score s1(t);
4a-vii) chaining the token into a nodejThe header in the node is Hs1And check the nodejWhether the node is in the active node table of the layer of the dictionary tree or not, if not, the node is linked into the active node table;
4a-viii) according to the score s1(t) if s1(t)-Lw>LbThen, the local pruning baseline threshold L is updatedbNamely Lb=s1(t);
4a-ix) taking another sub-node of the node where the current token is in the dictionary tree as the node to be processed currentlyjRepeating steps iii to viii untilThe expansion operation to the current token to all child nodes in the dictionary tree of the node where the current token is located is completed.
10. A fast decoding method in a speech recognition system as claimed in claim 1, characterized in that said step (6) of adaptive pruning based on local path maximum probability comprises the steps of:
6a, corresponding to the current t moment and the previous effective voice frame
Figure C021486820005C1
Calculating a pruning width threshold adjustment factor L according to the local maximum probability of the momentfComprises the following steps: L f = ( L b - L g ) t ~ L b , wherein,
Figure C021486820005C3
the current global pruning threshold L is the number of all valid speech frames up to time tgThe time corresponding to the previous valid speech frameLocal path maximum probability of, current local pruning baseline threshold LbThe local path maximum probability at time t;
6b, adjusting factor L for calculated pruning width threshold valuefAnd (3) performing normalization: if it is L f > L f MAX Then put LfIs Lf MAXIf, if L f < L f MIN , Then put LfIs Lf MINWherein: l isf MAXTo adjust the factor LfUpper bound of, Lf MINTo adjust the factor LfThe lower bound of (1) is a normal number, which can be read from a decoder configuration file set by a user;
6c, adjusting the factor L according to the calculated pruning width threshold valuefUpdating the pruning width threshold LwComprises the following steps: L w = L f L w c , Lw cthe initialized pruning width threshold is obtained in the initialization step (1);
6d, the global pruning threshold value at the updating time t is Lg:Lg=LbPreparing for token expansion for the next valid speech frame;
6e, resetting the local pruning baseline threshold LbFor log min, prepare for token extension for the next valid speech frame.
CNB021486824A 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system Expired - Lifetime CN1201284C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB021486824A CN1201284C (en) 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB021486824A CN1201284C (en) 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system

Publications (2)

Publication Number Publication Date
CN1455387A CN1455387A (en) 2003-11-12
CN1201284C true CN1201284C (en) 2005-05-11

Family

ID=29257528

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB021486824A Expired - Lifetime CN1201284C (en) 2002-11-15 2002-11-15 Rapid decoding method for voice identifying system

Country Status (1)

Country Link
CN (1) CN1201284C (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420438B (en) * 2008-11-18 2011-06-22 北京航空航天大学 Three stage progressive network attack characteristic extraction method based on sequence alignment
CN102737638B (en) * 2012-06-30 2015-06-03 北京百度网讯科技有限公司 Voice decoding method and device
US8484025B1 (en) * 2012-10-04 2013-07-09 Google Inc. Mapping an audio utterance to an action using a classifier
CN103915092B (en) 2014-04-01 2019-01-25 百度在线网络技术(北京)有限公司 Audio recognition method and device
KR102267405B1 (en) * 2014-11-21 2021-06-22 삼성전자주식회사 Voice recognition apparatus and method of controlling the voice recognition apparatus
CN106601229A (en) * 2016-11-15 2017-04-26 华南理工大学 Voice awakening method based on soc chip
CN108550365B (en) * 2018-02-01 2021-04-02 云知声智能科技股份有限公司 Threshold value self-adaptive adjusting method for off-line voice recognition
CN110164421B (en) 2018-12-14 2022-03-11 腾讯科技(深圳)有限公司 Voice decoding method, device and storage medium
CN110110294B (en) * 2019-03-26 2021-02-02 北京捷通华声科技股份有限公司 Dynamic reverse decoding method, device and readable storage medium
CN112151020B (en) * 2019-06-28 2024-06-18 北京声智科技有限公司 Speech recognition method, device, electronic equipment and storage medium
CN111640423B (en) * 2020-05-29 2023-10-13 北京声智科技有限公司 Word boundary estimation method and device and electronic equipment
CN112397053B (en) * 2020-11-02 2022-09-06 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112259082B (en) * 2020-11-03 2022-04-01 思必驰科技股份有限公司 Real-time voice recognition method and system
CN113066489A (en) * 2021-03-16 2021-07-02 深圳地平线机器人科技有限公司 Voice interaction method and device, computer readable storage medium and electronic equipment

Also Published As

Publication number Publication date
CN1455387A (en) 2003-11-12

Similar Documents

Publication Publication Date Title
CN1296886C (en) Speech recognition system and method
CN1201284C (en) Rapid decoding method for voice identifying system
CN1150515C (en) Speech recognition device
CN1199148C (en) Voice identifying device and method, and recording medium
CN1112669C (en) Method and system for speech recognition using continuous density hidden Markov models
CN1185621C (en) Speech recognition device and speech recognition method
CN1103971C (en) Speech recognition computer module and digit and speech signal transformation method based on phoneme
CN1267887C (en) Method and system for chinese speech pitch extraction
KR101237799B1 (en) Improving the robustness to environmental changes of a context dependent speech recognizer
CN111210807B (en) Speech recognition model training method, system, mobile terminal and storage medium
US5950158A (en) Methods and apparatus for decreasing the size of pattern recognition models by pruning low-scoring models from generated sets of models
CN1453766A (en) Sound identification method and sound identification apparatus
US5963902A (en) Methods and apparatus for decreasing the size of generated models trained for automatic pattern recognition
CN1551101A (en) Adaptation of compressed acoustic models
KR100904049B1 (en) System and Method for Classifying Named Entities from Speech Recongnition
CN1365488A (en) Speed recognition device and method, and recording medium
CN1781102A (en) Low memory decision tree
CN1534597A (en) Speech sound identification method using change inference inversion state space model
CN1841496A (en) Method and apparatus for measuring speech speed and recording apparatus therefor
CN1223985C (en) Phonetic recognition confidence evaluating method, system and dictation device therewith
CN1190772C (en) Voice identifying system and compression method of characteristic vector set for voice identifying system
CN1499484A (en) Recognition system of Chinese continuous speech
CN1150852A (en) Speech-recognition system utilizing neural networks and method of using same
CN1512485A (en) Voice identification system of voice speed adaption
CN1190773C (en) Voice identifying system and compression method of characteristic vector set for voice identifying system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20050511