CN101118745A

CN101118745A - Confidence degree quick acquiring method in speech identification system

Info

Publication number: CN101118745A
Application number: CNA2006100891355A
Authority: CN
Inventors: 董滨; 赵庆卫; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2006-08-04
Filing date: 2006-08-04
Publication date: 2008-02-06
Anticipated expiration: 2026-08-04
Also published as: CN101118745B

Abstract

The present invention relates to an improved algorithm to the confidence level of a voice recognition system, including: the pre-treatment of sub-frames; the pick-up of voice features of every frame voice; the likelihood probability p(xt/sj) of each frame voice in the graphic state is worked out according to state chart, acoustic model and the feature vector of the frame voice; the likelihood probability p(xt/sj) is stored in the light of frame number and state number; the state gets trimmed according to the likelihood probability p (xt/sj); the likelihood probability of an acoustic space and the general posteriori probability are calculated after trimming; the general posteriori probability of each acoustic element is worked out and regarded as the scores for the confidence level. In prior art, the search for acoustic elements is needed to obtain the acoustic element candidates, and then a second search is carried to calculate the confidence level by using a variety of acoustic models. The present invention is a synchronous calculation method which works out the confidence level by using the same acoustic model when a recognizer is in the course of searching frame in-phase beams, therefore, the search is done once, and the operating time of the system and the complexity of calculation are saved.

Description

Method for quickly solving confidence coefficient in voice recognition system

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a method for quickly solving confidence coefficient of a voice recognition system.

Background

The speech recognition system is used in natural conditions, unlike in ideal environments, where the performance of the speech recognition system is greatly degraded. Moreover, for real spoken language, many non-speech sounds, such as abnormal pauses, coughing sounds and many environmental noises, are mixed in the speech, which makes it difficult for the conventional speech recognition system to achieve the original recognition performance. In addition, if the words spoken by the user are not in the preset domain range of the voice recognition system, recognition errors are easily caused. In summary, for a commercial speech recognition system, the user's desire is to reject as much as possible the wrong speech, and the confidence score evaluation method is a good way to solve these difficulties.

The confidence evaluation method can carry out hypothesis test on the recognition result of the voice recognition system, evaluate the reliability of the recognition result through a threshold value set by tests, and locate errors in the result, thereby improving the recognition rate and the robustness of the recognition system.

At present, a two-pass calculation method is a method which is widely applied when calculating confidence. The input speech is first decoded in one pass by the recognizer, in which process a word graph or sequence of words corresponding to the input speech is obtained. The second pass of the calculation process is performed on the basis of the previously obtained word graph or word sequence, and a confidence score is calculated, as shown in fig. 2. In the two-pass calculation process, the used acoustic models are different, and the acoustic model in the second pass calculation of the confidence coefficient generally uses a full-phoneme model. Because two decoding passes are needed, the confidence coefficient is higher in calculation complexity, longer system time is needed, and the online use of the voice recognition system is not facilitated.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and comprehensively consider the calculation speed and the robustness, thereby providing a method for quickly obtaining the confidence coefficient by only searching once.

In order to achieve the above object, the method for quickly obtaining confidence level in a speech recognition system provided by the present invention comprises the following steps:

1) And inputting the voice to be recognized into the voice recognition system.

2) And preprocessing the input voice, wherein the preprocessing comprises framing processing.

3) And extracting the voice features to obtain the MFCC feature vector of each frame of voice.

4) Traversing all the voice frames, and calculating the likelihood probability p (x) of each state in the state diagram corresponding to the frame voice according to the state diagram, the acoustic model and the MFCC feature vector of the frame voice for each frame voice _t /s _j ) The negative logarithm is:

wherein x is _t In order to input the characteristics of the speech,

S _j for the state of its corresponding Markov model, the model is the normal distribution N (μ) _j Σ j); n is the dimension of the feature vector;

5) Storing the likelihood probability p (x) obtained in the step 4) according to the frame number and the state number of the current voice _t /s _j )。

6) Judging whether the current pointer points to a virtual node in the state diagram, if so, entering the step

7) (ii) a If the judgment result is no, pruning is carried out on the current state; the virtual node is a mark of the end of a phoneme in the state diagram;

7) Calculating likelihood probability sum of acoustic space after pruning

Wherein D is ^* Is the set of all the states retained in the state diagram after pruning;

8) Calculating a generalized posterior probability of

9) Respectively calculating generalized posterior probability of each phoneme

Where N is the number of states that make up each HMM. Tau. _b [j]、τ _e [j]Respectively indicating the initial frame number and the ending frame number of the voice input data in the current state, wherein j is a state number; and taking the generalized posterior probability of the phoneme as the confidence score of the phoneme.

In the above technical solution, the preprocessing the input speech in step 2) includes digitizing, pre-emphasizing, high-frequency boosting, framing, and windowing the input speech.

In the above technical solution, the extracting of the voice feature in step 3) includes: and calculating MFCC cepstrum coefficients, cepstrum weighting and calculating differential cepstrum coefficients.

In the above technical solution, the pruning process in the step 6) adopts a pruning method based on frame synchronization beam search.

The invention has the advantages that only one decoding is needed, in the prior art, after phoneme searching is carried out to obtain phoneme candidates, second searching is carried out for calculating the confidence coefficient, and different acoustic models are used for the two searching.

Drawings

FIG. 1 is a flow diagram of one embodiment of a fast confidence score method of the present invention;

FIG. 2 is a schematic diagram of a confidence two-pass search calculation method of the prior art;

FIG. 3 is a schematic diagram of the state diagram of the present invention;

FIG. 4 is a schematic diagram of a state diagram of the present invention;

FIG. 5 is a schematic diagram of confidence synchronization calculation pruning based on a state diagram according to the present invention;

FIG. 6 is a ROC plot of the performance of the one-pass search method of the present invention versus the two-pass search method of the prior art.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

Examples

As shown in fig. 1, the method for fast obtaining confidence in a speech recognition system provided by the present invention includes the following steps:

a) And inputting the speech to be recognized into the speech recognition system.

b) And (5) voice preprocessing. Mainly performs framing processing. In this example, the pretreatment was carried out by the following flow: 1. digitizing a speech signal at a 16K sampling rate

2. The high frequency boosting by pre-emphasis the pre-emphasis filter is

H(z)＝1-αz ^-1 Wherein α =0.98

3. The data is processed by framing, the frame length is 20ms, and the frames are overlapped for 10ms.

4. And (5) windowing. The window function adopts a common hamming window function, namely:

c) And extracting the voice features. The invention adopts MFCC (mel-frequency cepstral coefficient), a characteristic extraction method, and the specific flow is as follows:

5. calculating the MFCC cepstrum coefficient c (m), wherein m is more than or equal to 1 and less than or equal to N _c

Wherein N is _c Is the number of cepstral coefficients, N _c ＝14

6. And (4) cepstral weighting. I.e. adjusting the weight of each dimension of the cepstral coefficients

Weighted cepstrum coefficients

，1≤m≤N _c

7. First and second order differences of the energy features and the cepstral features are calculated.

The difference cepstrum coefficients are calculated using the regression formula:

where μ is the normalization factor, τ is an integer, 2t +1 is the number of speech frames used to calculate the differential cepstrum coefficients.

8. For each frame, a 39-dimensional MFCC feature vector is generated.

The invention can also adopt LPC feature extraction method, which is the prior art and is not described again.

d) For each frame of speech, likelihood probabilities p (x) corresponding to each state constituting a phoneme Markov model are calculated for the frame of speech based on a state diagram, an acoustic model, and MFCC feature vectors of the frame of speech itself _t /s _j ) The likelihood probability p (x) _t /s _j ) Is inputting a speech feature x _t Corresponding state s _j Acoustic layer of markov modelsAnd (6) scoring.

The method for constructing the state diagram utilized in this step is as follows:

as shown in fig. 3, a word-based search space, i.e. a word network, is first built up according to the content of the task grammar, and the recognizer will search on the word network to find the best path corresponding to the input speech as the recognition result. Before searching, the network of words is expanded into a phoneme network whose minimum unit is a phoneme by means of the information of the dictionary in the recognition system. Each node is transformed from a word to a phoneme and each phoneme is then replaced by a corresponding markov model (HMM) in the acoustic model. Each Malkov model (HMM) is composed of several states, so that the final search space becomes a state diagram, as shown in FIG. 4.

In fig. 4, each node represents one state in a certain HMM. Any path in the state diagram represents a sentence or word candidate in the task grammar. In order to reduce the search space and the space required for storage, the state diagrams are merged, so that the final state diagram is obtained. In this process, each node is subjected to forward combining and backward combining. When forward combining, searching nodes with the same forward path and combining; when backward combining, those nodes with the same backward path are combined.

The method of calculating the likelihood probability for each state is as follows:

in the form of traversing all speech frames, when a frame of data enters the recognizer, the likelihood probability p (x) of each state corresponding to the current frame in the state diagram is calculated first _t /s _j ) The comparison of the accumulation of the likelihood probability and the state transition probability with the pruning threshold will be used as the basis for pruning. Likelihood probability p (x) _t /s _j ) Is inputting a speech feature x _t Corresponding state s _j The acoustic layer score of the markov model of (a), the negative logarithm of the acoustic layer score being:

wherein state s _j Is modeled as a normal distribution N (μ) _j ，∑ _j ) The specific value of which can be obtained from an acoustic model, x _t Is the feature vector, mu, of a speech frame _j Sum Σ _j Are respectively state S _j The mean vector and covariance matrix of the model of (2), n is the eigenvector x _t Dimension (i.e. mu) of _j Sum Σ _j Dimension (c) of (a).

The acoustic model employed in the present embodiment is an acoustic model containing 5005 states, 16 gaussian models.

e) Storing the likelihood probability p (x) obtained in step d) according to the frame number and state number of the current voice _t /s _j )。

f) Judging whether the pointer points to the virtual node, if so, entering the step g); if not, pruning the current state.

In the state diagram used by the recognition system, each phoneme has a dummy node as a marker for ending. A phoneme is identified as long as the search pointer reaches a dummy node.

In the decoding process of the recognizer, the pruning strategy is implemented to improve the decoding speed and reduce the search space. In fig. 5, the solid dots represent the state of remaining after pruning, and the hollow dots represent the state of being pruned. As shown, when a state contributes too little to the appearance of an observation sequence (the observation sequence in this embodiment is a MFCC feature vector), the likelihood probability p (x) of the state for the observation sequence is _t /s _j ) If the current state is less than the preset threshold value, the state is cut off. In this embodiment, a pruning strategy based on frame synchronization beam search is used in the decoding process. The search strategy employs a conventional viterbi algorithm. In this embodiment, the pruning threshold is set to 200, and the pruning standard is as follows: taking the logarithm value of the probability of the current frame speech for each state,probability pair with current positionThe maximum value of the numerical values is compared with the value obtained after the pruning threshold is cut off, and if the logarithm value of the probability of the current frame speech for each state is smaller than the value, the numerical value is cut off.

g) Calculating likelihood probability sum of acoustic space after pruning

Wherein D is ^* Is the set of all states that remain in the state diagram after pruning.

The accumulation of likelihood probabilities for those states that remain after pruning is much larger than the accumulation of likelihood probabilities for those states that are pruned, so they can be used entirely as denominators of the generalized posterior probability of being

h) The generalized posterior probability of each phoneme is calculated.

In speech recognition systems, each phoneme is represented by a Markov model (HMM). The generalized posterior probability of each phoneme is defined as the arithmetic mean of the posterior probabilities of each state corresponding to the phoneme:

where N is the number of states that make up each HMM. Tau is _b 、τ _e Respectively mean that the voice input data is

The starting frame number and ending frame number of the previous state, j is the state number. p(s) _j |X _t ) I.e. the generalized posterior probability obtained in step g).

i) The generalized posterior probability of a phoneme can be used as the confidence score of the phoneme.

The state-graph based confidence likelihood synchronization estimation algorithm of the present invention was tested using a database of telephone names in chinese for testing of actual telephone speech recognition systems. The test task was to evaluate the recognition rate of a recognition system containing a 1278 personal name dictionary. The test speech was normal speech from 6 speakers including 3 men and 3 women. In the test set, 180 out-of-set words are included. Each task grammar includes 213 person names. The confidence score is used to reject those out-of-set words in the test set. Our goal is to increase rejection, i.e., to reduce the false accepted rate of those words that are out of the set.

Two different algorithms are used to calculate confidence. One is defined as a two-Pass (2 Pass) search algorithm, and the other is defined as a one-Pass (1 Pass) algorithm, namely a synchronization estimation algorithm, based on the state diagram confidence synchronization calculation method of the invention, as shown in fig. 2. In a two-pass search algorithm, two different acoustic models are used. The first pass uses an acoustic model containing 5005 states, 16 gaussian models, while the acoustic model used to calculate confidence is a smaller model covering only all phonemes, containing 1005 states and 8 gaussian models. In one search algorithm pass, an acoustic model is used, which contains 5005 states and 16 gaussian models.

The performance curves ROC (receiver operating characteristics) of the two algorithms are shown in fig. 6. It can be seen from the figure that the performance of the one-pass search algorithm used in the present invention is better than the two-pass search algorithm. The equal error rate of the search algorithm adopted by the invention is 16.1%, and the equal error rate of the two-pass search algorithm is 21%. Because only one acoustic model is used in one-pass search algorithm and the model used in the calculation of the confidence coefficient is fine, although the calculation of the acoustic space after pruning is an approximate value, the performance is still not reduced.

In addition, the two methods have different computational complexity, and the speed of the one-time search algorithm is improved by 16% compared with the two-time search algorithm.

Claims

1. A method for fast solving confidence in a speech recognition system is characterized by comprising the following steps:

1) Inputting the voice to be recognized into a voice recognition system;

2) Preprocessing input voice, wherein the preprocessing comprises framing processing;

3) Extracting MFCC feature vectors of each frame of voice;

4) Traversing all the voice frames, and for each frame of voice, calculating the likelihood probability p (x) of each state in the state diagram corresponding to the frame of voice according to the state diagram and the acoustic model in the voice recognition system and the MFCC feature vector of the frame of voice _t /s _j ) Negative logarithm of the likelihood probability

Wherein x is _t Is the feature vector, mu, of a speech frame _j Sum Σ _j Are respectively the state s _j N is the dimension of the feature vector;

5) Storing the likelihood probability p (x) obtained in the step 4) according to the frame number and the state number of the current voice _t /s _j )；

7) (ii) a If the judgment result is no, pruning is carried out on the current state; the virtual node is a mark for ending a phoneme in the state diagram;

7) Calculating likelihood probability sum of acoustic space after pruningWherein D is ^* Is the set of all the states retained in the state diagram after pruning;

8) Calculating generalized posterior probability

9) Computing generalized posterior probabilities for each phoneme

Taking the generalized posterior probability of the phoneme as the confidence score of the phoneme;

where N is the number of states that make up each Markov model. Tau. _b [j]、τ _e [j]Respectively indicating the starting frame number and the ending frame number of the voice input data in the current state, wherein j is a state number.

2. The method of claim 1 wherein preprocessing the input speech in step 2) includes digitizing, pre-emphasizing, high-frequency boosting, framing and windowing the input speech.

3. The method of fast confidence level calculation in a speech recognition system according to claim 1, wherein said extracting speech features in step 3) comprises: and calculating MFCC cepstrum coefficients, cepstrum weighting and calculating differential cepstrum coefficients.

4. The method for fast confidence level estimation in a speech recognition system according to claim 1, wherein the pruning in step 6) is performed by a pruning method based on frame-synchronous beam search.