CN102290048A

CN102290048A - Robust voice recognition method based on MFCC (Mel frequency cepstral coefficient) long-distance difference

Info

Publication number: CN102290048A
Application number: CN2011102588847A
Authority: CN
Inventors: 赵斯培; 邱小军
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2011-09-05
Filing date: 2011-09-05
Publication date: 2011-12-21
Anticipated expiration: 2031-09-05
Also published as: CN102290048B

Abstract

The invention discloses a robust voice recognition method based on an MFCC (Mel frequency cepstral coefficient) long-distance difference, which is significantly characterized in that the long-distance difference between four sampling points and six sampling points of the MFCC is used as a voice recognition characteristic parameter. Based on no fundamental increase in the amount of computation and the amount of storage, compared with common use of the MFCC parameter and the one-order differential coefficient as characteristic parameters in the field, the recognition rate of the robust voice recognition system can be increased by 20-40 percent.

Description

A kind of robust speech recognition methods based on the remote difference of MFCC

One, technical field

The present invention relates to the speech recognition technology field.The robust speech recognition methods of the remote difference of a kind of employing Mel frequency cepstral coefficient (MFCC) as characteristic parameter proposed.

Two, background technology

Speech recognition system performance main reasons for decrease under noise circumstance is that pure training data and being existed between the test data of noise pollution does not match, and seeks that a kind of can to reduce this unmatched characteristic parameter be a kind of important method that improves speech recognition system noisy speech discrimination.Speech recognition features parameter commonly used at present has Mel frequency cepstral coefficient (Mel Frequency Cepstral Coefficient, be called for short MFCC) and linear prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient is called for short LPCC).MFCC meets the auditory properties of people's ear, has noiseproof feature preferably, computing method are as follows: at first voice signal is carried out end-point detection, pre-emphasis, divide frame, pre-service such as windowing, then each frame signal is carried out fast fourier transform (Fast Fourier Transform, being called for short FFT) the back delivery square obtains power spectrum, adopt 24 Jan Vermeer bank of filters that power spectrum is carried out filtering, filtered energy is carried out log-transformation, carry out discrete cosine transform (Discrete Cosine Transform at last again, abbreviation DCT) obtains the MFCC parameter, concrete computation process can list of references (as Han Jiqing, Zhang Lei, Zheng Tieran. voice signal is handled [M]. Beijing: publishing house of Tsing-Hua University, 2004.).LPCC is based on people's sonification model, and the sonification model of supposing the people is an all-pole modeling, thinks that the voice of current time can be represented with several voice linear combinations constantly before.Employing minimum mean square error criterion and correlation method can be obtained the linear predictor coefficient in the following formula, then can be in the hope of linear prediction cepstrum coefficient (LPCC) according to the homomorphism disposal route.Concrete computation process can referring to document (as Han Jiqing, Zhang Lei, Zheng Tieran. voice signal is handled [M]. Beijing: publishing house of Tsing-Hua University, 2004.).

Test in a large number (as Steven B.Davis, Paul Mermelstein.Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences.[J] .IEEE Trans.on ASSP, 1980,28 (4): 357-366. and Shang-Ming Lee, Shi-hau Fang, Jeih-weih Hung and Lin-Shan Lee.Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition.[J] .IEEEAutomatic Speech Recognition and Understanding, 2001,49-52.) show, MFCC has better noise robustness than LPCC, but MFCC still can not obtain gratifying effect (Yeganeh H. in robust speech identification, Ahadi S.M., Ziaei A.A new MFCC improvement method for robust ASR.[J] .IEEE ICSP, 2008,643-646.).

Document (Steven B.Davis, Paul Mermelstein.Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences.[J] .IEEE Trans.on ASSP, 1980,28 (4): adopt principal component analysis (Principal Component Analysis 357-366.), abbreviation PCA) method is optimized the Mel bank of filters, improves robustness; Document (Yeganeh H. is arranged again, Ahadi S.M., Ziaei A.A new MFCC improvement method for robust ASR.[J] .IEEE ICSP, 2008,643-646.) at first calculate Mel subband spectrum and subtract, then to each subband estimated snr, parameter is weighted according to this estimation, less parameter weight affected by noise is bigger, improves the robustness of speech recognition system under noise circumstance thereby reach.Korean Patent KR100893154B1 is used for the identification of voice sex with the MFCC coefficient of weighting, U.S. Pat 2009177466 replaces whole power spectrum to be used to extract the Mel frequency cepstral coefficient of voice the energy of voice spectrum crest, has improved the anti-noise robustness of speech recognition under the situation that does not increase the phonetic feature dimension.

Distinguishing feature of the present invention is to utilize the remote difference of MFCC as the speech recognition features parameter, and the combination of MFCC parameter of abandoning tradition itself and first order difference coefficient thereof is as the speech recognition features parameter.Experiment shows that when characteristic parameter was selected MFCC4 sampled point and 6 remote differences of sampled point for use, speech recognition system had best anti-noise robustness.

Three, summary of the invention

1, goal of the invention: propose a kind of robust speech recognition methods based on the remote difference of MFCC.This method selects for use the remote difference of MFCC4 sampled point and 6 sampled points as characteristic parameter, and MFCC parameter of abandoning tradition itself and first order difference coefficient thereof.

2, technical scheme: for achieving the above object, algorithm proposed by the invention is tried to achieve the remote difference of its 4 sampled points and 6 sampled points on the basis that calculates the MFCC parameter, is used for training and identification with this as the speech recognition features parameter.

The MFCC calculation method of parameters of standard is: at first voice signal is carried out pre-service, be end-point detection, pre-emphasis, branch frame, windowing, then each frame voice is calculated its FFT and delivery and square obtain power spectrum, power spectrum is carried out filtering with the Mel bank of filters, take the logarithm after the filtering, and calculate the MFCC parameter that DCT obtains standard.Specifically can consult document (the diligent .MFCC feature of leaf is improved algorithm at Application in Speech Recognition .[J for pay cloud, Jing Xinxing]. computer engineering and science, 2009,31 (12): 146-148.).

The computing method of 2 sampled point differences of MFCC are as follows:

Δ ₂MFCC(i)＝MFCC(i+1)-MFCC(i-1) (1)

In like manner, the computing method of 4 remote differences of sampled point of MFCC are as follows:

Δ ₄MFCC(i)＝MFCC(i+2)-MFCC(i-2) (2)

The computing method of 6 remote differences of sampled point of MFCC are as follows:

Δ ₆MFCC(i)＝MFCC(i+3)-MFCC(i-3) (3)

Wherein MFCC (i) is the MFCC parameter of i frame voice signal, and Δ 2MFCC is 2 sampled point differences of MFCC, Δ ₄MFCC is 4 remote differences of sampled point of MFCC, Δ ₆MFCC is 6 remote differences of sampled point of MFCC.

The concrete sound recognition system can adopt as hidden markov model (Hidden Markov Model, abbreviation HMM) (but being not limited to) is as system model, to the characteristic parameter of selecting for use (the remote differences of 4 sampled points of MFCC disclosed by the invention and 6 sampled points), training process can adopt Baum-Welch algorithm (but being not limited to), and identifying can adopt Viterbi decoding algorithm (but being not limited to).Concrete sound recognition system algorithm flow can consult document (how strong, He Ying .MATLAB expansion programming [M]. Beijing: publishing house of Tsing-Hua University, 2002.).

3, beneficial effect: remarkable advantage of the present invention is: select for use 4 sampled points of MFCC and 6 remote differences of sampled point as the speech recognition features parameter, substantially not increasing on the basis of calculated amount and memory space, is that characteristic parameter improves 20-40 percentage point of noisy speech discrimination than the MFCC parameter of the common employing in this area itself and the cooperation of first order difference coefficient sets thereof.

4, description of drawings

Fig. 1 is the theory diagram that calculates 4 remote differences of sampled point of MFCC.

Fig. 2 is the theory diagram that calculates 6 remote differences of sampled point of MFCC.

Five, embodiment

Algorithm characteristics proposed by the invention is: select for use the remote difference of MFCC as the speech recognition features parameter, MFCC parameter of abandoning tradition itself and the cooperation of first order difference coefficient sets thereof are characteristic parameter.Be example with isolated word robust speech recognition system below, introduce its implementation procedure in detail.

Isolated word robust speech recognition system adopts hidden markov model (Hidden Markov Model is called for short HMM), and as system model, training process adopts the Baum-Welch algorithm, and identifying adopts the Viterbi decoding algorithm.Speech data is the 8kHz sampling, 16 quantifications, and frame length is 256, and frame moves 128, and Hamming window is adopted in windowing.The voice signal preprocessing part, end-point detection adopts classical short-time energy-zero-crossing rate double threshold method.Concrete HMM algorithm flow can consult document (how strong, He Ying .MATLAB expansion programming [M]. Beijing: publishing house of Tsing-Hua University, 2002.).Detailed process is as follows:

1, calculates 4 sampled points of MFCC and 6 remote differences of sampled point as characteristic parameter

At first voice signal is carried out pre-service, be end-point detection, pre-emphasis, branch frame, windowing, then each frame voice calculated its FFT and delivery and square obtain power spectrum, power spectrum is carried out filtering with the Mel bank of filters, take the logarithm after the filtering, and calculate the MFCC parameter that DCT obtains standard.At last calculate the remote difference of 4 sampled points of MFCC and 6 sampled points according to the method described above as characteristic parameter.

2, carry out the HMM model training with clean speech

When carrying out speech recognition, to train model parameter earlier, use 4 sampled points of MFCC of clean speech of 120 people (63 men/57 woman) and 6 remote differences of sampled point herein, be input among the HMM and train as the speech recognition features parameter with HMM.HMM adopts continuous pdf model, and each HMM has 4 states, and each state is mixed by 3 Gauss units.

3, test with noisy speech

The voice that contain different signal to noise ratio (S/N ratio)s with 51 people (31 men/20 woman) are tested, discovery select for use 4 sampled points of MFCC and 6 remote differences of sampled point as characteristic parameter than the normally used MFCC parameter in this area itself and first order difference coefficient thereof during as characteristic parameter discrimination exceed 20-40 percentage point, concrete outcome is shown in table 1-table 4.

The different signal to noise ratio (S/N ratio) phonetic recognization rates (Gaussian noise) of table 1 different characteristic parameter

The different signal to noise ratio (S/N ratio) phonetic recognization rates (Formocarbam supermarket noise) of table 2 different characteristic parameter

The different signal to noise ratio (S/N ratio) phonetic recognization rates of table 3 different characteristic parameter (noise in the subway carriage)

The different signal to noise ratio (S/N ratio) phonetic recognization rates (Hunan road traffic noise) of table 4 different characteristic parameter

Claims

1. the robust speech recognizer based on the remote difference of Mel frequency cepstral coefficient (MFCC) is characterized in that adopting 4 sampled points of MFCC and 6 remote differences of sampled point as characteristic parameter.

2. the computing method of 4 remote differences of sampled point of MFCC as claimed in claim 1 is characterized in that:

Δ ₄MFCC(i)＝MFCC(i+2)-MFCC(i-2)，

Wherein MFCC (i) is the MFCC parameter of i frame voice signal, Δ ₄MFCC is 4 remote differences of sampled point of MFCC.

3. the computing method of 6 remote differences of sampled point of MFCC as claimed in claim 1 is characterized in that:

Δ ₆MFCC(i)＝MFCC(i+3)-MFCC(i-3)，

Wherein MFCC (i) is the MFCC parameter of i frame voice signal, Δ ₆MFCC is 6 remote differences of sampled point of MFCC.