CN108564965B

CN108564965B - An anti-noise speech recognition system

Info

Publication number: CN108564965B
Application number: CN201810311359.9A
Authority: CN
Inventors: 薛珮芸; 史燕燕; 白静; 郭倩岩
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2018-04-09
Filing date: 2018-04-09
Publication date: 2021-08-24
Anticipated expiration: 2038-04-09
Also published as: CN108564965A

Abstract

本发明涉及语音识别技术领域。一种抗噪语音识别系统，对于语音信号进行加窗分帧，然后做离散傅里叶变换，求出语音信号的幅值和相角；通过谱减运算得到估计信号的功率谱；利用谱减前的相位角信息对信号进行重构，得到谱减后的语音序列；对新的语音序列采用非线性幂函数模拟人耳听觉特性提取耳蜗滤波倒谱特征CFCC及其一阶差分△CFCC，并利用维度筛选法进行特征混合；对融合特征用数据归一化处理，得到训练集标签和测试集标签；将归一化后的训练集采用PCA进行降维，并带入SVM模型，得到识别准确率。The present invention relates to the technical field of speech recognition. An anti-noise speech recognition system, which performs windowing and framing on speech signals, and then performs discrete Fourier transform to obtain the amplitude and phase angle of the speech signal; obtains the power spectrum of the estimated signal through spectral subtraction; uses spectral subtraction The signal is reconstructed with the previous phase angle information to obtain the speech sequence after spectral subtraction; the non-linear power function is used to simulate the auditory characteristics of the human ear for the new speech sequence to extract the cochlear filter cepstral feature CFCC and its first-order difference △CFCC, and Use the dimension screening method to mix features; normalize the fusion features to obtain training set labels and test set labels; use PCA to reduce the dimension of the normalized training set and bring it into the SVM model to obtain accurate identification Rate.

Description

Anti-noise voice recognition system

Technical Field

The invention relates to the technical field of voice recognition.

Background

With the rapid development of information technology, human-computer interaction receives more and more attention, and speech recognition becomes a key technology of human-computer interaction and becomes a research focus in the field. Speech recognition is a high-technology speech recognition technology in which a computer converts speech signals into corresponding texts or commands by extracting and analyzing human speech semantic information, and is widely applied to various fields such as industry, household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like.

However, speech signals are particularly susceptible to noise, and various links from acquisition to transmission to restoration may be affected by noise. The spectral subtraction method is one of the voice enhancement technologies, and is simple in operation and easy to implement.

Currently, the most mainstream characteristic parameter in speech recognition is Mel Frequency Cepstrum Coefficient (MFCC), MFCC characteristics are extracted based on fourier transform, and actually, the fourier transform is only suitable for processing of stationary signals. Auditory transformation is used as a new method for processing non-stationary voice signals, makes up for the defects of Fourier transformation, and has the advantages of less harmonic distortion and good spectrum smoothness. The first time in 2011 by Peter Li doctor in bell labs, cochlear filter cepstral coefficients were proposed and applied to speaker recognition, the cochlear filter cepstral coefficients being the first feature to use auditory transformations. Although many scholars research the CFCC features, the nonlinear power function derived from the saturation relation between the neuron action potential firing rate and the sound intensity can be approximate to an auditory neuron-intensity curve, the traditional CFCC feature extraction method does not take the characteristic of human auditory sense into consideration, and therefore the nonlinear power function capable of simulating the characteristic of human auditory sense is adopted to extract new CFCC features.

A complete speech signal contains both frequency information and energy information. The Teager energy operator is used as a nonlinear difference operator, can eliminate the influence of zero-mean noise and the voice enhancement capability, is used for feature extraction, can better reflect the energy change of voice signals, can inhibit noise and enhance the voice signals, and can obtain good effect when used for voice recognition.

Support Vector Machines (SVMs) are a new Machine learning technique based on the principle of minimizing structural risks. The method can better solve the classification problems of small samples, nonlinearity, high dimensionality and the like, has good generalization, is widely applied to the problems of pattern recognition, classification estimation and the like, and becomes a more common classification model in the voice recognition technology through the excellent classification capability and good generalization performance of the method.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: how to improve the speech recognition effect.

The technical scheme adopted by the invention is as follows: an anti-noise speech recognition system, comprising the steps of:

step one, performing windowing and framing on voice signals s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signals

Windowing the speech signal s (n), the window function used being the hamming window w (n):

multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)

x(n)＝s(n)*w(n)

The windowed speech signal x (n) is subjected to framing processing, and then the speech signal x (n) is expressed as x_n(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;

for the framed speech signal x_n(t) performing a discrete fourier transform:

where j denotes a complex number, e is a constant, pi is a constant, and the harmonic component number k is 0, 1., N-1, then the short-time amplitude spectrum of the windowed speech signal X (N) is estimated to be | X (N, k) |, and the phase angle is:

the value of | X (n, k) | is expressed as the amplitude of the voice signal,

the value of (d) is expressed as the phase angle of the speech signal;

calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;

the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:

the power spectrum of the estimated signal is obtained by the following spectral subtraction operation

Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and a1 is a gain compensation factor;

thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;

power spectrum subtracted with spectrum

Combining the phase angle information before spectral subtraction

IFFT is carried out, the frequency domain is restored to the time domain, and the speech sequence after spectral subtraction is obtained

Step four, the voice sequence after spectral subtraction

Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;

the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;

spectrally subtracted speech sequence

The output over a certain frequency band after auditory transformation is:

wherein

Is a cochlear filter function, and the expression thereof is:

in the above formula

β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real number that is variable over time, a2 is a scale variable, θ is the initial phase, and in general

Can be derived from the centre frequency f of the filter bank_cAnd the lowest center frequency f_LDetermining

Therein, in general

Is in the value range of

While

Beta is generally an empirical value

β＝0.2；

The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:

h(a2,b2)＝[T(a2,b2)]²

according to the auditory characteristics of human ears, the response duration of the auditory nerve of the sound to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ears are more sensitive to high-frequency transient components, so that the time smoothing window length of the cochlear filter with higher central frequency needs to be properly shortened. For different frequency bands, different window lengths are selected, and the average value of the capillary cell function of the ith frequency band can be expressed as:

where d ═ max {3.5 τ_q20ms, which is the smoothing window length of the ith band, τ_qIs the time length, τ, of the center frequency of the center band of the p-th filter_q＝1/f_cL is frame shift, L is d/2, and w is the number of windows;

the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:

y(i,w)＝[S(i,w)]^0.101

finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:

wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;

after extracting the CFCC parameters, calculating a first-order difference coefficient:

d_x(n1) represents an n1 th order coefficient of the first order difference CFCC parameter of the x frame speech signal, where k is a constant, and is generally taken to be 2;

after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;

step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;

for each frame of speech signal x (n), its TEO energy is calculated:

ψ[x(n)]＝x(n)²-x(n+1)x(n-1)

carrying out normalization processing and taking logarithm to obtain:

finally, performing DCT transformation to obtain one-dimensional TEOCC;

adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;

step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;

any data sample in the feature training set and the feature testing set is yy_iAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:

wherein y is_winAnd y_w0xRepresents y_iA respective minimum and maximum.

Step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy

Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:

model＝svmtrain(train_label,train_data)

testing the test set by using the established model to obtain the identification accuracy rate accurve:

accuracy＝svmpredict(test_label,test_data)。

the invention has the beneficial effects that: the invention reduces the influence of noise on the voice signal by introducing the spectral subtraction method into the front end of feature extraction, adopts the nonlinear power function to simulate the auditory characteristics of human ears to extract CFCC and the first-order difference coefficient thereof, adds TEOCC representing the voice signal energy on the basis to form fusion features, performs feature selection on the fusion features by using a principal component analysis method, and applies the SVM model of the selected features to a voice recognition system, thereby having higher recognition accuracy, stronger robustness and higher recognition speed.

Detailed Description

In the invention, a windows 7 system is used as a program development software environment, MATLAB R2011a is used as a program development platform, in the embodiment, under the condition that the signal to noise ratio of 10 isolated words is 0db, 270 voice samples which are generated by pronouncing each word three times are used as a training set by 9 persons, and 210 voice samples which correspond to 7 persons under the corresponding vocabulary and the signal to noise ratio are used as a test set.

x(n)＝s(n)*w(n)

for the framed speech signal x_n(t) performing a discrete fourier transform:

the value of | X (n, k) | is expressed as the amplitude of the voice signal,

the value of (d) is expressed as the phase angle of the speech signal;

power spectrum subtracted with spectrum

Combining the phase angle information before spectral subtraction

IFFT is carried out to restore the frequency domain to the time domain to obtainOf the spectrally subtracted speech sequence

Step four, the voice sequence after spectral subtraction

spectrally subtracted speech sequence

The output over a certain frequency band after auditory transformation is:

wherein

Is a cochlear filter function, and the expression thereof is:

in the above formula

β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real variable over timeNumber, a2 is a scale variable, θ is the initial phase, and

Therein, in general

Is in the value range of

While

Beta is generally an empirical value

β＝0.2；

h(a2,b2)＝[T(a2,b2)]²

where d ═ max {3.5 τ_q20ms, which is the smoothing window length of the ith band, τ_qIs the center frequency of the p-th filter center bandLength of time of τ_q＝1/f_cL is frame shift, L is d/2, and w is the number of windows;

y(i,w)＝[S(i,w)]^0.101

for each frame of speech signal x (n), its TEO energy is calculated:

ψ[x(n)]＝x(n)²-x(n+1)x(n-1)

carrying out normalization processing and taking logarithm to obtain:

finally, performing DCT transformation to obtain one-dimensional TEOCC;

wherein y is_winAnd y_waxRepresents y₄A respective minimum and maximum.

model＝svmtrain(train_label,train_data)

accuracy＝svmpredict(test_label,test_data)。

wherein accuracy is the classification accuracy of the test set sample, and the speech recognition accuracy corresponding to the test set sample is 88.10%.

Claims

1. an anti-noise speech recognition system is characterized in that: carry out according to the following steps:

Step 1. Windowing and framing the speech signal s(n), and then doing discrete Fourier transform to find the amplitude and phase angle of the speech signal

Windowing the speech signal s(n), the window function used is the Hamming window w(n):

Multiply the speech signal s(n) by the window function w(n) to form the windowed speech signal x(n)

x(n)=s(n)*w(n)

The windowed voice signal x(n) is framed, and the framed voice signal is represented as xn(t), where _n is the frame sequence number, t is the time sequence number of frame synchronization, and N is the frame length;

Perform discrete Fourier transform on the framed speech signal x _n (t):

Among them, j represents a complex number, e is a constant, π is a constant, and the harmonic component serial number k=0, 1, 2, ..., N-1, then the short-term amplitude spectrum estimation of the windowed speech signal is the amplitude of the speech signal. The value is |X(n,k)|,

represents the phase angle of the speech signal;

Step 2: Calculate the average energy of the noise segment, and obtain the power spectrum of the estimated signal through a spectral subtraction operation;

The duration of the noise segment is IS, the corresponding frame number is NIS, and the average energy of the noise segment is:

Use the following spectral subtraction to obtain the power spectrum of the estimated signal

Among them, a1 and b1 are two constants, a1 is the over-reduction factor, and b1 is the gain compensation factor;

Step 3, using the phase angle information before spectral subtraction to reconstruct the signal to obtain a speech sequence after spectral subtraction;

The power spectrum after subtraction from the spectrum

Combine the phase angle information before spectral subtraction

Perform IFFT, restore the frequency domain to the time domain, and obtain the spectrally subtracted speech sequence

Step 4. The speech sequence after spectral subtraction

The non-linear power function is used to simulate the auditory characteristics of the human ear to extract the cochlear filter cepstral feature CFCC and its first-order difference △CFCC, and the feature mixing is performed by the dimension screening method;

The auditory transformation simulates the human hearing mechanism, which is the process of using the cochlear filter function as a new wavelet base function to realize the filtering process by using the wavelet transform;

spectrally subtracted speech sequence

The output in a certain frequency band after auditory transformation is:

in

is the cochlear filter function, and its expression is:

In the above formula

The values of α and β determine the frequency domain shape and width of the cochlear filter function, u(t) is the unit step function, b2 is a real number that changes with time, a2 is the scale variable, θ is the initial phase,

Determined by the center frequency f _c and the lowest center frequency f _L of the filter bank

in,

β takes the empirical value

β=0.2;

The inner hair cells of the human cochlea convert the speech signal output by auditory transformation into electrical signals that can be analyzed by the human brain:

h(a2,b2)=[T(a2,b2)] ²

h(a2,b2) is the electrical signal that can be analyzed by the human brain, T(a2,b2) is the voice signal output after auditory transformation. According to the auditory characteristics of the human ear, the response duration of the acoustic auditory nerve to the sound will vary with the frequency. It gradually becomes shorter with the increase of , indicating that the human ear is more sensitive to high-frequency transient components. Therefore, for the cochlear filter with a higher center frequency, the time smoothing window length needs to be shortened appropriately. Different window lengths are selected for different frequency bands. The average value of the hair cell function in the i-th band can be expressed as:

where d=max{3.5τ _p , 20ms}, is the smoothing window length of the ith frequency band, τ _p is the time length of the center frequency of the center frequency band of the p th filter, τ _p =1/f _c , L is the frame shift, L=d/2, w is the number of windows;

The output of the hair cell completes the loudness transformation through a nonlinear power function, and changes from the energy value to the perceptual loudness. The perceptual loudness of the i-th frequency band can be expressed as:

y(i,w)=[S(i,w)] ^0.101

Finally, the discrete cosine transform is used to decorrelate the obtained features, and the CFCC feature parameters are obtained:

Among them, n1 is the order of the CFCC feature, and M is the number of channels of the cochlear filter;

Calculate its first-order difference coefficients after extracting the CFCC parameters:

d _x (n1) represents the n1th order coefficient of the first order difference CFCC parameter of the xth frame speech signal, k is a constant, and k=2;

After extracting the 16th-order CFCC and △CFCC, the features are dimensionally screened, and the part that can best characterize the speech features is selected for feature mixing;

Step 5. On the basis of the CFCC+△CFCC feature, add TEOCC to form a fusion feature;

Calculate the TEO energy of each frame of speech signal x(n):

ψ[x(n)]=x(n) ² -x(n+1)x(n-1)

Normalize and take the logarithm to get:

Finally, DCT transform is performed to obtain a one-dimensional TEOCC;

Add the one-dimensional TEOCC feature to the last dimension of the mixed feature vector;

Step 6: The fusion feature is processed by data normalization to form a normalized training set and a normalized test set, and labels are added to the two sets respectively to obtain a training set label and a test set label;

Any data sample in the feature training set and feature test set is yy _i . After normalization, the corresponding data samples in the normalized training set and the normalized test set are:

where y _min and y _max represent the respective minimum and maximum values of y _i ;

Step 7. Use PCA to reduce the dimension of the normalized training set and bring it into the SVM model to obtain the recognition accuracy

Divide the dimensionality-reduced speech features into two parts: training set train_data and test set test_data, add training set label train_label and test set label test_label respectively, and input the training set into SVM to build a model model:

model=svmtrain(train_label,train_data)

Test the test set with the established model to get the recognition accuracy accuracy:

accuracy=svmpredict(test_label, test_data).