Anti-noise voice recognition system
Technical Field
The invention relates to the technical field of voice recognition.
Background
With the rapid development of information technology, human-computer interaction receives more and more attention, and speech recognition becomes a key technology of human-computer interaction and becomes a research focus in the field. Speech recognition is a high-technology speech recognition technology in which a computer converts speech signals into corresponding texts or commands by extracting and analyzing human speech semantic information, and is widely applied to various fields such as industry, household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like.
However, speech signals are particularly susceptible to noise, and various links from acquisition to transmission to restoration may be affected by noise. The spectral subtraction method is one of the voice enhancement technologies, and is simple in operation and easy to implement.
Currently, the most mainstream characteristic parameter in speech recognition is Mel Frequency Cepstrum Coefficient (MFCC), MFCC characteristics are extracted based on fourier transform, and actually, the fourier transform is only suitable for processing of stationary signals. Auditory transformation is used as a new method for processing non-stationary voice signals, makes up for the defects of Fourier transformation, and has the advantages of less harmonic distortion and good spectrum smoothness. The first time in 2011 by Peter Li doctor in bell labs, cochlear filter cepstral coefficients were proposed and applied to speaker recognition, the cochlear filter cepstral coefficients being the first feature to use auditory transformations. Although many scholars research the CFCC features, the nonlinear power function derived from the saturation relation between the neuron action potential firing rate and the sound intensity can be approximate to an auditory neuron-intensity curve, the traditional CFCC feature extraction method does not take the characteristic of human auditory sense into consideration, and therefore the nonlinear power function capable of simulating the characteristic of human auditory sense is adopted to extract new CFCC features.
A complete speech signal contains both frequency information and energy information. The Teager energy operator is used as a nonlinear difference operator, can eliminate the influence of zero-mean noise and the voice enhancement capability, is used for feature extraction, can better reflect the energy change of voice signals, can inhibit noise and enhance the voice signals, and can obtain good effect when used for voice recognition.
Support Vector Machines (SVMs) are a new Machine learning technique based on the principle of minimizing structural risks. The method can better solve the classification problems of small samples, nonlinearity, high dimensionality and the like, has good generalization, is widely applied to the problems of pattern recognition, classification estimation and the like, and becomes a more common classification model in the voice recognition technology through the excellent classification capability and good generalization performance of the method.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to improve the speech recognition effect.
The technical scheme adopted by the invention is as follows: an anti-noise speech recognition system, comprising the steps of:
step one, performing windowing and framing on voice signals s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signals
Windowing the speech signal s (n), the window function used being the hamming window w (n):
multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)
x(n)=s(n)*w(n)
The windowed speech signal x (n) is subjected to framing processing, and then the speech signal x (n) is expressed as xn(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;
for the framed speech signal xn(t) performing a discrete fourier transform:
where j denotes a complex number, e is a constant, pi is a constant, and the harmonic component number k is 0, 1., N-1, then the short-time amplitude spectrum of the windowed speech signal X (N) is estimated to be | X (N, k) |, and the phase angle is:
the value of | X (n, k) | is expressed as the amplitude of the voice signal,
the value of (d) is expressed as the phase angle of the speech signal;
calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;
the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:
the power spectrum of the estimated signal is obtained by the following spectral subtraction operation
Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and a1 is a gain compensation factor;
thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;
power spectrum subtracted with spectrum
Combining the phase angle information before spectral subtraction
IFFT is carried out, the frequency domain is restored to the time domain, and the speech sequence after spectral subtraction is obtained
Step four, the voice sequence after spectral subtraction
Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;
the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;
spectrally subtracted speech sequence
The output over a certain frequency band after auditory transformation is:
wherein
Is a cochlear filter function, and the expression thereof is:
in the above formula
β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real number that is variable over time, a2 is a scale variable, θ is the initial phase, and in general
Can be derived from the centre frequency f of the filter bank
cAnd the lowest center frequency f
LDetermining
Therein, in general
Is in the value range of
While
Beta is generally an empirical value
β=0.2;
The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:
h(a2,b2)=[T(a2,b2)]2
according to the auditory characteristics of human ears, the response duration of the auditory nerve of the sound to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ears are more sensitive to high-frequency transient components, so that the time smoothing window length of the cochlear filter with higher central frequency needs to be properly shortened. For different frequency bands, different window lengths are selected, and the average value of the capillary cell function of the ith frequency band can be expressed as:
where d ═ max {3.5 τq20ms, which is the smoothing window length of the ith band, τqIs the time length, τ, of the center frequency of the center band of the p-th filterq=1/fcL is frame shift, L is d/2, and w is the number of windows;
the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:
y(i,w)=[S(i,w)]0.101
finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:
wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;
after extracting the CFCC parameters, calculating a first-order difference coefficient:
dx(n1) represents an n1 th order coefficient of the first order difference CFCC parameter of the x frame speech signal, where k is a constant, and is generally taken to be 2;
after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;
step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;
for each frame of speech signal x (n), its TEO energy is calculated:
ψ[x(n)]=x(n)2-x(n+1)x(n-1)
carrying out normalization processing and taking logarithm to obtain:
finally, performing DCT transformation to obtain one-dimensional TEOCC;
adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;
step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;
any data sample in the feature training set and the feature testing set is yyiAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:
wherein y iswinAnd yw0xRepresents yiA respective minimum and maximum.
Step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy
Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:
model=svmtrain(train_label,train_data)
testing the test set by using the established model to obtain the identification accuracy rate accurve:
accuracy=svmpredict(test_label,test_data)。
the invention has the beneficial effects that: the invention reduces the influence of noise on the voice signal by introducing the spectral subtraction method into the front end of feature extraction, adopts the nonlinear power function to simulate the auditory characteristics of human ears to extract CFCC and the first-order difference coefficient thereof, adds TEOCC representing the voice signal energy on the basis to form fusion features, performs feature selection on the fusion features by using a principal component analysis method, and applies the SVM model of the selected features to a voice recognition system, thereby having higher recognition accuracy, stronger robustness and higher recognition speed.
Detailed Description
In the invention, a windows 7 system is used as a program development software environment, MATLAB R2011a is used as a program development platform, in the embodiment, under the condition that the signal to noise ratio of 10 isolated words is 0db, 270 voice samples which are generated by pronouncing each word three times are used as a training set by 9 persons, and 210 voice samples which correspond to 7 persons under the corresponding vocabulary and the signal to noise ratio are used as a test set.
Step one, performing windowing and framing on voice signals s (n), and then performing discrete Fourier transform to obtain the amplitude and phase angle of the voice signals
Windowing the speech signal s (n), the window function used being the hamming window w (n):
multiplying the speech signal s (n) by a window function w (n) to form a windowed speech signal x (n)
x(n)=s(n)*w(n)
The windowed speech signal x (n) is subjected to framing processing, and then the speech signal x (n) is expressed as xn(t), wherein N is a frame number, t is a frame synchronization time number, and N is a frame length;
for the framed speech signal xn(t) performing a discrete fourier transform:
where j denotes a complex number, e is a constant, pi is a constant, and the harmonic component number k is 0, 1., N-1, then the short-time amplitude spectrum of the windowed speech signal X (N) is estimated to be | X (N, k) |, and the phase angle is:
the value of | X (n, k) | is expressed as the amplitude of the voice signal,
the value of (d) is expressed as the phase angle of the speech signal;
calculating the average energy of the noise section, and obtaining a power spectrum of the estimated signal through spectral subtraction;
the duration of the noise section IS, its corresponding frame number IS NIS, and the average energy of the noise section IS:
the power spectrum of the estimated signal is obtained by the following spectral subtraction operation
Wherein a1 and b1 are two constants, a1 is an over-subtraction factor, and a1 is a gain compensation factor;
thirdly, reconstructing the signal by utilizing the phase angle information before spectral subtraction to obtain a speech sequence after spectral subtraction;
power spectrum subtracted with spectrum
Combining the phase angle information before spectral subtraction
IFFT is carried out to restore the frequency domain to the time domain to obtainOf the spectrally subtracted speech sequence
Step four, the voice sequence after spectral subtraction
Simulating the auditory characteristics of human ears by adopting a nonlinear power function to extract a cochlear filtering cepstrum characteristic CFCC and a first-order difference delta CFCC thereof, and performing characteristic mixing by using a dimension screening method;
the auditory transformation simulates the auditory mechanism of human ears, and is a process of realizing filtering by using wavelet transformation by taking a cochlear filter function as a new wavelet basis function;
spectrally subtracted speech sequence
The output over a certain frequency band after auditory transformation is:
wherein
Is a cochlear filter function, and the expression thereof is:
in the above formula
β>0, where the values of α and β determine the frequency domain shape and width of the cochlear filter function, u (t) is a unit step function, b2 is a real variable over timeNumber, a2 is a scale variable, θ is the initial phase, and
can be derived from the centre frequency f of the filter bank
cAnd the lowest center frequency f
LDetermining
Therein, in general
Is in the value range of
While
Beta is generally an empirical value
β=0.2;
The inner hair cell of the human ear cochlea converts the voice signal output by auditory transformation into an electric signal analyzable by the human brain:
h(a2,b2)=[T(a2,b2)]2
according to the auditory characteristics of human ears, the response duration of the auditory nerve of the sound to the sound is gradually shortened along with the increase of the frequency, which indicates that the human ears are more sensitive to high-frequency transient components, so that the time smoothing window length of the cochlear filter with higher central frequency needs to be properly shortened. For different frequency bands, different window lengths are selected, and the average value of the capillary cell function of the ith frequency band can be expressed as:
where d ═ max {3.5 τq20ms, which is the smoothing window length of the ith band, τqIs the center frequency of the p-th filter center bandLength of time of τq=1/fcL is frame shift, L is d/2, and w is the number of windows;
the output of the hair cells completes loudness transformation through a nonlinear power function, the loudness is changed from an energy value to a perceived loudness, and the perceived loudness of the ith frequency band can be expressed as:
y(i,w)=[S(i,w)]0.101
finally, the obtained characteristics are decorrelated by using discrete cosine transform to obtain CFCC characteristic parameters:
wherein n1 is the order of the CFCC characteristic, and M is the channel number of the cochlear filter;
after extracting the CFCC parameters, calculating a first-order difference coefficient:
dx(n1) represents an n1 th order coefficient of the first order difference CFCC parameter of the x frame speech signal, where k is a constant, and is generally taken to be 2;
after 16-order CFCC and delta CFCC are respectively extracted, dimension screening is carried out on the features, and the parts which can represent the voice features most are selected for feature mixing;
step five, adding TEOCC on the basis of CFCC + delta CFCC characteristics to form fusion characteristics;
for each frame of speech signal x (n), its TEO energy is calculated:
ψ[x(n)]=x(n)2-x(n+1)x(n-1)
carrying out normalization processing and taking logarithm to obtain:
finally, performing DCT transformation to obtain one-dimensional TEOCC;
adding one-dimensional TEOCC characteristics into the last dimension of the mixed characteristic vector;
step six, performing data normalization processing on the fusion characteristics to form a normalized training set and a normalized test set, and labeling the two sets respectively to obtain a training set label and a test set label;
any data sample in the feature training set and the feature testing set is yyiAfter normalization processing, the corresponding data samples in the normalized training set and the normalized testing set are as follows:
wherein y iswinAnd ywaxRepresents y4A respective minimum and maximum.
Step seven, adopting PCA to reduce the dimension of the normalized training set, and bringing the reduced dimension into an SVM model to obtain the recognition accuracy
Dividing the voice features after dimensionality reduction into two parts of a training set train _ data and a test set test _ data, respectively adding a training set label train _ label and a test set label test _ label, and inputting the training set into an SVM (support vector machine) to establish a model:
model=svmtrain(train_label,train_data)
testing the test set by using the established model to obtain the identification accuracy rate accurve:
accuracy=svmpredict(test_label,test_data)。
wherein accuracy is the classification accuracy of the test set sample, and the speech recognition accuracy corresponding to the test set sample is 88.10%.