CN109949821B

CN109949821B - Method for removing reverberation of far-field voice by using U-NET structure of CNN

Info

Publication number: CN109949821B
Application number: CN201910200023.XA
Authority: CN
Inventors: 李楠; 张健
Original assignee: Huiyan Technology Tianjin Co ltd
Current assignee: Huiyan Technology Tianjin Co ltd
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2020-12-08
Anticipated expiration: 2039-03-15
Also published as: CN109949821A

Abstract

The invention discloses a method for removing reverberation of far-field voice by utilizing a U-NET structure of a CNN (voice network), which belongs to the technical field of voice signal processing, and provides a U-NET structure of the CNN aiming at the condition that the reverberation is larger under the far-field condition so as to cause the serious reduction of the recognition accuracy of voice recognition, wherein a REVERB Challenge data set in 2014 is taken as a processing object, and the method mainly comprises the following steps: performing feature extraction on the voice containing reverberation in the data set and the corresponding voice without reverberation; mapping from the reverberation-containing speech features to the non-reverberation-containing speech features for the extracted features; the features enhanced by the proposed web framework are subjected to acoustic model training and decoding thereof.

Description

Method for removing reverberation of far-field voice by using U-NET structure of CNN

Technical Field

The invention belongs to the technical field of voice signal processing, and particularly relates to a method for removing reverberation of far-field voice by using a U-NET structure of a CNN.

Background

In recent years, emerging industries such as smart homes, conversation robots, smart sound boxes and the like are vigorously developed, so that the life style of people and the interaction mode of people and machines are greatly changed, and voice interaction is widely applied to the emerging fields as a new interaction mode. With the application of deep learning in speech recognition, the recognition performance is greatly improved, the recognition rate is over 95 percent, and the recognition effect basically reaches the hearing level of people. However, the above are limited to the near-field condition, noise and reverberation generated by a room are very small, and how to achieve a good recognition effect in a complex scene with much noise or much reverberation becomes a very important user experience.

Dereverberation of speech is one of the main research directions in far-field speech recognition. Within a room, reverberant speech may be represented as a convolution of the clean speech signal and the Room Impulse Response (RIR), so that the reverberant speech may be disturbed by previous speech information in the same sentence. The reverberation includes early reverberation and late reverberation, the early reverberation can bring certain improvement to the voice recognition effect, but the late reverberation can reduce the voice recognition effect. Therefore, if late reverberation can be effectively suppressed or reduced, a good speech recognition effect will be obtained.

The existing studies are divided into two categories: one is a method of using signal processing to perform speech dereverberation such as Weighted Prediction Error (WPE) of NTT corporation in japan, but the effect of using such signal processing alone in more complicated scenes is far from satisfying the needs of people; another is to use deep learning methods for speech dereverberation such as using deep neural networks for speech dereverberation. Although a good nonlinear mapping can be established by the existing neural network method, the effect is difficult to achieve the expected effect only by using a fully-connected neural network, and the establishment of a good network structure can generate a good recognition performance improvement for speech recognition, and has practical significance for speech recognition in a complex scene. The invention also compares the current method under the same condition, and the result shows that the neural network framework used by the invention is greatly superior to the current mainstream method.

Disclosure of Invention

In order to solve the existing problems, the invention provides a method for dereverberating far-field voice by using a U-NET structure of a CNN.

The technical scheme of the invention is as follows: a method for far-field speech dereverberation by using a U-NET structure of CNN comprises the following steps:

the method comprises the following steps: extracting the characteristics of the data;

pre-emphasis: any set of speech signals s (n) in the data set is passed through a high pass filter,

windowing: taking 25ms as a frame and using a Hanning window;

fast Fourier Transform (FFT): performing FFT (fast Fourier transform) on each frame, converting time domain data into frequency domain data, and calculating spectral line energy;

mel filtering: passing the energy of each frame of spectral line through a Mel filter, and calculating the energy in the Mel filter;

calculating a DCT cepstrum: taking logarithm of energy in the Mel filter, and calculating DCT to obtain Mel frequency cepstrum coefficient MFCC;

step two: designing a neural network framework with enhanced front-end characteristics;

and using the MFCC features obtained in the step two as input, setting the width of the convolutional layer as the dimension of the MFCC, setting the height of the convolutional layer as 11, then setting the size of the filter of the convolutional layer as [12,12,24,24,32,32,24,24,12,12] to be equivalent to the network structure of an encoder-decoder, simultaneously connecting the encoder and the decoder by using the structure of Resnet, namely adding the outputs of the seventh and ninth convolutional layers to the second and fourth convolutional layers respectively, directly integrating the features of the input 11 frames into one frame to be spliced with the output of the U-NET network, adding two fully connected neural networks, and outputting the MFCC features of clean voice through an output layer.

Step three: training and decoding a voice recognition model;

the method comprises the steps of normalizing by using MFCC characteristics of clean voice, then solving a first-order difference and a second-order difference of the clean voice, carrying out one-factor and three-phoneme training on the characteristics subjected to the difference, carrying out training on an acoustic model by using the MFCC characteristics subjected to voice dereverberation in a data set under multiple scenes, and decoding test set data subjected to dereverberation.

Further, the data set is the REVERB Challenge data set of 2014.

Further, the functional relationship of the high-pass filter in the first step can be expressed as: h (z) ═ 1-az^-1([a∈[0.9,1]) And the value of a is 0.95.

Further, the loss function used in the second step is MSE, and the loss function is as follows:

where Y represents the MFCC feature of the neural network and XC represents the MFCC feature of clean speech.

Further, the MFCC feature of the output clean speech in the second step is a feature of one frame.

Furthermore, the MFCC features used as input in the step two are MFCC features of five frames of context, so that context information can be better learned.

Further, the number of the neurons of the two fully-connected neural networks in the second step is 1024.

The invention has the beneficial effects that: the invention mainly aims at the situation that reverberation is contained under the condition that real world is processed by a neural network for voice containing reverberation simulated from clean voice so as to improve the voice recognition accuracy under the real condition, and the invention is characterized in that an encoder-decoder-based U-NET framework is constructed, the combination of the network framework and DNN is improved, and finally, a better voice recognition performance is realized on a data set of single-channel data, the U-NET framework based on the encoder-decoder is applied to the characteristic enhancement process in far-field voice recognition, semantic information of voice characteristics can be better learned, characteristics of the voice characteristics on a frequency domain can be better learned, the respective advantages of CNN and DNN in the far-field voice recognition are combined, the CNN can better learn the frequency domain, and the DNN can better process a mapping function from reverberation-containing voice to clean voice, in the invention, RNN is not used, and the RNN has low speed in the process of training and decoding, while the CNN has better optimization of various hardware in convolution operation, so the used network has great advantage in decoding speed.

Drawings

FIG. 1 is a system block diagram of the present invention.

Detailed Description

For the convenience of understanding the technical solution of the present invention, the following description is made with reference to fig. 1 and the specific embodiments, which are not intended to limit the scope of the present invention.

Examples

In this example, taking a Reverb Challenge data set as an example, the whole system algorithm flow is shown in fig. 1, and includes feature extraction of data, network construction of U-NET + DNN based on encoder-decoder, training of a speech recognition model, and decoding thereof. The method comprises the following specific steps:

experiments were conducted using the REVERB challenge single channel dataset in the official dataset using a multi-ambient training set derived from clean training data by convolving clean utterances with measured room impulse responses with some additive noise added, generally with a signal to noise ratio of 20db, experimental test data including simulated data (SimData) and real-ambient data (RealData), simuda consisting of reverberant speech generated based on a WSJCAM0 corpus, in the same artificial distortion mode as the multi-ambient training set. Simulata simulates six reverberation cases: three rooms of different sizes (small, medium, large) and the distance between one speaker and microphone (near 50cm and far 200cm), RealData utterances come from the MC-WSJ-AV corpus, and in practical cases, since the speaker follows the head movement, the sound source cannot be considered as completely spatially fixed, so RealData and analog data are data in two different states. The room for the RealData recording differs from the room for the SimuData and training set in that the room has a reverberation time of about 0.7s, and also contains some fixed ambient noise. The RealData is classified into two different conditions according to two distances between the speaker and the microphone (near 100cm and far 250 cm). But since the text of the sentences used in RealData and simuldata is the same. Thus, we can use the same language model as well as the acoustic model for simuldata and RealData.

Performing voice recognition by using an acoustic model in nnet2 in Kaldi, normalizing by using MFCC characteristics of clean voice and then solving a first difference and a second difference, performing training of single phone and triphone by using the differentiated characteristics, performing model optimization by using an LDA algorithm and an MLLR algorithm, then performing training of the acoustic model by using MFCC characteristics of a training set under multiple scenes in a Reverbellege data set, finally decoding test set data of the data set, performing voice recognition by using a language model as a tri-gram language model, and recording the result of the data set as NULL without dereverberation as shown in Table 1;

using TensorFlow as the tool of the front-end extra-strong feature part, using the MFCC features of five frames in the context, and 11 frames in total as input, setting the width of the convolutional layer as the dimension of the MFCC, setting the height of the convolutional layer as 11, and then setting the size of the filter of the convolutional layer as [12,12,24,24,32,32, 32,24,24,12,12] respectively, which is equivalent to the network structure of an encoder-decoder, and at the same time, using the structure of Resnet to connect the encoder and decoder, i.e. adding the second and fourth convolutional layers to the outputs of the seventh and ninth, respectively, the network structure can greatly reduce the phenomenon of gradient dispersion caused by network deepening, and in order to better retain some information of the original input speech features, directly integrating the features of the input 11 frames into one frame to be spliced with the output of the U-NET network, thus some features of the original speech frame can be retained, and moreover, deep semantic features can be well learned, then in order to better map to MFCC features of clean voice, two fully-connected neural networks are added, the number of neurons of the fully-connected neural networks is set to be 1024, and finally, the fully-connected neural networks are an output layer, and the MFCC features (one frame) of the clean voice are output. The final resulting misrepresentation rate is recorded as U-CNET + DNN, as shown in Table 1;

TABLE 1 results of error rates of words obtained for different methods

Comparing the result of the task performed by the mainstream DNN method, wherein three hidden layers are used in the DNN, the number of neurons in each hidden layer is 1024, and the method of inputting five frames of context is also adopted, the obtained result is recorded as DNN, as shown in table 1, the result obtained by simply using U-NET (the conditions of the number of input frames and the like are the same as the conditions of the proposed structure) is U-NET, as shown in table 1, the result obtained by using improved U-NET (the final CNN output and input 11-dimensional integrated features are spliced and then input to a fully connected layer) is U-CNET, as shown in table 1;

from table 1 we can see that the results obtained using the method provided by the present invention have significant advantages over the results obtained by other methods.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for far-field speech dereverberation by using a U-NET structure of CNN is characterized by comprising the following steps:

windowing: taking 25ms as a frame and using a Hanning window;

using the MFCC characteristics obtained in the step two as input, setting the width of the convolutional layer as the dimension of the MFCC, setting the height of the convolutional layer as 11, then setting the size of the filter of the convolutional layer as [12,12,24,24,32,32,24,24,12,12] respectively, which is equivalent to a network structure of an encoder-decoder, and simultaneously using a structure of Resnet to connect the encoder and the decoder, namely adding the outputs of the seventh and ninth convolutional layers and the second and fourth convolutional layers respectively, directly integrating the characteristics of the input 11 frames into one frame to be spliced with the output of the U-NET network, adding two fully connected neural networks, and outputting the MFCC characteristics of clean voice through an output layer;

step three: training and decoding a voice recognition model;

using the MFCC characteristics of clean voice to carry out normalization, then solving a first order difference and a second order difference, carrying out single-factor and triphone training on the characteristics subjected to the difference, using the MFCC characteristics subjected to voice dereverberation in a data set to carry out acoustic model training, and decoding test set data subjected to dereverberation;

the data set is REVERB Challenge data set of 2014;

the functional relationship of the high-pass filter in the first step can be expressed as: h (z) ═ 1-az^-1([a∈[0.9,1]) The value of a is 0.95;

the loss function used in the second step is MSE, and the loss function is as follows:

where Y represents the MFCC feature of the neural network and XC represents the MFCC feature of clean speech;

the MFCC features of the output clean speech in the second step are features of one frame;

the MFCC features used as input in the step two are the MFCC features of each five frames of the context;

and in the second step, the number of the neurons of the two fully-connected neural networks is 1024.