CN109949821B - Method for removing reverberation of far-field voice by using U-NET structure of CNN - Google Patents

Method for removing reverberation of far-field voice by using U-NET structure of CNN Download PDF

Info

Publication number
CN109949821B
CN109949821B CN201910200023.XA CN201910200023A CN109949821B CN 109949821 B CN109949821 B CN 109949821B CN 201910200023 A CN201910200023 A CN 201910200023A CN 109949821 B CN109949821 B CN 109949821B
Authority
CN
China
Prior art keywords
mfcc
voice
reverberation
features
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910200023.XA
Other languages
Chinese (zh)
Other versions
CN109949821A (en
Inventor
李楠
张健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huiyan Technology Tianjin Co ltd
Original Assignee
Huiyan Technology Tianjin Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huiyan Technology Tianjin Co ltd filed Critical Huiyan Technology Tianjin Co ltd
Priority to CN201910200023.XA priority Critical patent/CN109949821B/en
Publication of CN109949821A publication Critical patent/CN109949821A/en
Application granted granted Critical
Publication of CN109949821B publication Critical patent/CN109949821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a method for removing reverberation of far-field voice by utilizing a U-NET structure of a CNN (voice network), which belongs to the technical field of voice signal processing, and provides a U-NET structure of the CNN aiming at the condition that the reverberation is larger under the far-field condition so as to cause the serious reduction of the recognition accuracy of voice recognition, wherein a REVERB Challenge data set in 2014 is taken as a processing object, and the method mainly comprises the following steps: performing feature extraction on the voice containing reverberation in the data set and the corresponding voice without reverberation; mapping from the reverberation-containing speech features to the non-reverberation-containing speech features for the extracted features; the features enhanced by the proposed web framework are subjected to acoustic model training and decoding thereof.

Description

Method for removing reverberation of far-field voice by using U-NET structure of CNN
Technical Field
The invention belongs to the technical field of voice signal processing, and particularly relates to a method for removing reverberation of far-field voice by using a U-NET structure of a CNN.
Background
In recent years, emerging industries such as smart homes, conversation robots, smart sound boxes and the like are vigorously developed, so that the life style of people and the interaction mode of people and machines are greatly changed, and voice interaction is widely applied to the emerging fields as a new interaction mode. With the application of deep learning in speech recognition, the recognition performance is greatly improved, the recognition rate is over 95 percent, and the recognition effect basically reaches the hearing level of people. However, the above are limited to the near-field condition, noise and reverberation generated by a room are very small, and how to achieve a good recognition effect in a complex scene with much noise or much reverberation becomes a very important user experience.
Dereverberation of speech is one of the main research directions in far-field speech recognition. Within a room, reverberant speech may be represented as a convolution of the clean speech signal and the Room Impulse Response (RIR), so that the reverberant speech may be disturbed by previous speech information in the same sentence. The reverberation includes early reverberation and late reverberation, the early reverberation can bring certain improvement to the voice recognition effect, but the late reverberation can reduce the voice recognition effect. Therefore, if late reverberation can be effectively suppressed or reduced, a good speech recognition effect will be obtained.
The existing studies are divided into two categories: one is a method of using signal processing to perform speech dereverberation such as Weighted Prediction Error (WPE) of NTT corporation in japan, but the effect of using such signal processing alone in more complicated scenes is far from satisfying the needs of people; another is to use deep learning methods for speech dereverberation such as using deep neural networks for speech dereverberation. Although a good nonlinear mapping can be established by the existing neural network method, the effect is difficult to achieve the expected effect only by using a fully-connected neural network, and the establishment of a good network structure can generate a good recognition performance improvement for speech recognition, and has practical significance for speech recognition in a complex scene. The invention also compares the current method under the same condition, and the result shows that the neural network framework used by the invention is greatly superior to the current mainstream method.
Disclosure of Invention
In order to solve the existing problems, the invention provides a method for dereverberating far-field voice by using a U-NET structure of a CNN.
The technical scheme of the invention is as follows: a method for far-field speech dereverberation by using a U-NET structure of CNN comprises the following steps:
the method comprises the following steps: extracting the characteristics of the data;
pre-emphasis: any set of speech signals s (n) in the data set is passed through a high pass filter,
windowing: taking 25ms as a frame and using a Hanning window;
fast Fourier Transform (FFT): performing FFT (fast Fourier transform) on each frame, converting time domain data into frequency domain data, and calculating spectral line energy;
mel filtering: passing the energy of each frame of spectral line through a Mel filter, and calculating the energy in the Mel filter;
calculating a DCT cepstrum: taking logarithm of energy in the Mel filter, and calculating DCT to obtain Mel frequency cepstrum coefficient MFCC;
step two: designing a neural network framework with enhanced front-end characteristics;
and using the MFCC features obtained in the step two as input, setting the width of the convolutional layer as the dimension of the MFCC, setting the height of the convolutional layer as 11, then setting the size of the filter of the convolutional layer as [12,12,24,24,32,32,24,24,12,12] to be equivalent to the network structure of an encoder-decoder, simultaneously connecting the encoder and the decoder by using the structure of Resnet, namely adding the outputs of the seventh and ninth convolutional layers to the second and fourth convolutional layers respectively, directly integrating the features of the input 11 frames into one frame to be spliced with the output of the U-NET network, adding two fully connected neural networks, and outputting the MFCC features of clean voice through an output layer.
Step three: training and decoding a voice recognition model;
the method comprises the steps of normalizing by using MFCC characteristics of clean voice, then solving a first-order difference and a second-order difference of the clean voice, carrying out one-factor and three-phoneme training on the characteristics subjected to the difference, carrying out training on an acoustic model by using the MFCC characteristics subjected to voice dereverberation in a data set under multiple scenes, and decoding test set data subjected to dereverberation.
Further, the data set is the REVERB Challenge data set of 2014.
Further, the functional relationship of the high-pass filter in the first step can be expressed as: h (z) ═ 1-az-1([a∈[0.9,1]) And the value of a is 0.95.
Further, the loss function used in the second step is MSE, and the loss function is as follows:
Figure GDA0002737930460000031
where Y represents the MFCC feature of the neural network and XC represents the MFCC feature of clean speech.
Further, the MFCC feature of the output clean speech in the second step is a feature of one frame.
Furthermore, the MFCC features used as input in the step two are MFCC features of five frames of context, so that context information can be better learned.
Further, the number of the neurons of the two fully-connected neural networks in the second step is 1024.
The invention has the beneficial effects that: the invention mainly aims at the situation that reverberation is contained under the condition that real world is processed by a neural network for voice containing reverberation simulated from clean voice so as to improve the voice recognition accuracy under the real condition, and the invention is characterized in that an encoder-decoder-based U-NET framework is constructed, the combination of the network framework and DNN is improved, and finally, a better voice recognition performance is realized on a data set of single-channel data, the U-NET framework based on the encoder-decoder is applied to the characteristic enhancement process in far-field voice recognition, semantic information of voice characteristics can be better learned, characteristics of the voice characteristics on a frequency domain can be better learned, the respective advantages of CNN and DNN in the far-field voice recognition are combined, the CNN can better learn the frequency domain, and the DNN can better process a mapping function from reverberation-containing voice to clean voice, in the invention, RNN is not used, and the RNN has low speed in the process of training and decoding, while the CNN has better optimization of various hardware in convolution operation, so the used network has great advantage in decoding speed.
Drawings
FIG. 1 is a system block diagram of the present invention.
Detailed Description
For the convenience of understanding the technical solution of the present invention, the following description is made with reference to fig. 1 and the specific embodiments, which are not intended to limit the scope of the present invention.
Examples
In this example, taking a Reverb Challenge data set as an example, the whole system algorithm flow is shown in fig. 1, and includes feature extraction of data, network construction of U-NET + DNN based on encoder-decoder, training of a speech recognition model, and decoding thereof. The method comprises the following specific steps:
experiments were conducted using the REVERB challenge single channel dataset in the official dataset using a multi-ambient training set derived from clean training data by convolving clean utterances with measured room impulse responses with some additive noise added, generally with a signal to noise ratio of 20db, experimental test data including simulated data (SimData) and real-ambient data (RealData), simuda consisting of reverberant speech generated based on a WSJCAM0 corpus, in the same artificial distortion mode as the multi-ambient training set. Simulata simulates six reverberation cases: three rooms of different sizes (small, medium, large) and the distance between one speaker and microphone (near 50cm and far 200cm), RealData utterances come from the MC-WSJ-AV corpus, and in practical cases, since the speaker follows the head movement, the sound source cannot be considered as completely spatially fixed, so RealData and analog data are data in two different states. The room for the RealData recording differs from the room for the SimuData and training set in that the room has a reverberation time of about 0.7s, and also contains some fixed ambient noise. The RealData is classified into two different conditions according to two distances between the speaker and the microphone (near 100cm and far 250 cm). But since the text of the sentences used in RealData and simuldata is the same. Thus, we can use the same language model as well as the acoustic model for simuldata and RealData.
Performing voice recognition by using an acoustic model in nnet2 in Kaldi, normalizing by using MFCC characteristics of clean voice and then solving a first difference and a second difference, performing training of single phone and triphone by using the differentiated characteristics, performing model optimization by using an LDA algorithm and an MLLR algorithm, then performing training of the acoustic model by using MFCC characteristics of a training set under multiple scenes in a Reverbellege data set, finally decoding test set data of the data set, performing voice recognition by using a language model as a tri-gram language model, and recording the result of the data set as NULL without dereverberation as shown in Table 1;
using TensorFlow as the tool of the front-end extra-strong feature part, using the MFCC features of five frames in the context, and 11 frames in total as input, setting the width of the convolutional layer as the dimension of the MFCC, setting the height of the convolutional layer as 11, and then setting the size of the filter of the convolutional layer as [12,12,24,24,32,32, 32,24,24,12,12] respectively, which is equivalent to the network structure of an encoder-decoder, and at the same time, using the structure of Resnet to connect the encoder and decoder, i.e. adding the second and fourth convolutional layers to the outputs of the seventh and ninth, respectively, the network structure can greatly reduce the phenomenon of gradient dispersion caused by network deepening, and in order to better retain some information of the original input speech features, directly integrating the features of the input 11 frames into one frame to be spliced with the output of the U-NET network, thus some features of the original speech frame can be retained, and moreover, deep semantic features can be well learned, then in order to better map to MFCC features of clean voice, two fully-connected neural networks are added, the number of neurons of the fully-connected neural networks is set to be 1024, and finally, the fully-connected neural networks are an output layer, and the MFCC features (one frame) of the clean voice are output. The final resulting misrepresentation rate is recorded as U-CNET + DNN, as shown in Table 1;
TABLE 1 results of error rates of words obtained for different methods
Figure GDA0002737930460000051
Comparing the result of the task performed by the mainstream DNN method, wherein three hidden layers are used in the DNN, the number of neurons in each hidden layer is 1024, and the method of inputting five frames of context is also adopted, the obtained result is recorded as DNN, as shown in table 1, the result obtained by simply using U-NET (the conditions of the number of input frames and the like are the same as the conditions of the proposed structure) is U-NET, as shown in table 1, the result obtained by using improved U-NET (the final CNN output and input 11-dimensional integrated features are spliced and then input to a fully connected layer) is U-CNET, as shown in table 1;
from table 1 we can see that the results obtained using the method provided by the present invention have significant advantages over the results obtained by other methods.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (1)

1. A method for far-field speech dereverberation by using a U-NET structure of CNN is characterized by comprising the following steps:
the method comprises the following steps: extracting the characteristics of the data;
pre-emphasis: any set of speech signals s (n) in the data set is passed through a high pass filter,
windowing: taking 25ms as a frame and using a Hanning window;
fast Fourier Transform (FFT): performing FFT (fast Fourier transform) on each frame, converting time domain data into frequency domain data, and calculating spectral line energy;
mel filtering: passing the energy of each frame of spectral line through a Mel filter, and calculating the energy in the Mel filter;
calculating a DCT cepstrum: taking logarithm of energy in the Mel filter, and calculating DCT to obtain Mel frequency cepstrum coefficient MFCC;
step two: designing a neural network framework with enhanced front-end characteristics;
using the MFCC characteristics obtained in the step two as input, setting the width of the convolutional layer as the dimension of the MFCC, setting the height of the convolutional layer as 11, then setting the size of the filter of the convolutional layer as [12,12,24,24,32,32,24,24,12,12] respectively, which is equivalent to a network structure of an encoder-decoder, and simultaneously using a structure of Resnet to connect the encoder and the decoder, namely adding the outputs of the seventh and ninth convolutional layers and the second and fourth convolutional layers respectively, directly integrating the characteristics of the input 11 frames into one frame to be spliced with the output of the U-NET network, adding two fully connected neural networks, and outputting the MFCC characteristics of clean voice through an output layer;
step three: training and decoding a voice recognition model;
using the MFCC characteristics of clean voice to carry out normalization, then solving a first order difference and a second order difference, carrying out single-factor and triphone training on the characteristics subjected to the difference, using the MFCC characteristics subjected to voice dereverberation in a data set to carry out acoustic model training, and decoding test set data subjected to dereverberation;
the data set is REVERB Challenge data set of 2014;
the functional relationship of the high-pass filter in the first step can be expressed as: h (z) ═ 1-az-1([a∈[0.9,1]) The value of a is 0.95;
the loss function used in the second step is MSE, and the loss function is as follows:
Figure FDA0002737930450000021
where Y represents the MFCC feature of the neural network and XC represents the MFCC feature of clean speech;
the MFCC features of the output clean speech in the second step are features of one frame;
the MFCC features used as input in the step two are the MFCC features of each five frames of the context;
and in the second step, the number of the neurons of the two fully-connected neural networks is 1024.
CN201910200023.XA 2019-03-15 2019-03-15 Method for removing reverberation of far-field voice by using U-NET structure of CNN Active CN109949821B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910200023.XA CN109949821B (en) 2019-03-15 2019-03-15 Method for removing reverberation of far-field voice by using U-NET structure of CNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910200023.XA CN109949821B (en) 2019-03-15 2019-03-15 Method for removing reverberation of far-field voice by using U-NET structure of CNN

Publications (2)

Publication Number Publication Date
CN109949821A CN109949821A (en) 2019-06-28
CN109949821B true CN109949821B (en) 2020-12-08

Family

ID=67008408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910200023.XA Active CN109949821B (en) 2019-03-15 2019-03-15 Method for removing reverberation of far-field voice by using U-NET structure of CNN

Country Status (1)

Country Link
CN (1) CN109949821B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110544485A (en) * 2019-09-27 2019-12-06 慧言科技(天津)有限公司 method for performing far-field speech dereverberation by using SE-ED network of CNN
CN111899738A (en) * 2020-07-29 2020-11-06 北京嘀嘀无限科技发展有限公司 Dialogue generating method, device and storage medium
CN112017682B (en) * 2020-09-18 2023-05-23 中科极限元(杭州)智能科技股份有限公司 Single-channel voice simultaneous noise reduction and reverberation removal system
CN112542177B (en) * 2020-11-04 2023-07-21 北京百度网讯科技有限公司 Signal enhancement method, device and storage medium
CN113129919A (en) * 2021-04-17 2021-07-16 上海麦图信息科技有限公司 Air control voice noise reduction method based on deep learning
CN113421581B (en) * 2021-08-24 2021-11-02 广州易方信息科技股份有限公司 Real-time voice noise reduction method for jump network
CN115331691A (en) * 2022-10-13 2022-11-11 广州成至智能机器科技有限公司 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390403A (en) * 2013-06-19 2013-11-13 北京百度网讯科技有限公司 Extraction method and device for mel frequency cepstrum coefficient (MFCC) characteristics
CN106373589A (en) * 2016-09-14 2017-02-01 东南大学 Binaural mixed voice separation method based on iteration structure
CN108320749A (en) * 2018-03-14 2018-07-24 百度在线网络技术(北京)有限公司 Far field voice control device and far field speech control system
CN109243429A (en) * 2018-11-21 2019-01-18 苏州奇梦者网络科技有限公司 A kind of pronunciation modeling method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9491545B2 (en) * 2014-05-23 2016-11-08 Apple Inc. Methods and devices for reverberation suppression
US10389885B2 (en) * 2017-02-01 2019-08-20 Cisco Technology, Inc. Full-duplex adaptive echo cancellation in a conference endpoint

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390403A (en) * 2013-06-19 2013-11-13 北京百度网讯科技有限公司 Extraction method and device for mel frequency cepstrum coefficient (MFCC) characteristics
CN106373589A (en) * 2016-09-14 2017-02-01 东南大学 Binaural mixed voice separation method based on iteration structure
CN108320749A (en) * 2018-03-14 2018-07-24 百度在线网络技术(北京)有限公司 Far field voice control device and far field speech control system
CN109243429A (en) * 2018-11-21 2019-01-18 苏州奇梦者网络科技有限公司 A kind of pronunciation modeling method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Image Segmentation with Pyramid Dilated Convolution Based on ResNet and U-Net;Qiao Zhang et al.;《Neural Information Processing》;20171231;第365-366页 *
Speech Dereverberation Using Fully Convolutional Networks;Ori Ernst et al.;《EUSIPCO》;20181231;第390-394页 *
图像语义分割问题研究综述;肖朝霞等;《软件导刊》;20180831;第17卷(第8期);第6-8页 *

Also Published As

Publication number Publication date
CN109949821A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109949821B (en) Method for removing reverberation of far-field voice by using U-NET structure of CNN
Qian et al. Very deep convolutional neural networks for noise robust speech recognition
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
CN109887489B (en) Speech dereverberation method based on depth features for generating countermeasure network
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Tan et al. Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
Pandey et al. Self-attending RNN for speech enhancement to improve cross-corpus generalization
CN108172238A (en) A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
Tawara et al. Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoencoder.
Nikzad et al. Deep residual-dense lattice network for speech enhancement
CN106328123B (en) Method for recognizing middle ear voice in normal voice stream under condition of small database
CN110047478B (en) Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation
Yuliani et al. Speech enhancement using deep learning methods: A review
Zezario et al. Self-supervised denoising autoencoder with linear regression decoder for speech enhancement
Wu et al. Increasing compactness of deep learning based speech enhancement models with parameter pruning and quantization techniques
CN110728991B (en) Improved recording equipment identification algorithm
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
Kothapally et al. Skipconvgan: Monaural speech dereverberation using generative adversarial networks via complex time-frequency masking
Geng et al. End-to-end speech enhancement based on discrete cosine transform
CN110867178B (en) Multi-channel far-field speech recognition method
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN110544485A (en) method for performing far-field speech dereverberation by using SE-ED network of CNN
Elshamy et al. DNN-based cepstral excitation manipulation for speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant