CN113707164A - Voice enhancement method for improving multi-resolution residual error U-shaped network - Google Patents

Voice enhancement method for improving multi-resolution residual error U-shaped network Download PDF

Info

Publication number
CN113707164A
CN113707164A CN202111026177.5A CN202111026177A CN113707164A CN 113707164 A CN113707164 A CN 113707164A CN 202111026177 A CN202111026177 A CN 202111026177A CN 113707164 A CN113707164 A CN 113707164A
Authority
CN
China
Prior art keywords
voice
speech
network
amplitude spectrum
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111026177.5A
Other languages
Chinese (zh)
Inventor
兰朝风
刘春东
周贤武
韩玉兰
郭小霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202111026177.5A priority Critical patent/CN113707164A/en
Publication of CN113707164A publication Critical patent/CN113707164A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A deep neural network, especially relate to a and improve the residual error U type network speech enhancement method of multiresolution, for make the traditional residual error U type network of multiresolution suitable for the speech enhancement task on the time-frequency domain more, it is weak to restore the ability of the speech detail under the low signal-to-noise ratio in the decoding stage, cause the question lost of the speech characteristic easily to improve, change the size of the convolution kernel in order to adapt to the speech signal and convert to the characteristic that the size of the speech characteristic picture that the height is far greater than usually got after the time-frequency domain, it is a speech enhancement field. The invention comprises the following steps: s1, obtaining magnitude spectrums of the two voice signals by carrying out short-time Fourier transform on the pure voice and the voice with noise; s2, taking the noisy speech amplitude spectrum as the input of the network, and taking the pure speech amplitude spectrum as the training target; fitting a nonlinear relation between the network input and a training target through the improved multi-resolution residual U-shaped network to further obtain a voice enhancement model based on the improved multi-resolution residual U-shaped network; s3, obtaining the amplitude spectrum of the voice with noise through STFT; the amplitude spectrum of the target voice can be obtained by passing the target voice through an improved multi-resolution residual U-shaped network model; and S4, combining the amplitude spectrum with the phase of the voice with the noise, carrying out waveform reconstruction, and obtaining the enhanced voice after reconstruction.

Description

Voice enhancement method for improving multi-resolution residual error U-shaped network
Technical Field
The invention relates to a deep neural network, in particular to a method for improving the voice enhancement of a multi-resolution residual U-shaped network, and belongs to the field of voice enhancement.
Background
Single-channel speech enhancement is an interesting and challenging technique, mainly aiming at improving speech quality, enhancing speech intelligibility, and making the target speech in a noisy environment clearer. Due to the practical functions, the engineering has many applications, such as hearing aids, communication equipment, robust speech recognition and other fields, and single-channel speech enhancement plays an important role.
Single channel algorithms can be divided into supervised and unsupervised speech enhancement algorithms. The unsupervised speech enhancement algorithm focuses on the research on noise parts, and most of the implementation of the speech enhancement algorithm needs to use prior conditions. In 1978 wiener filtering was used in the field of speech enhancement, proposed by Lim and Oppenheim, and requires that a transfer function conditioned on the minimum mean square error be constructed from the estimated power spectrum of noisy speech and noise, assuming that the noise is stationary. But this method does not filter out noise effectively. Boll et al proposed spectral subtraction in 1979, assuming that the noise is stationary additive noise and is uncorrelated with the speech signal, first finding the noise segment in the speech signal by speech endpoint detection; secondly, estimating the power spectrum of the noise section; finally, the estimated power spectrum is subtracted by the noise voice to obtain a pure voice power spectrum. Spectral subtraction is called the most classical speech enhancement algorithm because it is not only computationally simple but also efficient in handling wideband noise, but its assumed conditions are too simple to introduce "musical noise" after estimation. Statistical model-based speech enhancement algorithms assume that the noise is gaussian distributed. For example, Ephraim and Malah proposed Short-term Spectral Amplitude Minimum Mean-Square Error estimates (MMSE-STSA) in 1984 and Short-term log-Spectral Minimum Mean-Square Error estimates in 1985. Like the wiener filtering method, when MMSE also fails to correctly estimate the prior snr, the effect of filtering background noise is reduced.
With the continuous development of deep learning, supervised speech enhancement algorithms begin to develop gradually, and the speech enhancement technology based on deep learning estimates the characteristics of clean speech by training the nonlinear relationship between noisy speech and pure speech. Maas et al improve the robustness of speech recognition by building a speech enhancement model, which is implemented by a deep-loop noise-reduction auto-encoder network. Attabi et al propose a two-stage speech enhancement method, first generating an initial gain function using a non-negative matrix factorization algorithm, and then applying a modified gain function, which is more robust to noise interference, to the initial gain function using DNN estimation. Tu et al have improved DNN according to the residual network, have proposed Skip-DNN speech enhancement model, this model carries more speech details information in the course of training, solve gradient disappearance and model unidentifiable problem such as singularity that causes. Fu et al propose a speech enhancement model with a perceptual signal-to-noise ratio, which is divided into two stages, perceptual signal-to-noise ratio and selection of a relevant CNN model noise reduction according to the signal-to-noise ratio level. Gao et al propose a long-and-short-term memory network-based progressive learning framework according to a DNN-based progressive learning framework, apply the progressive learning framework to speech enhancement, and enable the model to learn rich target information and alleviate the problem of information loss. Donahue et al, who perform analysis in the frequency domain, propose a voice enhancement model based on the generated countermeasure network, and perform robust recognition in a noisy speech environment, which is greatly improved compared to multi-format training. Olaf et al proposed a U-type Network (U Network, U _ Net) and applied it in medical image segmentation in 2015, which was then also widely applied in the field of speech enhancement. Ernst et al employs U _ Net as a generator to generate a countermeasure network in order to produce a more intuitive loss function. Subsequently, Bulut et al only uses U _ Net to enhance a single-channel speech signal, and applies a sub-pixel convolution layer to a process of reconstructing a high-resolution image to recover details of speech features, and uses a smaller time window when processing data, thereby improving enhancement effect and achieving low latency of speech enhancement, but because the window size is single, some inherent characteristics may be lost in the process of extracting features. In 2020, Ibtehaz et al propose MultiResU _ Net according to U _ Net and apply the MultiResU _ Net to the field of medical image segmentation, and the MultiResU _ Net adopts windows of various sizes to realize extraction of various resolution features, so that some inherent features are reserved, and then some people apply the MultiResU _ Net to CT image segmentation. In 2016, an upsampling method called sub-pixel convolution was proposed to achieve super resolution of images. Somebody in 2019 uses the voice super-resolution task, and voice signals after super-resolution have noise immunity.
The voice signal is time-frequency transformed to obtain a time-frequency diagram, and the time-frequency diagram is an image when being used as network input, so the voice enhancement task can also be regarded as an image processing task. The time-frequency diagram expresses pitch continuity, harmonic structures and formants, and the image processing method can utilize the structures to realize related enhancement tasks, so that MultiResU _ Net for medical image segmentation can be applied to the field of speech enhancement.
Disclosure of Invention
In order to enable the traditional multi-resolution residual U-shaped network to be more suitable for a voice enhancement task on a time-frequency domain, improve the problem that the capability of restoring voice details under low signal-to-noise ratio is weak in a decoding stage and voice features are easy to lose, and change the size of a convolution kernel to adapt to the characteristic that the width of a voice feature graph obtained after a voice signal is converted into the time-frequency domain is far larger than the height, the invention provides an improved multi-resolution residual U-shaped network voice enhancement method.
The invention discloses an improved voice enhancement method of a multi-resolution residual error U-shaped network, which comprises the following steps:
s1, obtaining the magnitude spectrums of the two voice signals by Short-Time Fourier transform (STFT) of the pure and noisy voice;
s2, taking the noisy speech amplitude spectrum as the input of the network, and taking the pure speech amplitude spectrum as the training target; fitting a nonlinear relation between the network input and a training target through the improved multi-resolution residual U-shaped network to further obtain a voice enhancement model based on the improved multi-resolution residual U-shaped network;
s3, obtaining the amplitude spectrum of the voice with noise through STFT; the amplitude spectrum of the target voice can be obtained by passing the target voice through an improved multi-resolution residual U-shaped network model;
and S4, combining the amplitude spectrum with the phase of the voice with the noise, carrying out waveform reconstruction, and obtaining the enhanced voice after reconstruction.
The method has the advantages that the sub-pixel convolution layer is adopted to improve the upsampling process so as to recover the details of the network, the residual error path and the output characteristics of the upsampling are rearranged by using a mixed channel so as to improve the information fusion capability, and an improved multi-resolution residual error U-shaped network voice enhancement model is established. A time-frequency diagram obtained by short-time Fourier transform of a voice signal is used as network input. Compared with the enhancement effect of the traditional MultiResU _ Net, the full convolutional neural network and the speech enhancement model of the U-type network, the improved multi-resolution residual U-type network speech enhancement model has the STOI, PESQ and SDR scores of 0.7271, 2.0571 and 8.5052 respectively when the signal-to-noise ratio is-5 dB. The STOI scores are respectively 5.0 percent, 13.9 percent and 10.6 percent higher than those of other models; the PESQ scores are respectively 7.4%, 15.2% and 4.2% higher than those of other models; the SDR scores are respectively 13.7%, 60.0% and 38.3% higher than those of other models. Therefore, the improved multi-resolution residual error U-shaped network voice enhancement model has higher evaluation index score than other models under the condition of different signal to noise ratios, and the improved multi-resolution residual error U-shaped network voice enhancement model has better enhancement effect and is particularly suitable for voice enhancement under the low signal to noise ratio.
Drawings
FIG. 1 is a block diagram of a MultiResU _ Net based speech enhancement system;
FIG. 2 is a schematic diagram of a MultiResU _ Net model;
FIG. 3 is a schematic diagram of a multi-resolution module;
FIG. 4 is a schematic diagram of a residual path module;
FIG. 5 is a mixed channel enhanced MultiResU _ Net diagram;
FIG. 6 is a mixing channel flow diagram;
FIG. 7 is a diagram of a sub-pixel convolution architecture;
FIG. 8 illustrates an implementation of sub-pixel convolution;
fig. 9 is a schematic diagram of MultiResU _ Net network parameters.
Fig. 10 shows the MultiResU _ Net improvement results, wherein the PESQ score case is shown in fig. (a), STOI score case is shown in fig. (b), and SDR score case is shown in fig. (c).
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
The embodiment provides an improved voice enhancement method of a multi-resolution residual U-shaped network, which comprises three processing processes, namely feature extraction, model establishment and a training target. The method of the present embodiment includes:
s1, obtaining the magnitude spectrums of the two voice signals by Short-Time Fourier transform (STFT) of the pure and noisy voice;
s2, taking the noisy speech amplitude spectrum as the input of the network, and taking the pure speech amplitude spectrum as the training target; fitting a nonlinear relation between the network input and a training target through the improved multi-resolution residual U-shaped network to further obtain a voice enhancement model based on the improved multi-resolution residual U-shaped network;
s3, obtaining the amplitude spectrum of the voice with noise through STFT; the amplitude spectrum of the target voice can be obtained by passing the target voice through an improved multi-resolution residual U-shaped network model;
and S4, combining the amplitude spectrum with the phase of the voice with the noise, carrying out waveform reconstruction, and obtaining the enhanced voice after reconstruction.
The speech enhancement based on the MultiResU _ Net is a speech enhancement technology based on a mapping method, and the principle of the speech enhancement technology is that a nonlinear relation between pure speech and noisy speech features is trained through the MultiResU _ Net, and then a speech enhancement model based on the MultiResU _ Net is obtained. Compared with the U _ Net network which trains only a single voice feature, the MultiResU _ Net network can extract voice features with multiple resolutions and splice the extracted voice features to obtain complementary voice features, so that more inherent voice information can be retained in a low signal-to-noise ratio environment. The implementation method is based on a MultiResU _ Net network, and a voice enhancement system block diagram is constructed, as shown in FIG. 1.
As can be seen from fig. 1, the enhanced speech is obtained through the training and enhancement stages, and the processing flow is as follows: in the training phase, pure and noisy speech is subjected to Short-Time Fourier transform (STFT) to obtain the amplitude spectra of two speech signals; taking a voice amplitude spectrum with noise as the input of a network, and taking a pure voice amplitude spectrum as a training target; and fitting the nonlinear relation between the network input and the training target through the MultiResU _ Net to further obtain a speech enhancement model based on the MultiResU _ Net. In the enhancement stage, obtaining the amplitude spectrum of the voice with noise through STFT; the amplitude spectrum of the target voice can be obtained by passing the target voice through a MultiResU _ Net model; the amplitude spectrum is combined with the phase of the noisy speech, a process called waveform reconstruction, and the enhanced speech can be obtained after reconstruction.
Based on a MultiResU _ Net network model commonly used in the field of image processing, voice information is subjected to time-frequency conversion and then serves as network input, a network structure of the voice information is improved, and a voice enhancement model based on the MultiResU _ Net is built. A schematic diagram of the MultiResU _ Net model is shown in fig. 2. MultiResU _ Net consists of an encoder, a decoder, and a bottleneck stage. The encoder is composed of a multi-resolution module and a decoder, wherein the encoder is composed of a multi-resolution module and a decoder, the multi-resolution module is composed of a multi-resolution module and a multi-pixel convolution upsampling module, and the multi-resolution module is composed of a multi-pixel convolution upsampling module. The MultiResU _ Net also comprises cross-layer connection, the structure of the multi-ResU _ Net is a residual path module consisting of a plurality of residual networks, and in order to prevent the output characteristics of connected encoders and decoders from being too different, the residual networks are used for fusing the encoding stage characteristics with the decoding stage characteristics after learning the encoding stage characteristics. The number of residual networks used in the residual path is different due to the different multi-resolution modules to which the encoder and decoder are connected. The activation function of the model non-output layer adopts ReLu, and the output layer adopts Sigmod. In order to reduce training time and solve the gradient disappearance problem, the MultiResU _ Net employs batch normalization after each layer.
The main role of the multi-resolution module in the multiresolution _ Net is to construct multi-resolution image features, so the selection of the multi-resolution module has an important role in speech enhancement. The multi-resolution module selects windows with different sizes, and can acquire different resolution characteristics. The regularization of, for example, 5 × 5 convolutional layers is equivalent to the superposition of 2 3 × 3 convolutional layers, which makes it possible to enlarge the field of view without introducing additional parameters. Therefore, the present implementation method uses three 3 × 5 convolutional layer cascades, and the multi-resolution module is schematically shown in fig. 3. The input voice features are cascaded through three convolutional layers with convolutional kernel sizes of 3 multiplied by 5 to extract a low-resolution voice feature map, the convolutional layers with the sizes of 3 multiplied by 5 are connected through jumping to obtain a high-resolution voice feature map, different voice feature maps can be obtained through splicing, and finally a residual error connection is established through one convolutional layer with the size of 1 multiplied by 1.
The main role of the residual path module in the MultiResU _ Net is to supplement the missing bottom layer information in the encoding phase into the decoding phase, mitigating the difference in characteristics between the encoder and decoder. As can be seen from fig. 2, the multi-resolution module D1 can obtain a comparative bottom-layer feature map, but the multi-resolution module Dm that is fused with the multi-resolution module D1 obtains a deeper-layer feature map, and the difference between the two feature maps is large, so that a plurality of residual error networks can be used to learn the bottom-layer feature map, and further obtain features with small differences. When the difference between the two connected layers is small, the number of residual error networks is gradually reduced, so that the complexity of the network is reduced. The residual path block diagram is shown in fig. 4. The residual path module is arranged in a residual network between the encoder and the decoder, and the number of the residual networks is gradually reduced along with the increase of the number of the layers of the encoder. As can be seen from fig. 2, when the depth of the multi-resolution module at the encoder and decoder ends in the MultiResU _ Net is m/2, then 1 residual network is used to connect the encoder and decoder, and when the depth is (m-2)/2, the number of residual networks is 2, and so on, the number of residual networks used is 1,2, …, and m/2, respectively. The sizes of convolution kernels in the residual network are 3 × 5, and the sizes of convolution kernels connected with the residual are 1 × 1 respectively.
In order to improve the feature extraction capability of the network, the embodiment adopts a mixed channel mode to perform channel rearrangement on a feature map obtained by a residual path and a feature map obtained by up-sampling at a decoder end. The residual path and the features obtained by sampling at the decoder end are obtained through different convolutional layers, and at the moment, the connection between the two groups of features is small, so when the two groups of features are spliced, the two groups of features are firstly divided into two groups, and then the corresponding channels are arranged together to enable the two input channels to be associated, so that the information fusion capability is improved, and the feature extraction capability of the network is further improved. A partial MultiResU _ Net structure of the hybrid channel improvement is shown in fig. 5. The mixing channel implementation flow is shown in fig. 6.
The input layers have 2 groups, the total number of channels is 2 × N assuming that the number of channels, the height and the width are N, H, W respectively, if the effect of channel direction shuffling is to be achieved, the channel dimension is firstly split into two dimensions (2, N), then the two dimensions are transposed to form (N,2), then the number of channels is converted into 2 × N again, and finally the spliced features are processed through a multi-resolution module. The main task of the sub-pixel convolution is to extract the low-resolution image features, realize up-sampling by utilizing the mode of rearranging the feature pixels and further obtain the high-resolution image. The architecture of the sub-pixel convolution is shown in fig. 7. The method comprises the steps of taking a low-resolution image as input, obtaining a white feature map with r2 channels in the map through two convolution layers with the step length of 1, carrying out color distinguishing on each channel of the white feature map, and spreading the white feature map through a sub-pixel convolution layer to obtain a high-resolution image, so that upsampling in a MultiResU _ Net process can be realized. When the upsampling multiple is 2 times, r is equal to 2, and the implementation of the sub-pixel convolution in fig. 7 is shown in fig. 8. When the size of the feature map is 4 × 4 and the number of channels is equal to r2(r ═ 2), an 8 × 8 feature map can be obtained by sub-pixel convolution. Because the sub-pixel convolution can recover the detailed information in the network and the processing process does not need any parameter, the implementation method adopts the sub-pixel convolution to realize the up-sampling in the MultiResU _ Net, and can better process the voice characteristic diagram.
Experimental data processing and analysis:
selection and setting of network parameters
The MultiResU _ Net network parameters in the experiment are shown in figure 9. The MultiResU _ Net training process uses a minimum batch size of 64; the Adam algorithm is selected to optimize the training process of the network, the initial learning rate is 1 x 10 < -4 >, the iteration number is set to be 100, and MSE is selected based on the loss function of the MultiResU _ Net speech enhancement model.
The pure speech of the implementation method is selected from a LibriSpeech ASR corpus, and the LibriSpeech ASR corpus is a reading English speaking corpus of 16kHz for about 1000 hours. The implementation method selects 150 pure voices in the library of the LibriSpeech ASR, the average duration is about 8s, 105 voices are used as a training set, and 45 voices are used as a test set. Noise selects NoiseX-92 noise library, which contains 15 kinds of noise, the implementation method selects 10 kinds of noise in the noise library as noise signals, namely babble, buccaneer1, buccaneer2, destroyerengine, destroyero, f16, factory, hfchannel, ping and white, the noise signals are down-sampled to 16kHz and mixed with pure voice according to signal-to-noise ratios of-5 dB, 0dB, 5dB and 10dB to form noisy voice signals under different signal-to-noise ratios, and at the moment, the training set of the noisy voice has 4200(105 × 10 × 4 ═ 4200) voice; to study the generalization performance of the model, the noise under test was randomly selected from 10 of the Nonspeech-100, mixed with clean speech still at SNR-5 dB, 0dB, 5dB, and 10dB to form a test set of noisy speech, where the test set had a total of 1800(45 × 10 × 4 ═ 1800) speech bars.
(II) feature extraction method selection of experimental data and MultiResU _ Net model
The feature extraction method of the speech enhancement model based on the MultiResU _ Net adopts the STFT, wherein a window function selects a Hanning window, the window length is 512 points (32ms), the window is moved to 128 points (16ms), a symmetrical half is removed, the STFT amplitude of 257 points is left, because the last STFT point covers a 31.25Hz wave band of a signal, the importance of the STFT point is not large, the last STFT point is also removed, and the number of the amplitude points sent into a network is 256. In order to ensure that the input size of the training set and the input size of the test set are the same, 16 frames are intercepted and sent into the network, namely the input size of the network is 16 multiplied by 256 multiplied by 1.
(III) comparative experiment based on MultiResU _ Net model
Recovering detailed information in the network for an up-sampling process, and constructing a MultiResU _ Net voice enhancement model (MultiResU _ Net-sub-pixel convolution) of the sub-pixel convolution layer by adopting the sub-pixel convolution layer to replace deconvolution; using a shuffle channel mode to disturb the structural arrangement of a residual error path and an up-sampling output channel, and constructing a MultiResU _ Net voice enhancement model (MultiResU _ Net-mixed channel) of a mixed channel; an improved MultiResU _ Net speech enhancement model (improved MultiResU _ Net model) is constructed by adopting a sub-pixel convolution layer and a shuffling channel. Under the conditions that the signal-to-noise ratios are-5 dB, 0dB, 5dB and 10dB, experimental analysis is based on the voice enhancement effect of a traditional MultiResU _ Net voice enhancement model, a MultiResU _ Net-subpixel convolution, a MultiResU _ Net-mixed channel and an improved MultiResU _ Net model, the voice enhancement effect is evaluated by adopting PESQ, STOI and SDR, and the average score of 4 model evaluation indexes under different signal-to-noise ratios is shown in figure 10. Looking at FIGS. 10(a), (b) and (c), the PERSQ, STOI, SDR scores for the sub-pixel convolution layer and mixed channel refined MultiResU _ Net based speech enhancement model were all higher than the other speech enhancement models at signal-to-noise ratios of-5 dB, 0dB, 5dB and 10 dB. When the signal-to-noise ratio is-5 dB, the comparison experiment of different models shows that the PESQ score of the improved MultiResU _ Net model is higher than that of the traditional MultiResU _ Net model, the MultiResU _ Net-sub-pixel convolution and the MultiResU _ Net-mixed channel, and the increment is respectively 7.3%, 10.9% and 6.8%; the STOI score increment is 5.0 percent, 3.9 percent and 3.0 percent respectively; the SDR score increases were 13.7%, 5.8%, 3.2%, respectively. From the above results, it can be seen that the improvement of the speech enhancement effect by the sub-pixel convolution layer and the mixed channel improvement MultiResU _ Net is both improved, and especially the improvement of the MultiResU _ Net effect by the both is more obvious.
(4) Improved MultiResU _ Net model network parameter selection
In the multiresolution module of the multiresolution module, the speech features of different resolution modules are obtained through different window sizes (the sizes of convolution kernels), the experiment judges which size of window is more favorable for the improved multiresolution module of the multiresolution module based on the subpixel convolution and mixed channel by the scores of evaluation indexes PESQ, STOI and SDR of the speech and the length of time, the sizes of the convolution kernels are respectively set to be 3 × 3, 3 × 5, 5 × 5 and 5 × 7, and as can be known from section 3.1.2, the dimension of an input speech feature map is 16 × 256 × 1, the width of the convolution kernel is larger than the height, so that the convolution kernel is set to be in two forms of the same height and the same width and the width of the convolution kernel is larger than the height, meanwhile, in order to determine the influence of the network depth on the training result, experiments are respectively carried out under the conditions that the network depth is 5 and 9, when the depth is 5, the height dimension of the feature map is reduced to 1, and the bottleneck period is reached at the moment; when the depth is 9, the height and width of the feature map are reduced to 1, and the bottleneck period is reached, so the speech evaluation scores when the depth is 5 and 9 are compared in this chapter. The various evaluation index scores are shown in table 1.
TABLE 1 comparison of different network parameters
Figure BDA0003243416390000081
As can be seen from table 1, when the analysis network depth is 9, when the convolution kernel size is 3 × 5, the scores of PESQ, STOI and SDR are the highest, PESQ is 11.0%, 11.3% and 1.9% higher than those of convolution kernels of other sizes, STOI is 0.9%, 9.1% and 0.3% higher than those of convolution kernels of other sizes, and SDR is 1.8%, 36.0% and 10.0% higher than those of convolution kernels of other sizes; the training duration is 116s more per step for a convolution kernel size of 5 x 7 than for 3 x 5. It follows that the larger the convolution kernel, the longer the training time, but the speech enhancement is not better and better. The thickened data show the best effect obtained by different convolution kernel sizes when the depths are the same, and as can be seen from the observation table 1, the thickened data come from the condition that the width of the convolution kernel is larger than the height, so that the voice enhancement effect is better when the width of the convolution kernel is larger than the height; the training time is also lengthened when the evaluation index score is mostly higher than 5 when the network depth is 9 and the comparison network depth is fixed, and when the network depth is 3 × 5, the depth is 9 which is 29s longer than 5 training time, so that the best enhancement effect of improving the MultiResU _ Net model appears when the network depth is 9 and the convolution kernel size is 3 × 5.
(5) Speech enhancement effects for different network models
To verify the effectiveness of MultiResU _ Net in the field of speech enhancement, this experiment sets up multiple comparison experiments in the environments with signal-to-noise ratios of-5 dB, 0dB, 5dB, and 10 dB. The comparative analysis model is: based on the traditional MultiResU _ Net model, the Fully CNN model [22], the U _ Net model [17] and the improved MultiResU _ Net model.
In the experiment, the network parameters of Fully CNN and U _ Net are given in the original paper. Network depth selection for the improved MultiResU _ Net model 9, convolution kernel size 3 × 5. The results of evaluating the speech enhancement effect using PESQ, STOI, and SDR are shown in table 2.
TABLE 2 evaluation of Speech enhancement Effect under different models
Figure BDA0003243416390000091
As can be seen from table 2, in the low snr environment, the scores of PESQ, STOI, and SDR proposed by the present embodiment to improve the MultiResU _ Net model are mostly higher than those of other models. Taking the-5 dB case as an example, the PESQ score of the improved MultiResU _ Net model is 7.4%, 15.2%, and 4.2% higher than that of the other models, the STOI score of the improved MultiResU _ Net model is 5.0%, 13.9%, and 10.6% higher than that of the other models, and the SDR score of the improved MultiResU _ Net model is 13.7%, 60.0%, and 38.3% higher than that of the other models.
Therefore, in a low signal-to-noise ratio environment, the improved MultiResU _ Net model has a higher enhancement effect than the traditional MultiResU _ Net, Fully CNN and U _ Net models. Therefore, the speech enhancement model based on the improved MultiResU _ Net of the sub-pixel convolution and the mixed channel has a better speech enhancement effect on the noisy speech, and particularly has a more obvious speech enhancement capability on the speech signal with a low signal-to-noise ratio.
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims (4)

1. An improved voice enhancement method of a multi-resolution residual U-shaped network is characterized in that the inherent voice features are more completely reserved, the lost voice characteristics in training are less, the information fusion capability is strong, and the network details are better recovered; the method comprises the following steps:
s1, obtaining the magnitude spectrums of the two voice signals by Short-Time Fourier transform (STFT) of the pure and noisy voice;
s2, taking the noisy speech amplitude spectrum as the input of the network, and taking the pure speech amplitude spectrum as the training target; fitting a nonlinear relation between the network input and a training target through the improved multi-resolution residual U-shaped network to further obtain a voice enhancement model based on the improved multi-resolution residual U-shaped network;
s3, obtaining the amplitude spectrum of the voice with noise through STFT; the amplitude spectrum of the target voice can be obtained by passing the target voice through an improved multi-resolution residual U-shaped network model;
and S4, combining the amplitude spectrum with the phase of the voice with the noise, carrying out waveform reconstruction, and obtaining the enhanced voice after reconstruction.
2. The speech enhancement method according to claim 1, wherein a mixed channel mode is adopted to perform channel rearrangement on the speech features output by the residual path and the speech features sampled at the decoder side, so as to promote fusion of information between channels and further improve the capability of extracting features by a network.
3. The speech enhancement method of claim 1 wherein the window size (the size of the convolution kernel) is varied to obtain windows of different heights and widths to match the speech feature map having a width substantially greater than the height.
4. The speech enhancement method of claim 1 wherein the upsampling task in the MultiResU _ Net is implemented using a subpixel convolution layer for recovering detail information in the network.
CN202111026177.5A 2021-09-02 2021-09-02 Voice enhancement method for improving multi-resolution residual error U-shaped network Pending CN113707164A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111026177.5A CN113707164A (en) 2021-09-02 2021-09-02 Voice enhancement method for improving multi-resolution residual error U-shaped network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111026177.5A CN113707164A (en) 2021-09-02 2021-09-02 Voice enhancement method for improving multi-resolution residual error U-shaped network

Publications (1)

Publication Number Publication Date
CN113707164A true CN113707164A (en) 2021-11-26

Family

ID=78657357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111026177.5A Pending CN113707164A (en) 2021-09-02 2021-09-02 Voice enhancement method for improving multi-resolution residual error U-shaped network

Country Status (1)

Country Link
CN (1) CN113707164A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842863A (en) * 2022-04-19 2022-08-02 电子科技大学 Signal enhancement method based on multi-branch-dynamic merging network
CN115497496A (en) * 2022-09-22 2022-12-20 东南大学 FirePS convolutional neural network-based voice enhancement method
CN117611442A (en) * 2024-01-19 2024-02-27 第六镜科技(成都)有限公司 Near infrared face image generation method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610717A (en) * 2019-08-30 2019-12-24 西南电子技术研究所(中国电子科技集团公司第十研究所) Separation method of mixed signals in complex frequency spectrum environment
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
CN112927709A (en) * 2021-02-04 2021-06-08 武汉大学 Voice enhancement method based on time-frequency domain joint loss function
CN113223001A (en) * 2021-05-07 2021-08-06 西安智诊智能科技有限公司 Image segmentation method based on multi-resolution residual error network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610717A (en) * 2019-08-30 2019-12-24 西南电子技术研究所(中国电子科技集团公司第十研究所) Separation method of mixed signals in complex frequency spectrum environment
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
CN112927709A (en) * 2021-02-04 2021-06-08 武汉大学 Voice enhancement method based on time-frequency domain joint loss function
CN113223001A (en) * 2021-05-07 2021-08-06 西安智诊智能科技有限公司 Image segmentation method based on multi-resolution residual error network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ERNST O, CHAZAN S E, GANNOT S, ET AL SPEECH DEREVERBERATION USING FULLY CONVOLUTIONAL NETWORKS, 2018 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), ROME, ITALY, pages 390 - 394 *
OLAF R U-NET: CONVOLUTIONAL NETWORKS FOR BIOMEDICAL IMAGE SEGMENTATIN, INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISITED INTERVENTION, pages 234 - 241 *
时文华;张雄伟;邹霞;孙蒙;李莉: "联合深度编解码网络和时频掩蔽估计的单通道语音增强", 声学学报, no. 003, pages 299 - 307 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842863A (en) * 2022-04-19 2022-08-02 电子科技大学 Signal enhancement method based on multi-branch-dynamic merging network
CN114842863B (en) * 2022-04-19 2023-06-02 电子科技大学 Signal enhancement method based on multi-branch-dynamic merging network
CN115497496A (en) * 2022-09-22 2022-12-20 东南大学 FirePS convolutional neural network-based voice enhancement method
CN115497496B (en) * 2022-09-22 2023-11-14 东南大学 Voice enhancement method based on FirePS convolutional neural network
CN117611442A (en) * 2024-01-19 2024-02-27 第六镜科技(成都)有限公司 Near infrared face image generation method

Similar Documents

Publication Publication Date Title
CN113707164A (en) Voice enhancement method for improving multi-resolution residual error U-shaped network
Ernst et al. Speech dereverberation using fully convolutional networks
Yin et al. Phasen: A phase-and-harmonics-aware speech enhancement network
CN110739002B (en) Complex domain speech enhancement method, system and medium based on generation countermeasure network
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
CN111653288B (en) Target person voice enhancement method based on conditional variation self-encoder
CN109890043B (en) Wireless signal noise reduction method based on generative countermeasure network
CN110148420A (en) A kind of audio recognition method suitable under noise circumstance
Gurbuz et al. Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition
CN110490816B (en) Underwater heterogeneous information data noise reduction method
CN113823308B (en) Method for denoising voice by using single voice sample with noise
CN111899750A (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN110675888A (en) Speech enhancement method based on RefineNet and evaluation loss
Li et al. Deeplabv3+ vision transformer for visual bird sound denoising
CN117496990A (en) Speech denoising method, device, computer equipment and storage medium
CN114613384B (en) Deep learning-based multi-input voice signal beam forming information complementation method
CN115295002A (en) Single-channel speech enhancement method based on interactive time-frequency attention mechanism
Hwang et al. Monoaural Speech Enhancement Using a Nested U-Net with Two-Level Skip Connections.
CN113035217A (en) Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition
Kar et al. Convolutional Neural Network for Removal of Environmental Noises from Acoustic Signal
CN115472153A (en) Voice enhancement system, method, device and equipment
Lee et al. Speech enhancement with MAP estimation and ICA-based speech features
CN113327633A (en) Method and device for detecting noisy speech endpoint based on deep neural network model
KR100329596B1 (en) Text-Independent Speaker Identification Using Telephone Speech
CN112907456A (en) Deep neural network image denoising method based on global smooth constraint prior model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination