CN113421581B

CN113421581B - Real-time voice noise reduction method for jump network

Info

Publication number: CN113421581B
Application number: CN202110971215.8A
Authority: CN
Inventors: 黄祥康; 吴庆耀; 白剑; 黄海亮; 梁瑛玮; 张海林; 鲁和平; 李长杰; 陈焕然; 李乐; 王浩; 洪行健; 冷冬; 丁一
Original assignee: Guangzhou Easefun Information Technology Co ltd
Current assignee: Yifang Information Technology Co ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-11-02
Anticipated expiration: 2041-08-24
Also published as: CN113421581A

Abstract

The invention discloses a real-time voice noise reduction method of a hop network, which is based on a multilayer short-time Fourier transform loss function and comprises the following steps: constructing an audio training set of network training by using a frequency band shielding and signal reverberation data enhancement method; constructing a jumping Unet lightweight network structure; and training a model by using a multilayer short-time Fourier transform loss function, and denoising by using the trained model. The invention adopts a jumping Unet network structure to lighten the model, and greatly improves the generalization capability of the model to the processing of different noise types by utilizing a loss function based on multilayer short-time Fourier transform and data enhancement means such as noise shift, signal reverberation and the like.

Description

Real-time voice noise reduction method for jump network

Technical Field

The present invention relates to a voice noise reduction method, and more particularly, to a voice noise reduction method for a hop network.

Background

The voice enhancement technology is always a popular research field, has great practicability in life, such as video conference, voice communication and the like, and can greatly improve the voice and video call quality of people by utilizing the voice enhancement noise reduction technology. The traditional speech noise reduction method mainly uses a spectral subtraction method and a method based on a statistical model, and the algorithm cannot achieve good effect in dealing with non-stationary noise signals. The traditional methods such as wiener filtering and the like are difficult to process noise signals which are not stable or are talked by multiple people, and the noise removing method of the deep neural network appearing later improves the noise signals, but the processing speed is slow, and the effect is difficult to play in practical application.

In recent years, with the development of deep learning, deep learning is also used to perform noise reduction processing on audio signals, and a good effect is obtained. The common deep neural network has large parameter quantity and complex model, so the time for processing the audio frequency is longer.

Disclosure of Invention

In view of the above technical problems, an object of the present invention is to provide a voice denoising method for a skip network, which employs a lighter network structure, and transmits noisy audio information as input information of the network to an input layer, and uses pure audio information without noise as output target data to perform supervised training.

The technical scheme of the invention is as follows:

a real-time voice noise reduction method of a hop network is based on a multilayer short-time Fourier transform loss function, and is characterized by comprising the following steps:

s1: constructing an audio training set of network training by using a frequency band shielding and signal reverberation data enhancement method, wherein the frequency band shielding allows audio to pass through a band elimination filter to remove partial frequency in the audio, and the signal reverberation is added into the original audio after the audio is continuously attenuated and delayed;

s2: constructing a jumping Unet lightweight network structure, obtaining the characteristics of different channel numbers by performing convolution and transposition convolution on the characteristics, and connecting and adding the characteristics of the same channel number to enable the Unet lightweight network to learn the relationship between high-level and low-level characteristics at the same time;

s3: using multi-layer short-time Fourier transform loss functions

And absolute mean error

And as a loss function of the model, training the model by an Adam optimization algorithm, and performing noise reduction by using the trained model.

The preferred technical solution of the present invention is that the audio training set for constructing the network training in step S1 includes the following steps:

s101: acquiring pure voice signals and noise signals as training data of a model through a Valentini data set and a DNS2020 reference data set;

s102: superposing various noise signals to obtain a mixed noise signal;

s103: randomly intercepting and synthesizing the mixed noise signal and the voice signal to obtain a voice signal with mixed noise;

s104: the speech signal and the original noise signal are delayed, attenuation processing is added to the speech signal with mixed noise, and a noise speech signal with reverberation is obtained.

The preferable technical solution of the present invention is that step S2 specifically includes:

s201: constructing an encoding module, enabling the audio signal to pass through a one-dimensional convolution module, then carrying out zero setting processing on a numerical value smaller than zero through a relu activation function, then continuing carrying out convolution processing on a convolution kernel with twice channel number, and finally obtaining an encoded signal through a gate control linear unit;

s202: the coded signals are processed by an LSTM signal processing module, wherein the LSTM signal processing module is constructed by a unidirectional LSTM network or a bidirectional LSTM network;

s203: a decoding module is constructed, the number of channels is reduced through a one-dimensional convolution module after the encoded signal is processed by an LSTM signal processing module, the signal is processed through a gate control linear unit, and finally the voice-enhanced audio is obtained through a one-dimensional transposition convolution module;

s204: and connecting modules of which the number of output channels of the coding module is equal to that of input channels of the decoding module to construct a hopping Unet lightweight network structure.

The preferable technical solution of the present invention is that step S3 specifically includes:

s301: construction of input noise signal and clean audio signal

A loss function;

s302: constructing the input noise signal and the pure audio signal by respectively carrying out short-time Fourier transform of different parameters

A loss function;

s303: inputting the model parameters of the coding module, the decoding module and the LSTM into an Adam optimizer for optimization learning, and training a final model;

s304: and directly inputting the voice signal with noise into the trained final model to obtain the voice enhanced voice signal.

The preferred technical scheme of the invention is that the formula of the gate control linear unit is as follows:

where X is the output of the convolution module, W, b, V, c are all learnable parameters, ⨂ is the element product, and σ (-) is the sigmoid function.

The preferred technical solution of the present invention is that the formula of the loss function is as follows:

wherein

In order to be a pure speech signal, the speech signal,

for the enhanced speech signal, T is the audio length, and the value of T varies from sample to sample.

wherein the content of the first and second substances,

in order to perform a short-time fourier transform,

in order to be a pure speech signal, the speech signal,

for the enhanced speech signal, T is the audio length, T values are different among different samples, the number of transform parameter Fourier transform points is selected to be 512, 1024 and 2048, frame phase shift is 50, 120 and 240 correspondingly, and the window length is longThe degrees correspond to 240, 600 and 1200.

Compared with the prior art, the invention has the following beneficial effects:

the invention can obtain better voice noise reduction effect, and has the advantages of small distortion, strong generalization capability and good noise reduction effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of a voice denoising method for a hop network according to an embodiment of the present invention;

fig. 2 is a flowchart of a voice noise reduction method for a hop network according to embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of a hop denoising network;

fig. 4 is a schematic diagram of an LSTM network.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.

As shown in fig. 1-2, the speech noise reduction method according to the embodiment of the present invention is based on a multi-layer short-time fourier transform loss function, and includes:

s1: constructing an audio training set of network training by using a frequency band shielding and signal reverberation data enhancement method, wherein the frequency band shielding is to enable audio to pass through a band elimination filter to remove partial frequency in the audio, and the signal reverberation is to add the audio into the original audio after continuous attenuation and delay;

s2: constructing a jumping Unet lightweight network structure, as shown in FIG. 3, obtaining features with different channel numbers by performing convolution and transposition convolution on the features, and connecting and adding the features with the same channel number, so that a model can learn the relationship between high-level and low-level features at the same time to obtain a better effect;

s3: and (3) taking the multilayer short-time Fourier transform loss function and the absolute mean error as the loss function of the model, training the model by an Adam optimization algorithm, and using the trained model to reduce noise.

The audio training set for constructing the network training in step S1 includes the following steps:

wherein, Valentini is a training data set provided by the university of Edinburgh Voice technology research center and used as a voice enhancement and voice synthesis algorithm, and DNS2020 is a deep voice denoising challenge held by Microsoft, and a large amount of clean voice signals and noise signals are provided therein.

S102: superposing various noise signals to obtain a mixed noise signal;

Wherein the audio training set in step S1 includes audio data and noise data, and various types of noisy audio and pure audio data for supervision corresponding thereto are synthesized.

Step S2 specifically includes:

s201: constructing an encoding module, wherein the audio signal passes through a one-dimensional convolution module, then is processed by a relu activation function, then is continuously convoluted by a convolution kernel with twice channel number, and finally passes through a gate control linear unit to obtain an encoded signal;

the relu activation function formula is as follows:

the formula for the gated linear cell is as follows:

S202: a coding signal processing module is constructed by adopting a unidirectional LSTM network or a bidirectional LSTM network, and the coded signals are processed by the LSTM signal processing module. The Long Short Term Memory (LSTM) network is a special RNN model, and its special structural design makes it possible to avoid the Long Term dependence problem. In fig. 4, the σ symbol represents a sigmoid function, tanh represents a tanh function, and + represents vector corresponding bit addition, where the left side controls forgetting of the previous cell state by using the input x and the previous cell hidden state h as the inputs of the sigmoid function, and the middle part input x and the previous cell hidden state h pass through the sigmoid and tanh functions to determine which new input information of the cell is to be retained and used for updating the cell state. The rightmost part finally outputs the unit output h by combining the input x and the previous unit hidden state h and the unit state output.

S203: constructing a decoding module, wherein the decoding module is described as follows, after the encoded signal is processed by an LSTM signal processing module, firstly reducing the number of channels through a one-dimensional convolution module, then processing the signal through a gate control linear unit, and finally obtaining the voice-enhanced audio through a one-dimensional transposition convolution module;

s204: and connecting the modules of which the number of output channels of the coding module is equal to that of input channels of the decoding module so as to construct a hop network structure.

Step S3 specifically includes:

s301: constructing enhanced speech signal and pure speech signal

A loss function;

the formula for the loss function is as follows:

wherein

In order to be a pure speech signal, the speech signal,

S302: respectively constructing the enhanced voice signal and the pure voice signal by short-time Fourier transform of different parameters

A loss function;

the formula of the loss function is as follows

Wherein the content of the first and second substances,

in order to perform a short-time fourier transform,

in order to be a pure speech signal, the speech signal,

for the enhanced speech signal, T is the audio length, T varies from sample to sample, the number of transform parameter fourier transform points is selected to be 512, 1024 and 2048, the frame phase shift corresponds to 50, 120 and 240, and the window length corresponds to 240, 600 and 1200.

s304: and directly inputting the voice signal with the noise into the network to obtain the voice enhanced voice signal.

The target sampling rate of the audio data of the data set constructed in the present invention is 16 k. For audio signals with different sampling rates, resampling to a target sampling rate is needed, and then the audio signals are directly input into a network, so that the audio signals after speech enhancement can be obtained. Through the specific implementation mode, the invention can obtain better voice noise reduction effect. The method has the advantages of small distortion, strong generalization capability and good noise reduction effect.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A real-time voice noise reduction method of a hop network is based on a multilayer short-time Fourier transform loss function, and is characterized by comprising the following steps:

s3: using multi-layer short-time Fourier transform loss functions

And absolute mean error

2. The method for reducing noise of real-time speech according to claim 1, wherein the step S1 of constructing the audio training set of the network training comprises the steps of:

s102: superposing various noise signals to obtain a mixed noise signal;

3. The real-time speech noise reduction method according to claim 2, wherein step S2 specifically includes:

4. The real-time speech noise reduction method according to claim 3, wherein step S3 specifically includes:

s301: construction of input noise signal and clean audio signal