CN113421581B - Real-time voice noise reduction method for jump network - Google Patents
Real-time voice noise reduction method for jump network Download PDFInfo
- Publication number
- CN113421581B CN113421581B CN202110971215.8A CN202110971215A CN113421581B CN 113421581 B CN113421581 B CN 113421581B CN 202110971215 A CN202110971215 A CN 202110971215A CN 113421581 B CN113421581 B CN 113421581B
- Authority
- CN
- China
- Prior art keywords
- signal
- noise
- audio
- voice
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
Abstract
The invention discloses a real-time voice noise reduction method of a hop network, which is based on a multilayer short-time Fourier transform loss function and comprises the following steps: constructing an audio training set of network training by using a frequency band shielding and signal reverberation data enhancement method; constructing a jumping Unet lightweight network structure; and training a model by using a multilayer short-time Fourier transform loss function, and denoising by using the trained model. The invention adopts a jumping Unet network structure to lighten the model, and greatly improves the generalization capability of the model to the processing of different noise types by utilizing a loss function based on multilayer short-time Fourier transform and data enhancement means such as noise shift, signal reverberation and the like.
Description
Technical Field
The present invention relates to a voice noise reduction method, and more particularly, to a voice noise reduction method for a hop network.
Background
The voice enhancement technology is always a popular research field, has great practicability in life, such as video conference, voice communication and the like, and can greatly improve the voice and video call quality of people by utilizing the voice enhancement noise reduction technology. The traditional speech noise reduction method mainly uses a spectral subtraction method and a method based on a statistical model, and the algorithm cannot achieve good effect in dealing with non-stationary noise signals. The traditional methods such as wiener filtering and the like are difficult to process noise signals which are not stable or are talked by multiple people, and the noise removing method of the deep neural network appearing later improves the noise signals, but the processing speed is slow, and the effect is difficult to play in practical application.
In recent years, with the development of deep learning, deep learning is also used to perform noise reduction processing on audio signals, and a good effect is obtained. The common deep neural network has large parameter quantity and complex model, so the time for processing the audio frequency is longer.
Disclosure of Invention
In view of the above technical problems, an object of the present invention is to provide a voice denoising method for a skip network, which employs a lighter network structure, and transmits noisy audio information as input information of the network to an input layer, and uses pure audio information without noise as output target data to perform supervised training.
The technical scheme of the invention is as follows:
a real-time voice noise reduction method of a hop network is based on a multilayer short-time Fourier transform loss function, and is characterized by comprising the following steps:
s1: constructing an audio training set of network training by using a frequency band shielding and signal reverberation data enhancement method, wherein the frequency band shielding allows audio to pass through a band elimination filter to remove partial frequency in the audio, and the signal reverberation is added into the original audio after the audio is continuously attenuated and delayed;
s2: constructing a jumping Unet lightweight network structure, obtaining the characteristics of different channel numbers by performing convolution and transposition convolution on the characteristics, and connecting and adding the characteristics of the same channel number to enable the Unet lightweight network to learn the relationship between high-level and low-level characteristics at the same time;
s3: using multi-layer short-time Fourier transform loss functionsAnd absolute mean errorAnd as a loss function of the model, training the model by an Adam optimization algorithm, and performing noise reduction by using the trained model.
The preferred technical solution of the present invention is that the audio training set for constructing the network training in step S1 includes the following steps:
s101: acquiring pure voice signals and noise signals as training data of a model through a Valentini data set and a DNS2020 reference data set;
s102: superposing various noise signals to obtain a mixed noise signal;
s103: randomly intercepting and synthesizing the mixed noise signal and the voice signal to obtain a voice signal with mixed noise;
s104: the speech signal and the original noise signal are delayed, attenuation processing is added to the speech signal with mixed noise, and a noise speech signal with reverberation is obtained.
The preferable technical solution of the present invention is that step S2 specifically includes:
s201: constructing an encoding module, enabling the audio signal to pass through a one-dimensional convolution module, then carrying out zero setting processing on a numerical value smaller than zero through a relu activation function, then continuing carrying out convolution processing on a convolution kernel with twice channel number, and finally obtaining an encoded signal through a gate control linear unit;
s202: the coded signals are processed by an LSTM signal processing module, wherein the LSTM signal processing module is constructed by a unidirectional LSTM network or a bidirectional LSTM network;
s203: a decoding module is constructed, the number of channels is reduced through a one-dimensional convolution module after the encoded signal is processed by an LSTM signal processing module, the signal is processed through a gate control linear unit, and finally the voice-enhanced audio is obtained through a one-dimensional transposition convolution module;
s204: and connecting modules of which the number of output channels of the coding module is equal to that of input channels of the decoding module to construct a hopping Unet lightweight network structure.
The preferable technical solution of the present invention is that step S3 specifically includes:
s302: constructing the input noise signal and the pure audio signal by respectively carrying out short-time Fourier transform of different parametersA loss function;
s303: inputting the model parameters of the coding module, the decoding module and the LSTM into an Adam optimizer for optimization learning, and training a final model;
s304: and directly inputting the voice signal with noise into the trained final model to obtain the voice enhanced voice signal.
The preferred technical scheme of the invention is that the formula of the gate control linear unit is as follows:
where X is the output of the convolution module, W, b, V, c are all learnable parameters, ⨂ is the element product, and σ (-) is the sigmoid function.
The preferred technical solution of the present invention is that the formula of the loss function is as follows:
whereinIn order to be a pure speech signal, the speech signal,for the enhanced speech signal, T is the audio length, and the value of T varies from sample to sample.
The preferred technical solution of the present invention is that the formula of the loss function is as follows:
wherein the content of the first and second substances,in order to perform a short-time fourier transform,in order to be a pure speech signal, the speech signal,for the enhanced speech signal, T is the audio length, T values are different among different samples, the number of transform parameter Fourier transform points is selected to be 512, 1024 and 2048, frame phase shift is 50, 120 and 240 correspondingly, and the window length is longThe degrees correspond to 240, 600 and 1200.
Compared with the prior art, the invention has the following beneficial effects:
the invention can obtain better voice noise reduction effect, and has the advantages of small distortion, strong generalization capability and good noise reduction effect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic diagram of a voice denoising method for a hop network according to an embodiment of the present invention;
fig. 2 is a flowchart of a voice noise reduction method for a hop network according to embodiment 1 of the present invention;
FIG. 3 is a schematic diagram of a hop denoising network;
fig. 4 is a schematic diagram of an LSTM network.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
As shown in fig. 1-2, the speech noise reduction method according to the embodiment of the present invention is based on a multi-layer short-time fourier transform loss function, and includes:
s1: constructing an audio training set of network training by using a frequency band shielding and signal reverberation data enhancement method, wherein the frequency band shielding is to enable audio to pass through a band elimination filter to remove partial frequency in the audio, and the signal reverberation is to add the audio into the original audio after continuous attenuation and delay;
s2: constructing a jumping Unet lightweight network structure, as shown in FIG. 3, obtaining features with different channel numbers by performing convolution and transposition convolution on the features, and connecting and adding the features with the same channel number, so that a model can learn the relationship between high-level and low-level features at the same time to obtain a better effect;
s3: and (3) taking the multilayer short-time Fourier transform loss function and the absolute mean error as the loss function of the model, training the model by an Adam optimization algorithm, and using the trained model to reduce noise.
The audio training set for constructing the network training in step S1 includes the following steps:
s101: acquiring pure voice signals and noise signals as training data of a model through a Valentini data set and a DNS2020 reference data set;
wherein, Valentini is a training data set provided by the university of Edinburgh Voice technology research center and used as a voice enhancement and voice synthesis algorithm, and DNS2020 is a deep voice denoising challenge held by Microsoft, and a large amount of clean voice signals and noise signals are provided therein.
S102: superposing various noise signals to obtain a mixed noise signal;
s103: randomly intercepting and synthesizing the mixed noise signal and the voice signal to obtain a voice signal with mixed noise;
s104: the speech signal and the original noise signal are delayed, attenuation processing is added to the speech signal with mixed noise, and a noise speech signal with reverberation is obtained.
Wherein the audio training set in step S1 includes audio data and noise data, and various types of noisy audio and pure audio data for supervision corresponding thereto are synthesized.
Step S2 specifically includes:
s201: constructing an encoding module, wherein the audio signal passes through a one-dimensional convolution module, then is processed by a relu activation function, then is continuously convoluted by a convolution kernel with twice channel number, and finally passes through a gate control linear unit to obtain an encoded signal;
the relu activation function formula is as follows:
the formula for the gated linear cell is as follows:
where X is the output of the convolution module, W, b, V, c are all learnable parameters, ⨂ is the element product, and σ (-) is the sigmoid function.
S202: a coding signal processing module is constructed by adopting a unidirectional LSTM network or a bidirectional LSTM network, and the coded signals are processed by the LSTM signal processing module. The Long Short Term Memory (LSTM) network is a special RNN model, and its special structural design makes it possible to avoid the Long Term dependence problem. In fig. 4, the σ symbol represents a sigmoid function, tanh represents a tanh function, and + represents vector corresponding bit addition, where the left side controls forgetting of the previous cell state by using the input x and the previous cell hidden state h as the inputs of the sigmoid function, and the middle part input x and the previous cell hidden state h pass through the sigmoid and tanh functions to determine which new input information of the cell is to be retained and used for updating the cell state. The rightmost part finally outputs the unit output h by combining the input x and the previous unit hidden state h and the unit state output.
S203: constructing a decoding module, wherein the decoding module is described as follows, after the encoded signal is processed by an LSTM signal processing module, firstly reducing the number of channels through a one-dimensional convolution module, then processing the signal through a gate control linear unit, and finally obtaining the voice-enhanced audio through a one-dimensional transposition convolution module;
s204: and connecting the modules of which the number of output channels of the coding module is equal to that of input channels of the decoding module so as to construct a hop network structure.
Step S3 specifically includes:
whereinIn order to be a pure speech signal, the speech signal,for the enhanced speech signal, T is the audio length, and the value of T varies from sample to sample.
S302: respectively constructing the enhanced voice signal and the pure voice signal by short-time Fourier transform of different parametersA loss function;
Wherein the content of the first and second substances,in order to perform a short-time fourier transform,in order to be a pure speech signal, the speech signal,for the enhanced speech signal, T is the audio length, T varies from sample to sample, the number of transform parameter fourier transform points is selected to be 512, 1024 and 2048, the frame phase shift corresponds to 50, 120 and 240, and the window length corresponds to 240, 600 and 1200.
S303: inputting the model parameters of the coding module, the decoding module and the LSTM into an Adam optimizer for optimization learning, and training a final model;
s304: and directly inputting the voice signal with the noise into the network to obtain the voice enhanced voice signal.
The target sampling rate of the audio data of the data set constructed in the present invention is 16 k. For audio signals with different sampling rates, resampling to a target sampling rate is needed, and then the audio signals are directly input into a network, so that the audio signals after speech enhancement can be obtained. Through the specific implementation mode, the invention can obtain better voice noise reduction effect. The method has the advantages of small distortion, strong generalization capability and good noise reduction effect.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (7)
1. A real-time voice noise reduction method of a hop network is based on a multilayer short-time Fourier transform loss function, and is characterized by comprising the following steps:
s1: constructing an audio training set of network training by using a frequency band shielding and signal reverberation data enhancement method, wherein the frequency band shielding allows audio to pass through a band elimination filter to remove partial frequency in the audio, and the signal reverberation is added into the original audio after the audio is continuously attenuated and delayed;
s2: constructing a jumping Unet lightweight network structure, obtaining the characteristics of different channel numbers by performing convolution and transposition convolution on the characteristics, and connecting and adding the characteristics of the same channel number to enable the Unet lightweight network to learn the relationship between high-level and low-level characteristics at the same time;
2. The method for reducing noise of real-time speech according to claim 1, wherein the step S1 of constructing the audio training set of the network training comprises the steps of:
s101: acquiring pure voice signals and noise signals as training data of a model through a Valentini data set and a DNS2020 reference data set;
s102: superposing various noise signals to obtain a mixed noise signal;
s103: randomly intercepting and synthesizing the mixed noise signal and the voice signal to obtain a voice signal with mixed noise;
s104: the speech signal and the original noise signal are delayed, attenuation processing is added to the speech signal with mixed noise, and a noise speech signal with reverberation is obtained.
3. The real-time speech noise reduction method according to claim 2, wherein step S2 specifically includes:
s201: constructing an encoding module, enabling the audio signal to pass through a one-dimensional convolution module, then carrying out zero setting processing on a numerical value smaller than zero through a relu activation function, then continuing carrying out convolution processing on a convolution kernel with twice channel number, and finally obtaining an encoded signal through a gate control linear unit;
s202: the coded signals are processed by an LSTM signal processing module, wherein the LSTM signal processing module is constructed by a unidirectional LSTM network or a bidirectional LSTM network;
s203: a decoding module is constructed, the number of channels is reduced through a one-dimensional convolution module after the encoded signal is processed by an LSTM signal processing module, the signal is processed through a gate control linear unit, and finally the voice-enhanced audio is obtained through a one-dimensional transposition convolution module;
s204: and connecting modules of which the number of output channels of the coding module is equal to that of input channels of the decoding module to construct a hopping Unet lightweight network structure.
4. The real-time speech noise reduction method according to claim 3, wherein step S3 specifically includes:
s302: constructing the input noise signal and the pure audio signal by respectively carrying out short-time Fourier transform of different parametersA loss function;
s303: inputting the model parameters of the coding module, the decoding module and the LSTM into an Adam optimizer for optimization learning, and training a final model;
s304: and directly inputting the voice signal with noise into the trained final model to obtain the voice enhanced voice signal.
7. The real-time speech noise reduction method of claim 6,
wherein the content of the first and second substances,in order to perform a short-time fourier transform,in order to be a pure speech signal, the speech signal,for the enhanced speech signal, T is the audio length, T varies from sample to sample, the number of transform parameter fourier transform points is selected to be 512, 1024 and 2048, the frame phase shift corresponds to 50, 120 and 240, and the window length corresponds to 240, 600 and 1200.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971215.8A CN113421581B (en) | 2021-08-24 | 2021-08-24 | Real-time voice noise reduction method for jump network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110971215.8A CN113421581B (en) | 2021-08-24 | 2021-08-24 | Real-time voice noise reduction method for jump network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113421581A CN113421581A (en) | 2021-09-21 |
CN113421581B true CN113421581B (en) | 2021-11-02 |
Family
ID=77719525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110971215.8A Active CN113421581B (en) | 2021-08-24 | 2021-08-24 | Real-time voice noise reduction method for jump network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113421581B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109949821A (en) * | 2019-03-15 | 2019-06-28 | 慧言科技(天津)有限公司 | A method of far field speech dereverbcration is carried out using the U-NET structure of CNN |
CN111757172A (en) * | 2019-03-29 | 2020-10-09 | Tcl集团股份有限公司 | HDR video acquisition method, HDR video acquisition device and terminal equipment |
CN112151059A (en) * | 2020-09-25 | 2020-12-29 | 南京工程学院 | Microphone array-oriented channel attention weighted speech enhancement method |
CN113011093A (en) * | 2021-03-15 | 2021-06-22 | 哈尔滨工程大学 | Ship navigation noise simulation generation method based on LCWaveGAN |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190385282A1 (en) * | 2018-06-18 | 2019-12-19 | Drvision Technologies Llc | Robust methods for deep image transformation, integration and prediction |
US10923141B2 (en) * | 2018-08-06 | 2021-02-16 | Spotify Ab | Singing voice separation with deep u-net convolutional networks |
-
2021
- 2021-08-24 CN CN202110971215.8A patent/CN113421581B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109949821A (en) * | 2019-03-15 | 2019-06-28 | 慧言科技(天津)有限公司 | A method of far field speech dereverbcration is carried out using the U-NET structure of CNN |
CN111757172A (en) * | 2019-03-29 | 2020-10-09 | Tcl集团股份有限公司 | HDR video acquisition method, HDR video acquisition device and terminal equipment |
CN112151059A (en) * | 2020-09-25 | 2020-12-29 | 南京工程学院 | Microphone array-oriented channel attention weighted speech enhancement method |
CN113011093A (en) * | 2021-03-15 | 2021-06-22 | 哈尔滨工程大学 | Ship navigation noise simulation generation method based on LCWaveGAN |
Also Published As
Publication number | Publication date |
---|---|
CN113421581A (en) | 2021-09-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491404B (en) | Voice processing method, device, terminal equipment and storage medium | |
CN111564160B (en) | Voice noise reduction method based on AEWGAN | |
CN111081268A (en) | Phase-correlated shared deep convolutional neural network speech enhancement method | |
CN111653285B (en) | Packet loss compensation method and device | |
Lin et al. | Speech enhancement using multi-stage self-attentive temporal convolutional networks | |
CN111524530A (en) | Voice noise reduction method based on expansion causal convolution | |
CN116486826A (en) | Voice enhancement method based on converged network | |
Lim et al. | Harmonic and percussive source separation using a convolutional auto encoder | |
CN113782044B (en) | Voice enhancement method and device | |
CN113421581B (en) | Real-time voice noise reduction method for jump network | |
CN117174105A (en) | Speech noise reduction and dereverberation method based on improved deep convolutional network | |
CN113936680B (en) | Single-channel voice enhancement method based on multi-scale information perception convolutional neural network | |
CN111916060A (en) | Deep learning voice endpoint detection method and system based on spectral subtraction | |
CN115331690A (en) | Method for eliminating noise of call voice in real time | |
Xiang et al. | Joint waveform and magnitude processing for monaural speech enhancement | |
Hou et al. | A real-time speech enhancement algorithm based on convolutional recurrent network and Wiener filter | |
Ullah et al. | Semi-supervised transient noise suppression using OMLSA and SNMF algorithms | |
CN115116451A (en) | Audio decoding method, audio encoding method, audio decoding device, audio encoding device, electronic equipment and storage medium | |
CN114822569A (en) | Audio signal processing method, device, equipment and computer readable storage medium | |
CN110751958A (en) | Noise reduction method based on RCED network | |
Pastor-Naranjo et al. | Conditional Generative Adversarial Networks for Acoustic Echo Cancellation | |
WO2024055751A1 (en) | Audio data processing method and apparatus, device, storage medium, and program product | |
CN113903355B (en) | Voice acquisition method and device, electronic equipment and storage medium | |
Lee et al. | Stacked U-Net with high-level feature transfer for parameter efficient speech enhancement | |
US20240096332A1 (en) | Audio signal processing method, audio signal processing apparatus, computer device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address |
Address after: Room 402, No. 66, North Street, University Town Center, Panyu District, Guangzhou City, Guangdong Province, 510006 Patentee after: Yifang Information Technology Co.,Ltd. Address before: 510006 Room 601, 603, 605, science museum, Guangdong University of technology, 100 Waihuan West Road, Xiaoguwei street, Panyu District, Guangzhou City, Guangdong Province Patentee before: GUANGZHOU EASEFUN INFORMATION TECHNOLOGY Co.,Ltd. |
|
CP03 | Change of name, title or address |