CN112767960B - Audio noise reduction method, system, device and medium - Google Patents

Audio noise reduction method, system, device and medium Download PDF

Info

Publication number
CN112767960B
CN112767960B CN202110164294.1A CN202110164294A CN112767960B CN 112767960 B CN112767960 B CN 112767960B CN 202110164294 A CN202110164294 A CN 202110164294A CN 112767960 B CN112767960 B CN 112767960B
Authority
CN
China
Prior art keywords
audio signals
signal
noise
neural network
noisy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110164294.1A
Other languages
Chinese (zh)
Other versions
CN112767960A (en
Inventor
周利明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuncong Technology Group Co Ltd
Original Assignee
Yuncong Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuncong Technology Group Co Ltd filed Critical Yuncong Technology Group Co Ltd
Priority to CN202110164294.1A priority Critical patent/CN112767960B/en
Publication of CN112767960A publication Critical patent/CN112767960A/en
Application granted granted Critical
Publication of CN112767960B publication Critical patent/CN112767960B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides an audio noise reduction method, system, equipment and medium based on deep learning, wherein an amplitude value and an angle value of a signal with noise frequency are extracted by acquiring the signal with noise frequency, and an ideal amplitude ratio associated with the signal with noise frequency is predicted by utilizing a neural network model based on the amplitude value; determining the complex frequency spectrum of the pure audio signal according to the predicted ideal amplitude ratio, the amplitude value of the signal with the noise frequency and the angle value; and performing inverse transformation on the complex frequency spectrum of the pure audio signal to obtain a pure audio signal corresponding to the noisy audio signal. Aiming at the problems existing at present, the invention designs a method which is based on a deep learning method and can have the noise reduction capability in a scene with various noises. The audio signal is processed based on the deep learning method, an ideal assumed condition is not needed, meanwhile, a model can be trained on a wider noise data set, and the defects of insufficient robustness and insufficient noise reduction capability of a traditional algorithm can be overcome.

Description

Audio noise reduction method, system, device and medium
Technical Field
The present invention relates to the field of noise processing technologies, and in particular, to an audio noise reduction method, system, device, and medium.
Background
In the real world, there are a variety of noises, and the presence of the noises seriously affects the application of the voice product. As in a conference voice recording scenario, a conference recorder typically needs to play back a recording later to assist in grooming the conference content. However, the sound after the recording and playback is not clear due to the noisy noise generated by the recording environment, the recording equipment and the like. During a voice call, ambient noise is usually present at the call site, and such noise also affects the call quality. Noise reduction is therefore an important step in speech applications. However, the conventional noise reduction algorithm widely used at present has the problems of limited noise reduction performance and poor noise reduction robustness, and is not enough to cope with noise reduction scenes of increasingly wide speech applications. The source of noise in a real environment may be various, such as machine operation noise, music, noisy human voice, etc. Conventional noise reduction methods typically have a noise reduction effect only for a limited number of noise sources, which reduces the robustness of the noise reduction. Moreover, the conventional noise reduction algorithm assumes ideal conditions, so that the noise reduction capability in a real scene is reduced.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide an audio noise reduction method, system, device and medium for solving the technical problems in the prior art.
To achieve the above and other related objects, the present invention provides an audio noise reduction method, comprising:
obtaining one or more noisy audio signals;
extracting amplitude values and angle values of the one or more noisy audio signals, and predicting ideal amplitude ratio values associated with the one or more noisy audio signals by using one or more neural network models based on the extracted amplitude values;
determining a complex spectrum of one or more clean audio signals according to the predicted ideal amplitude ratio value, the amplitude values and the angle values of the one or more noisy audio signals;
and performing inverse transformation on the complex frequency spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals.
Optionally, the training process of the one or more neural network models includes:
acquiring a voice signal and a noise signal in a real environment, and obtaining a voice signal with noise in the real environment according to the voice signal and the noise signal in the real environment;
respectively extracting the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise in the real environment, and determining the amplitude ratio in the real environment according to the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise;
taking a log of a voice signal with noise in a real environment, and inputting the obtained numerical value into the one or more neural network models to obtain a predicted ideal amplitude ratio;
determining whether the one or more neural network models converge or not according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment, and finishing the training of the one or more neural network models when the one or more neural network models converge;
wherein the network structure of the one or more neural network models comprises at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure.
Optionally, determining a loss function according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment;
judging whether the absolute value of the difference value of the loss functions of two adjacent training times is smaller than a preset target value or not;
if the target value is smaller than the preset target value, finishing the training of the one or more neural network models;
and if the target value is not less than the preset target value, updating the parameters of the one or more neural network models, and continuing to train the one or more neural network models.
Optionally, the specific process of extracting the amplitude values and the angle values of the one or more noisy audio signals includes:
framing and windowing the acquired one or more noisy audio signals;
performing Fourier transform on one or more noisy audio signals after framing and windowing are completed;
calculating amplitudes of the one or more noisy audio signals based on a fourier transform result, and obtaining angle values of the one or more noisy audio signals according to the fourier transform result;
and taking a log of the calculated amplitude to obtain the amplitude values of the one or more noisy audio signals.
Optionally, performing an inverse transform on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals, includes:
performing inverse Fourier transform on the complex frequency spectrum of the one or more clean audio signals to obtain time sequence signals of the one or more clean audio signals;
and performing overlap-add convolution operation on the time sequence signal of each frame of the pure audio signal to obtain one or more final pure audio signals corresponding to the one or more noisy audio signals.
Optionally, the method further includes pre-emphasizing the one or more noisy audio signals before extracting amplitude values and angle values of the one or more noisy audio signals;
and after acquiring the one or more clean audio signals, performing de-pre-emphasis on the acquired one or more clean audio signals.
Optionally, the method further includes, before acquiring one or more noisy audio signals, resampling the one or more noisy audio signals according to a preset sampling rate, and unifying the sampling rate of the one or more noisy audio signals to the preset sampling rate;
and after one or more pure audio signals are obtained, resampling the one or more pure audio signals, and restoring the sampling rate of the one or more pure audio signals.
The invention also provides an audio noise reduction system, comprising:
the acquisition module is used for acquiring one or more noisy audio signals;
the extraction module is used for extracting amplitude values and angle values of the one or more noisy audio signals;
a prediction module for predicting an ideal amplitude ratio associated with the one or more noisy audio signals using one or more neural network models based on the extracted amplitude values;
a noise reduction module for determining a complex spectrum of one or more clean audio signals according to the predicted ideal amplitude ratio, the amplitude values and the angle values of the one or more noisy audio signals; and performing an inverse transform on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals.
Optionally, the training process of one or more neural network models in the prediction module includes:
acquiring a voice signal and a noise signal in a real environment, and obtaining a voice signal with noise in the real environment according to the voice signal and the noise signal in the real environment;
respectively extracting the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise in the real environment, and determining the amplitude ratio in the real environment according to the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise;
taking a log of a voice signal with noise in a real environment, and inputting the obtained numerical value into the one or more neural network models to obtain a predicted ideal amplitude ratio;
determining whether the one or more neural network models converge or not according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment, and finishing the training of the one or more neural network models when the one or more neural network models converge;
wherein the network structure of the one or more neural network models comprises at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure.
Optionally, determining a loss function according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment;
judging whether the absolute value of the difference value of the loss functions of two adjacent training times is smaller than a preset target value or not;
if the target value is smaller than the preset target value, finishing the training of the one or more neural network models;
and if the target value is not less than the preset target value, updating the parameters of the one or more neural network models, and continuing to train the one or more neural network models.
Optionally, the specific process of extracting the amplitude values and the angle values of the one or more noisy audio signals by the extraction module includes:
framing and windowing the acquired one or more noisy audio signals;
performing Fourier transform on one or more noisy audio signals after framing and windowing are completed;
calculating amplitudes of the one or more noisy audio signals based on a fourier transform result, and obtaining angle values of the one or more noisy audio signals according to the fourier transform result;
and taking a log of the calculated amplitude to obtain the amplitude values of the one or more noisy audio signals.
Optionally, the performing inverse transformation on the complex spectrum of the one or more clean audio signals by the noise reduction module to obtain one or more clean audio signals corresponding to the one or more noisy audio signals includes:
performing inverse Fourier transform on the complex frequency spectrum of the one or more clean audio signals to obtain time sequence signals of the one or more clean audio signals;
and performing overlap-add convolution operation on the time sequence signal of each frame of the pure audio signal to obtain one or more final pure audio signals corresponding to the one or more noisy audio signals.
The present invention also provides an apparatus comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as in any one of the above.
The invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method as described in any one of the above.
As described above, the present invention provides an audio noise reduction method, system, device and medium based on deep learning, which has the following advantages: by obtaining one or more noisy audio signals; extracting amplitude values and angle values of the one or more noisy audio signals, and predicting ideal amplitude ratio values associated with the one or more noisy audio signals by using one or more neural network models based on the extracted amplitude values; determining a complex spectrum of one or more clean audio signals according to the predicted ideal amplitude ratio value, the amplitude values and the angle values of the one or more noisy audio signals; performing an inverse transform on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals. Aiming at the problems existing at present, the invention designs a method which is based on a deep learning method and can have the noise reduction capability in a scene with various noises. The audio signal is processed based on the deep learning method, an ideal assumed condition is not needed, meanwhile, a model can be trained on a wider noise data set, and the defects of insufficient robustness and insufficient noise reduction capability of a traditional algorithm can be overcome. The invention not only solves the problem of insufficient noise reduction capability caused by over ideal assumed conditions of the traditional noise reduction algorithm; the problem that the traditional noise reduction algorithm cannot be trained on a large amount of noise data, so that the noise reduction robustness is insufficient is solved. The invention can eliminate noise of machine noise, noise of noisy people such as roads and restaurants, music noise, white noise and the like; the invention still has the noise reduction capability in an extreme noise pollution scene (such as the signal-to-noise ratio of-10 db); compared with the method before noise reduction, the pure speech auditory sensation (PESQ) after noise reduction can be improved by 0.8.
Drawings
Fig. 1 is a schematic flowchart of an audio denoising method based on deep learning according to an embodiment;
FIG. 2 is a schematic diagram of a network structure of a deep learning neural network model according to an embodiment;
FIG. 3 is a flowchart illustrating a deep learning based audio denoising method according to another embodiment;
FIG. 4 is a schematic diagram illustrating a process of training and applying a neural network model according to an embodiment;
FIG. 5 is a schematic flow chart for generating training data according to an embodiment;
FIG. 6 is a diagram illustrating a hardware structure of an audio denoising system based on deep learning according to an embodiment;
fig. 7 is a schematic hardware structure diagram of a terminal device according to an embodiment;
fig. 8 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.
Description of the element reference numerals
M10 acquisition module
M20 extraction module
M30 prediction module
M40 noise reduction module
1100 input device
1101 first processor
1102 output device
1103 first memory
1104 communication bus
1200 processing assembly
1201 second processor
1202 second memory
1203 communication assembly
1204 Power supply Assembly
1205 multimedia assembly
1206 Audio component
1207 input/output interface
1208 sensor assembly
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
Referring to fig. 1, the present invention provides an audio denoising method, including the following steps:
s100, acquiring one or more noisy audio signals; the audio signal at least comprises a voice signal and a video signal.
S200, extracting amplitude values and angle values of the one or more noisy audio signals, and predicting ideal amplitude ratios associated with the one or more noisy audio signals by using one or more neural network models based on the amplitude values;
s300, determining the complex frequency spectrum of one or more pure audio signals according to the predicted ideal amplitude ratio, the amplitude value and the angle value of the one or more noisy audio signals;
s400, inverse transform is performed on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals.
Aiming at the existing problems, the method is designed based on the deep learning method and can have the noise reduction capability in a plurality of noise scenes. The audio signal is processed based on the deep learning method, an ideal assumed condition is not needed, meanwhile, a model can be trained on a wider noise data set, and the defects of insufficient robustness and insufficient noise reduction capability of a traditional algorithm can be overcome. The method solves the problem of insufficient noise reduction capability caused by over-ideal assumed conditions of the traditional noise reduction algorithm; the problem that the traditional noise reduction algorithm cannot be trained on a large amount of noise data, so that the noise reduction robustness is insufficient is solved. The method can eliminate noise of machine noise, noisy human voice of roads, restaurants and the like, music noise, white noise and the like; the method still has the noise reduction capability in an extreme noise pollution scene (such as the signal-to-noise ratio of-10 db); compared with the method before noise reduction, the pure speech auditory sensation (PESQ) after noise reduction can be improved by 0.8.
In an exemplary implementation, the network structure of the one or more neural network models includes at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure. By way of example, a network structure such as a single neural network model in the method employs an 8-layer neural network, and can be combined into an encoder-decoder network structure and a Unet network structure. The encryption network encoder and the decryption network decoder are all 3-layer networks, and form a Unet network structure (namely a U-type network structure) after being in one-to-one correspondence with each other. The Unet network structure jumps from the encoder network layer to the decoder corresponding layer, and skips the middle layer, thereby avoiding the information loss possibly existing in the middle layer coding. Two layers of circulation networks are embedded between the encoder and the decoder, the circulation networks in the embodiment of the application are composed of Long Short-Term Memory networks (LSTM), time sequence characteristics between feature codes, such as fundamental frequency formants of voice, noise stationarity and the like, can be learned through the LSTM, and the network structure of a neural network model is shown in figure 2. As other examples, the Dense network layer in the neural network model may also select the convolutional neural network CNN. When the types of noise are more and the difference of the noise intensity is larger, the convergence of the neural network model is possibly difficult by taking the characteristics of the amplitude spectrum as input, and a convolutional neural network CNN layer can be selected to replace a Dense network layer, so that the performance of the neural network model is reduced.
In accordance with the above, in an exemplary implementation, the specific process of extracting the amplitude values and the angle values of the one or more noisy audio signals includes: framing and windowing the acquired one or more noisy audio signals; performing Fourier transform on one or more noisy audio signals after framing and windowing are completed; calculating amplitudes of the one or more noisy audio signals based on the fourier transform results, and obtaining angle values of the one or more noisy audio signals according to the fourier transform results; and taking a log of the calculated amplitude to obtain one or more amplitude values of the noisy audio signal. In the embodiment of the present application, the speech signal has a short-time stationary characteristic, and the speech is usually segmented, i.e. framed. The framed speech signal is windowed for smoothing, which avoids the boundary effect of the frames. Then, a magnitude spectrum is obtained through Fourier transform, and the characteristics of the voice signal such as formant characteristics and the like can be visually analyzed in a frequency domain; the amplitude log after the Fourier transform is taken, so that the change range of the signal amplitude can be compressed, and the network convergence is facilitated. Meanwhile, according to the Fourier transform result, log is taken for the amplitude of the audio signal, and the amplitude value of the audio signal in the frequency domain can be obtained; and calculating tangent values of a real part and an imaginary part of the audio signal in the frequency domain according to the Fourier transform result, and performing arc tangent on the tangent values to obtain an angle value of the audio signal in the frequency domain. The calculated amplitude value and angle value of the audio signal can be stored, and the audio signal restoring method is convenient to use when the signal is restored subsequently. As other examples, the method may also employ a magnitude spectrum as an input feature.
In accordance with the above, in an exemplary implementation, the inverse transforming on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals comprises: performing inverse Fourier transform on the complex frequency spectrum of the one or more clean audio signals to obtain time sequence signals of the one or more clean audio signals; and performing overlap-add convolution operation or overlap-add on the time sequence signal of each frame of the pure audio signal to obtain one or more final pure audio signals corresponding to one or more noisy audio signals.
According to the above description, before extracting the amplitude values and angle values of one or more noisy audio signals, pre-emphasizing the one or more noisy audio signals; and after acquiring the one or more clean audio signals, performing de-pre-emphasis on the acquired one or more clean audio signals. As an example, when the embodiment of the present application performs pre-emphasis, the audio signal may be pre-emphasized by a high-pass filtering. Because the high-frequency voice signals are easier to attenuate relative to the low-frequency signals, the voice energy is mainly concentrated in a low-frequency area, and the high-frequency signal energy is improved through pre-emphasis, so that the network can learn in all frequency intervals in a balanced manner. When the pre-emphasis removing is carried out, the low-pass filtering is carried out on the pure audio signal, so that the pure audio signal is restored to the state of high-frequency signal attenuation in a natural state, and the pre-emphasis removing is the inverse operation of the pre-emphasis.
According to the above description, before obtaining one or more noisy audio signals, the method further includes resampling the one or more noisy audio signals according to a preset sampling rate, and unifying the sampling rate of the one or more noisy audio signals to the preset sampling rate; and after the one or more pure audio signals are obtained, resampling the one or more pure audio signals to restore the sampling rate of the one or more pure audio signals. As an example, all noisy audio signals may be uniformly applied to 16000 sampling rate single channel audio, for other sampling rate audio signals, resampling may be performed to 16000 sampling rate, and if multi-channel data, noise reduction may be performed on each channel. After obtaining the clean audio signal, if the signal of other format is resampled to audio of 16000, the resampling needs to be performed to the original sampling rate.
As a specific embodiment, as shown in fig. 3, a speech noise reduction processing flow based on deep learning is provided, which includes:
step S101, signal preprocessing. The sampling rule is set to support only single-channel audio data of 16000 sampling rate, and audio data of other sampling rate needs to be resampled into audio data of 16000 sampling rate. Before the signal is subjected to feature extraction, the noise audio signal can be pre-emphasized through a high-pass filter, so that the energy of the high-frequency signal is relatively improved.
And step S102, feature extraction. And sequentially performing framing, windowing and Fourier transformation on the signal with the noise frequency, calculating the amplitude of the signal with the noise frequency in a frequency domain based on the Fourier transformation, and taking log of the amplitude to obtain the log amplitude characteristic of the signal with the noise frequency in the frequency domain. Since speech signals have a short-term stationary characteristic, speech is usually segmented, i.e., framed. The framed speech signal is windowed for smoothing, which avoids the boundary effect of the frames. Then, a magnitude spectrum is obtained through Fourier transform, and the characteristics of the voice signal such as formant characteristics and the like can be visually analyzed in a frequency domain; the amplitude log after the Fourier transform is taken, so that the change range of the signal amplitude can be compressed, and the network convergence is facilitated. Meanwhile, according to the Fourier transform result, log is taken for the amplitude of the audio signal, and the amplitude value of the audio signal in the frequency domain can be obtained; and calculating tangent values of a real part and an imaginary part of the audio signal in the frequency domain according to the Fourier transform result, and performing arc tangent on the tangent values to obtain an angle value of the audio signal in the frequency domain. The calculated amplitude value and angle value of the audio signal can be stored, and the audio signal restoring method is convenient to use when the signal is restored subsequently.
Step S103, predicting an ideal amplitude ratio imr (ideal amplitude ratio). The IMR value is the ratio of the amplitude of the clean speech signal to the amplitude of the noisy speech signal in noisy speech. Because the real IMR value can not be obtained, deep learning can be carried out by using one or more trained neural network models, and a predicted value closest to the real IMR value is predicted. As an example, after obtaining a real noisy speech signal, for example, deep learning is performed using one or more neural network models that have been trained to predict an IMR value associated with the real noisy speech signal. The network structure of the single neural network model in the step adopts 8 layers of neural networks, and can be combined into an encoder-decoder network structure and a Unet network structure. The encryption network encoder and the decryption network decoder are all 3-layer networks, and form a Unet network structure (namely a U-type network structure) after being in one-to-one correspondence with each other. The Unet network structure jumps from the encoder network layer to the decoder corresponding layer, and skips the middle layer, thereby avoiding the information loss possibly existing in the middle layer coding. Two layers of circulation networks are embedded between the encoder and the decoder, the circulation networks in the embodiment of the application are composed of Long Short-Term Memory networks (LSTM), time sequence characteristics between feature codes, such as fundamental frequency formants of voice, noise stationarity and the like, can be learned through the LSTM, and the network structure of a neural network model is shown in figure 2.
And step S104, signal recovery. And recovering the amplitude of the pure signal according to the IMR and the amplitude of the signal with the noise, and obtaining the complex frequency spectrum of the pure signal by using the angle of the signal with the noise and the amplitude of the pure signal. And performing inverse Fourier transform on the complex frequency spectrum to obtain a time sequence signal. And performing overlap-add operation on the time sequence signals of each frame to obtain complete pure signals. As an example, the amplitude of the clean speech signal is obtained by multiplying the previously saved noisy speech signal amplitude by the predicted IMR. And combining the angle value of the voice with the noise signal which is stored before as the angle value of the pure voice signal with the amplitude of the pure voice signal to obtain the complex frequency spectrum of the pure voice signal. And performing inverse Fourier transform and overlap-add on the complex frequency spectrum of the pure voice signal to obtain the pure voice signal corresponding to the voice signal with noise. The inverse fourier transform is converted from a frequency domain signal to a time domain signal, and overlap-add is the inverse operation of framing and windowing.
And step S105, preprocessing. And processing the signal of the high-pass filtered noisy speech through a deep learning algorithm to obtain the high-pass filtered pure speech. It is therefore necessary to perform low-pass filtering, i.e. de-emphasis, to obtain a clean speech signal corresponding to the original noisy speech signal. Meanwhile, if the input noisy speech signal is speech in which other formats of speech signals are resampled to 16000, it is necessary to resample to the original sampling rate.
In another specific embodiment, a speech noise reduction processing flow based on deep learning is provided, specifically: and acquiring a signal with noise frequency in a real environment. Due to the real environment, the process of polluting the speech signal by the noise signal can be generally regarded as a process of adding time domain signals. Therefore, in the embodiment of the present application, the speech signal in the real environment is set to x, the noise signal in the real environment is set to n, and the noisy speech signal in the real environment is set to y, then: y is x + n.
Amplitude values and angle values of the noisy audio signal are extracted. The method comprises the following steps of carrying out Fourier transformation on y, x and n respectively, wherein the Fourier transformation has linear invariance: and Y is X + N, wherein Y, X, N is a complex spectrum obtained by performing Fourier transform on the noisy speech signal Y in the real environment, the speech signal X in the real environment and the noise signal N in the real environment. According to the euler equation, Y can be expressed as: y ═ Y | ewj(ii) a Where | Y | represents the amplitude or spectral magnitude of the noisy speech signal Y in the real environment, and w represents the angle or phase of the noisy speech signal Y in the real environment. Similarly, X and N can be similarly expressed according to the euler formula, and the amplitude value and the angle value of the noisy audio signal are extracted after the representation by the euler formula.
And training a neural network model M by using the signal with the noise frequency in the real environment, and predicting an ideal amplitude ratio associated with the signal with the noise frequency by using the neural network model M. Specifically, since there are no large number of speech signal data pairs (y and x) in the real environment for training the neural network model M, the embodiment of the present application may add a large number of speech signals x in the real environment and a noise signal n in the real environment to obtain a corresponding noisy speech signal y in the real environment, and train the neural network model M using the obtained speech signal data (y and x). Because the influence of the noise frequency signal, the voice signal and the noise signal angle is small, Y, X and N are set to have the same angle in the embodiment of the application, so that the noise frequency signal, the voice signal and the noise signal have the following characteristics: y | + | X | + | N |. Accordingly, in the embodiment of the present application, an amplitude ratio (ideal magnetic ratio) in a real environment may be defined as g, and the following may be defined:
Figure BDA0002937062430000101
wherein g is a value between 0 and 1. Therefore, in the training stage of the neural network model M, a large number of data pairs (y and x) can be used for training the neural network model M, and the mapping from y to g is obtained through the learning of the neural network model M, so that the predicted ideal amplitude ratio is obtained
Figure BDA0002937062430000102
The training process and application process of the neural network model are shown in fig. 4, and it can be known from fig. 4 that the training of the neural network model M predicts an ideal amplitude ratio, i.e., an IMR value in some embodiments, under an input log (| Y |), using parameters
Figure BDA0002937062430000103
Is expressed such that the predicted ideal amplitude ratio is
Figure BDA0002937062430000104
Approaching the amplitude ratio g in real environment. According to the definition formula of g, the square of g is actually the energy ratio of the speech signal x in the real environment and the noisy speech signal in the real environment. The energy value of the speech signal is an important quantity, and the curve of energy concentration appears as a formant of speech. In order to preserve the formant characteristics of the speech signal, the present application proposes the following loss function:
Figure BDA0002937062430000105
in the embodiment of the present application, the convergence criterion of the neural network model M is that an absolute value Δ L of a difference between losses L of two adjacent training is less than 0.001, and when the neural network model M satisfies the convergence criterion, it indicates that the training of the neural network model M is completed, and the trained neural network model M may be started to be applied. For example, the trained neural network model M is used to predict the ideal amplitude ratio.
In the embodiment of the present application, the structure of the neural network model M is as shown in fig. 2, and the neural network model M adopts a U-type network structure, and is divided into an encryption network layer, a decryption network layer, and a loop network; the encryption network layer is composed of an encoder, the decryption network layer is composed of a decoder, and the circulation network is composed of a long-term and short-term memory network LSTM. The encoder and the decoder both comprise three fully-connected layers, except that the activation function of the last fully-connected layer of the decoder adopts Sigmoid (delta (x) ═ 1/(1+ e)-x) All connections other than that of)The activation functions of the connection layer network all adopt Leaky-ReLU:
Figure BDA0002937062430000106
in fig. 2, the loop network adopts a 2-layer long-short term memory network LSTM, both Dense (512) and Dense (1024) represent fully-connected layers, the number of nodes of the output layer of Dense (512) is 512, and the number of nodes of the output layer of Dense (1024) is 1024; the number of nodes of a certain input layer in the encoder or decoder is the number of nodes of an output layer of a previous layer. The full link layer of the encoder and the full link layer of the decoder are symmetrical, and a jump is arranged between the full link layer of the encoder and the full link layer of the decoder, and the output of the full link layer of the encoder and the output of the upper layer of the symmetrical full link layer can be connected together through the jump to be used as the input of the full link layer of the decoder of the next layer. According to the embodiment of the application, information loss in the decoding process can be avoided through the skipping process, and the time sequence characteristics between feature codes, such as the fundamental frequency formants of voice, the noise stationarity and the like, can be learned by utilizing the cyclic network. In the application stage of the neural network model M, the spectral amplitude of the speech signal x in the noisy speech signal y under the real environment
Figure BDA0002937062430000111
The embodiment of the application can set the frequency spectrum angle of the voice signal to be the same as the frequency spectrum angle of the voice signal with noise, and use the frequency spectrum angle and the frequency spectrum amplitude of the voice signal with noise y in the real environment after the setting
Figure BDA0002937062430000112
Obtaining a complex spectrum of a speech signal x
Figure BDA0002937062430000113
Thereby passing through the complex spectrum
Figure BDA0002937062430000114
Performing inverse Fourier transform to obtain predicted speech signal
Figure BDA0002937062430000115
I.e. a clean audio signal in some embodiments.
According to the above description, the specific process of training and applying the neural network model M is as follows:
training data is determined. As shown in fig. 5, the embodiment of the present application prepares two signal data sets in a real environment, i.e., a speech signal data set and a noise signal data set, each of which includes a plurality of pieces of signal data. And sequentially taking out each noise signal n from the noise signal data set, randomly selecting one voice signal x from the voice data set, and adding the selected voice signal x and the noise signal n to obtain a voice signal y with noise. In the addition process, the noise signal n can be properly scaled, so as to obtain the noisy speech signals with different signal-to-noise ratios. For example, the total number of sampling points of the noise signal n and the speech signal x is set to T, and the sampling point of the noise signal n is set to niSampling points of speech signals by xjRepresents; wherein i is less than or equal to T, and j is less than or equal to T. Setting the amplification factor of the signal n with noise as alpha, and the signal-to-noise ratio of the voice signal with noise after noise addition as beta, then:
Figure BDA0002937062430000116
setting the signal-to-noise ratio beta as a random real number between-5 and 20, and then calculating an amplification factor according to a specific numerical value of the signal-to-noise ratio beta
Figure BDA0002937062430000117
Finally according to y1,...,T=x1,...,T+α×n1,...,TA noisy speech signal is obtained.
The above process of generating a noisy speech signal may be repeated multiple times to generate sufficient training data to train the neural network model. As an example, in the training data of the embodiment of the present application, the number of speech data sets may be set to 7 ten thousand, the number of noise data sets may be set to 3 ten thousand, and the number of training data pairs generated by repeating the above operations may be set to 20 ten thousand.
And respectively extracting complex spectrum amplitude characteristics | X | and | Y | corresponding to the voice signal X and the noisy voice signal Y. The extraction process of the complex frequency spectrum features comprises the stages of pre-emphasis, framing, windowing, discrete Fourier transform, separation of the amplitude of the complex frequency spectrum, separation of the phase of the complex frequency spectrum and the like. In fig. 4, the preprocessing of the training phase of the neural network model M comprises three phases of pre-emphasis, framing and windowing in sequence. In the application phase of the neural network model M, the inverse operations of the three phases, in turn windowing, overlap-add and de-emphasis, are used in reverse order, which are collectively referred to as de-preprocessing in fig. 4. Specifically, the method comprises the following steps:
1) pre-emphasis and de-emphasis: the voice signal can be regarded as a series of superposed signals with different frequencies, wherein the high-frequency part signals are easy to attenuate, the effect of vocal cords and lips in the occurrence process is eliminated through pre-emphasis, the high-frequency part of the voice signal, which is restrained by a pronunciation system, is compensated, and the formant of the high frequency is highlighted. The pre-emphasis calculation expression in the embodiment of the application is as follows:
y′i+1=yi+1-μ×yi
of formula (II) to (III)'i+1For the signal after pre-emphasis, yi+1、yiFor the signal before pre-emphasis, i is the sample point. Mu is usually between 0.9 and 1, for example 0.97 can be selected. For the convenience of calculation, in the embodiment of the present application, after the pre-emphasis is performed on the voice signal in the pre-processing stage, y still represents the pre-emphasized signal.
Likewise, when the neural network model M is applied to predict a clean speech signal, i.e., in the application stage of the neural network model M, the inverse operation of the de-emphasis is required. Wherein, the expression of the de-emphasis calculation is as follows: y ″)i+1=y′i+1+μ×y′i. In the formula, y ″)i+1Is the signal after de-emphasis, y'i+1、y′iTo de-emphasize the previous signal, i is the sample point. Mu is usually between 0.9 and 1, for example 0.97 can be selected. For the convenience of calculation, after the speech signal is de-emphasized, the de-emphasized signal may still be represented by y in the de-preprocessing stage.
2) Framing and overlap-add: speech signals have a short-term stationary characteristic and frequency domain analysis of speech usually requires frame first. The framing process can be regarded as sliding and intercepting a voice segment with fixed length on a signal, and such a voice segment is called a frame; the sliding step length can be fixed, the number of sampling points in the frame range is called the frame length, and the number of sampling points passing through the sliding is called the sliding step length. In the embodiment of the present application, the length of the frame may be set to 512, and the sliding step size may be set to 256.
When the neural network model M is used for predicting a pure speech signal, namely in the application stage of the neural network model M, the framed segments are synthesized into a time sequence signal, and the reverse operation of framing is required, which is called as an overlap addition algorithm; the overlap-add algorithm is to accumulate the sampling point values of the overlapped parts in time sequence between frames and divide the accumulated sampling point values by the sum of the coefficients of the window function in the overlapped parts; i.e., overlap-add convolution operations in some embodiments.
3) Windowing: windowing the frames avoids spectral leakage that may occur from sudden signal dropouts at the frame boundaries. The window function employed in the practice of this application is the square root hanning window:
Figure BDA0002937062430000121
wherein, K is the frame length, and the frame length is the same as the value set in the framing process, namely 512; w is aiIs the data of the corresponding position of the window. Windowing is the multiplication of data at corresponding positions of a frame and a window, namely: y isi=yiwi
When the neural network model M is applied to predict a clean speech signal, namely, in the application stage of the neural network model M, a square root Hanning window is added. The method is equivalent to that in the whole prediction process, namely the training process and the application process, the noisy speech signal y is subjected to twice addition of a square root Hanning window, and the method is equivalent to once addition of the Hanning window. The sum of coefficients of the frame overlapping portions of the hanning window is 1, and when the overlap is added, the sum is not required to be divided by the sum of window coefficients.
4) Performing discrete Fourier transform:
Figure BDA0002937062430000131
wherein, YiDenotes the ith frequency component of the complex spectrum Y, j being an imaginary unit.
5) According to the Euler formula, the complex spectral amplitude Y and the angle are separated from the noisy speech signal Y, the angle value is not required to be stored in the training stage of the neural network model M, but the angle value is required to be stored in the application stage of the neural network model M, so that the signal can be recovered for use later. Taking logarithm of complex spectrum amplitude Y of the noisy speech signal Y to obtain a characteristic log (| Y |), calculating a mean value and a variance by utilizing the log (| Y |) of the noisy speech signal Y in the training set, storing the two statistics, and standardizing standard deviation of sample data, namely, subtracting the mean value and then dividing the mean value by the square deviation. The two quantities can also be used to mean down the input and divide the variance during the model application phase.
As can be seen from fig. 4, log (| Y |) and learning target g need to be input in the training phase of the neural network model M. Therefore, | X | and | Y | are calculated according to the above-described processes 1) to 5), and then the learning target g is calculated. Where log (| Y |) can be obtained by taking the logarithm of | Y |, g is calculated as follows:
Figure BDA0002937062430000132
in the formula, the learning target g is a ratio of the complex spectrum amplitude value | X | of the speech signal X in the real environment to the complex spectrum amplitude value | Y | of the noisy speech signal Y in the real environment. In the present embodiment, a value of g greater than 1 is constrained to 1.
As can be seen from FIG. 4, the training of the neural network model M predicts the ideal amplitude ratio under the input log (| Y |)
Figure BDA0002937062430000133
So that the predicted ideal amplitude ratio
Figure BDA0002937062430000134
Approaching to the ideal amplitude ratio g in real environment. According to the definition formula of g, the square of g is actually the language in the real environmentThe energy ratio of the sound signal x and the noisy speech signal in the real environment. The energy value of the speech signal is an important quantity, and the curve of energy concentration appears as a formant of speech. In order to preserve the formant characteristics of the speech signal, the present application proposes the following loss function:
Figure BDA0002937062430000135
in the embodiment of the present application, the convergence criterion of the neural network model M is that an absolute value Δ L of a difference between losses L of two adjacent training is less than 0.001, and when the neural network model M satisfies the convergence criterion, it indicates that the training of the neural network model M is completed, and the trained neural network model M may be started to be applied. For example, the trained neural network model M is used to predict the ideal amplitude ratio.
Aiming at the existing problems, the method is designed based on the deep learning method and can have the noise reduction capability in a plurality of noise scenes. The audio signal is processed based on the deep learning method, an ideal assumed condition is not needed, meanwhile, a model can be trained on a wider noise data set, and the defects of insufficient robustness and insufficient noise reduction capability of a traditional algorithm can be overcome. The method solves the problem of insufficient noise reduction capability caused by over-ideal assumed conditions of the traditional noise reduction algorithm; the problem that the traditional noise reduction algorithm cannot be trained on a large amount of noise data, so that the noise reduction robustness is insufficient is solved. The method can eliminate noise of machine noise, noisy human voice of roads, restaurants and the like, music noise, white noise and the like; the method still has the noise reduction capability in an extreme noise pollution scene (such as the signal-to-noise ratio of-10 db); compared with the method before noise reduction, the pure speech auditory sensation (PESQ) after noise reduction can be improved by 0.8. By training the model under a large number of various scene noises, the problems that the traditional noise reduction algorithm is insufficient in noise reduction performance, the applicable noise reduction scene is limited and the like can be solved. In addition, the characteristics of an encoder-decoder structure, a Unet structure and an LSTM network can be combined in the design of the network structure.
As shown in fig. 6, the present invention further provides an audio noise reduction system, comprising:
the acquisition module M10 is used for acquiring one or more noisy audio signals;
an extraction module M20, configured to extract amplitude values and angle values of the one or more noisy audio signals;
a prediction module M30, configured to predict, according to the extracted amplitude values, ideal amplitude ratio values associated with the one or more noisy audio signals by using one or more neural network models;
a noise reduction module M40 configured to determine a complex spectrum of one or more clean audio signals according to the predicted ideal amplitude ratio, the amplitude values of the one or more noisy audio signals, and the angle values; and performing an inverse transform on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals.
Aiming at the existing problems, the system is designed based on a deep learning method and can have the noise reduction capability in a scene with various noises. The audio signal is processed based on the deep learning method, an ideal assumed condition is not needed, meanwhile, a model can be trained on a wider noise data set, and the defects of insufficient robustness and insufficient noise reduction capability of a traditional algorithm can be overcome. The system not only solves the problem of insufficient noise reduction capability caused by over ideal assumed conditions of the traditional noise reduction algorithm; the problem that the traditional noise reduction algorithm cannot be trained on a large amount of noise data, so that the noise reduction robustness is insufficient is solved. The system can eliminate noise of machine noise, noisy human voice of roads, restaurants and the like, music noise, white noise and the like; the system still has the noise reduction capability in an extreme noise pollution scene (such as a signal-to-noise ratio of-10 db); compared with the system before noise reduction, the pure speech auditory sensation (PESQ) of the system after noise reduction can be improved by 0.8.
In an exemplary implementation, the network structure of the one or more neural network models in the prediction module M30 includes at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure. By way of example, a network structure such as a single neural network model in the method employs an 8-layer neural network, and can be combined into an encoder-decoder network structure and a Unet network structure. The encryption network encoder and the decryption network decoder are all 3-layer networks, and form a Unet network structure (namely a U-type network structure) after being in one-to-one correspondence with each other. The Unet network structure jumps from the encoder network layer to the decoder corresponding layer, and skips the middle layer, thereby avoiding the information loss possibly existing in the middle layer coding. Two layers of circulation networks are embedded between the encoder and the decoder, the circulation networks in the embodiment of the application are composed of Long Short-Term Memory networks (LSTM), time sequence characteristics between feature codes, such as fundamental frequency formants of voice, noise stationarity and the like, can be learned through the LSTM, and the network structure of a neural network model is shown in figure 2. As other examples, the Dense network layer in the neural network model may also select the convolutional neural network CNN. When the types of noise are more and the difference of the noise intensity is larger, the convergence of the neural network model is possibly difficult by taking the characteristics of the amplitude spectrum as input, and a convolutional neural network CNN layer can be selected to replace a Dense network layer, so that the performance of the neural network model is reduced.
In light of the above description, in an exemplary implementation, the specific process of extracting the amplitude values and the angle values of the one or more noisy audio signals by the extracting module M20 includes: framing and windowing the acquired one or more noisy audio signals; performing Fourier transform on one or more noisy audio signals after framing and windowing are completed; calculating amplitudes of the one or more noisy audio signals based on the fourier transform results, and obtaining angle values of the one or more noisy audio signals according to the fourier transform results; and taking a log of the calculated amplitude to obtain one or more amplitude values of the noisy audio signal. In the embodiment of the present application, the speech signal has a short-time stationary characteristic, and the speech is usually segmented, i.e. framed. The framed speech signal is windowed for smoothing, which avoids the boundary effect of the frames. Then, a magnitude spectrum is obtained through Fourier transform, and the characteristics of the voice signal such as formant characteristics and the like can be visually analyzed in a frequency domain; the amplitude log after the Fourier transform is taken, so that the change range of the signal amplitude can be compressed, and the network convergence is facilitated. Meanwhile, according to the Fourier transform result, log is taken for the amplitude of the audio signal, and the amplitude value of the audio signal in the frequency domain can be obtained; and calculating tangent values of a real part and an imaginary part of the audio signal in the frequency domain according to the Fourier transform result, and performing arc tangent on the tangent values to obtain an angle value of the audio signal in the frequency domain. The calculated amplitude value and angle value of the audio signal can be stored, and the audio signal restoring method is convenient to use when the signal is restored subsequently. As other examples, the method may also employ a magnitude spectrum as an input feature.
In accordance with the above description, in an exemplary implementation, the noise reduction module M40 performs inverse transformation on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals, and includes: performing inverse Fourier transform on the complex frequency spectrum of the one or more clean audio signals to obtain time sequence signals of the one or more clean audio signals; and performing overlap-add convolution operation or overlap-add on the time sequence signal of each frame of the pure audio signal to obtain one or more final pure audio signals corresponding to one or more noisy audio signals.
According to the above description, before extracting the amplitude values and angle values of one or more noisy audio signals, pre-emphasizing the one or more noisy audio signals; and after acquiring the one or more clean audio signals, performing de-pre-emphasis on the acquired one or more clean audio signals. As an example, when the embodiment of the present application performs pre-emphasis, the audio signal may be pre-emphasized by a high-pass filtering. Because the high-frequency voice signals are easier to attenuate relative to the low-frequency signals, the voice energy is mainly concentrated in a low-frequency area, and the high-frequency signal energy is improved through pre-emphasis, so that the network can learn in all frequency intervals in a balanced manner. When the pre-emphasis removing is carried out, the low-pass filtering is carried out on the pure audio signal, so that the pure audio signal is restored to the state of high-frequency signal attenuation in a natural state, and the pre-emphasis removing is the inverse operation of the pre-emphasis.
According to the above description, before obtaining one or more noisy audio signals, the method further includes resampling the one or more noisy audio signals according to a preset sampling rate, and unifying the sampling rate of the one or more noisy audio signals to the preset sampling rate; and after the one or more pure audio signals are obtained, resampling the one or more pure audio signals to restore the sampling rate of the one or more pure audio signals. As an example, all noisy audio signals may be uniformly applied to 16000 sampling rate single channel audio, for other sampling rate audio signals, resampling may be performed to 16000 sampling rate, and if multi-channel data, noise reduction may be performed on each channel. After obtaining the clean audio signal, if the signal of other format is resampled to audio of 16000, the resampling needs to be performed to the original sampling rate.
As a specific embodiment, as shown in fig. 3, a speech noise reduction processing method based on deep learning is provided, and noise reduction processing can be performed on a speech signal through the speech noise reduction processing method, which is not described in detail again in this system.
In another specific embodiment, a speech noise reduction processing flow based on deep learning is provided, specifically: and acquiring a signal with noise frequency in a real environment. Due to the real environment, the process of polluting the speech signal by the noise signal can be generally regarded as a process of adding time domain signals. Therefore, in the embodiment of the present application, the speech signal in the real environment is set to x, the noise signal in the real environment is set to n, and the noisy speech signal in the real environment is set to y, then: y is x + n.
Amplitude values and angle values of the noisy audio signal are extracted. The method comprises the following steps of carrying out Fourier transformation on y, x and n respectively, wherein the Fourier transformation has linear invariance: and Y is X + N, wherein Y, X, N is a complex spectrum obtained by performing Fourier transform on the noisy speech signal Y in the real environment, the speech signal X in the real environment and the noise signal N in the real environment. According to the euler equation, Y can be expressed as: y ═ Y | ewj(ii) a Where | Y | represents the amplitude or spectral magnitude of noisy speech signal Y in the real environment, wRepresenting the angle or phase of the noisy speech signal y in the real environment. Similarly, X and N can be similarly expressed according to the euler formula, and the amplitude value and the angle value of the noisy audio signal are extracted after the representation by the euler formula.
And training a neural network model M by using the signal with the noise frequency in the real environment, and predicting an ideal amplitude ratio associated with the signal with the noise frequency by using the neural network model M. Specifically, since there are no large number of speech signal data pairs (y and x) in the real environment for training the neural network model M, the embodiment of the present application may add a large number of speech signals x in the real environment and a noise signal n in the real environment to obtain a corresponding noisy speech signal y in the real environment, and train the neural network model M using the obtained speech signal data (y and x). Because the influence of the noise frequency signal, the voice signal and the noise signal angle is small, Y, X and N are set to have the same angle in the embodiment of the application, so that the noise frequency signal, the voice signal and the noise signal have the following characteristics: y | + | X | + | N |. Accordingly, in the embodiment of the present application, an amplitude ratio (ideal magnetic ratio) in a real environment may be defined as g, and the following may be defined:
Figure BDA0002937062430000161
wherein g is a value between 0 and 1. Therefore, in the training stage of the neural network model M, a large number of data pairs (y and x) can be used for training the neural network model M, and the mapping from y to g is obtained through the learning of the neural network model M, so that the predicted ideal amplitude ratio is obtained
Figure BDA0002937062430000171
The training process and application process of the neural network model are shown in fig. 4, and it can be known from fig. 4 that the training of the neural network model M predicts an ideal amplitude ratio, i.e., an IMR value in some embodiments, under an input log (| Y |), using parameters
Figure BDA0002937062430000172
Is expressed such that the predicted ideal amplitude ratio is
Figure BDA0002937062430000173
Approaching the amplitude ratio g in real environment. According to the definition formula of g, the square of g is actually the energy ratio of the speech signal x in the real environment and the noisy speech signal in the real environment. The energy value of the speech signal is an important quantity, and the curve of energy concentration appears as a formant of speech. In order to preserve the formant characteristics of the speech signal, the present application proposes the following loss function:
Figure BDA0002937062430000174
in the embodiment of the present application, the convergence criterion of the neural network model M is that an absolute value Δ L of a difference between losses L of two adjacent training is less than 0.001, and when the neural network model M satisfies the convergence criterion, it indicates that the training of the neural network model M is completed, and the trained neural network model M may be started to be applied. For example, the trained neural network model M is used to predict the ideal amplitude ratio.
In the embodiment of the present application, the structure of the neural network model M is as shown in fig. 2, and the neural network model M adopts a U-type network structure, and is divided into an encryption network layer, a decryption network layer, and a loop network; the encryption network layer is composed of an encoder, the decryption network layer is composed of a decoder, and the circulation network is composed of a long-term and short-term memory network LSTM. The encoder and the decoder both comprise three fully-connected layers, except that the activation function of the last fully-connected layer of the decoder adopts Sigmoid (delta (x) ═ 1/(1+ e)-x) Other full connectivity layer networks use leakage-ReLU:
Figure BDA0002937062430000175
in fig. 2, the loop network adopts a 2-layer long-short term memory network LSTM, both Dense (512) and Dense (1024) represent fully-connected layers, the number of nodes of the output layer of Dense (512) is 512, and the number of nodes of the output layer of Dense (1024) is 1024; the number of nodes of a certain input layer in the encoder or decoder is the number of nodes of an output layer of a previous layer. The full link layer of the encoder and the full link layer of the decoder are symmetrical, and a jump is formed between the full link layer of the encoder and the full link layer of the decoder, and the jump can be used forThe output of the encoder fully-connected layer and the output of the previous layer of the symmetric fully-connected layer are concatenated together as the input to the decoder fully-connected layer of the next layer. According to the embodiment of the application, information loss in the decoding process can be avoided through the skipping process, and the time sequence characteristics between feature codes, such as the fundamental frequency formants of voice, the noise stationarity and the like, can be learned by utilizing the cyclic network. In the application stage of the neural network model M, the spectral amplitude of the speech signal x in the noisy speech signal y under the real environment
Figure BDA0002937062430000176
The embodiment of the application can set the frequency spectrum angle of the voice signal to be the same as the frequency spectrum angle of the voice signal with noise, and use the frequency spectrum angle and the frequency spectrum amplitude of the voice signal with noise y in the real environment after the setting
Figure BDA0002937062430000177
Obtaining a complex spectrum of a speech signal x
Figure BDA0002937062430000178
Thereby passing through the complex spectrum
Figure BDA0002937062430000179
Performing inverse Fourier transform to obtain predicted speech signal
Figure BDA00029370624300001710
I.e. a clean audio signal in some embodiments.
According to the above description, the specific process of training and applying the neural network model M is as follows:
training data is determined. As shown in fig. 5, the embodiment of the present application prepares two signal data sets in a real environment, i.e., a speech signal data set and a noise signal data set, each of which includes a plurality of pieces of signal data. And sequentially taking out each noise signal n from the noise signal data set, randomly selecting one voice signal x from the voice data set, and adding the selected voice signal x and the noise signal n to obtain a voice signal y with noise. Summing processIn this way, the noise signal n can be scaled appropriately, so as to obtain the noisy speech signal with different signal-to-noise ratios. For example, the total number of sampling points of the noise signal n and the speech signal x is set to T, and the sampling point of the noise signal n is set to niSampling points of speech signals by xjRepresents; wherein i is less than or equal to T, and j is less than or equal to T. Setting the amplification factor of the signal n with noise as alpha, and the signal-to-noise ratio of the voice signal with noise after noise addition as beta, then:
Figure BDA0002937062430000181
setting the signal-to-noise ratio beta as a random real number between-5 and 20, and then calculating an amplification factor according to a specific numerical value of the signal-to-noise ratio beta
Figure BDA0002937062430000182
Finally according to y1,...,T=x1,...,T+α×n1,...,TA noisy speech signal is obtained.
The above process of generating a noisy speech signal may be repeated multiple times to generate sufficient training data to train the neural network model. As an example, in the training data of the embodiment of the present application, the number of speech data sets may be set to 7 ten thousand, the number of noise data sets may be set to 3 ten thousand, and the number of training data pairs generated by repeating the above operations may be set to 20 ten thousand.
And respectively extracting complex spectrum amplitude characteristics | X | and | Y | corresponding to the voice signal X and the noisy voice signal Y. The extraction process of the complex frequency spectrum features comprises the stages of pre-emphasis, framing, windowing, discrete Fourier transform, separation of the amplitude of the complex frequency spectrum, separation of the phase of the complex frequency spectrum and the like. In fig. 4, the preprocessing of the training phase of the neural network model M comprises three phases of pre-emphasis, framing and windowing in sequence. In the application phase of the neural network model M, the inverse operations of the three phases, in turn windowing, overlap-add and de-emphasis, are used in reverse order, which are collectively referred to as de-preprocessing in fig. 4. Specifically, the method comprises the following steps:
1) pre-emphasis and de-emphasis: the voice signal can be regarded as a series of superposed signals with different frequencies, wherein the high-frequency part signals are easy to attenuate, the effect of vocal cords and lips in the occurrence process is eliminated through pre-emphasis, the high-frequency part of the voice signal, which is restrained by a pronunciation system, is compensated, and the formant of the high frequency is highlighted. The pre-emphasis calculation expression in the embodiment of the application is as follows:
y′i+1=yi+1-μ×yi
of formula (II) to (III)'i+1For the signal after pre-emphasis, yi+1、yiFor the signal before pre-emphasis, i is the sample point. Mu is usually between 0.9 and 1, for example 0.97 can be selected. For the convenience of calculation, in the embodiment of the present application, after the pre-emphasis is performed on the voice signal in the pre-processing stage, y still represents the pre-emphasized signal.
Likewise, when the neural network model M is applied to predict a clean speech signal, i.e., in the application stage of the neural network model M, the inverse operation of the de-emphasis is required. Wherein, the expression of the de-emphasis calculation is as follows: y ″)i+1=y′i+1+μ×y′i. In the formula, y ″)i+1Is the signal after de-emphasis, y'i+1、y′iTo de-emphasize the previous signal, i is the sample point. Mu is usually between 0.9 and 1, for example 0.97 can be selected. For the convenience of calculation, after the speech signal is de-emphasized, the de-emphasized signal may still be represented by y in the de-preprocessing stage.
2) Framing and overlap-add: speech signals have a short-term stationary characteristic and frequency domain analysis of speech usually requires frame first. The framing process can be regarded as sliding and intercepting a voice segment with fixed length on a signal, and such a voice segment is called a frame; the sliding step length can be fixed, the number of sampling points in the frame range is called the frame length, and the number of sampling points passing through the sliding is called the sliding step length. In the embodiment of the present application, the length of the frame may be set to 512, and the sliding step size may be set to 256.
When the neural network model M is used for predicting a pure speech signal, namely in the application stage of the neural network model M, the framed segments are synthesized into a time sequence signal, and the reverse operation of framing is required, which is called as an overlap addition algorithm; the overlap-add algorithm is to accumulate the sampling point values of the overlapped parts in time sequence between frames and divide the accumulated sampling point values by the sum of the coefficients of the window function in the overlapped parts; i.e., overlap-add convolution operations in some embodiments.
3) Windowing: windowing the frames avoids spectral leakage that may occur from sudden signal dropouts at the frame boundaries. The window function employed in the practice of this application is the square root hanning window:
Figure BDA0002937062430000191
wherein, K is the frame length, and the frame length is the same as the value set in the framing process, namely 512; w is aiIs the data of the corresponding position of the window. Windowing is the multiplication of data at corresponding positions of a frame and a window, namely: y isi=yiwi
When the neural network model M is applied to predict a clean speech signal, namely, in the application stage of the neural network model M, a square root Hanning window is added. The method is equivalent to that in the whole prediction process, namely the training process and the application process, the noisy speech signal y is subjected to twice addition of a square root Hanning window, and the method is equivalent to once addition of the Hanning window. The sum of coefficients of the frame overlapping portions of the hanning window is 1, and when the overlap is added, the sum is not required to be divided by the sum of window coefficients.
4) Performing discrete Fourier transform:
Figure BDA0002937062430000192
wherein, YiDenotes the ith frequency component of the complex spectrum Y, j being an imaginary unit.
5) According to the Euler formula, the complex spectral amplitude Y and the angle are separated from the noisy speech signal Y, the angle value is not required to be stored in the training stage of the neural network model M, but the angle value is required to be stored in the application stage of the neural network model M, so that the signal can be recovered for use later. Taking logarithm of complex spectrum amplitude Y of the noisy speech signal Y to obtain a characteristic log (| Y |), calculating a mean value and a variance by utilizing the log (| Y |) of the noisy speech signal Y in the training set, storing the two statistics, and standardizing standard deviation of sample data, namely, subtracting the mean value and then dividing the mean value by the square deviation. The two quantities can also be used to mean down the input and divide the variance during the model application phase.
As can be seen from fig. 4, log (| Y |) and learning target g need to be input in the training phase of the neural network model M. Therefore, | X | and | Y | are calculated according to the above-described processes 1) to 5), and then the learning target g is calculated. Where log (| Y |) can be obtained by taking the logarithm of | Y |, g is calculated as follows:
Figure BDA0002937062430000201
in the formula, the learning target g is a ratio of the complex spectrum amplitude value | X | of the speech signal X in the real environment to the complex spectrum amplitude value | Y | of the noisy speech signal Y in the real environment. In the present embodiment, a value of g greater than 1 is constrained to 1.
As can be seen from FIG. 4, the training of the neural network model M predicts the ideal amplitude ratio under the input log (| Y |)
Figure BDA0002937062430000202
So that the predicted ideal amplitude ratio
Figure BDA0002937062430000203
Approaching to the ideal amplitude ratio g in real environment. According to the definition formula of g, the square of g is actually the energy ratio of the speech signal x in the real environment and the noisy speech signal in the real environment. The energy value of the speech signal is an important quantity, and the curve of energy concentration appears as a formant of speech. In order to preserve the formant characteristics of the speech signal, the present application proposes the following loss function:
Figure BDA0002937062430000204
in the embodiment of the present application, the convergence criterion of the neural network model M is that an absolute value Δ L of a difference between losses L of two adjacent training is less than 0.001, and when the neural network model M satisfies the convergence criterion, it indicates that the training of the neural network model M is completed, and the trained neural network model M may be started to be applied. For example, the trained neural network model M is used to predict the ideal amplitude ratio.
Aiming at the existing problems, the system is designed based on a deep learning method and can have the noise reduction capability in a scene with various noises. The audio signal is processed based on the deep learning method, an ideal assumed condition is not needed, meanwhile, a model can be trained on a wider noise data set, and the defects of insufficient robustness and insufficient noise reduction capability of a traditional algorithm can be overcome. The system not only solves the problem of insufficient noise reduction capability caused by over ideal assumed conditions of the traditional noise reduction algorithm; the problem that the traditional noise reduction algorithm cannot be trained on a large amount of noise data, so that the noise reduction robustness is insufficient is solved. The system can eliminate noise of machine noise, noisy human voice of roads, restaurants and the like, music noise, white noise and the like; the system still has the noise reduction capability in an extreme noise pollution scene (such as a signal-to-noise ratio of-10 db); compared with the system before noise reduction, the pure speech auditory sensation (PESQ) of the system after noise reduction can be improved by 0.8. By training the model under a large number of various scene noises, the problems that the traditional noise reduction algorithm is insufficient in noise reduction performance, the applicable noise reduction scene is limited and the like can be solved. In addition, the characteristics of an encoder-decoder structure, a Unet structure and an LSTM network can be combined in the design of the network structure.
The embodiment of the application further provides an audio noise reduction device based on deep learning, which comprises:
obtaining one or more noisy audio signals;
extracting amplitude values and angle values of the one or more noisy audio signals, and predicting ideal amplitude ratio values associated with the one or more noisy audio signals by using one or more neural network models based on the extracted amplitude values;
determining a complex spectrum of one or more clean audio signals according to the predicted ideal amplitude ratio value, the amplitude values and the angle values of the one or more noisy audio signals;
and performing inverse transformation on the complex frequency spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals.
In this embodiment, the audio noise reduction device executes the system or the method, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.
An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
The present embodiment also provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the data processing method in fig. 1 according to the present embodiment.
Fig. 7 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.
In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.
Fig. 8 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. FIG. 8 is a specific embodiment of FIG. 7 in an implementation. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.
The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication components 1203, power components 1204, multimedia components 1205, audio components 1206, input/output interfaces 1207, and/or sensor components 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.
The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.
The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The audio component 1206 is configured to output and/or input speech signals. For example, the audio component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, audio component 1206 also includes a speaker for outputting voice signals.
The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.
The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication component 1203, the audio component 1206, the input/output interface 1207 and the sensor component 1208 in the embodiment of fig. 8 may be implemented as the input device in the embodiment of fig. 7.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (14)

1. An audio noise reduction method, comprising the steps of:
obtaining one or more noisy audio signals;
extracting amplitude values and angle values of the one or more noisy audio signals, and predicting ideal amplitude ratio values associated with the one or more noisy audio signals by using one or more neural network models based on the amplitude values;
determining the complex frequency spectrum of the corresponding one or more pure audio signals according to the predicted ideal amplitude ratio, the amplitude values and the angle values of the one or more noisy audio signals;
performing inverse transformation on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals;
the training process of the neural network model comprises the following steps: determining whether the one or more neural network models converge or not according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment, and finishing the training of the one or more neural network models when the one or more neural network models converge; wherein the network structure of the one or more neural network models comprises at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure;
the acquisition process of the band noise frequency signal comprises the following steps:
acquiring a noise signal n and a voice signal x, and the total sampling point number T of the noise signal n and the voice signal x;
calculating the signal-to-noise ratio beta of the noisy speech signal after noise addition, comprising the following steps:
Figure FDA0003530981560000011
in the formula, alpha is the amplification factor of the noise signal n;
nja sampling point for a noise signal n;
xia sampling point of a speech signal x;
calculating the amplification factor alpha according to the signal-to-noise ratio beta, wherein the amplification factor alpha comprises the following components:
Figure FDA0003530981560000012
according to yi=xi+α×njObtaining a noisy speech signal yi
2. The audio denoising method of claim 1, wherein the training process of the one or more neural network models comprises:
acquiring a voice signal and a noise signal in a real environment, and obtaining a voice signal with noise in the real environment according to the voice signal and the noise signal in the real environment;
respectively extracting the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise in the real environment, and determining the amplitude ratio in the real environment according to the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise;
taking a log of a voice signal with noise in a real environment, and inputting the obtained numerical value into the one or more neural network models to obtain a predicted ideal amplitude ratio;
determining whether the one or more neural network models converge or not according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment, and finishing the training of the one or more neural network models when the one or more neural network models converge;
wherein the network structure of the one or more neural network models comprises at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure.
3. The audio noise reduction method of claim 2, further comprising determining a loss function based on the predicted ideal amplitude ratio and the amplitude ratio in the real environment;
judging whether the absolute value of the difference value of the loss functions of two adjacent training times is smaller than a preset target value or not;
if the target value is smaller than the preset target value, finishing the training of the one or more neural network models;
and if the target value is not less than the preset target value, updating the parameters of the one or more neural network models, and continuing to train the one or more neural network models.
4. The audio noise reduction method according to claim 1, wherein the specific process of extracting the amplitude values and the angle values of the one or more noisy audio signals comprises:
framing and windowing the acquired one or more noisy audio signals;
performing Fourier transform on one or more noisy audio signals after framing and windowing are completed;
calculating amplitudes of the one or more noisy audio signals based on a fourier transform result, and obtaining angle values of the one or more noisy audio signals according to the fourier transform result;
and taking a log of the calculated amplitude to obtain the amplitude values of the one or more noisy audio signals.
5. The audio noise reduction method of claim 4, wherein performing an inverse transform on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals comprises:
performing inverse Fourier transform on the complex frequency spectrum of the one or more clean audio signals to obtain time sequence signals of the one or more clean audio signals;
and performing overlap-add convolution operation on the time sequence signal of each frame of the pure audio signal to obtain one or more final pure audio signals corresponding to the one or more noisy audio signals.
6. The audio noise reduction method according to claim 1 or 5, further comprising pre-emphasizing the one or more noisy audio signals before extracting amplitude values and angle values of the one or more noisy audio signals;
and after acquiring the one or more clean audio signals, performing de-pre-emphasis on the acquired one or more clean audio signals.
7. The audio noise reduction method according to claim 1 or 5, further comprising resampling one or more noisy audio signals according to a preset sampling rate before obtaining the one or more noisy audio signals, unifying the sampling rate of the one or more noisy audio signals to the preset sampling rate;
and after one or more pure audio signals are obtained, resampling the one or more pure audio signals, and restoring the sampling rate of the one or more pure audio signals.
8. An audio noise reduction system, comprising:
the acquisition module is used for acquiring one or more noisy audio signals;
the extraction module is used for extracting amplitude values and angle values of the one or more noisy audio signals;
a prediction module for predicting an ideal amplitude ratio associated with the one or more noisy audio signals using one or more neural network models based on the extracted amplitude values;
a noise reduction module for determining a complex spectrum of one or more clean audio signals according to the predicted ideal amplitude ratio, the amplitude values and the angle values of the one or more noisy audio signals; performing inverse transformation on the complex frequency spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals;
the training process of the neural network model comprises the following steps: determining whether the one or more neural network models converge or not according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment, and finishing the training of the one or more neural network models when the one or more neural network models converge; wherein the network structure of the one or more neural network models comprises at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure;
the acquisition process of the band noise frequency signal comprises the following steps:
acquiring a noise signal n and a voice signal x, and the total sampling point number T of the noise signal n and the voice signal x;
calculating the signal-to-noise ratio beta of the noisy speech signal after noise addition, comprising the following steps:
Figure FDA0003530981560000041
in the formula, alpha is the amplification factor of the noise signal n;
nja sampling point for a noise signal n;
xia sampling point of a speech signal x;
calculating the amplification factor alpha according to the signal-to-noise ratio beta, wherein the amplification factor alpha comprises the following components:
Figure FDA0003530981560000042
according to yi=xi+α×njObtaining a noisy speech signal yi
9. The audio noise reduction system of claim 8, wherein the training process of the one or more neural network models in the prediction module comprises:
acquiring a voice signal and a noise signal in a real environment, and obtaining a voice signal with noise in the real environment according to the voice signal and the noise signal in the real environment;
respectively extracting the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise in the real environment, and determining the amplitude ratio in the real environment according to the amplitude corresponding to the voice signal in the real environment and the amplitude corresponding to the voice signal with noise;
taking a log of a voice signal with noise in a real environment, and inputting the obtained numerical value into the one or more neural network models to obtain a predicted ideal amplitude ratio;
determining whether the one or more neural network models converge or not according to the predicted ideal amplitude ratio and the amplitude ratio in the real environment, and finishing the training of the one or more neural network models when the one or more neural network models converge;
wherein the network structure of the one or more neural network models comprises at least: an encryption network layer, a decryption network layer and a circulation network; the circulating network is embedded between the encryption network layer and the decryption network layer, and the encryption network layer and the decryption network layer correspond to each other to form a U-shaped network structure.
10. The audio noise reduction system of claim 9, further comprising determining a loss function based on the predicted ideal amplitude ratio and the amplitude ratio in real environment;
judging whether the absolute value of the difference value of the loss functions of two adjacent training times is smaller than a preset target value or not;
if the target value is smaller than the preset target value, finishing the training of the one or more neural network models;
and if the target value is not less than the preset target value, updating the parameters of the one or more neural network models, and continuing to train the one or more neural network models.
11. The audio noise reduction system of claim 8, wherein the specific process of the extraction module extracting the amplitude values and the angle values of the one or more noisy audio signals comprises:
framing and windowing the acquired one or more noisy audio signals;
performing Fourier transform on one or more noisy audio signals after framing and windowing are completed;
calculating amplitudes of the one or more noisy audio signals based on a fourier transform result, and obtaining angle values of the one or more noisy audio signals according to the fourier transform result;
and taking a log of the calculated amplitude to obtain the amplitude values of the one or more noisy audio signals.
12. The audio noise reduction system of claim 11, wherein the noise reduction module performs an inverse transform on the complex spectrum of the one or more clean audio signals to obtain one or more clean audio signals corresponding to the one or more noisy audio signals, comprising:
performing inverse Fourier transform on the complex frequency spectrum of the one or more clean audio signals to obtain time sequence signals of the one or more clean audio signals;
and performing overlap-add convolution operation on the time sequence signal of each frame of the pure audio signal to obtain one or more final pure audio signals corresponding to the one or more noisy audio signals.
13. A computer device, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-7.
14. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform the method of any of claims 1-7.
CN202110164294.1A 2021-02-05 2021-02-05 Audio noise reduction method, system, device and medium Active CN112767960B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110164294.1A CN112767960B (en) 2021-02-05 2021-02-05 Audio noise reduction method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110164294.1A CN112767960B (en) 2021-02-05 2021-02-05 Audio noise reduction method, system, device and medium

Publications (2)

Publication Number Publication Date
CN112767960A CN112767960A (en) 2021-05-07
CN112767960B true CN112767960B (en) 2022-04-26

Family

ID=75705110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110164294.1A Active CN112767960B (en) 2021-02-05 2021-02-05 Audio noise reduction method, system, device and medium

Country Status (1)

Country Link
CN (1) CN112767960B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314147B (en) * 2021-05-26 2023-07-25 北京达佳互联信息技术有限公司 Training method and device of audio processing model, audio processing method and device
CN117480554A (en) * 2021-05-31 2024-01-30 华为技术有限公司 Voice enhancement method and related equipment
CN113436640B (en) * 2021-06-28 2022-11-25 歌尔科技有限公司 Audio noise reduction method, device and system and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN108986835A (en) * 2018-08-28 2018-12-11 百度在线网络技术(北京)有限公司 Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7385381B2 (en) * 2019-06-21 2023-11-22 株式会社日立製作所 Abnormal sound detection system, pseudo sound generation system, and pseudo sound generation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN108986835A (en) * 2018-08-28 2018-12-11 百度在线网络技术(北京)有限公司 Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network
CN110808063A (en) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 Voice processing method and device for processing voice
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium

Also Published As

Publication number Publication date
CN112767960A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN112767960B (en) Audio noise reduction method, system, device and medium
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
WO2021082823A1 (en) Audio processing method, apparatus, computer device, and storage medium
CN111883107B (en) Speech synthesis and feature extraction model training method, device, medium and equipment
WO2023226839A1 (en) Audio enhancement method and apparatus, and electronic device and readable storage medium
CN113192528B (en) Processing method and device for single-channel enhanced voice and readable storage medium
CN115691544A (en) Training of virtual image mouth shape driving model and driving method, device and equipment thereof
CN113763977A (en) Method, apparatus, computing device and storage medium for eliminating echo signal
CN114783459B (en) Voice separation method and device, electronic equipment and storage medium
CN113707167A (en) Training method and training device for residual echo suppression model
CN111369968A (en) Sound reproduction method, device, readable medium and electronic equipment
CN115883869B (en) Processing method, device and processing equipment of video frame insertion model based on Swin converter
CN117496990A (en) Speech denoising method, device, computer equipment and storage medium
US20220408201A1 (en) Method and system of audio processing using cochlear-simulating spike data
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device
Sui et al. TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms
CN115035887A (en) Voice signal processing method, device, equipment and medium
CN114999440A (en) Avatar generation method, apparatus, device, storage medium, and program product
Zhou et al. Meta-SE: a meta-learning framework for few-shot speech enhancement
Cheng et al. A DenseNet-GRU technology for Chinese speech emotion recognition
CN117316160B (en) Silent speech recognition method, silent speech recognition apparatus, electronic device, and computer-readable medium
WO2024055751A1 (en) Audio data processing method and apparatus, device, storage medium, and program product
CN117219107B (en) Training method, device, equipment and storage medium of echo cancellation model
CN113903355B (en) Voice acquisition method and device, electronic equipment and storage medium
CN118098245A (en) Voiceprint recognition model training method, voiceprint recognition method and voiceprint recognition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant